In [1]:
import numpy as np
import pandas as pd

import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import acquire as ac

import warnings
warnings.filterwarnings("ignore")

In [2]:
blogs = ac.get_blog_articles(urls=["https://codeup.com/codeups-data-science-career-accelerator-is-here/",
                       "https://codeup.com/data-science-myths/",
                       "https://codeup.com/data-science-vs-data-analytics-whats-the-difference/",
                       "https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/",
                       "https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/"])
blogs = pd.DataFrame(blogs)

### 1. Define a function named `basic_clean`. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

In [3]:
# lowercase everything
article = blogs.original_content[0].lower()
article

'the rumors are true! the time has arrived. codeup has officially opened applications to our new data science career accelerator, with only 25 seats available! this immersive program is one of a kind in san antonio, and will help you land a job in\xa0glassdoor’s #1 best job in america.data science is a method of providing actionable intelligence from data.\xa0the data revolution has hit san antonio,\xa0resulting in an explosion in data scientist positions\xa0across companies like usaa, accenture, booz allen hamilton, and heb. we’ve even seen\xa0utsa invest $70 m for a cybersecurity center and school of data science.\xa0we built a program to specifically meet the growing demands of this industry.our program will be 18 weeks long, full-time, hands-on, and project-based. our curriculum development and instruction is led by senior data scientist, maggie giust, who has worked at heb, capital group, and rackspace, along with input from dozens of practitioners and hiring partners. students wi

In [4]:
# normalize unicode characters
article = unicodedata.normalize("NFKD", article)\
            .encode("ascii", "ignore")\
            .decode("utf-8", "ignore")

article

'the rumors are true! the time has arrived. codeup has officially opened applications to our new data science career accelerator, with only 25 seats available! this immersive program is one of a kind in san antonio, and will help you land a job in glassdoors #1 best job in america.data science is a method of providing actionable intelligence from data. the data revolution has hit san antonio, resulting in an explosion in data scientist positions across companies like usaa, accenture, booz allen hamilton, and heb. weve even seen utsa invest $70 m for a cybersecurity center and school of data science. we built a program to specifically meet the growing demands of this industry.our program will be 18 weeks long, full-time, hands-on, and project-based. our curriculum development and instruction is led by senior data scientist, maggie giust, who has worked at heb, capital group, and rackspace, along with input from dozens of practitioners and hiring partners. students will work with real da

In [5]:
# replace anything that is not a letter, number, whitespace or a single quote.
article = re.sub(r"[^a-z0-9\s']", "", article)
article

'the rumors are true the time has arrived codeup has officially opened applications to our new data science career accelerator with only 25 seats available this immersive program is one of a kind in san antonio and will help you land a job in glassdoors 1 best job in americadata science is a method of providing actionable intelligence from data the data revolution has hit san antonio resulting in an explosion in data scientist positions across companies like usaa accenture booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecurity center and school of data science we built a program to specifically meet the growing demands of this industryour program will be 18 weeks long fulltime handson and projectbased our curriculum development and instruction is led by senior data scientist maggie giust who has worked at heb capital group and rackspace along with input from dozens of practitioners and hiring partners students will work with real data sets realistic problems an

---
**Write a Function**

In [6]:
blogs

Unnamed: 0,title,original_content
0,Codeup’s Data Science Career Accelerator is He...,The rumors are true! The time has arrived. Cod...
1,Data Science Myths - Codeup,By Dimitri Antoniou and Maggie GiustData Scien...
2,Data Science VS Data Analytics: What’s The Dif...,"By Dimitri AntoniouA week ago, Codeup launched..."
3,10 Tips to Crush It at the SA Tech Job Fair - ...,10 Tips to Crush It at the SA Tech Job FairSA ...
4,Competitor Bootcamps Are Closing. Is the Model...,Competitor Bootcamps Are Closing. Is the Model...


In [7]:
def basic_clean(string):
    """
    This function accepts a string and returns the string after applying some basic text cleaning to each word.
    """
    
    # lowercase all characters
    string = string.lower()
    
    # normalize unicode characters
    string = unicodedata.normalize("NFKD", string)\
                .encode("ascii", "ignore")\
                .decode("utf-8", "ignore")
    
    # replace anything that is not a letter, number, whitespace or a single quote.
    string = re.sub(r"[^a-z0-9\s']", "", string)
    
    return string

In [8]:
blogs

Unnamed: 0,title,original_content
0,Codeup’s Data Science Career Accelerator is He...,The rumors are true! The time has arrived. Cod...
1,Data Science Myths - Codeup,By Dimitri Antoniou and Maggie GiustData Scien...
2,Data Science VS Data Analytics: What’s The Dif...,"By Dimitri AntoniouA week ago, Codeup launched..."
3,10 Tips to Crush It at the SA Tech Job Fair - ...,10 Tips to Crush It at the SA Tech Job FairSA ...
4,Competitor Bootcamps Are Closing. Is the Model...,Competitor Bootcamps Are Closing. Is the Model...


In [9]:
blogs["basic_clean"] = blogs.original_content.apply(basic_clean)
blogs

Unnamed: 0,title,original_content,basic_clean
0,Codeup’s Data Science Career Accelerator is He...,The rumors are true! The time has arrived. Cod...,the rumors are true the time has arrived codeu...
1,Data Science Myths - Codeup,By Dimitri Antoniou and Maggie GiustData Scien...,by dimitri antoniou and maggie giustdata scien...
2,Data Science VS Data Analytics: What’s The Dif...,"By Dimitri AntoniouA week ago, Codeup launched...",by dimitri antonioua week ago codeup launched ...
3,10 Tips to Crush It at the SA Tech Job Fair - ...,10 Tips to Crush It at the SA Tech Job FairSA ...,10 tips to crush it at the sa tech job fairsa ...
4,Competitor Bootcamps Are Closing. Is the Model...,Competitor Bootcamps Are Closing. Is the Model...,competitor bootcamps are closing is the model ...


In [10]:
blogs.basic_clean[0]

'the rumors are true the time has arrived codeup has officially opened applications to our new data science career accelerator with only 25 seats available this immersive program is one of a kind in san antonio and will help you land a job in glassdoors 1 best job in americadata science is a method of providing actionable intelligence from data the data revolution has hit san antonio resulting in an explosion in data scientist positions across companies like usaa accenture booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecurity center and school of data science we built a program to specifically meet the growing demands of this industryour program will be 18 weeks long fulltime handson and projectbased our curriculum development and instruction is led by senior data scientist maggie giust who has worked at heb capital group and rackspace along with input from dozens of practitioners and hiring partners students will work with real data sets realistic problems an

---
### 2. Define a function named `tokenize`. It should take in a string and tokenize all the words in the string.

In [11]:
tokenizer = ToktokTokenizer()

article = tokenizer.tokenize(article, return_str=True)
article

'the rumors are true the time has arrived codeup has officially opened applications to our new data science career accelerator with only 25 seats available this immersive program is one of a kind in san antonio and will help you land a job in glassdoors 1 best job in americadata science is a method of providing actionable intelligence from data the data revolution has hit san antonio resulting in an explosion in data scientist positions across companies like usaa accenture booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecurity center and school of data science we built a program to specifically meet the growing demands of this industryour program will be 18 weeks long fulltime handson and projectbased our curriculum development and instruction is led by senior data scientist maggie giust who has worked at heb capital group and rackspace along with input from dozens of practitioners and hiring partners students will work with real data sets realistic problems an

---
**Write a Function**

In [12]:
def tokenize(string):
    """
    This function accepts a string and returns the string after tokenizing to each word.
    """
    
    # make tokenizer object
    tokenizer = ToktokTokenizer()

    # use tokenizer object and return string
    string = tokenizer.tokenize(string, return_str=True)
    
    return string

In [13]:
tokenize(blogs.basic_clean[0])

'the rumors are true the time has arrived codeup has officially opened applications to our new data science career accelerator with only 25 seats available this immersive program is one of a kind in san antonio and will help you land a job in glassdoors 1 best job in americadata science is a method of providing actionable intelligence from data the data revolution has hit san antonio resulting in an explosion in data scientist positions across companies like usaa accenture booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecurity center and school of data science we built a program to specifically meet the growing demands of this industryour program will be 18 weeks long fulltime handson and projectbased our curriculum development and instruction is led by senior data scientist maggie giust who has worked at heb capital group and rackspace along with input from dozens of practitioners and hiring partners students will work with real data sets realistic problems an

---
### 3. Define a function named `stem`. It should accept some text and return the text after applying stemming to all the words.

In [14]:
ps = nltk.porter.PorterStemmer()
ps

<PorterStemmer>

In [15]:
stems = [ps.stem(word) for word in article.split()]
stems

['the',
 'rumor',
 'are',
 'true',
 'the',
 'time',
 'ha',
 'arriv',
 'codeup',
 'ha',
 'offici',
 'open',
 'applic',
 'to',
 'our',
 'new',
 'data',
 'scienc',
 'career',
 'acceler',
 'with',
 'onli',
 '25',
 'seat',
 'avail',
 'thi',
 'immers',
 'program',
 'is',
 'one',
 'of',
 'a',
 'kind',
 'in',
 'san',
 'antonio',
 'and',
 'will',
 'help',
 'you',
 'land',
 'a',
 'job',
 'in',
 'glassdoor',
 '1',
 'best',
 'job',
 'in',
 'americadata',
 'scienc',
 'is',
 'a',
 'method',
 'of',
 'provid',
 'action',
 'intellig',
 'from',
 'data',
 'the',
 'data',
 'revolut',
 'ha',
 'hit',
 'san',
 'antonio',
 'result',
 'in',
 'an',
 'explos',
 'in',
 'data',
 'scientist',
 'posit',
 'across',
 'compani',
 'like',
 'usaa',
 'accentur',
 'booz',
 'allen',
 'hamilton',
 'and',
 'heb',
 'weve',
 'even',
 'seen',
 'utsa',
 'invest',
 '70',
 'm',
 'for',
 'a',
 'cybersecur',
 'center',
 'and',
 'school',
 'of',
 'data',
 'scienc',
 'we',
 'built',
 'a',
 'program',
 'to',
 'specif',
 'meet',
 'the',


In [16]:
article_stems = " ".join(stems)

In [17]:
article_stems

'the rumor are true the time ha arriv codeup ha offici open applic to our new data scienc career acceler with onli 25 seat avail thi immers program is one of a kind in san antonio and will help you land a job in glassdoor 1 best job in americadata scienc is a method of provid action intellig from data the data revolut ha hit san antonio result in an explos in data scientist posit across compani like usaa accentur booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecur center and school of data scienc we built a program to specif meet the grow demand of thi industryour program will be 18 week long fulltim handson and projectbas our curriculum develop and instruct is led by senior data scientist maggi giust who ha work at heb capit group and rackspac along with input from dozen of practition and hire partner student will work with real data set realist problem and the entir data scienc pipelin from collect to deploy they will receiv profession develop train in resum 

---
**Write a Funtion**

In [18]:
def stem(string):
    """
    This function accepts a string and returns the string after applying stemming to each word.
    """
    
    # create stemmer object
    ps = nltk.porter.PorterStemmer()
    
    # use stemmer to generate list of stems
    stems = [ps.stem(word) for word in string.split()]
    
    # join stems to whitespace to return a cohesive string
    cohesive_stems = " ".join(stems)
    
    return stems, cohesive_stems

In [19]:
# stemmed article
article = blogs.original_content[0]
article = basic_clean(article)
article = tokenize(article)
article_list, article_string = stem(article)
article_list

['the',
 'rumor',
 'are',
 'true',
 'the',
 'time',
 'ha',
 'arriv',
 'codeup',
 'ha',
 'offici',
 'open',
 'applic',
 'to',
 'our',
 'new',
 'data',
 'scienc',
 'career',
 'acceler',
 'with',
 'onli',
 '25',
 'seat',
 'avail',
 'thi',
 'immers',
 'program',
 'is',
 'one',
 'of',
 'a',
 'kind',
 'in',
 'san',
 'antonio',
 'and',
 'will',
 'help',
 'you',
 'land',
 'a',
 'job',
 'in',
 'glassdoor',
 '1',
 'best',
 'job',
 'in',
 'americadata',
 'scienc',
 'is',
 'a',
 'method',
 'of',
 'provid',
 'action',
 'intellig',
 'from',
 'data',
 'the',
 'data',
 'revolut',
 'ha',
 'hit',
 'san',
 'antonio',
 'result',
 'in',
 'an',
 'explos',
 'in',
 'data',
 'scientist',
 'posit',
 'across',
 'compani',
 'like',
 'usaa',
 'accentur',
 'booz',
 'allen',
 'hamilton',
 'and',
 'heb',
 'weve',
 'even',
 'seen',
 'utsa',
 'invest',
 '70',
 'm',
 'for',
 'a',
 'cybersecur',
 'center',
 'and',
 'school',
 'of',
 'data',
 'scienc',
 'we',
 'built',
 'a',
 'program',
 'to',
 'specif',
 'meet',
 'the',


---
### 4. Define a function named `lemmatize`. It should accept some text and return the text after applying lemmatization to each word.

In [20]:
wnl = nltk.stem.WordNetLemmatizer()
wnl

<WordNetLemmatizer>

In [21]:
article = blogs.original_content[0]
article = basic_clean(article)
article = tokenize(article)

In [22]:
lemmas = [wnl.lemmatize(word) for word in article.split()]
lemmas

['the',
 'rumor',
 'are',
 'true',
 'the',
 'time',
 'ha',
 'arrived',
 'codeup',
 'ha',
 'officially',
 'opened',
 'application',
 'to',
 'our',
 'new',
 'data',
 'science',
 'career',
 'accelerator',
 'with',
 'only',
 '25',
 'seat',
 'available',
 'this',
 'immersive',
 'program',
 'is',
 'one',
 'of',
 'a',
 'kind',
 'in',
 'san',
 'antonio',
 'and',
 'will',
 'help',
 'you',
 'land',
 'a',
 'job',
 'in',
 'glassdoors',
 '1',
 'best',
 'job',
 'in',
 'americadata',
 'science',
 'is',
 'a',
 'method',
 'of',
 'providing',
 'actionable',
 'intelligence',
 'from',
 'data',
 'the',
 'data',
 'revolution',
 'ha',
 'hit',
 'san',
 'antonio',
 'resulting',
 'in',
 'an',
 'explosion',
 'in',
 'data',
 'scientist',
 'position',
 'across',
 'company',
 'like',
 'usaa',
 'accenture',
 'booz',
 'allen',
 'hamilton',
 'and',
 'heb',
 'weve',
 'even',
 'seen',
 'utsa',
 'invest',
 '70',
 'm',
 'for',
 'a',
 'cybersecurity',
 'center',
 'and',
 'school',
 'of',
 'data',
 'science',
 'we',
 'built

---
**Write a Function**

In [23]:
def lemmatize(string):
    """
    This function accepts a string and returns the string after applying lemmatization to each word.
    """
    
    # create lemmatizer object
    wnl = nltk.stem.WordNetLemmatizer()
    
    # use lemmatizer to generate list of stems
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    
    # join lemmas to whitespace to return a cohesive string
    cohesive_lemmas = " ".join(lemmas)
    
    return lemmas, cohesive_lemmas

In [24]:
# lemmatized article
article = blogs.original_content[0]
article = basic_clean(article)
article = tokenize(article)
article_list, article_string = lemmatize(article)
article

'the rumors are true the time has arrived codeup has officially opened applications to our new data science career accelerator with only 25 seats available this immersive program is one of a kind in san antonio and will help you land a job in glassdoors 1 best job in americadata science is a method of providing actionable intelligence from data the data revolution has hit san antonio resulting in an explosion in data scientist positions across companies like usaa accenture booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecurity center and school of data science we built a program to specifically meet the growing demands of this industryour program will be 18 weeks long fulltime handson and projectbased our curriculum development and instruction is led by senior data scientist maggie giust who has worked at heb capital group and rackspace along with input from dozens of practitioners and hiring partners students will work with real data sets realistic problems an

---
### 5. Define a function named `remove_stopwords`. It should accept some text and return the text after removing all the stopwords.

This function should define two optional parameters, `extra_words` and `exclude_words`. These parameters should define any additional stop words to include, and any words that we *don't* want to remove.

In [25]:
stopword_list = stopwords.words("english")
stopword_list

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [26]:
stems_sans_stopwords = [word for word in stems if word not in stopword_list]
stems_sans_stopwords

['rumor',
 'true',
 'time',
 'ha',
 'arriv',
 'codeup',
 'ha',
 'offici',
 'open',
 'applic',
 'new',
 'data',
 'scienc',
 'career',
 'acceler',
 'onli',
 '25',
 'seat',
 'avail',
 'thi',
 'immers',
 'program',
 'one',
 'kind',
 'san',
 'antonio',
 'help',
 'land',
 'job',
 'glassdoor',
 '1',
 'best',
 'job',
 'americadata',
 'scienc',
 'method',
 'provid',
 'action',
 'intellig',
 'data',
 'data',
 'revolut',
 'ha',
 'hit',
 'san',
 'antonio',
 'result',
 'explos',
 'data',
 'scientist',
 'posit',
 'across',
 'compani',
 'like',
 'usaa',
 'accentur',
 'booz',
 'allen',
 'hamilton',
 'heb',
 'weve',
 'even',
 'seen',
 'utsa',
 'invest',
 '70',
 'cybersecur',
 'center',
 'school',
 'data',
 'scienc',
 'built',
 'program',
 'specif',
 'meet',
 'grow',
 'demand',
 'thi',
 'industryour',
 'program',
 '18',
 'week',
 'long',
 'fulltim',
 'handson',
 'projectbas',
 'curriculum',
 'develop',
 'instruct',
 'led',
 'senior',
 'data',
 'scientist',
 'maggi',
 'giust',
 'ha',
 'work',
 'heb',
 'c

In [27]:
lemmas_sans_stopwords = [word for word in lemmas if word not in stopword_list]
article_sans_stopwords = " ".join(lemmas_sans_stopwords)
article_sans_stopwords

'rumor true time ha arrived codeup ha officially opened application new data science career accelerator 25 seat available immersive program one kind san antonio help land job glassdoors 1 best job americadata science method providing actionable intelligence data data revolution ha hit san antonio resulting explosion data scientist position across company like usaa accenture booz allen hamilton heb weve even seen utsa invest 70 cybersecurity center school data science built program specifically meet growing demand industryour program 18 week long fulltime handson projectbased curriculum development instruction led senior data scientist maggie giust ha worked heb capital group rackspace along input dozen practitioner hiring partner student work real data set realistic problem entire data science pipeline collection deployment receive professional development training resume writing interviewing continuing education prepare smooth transition workforcewe focus applied data science immediat

---
**Write a Function**

In [28]:
empty_list = []

In [29]:
if not empty_list:
    print("List is empty")

List is empty


In [30]:
new_list = ["alec", "is", "good"]
new_list

['alec', 'is', 'good']

In [31]:
new_list.remove("is")
new_list

['alec', 'good']

In [32]:
new_list.append("is")
new_list

['alec', 'good', 'is']

In [33]:
if not new_list:
    new_list
else:
    new_list.append("word")
        
new_list

['alec', 'good', 'is', 'word']

In [34]:
if not new_list:
    new_list
else:
    new_list.extend(("word", "another"))

In [35]:
new_list

['alec', 'good', 'is', 'word', 'word', 'another']

In [36]:
# sum(new_list, [])

In [37]:
def remove_stopwords(lemmas_or_stems, extra_stopwords=[], exclude_stopwords=[]):
    """
    This function accepts a list of text (lemmas_or_stems) and returns a string after removing stopwords.
    Extra words can be added the standard english stopwords using the extra_stopwords parameter.
    Words can be excluded from the standard english stopwords using the exclude_stopwords parameter.
    """
    
    # create stopword list
    stopword_list = stopwords.words("english")
    
    # extend extra_stopwords variable to stopwords if there are words in the parameter
    if not extra_stopwords:
        stopword_list
    else:
        stopword_list.extend(extra_stopwords)
    
    # remove exclude_stopwords variable from stopwords if there are words in the parameter
    if not exclude_stopwords:
        stopword_list
    else:
        stopword_list = [word for word in stopword_list if word not in exclude_stopwords]
    
    # list comprehension 
    lemmas_or_stems_sans_stopwords = [word for word in lemmas_or_stems if word not in stopword_list]
    
    # join lemmas_or_stems_sans_stopwords to whitespace to return a cohesive string
    string_sans_stopwords = " ".join(lemmas_or_stems_sans_stopwords)
    
    return string_sans_stopwords

In [38]:
# lemmatized article
blogs = ac.get_blog_articles(urls=["https://codeup.com/codeups-data-science-career-accelerator-is-here/",
                       "https://codeup.com/data-science-myths/",
                       "https://codeup.com/data-science-vs-data-analytics-whats-the-difference/",
                       "https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/",
                       "https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/"])
blogs = pd.DataFrame(blogs)
article = blogs.original_content[0]
article = basic_clean(article)
article = tokenize(article)
lemmatized_article_list, lemmatized_article_string = lemmatize(article)
lemmatized_article_list

['the',
 'rumor',
 'are',
 'true',
 'the',
 'time',
 'ha',
 'arrived',
 'codeup',
 'ha',
 'officially',
 'opened',
 'application',
 'to',
 'our',
 'new',
 'data',
 'science',
 'career',
 'accelerator',
 'with',
 'only',
 '25',
 'seat',
 'available',
 'this',
 'immersive',
 'program',
 'is',
 'one',
 'of',
 'a',
 'kind',
 'in',
 'san',
 'antonio',
 'and',
 'will',
 'help',
 'you',
 'land',
 'a',
 'job',
 'in',
 'glassdoors',
 '1',
 'best',
 'job',
 'in',
 'americadata',
 'science',
 'is',
 'a',
 'method',
 'of',
 'providing',
 'actionable',
 'intelligence',
 'from',
 'data',
 'the',
 'data',
 'revolution',
 'ha',
 'hit',
 'san',
 'antonio',
 'resulting',
 'in',
 'an',
 'explosion',
 'in',
 'data',
 'scientist',
 'position',
 'across',
 'company',
 'like',
 'usaa',
 'accenture',
 'booz',
 'allen',
 'hamilton',
 'and',
 'heb',
 'weve',
 'even',
 'seen',
 'utsa',
 'invest',
 '70',
 'm',
 'for',
 'a',
 'cybersecurity',
 'center',
 'and',
 'school',
 'of',
 'data',
 'science',
 'we',
 'built

In [39]:
article_base_stopwords = remove_stopwords(lemmatized_article_list)
article_base_stopwords

'rumor true time ha arrived codeup ha officially opened application new data science career accelerator 25 seat available immersive program one kind san antonio help land job glassdoors 1 best job americadata science method providing actionable intelligence data data revolution ha hit san antonio resulting explosion data scientist position across company like usaa accenture booz allen hamilton heb weve even seen utsa invest 70 cybersecurity center school data science built program specifically meet growing demand industryour program 18 week long fulltime handson projectbased curriculum development instruction led senior data scientist maggie giust ha worked heb capital group rackspace along input dozen practitioner hiring partner student work real data set realistic problem entire data science pipeline collection deployment receive professional development training resume writing interviewing continuing education prepare smooth transition workforcewe focus applied data science immediat

In [40]:
article_extra_stopwords = remove_stopwords(lemmatized_article_list, extra_stopwords=["email"])
article_extra_stopwords

'rumor true time ha arrived codeup ha officially opened application new data science career accelerator 25 seat available immersive program one kind san antonio help land job glassdoors 1 best job americadata science method providing actionable intelligence data data revolution ha hit san antonio resulting explosion data scientist position across company like usaa accenture booz allen hamilton heb weve even seen utsa invest 70 cybersecurity center school data science built program specifically meet growing demand industryour program 18 week long fulltime handson projectbased curriculum development instruction led senior data scientist maggie giust ha worked heb capital group rackspace along input dozen practitioner hiring partner student work real data set realistic problem entire data science pipeline collection deployment receive professional development training resume writing interviewing continuing education prepare smooth transition workforcewe focus applied data science immediat

In [41]:
article_exclude_stopwords = remove_stopwords(lemmatized_article_list, exclude_stopwords=["the"])
article_exclude_stopwords

'the rumor true the time ha arrived codeup ha officially opened application new data science career accelerator 25 seat available immersive program one kind san antonio help land job glassdoors 1 best job americadata science method providing actionable intelligence data the data revolution ha hit san antonio resulting explosion data scientist position across company like usaa accenture booz allen hamilton heb weve even seen utsa invest 70 cybersecurity center school data science built program specifically meet the growing demand industryour program 18 week long fulltime handson projectbased curriculum development instruction led senior data scientist maggie giust ha worked heb capital group rackspace along input dozen practitioner hiring partner student work real data set realistic problem the entire data science pipeline collection deployment receive professional development training resume writing interviewing continuing education prepare smooth transition the workforcewe focus appli

In [42]:
article_extra_and_exclude_stopwords = remove_stopwords(lemmatized_article_list, extra_stopwords=["rumor"]\
                                                       , exclude_stopwords=["a"])
article_extra_and_exclude_stopwords

'true time ha arrived codeup ha officially opened application new data science career accelerator 25 seat available immersive program one a kind san antonio help land a job glassdoors 1 best job americadata science a method providing actionable intelligence data data revolution ha hit san antonio resulting explosion data scientist position across company like usaa accenture booz allen hamilton heb weve even seen utsa invest 70 a cybersecurity center school data science built a program specifically meet growing demand industryour program 18 week long fulltime handson projectbased curriculum development instruction led senior data scientist maggie giust ha worked heb capital group rackspace along input dozen practitioner hiring partner student work real data set realistic problem entire data science pipeline collection deployment receive professional development training resume writing interviewing continuing education prepare a smooth transition workforcewe focus applied data science im

---
**Question**

How do we deal with line breaks?

Examples:
- `americadata`
- `industryour`
- `workforcewe`
- `developmentapplications`
- `antonioif`

---
### 6. Define a function named `prep_article` that takes in the dictionary representing an article and returns a dictionary that looks like this:

`{
    'title': 'the original title',
    'original': original,
    'stemmed': article_stemmed,
    'lemmatized': article_lemmatized,
    'clean': article_without_stopwords
}`

Note that if the orignal dictionary has a title property, it should remain unchanged (same goes for the category property).

In [43]:
def prep_article(dictionary, key):
    """
    This function accepts a dictionary representing a singular article containing a body of text, as specified 
    in the key parameter, to clean. 
    The function then returns a dictionary containing the stemmed, lemmatized, and cleaned text in their 
    respective columns.
    """
    
    # indexing the original content
    content = dictionary[key]
    
    # running basic_clean function on content
    content = basic_clean(content)
    
    # running tokenize function on content
    content = tokenize(content)
    
    # running stem function on content
    stem_list, stem_string = stem(content)
    
    # creating stemmed column in df
    dictionary["stemmed"] = stem_string
    
    # running lemmatize function on content
    lemma_list, lemma_string = lemmatize(content)
    
    # creating lemmatized column in df
    dictionary["lemmatized"] = lemma_string
    
    # running remove_stopwords on lemma_list
    cleaned_content = remove_stopwords(lemma_list)
    
    # creating cleaned column in df
    dictionary["clean"] = cleaned_content
    
    return dictionary

In [44]:
blogs = ac.get_blog_articles(urls=["https://codeup.com/codeups-data-science-career-accelerator-is-here/"])
blogs[0]

{'title': 'Codeup’s Data Science Career Accelerator is Here! - Codeup',
 'original_content': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.Data Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.Our program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, 

In [45]:
dictionary = prep_article(blogs[0], "original_content")
dictionary

{'title': 'Codeup’s Data Science Career Accelerator is Here! - Codeup',
 'original_content': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.Data Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.Our program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, 

In [46]:
blogs = ac.get_blog_articles(urls=["https://codeup.com/codeups-data-science-career-accelerator-is-here/"])
blogs

[{'title': 'Codeup’s Data Science Career Accelerator is Here! - Codeup',
  'original_content': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.Data Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.Our program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group

In [47]:
blogs = prep_article(blogs[0], "original_content")
blogs

{'title': 'Codeup’s Data Science Career Accelerator is Here! - Codeup',
 'original_content': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.Data Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.Our program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, 

In [48]:
news_articles = ac.get_news_articles(categories=["business", "sports", "technology", "entertainment"])
news_articles

[{'title': "India's GDP grows 3.1% in January-March quarter, lowest in 11 years",
  'original_content': "India's Gross Domestic Product (GDP) grew 3.1% in the January-March quarter from a year ago, slowing from downwardly revised 4.1% in the prior three months, official data showed on Friday. The is the slowest pace of expansion since the fourth quarter of 2008-09. The country was under the coronavirus lockdown during the last 7 days of the quarter.",
  'category': 'business'},
 {'title': 'Billionaire Icahn loses $2B selling entire stake in bankrupt 102-yr-old firm Hertz',
  'original_content': "Billionaire investor Carl Icahn has sold his entire stake in Hertz, a 102-year-old car rental company that filed for bankruptcy last week, at a loss of almost $2 billion. Icahn was the company's largest shareholder, having bought a 39% stake for an aggregate price of $1.88 billion. However, Icahn sold the stake at only $0.72 per share for $39.8 million.",
  'category': 'business'},
 {'title': '

In [49]:
news_articles = prep_article(news_articles[0], "original_content")
news_articles

{'title': "India's GDP grows 3.1% in January-March quarter, lowest in 11 years",
 'original_content': "India's Gross Domestic Product (GDP) grew 3.1% in the January-March quarter from a year ago, slowing from downwardly revised 4.1% in the prior three months, official data showed on Friday. The is the slowest pace of expansion since the fourth quarter of 2008-09. The country was under the coronavirus lockdown during the last 7 days of the quarter.",
 'category': 'business',
 'stemmed': "india ' s gross domest product gdp grew 31 in the januarymarch quarter from a year ago slow from downwardli revis 41 in the prior three month offici data show on friday the is the slowest pace of expans sinc the fourth quarter of 200809 the countri wa under the coronaviru lockdown dure the last 7 day of the quarter",
 'lemmatized': "india ' s gross domestic product gdp grew 31 in the januarymarch quarter from a year ago slowing from downwardly revised 41 in the prior three month official data showed on 

---
### 7. Define a function named `prepare_article_data` that takes in the list of articles dictionaries (read: DataFrame), applies the `prep_article` function to each one, and returns the transformed data.

In [50]:
blogs = ac.get_blog_articles(urls=["https://codeup.com/codeups-data-science-career-accelerator-is-here/",
                       "https://codeup.com/data-science-myths/",
                       "https://codeup.com/data-science-vs-data-analytics-whats-the-difference/",
                       "https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/",
                       "https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/"])
blogs 

[{'title': 'Codeup’s Data Science Career Accelerator is Here! - Codeup',
  'original_content': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.Data Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.Our program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group

In [51]:
blogs[1]

{'title': 'Data Science Myths - Codeup',
 'original_content': 'By Dimitri Antoniou and Maggie GiustData Science, Big Data, Machine Learning, NLP, Neural Networks…these buzzwords have rapidly spread into mainstream use over the last few years. Unfortunately, definitions are varied and sources of truth are limited. Data Scientists are in fact not magical unicorn wizards who can snap their fingers and turn a business around! Today, we’ll take a cue from our favorite Mythbusters to tackle some common myths and misconceptions in the field of Data Science.via GIPHYMyth #1: Data Science = StatisticsAt first glance, this one doesn’t sound unreasonable. Statistics is defined as, “A branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data.” That sounds a lot like our definition of Data Science: a method of drawing actionable intelligence from data. In truth, statistics is actually one small piece of Data Science. As our Senior Data

In [52]:
blogs[-1]

{'title': 'Competitor Bootcamps Are Closing. Is the Model in Danger? - Codeup',
 'original_content': 'Competitor Bootcamps Are Closing. Is the Model in Danger?\xa0Is the programming bootcamp model in danger?In recent news, DevBootcamp and The Iron Yard announced that they are closing their doors. This is big news. DevBootcamp was the first programming bootcamp model and The Iron Yard is a national player with 15 campuses across the U.S. In both cases, the companies cited an unsustainable business model. Does that mean the boot-camp model is dead?tl;dr “Nope!”Bootcamps exist because traditional education models have failed to provide students job-ready skills for the 21st century. Students demand better employment options from their education. Employers demand skilled and job ready candidates. Big Education’s failure to meet those needs through traditional methods created the fertile ground for the new business model of the programming bootcamp.Education giant Kaplan and Apollo Educatio

In [53]:
[prep_article(blog, "original_content") for blog in blogs]

[{'title': 'Codeup’s Data Science Career Accelerator is Here! - Codeup',
  'original_content': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.Data Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.Our program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group

In [54]:
def prepare_article_data(list_of_dictionaries):
    """
    This function accepts a list of dictionaries and returns a list of dictionaries after applying the 
    prep_article function to each article in the orignial dictionary.
    """
    
    # list comprehension applying prep_article function to each dictionary
    list_of_dictionaries = [prep_article(dictionary, "original_content") for dictionary in list_of_dictionaries]
    
    return list_of_dictionaries

In [55]:
blogs = ac.get_blog_articles(urls=["https://codeup.com/codeups-data-science-career-accelerator-is-here/",
                       "https://codeup.com/data-science-myths/",
                       "https://codeup.com/data-science-vs-data-analytics-whats-the-difference/",
                       "https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/",
                       "https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/"])
blogs 

[{'title': 'Codeup’s Data Science Career Accelerator is Here! - Codeup',
  'original_content': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.Data Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.Our program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group

In [56]:
blogs = prepare_article_data(blogs)
blogs = pd.DataFrame(blogs)
blogs

Unnamed: 0,title,original_content,stemmed,lemmatized,clean
0,Codeup’s Data Science Career Accelerator is He...,The rumors are true! The time has arrived. Cod...,the rumor are true the time ha arriv codeup ha...,the rumor are true the time ha arrived codeup ...,rumor true time ha arrived codeup ha officiall...
1,Data Science Myths - Codeup,By Dimitri Antoniou and Maggie GiustData Scien...,by dimitri antoni and maggi giustdata scienc b...,by dimitri antoniou and maggie giustdata scien...,dimitri antoniou maggie giustdata science big ...
2,Data Science VS Data Analytics: What’s The Dif...,"By Dimitri AntoniouA week ago, Codeup launched...",by dimitri antonioua week ago codeup launch ou...,by dimitri antonioua week ago codeup launched ...,dimitri antonioua week ago codeup launched imm...
3,10 Tips to Crush It at the SA Tech Job Fair - ...,10 Tips to Crush It at the SA Tech Job FairSA ...,10 tip to crush it at the sa tech job fairsa t...,10 tip to crush it at the sa tech job fairsa t...,10 tip crush sa tech job fairsa tech job fairt...
4,Competitor Bootcamps Are Closing. Is the Model...,Competitor Bootcamps Are Closing. Is the Model...,competitor bootcamp are close is the model in ...,competitor bootcamps are closing is the model ...,competitor bootcamps closing model danger prog...


In [57]:
news_articles = ac.get_news_articles(categories=["business", "sports", "technology", "entertainment"])
news_articles = prepare_article_data(news_articles)
news_articles = pd.DataFrame(news_articles)
news_articles

Unnamed: 0,title,original_content,category,stemmed,lemmatized,clean
0,India's GDP grows 3.1% in January-March quarte...,India's Gross Domestic Product (GDP) grew 3.1%...,business,india ' s gross domest product gdp grew 31 in ...,india ' s gross domestic product gdp grew 31 i...,india ' gross domestic product gdp grew 31 jan...
1,Billionaire Icahn loses $2B selling entire sta...,Billionaire investor Carl Icahn has sold his e...,business,billionair investor carl icahn ha sold hi enti...,billionaire investor carl icahn ha sold his en...,billionaire investor carl icahn ha sold entire...
2,Sun Pharma to begin clinical trial of pancreat...,Sun Pharma said on Friday that it has received...,business,sun pharma said on friday that it ha receiv ap...,sun pharma said on friday that it ha received ...,sun pharma said friday ha received approval dr...
3,No investment proposal from Google being consi...,After media reports said Google is looking to ...,business,after media report said googl is look to buy a...,after medium report said google is looking to ...,medium report said google looking buy around 5...
4,Elon Musk earns over $700 million in his first...,Tesla CEO Elon Musk has earned the first tranc...,business,tesla ceo elon musk ha earn the first tranch o...,tesla ceo elon musk ha earned the first tranch...,tesla ceo elon musk ha earned first tranche pe...
...,...,...,...,...,...,...
95,Web platforms a boon for content-based films: ...,South Indian actress Jyotika has said streamin...,entertainment,south indian actress jyotika ha said stream pl...,south indian actress jyotika ha said streaming...,south indian actress jyotika ha said streaming...
96,Released 'Candle' to give people hope: Madhuri...,Madhuri Dixit has said she thought her debut s...,entertainment,madhuri dixit ha said she thought her debut so...,madhuri dixit ha said she thought her debut so...,madhuri dixit ha said thought debut song ' can...
97,He is living: Dylan on how twin Cole is doing ...,Talking about how his twin brother Cole Sprous...,entertainment,talk about how hi twin brother cole sprous is ...,talking about how his twin brother cole sprous...,talking twin brother cole sprouse following br...
98,Something shockingly scary is coming: Nushrat ...,"Actress Nushrat Bharucha, who was previously s...",entertainment,actress nushrat bharucha who wa previous seen ...,actress nushrat bharucha who wa previously see...,actress nushrat bharucha wa previously seen ' ...
