# NLP Preparation Exercises

The end result of this exercise should be a file named `prepare.py` that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

### Imports

In [7]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd
import acquire
from time import strftime

import warnings
warnings.filterwarnings('ignore')

import acquire

### Acquire inshorts Data

In [9]:
original = acquire.get_blog_articles_info()
print(original)


                                                title     published  \
0                            Codeup Dallas Open House  Nov 30, 2021   
1   Codeup’s Placement Team Continues Setting Records  Nov 19, 2021   
2   IT Certifications 101: Why They Matter, and Wh...  Nov 18, 2021   
3   A rise in cyber attacks means opportunities fo...  Nov 17, 2021   
4    Use your GI Bill® benefits to Land a Job in Tech   Nov 4, 2021   
5   Which program is right for me: Cyber Security ...  Oct 28, 2021   
6                What the Heck is System Engineering?  Oct 21, 2021   
7      From Speech Pathology to Business Intelligence  Oct 18, 2021   
8                       Boris – Behind the Billboards   Oct 3, 2021   
9   Is Codeup the Best Bootcamp in San Antonio…or ...  Sep 16, 2021   
10           Codeup Launches First Podcast: Hire Tech  Aug 25, 2021   
11        Why Should I Become a System Administrator?  Aug 23, 2021   
12        Announcing our Candidacy for Accreditation!  Jun 30, 2021   
13  Co

### Exercises

1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

In [10]:
def basic_clean(string):
    '''
    This function takes in a string and
    returns the string normalized.
    '''
    string = unicodedata.normalize('NFKD', string)\
             .encode('ascii', 'ignore')\
             .decode('utf-8', 'ignore')
    string = re.sub(r'[^\w\s]', '', string).lower()
    return string

In [11]:
basic_clean(str(original.content))

'0     come join us for the reopening of our dallas \n1     our placement team is simply defined as a grou\n2     aws google azure red hat comptiathese are\n3     in the last few months the us has experienced\n4     as the end of military service gets closer ma\n5     what it career should i choosenif youre thi\n6     codeup offers a 13week training program syst\n7     by alicia gonzaleznbefore codeup i was a ho\n8                                                      \n9     looking for the best data science bootcamp in \n10    any podcast enthusiasts out there we are plea\n11    with so many tech careers in demand why choos\n12    did you know that even though were an indepen\n13    codeup is moving into another floor of our his\n14    happy pride month pride month is a dedicated \nname content dtype object'

2. Define a function named `tokenize`. It should take in a string and tokenize all the words in the string.

In [12]:
def tokenize(string):
    '''
    This function takes in a string and
    returns a tokenized string.
    '''
    # Create tokenizer.
    tokenizer = nltk.tokenize.ToktokTokenizer()
    
    # Use tokenizer
    string = tokenizer.tokenize(string, return_str = True)
    
    return string

In [13]:
tokenize(str(original.content))

'0 Come join us for the re-opening of our Dallas ... \n1 Our Placement Team is simply defined as a grou ... \n2 AWS , Google , Azure , Red Hat , CompTIA … these are ... \n3 In the last few months , the US has experienced ... \n4 As the end of military service gets closer , ma ... \n5 What IT Career should I choose?\\nIf you ’ re thi ... \n6 Codeup offers a 13-week training program : Syst ... \n7 By : Alicia Gonzalez\\nBefore Codeup , I was a ho ... \n8 \n9 Looking for the best data science bootcamp in ... \n10 Any podcast enthusiasts out there ? We are plea ... \n11 With so many tech careers in demand , why choos ... \n12 Did you know that even though we ’ re an indepen ... \n13 Codeup is moving into another floor of our His ... \n14 Happy Pride Month ! Pride Month is a dedicated ... \nName : content , dtype : object'

3. Define a function named `stem`. It should accept some text and return the text after applying stemming to all the words.

In [14]:
def stem(string):
    '''
    This function takes in a string and
    returns a string with words stemmed.
    '''
    # Create porter stemmer.
    ps = nltk.porter.PorterStemmer()
    
    # Use the stemmer to stem each word in the list of words we created by using split.
    stems = [ps.stem(word) for word in string.split()]
    
    # Join our lists of words into a string again and assign to a variable.
    string = ' '.join(stems)
    
    return string

In [15]:
stem(str(original.content))

'0 come join us for the re-open of our dalla ... 1 our placement team is simpli defin as a grou... 2 aws, google, azure, red hat, comptia…thes are... 3 in the last few months, the us ha experienced... 4 as the end of militari servic get closer, ma... 5 what it career should i choose?\\nif you’r thi... 6 codeup offer a 13-week train program: syst... 7 by: alicia gonzalez\\nbefor codeup, i wa a ho... 8 9 look for the best data scienc bootcamp in ... 10 ani podcast enthusiast out there? we are plea... 11 with so mani tech career in demand, whi choos... 12 did you know that even though we’r an indepen... 13 codeup is move into anoth floor of our his... 14 happi pride month! pride month is a dedic ... name: content, dtype: object'

4. Define a function named `lemmatize`. It should accept some text and return the text after applying lemmatization to each word.

In [16]:
def lemmatize(string):
    '''
    This function takes in string for and
    returns a string with words lemmatized.
    '''
    # Create the lemmatizer.
    wnl = nltk.stem.WordNetLemmatizer()
    
    # Use the lemmatizer on each word in the list of words we created by using split.
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    
    # Join our list of words into a string again and assign to a variable.
    string = ' '.join(lemmas)
    
    return string

In [17]:
lemmatize(str(original.content))

'0 Come join u for the re-opening of our Dallas ... 1 Our Placement Team is simply defined a a grou... 2 AWS, Google, Azure, Red Hat, CompTIA…these are... 3 In the last few months, the US ha experienced... 4 As the end of military service get closer, ma... 5 What IT Career should I choose?\\nIf you’re thi... 6 Codeup offer a 13-week training program: Syst... 7 By: Alicia Gonzalez\\nBefore Codeup, I wa a ho... 8 9 Looking for the best data science bootcamp in ... 10 Any podcast enthusiast out there? We are plea... 11 With so many tech career in demand, why choos... 12 Did you know that even though we’re an indepen... 13 Codeup is moving into another floor of our His... 14 Happy Pride Month! Pride Month is a dedicated ... Name: content, dtype: object'

5. Define a function named `remove_stopwords`. It should accept some text and return the text after removing all the stopwords.

    This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [18]:
def remove_stopwords(string, extra_words = [], exclude_words = []):
    '''
    This function takes in a string, optional extra_words and exclude_words parameters
    with default empty lists and returns a string.
    '''
    # Create stopword_list.
    stopword_list = stopwords.words('english')
    
    # Remove 'exclude_words' from stopword_list to keep these in my text.
    stopword_list = set(stopword_list) - set(exclude_words)
    
    # Add in 'extra_words' to stopword_list.
    stopword_list = stopword_list.union(set(extra_words))

    # Split words in string.
    words = string.split()
    
    # Create a list of words from my string with stopwords removed and assign to variable.
    filtered_words = [word for word in words if word not in stopword_list]
    
    # Join words in the list back into strings and assign to a variable.
    string_without_stopwords = ' '.join(filtered_words)
    
    return string_without_stopwords

In [20]:
remove_stopwords(str(original.content))

'0 Come join us re-opening Dallas ... 1 Our Placement Team simply defined grou... 2 AWS, Google, Azure, Red Hat, CompTIA…these are... 3 In last months, US experienced... 4 As end military service gets closer, ma... 5 What IT Career I choose?\\nIf you’re thi... 6 Codeup offers 13-week training program: Syst... 7 By: Alicia Gonzalez\\nBefore Codeup, I ho... 8 9 Looking best data science bootcamp ... 10 Any podcast enthusiasts there? We plea... 11 With many tech careers demand, choos... 12 Did know even though we’re indepen... 13 Codeup moving another floor His... 14 Happy Pride Month! Pride Month dedicated ... Name: content, dtype: object'

6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe `news_df`.

In [29]:
# # use all the functions to see if they work on the content column
# original['content'].apply(basic_clean)\
# .apply(tokenize)\
# .apply(lemmatize)\
# .apply(remove_stopwords)

In [43]:
news_df = pd.read_json('inshorts-2022-02-07.json')

In [44]:
news_df.head()

Unnamed: 0,title,published,author,content,category
0,Start demolition of Supertech's 40-storey twin...,2022-02-07T11:05:31.000Z,Ridham Gambhir,The Supreme Court on Monday ordered that the d...,business
1,The focus is on growth and adoption for Digita...,2022-02-07T06:10:25.000Z,Roshan Gupta,Union Budget 2022 is focused on promoting digi...,business
2,"COVID, you did your worst & stole our voice: M...",2022-02-06T15:37:00.000Z,Sakshita Khosla,Businessman Anand Mahindra on Sunday shared a ...,business
3,"Meta says data regulations may cause Facebook,...",2022-02-07T09:20:26.000Z,Pragya Swastik,Meta in its annual report to the US SEC said i...,business
4,"Wishy-washy words not needed, say sorry: Shiv ...",2022-02-07T11:20:42.000Z,Pragya Swastik,Shiv Sena's MP Priyanka Chaturvedi tweeted to ...,business


In [21]:
def prep_article_data(df, column, extra_words=[], exclude_words=[]):
    '''
    This function take in a df and the string name for a text column with 
    option to pass lists for extra_words and exclude_words and
    returns a df with the text article title, original text, stemmed text,
    lemmatized text, cleaned, tokenized, & lemmatized text with stopwords removed.
    '''
    df['clean'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    df['stemmed'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(stem)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    df['lemmatized'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(lemmatize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    return df[['title', column,'clean', 'stemmed', 'lemmatized']]

In [45]:
prep_article_data(news_df, 'content', extra_words = ['ha'], exclude_words = ['no']).head()

Unnamed: 0,title,content,clean,stemmed,lemmatized
0,Start demolition of Supertech's 40-storey twin...,The Supreme Court on Monday ordered that the d...,supreme court monday ordered demolition works ...,suprem court monday order demolit work superte...,supreme court monday ordered demolition work s...
1,The focus is on growth and adoption for Digita...,Union Budget 2022 is focused on promoting digi...,union budget 2022 focused promoting digital ec...,union budget 2022 focus promot digit economi f...,union budget 2022 focused promoting digital ec...
2,"COVID, you did your worst & stole our voice: M...",Businessman Anand Mahindra on Sunday shared a ...,businessman anand mahindra sunday shared pictu...,businessman anand mahindra sunday share pictur...,businessman anand mahindra sunday shared pictu...
3,"Meta says data regulations may cause Facebook,...",Meta in its annual report to the US SEC said i...,meta annual report us sec said might stop oper...,meta annual report us sec said might stop oper...,meta annual report u sec said might stop opera...
4,"Wishy-washy words not needed, say sorry: Shiv ...",Shiv Sena's MP Priyanka Chaturvedi tweeted to ...,shiv senas mp priyanka chaturvedi tweeted hyun...,shiv sena mp priyanka chaturvedi tweet hyundai...,shiv senas mp priyanka chaturvedi tweeted hyun...


7. Make another dataframe for the Codeup blog posts. Name the dataframe `codeup_df`.

In [41]:
codeup_df = pd.read_json('codeup_blog_2022-02-07.json')

In [42]:
codeup_df.head()

Unnamed: 0,title,published,content
0,Codeup Dallas Open House,"Nov 30, 2021",Come join us for the re-opening of our Dallas ...
1,Codeup’s Placement Team Continues Setting Records,"Nov 19, 2021",Our Placement Team is simply defined as a grou...
2,"IT Certifications 101: Why They Matter, and Wh...","Nov 18, 2021","AWS, Google, Azure, Red Hat, CompTIA…these are..."
3,A rise in cyber attacks means opportunities fo...,"Nov 17, 2021","In the last few months, the US has experienced..."
4,Use your GI Bill® benefits to Land a Job in Tech,"Nov 4, 2021","As the end of military service gets closer, ma..."


8. For each dataframe, produce the following columns:
- `title` to hold the title
- `original` to hold the original article/post content
- `clean` to hold the normalized and tokenized original with the stopwords removed.
- `stemmed` to hold the stemmed version of the cleaned data.
- `lemmatized` to hold the lemmatized version of the cleaned data.


In [47]:
# content = original

In [46]:
prep_article_data(codeup_df, 'content', extra_words = ['ha'], exclude_words = ['no']).head()

Unnamed: 0,title,content,clean,stemmed,lemmatized
0,Codeup Dallas Open House,Come join us for the re-opening of our Dallas ...,come join us reopening dallas campus drinks sn...,come join us reopen dalla campu drink snack co...,come join u reopening dallas campus drink snac...
1,Codeup’s Placement Team Continues Setting Records,Our Placement Team is simply defined as a grou...,placement team simply defined group manages re...,placement team simpli defin group manag relati...,placement team simply defined group manages re...
2,"IT Certifications 101: Why They Matter, and Wh...","AWS, Google, Azure, Red Hat, CompTIA…these are...",aws google azure red hat comptiathese big name...,aw googl azur red hat comptiathes big name onl...,aws google azure red hat comptiathese big name...
3,A rise in cyber attacks means opportunities fo...,"In the last few months, the US has experienced...",last months us experienced dozens major cybera...,last month us experienc dozen major cyberattac...,last month u experienced dozen major cyberatta...
4,Use your GI Bill® benefits to Land a Job in Tech,"As the end of military service gets closer, ma...",end military service gets closer many transiti...,end militari servic get closer mani transit se...,end military service get closer many transitio...


Ask yourself:

- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
    - lemmatized
- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
    - lemmatized
- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?
    - stemmed