## NLP Prepare Exercises

The end result of this exercise should be a file named ```prepare.py``` that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

**1. Define a function named ```basic_clean```. It should take in a string and apply some basic text cleaning to it:**
- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

In [1]:
#imports
import unicodedata
import re
import json

import acquire
import prepare

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

In [2]:
def basic_clean(text):
    '''
    This function takes in a string and normalizes it by lowercasing
    everything and replacing anything that is not a letter, number, 
    whitespace or a single quote.
    '''
    
    #lowercase all letters in the text
    text = text.lower()
    
    # normalize unicode by encoding into ASCII (ignore non-ASCII characters)
    # then decoding back into unicode 
    text = unicodedata.normalize('NFKD', text)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')

    # remove any that is not a letter, number, single quote, or whitespace
    text = re.sub(r"[^a-z0-9'\s]", '', text)
    
    return text

In [3]:
text = "HERE is a sTring with lőt^^^S óf s\\##tranGe things+ go%$ing ón. I'm attempting to 'normalize' some text."
text



"HERE is a sTring with lőt^^^S óf s\\##tranGe things+ go%$ing ón. I'm attempting to 'normalize' some text."

In [4]:
text = basic_clean(text)
text

"here is a string with lots of strange things going on i'm attempting to 'normalize' some text"

**2.  Define a function named ```tokenize```. It should take in a string and tokenize all the words in the string.**

In [5]:
def tokenize(text):
    '''
    This function takes in a string and returns the string will the
    words tokenized
    '''

    # Create the tokenizer
    tokenizer = nltk.tokenize.ToktokTokenizer()

    # Use the tokenizer
    text = tokenizer.tokenize(text, return_str=True)
    
    return text


In [6]:
text = tokenize(text)
text

"here is a string with lots of strange things going on i ' m attempting to ' normalize ' some text"

**3.  Define a function named ```stem```. It should accept some text and return the text after applying stemming to all the words.**

In [7]:
def stem(text):
    '''
    This function takes in a string and returns the string after applying
    stemming to all the words.
    '''

    # Create the porter stemmer
    ps = nltk.porter.PorterStemmer()

    # Apply the stemmer to each word in our string.
    stems = [ps.stem(word) for word in text.split()]
    
    # Join our lists of words into a string again
    text_stemmed = ' '.join(stems)

    return text_stemmed


In [8]:
stem(text)

"here is a string with lot of strang thing go on i ' m attempt to ' normal ' some text"

**4. Define a function named ```lemmatize```. It should accept some text and return the text after applying lemmatization to each word.**

In [9]:
def lemmatize(text):
    '''
    This function takes in a string and returns the string after applying
    lemmatization to all the words.
    '''

    # Create the Lemmatizer.
    wnl = nltk.stem.WordNetLemmatizer()

    # Use the lemmatizer on each word in the list of words we created by using split.
    lemmas = [wnl.lemmatize(word) for word in text.split()]

    # Join our list of words into a string again; assign to a variable to save changes.
    text_lemmatized = ' '.join(lemmas)
    
    return text_lemmatized

In [10]:
text = lemmatize(text)
text

"here is a string with lot of strange thing going on i ' m attempting to ' normalize ' some text"

**5. Define a function named ```remove_stopwords```. It should accept some text and return the text after removing all the stopwords.**

This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [11]:
# import standard English language stopwords list from nltk
from nltk.corpus import stopwords

def remove_stopwords(text, extra_words=[], exclude_words=[]):
    '''
    This function takes in a string and optional lists of extra_words and 
    words to exclude from the list and then returns the string after removing stop_words
    '''

    # Define the stop word list
    stopword_list = stopwords.words('english')

    # add extra_words (if any) to the stopwords list
    if len(extra_words) > 0:
        stopword_list = stopword_list.append(extra_words)
      
    # remove exclude_words (if any) from the stopwords list
    if len(exclude_words) > 0:
        stopword_list = stopword_list.remove(exclude_words)   

    # Split words in text.
    text = text.split()
    
    # Create a list of words from my string with stopwords removed and assign to variable.
    filtered_words = [word for word in text if word not in stopword_list]
    
    # Join words in the list back into strings; assign to a variable to keep changes.
    text_without_stopwords = ' '.join(filtered_words)

    return text_without_stopwords



In [12]:
remove_stopwords(text)

"string lot strange thing going ' attempting ' normalize ' text"

**6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe ```news_df```.**

In [13]:
articles = acquire.acquire_news_articles()
articles



  soup = BeautifulSoup(response.text)


Unnamed: 0,title,content,category
0,Air India pilots demand vaccination on priorit...,Indian Commercial Pilots Association (ICPA) on...,business
1,India underestimated the coronavirus: Raghuram...,"Speaking about India's second COVID-19 wave, f...",business
2,World's biggest jeweller says it will no longe...,"Pandora, the world's biggest jeweller, has sai...",business
3,South Korea's richest woman gets fortune worth...,South Korea’s richest woman Hong Ra-hee added ...,business
4,Samsung pledges ₹37 crore to India to fight CO...,Samsung has pledged $5 million (around ₹37 cro...,business
...,...,...,...
142,Prez Biden raises US' annual refugee admission...,US President Joe Biden has raised the maximum ...,world
143,Myanmar's military govt bans satellite TV citi...,Myanmar's military government has announced a ...,world
144,Egypt buys 30 more Rafale jets from France in ...,Egypt will buy 30 more Rafale fighter jets fro...,world
145,Further violence in Myanmar could lead to civi...,China's Ambassador to the UN Zhang Jun on Mond...,world


**7.  Make another dataframe for the Codeup blog posts. Name the dataframe ```codeup_df```.**

In [14]:
# acquire the dataframe of codeup blog articles
blog = acquire.acquire_codeup_blog()
blog



  soup = BeautifulSoup(response.text)


Unnamed: 0,title,published_date,blog_image,content
0,Codeup’s Data Science Career Accelerator is Here!,"September 30, 2018",https://codeup.com/wp-content/uploads/2018/10/...,The rumors are true! The time has arrived. Cod...
1,Data Science Myths,"October 31, 2018",https://codeup.com/wp-content/uploads/2018/10/...,By Dimitri Antoniou and Maggie Giust\nData Sci...
2,Data Science VS Data Analytics: What’s The Dif...,"October 17, 2018",https://codeup.com/wp-content/uploads/2018/10/...,"By Dimitri Antoniou\nA week ago, Codeup launch..."
3,10 Tips to Crush It at the SA Tech Job Fair,"August 14, 2018",,SA Tech Job Fair\nThe third bi-annual San Anto...
4,Competitor Bootcamps Are Closing. Is the Model...,"August 14, 2018",,Competitor Bootcamps Are Closing. Is the Model...


**8.  For each dataframe, produce the following columns:**
- ```title``` to hold the title
- ```original``` to hold the original article/post content
- ```clean``` to hold the normalized and tokenized original with the stopwords removed.
- ```stemmed``` to hold the stemmed version of the cleaned data.
- ```lemmatized``` to hold the lemmatized version of the cleaned data.

In [15]:
def prep_article_data(df, column, extra_words=[], exclude_words=[]):
    '''
    This function take in a df and the string name for a text column with 
    option to pass lists for extra_words and exclude_words and
    returns a df with the text article title, original text, stemmed text,
    lemmatized text, cleaned, tokenized, & lemmatized text with stopwords removed.
    '''
    df['clean'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)\
                            .apply(lemmatize)
    
    df['stemmed'] = df[column].apply(basic_clean).apply(stem)
    
    df['lemmatized'] = df[column].apply(basic_clean).apply(lemmatize)
    
    return df[['title', column, 'stemmed', 'lemmatized', 'clean']]

In [16]:
prep_article_data(articles,'content')

Unnamed: 0,title,content,stemmed,lemmatized,clean
0,Air India pilots demand vaccination on priorit...,Indian Commercial Pilots Association (ICPA) on...,indian commerci pilot associ icpa on tuesday s...,indian commercial pilot association icpa on tu...,indian commercial pilot association icpa tuesd...
1,India underestimated the coronavirus: Raghuram...,"Speaking about India's second COVID-19 wave, f...",speak about india' second covid19 wave former ...,speaking about india's second covid19 wave for...,speaking india ' second covid19 wave former rb...
2,World's biggest jeweller says it will no longe...,"Pandora, the world's biggest jeweller, has sai...",pandora the world' biggest jewel ha said that ...,pandora the world's biggest jeweller ha said t...,pandora world ' biggest jeweller said ' stop u...
3,South Korea's richest woman gets fortune worth...,South Korea’s richest woman Hong Ra-hee added ...,south korea richest woman hong rahe ad anoth 7...,south korea richest woman hong rahee added ano...,south korea richest woman hong rahee added ano...
4,Samsung pledges ₹37 crore to India to fight CO...,Samsung has pledged $5 million (around ₹37 cro...,samsung ha pledg 5 million around 37 crore to ...,samsung ha pledged 5 million around 37 crore t...,samsung pledged 5 million around 37 crore help...
...,...,...,...,...,...
142,Prez Biden raises US' annual refugee admission...,US President Joe Biden has raised the maximum ...,us presid joe biden ha rais the maximum number...,u president joe biden ha raised the maximum nu...,u president joe biden raised maximum number re...
143,Myanmar's military govt bans satellite TV citi...,Myanmar's military government has announced a ...,myanmar' militari govern ha announc a ban on s...,myanmar's military government ha announced a b...,myanmar ' military government announced ban sa...
144,Egypt buys 30 more Rafale jets from France in ...,Egypt will buy 30 more Rafale fighter jets fro...,egypt will buy 30 more rafal fighter jet from ...,egypt will buy 30 more rafale fighter jet from...,egypt buy 30 rafale fighter jet france 4 billi...
145,Further violence in Myanmar could lead to civi...,China's Ambassador to the UN Zhang Jun on Mond...,china' ambassador to the un zhang jun on monda...,china's ambassador to the un zhang jun on mond...,china ' ambassador un zhang jun monday said vi...


In [17]:
prep_article_data(blog,'content')

Unnamed: 0,title,content,stemmed,lemmatized,clean
0,Codeup’s Data Science Career Accelerator is Here!,The rumors are true! The time has arrived. Cod...,the rumor are true the time ha arriv codeup ha...,the rumor are true the time ha arrived codeup ...,rumor true time arrived codeup officially open...
1,Data Science Myths,By Dimitri Antoniou and Maggie Giust\nData Sci...,by dimitri antoni and maggi giust data scienc ...,by dimitri antoniou and maggie giust data scie...,dimitri antoniou maggie giust data science big...
2,Data Science VS Data Analytics: What’s The Dif...,"By Dimitri Antoniou\nA week ago, Codeup launch...",by dimitri antoni a week ago codeup launch our...,by dimitri antoniou a week ago codeup launched...,dimitri antoniou week ago codeup launched imme...
3,10 Tips to Crush It at the SA Tech Job Fair,SA Tech Job Fair\nThe third bi-annual San Anto...,sa tech job fair the third biannual san antoni...,sa tech job fair the third biannual san antoni...,sa tech job fair third biannual san antonio te...
4,Competitor Bootcamps Are Closing. Is the Model...,Competitor Bootcamps Are Closing. Is the Model...,competitor bootcamp are close is the model in ...,competitor bootcamps are closing is the model ...,competitor bootcamps closing model danger prog...


**9.  Ask yourself:**

- **If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?**

lemmatized -- it takes longer but is better quality -- and with such little data, shouldn't add too much time


- **If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?**

lemmatized -- it takes longer but is better quality -- and with such little data, shouldn't add too much time


- **If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?**

Stem, because it runs faster 