The end result of this exercise should be a file named `prepare.py` that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

import warnings
warnings.filterwarnings("ignore")

import acquire

### 1. Define a function named `basic_clean`. It should take in a string and apply some basic text cleaning to it:

    - Lowercase everything
    - Normalize unicode characters
    - Replace anything that is not a letter, number, whitespace or a single quote.

In [2]:
def basic_clean(string):
    '''
    This function takes in a string and applies basic text cleaning by:
    lowercasing everything,
    normalizing unicode characters,
    replacing anything that is not a letter, number, whitespace, or a single quote
    
    and returns the cleaned string.
    '''
    
    #lowercase
    string = string.lower()
    
    #normalize unicode chars
    string = unicodedata.normalize('NFKD', string).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    
    #replace anything not a letter, number, whitespace, or single quote
    string = re.sub(r"[^a-z0-9'\s]", '', string)
    
    return string

In [3]:
#test function
string = "Hello, I am testing this function! I wonder if this'll work?"
string = basic_clean(string)

string

"hello i am testing this function i wonder if this'll work"

### 2. Define a function named `tokenize`. It should take in a string and tokenize all the words in the string.

In [4]:
def tokenize(string):
    '''
    This function takes in a string, 
    tokenizes all the words in the string,
    
    and returns the tokenized string.
    '''
    
    #create the tokenizer
    tokenizer = nltk.tokenize.ToktokTokenizer()
    
    string = tokenizer.tokenize(string, return_str=True)
    
    return string

In [5]:
#test function
string = tokenize(string)
string

"hello i am testing this function i wonder if this ' ll work"

### 3. Define a function named `stem`. It should accept some text and return the text after applying stemming to all the words.

In [6]:
def stem(string):
    '''
    This function takes in some text
    
    and returns the text after applying stemming to all the words.
    '''
    
    #create porter stemmer.
    ps = nltk.porter.PorterStemmer()
    
    #apply the stemmer to each word in the string
    stems = [ps.stem(word) for word in string.split()]
    
    #join stemmed list of words into a string again
    string_stemmed = ' '.join(stems)
    
    return string_stemmed

In [7]:
#test function
string = stem(string)
string

"hello i am test thi function i wonder if thi ' ll work"

### 4. Define a function named `lemmatize`. It should accept some text and return the text after applying lemmatization to each word.

In [8]:
def lemmatize(string):
    '''
    This function takes in some text
    
    and returns the text after applying lemmatization to each word.
    '''
    
    #create the lemmatizer
    wnl = nltk.stem.WordNetLemmatizer()
    
    #use the lemmatizer on each word in the list of words created by using split
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    
    #join lemmatized list of words into a string again
    string_lemmatized = ' '.join(lemmas)
    
    return string_lemmatized

In [9]:
#test function w/ lemmatize instead of stem
string = "Hello, I am testing this function! I wonder if this'll work?"
string = basic_clean(string)
string = tokenize(string)
string = lemmatize(string)
string

"hello i am testing this function i wonder if this ' ll work"

### 5. Define a function named `remove_stopwords`. It should accept some text and return the text after removing all the stopwords. 

- This function should define two optional parameters, `extra_words` and `exclude_words`. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [10]:
def remove_stopwords(string, extra_words = [], exclude_words = []): 
    '''
    This function takes in a string, 
    optional extra_words (additional stop words),
    and optional exclude_words (words that won't be removed) parameters with default empty lists,
    
    and returns a string after removing all the stopwords.
    '''

    #create stopword_list
    stopword_list = stopwords.words('english')
    
    #remove 'exclude_words' from stopword_list to keep these in my text
    stopword_list = set(stopword_list) - set(exclude_words)
    
    #add in 'extra_words' to stopword_list
    stopword_list = stopword_list.union(set(extra_words))
    
    #split words in string
    words = string.split()
    
    #create a list of words from my string with stopwords removed and assign to variable.
    filtered_words = [word for word in words if word not in stopword_list]
    
    #join words in the list back into strings and assign to a variable.
    string_without_stopwords = ' '.join(filtered_words)
    
    return string_without_stopwords

In [11]:
#test function
string = remove_stopwords(string)
string

"hello testing function wonder ' work"

In [12]:
#test function w/ extra words and exlude words specified
string = "Hello, I am testing this function! I wonder if this'll work?"
string = basic_clean(string)
string = tokenize(string)
string = lemmatize(string)

string = remove_stopwords(string, extra_words = ['hello'], exclude_words = ['if', 'i'])
string

"i testing function i wonder if ' work"

### 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe `news_df`.

In [13]:
categories = ['business', 'sports', 'technology', 'entertainment']
news_df = pd.DataFrame(acquire.get_news_articles(categories))

In [14]:
news_df

Unnamed: 0,title,content,category
0,Air India pilots demand vaccination on priorit...,Indian Commercial Pilots Association (ICPA) on...,business
1,India underestimated the coronavirus: Raghuram...,"Speaking about India's second COVID-19 wave, f...",business
2,South Korea's richest woman gets fortune worth...,South Korea’s richest woman Hong Ra-hee added ...,business
3,World's biggest jeweller says it will no longe...,"Pandora, the world's biggest jeweller, has sai...",business
4,"Will supply 11 cr doses to states, pvt hospita...",Serum Institute of India (SII) CEO Adar Poonaw...,business
...,...,...,...
93,"Everything's a mess, have stopped using Instag...",Calling the ongoing coronavirus pandemic in In...,entertainment
94,A film I'm super proud of: Hansal on 3 years o...,Filmmaker Hansal Mehta took to Instagram on Tu...,entertainment
95,Hope Bell Bottom releases in theatres: Huma on...,Amid reports of director Ranjit Tewari's 'Bell...,entertainment
96,World is interested in negative: Britney on do...,"Taking to Instagram, singer Britney Spears sha...",entertainment


### 7. Make another dataframe for the Codeup blog posts. Name the dataframe `codeup_df`.

In [15]:
urls = ['https://codeup.com/codeups-data-science-career-accelerator-is-here/', 
        'https://codeup.com/data-science-myths/', 
        'https://codeup.com/data-science-vs-data-analytics-whats-the-difference/', 
        'https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/', 
        'https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/']

codeup_df = pd.DataFrame(acquire.get_blog_articles(urls))

In [16]:
codeup_df

Unnamed: 0,title,content
0,Codeup’s Data Science Career Accelerator is Here!,The rumors are true! The time has arrived. Cod...
1,Data Science Myths,By Dimitri Antoniou and Maggie Giust\nData Sci...
2,Data Science VS Data Analytics: What’s The Dif...,"By Dimitri Antoniou\nA week ago, Codeup launch..."
3,10 Tips to Crush It at the SA Tech Job Fair,SA Tech Job Fair\nThe third bi-annual San Anto...
4,Competitor Bootcamps Are Closing. Is the Model...,Competitor Bootcamps Are Closing. Is the Model...


### 8. For each dataframe, produce the following columns:

- `title` to hold the title
- `original` to hold the original article/post content
- `clean` to hold the normalized and tokenized original with the stopwords removed.
- `stemmed` to hold the stemmed version of the cleaned data.
- `lemmatized` to hold the lemmatized version of the cleaned data.

In [17]:
def prep_article_data(df, column, extra_words=[], exclude_words=[]):
    '''
    This function take in a df, 
    the string name for a text column,
    with option to pass lists for extra_words and exclude_words,
    renames content column to be 'original',
    
    and returns a df with the text article title, original text, 
    cleaned, tokenized, stemmed, and lemmatized text with stopwords removed.
    '''
    
    #rename the content column to original
    df = df.rename(columns={"content": "original"})
    
    #holds the normalized and tokenized original w/ stopwords removed
    df['clean'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    #holds the stemmed version of the cleaned data
    df['stemmed'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(stem)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    #holds the lemmatized version of the cleaned data
    df['lemmatized'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(lemmatize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    return df[['title', column, 'clean', 'stemmed', 'lemmatized']]


In [18]:
#test function for news_df original article/post content
prep_article_data(news_df, 'original', extra_words = ['ha'], exclude_words = ['no']).head()

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Air India pilots demand vaccination on priorit...,Indian Commercial Pilots Association (ICPA) on...,indian commercial pilots association icpa tues...,indian commerci pilot associ icpa tuesday said...,indian commercial pilot association icpa tuesd...
1,India underestimated the coronavirus: Raghuram...,"Speaking about India's second COVID-19 wave, f...",speaking india ' second covid19 wave former rb...,speak india ' second covid19 wave former rbi g...,speaking india ' second covid19 wave former rb...
2,South Korea's richest woman gets fortune worth...,South Korea’s richest woman Hong Ra-hee added ...,south koreas richest woman hong rahee added an...,south korea richest woman hong rahe ad anoth 7...,south korea richest woman hong rahee added ano...
3,World's biggest jeweller says it will no longe...,"Pandora, the world's biggest jeweller, has sai...",pandora world ' biggest jeweller said ' stop u...,pandora world ' biggest jewel said ' stop use ...,pandora world ' biggest jeweller said ' stop u...
4,"Will supply 11 cr doses to states, pvt hospita...",Serum Institute of India (SII) CEO Adar Poonaw...,serum institute india sii ceo adar poonawalla ...,serum institut india sii ceo adar poonawalla s...,serum institute india sii ceo adar poonawalla ...


In [19]:
#test function for codeup_df original article/post content
prep_article_data(codeup_df, 'original', extra_words = ['ha'], exclude_words = ['no']).head()

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Codeup’s Data Science Career Accelerator is Here!,The rumors are true! The time has arrived. Cod...,rumors true time arrived codeup officially ope...,rumor true time arriv codeup offici open appli...,rumor true time arrived codeup officially open...
1,Data Science Myths,By Dimitri Antoniou and Maggie Giust\nData Sci...,dimitri antoniou maggie giust data science big...,dimitri antoni maggi giust data scienc big dat...,dimitri antoniou maggie giust data science big...
2,Data Science VS Data Analytics: What’s The Dif...,"By Dimitri Antoniou\nA week ago, Codeup launch...",dimitri antoniou week ago codeup launched imme...,dimitri antoni week ago codeup launch immers d...,dimitri antoniou week ago codeup launched imme...
3,10 Tips to Crush It at the SA Tech Job Fair,SA Tech Job Fair\nThe third bi-annual San Anto...,sa tech job fair third biannual san antonio te...,sa tech job fair third biannual san antonio te...,sa tech job fair third biannual san antonio te...
4,Competitor Bootcamps Are Closing. Is the Model...,Competitor Bootcamps Are Closing. Is the Model...,competitor bootcamps closing model danger prog...,competitor bootcamp close model danger program...,competitor bootcamps closing model danger prog...


### 9. Ask yourself:

- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
    - probably lemmatizing due to small text
- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
    - it depends how much time you have to waste
- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?
    - most probably stemmed to reduce text and it would take too long to lemmatize (plus save money w/ stemmed).