# Prepare Exercises

Exercises

The end result of this exercise should be a file named prepare.py that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

In [17]:
import pandas as pd
import numpy as np
import unicodedata
import re

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

# 1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

* Lowercase everything
* Normalize unicode characters
* Replace anything that is not a letter, number, whitespace or a single quote.

In [18]:
def basic_clean(input_string):
    '''
    basice_clean function takes in a string and performs the following cleaning: lowercase, normalize characters
    and replaces anything that ia a letter , number, whitespace or sigle quote
    returns clean_string
    '''
    # takes original string and lowercase the string
    clean_string = input_string.lower()
    
    # normalized the string
    clean_string = unicodedata.normalize('NFKD', clean_string).encode('ascii','ignore').decode('utf-8')
    
    # remove anything that is not a through z, a number, a single quote, or whites
    clean_string = re.sub(r"[^a-z0-9'\s]", '', clean_string)
    
    return clean_string

In [19]:
sample_string = '1-034-@#$.32ksk|llkm fsadpfo-3-ljf &*^...hi mom...?/\|'

In [23]:
sample_text = "Hey Amazon - my package never arrived https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first PLEASE FIX ASAP! @AmazonHelp"

In [20]:
basic_clean(sample_string)

'103432kskllkm fsadpfo3ljf hi mom'

In [24]:
basic_clean(sample_text)

'hey amazon  my package never arrived httpswwwamazoncomgpcssorderhistoryrefnavordersfirst please fix asap amazonhelp'

# 2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [21]:
def tokenize(input_string):
    '''
    tokenize takes in a string and passes throug basic_clean function then tokenize all the words in the string
    returns token_string
    '''
    
    #basic clean string
    clean_string = basic_clean(input_string)
    
    # create tokenizer object
    tokenizer = nltk.tokenize.ToktokTokenizer()
     
    # apply token to string    
    token_string  = tokenizer.tokenize(clean_string, return_str=True)
    
    return token_string

In [22]:
tokenize(sample_string)

'103432kskllkm fsadpfo3ljf hi mom'

In [25]:
tokenize(sample_text)

'hey amazon my package never arrived httpswwwamazoncomgpcssorderhistoryrefnavordersfirst please fix asap amazonhelp'

# 3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [39]:
def stem(input_string):
    '''
    stem takes in a string and runs it thourgh tokenize function
    returns stem_string a stem version of string
    '''
    # clean string and tokenize
    token_string = tokenize(input_string)
    
    # create stemming object
    ps = nltk.porter.PorterStemmer()
    # stemming string
    stem_string = [ps.stem(word) for word in token_string.split()]
    # join stemmed string
    stem_string = ' '.join(stem_string)
    
    return stem_string

In [40]:
stem(sample_string)

'103432kskllkm fsadpfo3ljf hi mom'

In [41]:
stem(sample_text)

'hey amazon my packag never arriv httpswwwamazoncomgpcssorderhistoryrefnavordersfirst pleas fix asap amazonhelp'

# 4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [44]:
def lemmatize(input_string):
    '''
    lemmatize takes in a string and passes it through takenize function
    returns lemmas_string a lemmatize version of the string.
    '''
    
    # clean string and tokenize
    token_string = tokenize(input_string)
    
    # create object
    wnl = nltk.stem.WordNetLemmatizer()
    
    # apply lemmatizer to string
    lemmas_string = [wnl.lemmatize(word) for word in token_string.split()]
    lemmas_string = " ".join(lemmas_string)
    
    return lemmas_string
    
    

In [45]:
lemmatize(sample_string)

'103432kskllkm fsadpfo3ljf hi mom'

In [46]:
lemmatize(sample_text)

'hey amazon my package never arrived httpswwwamazoncomgpcssorderhistoryrefnavordersfirst please fix asap amazonhelp'

# 5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [52]:
def remove_stopwords(input_string, extra_words = None, exclude_words = None):
    # ceate stopwords list
    stopwords_eng = stopwords.words('english')
    
    # words to be added
    stopwords_eng.append(extra_words)
    
    # words to be removed
    stopwords_eng.remove(exclude_words)
    
    
    # this is the stopwords applied(taken out of) the original text
    new_string = [word for word in input_string.split() if word not in stopwords_eng]
    
    return new_string
    

In [54]:
sample_text

'Hey Amazon - my package never arrived https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first PLEASE FIX ASAP! @AmazonHelp'

In [62]:
remove_stopwords(sample_text,extra_words=None,'my')

SyntaxError: positional argument follows keyword argument (1501355399.py, line 1)

# 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

# 7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

# 8. For each dataframe, produce the following columns:

* title to hold the title
* original to hold the original article/post content
* clean to hold the normalized and tokenized original with the stopwords removed.
* stemmed to hold the stemmed version of the cleaned data.
* lemmatized to hold the lemmatized version of the cleaned data.

# 9 Ask yourself:

* If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
* If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
* If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?