# Exercises

The end result of this exercise should be a file named prepare.py that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.



## 1.


In [19]:
# Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:
# - Lowercase everything
# - Normalize unicode characters
# - Replace anything that is not a letter, number, whitespace or a single quote.

import unicodedata
import re

def basic_clean(text):
    """
    This function takes in a string and applies some basic text cleaning to it:
    - Lowercase everything
    - Normalize unicode characters
    - Replace anything that is not a letter, number, whitespace or a single quote.
    """
    # Lowercase the text
    text = text.lower()
    
    # Normalize unicode characters
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    
    # Replace anything that is not a letter, number, whitespace or a single quote
    text = re.sub(r"[^a-z0-9'\s]", '', text)
    
    return text



## 2.


In [20]:
# Define a function named tokenize. It should take in a string and tokenize all the words in the string.

import nltk

def tokenize(text):
    """
    This function takes in a string and tokenize all the words in the string.
    """
    # Create the tokenizer
    tokenizer = nltk.tokenize.ToktokTokenizer()
    
    # Use the tokenizer
    text = tokenizer.tokenize(text, return_str=True)
    
    return text


## 3. 


In [21]:
# Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

def stem(text):
    """
    This function takes in some text and return the text after applying stemming to all the words.
    """
    # Create the nltk stemmer object, then use it
    ps = nltk.porter.PorterStemmer()
    stems = [ps.stem(word) for word in text.split()]
    
    # Join our lists of words into a string again
    text_stemmed = ' '.join(stems)
    
    return text_stemmed


## 4. 


In [27]:
# Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

import nltk
nltk.download('wordnet')

def lemmatize(text):
    """
    This function takes in some text and return the text after applying lemmatization to each word.
    """
    # Create the nltk lemmatizer object, then use it
    wnl = nltk.stem.WordNetLemmatizer()
    lemmas = [wnl.lemmatize(word) for word in text.split()]
    
    # Join our list of words into a string again
    text_lemmatized = ' '.join(lemmas)
    
    return text_lemmatized

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/zacschmitz/nltk_data...



## 5. 


In [23]:
# Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.
# This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

def remove_stopwords(text, extra_words=None, exclude_words=None):
    """
    This function takes in some text and returns the text after removing all the stopwords.
    It defines two optional parameters, extra_words and exclude_words, which define any additional stop words to include,
    and any words that we don't want to remove.
    """
    # Get the list of stopwords
    stopword_list = stopwords.words('english')
    
    # Add any extra words to the stopword list
    if extra_words:
        stopword_list += extra_words
    
    # Remove any exclude words from the stopword list
    if exclude_words:
        stopword_list = [word for word in stopword_list if word not in exclude_words]
    
    # Tokenize the text
    words = text.split()
    
    # Remove the stopwords from the text
    filtered_words = [word for word in words if word not in stopword_list]
    
    # Join the filtered words back into a string
    filtered_text = ' '.join(filtered_words)
    
    return filtered_text

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/zacschmitz/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!



## 6. 


In [24]:
# Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

import pandas as pd

# Read in new_shorts.csv
news_df = pd.read_csv('csv_files/news_shorts.csv')


## 7.


In [25]:
# Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

codeup_df = pd.read_csv('csv_files/codeup_blogs.csv')


## 8. 


In [28]:
# For each dataframe, produce the following columns:

# - title to hold the title
# - original to hold the original article/post content
# - clean to hold the normalized and tokenized original with the stopwords removed.
# - stemmed to hold the stemmed version of the cleaned data.
# - lemmatized to hold the lemmatized version of the cleaned data.

In [37]:
news_df = news_df.rename(columns={'content': 'original'})

news_df['clean'] = news_df.original.apply(basic_clean).apply(tokenize).apply(remove_stopwords)

news_df['stemmed'] = news_df.clean.apply(stem)

news_df['lemmatized'] = news_df.clean.apply(lemmatize)

news_df = news_df[['title', 'category','original', 'clean', 'stemmed', 'lemmatized']]

news_df.head(1)

Unnamed: 0,title,category,original,clean,stemmed,lemmatized
0,Jio Financial Services' Q2 profit jumps 101% Q...,Business,Jio Financial Services posted a net profit of ...,jio financial services posted net profit 668 c...,jio financi servic post net profit 668 crore s...,jio financial service posted net profit 668 cr...


In [36]:
codeup_df = codeup_df.rename(columns={'content': 'original'})

codeup_df['clean'] = codeup_df.original.apply(basic_clean).apply(tokenize).apply(remove_stopwords)

codeup_df['stemmed'] = codeup_df.clean.apply(stem)

codeup_df['lemmatized'] = codeup_df.clean.apply(lemmatize)

codeup_df.head(1)

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Spotlight on APIDA Voices: Celebrating Heritag...,May is traditionally known as Asian American a...,may traditionally known asian american pacific...,may tradit known asian american pacif island a...,may traditionally known asian american pacific...



## 9. 


### Ask yourself:


### If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?


I would almost always prefer to lemmatize, due to higher accuracy. It would definitely make sense here as well, due to the corpus being so small and easily processable.


### If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?


Although this corpus is slightly larger, I would most likely still lemmatize, due to accruacy. Although, if I were on a time crunch or had limited resources, stemming could make sense.

### If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?



We would most likely use stemming, unless it proved to be absolutely critical to use lemmatized text.