In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

from acquire import get_codeup_blog, get_inshorts_articles

# Data Preparation Exercises

## 1

Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

In [2]:
# A string to work with
text = "HERE is some text: α alpha  β beta | something * else. Someone's pencil."

In [3]:
# Lowercase everything
cleaned = text.lower()
cleaned

"here is some text: α alpha  β beta | something * else. someone's pencil."

In [4]:
# Normalize unicode characters
cleaned = unicodedata.normalize('NFKD', cleaned).encode('ascii', 'ignore').decode('utf-8', 'ignore')
cleaned

"here is some text:  alpha   beta | something * else. someone's pencil."

In [5]:
# Replace special characters
regexp = r"[^a-z0-9'\s]"
cleaned = re.sub(regexp, '', cleaned)
cleaned

"here is some text  alpha   beta  something  else someone's pencil"

In [6]:
# Now let's put it in a function

def basic_clean(text):
    text = text.lower()
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    regexp = r"[^a-z0-9'\s]"
    text = re.sub(regexp, '', text)
    return text

In [7]:
# Let's test it
basic_clean(text)

"here is some text  alpha   beta  something  else someone's pencil"

## 2

Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [8]:
# Let's create a function that will create a tokenizer object and tokenize the input

def tokenize(text):
    tokenizer = ToktokTokenizer()
    return tokenizer.tokenize(text, return_str = True)

In [9]:
# Let's test it
tokenize(cleaned)

"here is some text alpha beta something else someone ' s pencil"

## 3

Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [10]:
# Let's create a function that apply stemming to the input text

def stem(text):
    ps = nltk.porter.PorterStemmer()
    stems = [ps.stem(word) for word in text.split()]
    return ' '.join(stems)

In [11]:
# Let's test it
stem(cleaned)

"here is some text alpha beta someth els someone' pencil"

## 4

Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [12]:
# Let's create a function that will apply lemmatization to the input text

def lemmatize(text):
    wnl = nltk.stem.WordNetLemmatizer()
    lemmas = [wnl.lemmatize(word) for word in text.split()]
    return ' '.join(lemmas)

In [13]:
# Let's test it
lemmatize(cleaned)

"here is some text alpha beta something else someone's pencil"

That didn't really change anything. Let's try a different string.

In [14]:
lemmatize("He studies the principles of mathematical mumbo jumbo")

'He study the principle of mathematical mumbo jumbo'

## 5

Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [15]:
# Let's try to add and remove words from the stopwords list

stopword_list = stopwords.words('english')
stopword_list[ : 20]

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his']

In [16]:
# First let's try adding some words

extra_words = [
    'hubba',
    'bubba'
]

stopword_list += extra_words
stopword_list[-10 : ]

['wasn',
 "wasn't",
 'weren',
 "weren't",
 'won',
 "won't",
 'wouldn',
 "wouldn't",
 'hubba',
 'bubba']

In [17]:
# Now let's try removing some words

exclude_words = [
    "wouldn't",
    "won't"
]

[stopword_list.remove(word) for word in exclude_words]

stopword_list[-10 : ]

['shouldn',
 "shouldn't",
 'wasn',
 "wasn't",
 'weren',
 "weren't",
 'won',
 'wouldn',
 'hubba',
 'bubba']

In [33]:
# Now let's create the function to remove all stopwords from the input text

def remove_stopwords(text, extra_words = None, exclude_words = None):
    stopword_list = stopwords.words('english')
    
    # We need to add in the extra checks if the parameters are None in order to make the 
    # parameters optional.
    stopword_list = set(stopword_list) | set(extra_words) if extra_words is not None else set(stopword_list)
    stopword_list = stopword_list - set(exclude_words) if exclude_words is not None else stopword_list
    
    text = [word for word in text.split() if word not in stopword_list]
    return ' '.join(text)

In [34]:
# Let's test it
remove_stopwords(cleaned)

"text alpha beta something else someone's pencil"

In [35]:
remove_stopwords(cleaned, extra_words = ['alpha', 'beta'], exclude_words = ['here'])

"here text something else someone's pencil"

## 6

Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

In [37]:
news_df = get_inshorts_articles()
news_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99 entries, 0 to 98
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     99 non-null     object
 1   content   99 non-null     object
 2   category  99 non-null     object
dtypes: object(3)
memory usage: 2.4+ KB


## 7

Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [42]:
codeup_df = get_codeup_blog()
codeup_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 220 entries, 0 to 219
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    220 non-null    object
 1   date     220 non-null    object
 2   content  218 non-null    object
dtypes: object(3)
memory usage: 5.3+ KB


In [43]:
# Not sure what those 2 nulls are, but let's just drop them.
codeup_df = codeup_df.dropna()
codeup_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 218 entries, 0 to 219
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    218 non-null    object
 1   date     218 non-null    object
 2   content  218 non-null    object
dtypes: object(3)
memory usage: 6.8+ KB


## 8

For each dataframe, produce the following columns:

- title to hold the title
- original to hold the original article/post content
- clean to hold the normalized and tokenized original with the stopwords removed.
- stemmed to hold the stemmed version of the cleaned data.
- lemmatized to hold the lemmatized version of the cleaned data.

In [38]:
# We'll use the news_df dataframe to produce a new dataframe with the contents we need.

news_df = pd.DataFrame({
    'title' : news_df.title,
    'original' : news_df.content,
    'clean' : news_df.content.apply(basic_clean).apply(tokenize).apply(remove_stopwords)
})

In [39]:
news_df.head()

Unnamed: 0,title,original,clean
0,"Who is Jared Birchall, who manages the world's...",Jared Birchall is the Managing Director of Exc...,jared birchall managing director excession fam...
1,Work ethic expectations from Twitter employees...,"The world's richest man Elon Musk tweeted ""wor...",world ' richest man elon musk tweeted work eth...
2,Best investment you'll make: Poonawalla sugges...,The Serum Institute of India's Adar Poonawalla...,serum institute india ' adar poonawalla sunday...
3,Amazon driver leaves 'kind' message for girl f...,Amazon Founder Jeff Bezos took to Instagram St...,amazon founder jeff bezos took instagram stori...
4,Baseless and untrue: ED as Xiaomi alleges it t...,After Xiaomi alleged in a court filing that it...,xiaomi alleged court filing top executives fac...


In [40]:
# Now let's create the stemmed and lemmatized columns
news_df['stemmed'] = news_df.clean.apply(stem)
news_df['lemmatized'] = news_df.clean.apply(lemmatize)

In [41]:
# Let's see the results
news_df.head()

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,"Who is Jared Birchall, who manages the world's...",Jared Birchall is the Managing Director of Exc...,jared birchall managing director excession fam...,jare birchal manag director excess famili offi...,jared birchall managing director excession fam...
1,Work ethic expectations from Twitter employees...,"The world's richest man Elon Musk tweeted ""wor...",world ' richest man elon musk tweeted work eth...,world ' richest man elon musk tweet work ethic...,world ' richest man elon musk tweeted work eth...
2,Best investment you'll make: Poonawalla sugges...,The Serum Institute of India's Adar Poonawalla...,serum institute india ' adar poonawalla sunday...,serum institut india ' adar poonawalla sunday ...,serum institute india ' adar poonawalla sunday...
3,Amazon driver leaves 'kind' message for girl f...,Amazon Founder Jeff Bezos took to Instagram St...,amazon founder jeff bezos took instagram stori...,amazon founder jeff bezo took instagram stori ...,amazon founder jeff bezos took instagram story...
4,Baseless and untrue: ED as Xiaomi alleges it t...,After Xiaomi alleged in a court filing that it...,xiaomi alleged court filing top executives fac...,xiaomi alleg court file top execut face threat...,xiaomi alleged court filing top executive face...


In [44]:
# Now let's do the same thing with the codeup_df dataframe

codeup_df = pd.DataFrame({
    'title' : codeup_df.title,
    'original' : codeup_df.content,
    'clean' : codeup_df.content.apply(basic_clean).apply(tokenize).apply(remove_stopwords)
})

In [45]:
codeup_df['stemmed'] = codeup_df.clean.apply(stem)
codeup_df['lemmatized'] = codeup_df.clean.apply(lemmatize)

In [46]:
# Let's see the results
codeup_df.head()

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,"Meet the new Codeup COO, Stephen Noteboom!","A big welcome to Stephen Noteboom, who will be...",big welcome stephen noteboom joining codeup ch...,big welcom stephen noteboom join codeup chief ...,big welcome stephen noteboom joining codeup ch...
1,Codeup Launches a Houston Bootcamp!,"Houston, we have a problem: there aren’t enoug...",houston problem arent enough software develope...,houston problem arent enough softwar develop 6...,houston problem arent enough software develope...
2,Codeup Named a Top 30 Coding School,Codeup Named a Top 30 Coding School\n \nWhile ...,codeup named top 30 coding school awards arent...,codeup name top 30 code school award arent nic...,codeup named top 30 coding school award arent ...
3,Codeup Success Story: Ryan Orsinger,Codeup Success Story: Ryan Orsinger\n \nWatch ...,codeup success story ryan orsinger watch video...,codeup success stori ryan orsing watch video i...,codeup success story ryan orsinger watch video...
4,Codeup Success Story: Cole Reveal,Codeup Success Story: Cole Reveal\nWatch the v...,codeup success story cole reveal watch video p...,codeup success stori cole reveal watch video p...,codeup success story cole reveal watch video p...


## 9

Ask yourself:

- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?

For this size of corpus I would prefer to use lemmatized text since this will not take too long to lemmatize.

- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?

For this size of corpus I would still prefer lemmatizing. It will take some time to lemmatize, but I suspect not so long that it would be unreasonable.

- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?

For this size of corpus I would prefer stemming. That much data would not only take a very long time to lemmatize, but would also be heavily charged.