In [1]:
# Import libraries

import pandas as pd
import numpy as np

import unicodedata
import re
import json
import os

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import acquire, prepare

# nltk.download('wordnet')
# nltk.download('stopwords')

### Natural Language Processing
Natural Language Processing allows you to use the techniques in **Python libraries like NLTK (Natural Language Tool Kit) and SpaCy** to create a **machine-useable structure out of natural language text**. In other words, you can manipulate natural language in such a way that renders it useful in machine learning. Machines can't read words, but they can recognize numbers, so we have **to process the text we want to use in a way that retains the original meaning while representing the text with numbers.** 

### Workflow to process text data (text normalizatioin)
- To used in exploration and modeling, our text data needs to be processed. Such preprocessing is known as text normaliztion. 
- Normalization is when you perform a series of tasks like making all text lowercase, removing punctuation, expanding contractions, removing anything that's not an ASCII character, etc.

## Exercises
The end result of this exercise should be a file named prepare.py that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

In [3]:
# Read the text data into a pandas df from saved json file

df = acquire.get_news_articles(cached=True)
df.head()

Unnamed: 0,topic,title,author,content
0,business,"Lakshmi Vilas Bank withdrawals capped at ₹25,0...",Pragya Swastik,The Centre has imposed a 30-day moratorium on ...
1,business,How does Moderna's COVID-19 vaccine candidate ...,Pragya Swastik,Moderna's initial results of late-stage trial ...
2,business,Shutting Delhi markets may prove counterproduc...,Sakshita Khosla,Traders' body CAIT on Tuesday said a proposal ...
3,business,Pfizer shares drop 4.5% as Moderna says its va...,Krishna Veera Vanamali,Pfizer’s shares fell as much as 4.5% on Monday...
4,business,"Musk gets $15bn richer in 2 hours, becomes wor...",Krishna Veera Vanamali,Billionaire Elon Musk added $15 billion to his...


In [5]:
# Define a string used to test the functions
original = df.content[0]

# Print its dtype
print(type(original))

# Print the content
original

<class 'str'>


'The Centre has imposed a 30-day moratorium on Lakshmi Vilas Bank effective from Tuesday. A withdrawal limit of ₹25,000 with certain exceptions for unforeseen expenses has been imposed for depositors. The RBI said, "The financial position of the bank has undergone a steady decline with continuous losses over the last three years."'

### 1. Define a function named `basic_clean`. It should take in a string and apply some basic text cleaning to it.

#### Remove Accented Characters

We will remove accented characters by chaining together the following methods:

**Remove inconsistencies in unicode character encoding.**<br>
`string = unicodedata.normalize(form, unistr)`

**Convert string to ASCII character set and drop non-ASCII characters.**<br>
`string = string.encode('ascii', 'ignore')`

**Convert the bytes back into a string object.**<br>
`string = string.decode('utf-8', 'ignore')`

#### Remove Special Characters

Here are two common patterns I might want to use to remove special characters; it just depends on what you want to remain in your string.

**Remove characters that are not letters, underscores, or spaces.**<br>
`string = re.sub(r'[^\w\s]', '', string)`

**Remove characters that are not letters, numbers, single quotes, or spaces.**<br>
`string = re.sub(r"[^a-z0-9'\s]", '', string)`

In [22]:
def basic_clean(original):
    '''
    The function takes in a string and does the basic clean to the string
    '''
    
    # convert text to all lower case for normalcy. 
    article = original.lower()
    
    # remove any accented, non-ACSII cahracters 
    article = unicodedata.normalize('NFKD', article)\
                .encode('ascii', 'ignore')\
                .decode('utf-8', 'ignore')
    
    # replace anthing that is not a letter, number, whitespace 
    article = re.sub(r"[^a-z0-9'\s]", '', article)
    
    return article                     

In [23]:
# Test the function

article = basic_clean(original)
article

'the centre has imposed a 30day moratorium on lakshmi vilas bank effective from tuesday a withdrawal limit of 25000 with certain exceptions for unforeseen expenses has been imposed for depositors the rbi said the financial position of the bank has undergone a steady decline with continuous losses over the last three years'

### 2. Define a function named `tokenize`. It should take in a string and tokenize all the words in the string.

#### Tokenize Text

**Tokenization** - is when you split larger strings of text into smaller pieces or tokens by setting a boundary. You might chunk a sentence into words using a space as a boundary or a paragraph into sentences using punctuation as a boundary.

### Using NLTK Tokenization
**Create the tokenizer**<br>
`tokenizer = nltk.tokenize.ToktokTokenizer()`

**Use the tokenizer on my string and assign to a variable.**<br>
`tokenized_string = tokenizer.tokenize(string, return_str=True)`

In [8]:
def tokenize(original):
    '''
    This function takes in a string and returns a tokenized string. 
    '''
    # Create the object
    tokenizer = nltk.tokenize.ToktokTokenizer()
    
    # Use the tokenizer
    article = tokenizer.tokenize(original, return_str=True)
    
    return article

In [9]:
# Test the function

article = tokenize(original)
article

'The Centre has imposed a 30-day moratorium on Lakshmi Vilas Bank effective from Tuesday. A withdrawal limit of ₹ 25,000 with certain exceptions for unforeseen expenses has been imposed for depositors. The RBI said , " The financial position of the bank has undergone a steady decline with continuous losses over the last three years . "'

### 3. Define a function named `stem`. It should accept some text and return the text after applying stemming to all the words.

#### Stem Words

**Stemming** is when you reduce related words in your text to their common stem. It can make it easier when you are searching for a particular word in your text to search for their common stem rather than every form of the word. Stemmers aren't that sophisticated in the way they chop off word endings at their common stems; Spacy, another python NLP library, doesn't even include a stemmer in their library. Spacy only offers the more sophisticated lemmatizer, which we will look at in NLTK next.

#### Using NLTK PorterStemmer
**Create the Stemmer.**<br>
`ps = nltk.porter.PorterStemmer()`

**Apply the stemmer to each word in our string.**<br>
`stems = [ps.stem(word) for word in string.split()]`

**Join our lists of words into strings again and assign to a variable.**<br>
`stemmed_string = ' '.join('stems')`

In [10]:
# Stem function

def stem(article):
    '''
    This function takes in a string and returns a string with words stemmed. 
    '''    
    # Create the nltk stemmer object
    ps = nltk.porter.PorterStemmer()
    
    # Use list comprehension to stemmingly transform all the words in the article
    stems = [ps.stem(word) for word in article.split()]
    
    # Join the stemmed words back to a string
    stemmed_article = ' '.join(stems)
    
    return stemmed_article

In [11]:
# Test the function

stemmed_article = stem(article)
stemmed_article

'the centr ha impos a 30-day moratorium on lakshmi vila bank effect from tuesday. A withdraw limit of ₹ 25,000 with certain except for unforeseen expens ha been impos for depositors. the rbi said , " the financi posit of the bank ha undergon a steadi declin with continu loss over the last three year . "'

In [12]:
# Count the number of unique words
pd.Series(stemmed_article.split()).value_counts().head(10)

the       5
ha        3
impos     2
for       2
"         2
a         2
bank      2
with      2
of        2
except    1
dtype: int64

### 4. Define a function named `lemmatize`. It should accept some text and return the text after applying lemmatization to each word.

#### Lemmatize Words

**Lemmatization** - is when you reduce related words in your text to their lemma or word base by applying a **morphological analysis** to your text. Like stemming, this is done to reduce the number of forms you have of the same word, so they can be analyzed as a single item. While stemming might create tokens that are not actually words anymore after they have been chopped off at their base, lemmatization will leave you with real words. A drawback to lemmatization is that it takes longer than stemming; you can try both to see which gives you better results as you analyze a given text.

#### Using NLTK WordNetLemmatizer

**Download the first time.**<br>
`nltk.download('wordnet')`

**Create the Lemmatizer.**<br>
`wnl = nltk.stem.WordNetLemmatizer()`

**Use the lemmatizer on each word in the list of words we created by using split.**<br>
`lemmas = [wnl.lemmatize(word) for word in string.split()]`

**Join our list of words into a string again and assign to a variable.**<br>
`lemmatized_string = ' '.join(lemmas)`

In [13]:
# Lemmatize function

def lemmatize(article):
    '''
    This function takes in a string and returns a string with words lemmatized. 
    '''
    
    # Create the nltk lemmatizer object
    wnl = nltk.stem.WordNetLemmatizer()
    
    # Use list comprehension to lemmatizedly transform all the words in the article
    lemmas = [wnl.lemmatize(word) for word in article.split()]
    
    # Join the lemmatized words back to a string
    lemmatized_article = ' '.join(lemmas)
    
    return lemmatized_article

In [24]:
# Test the function

lemmatized_article = lemmatize(article)
lemmatized_article

'the centre ha imposed a 30day moratorium on lakshmi vila bank effective from tuesday a withdrawal limit of 25000 with certain exception for unforeseen expense ha been imposed for depositor the rbi said the financial position of the bank ha undergone a steady decline with continuous loss over the last three year'

### 5. Define a function named `remove_stopwords`. It should accept some text and return the text after removing all the stopwords.

### This function should define two optional parameters, `extra_words` and `exclude_words`. These parameters should define any additional stop words to include, and any words that we don't want to remove.

#### Remove Stopwords

**Stopwords** - are words which are filtered out during the preparation of your text for analyzation and modeling. **Stopwords are those that offer little to the meaning of your text and are basically just adding noise to your analysis.** Or, as Ryan Orsinger would say, "Stopwords aren't the real story of the document." Words such as 'the', 'and', 'a', and the like can be removed, so you can better focus on the good stuff.

#### Using NLTK Stopwords
**Download the first time.**<br>
`nltk.download('stopwords')`

**Create list of stopwords and assign to variable.**<br>
`stopword_list = stopwords.words('english')`

**Create list of words and assign to variable.**<br>
`words = string.split()`

**Create a list of words from my string with stopwords removed and assign to variable.**<br>
`filtered_words = [w for w in words if w not in stopword_list]`

**Join words in the list back into strings and assign to a variable.**<br>
`string_without_stopwords = ' '.join(filtered_words)`

In [19]:
# Create the list of stopwords
stopword_list = stopwords.words('english')

# Print the size of the list
print(len(stopword_list))

# Take a peek at the first 
stopword_list[:5]

179


['i', 'me', 'my', 'myself', 'we']

In [25]:
# Split the words in lemmatized column

words = article.split()
words[:10]

['the',
 'centre',
 'has',
 'imposed',
 'a',
 '30day',
 'moratorium',
 'on',
 'lakshmi',
 'vilas']

In [26]:
# Create a list of words from with stopwords removed and assign to variable

filtered_words = [word for word in words if word not in stopword_list]
filtered_words[:10]

['centre',
 'imposed',
 '30day',
 'moratorium',
 'lakshmi',
 'vilas',
 'bank',
 'effective',
 'tuesday',
 'withdrawal']

In [27]:
# Join words in the list back into strings; assign to a variable to keep changes.

' '.join(filtered_words)

'centre imposed 30day moratorium lakshmi vilas bank effective tuesday withdrawal limit 25000 certain exceptions unforeseen expenses imposed depositors rbi said financial position bank undergone steady decline continuous losses last three years'

In [28]:
# Stopwords function

def remove_stopwords(string, extra_words=[], exclude_words=[]):
    '''
    This function takes in a string, optional extra_words and exlude_words parameters
    with default empty lists and returns a string.
    '''
    # Create stopword_list
    stopword_list = stopwords.words('english')
    
    # Remove 'exclude_words' from stopword_list to keep these in the text
    stopword_list = set(stopword_list) - set(exclude_words)
    
    # Add in 'extra_words' to stopword_list
    stopword_list = stopword_list.union(set(extra_words))
    
    # Split words in the string
    words = string.split()
    
    # Create a list of words from the string with stopwords removed and assign to variable
    filtered_words = [word for word in words if word not in stopword_list]
    
    # Join words in the list back into strings and assign to a varibale
    string_without_stopwords = ' '.join(filtered_words)
    
    return string_without_stopwords

In [29]:
# Test the function
remove_stopwords(article)

'centre imposed 30day moratorium lakshmi vilas bank effective tuesday withdrawal limit 25000 certain exceptions unforeseen expenses imposed depositors rbi said financial position bank undergone steady decline continuous losses last three years'

In [30]:
# Test the function for adding extra_words to stopword list and removing exclude_words

remove_stopwords(article, extra_words=['bank'], exclude_words=['the'])

'the centre imposed 30day moratorium lakshmi vilas effective tuesday withdrawal limit 25000 certain exceptions unforeseen expenses imposed depositors the rbi said the financial position the undergone steady decline continuous losses the last three years'

### Build the helper function by chaining the above fucntions

In [36]:
def prep_article_data(df, column, extra_words=[], exclude_words=[]):
    '''
    This function take in a df and the string name for a text column with the option
    to pass lists for extra_words and exlucde_words and returns a df with the text article title, 
    original text, stemmed text, lemmatized text, cleaned-tokenized-lemmatized-stopwords removed text.  
    '''
    df['clean'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(remove_stopwords, extra_words=extra_words, exclude_words=exclude_words)\
                            .apply(lemmatize)
    
    df['stemmed'] = df[column].apply(basic_clean).apply(stem)
    
    df['lemmatized'] = df[column].apply(basic_clean).apply(lemmatize)
    
    return df[['title', column, 'stemmed', 'lemmatized', 'clean']]

### 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe `news_df.`

In [3]:
# Read the news with a fresh scrape

news_df = acquire.get_news_articles(cached=False)
news_df.head()

Unnamed: 0,topic,title,author,original
0,business,"Lakshmi Vilas Bank withdrawals capped at ₹25,0...",Pragya Swastik,The Centre has imposed a 30-day moratorium on ...
1,business,Pfizer shares drop 4.5% as Moderna says its va...,Krishna Veera Vanamali,Pfizer’s shares fell as much as 4.5% on Monday...
2,business,How does Moderna's COVID-19 vaccine candidate ...,Pragya Swastik,Moderna's initial results of late-stage trial ...
3,business,Shutting Delhi markets may prove counterproduc...,Sakshita Khosla,Traders' body CAIT on Tuesday said a proposal ...
4,business,"Musk gets $15bn richer in 2 hours, becomes wor...",Krishna Veera Vanamali,Billionaire Elon Musk added $15 billion to his...


In [4]:
# Prepare the text using the helper function

news_df = prepare.prep_article_data(news_df, 'original')
news_df.head()

Unnamed: 0,title,original,stemmed,lemmatized,clean
0,"Lakshmi Vilas Bank withdrawals capped at ₹25,0...",The Centre has imposed a 30-day moratorium on ...,the centr ha impos a 30day moratorium on laksh...,the centre ha imposed a 30day moratorium on la...,centre imposed 30day moratorium lakshmi vila b...
1,Pfizer shares drop 4.5% as Moderna says its va...,Pfizer’s shares fell as much as 4.5% on Monday...,pfizer share fell as much as 45 on monday afte...,pfizers share fell a much a 45 on monday after...,pfizers share fell much 45 monday rival modern...
2,How does Moderna's COVID-19 vaccine candidate ...,Moderna's initial results of late-stage trial ...,moderna' initi result of latestag trial show i...,moderna's initial result of latestage trial sh...,moderna ' initial result latestage trial show ...
3,Shutting Delhi markets may prove counterproduc...,Traders' body CAIT on Tuesday said a proposal ...,traders' bodi cait on tuesday said a propos to...,traders' body cait on tuesday said a proposal ...,trader ' body cait tuesday said proposal impos...
4,"Musk gets $15bn richer in 2 hours, becomes wor...",Billionaire Elon Musk added $15 billion to his...,billionair elon musk ad 15 billion to hi wealt...,billionaire elon musk added 15 billion to his ...,billionaire elon musk added 15 billion wealth ...


### 7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [2]:
# Read the blogs with a fresh scrape

codeup_df = acquire.acquire_codeup_blogs(urls=acquire.get_blog_urls(), cached=False)
codeup_df

Unnamed: 0,title,original
0,How We’re Celebrating World Mental Health Day ...,World Mental Health Day is on October 10th. Al...
1,What is Codeup’s Application Process?,Curious about Codeup’s application process? Wo...
2,What are the Math and Stats Principles You Nee...,"Coming into our Data Science program, you will..."
3,Codeup on Inc. 5000 Fastest Growing Private Co...,We’re excited to announce a huge Codeup achiev...
4,Codeup Grads Win CivTech Datathon,Many Codeup alumni enjoy competing in hackatho...
5,What to Expect at Codeup,"Setting Expectations for Life Before, During, ..."
6,How to Succeed in a Coding Bootcamp,We held a virtual event called “How to Succeed...
7,Build Your Career in Tech: Advice from Alumni!,"Bryan Walsh, Codeup Web Development alum, and ..."
8,What is the Transition into Data Science Like?,Alumni Katy Salts and Brandi Reger joined us a...
9,Codeup Launches Houston!,"Houston, we have a problem: there aren’t enoug..."


In [3]:
# Prepare the text using the helper function

codeup_df = prepare.prep_article_data(codeup_df, 'original')
codeup_df.head()

Unnamed: 0,title,original,stemmed,lemmatized,clean
0,How We’re Celebrating World Mental Health Day ...,World Mental Health Day is on October 10th. Al...,world mental health day is on octob 10th all o...,world mental health day is on october 10th all...,world mental health day october 10th u codeup ...
1,What is Codeup’s Application Process?,Curious about Codeup’s application process? Wo...,curiou about codeup applic process wonder whi ...,curious about codeups application process wond...,curious codeups application process wondering ...
2,What are the Math and Stats Principles You Nee...,"Coming into our Data Science program, you will...",come into our data scienc program you will nee...,coming into our data science program you will ...,coming data science program need know math sta...
3,Codeup on Inc. 5000 Fastest Growing Private Co...,We’re excited to announce a huge Codeup achiev...,were excit to announc a huge codeup achiev inc...,were excited to announce a huge codeup achieve...,excited announce huge codeup achievement inc m...
4,Codeup Grads Win CivTech Datathon,Many Codeup alumni enjoy competing in hackatho...,mani codeup alumni enjoy compet in hackathon a...,many codeup alumnus enjoy competing in hackath...,many codeup alumnus enjoy competing hackathons...
