# Data Preprocessing

In [1]:
# import packages
import numpy as np
import pandas as pd
import re

As a review from the last notebook, we scraped speech transcripts from James Clear's website on his ["Great Speeches" page](https://jamesclear.com/great-speeches) (where we took the top 10 which are sorted alphabetically by the orator's last name) and the lyrics from all songs in bestselling albums of all time (according to [Business Insider](https://www.businessinsider.com/50-best-selling-albums-all-time-2016-9)) from [AZLyrics](https://www.azlyrics.com/). 

Below are what our raw dataframes look like. Not bad if I do say so myself. Of course, there's a lot of preprocessing that has to be done, but it looks like we properly collected the data.

In [2]:
speech_df = pd.read_excel('speeches.xlsx', dtype=str)
speech_df

Unnamed: 0,Orator,Title,Transcript,Link
0,Chimamanda Ngozi Adichie,The Danger of a Single Story,I'm a storyteller. And I would like to tell yo...,https://jamesclear.com/great-speeches/the-dang...
1,Jeff Bezos,What Matters More Than Your Talents,"As a kid, I spent my summers with my grandpare...",https://jamesclear.com/great-speeches/what-mat...
2,John C. Bogle,Enough,Here’s how I recall the wonderful story that s...,https://jamesclear.com/great-speeches/enough-b...
3,Brené Brown,The Anatomy of Trust,"Oh, it just feels like an incredible understat...",https://jamesclear.com/great-speeches/the-anat...
4,John Cleese,Creativity in Management,"You know, when Video Arts asked me if I'd like...",https://jamesclear.com/great-speeches/creativi...
5,William Deresiewicz,Solitude and Leadership,My title must seem like a contradiction. What ...,https://jamesclear.com/great-speeches/solitude...
6,Richard Feynman,Seeking New Laws,What I want to talk to you about tonight is st...,https://jamesclear.com/great-speeches/seeking-...
7,Neil Gaiman,Make Good Art,I never really expected to find myself giving ...,https://jamesclear.com/great-speeches/make-goo...
8,John W. Gardner,Personal Renewal,I'm going to talk about “Self-Renewal.” One of...,https://jamesclear.com/great-speeches/personal...
9,Elizabeth Gilbert,Your Elusive Creative Genius,I am a writer. Writing books is my profession ...,https://jamesclear.com/great-speeches/your-elu...


In [3]:
song_df =  pd.read_excel('songs.xlsx', dtype=str)
song_df

Unnamed: 0,Artist,Album,Lyric,Artist_Link
0,Michael Jackson,"""Thriller""",\n\r\nI said you wanna be startin' somethin'\n...,https://www.azlyrics.com/j/jackson.html
1,Eagles,"""Hotel California""","\n\r\nOn a dark desert highway, cool wind in m...",https://www.azlyrics.com/e/eagles.html
2,Led Zeppelin,"""Led Zeppelin IV""","\n\r\nHey, hey mama, said the way you move\nGo...",https://www.azlyrics.com/l/ledzeppelin.html
3,Pink Floyd,"""The Wall""",\n\r\n(We came in)\n\nSo ya thought ya might l...,https://www.azlyrics.com/p/pinkfloyd.html
4,AC/DC,"""Back In Black""",\n\r\nI'm rolling thunder pouring rain\nI'm co...,https://www.azlyrics.com/a/acdc.html
5,Garth Brooks,"""Double Live""",\n\r\nWe all came here for a party tonight\nAn...,https://www.azlyrics.com/b/brooks.html
6,Hootie & The Blowfish,"""Cracked Rear View""",\n\r\nYou got your big girl \nNow you've got y...,https://www.azlyrics.com/h/hootie.html
7,Fleetwood Mac,"""Rumours""",\n\r\nI know there's nothing to say\nSomeone h...,https://www.azlyrics.com/f/fleetwood.html
8,Shania Twain,"""Come On Over""",\n\r\nLet's go girls! Come on.\n\nI'm going ou...,https://www.azlyrics.com/t/twain.html
9,The Beatles,"""The Beatles (The White Album)""","\n\r\nOh, flew in by Miami Beach B.O.A.C.\nDid...",https://www.azlyrics.com/b/beatles.html


Below, I've combined our two raw data frames in the hopes that they can be preprocessed at the same time. Twenty rows may be a bit much for some people, but it's simple to take out rows so I felt it was best to have more rather than less. As an aside it also made webscraping a more interesting exercise since I needed to deal with different page and url formats.

In [4]:
comm_df =  pd.read_excel('communication_data.xlsx', dtype=str)
comm_df

Unnamed: 0,Originator,Title,Corpus,Source
0,Chimamanda Ngozi Adichie,The Danger of a Single Story,I'm a storyteller. And I would like to tell yo...,https://jamesclear.com/great-speeches/the-dang...
1,Jeff Bezos,What Matters More Than Your Talents,"As a kid, I spent my summers with my grandpare...",https://jamesclear.com/great-speeches/what-mat...
2,John C. Bogle,Enough,Here’s how I recall the wonderful story that s...,https://jamesclear.com/great-speeches/enough-b...
3,Brené Brown,The Anatomy of Trust,"Oh, it just feels like an incredible understat...",https://jamesclear.com/great-speeches/the-anat...
4,John Cleese,Creativity in Management,"You know, when Video Arts asked me if I'd like...",https://jamesclear.com/great-speeches/creativi...
5,William Deresiewicz,Solitude and Leadership,My title must seem like a contradiction. What ...,https://jamesclear.com/great-speeches/solitude...
6,Richard Feynman,Seeking New Laws,What I want to talk to you about tonight is st...,https://jamesclear.com/great-speeches/seeking-...
7,Neil Gaiman,Make Good Art,I never really expected to find myself giving ...,https://jamesclear.com/great-speeches/make-goo...
8,John W. Gardner,Personal Renewal,I'm going to talk about “Self-Renewal.” One of...,https://jamesclear.com/great-speeches/personal...
9,Elizabeth Gilbert,Your Elusive Creative Genius,I am a writer. Writing books is my profession ...,https://jamesclear.com/great-speeches/your-elu...


#### Handling Numbers

I don't plan on handling numbers for this project, mostly because I don't believe they are relevant to the goal of my project. For that reason, I will be removing numerical characters (ie. 0-9) from my corpora along with other special characters. The below code block shows the distribution for each row and the total amount of numeric chars in our dataset. 

The # count for speeches is significantly higher than that used in the album lyrics. That said, our numbers do in part depend on the website's preference for having numbers in text (whether they chose to say '2' or 'two'). Even if it wasn't James Clear's site, I would suspect speeches do include more numbers since people often reference years "back in '01", dates or figures "$7000" whereas songs normally don't have such usages. 

In [5]:
# If you'd like to see the number of chars by row, uncomment the below lines
# num_digits = comm_df["Corpus"].map(lambda s: sum(c.isdigit() for c in s))
# print('Number of Numerical Chars per Row')
# print(num_digits)

number_of_numerical_chars = sum(c.isdigit() for c in comm_df['Corpus'].str.cat())
print('Total Number of Numerical Chars')
print(number_of_numerical_chars)

Total Number of Numerical Chars
315


The next step is taking out characters or unspoken text from our speeches/lyrics. This includes text between square brackets [Used for what goes on in songs, like "[__ singer:]"], number characters, newlines and special characters (besides apostrophes). Notice that I start off replacing all of these with a space. This is to make sure words don't get joined if there's something like a newline of backslash between them. After all that, I replace contiguous/multiple spaces with a single space.

In [6]:
# example strings
ex_str1 = '[Speaker clears throat]\nThis is an example of what a corpus might look like.'
ex_str2 = 'Using "\r\n" is the standard newline for Windows. \n is used for Linux and Macs. \r is seen for old Macs.'
ex_str3 = "@#$ Let $me a*dd 9te0xt with cha^rs we don't want to keep"

def standardize_text(corpus):
    # remove text between square brakets (effects in lyric text)
    s = re.sub('[\[].*?[\]]', ' ', corpus)
    # replace newlines/returns with spaces
    s = re.sub(r'\s',' ',s)
    # set all letters to lower case
    s = s.lower()
    # remove characters that aren't alphabetical, apostrophes or spaces
    s = re.sub(r"[^a-z ']",' ',s)
    # remove extra spaces
    s = re.sub(' +', ' ',s)
    return s

print("Example 1:")
print(standardize_text(ex_str1))
print("Example 2:")
print(standardize_text(ex_str2))
print("Example 3:")
print(standardize_text(ex_str3))

Example 1:
 this is an example of what a corpus might look like 
Example 2:
using is the standard newline for windows is used for linux and macs is seen for old macs 
Example 3:
 let me a dd te xt with cha rs we don't want to keep


In [7]:
comm_df['Corpus'] = comm_df['Corpus'].apply(standardize_text)
comm_df

Unnamed: 0,Originator,Title,Corpus,Source
0,Chimamanda Ngozi Adichie,The Danger of a Single Story,i'm a storyteller and i would like to tell you...,https://jamesclear.com/great-speeches/the-dang...
1,Jeff Bezos,What Matters More Than Your Talents,as a kid i spent my summers with my grandparen...,https://jamesclear.com/great-speeches/what-mat...
2,John C. Bogle,Enough,here s how i recall the wonderful story that s...,https://jamesclear.com/great-speeches/enough-b...
3,Brené Brown,The Anatomy of Trust,oh it just feels like an incredible understate...,https://jamesclear.com/great-speeches/the-anat...
4,John Cleese,Creativity in Management,you know when video arts asked me if i'd like ...,https://jamesclear.com/great-speeches/creativi...
5,William Deresiewicz,Solitude and Leadership,my title must seem like a contradiction what c...,https://jamesclear.com/great-speeches/solitude...
6,Richard Feynman,Seeking New Laws,what i want to talk to you about tonight is st...,https://jamesclear.com/great-speeches/seeking-...
7,Neil Gaiman,Make Good Art,i never really expected to find myself giving ...,https://jamesclear.com/great-speeches/make-goo...
8,John W. Gardner,Personal Renewal,i'm going to talk about self renewal one of yo...,https://jamesclear.com/great-speeches/personal...
9,Elizabeth Gilbert,Your Elusive Creative Genius,i am a writer writing books is my profession b...,https://jamesclear.com/great-speeches/your-elu...


#### Contractions


In [8]:
# get list of contractions
def get_contractions(df,txt_col):
    all_words = df[txt_col].str.cat(sep=' ')
    # could use regex, but more steps and this already runs fast
    unique_contractions = set([word for word in all_words.split() if "'" in word])
    return unique_contractions

# get contractions with their number of occurances
def get_contraction_count(df,txt_col):
    all_words = df[txt_col].str.cat(sep=' ')
    word_list = all_words.split()
    # could use regex, but more steps and this already runs fast
    contraction_dict = {word: 0 for word in word_list if "'" in word}
    for word in word_list:
        if "'" in word:
            contraction_dict[word] += 1
    # sort by count
    contraction_dict = {w: count for w, count in sorted(contraction_dict.items(), key=lambda item: item[1], reverse=True)}
    return contraction_dict

cont_set = get_contractions(comm_df,'Corpus')
get_contraction_count(comm_df,'Corpus')

{"don't": 341,
 "i'm": 286,
 "it's": 258,
 "you're": 201,
 "can't": 82,
 "there's": 74,
 "we're": 73,
 "i've": 65,
 "won't": 59,
 "she's": 56,
 "that's": 52,
 "i'll": 50,
 "you'll": 44,
 "ain't": 43,
 "didn't": 39,
 "'cause": 39,
 "d'do": 39,
 "i'd": 36,
 "you've": 27,
 "doesn't": 27,
 "they're": 25,
 "let's": 22,
 "startin'": 22,
 "somethin'": 22,
 "'n'": 22,
 "isn't": 20,
 "we've": 17,
 "'em": 16,
 "goin'": 15,
 "c'mon": 14,
 "couldn't": 13,
 "who's": 13,
 "d'da": 13,
 "we'll": 12,
 "he's": 12,
 "haven't": 11,
 "what's": 10,
 "'til": 10,
 "lovin'": 10,
 "'bout": 10,
 "it'll": 10,
 "holdin'": 10,
 "wasn't": 9,
 "aren't": 9,
 "they'll": 9,
 "tryin'": 9,
 "cryin'": 9,
 "everybody's": 8,
 "yesterday's": 8,
 "she'll": 7,
 "one's": 7,
 "we'd": 7,
 "daddy's": 7,
 "mama's": 7,
 "gol'": 7,
 "they've": 6,
 "here's": 6,
 "nobody's": 6,
 "lookin'": 6,
 "feelin'": 6,
 "'round": 5,
 "nature's": 5,
 "wouldn't": 5,
 "livin'": 5,
 "there'll": 5,
 "kickin'": 5,
 "'tween": 5,
 "fide's": 4,
 "'": 4,
 "h

Well, that's a lot of words with apostrophe. I'll do my best to batch process these but it won't be perfect. To get a better understanding of which words occur more, I created `get_contraction_count(df,txt_col)` since it's good if I don't accidentally miss common contractions that weren't considered in my batch fixes. 

One of the next steps will be removing two letter words (since they normally don't mean anything in isolation) so we'll be able to remove words that end with `'s`. Yes, it can also denote possession, but the noun is the part we want if that's the case. It also mean's `i'm` and `it's` (two of the top 3 contractions) will end up being completely removed. 

In [9]:
contractions_or_slang = {
    "c'mon": "come on",
    "i'm": "i am",
    "'em": 'them',
    "'n'": 'and',
    "'fore": 'before',
    "'round": 'around',
    "y'all": "you all",
    "ain't": 'is not',
    "'bout": "about",
    "'cause": 'because',
    "'til": "until",
    "d'ya": "would you",
    "'tween": "between",
    "ya": "you",
    "'ya": "you",
    "m'lord": "my lord",
    "t'aime": 'love you', 
    # yes, this was used somewhere
    "y'all's": "you all is",
    # assumed typo
    "'till": "until",
    # ignored - few occurances
    "d'do": '', 
    "d'da":''
}

def expand_contraction(s):
    global contractions_or_slang
    if "'" not in s:
        return s
    elif s in contractions_or_slang:
        return contractions_or_slang[s]
    elif len(s)<3:
        return ''
    # whether possesion or __ is, 's does not have significant meaning
    elif s[-2:] == "'s":
        return s[:-2]
    elif s[-3:] == "n't":
        return s[:-3] + ' not'
    elif s[-3:] == "'re":
        return s[:-3] + ' are'
    elif s[-2:] == "'d":
        return s[:-2] + ' would'
    elif s[-3:] == "'ve":
        return s[:-3] + ' have'
    elif s[-3:] == "'ll":
        return s[:-3] + ' will'
    elif s[-3:] == "in'":
        return s[:-1] + 'g'
    elif "'" == s[-1]:
        return s[:-1]
    return s
    
## After initial creation of the function
# [word for word in list(map(expand_contraction, cont_set)) if "'" in word]
# >>> ["'till", "o'clock", "t'aime", "d'do", "d'da", "'ya"]

So it seems like we've dealt with the majority of contractions without hitch. The few weird ones commented above will be accounted for with some assumptions. "'till" is probably a typo of "'til" (ie. "until"), "o'clock" is acceptable, t'aime is French for "love you" and everything afterword is a bit odd. They'll probably be removed in tf-idf (very low count from `get_contraction_count`), but I'll keep the words they probably represent. I'm removing "d'do" and "d'da" because they don't hold any meaning.

In [10]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def resolve_contractions_in_corpus(corpus):
    word_list = corpus.split()
#     expanded_word_list = [word for word in list(map(expand_contraction, word_list))]
    expanded_word_list = [word for word in list(map(expand_contraction, word_list)) if word not in stop_words and len(word)>2]
    return ' '.join(expanded_word_list)
    
# try it out on MJ's "Thriller" text
resolve_contractions_in_corpus(comm_df.at[10,'Corpus'])[:555]

'said wanna starting something got starting something said wanna starting something got starting something high get yeah yeah low get yeah yeah you are stuck middle yeah yeah pain thunder yeah yeah high get yeah yeah low get yeah yeah you are stuck middle yeah yeah pain thunder yeah yeah took baby doctor fever nothing found time hit street said breakdown someone always trying start baby crying talking squealing lying saying wanna starting something said wanna starting something got starting something said wanna starting something got starting somethi'

This looks clean, so we'll apply this function to the 'Corpus' column of our DataFrame. It's very apparent that there's a lot of repetition in this first song which is something true for most music. Since there's no reason to have a separate section for it, the third line of resolve_contractions_in_corpus can replace the second to remove stop words and words that are 1-2 letters long. 

In [11]:
comm_df['Corpus'] = comm_df['Corpus'].apply(resolve_contractions_in_corpus)
comm_df

Unnamed: 0,Originator,Title,Corpus,Source
0,Chimamanda Ngozi Adichie,The Danger of a Single Story,i am storyteller would like tell personal stor...,https://jamesclear.com/great-speeches/the-dang...
1,Jeff Bezos,What Matters More Than Your Talents,kid spent summers grandparents ranch texas hel...,https://jamesclear.com/great-speeches/what-mat...
2,John C. Bogle,Enough,recall wonderful story sets theme remarks toda...,https://jamesclear.com/great-speeches/enough-b...
3,Brené Brown,The Anatomy of Trust,feels like incredible understatement say grate...,https://jamesclear.com/great-speeches/the-anat...
4,John Cleese,Creativity in Management,know video arts asked i would like talk creati...,https://jamesclear.com/great-speeches/creativi...
5,William Deresiewicz,Solitude and Leadership,title must seem like contradiction solitude le...,https://jamesclear.com/great-speeches/solitude...
6,Richard Feynman,Seeking New Laws,want talk tonight strictly speaking character ...,https://jamesclear.com/great-speeches/seeking-...
7,Neil Gaiman,Make Good Art,never really expected find giving advice peopl...,https://jamesclear.com/great-speeches/make-goo...
8,John W. Gardner,Personal Renewal,i am going talk self renewal one fundamental t...,https://jamesclear.com/great-speeches/personal...
9,Elizabeth Gilbert,Your Elusive Creative Genius,writer writing books profession course also gr...,https://jamesclear.com/great-speeches/your-elu...


## Data Cleaning

Stemming and lemmatization are important for NLP when looking at word count, topics and other applications that don't care about the right usage of verbs or complete words. This means these techniques should not be used if you are trying text synthesis, question answering or sentiment analysis. 

Stemmers use an algorithm to get the stem/root of given words. This way, if you have variations on a word like "walk" such as "walked" and "walking", they would all be stemmed to "walk" and we'd have an appropriate count of the use of the verb in our corpus. The SnowballStemmer is preferred for achieving the best practical results, but if you're curious about the different popular types and how they differ, it is explained in more detail below. 

Lemmatizers work to achieve the same goal as stemmers, except the approach is different. When talking about lemmas for NLP, we are talking about a word with a similar meaning to the word being lemmatized. Of course, this is to keep words consistent - a lemma doesn't always need to be different from the input word. For example, "create", "making" and "make" lemmatized could all be "make". Now this might not be true for the lemmatizer we'll use, but this gives you an idea on how it could get rid of noise from synonyms. We'll take a more detailed look at how lemmatizers work in practice below.

### Stemming

The three stemmers we'll discuss in this notebook are Porter, Snowball and Lancaster. My information is from [here](https://stackoverflow.com/questions/10554052/what-are-the-major-differences-and-benefits-of-porter-and-lancaster-stemming-alg) but I'll summarize it below before testing them out on some words. 

**Popular Stemmers:**
* **Porter:** 
* **Snowball (Porter2):**
* **Lancaster:**

In [12]:
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer # aka Porter2
from nltk.stem import LancasterStemmer

### Lemmatization

### TF-IDF
