# Data Preprocessing

In [1]:
# import packages
import numpy as np
import pandas as pd
import re

As a review from the last notebook, we scraped speech transcripts from James Clear's website on his ["Great Speeches" page](https://jamesclear.com/great-speeches) (where we took the top 10 which are sorted alphabetically by the orator's last name) and the lyrics from all songs in bestselling albums of all time (according to this [Business Insider article](https://www.businessinsider.com/50-best-selling-albums-all-time-2016-9)) from [AZLyrics](https://www.azlyrics.com/). 

Below are what our raw dataframes look like. Not bad if I do say so myself. Of course, there's a lot of preprocessing that has to be done, but it looks like we properly collected the data.

Edit: Temporarily, I will be using `speeches_raw.xlsx` and  `songs_raw.xlsx` dataset which are datasets generated taking the top 24 speeches and songs successfully scraped from the websites. I'm doing this so that I have a larger dataset to work with for the analysis notebook, which will make it more helpful when visualized and easier to describe what the data in general looks like. I later plan to debug scraping errors that started to appear when generating over 10 rows due to differences in URLs and page structure.

My data where I only scraped the first 10 links for songs and speeches is in the `data_n=10` folder. This is mostly just to save the data somewhere where it won't be overwritten as I work with my longer dataframes.

In [2]:
speech_df = pd.read_excel('speeches_raw.xlsx', dtype=str)
speech_df
# speech_df.iat[2,2]

Unnamed: 0,Orator,Title,Transcript,Link
0,Chimamanda Ngozi Adichie,The Danger of a Single Story,I'm a storyteller. And I would like to tell yo...,https://jamesclear.com/great-speeches/the-dang...
1,Jeff Bezos,What Matters More Than Your Talents,"As a kid, I spent my summers with my grandpare...",https://jamesclear.com/great-speeches/what-mat...
2,John C. Bogle,Enough,Here’s how I recall the wonderful story that s...,https://jamesclear.com/great-speeches/enough-b...
3,Brené Brown,The Anatomy of Trust,"Oh, it just feels like an incredible understat...",https://jamesclear.com/great-speeches/the-anat...
4,John Cleese,Creativity in Management,"You know, when Video Arts asked me if I'd like...",https://jamesclear.com/great-speeches/creativi...
5,William Deresiewicz,Solitude and Leadership,My title must seem like a contradiction. What ...,https://jamesclear.com/great-speeches/solitude...
6,Richard Feynman,Seeking New Laws,What I want to talk to you about tonight is st...,https://jamesclear.com/great-speeches/seeking-...
7,Neil Gaiman,Make Good Art,I never really expected to find myself giving ...,https://jamesclear.com/great-speeches/make-goo...
8,John W. Gardner,Personal Renewal,I'm going to talk about “Self-Renewal.” One of...,https://jamesclear.com/great-speeches/personal...
9,Elizabeth Gilbert,Your Elusive Creative Genius,I am a writer. Writing books is my profession ...,https://jamesclear.com/great-speeches/your-elu...


In [3]:
song_df =  pd.read_excel('songs_raw.xlsx', dtype=str)
song_df

Unnamed: 0,Artist,Album,Lyric,Artist_Link
0,Michael Jackson,"""Thriller""",\n\r\nI said you wanna be startin' somethin'\n...,https://www.azlyrics.com/j/jackson.html
1,Eagles,"""Hotel California""","\n\r\nOn a dark desert highway, cool wind in m...",https://www.azlyrics.com/e/eagles.html
2,Led Zeppelin,"""Led Zeppelin IV""","\n\r\nHey, hey mama, said the way you move\nGo...",https://www.azlyrics.com/l/ledzeppelin.html
3,Pink Floyd,"""The Wall""",\n\r\n(We came in)\n\nSo ya thought ya might l...,https://www.azlyrics.com/p/pinkfloyd.html
4,AC/DC,"""Back In Black""",\n\r\nI'm rolling thunder pouring rain\nI'm co...,https://www.azlyrics.com/a/acdc.html
5,Garth Brooks,"""Double Live""",\n\r\nWe all came here for a party tonight\nAn...,https://www.azlyrics.com/b/brooks.html
6,Hootie & The Blowfish,"""Cracked Rear View""",\n\r\nYou got your big girl \nNow you've got y...,https://www.azlyrics.com/h/hootie.html
7,Fleetwood Mac,"""Rumours""",\n\r\nI know there's nothing to say\nSomeone h...,https://www.azlyrics.com/f/fleetwood.html
8,Shania Twain,"""Come On Over""",\n\r\nLet's go girls! Come on.\n\nI'm going ou...,https://www.azlyrics.com/t/twain.html
9,The Beatles,"""The Beatles (The White Album)""","\n\r\nOh, flew in by Miami Beach B.O.A.C.\nDid...",https://www.azlyrics.com/b/beatles.html


Below, I've combined our two raw data frames in the hopes that they can be preprocessed at the same time. Twenty rows may be a bit much for some people, but it's simple to take out rows so I felt it was best to have more rather than less. As an aside it also made webscraping a more interesting exercise since I needed to deal with different page and url formats.

In [4]:
comm_df =  pd.read_excel('communication_data.xlsx', dtype=str)
# let's remove those newlines at the start of our songs
comm_df['Corpus'] = comm_df['Corpus'].str.strip()
comm_df

Unnamed: 0,Originator,Title,Corpus,Link,Text Type
0,Chimamanda Ngozi Adichie,The Danger of a Single Story,I'm a storyteller. And I would like to tell yo...,https://jamesclear.com/great-speeches/the-dang...,speech
1,Jeff Bezos,What Matters More Than Your Talents,"As a kid, I spent my summers with my grandpare...",https://jamesclear.com/great-speeches/what-mat...,speech
2,John C. Bogle,Enough,Here’s how I recall the wonderful story that s...,https://jamesclear.com/great-speeches/enough-b...,speech
3,Brené Brown,The Anatomy of Trust,"Oh, it just feels like an incredible understat...",https://jamesclear.com/great-speeches/the-anat...,speech
4,John Cleese,Creativity in Management,"You know, when Video Arts asked me if I'd like...",https://jamesclear.com/great-speeches/creativi...,speech
5,William Deresiewicz,Solitude and Leadership,My title must seem like a contradiction. What ...,https://jamesclear.com/great-speeches/solitude...,speech
6,Richard Feynman,Seeking New Laws,What I want to talk to you about tonight is st...,https://jamesclear.com/great-speeches/seeking-...,speech
7,Neil Gaiman,Make Good Art,I never really expected to find myself giving ...,https://jamesclear.com/great-speeches/make-goo...,speech
8,John W. Gardner,Personal Renewal,I'm going to talk about “Self-Renewal.” One of...,https://jamesclear.com/great-speeches/personal...,speech
9,Elizabeth Gilbert,Your Elusive Creative Genius,I am a writer. Writing books is my profession ...,https://jamesclear.com/great-speeches/your-elu...,speech


#### Handling Numbers

I don't plan on handling numbers for this project, mostly because I don't believe they are relevant to the goal of my project. For that reason, I will be removing numerical characters (ie. 0-9) from my corpora along with other special characters. The below code block shows the distribution for each row and the total amount of numeric chars in our dataset. 

The # count for speeches is significantly higher than that used in the album lyrics. That said, our numbers do in part depend on the website's preference for having numbers in text (whether they chose to say '2' or 'two'). Even if it wasn't James Clear's site, I would suspect speeches do include more numbers since people often reference years "back in '01", dates or figures "$7000" whereas songs normally don't have such usages. 

In [5]:
# If you'd like to see the number of chars by row, uncomment the below lines
# num_digits = comm_df["Corpus"].map(lambda s: sum(c.isdigit() for c in s))
# print('Number of Numerical Chars per Row')
# print(num_digits)

# set cat to song or speech to see the total numerical chars for that category
# leave as empty string for whole df total
cat = ''
number_of_numerical_chars = sum(c.isdigit() for c in comm_df.loc[comm_df['Text Type'].str.contains(cat),'Corpus'].str.cat())
print('Total Number of Numerical Chars')
print(number_of_numerical_chars)

Total Number of Numerical Chars
1338


The next step is taking out newline characters and unspoken text from our speeches/lyrics. This includes text between square brackets (Used for what goes on in songs, like "[__ singer:]"), number characters and special characters (besides apostrophes). Notice that I start off replacing all of these with a space. This is to make sure words don't get joined if there's something like a newline of backslash between them. After all that, I replace contiguous/multiple spaces with a single space.

Newlines are sort of a special case. As you'll see, when I get to sentiment analysis, the tools I plan to use (VADER and TextBlob) struggle with large corpora. As a result, we'll preserve newlines between verses (songs) and paragraphs (speeches).

I won't be taking out numbers or certain special characters in this step. I originally planned to by adding the line
`s = re.sub(r"[^a-z ']",' ',s)`, but later realized that punctuation helps nltk's lemmatizer figure out if a word if an adjective, noun or whatever. This is also a pretty simple step, so it might happen in the analysis notebooks. I will also not be taking out stop words and short words (len < 3) yet for the same reason.

In [6]:
# example strings
ex_str1 = '[Speaker clears throat]\nThis is an example of what a corpus might look like.'
ex_str2 = 'Using "\r\n" is the standard newline for Windows. \n is used for Linux and Macs. \r is seen for old Macs.'
ex_str3 = "@#$ Let $me a*dd80943 9te0xt with cha^rs we don't[ really don't](123#$%) want to keep."

def standardize_text(corpus):
    # remove text between square brakets (effects in lyric text)
    s = re.sub('\[.*?\]', ' ', corpus)
    # replace newlines characters with spaces
    s = re.sub('\r','',s)
    # set all letters to lower case
    s = s.lower()
    # replace weird apostrophe with standard one
    s = re.sub(r"’","'",s)
    # replace all but a few special characters
    s = re.sub(r"[^a-z\.,\?! \n']",' ',s)
    # remove extra spaces
    s = re.sub(' +', ' ',s)
    return s

print("Example 1:")
print(standardize_text(ex_str1))
print("Example 2:")
print(standardize_text(ex_str2))
# using the function I took out for lemmatization effectiveness
print("Example 3:")
print(standardize_text(ex_str3))

Example 1:
 
this is an example of what a corpus might look like.
Example 2:
using 
 is the standard newline for windows. 
 is used for linux and macs. is seen for old macs.
Example 3:
 let me a dd te xt with cha rs we don't want to keep.


In [7]:
comm_df1 = comm_df.copy()
# comm_df = comm_df1

In [8]:
comm_df['Corpus'] = comm_df['Corpus'].apply(standardize_text)
song_mask = (comm_df['Text Type']=='song')
comm_df.loc[song_mask,'Corpus'] = comm_df.loc[song_mask,'Corpus'].apply(lambda s: re.sub('\n\n','\t',s))
comm_df.loc[song_mask,'Corpus'] = comm_df.loc[song_mask,'Corpus'].apply(lambda s: re.sub('\n',' ',s))
comm_df.loc[song_mask,'Corpus'] = comm_df.loc[song_mask,'Corpus'].apply(lambda s: re.sub('\t','\n',s))
# comm_df.iat[16,2]

#### Contractions


In [9]:
# get list of contractions
def get_contractions(df,txt_col):
    all_words = df[txt_col].str.cat(sep=' ')
    # could use regex, but more steps and this already runs fast
    unique_contractions = set([word for word in all_words.split() if "'" in word or "-" in word])
    return unique_contractions

# get contractions with their number of occurances
def get_contraction_count(df,txt_col):
    all_words = df[txt_col].str.cat(sep=' ')
    word_list = all_words.split()
    # could use regex, but more steps and this already runs fast
    contraction_dict = {word: 0 for word in word_list if "'" in word or "-" in word}
    for word in word_list:
        if "'" in word:
            contraction_dict[word] += 1
    # sort by count
    contraction_dict = {w: count for w, count in sorted(contraction_dict.items(), key=lambda item: item[1], reverse=True)}
    return contraction_dict

cont_set = get_contractions(comm_df,'Corpus')
get_contraction_count(comm_df,'Corpus')

{"i'm": 778,
 "don't": 738,
 "it's": 734,
 "you're": 522,
 "that's": 273,
 "can't": 263,
 "i've": 227,
 "there's": 212,
 "we're": 191,
 "i'll": 185,
 "didn't": 135,
 "won't": 132,
 "ain't": 121,
 "'cause": 118,
 "she's": 112,
 "you'll": 106,
 "they're": 100,
 "i'd": 98,
 "you've": 98,
 "let's": 88,
 "isn't": 72,
 "doesn't": 69,
 "he's": 68,
 "what's": 58,
 "we'll": 54,
 "we've": 48,
 "couldn't": 48,
 "talkin'": 40,
 "wasn't": 39,
 "d'do": 39,
 "goin'": 34,
 "who's": 33,
 "they'll": 32,
 "'til": 31,
 "here's": 30,
 "you'd": 30,
 "'bout": 28,
 "stayin'": 28,
 "hadn't": 27,
 "wouldn't": 25,
 "'em": 25,
 "somethin'": 25,
 "startin'": 24,
 "livin'": 24,
 "holdin'": 23,
 "'n'": 22,
 "we'd": 21,
 "doin'": 21,
 "aren't": 20,
 "haven't": 19,
 "lovin'": 18,
 "thinkin'": 18,
 "it'll": 16,
 "they've": 15,
 "people's": 15,
 "'round": 15,
 "everybody's": 15,
 "nobody's": 14,
 "she'll": 13,
 "shouldn't": 13,
 "d'da": 13,
 "tryin'": 12,
 "feelin'": 12,
 "gettin'": 12,
 "he'll": 11,
 "he'd": 10,
 "one'

Well, that's a lot of words with apostrophe. I'll do my best to batch process these but it won't be perfect. To get a better understanding of which words occur more, I created `get_contraction_count(df,txt_col)` since it's good if I don't accidentally miss common contractions that weren't considered in my batch fixes. 

One of the next steps will be removing two letter words (since they normally don't mean anything in isolation) so we'll be able to remove words that end with `'s`. Yes, it can also denote possession, but the noun is the part we want if that's the case. It also mean's `i'm` and `it's` (two of the top 3 contractions) will end up being completely removed. 

In [10]:
contractions_or_slang = {
    "c'mon": "come on",
    "i'm": "i am",
    "'em": 'them',
    "'n'": 'and',
    "'fore": 'before',
    "'round": 'around',
    "y'all": "you all",
    "ain't": 'is not',
    "'bout": "about",
    "'cause": 'because',
    "'til": "until",
    "d'ya": "would you",
    "'tween": "between",
    "wanna": "want to",
    "won't": "will not",
    "ya": "you",
    "'ya": "you",
    "m'lord": "my lord",
    "t'aime": 'love you', 
    # yes, this was used somewhere
    "y'all's": "you all is",
    # assumed typo
    "'till": "until",
    # ignored - few occurances
    "d'do": '', 
    "d'da":''
}

def expand_contraction(s):
    s = s.group(0)
    global contractions_or_slang
    if s in contractions_or_slang:
        return contractions_or_slang[s]
    elif "'" not in s or len(s) < 3:
        return s
    # whether possesion or __ is, 's does not have significant meaning
    elif s[-2:] == "'s":
        return s[:-2]
    elif s[-3:] == "n't":
        return s[:-3] + ' not'
    elif s[-3:] == "'re":
        return s[:-3] + ' are'
    elif s[-2:] == "'d":
        return s[:-2] + ' would'
    elif s[-3:] == "'ve":
        return s[:-3] + ' have'
    elif s[-3:] == "'ll":
        return s[:-3] + ' will'
    elif s[-3:] == "in'":
        return s[:-1] + 'g'
    elif "'" == s[-1]:
        return s[:-1]
    return s

## After initial creation of the function
# [word for word in list(map(expand_contraction, cont_set)) if "'" in word]
# >>> ["'till", "o'clock", "t'aime", "d'do", "d'da", "'ya"]

So it seems like we've dealt with the majority of contractions without hitch. The few weird ones commented above will be accounted for with some assumptions. "'till" is probably a typo of "'til" (ie. "until"), "o'clock" is acceptable, t'aime is French for "love you" and everything afterword is a bit odd. They'll probably be removed in tf-idf (very low count from `get_contraction_count`), but I'll keep the words they probably represent. I'm removing "d'do" and "d'da" because they don't hold any meaning.

In [11]:
def resolve_contractions_in_corpus(corpus):
    r=re.sub("[\w']+",expand_contraction,corpus)
    return r
    
# try it out on MJ's "Thriller" text
resolve_contractions_in_corpus(comm_df.at[2,'Corpus'])[:500]

'here how i recall the wonderful story that sets the theme for my remarks today at a party given by a billionaire on shelter island, the late kurt vonnegut informs his pal, the author joseph heller, that their host, a hedge fund manager, had made more money in a single day than heller had earned from his wildly popular novel catch over its whole history. heller responds, yes, but i have something he will never have . . . enough. \nenough. i was stunned by its simple eloquence, to say nothing of it'

This looks clean, so we'll apply this function to the 'Corpus' column of our DataFrame. It's very apparent that there's a lot of repetition in this first song which is something true for most music. We'll keep in stop words and words less than 3 characters for now, and save a copy of this dataset with clean text and punctuation. These will be removed below when we clean the data to get a dataset better for Bag of Words (unordered) techniques.

In [12]:
comm_df['Corpus'] = comm_df['Corpus'].apply(resolve_contractions_in_corpus)
comm_df.to_excel('communication_data_clean.xlsx',index=False)
comm_df

Unnamed: 0,Originator,Title,Corpus,Link,Text Type
0,Chimamanda Ngozi Adichie,The Danger of a Single Story,i am a storyteller. and i would like to tell y...,https://jamesclear.com/great-speeches/the-dang...,speech
1,Jeff Bezos,What Matters More Than Your Talents,"as a kid, i spent my summers with my grandpare...",https://jamesclear.com/great-speeches/what-mat...,speech
2,John C. Bogle,Enough,here how i recall the wonderful story that set...,https://jamesclear.com/great-speeches/enough-b...,speech
3,Brené Brown,The Anatomy of Trust,"oh, it just feels like an incredible understat...",https://jamesclear.com/great-speeches/the-anat...,speech
4,John Cleese,Creativity in Management,"you know, when video arts asked me if i would ...",https://jamesclear.com/great-speeches/creativi...,speech
5,William Deresiewicz,Solitude and Leadership,my title must seem like a contradiction. what ...,https://jamesclear.com/great-speeches/solitude...,speech
6,Richard Feynman,Seeking New Laws,what i want to talk to you about tonight is st...,https://jamesclear.com/great-speeches/seeking-...,speech
7,Neil Gaiman,Make Good Art,i never really expected to find myself giving ...,https://jamesclear.com/great-speeches/make-goo...,speech
8,John W. Gardner,Personal Renewal,i am going to talk about self renewal. one of ...,https://jamesclear.com/great-speeches/personal...,speech
9,Elizabeth Gilbert,Your Elusive Creative Genius,i am a writer. writing books is my profession ...,https://jamesclear.com/great-speeches/your-elu...,speech


## Data Cleaning

Stemming and lemmatization are important for NLP when looking at word count, topics and other applications that don't care about the right usage of verbs or complete words. This means these techniques should not be used if you are trying text synthesis, question answering or sentiment analysis. 

Stemmers use an algorithm to get the stem/root of given words. This way, if you have variations on a word like "walk" such as "walked" and "walking", they would all be stemmed to "walk" and we'd have an appropriate count of the use of the verb in our corpus. The SnowballStemmer is preferred for achieving the best practical results, but if you're curious about the different popular types and how they differ, it is explained in more detail below. 

Lemmatizers work to achieve the same goal as stemmers, except the approach is different. When talking about lemmas for NLP, we are talking about a word with a similar meaning to the word being lemmatized. Of course, this is to keep words consistent - a lemma doesn't always need to be different from the input word. For example, "create", "making" and "make" lemmatized could all be "make". Now this might not be true for the lemmatizer we'll use, but this gives you an idea on how it could get rid of noise from synonyms. We'll take a more detailed look at how lemmatizers work in practice below.

Now that we also don't care about punctuation and newlines, let's quickly get rid of those.

In [13]:
def remove_special_chars(corpus):
    # replace remaining non-alpha chars with space
    s = re.sub('[^a-z ]', ' ', corpus)
    # remove extra spaces
    s = re.sub(' +', ' ',s)
    return s

comm_df['Corpus'] = comm_df['Corpus'].apply(remove_special_chars)

### Stemming

The three stemmers we'll discuss in this notebook are Porter, Snowball and Lancaster. My information is from [here](https://stackoverflow.com/questions/10554052/what-are-the-major-differences-and-benefits-of-porter-and-lancaster-stemming-alg) but I'll summarize it below before testing them out on some words. 

**Popular Stemmers:**
* **Porter:** 
* **Snowball (Porter2):**
* **Lancaster:**

In [14]:
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer # aka Porter2
from nltk.stem import LancasterStemmer

ps = PorterStemmer()
ss = SnowballStemmer('english')
ls = LancasterStemmer()

# no difference in results
account_ls = ['account','accountant','accountable','accounting','accountability']
# note how Lancaster is more aggressive
organize_ls = ['organ','organize','organized','organization','organizer','organizable','organizational']
# first time we see a difference between Porter and Snowball
stud_ls = ['stud','student','studious','studiously','studiousness','study','studies','studentized','studied','studying']

# choose an above list or make your own
l = stud_ls
print('Porter Stemmer')
print(list(map(ps.stem,l)))
print('\nSnowball Stemmer')
print(list(map(ss.stem,l)))
print('\nLancaster Stemmer')
print(list(map(ls.stem,l)))

Porter Stemmer
['stud', 'student', 'studiou', 'studious', 'studious', 'studi', 'studi', 'student', 'studi', 'studi']

Snowball Stemmer
['stud', 'student', 'studious', 'studious', 'studious', 'studi', 'studi', 'student', 'studi', 'studi']

Lancaster Stemmer
['stud', 'stud', 'study', 'study', 'study', 'study', 'study', 'stud', 'study', 'study']


From what we've seen on the example lists I created, there isn't much difference between Porter and Snowball stemmer. Since I've seen Snowball used most commonly (and because the SO link says it's usually the best and faster than Porter), this is the one I'll be using for my stemmed version of the corpora. Lancaster seems a bit too indiscriminant. I'm not sure how much faster it is, but for a small dataset like mine it definitely is not worth the information I'd lose.

I have read that lemmatization usually works better, but because it isn't guarenteed. It is at least important to be aware of stemming even if it doesn't always work the best. Since applying stemmers is easy, we won't create a separate version of the stemmed corpora, but we will compare how it differs from lemmatization in later notebooks.

### Lemmatization

Lemmatization is the process of replacing words similar in meaning with one select synonym. This is done so that all these different ways of saying the same thing appear under one word and be analysed together. For us, avoiding the manual process is great because it is quite time consuming and good knowledge of how words are used. Not to say you can't do it, but it's much better to use the premade language tools found in Python libraries. 

NLTK's WordNetLemmatizer is what we'll be using today. Know there are other lemmatizers from SpaCy, Textblob and probably other libraries, but we've already been using NLTK plus it is widely used so that's what I'll be sticking with today. 

In [15]:
from nltk.corpus import wordnet
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
import re


def get_lemmas(corpus):
    word_tokens = word_tokenize(corpus)
    lemmatizer = WordNetLemmatizer()
    token_pos_tuple = pos_tag(word_tokens)
    pos_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "R": wordnet.ADV,
                "V": wordnet.VERB}
    token_wn_pos = list(map(lambda p: (p[0],pos_dict.get(p[1][0],wordnet.NOUN)),token_pos_tuple))
    s = ' '.join(list(map(lambda p: lemmatizer.lemmatize(p[0],p[1]),token_wn_pos)))
    return re.sub(r' ([\.\?!,])','\\1',s)

get_lemmas("geese are cute but vicious. wordnet makes my nlp analysis easier and better than it would otherwise be.")
# get_lemmas("geese")
# >>> "geese"
# get_lemmas(<geese in a sentence where it's clear it's a noun>)
# >>> <goose with rest of lemmatized sentence>

'goose be cute but vicious. wordnet make my nlp analysis easy and good than it would otherwise be.'

In [16]:
lemma_df = comm_df.copy()
lemma_df['Corpus'] = lemma_df['Corpus'].apply(get_lemmas)
lemma_df

Unnamed: 0,Originator,Title,Corpus,Link,Text Type
0,Chimamanda Ngozi Adichie,The Danger of a Single Story,i be a storyteller and i would like to tell yo...,https://jamesclear.com/great-speeches/the-dang...,speech
1,Jeff Bezos,What Matters More Than Your Talents,a a kid i spend my summer with my grandparent ...,https://jamesclear.com/great-speeches/what-mat...,speech
2,John C. Bogle,Enough,here how i recall the wonderful story that set...,https://jamesclear.com/great-speeches/enough-b...,speech
3,Brené Brown,The Anatomy of Trust,oh it just feel like an incredible understatem...,https://jamesclear.com/great-speeches/the-anat...,speech
4,John Cleese,Creativity in Management,you know when video art ask me if i would like...,https://jamesclear.com/great-speeches/creativi...,speech
5,William Deresiewicz,Solitude and Leadership,my title must seem like a contradiction what c...,https://jamesclear.com/great-speeches/solitude...,speech
6,Richard Feynman,Seeking New Laws,what i want to talk to you about tonight be st...,https://jamesclear.com/great-speeches/seeking-...,speech
7,Neil Gaiman,Make Good Art,i never really expect to find myself give advi...,https://jamesclear.com/great-speeches/make-goo...,speech
8,John W. Gardner,Personal Renewal,i be go to talk about self renewal one of your...,https://jamesclear.com/great-speeches/personal...,speech
9,Elizabeth Gilbert,Your Elusive Creative Genius,i be a writer write book be my profession but ...,https://jamesclear.com/great-speeches/your-elu...,speech


In [17]:
from nltk.corpus import stopwords

def remove_stopwords(corpus):
    return ' '.join([word for word in corpus.split() if word not in stopwords.words('english') and len(word) > 2])

comm_df['Corpus'] = comm_df['Corpus'].apply(remove_stopwords)
comm_df['Corpus']

0     storyteller would like tell personal stories l...
1     kid spent summers grandparents ranch texas hel...
2     recall wonderful story sets theme remarks toda...
3     feels like incredible understatement say grate...
4     know video arts asked would like talk creativi...
5     title must seem like contradiction solitude le...
6     want talk tonight strictly speaking character ...
7     never really expected find giving advice peopl...
8     going talk self renewal one fundamental tasks ...
9     writer writing books profession course also gr...
10    several years ago brought face face disturbing...
11    thanks believe thinking giving particular pres...
12    first lecture orientation trying purpose cours...
13    honored today commencement one finest universi...
14    asked talk multidisciplinary approach thinking...
15    president powers provost fenves deans members ...
16    midst sea change tidal wave might accurate med...
17    well doubt many wondering speaker old well

Saving for Bag of Words analysis (where order doesn't matter).

In [18]:
comm_df.to_excel('communication_data_BoW.xlsx',index=False)

### TF-IDF


TF-IDF stands for Term Frequency-Inverse Document Frequency and is a way to numerically measure how important words are within the document. Term frequency is the number of times a word appears over the number of words in a document. IDF is the same for all documents because it is the log of number of documents over the number of documents containing that word. 

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

ti_vectorizer = TfidfVectorizer()
X = ti_vectorizer.fit_transform(lemma_df['Corpus'])

In [20]:
# index of the row to base tf off of
i=0

first_doc_df = pd.DataFrame(X[i].T.todense(), index=ti_vectorizer.get_feature_names(), columns=["TF-IDF"])
first_doc_df = first_doc_df.sort_values('TF-IDF', ascending=False)
first_doc_df.head(20)
# first_doc_df.tail(20)

Unnamed: 0,TF-IDF
be,0.380809
the,0.330233
to,0.258831
of,0.255856
and,0.249906
story,0.227415
that,0.21123
have,0.190404
in,0.166604
african,0.164908


I'm not sure about anyone reading this, but this isn't exactly the kind of results I expected. Much of the code in the block above is standard for calculating tf-idf since the TfidfVectorizer was built to be used in a certain way. The index, i selects which row we want to base the tf-idf values since tf is a document-specific metric (expand on this later). Consequently, when we look at the first row, we see many words relating to Nigeria and stories since Chimamanda Ngozi Adichie is a Nigerian writer, telling her story (which involves reading stories). From there you understand why those words have the highest tf-idf weights. 

So why do some of our stop words have such high tf-idfs? Shouldn't they be common to many documents and thus have low idf? Great question, I'll be looking into that... 