# VIK Exploratory Data Analysis

## Introduction

After the data cleaning step where we put our data into a few standard formats, the next step is to take a look at the data and see if what we're looking at makes sense. Before applying any fancy algorithms, it's always important to explore the data first.

When working with numerical data, some of the exploratory data analysis (EDA) techniques we can use include finding the average of the data set, the distribution of the data, the most common values, etc. The idea is the same when working with text data. We are going to find some more obvious patterns with EDA before identifying the hidden patterns with machine learning (ML) techniques. We are going to look at the following for each comedian:

1. **Most common words** - find these and create word clouds
2. **Size of vocabulary** - look number of unique words and also how quickly someone speaks
    - Probably won't use speed because does not apply to written text
3. **Amount of profanity** - most common terms
    - Find alternative set of words of interest

## Most Common Words

### Analysis

In [1]:
#### Read in the document-term matrix
import pandas as pd

books_dtm = pd.read_pickle('pickle/books_dtm.pkl') # Read previously pickled document-term matrix into a dataframe
books_dtm.head()
books_tdm = books_dtm.transpose() # Convert this document-term matrix to a term-document matrix - i.e., transpose
books_tdm.head() # Display only the first few rows - i.e., the first few unique words or vocabulary

Unnamed: 0,Hell,The Creative Process in the Individual,Select Speeches of Kossuth,The Apricot Tree,The King James Bible,King Richard III,"The Mirror of Literature, Amusement, and Instruction",The Lure of San Francisco,Robbery Under Arms,Alice's Adventures in Wonderland,...,My Life and Work,"Ernest Maltravers, Book 9",David Copperfield,Agnes Grey,"The Bible, King James version, Book 21: Ecclesiastes","The Bible, Douay-Rheims, Book 69: 1 John","Lyrical Ballads with Other Poems, 1800, Vol. 2","Samantha Among the Brethren, Part 4.",Grace Harlowe's Return to Overton Campus,Q and the Magic of Grammar
aa,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aam,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aaron,0,0,0,0,319,95,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aaronites,0,0,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aarons,0,0,0,0,31,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Basic dataframe retrieval syntax
- #### new dataframe = olddataframe.set_index("ColumnName", drop = False) 
- #### By default, first column is index; however, can set any column to index

In [2]:
#### Show preset subset of first five rows
books_tdm.head() 

Unnamed: 0,Hell,The Creative Process in the Individual,Select Speeches of Kossuth,The Apricot Tree,The King James Bible,King Richard III,"The Mirror of Literature, Amusement, and Instruction",The Lure of San Francisco,Robbery Under Arms,Alice's Adventures in Wonderland,...,My Life and Work,"Ernest Maltravers, Book 9",David Copperfield,Agnes Grey,"The Bible, King James version, Book 21: Ecclesiastes","The Bible, Douay-Rheims, Book 69: 1 John","Lyrical Ballads with Other Poems, 1800, Vol. 2","Samantha Among the Brethren, Part 4.",Grace Harlowe's Return to Overton Campus,Q and the Magic of Grammar
aa,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aam,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aaron,0,0,0,0,319,95,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aaronites,0,0,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aarons,0,0,0,0,31,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
#### Show required subset of rows - indexed to 0, and exclusive of range end
books_tdm[0:10] 

Unnamed: 0,Hell,The Creative Process in the Individual,Select Speeches of Kossuth,The Apricot Tree,The King James Bible,King Richard III,"The Mirror of Literature, Amusement, and Instruction",The Lure of San Francisco,Robbery Under Arms,Alice's Adventures in Wonderland,...,My Life and Work,"Ernest Maltravers, Book 9",David Copperfield,Agnes Grey,"The Bible, King James version, Book 21: Ecclesiastes","The Bible, Douay-Rheims, Book 69: 1 John","Lyrical Ballads with Other Poems, 1800, Vol. 2","Samantha Among the Brethren, Part 4.",Grace Harlowe's Return to Overton Campus,Q and the Magic of Grammar
aa,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aam,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aaron,0,0,0,0,319,95,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aaronites,0,0,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aarons,0,0,0,0,31,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ab,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
aback,0,0,0,0,0,0,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0
abaddon,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abagtha,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abana,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
#### Show an entire row (term in TDM) that meets certain text criterion, with frequency count
try:
    print(books_tdm.loc["alice", :].sort_values(ascending=False)) 
except:
    print("\nThat text does not exist\n")

Through the Looking-Glass                                      434
Alice's Adventures in Wonderland                               386
A Damsel in Distress                                            39
The Young Step-Mother                                            3
Agnes Grey                                                       1
Ernest Maltravers, Book 9                                        1
Grace Harlowe's Return to Overton Campus                         1
The Apricot Tree                                                 0
The Creative Process in the Individual                           0
The Song Of Hiawatha                                             0
The Federalist Papers                                            0
The Book Of Mormon                                               0
Facino Cane                                                      0
Peter Pan                                                        0
Household Gods                                                

In [5]:
#### Find the top 50 words written by each author
top50words_dict = {}
for eachbook in books_tdm.columns:
    top50words = books_tdm[eachbook].sort_values(ascending=False).head(50)
    top50words_dict[eachbook] = list(zip(top50words.index, top50words.values))



In [6]:
# Print the top 25 words written by each author
for book, top50words in top50words_dict.items():
    print(book)
    print("Top 25 words: " + ', '.join([word for word, count in top50words[0:24]]))
    print('---')

Hell
Top 25 words: thou, thy, thee, guide, spake, hath, shall, round, way, art, words, forth, o, came, oer, like, spirit, turnd, saw, said, th, een, cried, come
---
The Creative Process in the Individual
Top 25 words: spirit, power, life, principle, law, creative, individual, self, mind, personality, nature, process, universal, contemplation, divine, consciousness, action, conditions, recognition, thought, originating, god, new, standard
---
Select Speeches of Kossuth
Top 25 words: hungary, people, nations, states, nation, united, freedom, europe, country, great, government, power, world, cause, liberty, russia, man, independence, free, law, gentlemen, america, principle, men
---
The Apricot Tree
Top 25 words: ned, tom, grandmother, said, good, tree, did, apricots, day, think, told, work, great, apricot, home, going, read, tell, say, evening, garden, know, bible, like
---
The King James Bible
Top 25 words: shall, unto, lord, thou, thy, god, said, ye, thee, man, israel, son, hath, king,

**NOTE:** At this point, we could go on and create word clouds. However, by looking at these top words, you can see that some of them have very little meaning and could be added to a stop words list, so let's do just that.



In [7]:
# Look at the most common words across *all* books --> add them to a revised stop word list
from collections import Counter

# Let's first pull out the top 50 words for each book, across all books (referencing the above dictionary)
top50words_all = []
for eachbook in books_tdm.columns:
    top = [word for (word, count) in top50words_dict[eachbook]]
    for eachword in top:
        top50words_all.append(eachword)
        
top50words_all[0:10] # This is a list of exactly 2,550 items, because we're simply combining the top 50 words for each of 51 books (50 x 51)

['thou',
 'thy',
 'thee',
 'guide',
 'spake',
 'hath',
 'shall',
 'round',
 'way',
 'art']

In [8]:
# Aggregate above list by book frequency, to identify the 25 (or any other top-n) most common words *across all books*, showing how many books contain each word
Counter(top50words_all).most_common(25)

[('like', 41),
 ('time', 39),
 ('said', 38),
 ('man', 37),
 ('did', 36),
 ('know', 36),
 ('come', 35),
 ('little', 32),
 ('great', 30),
 ('long', 28),
 ('good', 28),
 ('way', 27),
 ('came', 27),
 ('say', 27),
 ('day', 26),
 ('shall', 25),
 ('thought', 23),
 ('just', 22),
 ('life', 20),
 ('think', 20),
 ('away', 20),
 ('make', 20),
 ('went', 20),
 ('hand', 19),
 ('old', 18)]

In [9]:
# If more than half the books (out of 51 books) have any of these 25 most common words, exclude that word - i.e., treat it as a stop word
books_stopwords_addl = [word for word, count in Counter(top50words_all).most_common() if count > (len(books_dtm)/2)]
books_stopwords_addl

['like',
 'time',
 'said',
 'man',
 'did',
 'know',
 'come',
 'little',
 'great',
 'long',
 'good',
 'way',
 'came',
 'say',
 'day']

In [10]:
# Let's update our document-term matrix with the new list of stop words
from sklearn.feature_extraction import text 
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np

# Read in cleaned data
books_df_clean = pd.read_pickle('pickle/books_clean_corpus.pkl')

## First, make custom stopwords list from previously edited built-in list
books_stopwords = pd.read_csv('/Users/vix/Repos/Python-Learning/src/NLP/Intro to NLP Alice Zhao/books_stopwords.csv', header=None, encoding='utf-8')
books_stopwords = list(np.squeeze(books_stopwords.values))
books_stopwords = frozenset(books_stopwords)

# Add new stop words
books_stopwords_rev = books_stopwords.union(books_stopwords_addl) # Adding our new stop words to existing stop words

# Recreate document-term matrix

cv = CountVectorizer(stop_words=frozenset(books_stopwords_rev), strip_accents='ascii', token_pattern=r'(?u)\b\w+\b', ngram_range=(1,1))
books_cv = cv.fit_transform(books_df_clean.book_text)

books_dtm_rev1 = pd.DataFrame(books_cv.toarray(), columns=cv.get_feature_names())
books_dtm_rev1.index = books_df_clean.index

# Pickle it for later use
import pickle
pickle.dump(cv, open("pickle/books_cv_rev1.pkl", "wb"))
books_dtm_rev1.to_pickle("pickle/books_dtm_rev1.pkl")

In [11]:
### Check if certain term exists in vocabulary

try:
    print(books_dtm_rev1.loc[:, "q"].sort_values(ascending=False)) 
except:
    print("\nThat text does not exist\n")

Q and the Magic of Grammar                                     618
David Copperfield                                                1
The Young Step-Mother                                            1
Sketches in Lavender, Blue and Green                             1
The Apricot Tree                                                 0
Through the Looking-Glass                                        0
Paradise Lost                                                    0
The Song Of Hiawatha                                             0
The Federalist Papers                                            0
The Book Of Mormon                                               0
Facino Cane                                                      0
Peter Pan                                                        0
Household Gods                                                   0
The Hunting of the Snark                                         0
Paul Kelver                                                   

In [12]:
### Display revised list of stopwords and save to file
import csv
books_stopwords_rev

with open("/Users/vix/Repos/Python-Learning/src/NLP/Intro to NLP Alice Zhao/books_stopwords_rev.csv", 'w', newline='') as csvfile:
    stopwords = csv.writer(csvfile, delimiter='\n')
    stopwords.writerow(cv.get_stop_words())


In [None]:
### Understand WordCloud options with simple example

import os
os.chdir('/Users/vix/Repos/Python-Learning/src/NLP/Intro to NLP Alice Zhao/')

from wordcloud import WordCloud
from PIL import Image
import numpy as np
import matplotlib.pyplot as plot

mask = np.array(Image.open('books_mask.png'))

test_wordcloud = WordCloud(width=600, height=400, stopwords='', background_color="whitesmoke", colormap="tab20b", random_state=1, collocations=False, regexp=r"[\w']+", mask=mask)

plot.rcParams['figure.figsize'] = [3, 2]
plot.rcParams['figure.dpi'] = 300
test_wordcloud.generate("q and p and the magic of grammar are all here to learn the a b c d and e of grammar rules 1 2 3")
plot.imshow(test_wordcloud)
plot.axis(False)

### plot.savefig("Test_WordCloud.png")
plot.show()


test_wordcloud.words_ ### Word frequencies
### test_wordcloud.layout_

In [None]:
### Let's make some word clouds!
### Terminal / Anaconda Prompt: conda install -c conda-forge wordcloud
import os
os.chdir('/Users/vix/Repos/Python-Learning/src/NLP/Intro to NLP Alice Zhao')

from wordcloud import WordCloud
from PIL import Image
import numpy as np
import matplotlib.pyplot as plot

# mask = np.array(Image.open('mask_peterpan.png'))

books_wordcloud = WordCloud(stopwords=books_stopwords_rev, collocations=False, background_color="whitesmoke", colormap="tab20b", random_state=42, regexp=r"[\w']+")

In [None]:
# Reset the output dimensions
import matplotlib.pyplot as plot
from textwrap import wrap

plot.rcParams['figure.figsize'] = [50, 28] ### For several items in grid
# plot.rcParams['figure.figsize'] = [6, 6] ### For single item
plot.rcParams['figure.dpi'] = 300

book_author = books_df_clean.index.tolist() # Useful code to take a dataframe column and make it a list - in this case, the column in queston is the index

# Create subplots for each book
for each, book in enumerate(books_tdm.columns): # Dataframe whose columns represent each author (ie, document)
    # if book == "Peter Pan": ### Use (with indent) for single item
    books_wordcloud.generate(books_df_clean.book_text[book]) # Invokes actual WordCloud function, which incorporates stop words set above
    
    plot.tight_layout(pad=1)
    plot.subplot(6, 9, each+1) ### Ignore if plotting single item
    plot.imshow(books_wordcloud, interpolation="bilinear")
    plot.axis("off")
    plot.title('\n'.join(wrap(book_author[each],30)), fontsize=15)
            
plot.savefig("Books_WordCloud_Rev.png", dpi='figure', facecolor='whitesmoke') ### Remember to update filename as needed to avoid overwriting
plot.show()

### Findings

* These 50 books are reasonably distinct in terms of their respective most frequent words.

## Number of Words

### Analysis

In [None]:
# Find the number of unique words in each book - i.e., the book's vocabulary

# Identify the non-zero items in the document-term matrix, meaning that the word occurs at least once
vocabulary_list = []
for eachbook in books_tdm.columns:
    vocabulary = books_tdm[eachbook].to_numpy().nonzero()[0].size
    vocabulary_list.append(vocabulary)

# Create a new dataframe that contains this unique word count
vocabulary_df = pd.DataFrame(list(zip(book_author, vocabulary_list)), columns=['Book Title', 'Vocabulary'])
vocabulary_df_sort = vocabulary_df.sort_values(by='Vocabulary',ascending=False)
pd.options.display.float_format = '${:, 0f}'.format
vocabulary_df_sort

In [None]:
# Calculate the words per minute of each comedian

# Find the total number of words that a comedian uses
total_list = []
for comedian in data.columns:
    totals = sum(data[comedian])
    total_list.append(totals)
    
# Comedy special run times from IMDB, in minutes
run_times = [60, 59, 80, 60, 67, 73, 77, 63, 62, 58, 76, 79]

# Let's add some columns to our dataframe
data_words['total_words'] = total_list
data_words['run_times'] = run_times
data_words['words_per_minute'] = data_words['total_words'] / data_words['run_times']

# Sort the dataframe by words per minute to see who talks the slowest and fastest
data_wpm_sort = data_words.sort_values(by='words_per_minute')
data_wpm_sort

In [None]:
# Let's plot our findings
import numpy as np

y_pos = np.arange(len(data_words))

plt.subplot(1, 2, 1)
plt.barh(y_pos, data_unique_sort.unique_words, align='center')
plt.yticks(y_pos, data_unique_sort.comedian)
plt.title('Number of Unique Words', fontsize=20)

plt.subplot(1, 2, 2)
plt.barh(y_pos, data_wpm_sort.words_per_minute, align='center')
plt.yticks(y_pos, data_wpm_sort.comedian)
plt.title('Number of Words Per Minute', fontsize=20)

plt.tight_layout()
plt.show()

### Findings

* **Vocabulary**
   * Ricky Gervais (British comedy) and Bill Burr (podcast host) use a lot of words in their comedy
   * Louis C.K. (self-depricating comedy) and Anthony Jeselnik (dark humor) have a smaller vocabulary


* **Talking Speed**
   * Joe Rogan (blue comedy) and Bill Burr (podcast host) talk fast
   * Bo Burnham (musical comedy) and Anthony Jeselnik (dark humor) talk slow
   
Ali Wong is somewhere in the middle in both cases. Nothing too interesting here.

## Amount of Profanity

### Analysis

In [None]:
# Earlier I said we'd revisit profanity. Let's take a look at the most common words again.
Counter(words).most_common()

In [None]:
# Let's isolate just these bad words
data_bad_words = data.transpose()[['fucking', 'fuck', 'shit']]
data_profanity = pd.concat([data_bad_words.fucking + data_bad_words.fuck, data_bad_words.shit], axis=1)
data_profanity.columns = ['f_word', 's_word']
data_profanity

In [None]:
# Let's create a scatter plot of our findings
plt.rcParams['figure.figsize'] = [10, 8]

for i, comedian in enumerate(data_profanity.index):
    x = data_profanity.f_word.loc[comedian]
    y = data_profanity.s_word.loc[comedian]
    plt.scatter(x, y, color='blue')
    plt.text(x+1.5, y+0.5, full_names[i], fontsize=10)
    plt.xlim(-5, 155) 
    
plt.title('Number of Bad Words Used in Routine', fontsize=20)
plt.xlabel('Number of F Bombs', fontsize=15)
plt.ylabel('Number of S Words', fontsize=15)

plt.show()

### Findings

* **Averaging 2 F-Bombs Per Minute!** - I don't like too much swearing, especially the f-word, which is probably why I've never heard of Bill Bur, Joe Rogan and Jim Jefferies.
* **Clean Humor** - It looks like profanity might be a good predictor of the type of comedy I like. Besides Ali Wong, my two other favorite comedians in this group are John Mulaney and Mike Birbiglia.

## Side Note

What was our goal for the EDA portion of our journey? **To be able to take an initial look at our data and see if the results of some basic analysis made sense.**

My conclusion - yes, it does, for a first pass. There are definitely some things that could be better cleaned up, such as adding more stop words or including bi-grams. But we can save that for another day. The results, especially the profanity findings, are interesting and make general sense, so we're going to move on.

As a reminder, the data science process is an interative one. It's better to see some non-perfect but acceptable results to help you quickly decide whether your project is a dud or not, instead of having analysis paralysis and never delivering anything.

**Alice's data science (and life) motto: Let go of perfectionism!**

## Additional Exercises

1. What other word counts do you think would be interesting to compare instead of the f-word and s-word? Create a scatter plot comparing them.

### Adding new markdown cell