# Natural Language Processing

## examples of sentiment analysis, topic analysis, and text generation

1. Define question - What makes one comedian's material different from another's?
2. Get and clean data
3. Perform exploratory data analysis
4. Apply analsyis techniques
    + sentiment analysis
    +  topic analysis
    + text generation
5. Share findings

# 1. Define question - What makes one comedian's material different from another's?

Use methods like sentiment analysis to see how comedians differ by thier material.

In [None]:
from bs4 import BeautifulSoup
import os
import re
import spacy
import requests
import pickle

# 2. Data collection

    1. Define the scope of the data to be used (what and how much)
    2. Define where you can get this data
    3. Have a plane for storage

## 2.1 Gather data by way of web scraping - Beautiful soup

In [None]:
# def url_to_transcript(url):
#     '''
#     Returns transcriipt data from the target website, scrapsfromtheloft.com.
#     Content is taken from the "post-content" class.
#     '''
#     page = requests.get(url).text
#     soup = BeautifulSoup(page, 'lxml')
#     text = [p.text for p in soup.find(class_="post-content").find_all('p')]
#     print(url)
#     return text


# # URLs of transcripts in scope
# urls = [
#     'http://scrapsfromtheloft.com/2017/05/06/louis-ck-oh-my-god-full-transcript/',
#     'http://scrapsfromtheloft.com/2017/04/11/dave-chappelle-age-spin-2017-full-transcript/',
#     'http://scrapsfromtheloft.com/2018/03/15/ricky-gervais-humanity-transcript/',
#     'http://scrapsfromtheloft.com/2017/08/07/bo-burnham-2013-full-transcript/',
#     'http://scrapsfromtheloft.com/2017/05/24/bill-burr-im-sorry-feel-way-2014-full-transcript/',
#     'http://scrapsfromtheloft.com/2017/04/21/jim-jefferies-bare-2014-full-transcript/',
#     'http://scrapsfromtheloft.com/2017/08/02/john-mulaney-comeback-kid-2015-full-transcript/',
#     'http://scrapsfromtheloft.com/2017/10/21/hasan-minhaj-homecoming-king-2017-full-transcript/',
#     'http://scrapsfromtheloft.com/2017/09/19/ali-wong-baby-cobra-2016-full-transcript/',
#     'http://scrapsfromtheloft.com/2017/08/03/anthony-jeselnik-thoughts-prayers-2015-full-transcript/',
#     'http://scrapsfromtheloft.com/2018/03/03/mike-birbiglia-my-girlfriends-boyfriend-2013-full-transcript/',
#     'http://scrapsfromtheloft.com/2017/08/19/joe-rogan-triggered-2016-full-transcript/'
# ]

# # Use comedian's short names, in order with the URLs listed, as keys to their respective content.
# comedians = ['louis', 'dave', 'ricky', 'bo', 'bill', 'jim',
#              'john', 'hasan', 'ali', 'anthony', 'mike', 'joe']

# # request transcripts (takes a few minutes to run)
# transcripts = [url_to_transcript(url) for url in urls]

# # Save the work as a pickled file for use later
# !mkdir transcripts
# for index, comedian in enumerate(comedians):
#     with open('transcripts/' + comedian + '.txt', 'wb') as file:
#         pickle.dump(transcripts[index], file)

## 2.2 Clean the data

1. Get the corpus i.e collect all the data into a table with the comedian on one cloumn and their material in the next column, in a dataframe (Pandas).
2. Create a document-term matrix:
    + clean text - remove unneccessary parts of text, punctuation, etc.
    + tokenize the text - change the 'words' into machine usable symbols
    + create document matrix - put the document into a form the machine can understand

In [None]:
comedians = ['louis', 'dave', 'ricky', 'bo', 'bill', 'jim',
             'john', 'hasan', 'ali', 'anthony', 'mike', 'joe']
# Load pickled files and check that the data has been recovered
data = {} # use a dictionary with the comedians' names as keys and the transcripts as values
for i, comedian in enumerate(comedians):
    with open('transcripts/' + comedian + '.txt', 'rb') as file:
        data[comedian] = pickle.load(file)
# check key, names
# print('{}\n'.format(data.keys()))

# # check value, text. You might notice that there are non-ascii values in the corpus
# print('louis:\t{}\n'.format(data['louis'][:2]))

When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing with outliers, etc. With text data, there are some common data cleaning techniques, which are also known as text pre-processing techniques.

With text data, this cleaning process can go on forever. There are always exceptions to every cleaning step. So, we're going to follow the MVP (minimum viable product) approach - start simple and iterate. Here are a bunch of things you can do to clean your data. We're going to execute just the common cleaning steps here and the rest can be done at a later point to improve the results.

Common data cleaning steps on all text:

* Make text all lower case
* Remove punctuation
* Remove numerical values
* Remove common non-sensical text (e.g. /n)
* Tokenize text (break a block of text into sentences or words)
* Remove stop words

More data cleaning steps after tokenization:
* Stemming / Lemmatization
* Parts of speech tagging
* Create bi-grams or tri-grams
* Deal with typos and misspelled words
Do these later

In [None]:
next(iter(data.keys()))

In [None]:
# Notice that our dictionary is currently in key: comedian, value: list of text format
next(iter(data.values()))

In [None]:
# We are going to change this to key: comedian, value: string format
def combine_text(list_of_text):
    '''
    Takes a list of text and combines them into one large chunk of text.
    '''
    # Combine all lines of a comedian's material into one line of text.
    combined_text = ' '.join(list_of_text)
    return combined_text

# create a data object, a dictionary, with the name as key and the combined text as the value.
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}

Put the data into a dataframe for easier data manipulation

In [None]:
import pandas as pd
# Set the width of the display to show a limited muber of characters, to avoid word-wrap
pd.set_option('max_colwidth',125)

# flip the data by its leaning diagonal, so a comedian's name is in the first column and the text is in the second
data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['transcript']
data_df = data_df.sort_index()
data_df


In [None]:
# # Let's take a look at the transcript for Ali Wong
data_df.transcript.loc['ali']

## 2.3 Tokenize the data - sentence, fragment, word

### Apply a first round of text cleaning techniques

In [None]:
import pandas as pd
import re
import string

# data_df.head()

def clean_text(text):
    ''' 
    Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers. 
    '''
    text = text.lower() # make text lower case
    text = re.sub('\\[.*?\\]', '', text) # remove any brackets and the content within it
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text) # remove the punctuation marks (string.punctuation)
    text = re.sub('\\w*\\d\\w*', '', text) # remove an words with numbers in them
    text = re.sub('[‘’“”…]', '', text) # remove non-ascii quotes and apostrophies
    text = re.sub('\\n', '', text) # remove line break characters
    text = re.sub('\s{2,}', ' ', text) # remove multiple spaces
    return text
    
#==============================================================================
# Make use of the apply function. Identify the series in the dataframe, 
# followed by the function you want to apply to that series, and add the name 
# of the function, without the parenteses within the parentheses of the apply 
# method. THis will return the results of the referenced method to each element 
# in the series. In this case data_df.transcript is the series submitted and for
# each element in that series, it is replaced by the result of the function.
data_df.transcript = data_df.transcript.apply(clean_text)

# Let's take a look at the updated text
data_df.transcript


### Save the work

In [None]:
#==============================================================================
# Let's add the comedians' full names as well
full_names = ['Ali Wong', 'Anthony Jeselnik', 'Bill Burr', 'Bo Burnham', 
    'Dave Chappelle', 'Hasan Minhaj','Jim Jefferies', 'Joe Rogan', 
    'John Mulaney', 'Louis C.K.', 'Mike Birbiglia', 'Ricky Gervais']

# add the full names of the comedians
data_df['full_name'] = full_names
data_df

In [None]:
#==============================================================================
# Let's pickle it for later use
data_df.to_pickle('data_df.pkl')

### Load the work

In [None]:
#==============================================================================
# Load the dictionary back from the pickle file.
import pickle

data_df = pickle.load( open( "data_df.pkl", "rb" ) )
# data_df

### Document-Term Matrix

For many of the techniques we'll be using in future notebooks, the text must be 
tokenized, meaning broken down into smaller pieces. The most common tokenization 
technique is to break down text into words. We can do this using scikit-learn's 
CountVectorizer, where every row will represent a different document and every 
column will represent a different word.

In addition, with CountVectorizer, we can remove stop words. Stop words are common 
words that add no additional meaning to text such as 'a', 'the', etc.

In [None]:
#==============================================================================
from sklearn.feature_extraction.text import CountVectorizer

# define the instance of CountVectorizer and add a reference to the stop words
# in English. We will add a list of special stop-words later.
cv = CountVectorizer(stop_words='english')

# Process the trascripts, removing the stop words and put it into a matrix.
data_cv = cv.fit_transform(data_df.transcript)

#==============================================================================
# Save this spot for Lemitizing and Stemming words...
#==============================================================================

# convert data object into an array
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm

In [None]:
data_dtm.to_pickle("./data_dtm.pkl")

# 3. Exploratory Data Analysis

Use the corpus and the document-term matrix to perform EDA to figure out the main trends in the data and see if it makes sense.

* Data: determine the format of the raw data you will need to begin analysis
* Aggregate: Figure out how to aggregate the data
* Visualize: find the best way to visualize the data 
* Insights: Extract some key insights from the visualizations

Use frequency, or word counts, to see what topic or subject seems to keep being mentioned.

* top word(s)
* vocabulary
* jargon or specialty words (differentiating it from other texts)

### Top words run a tally for top words for each comedian. You can visualize the text as a word cloud, bar plot, scatter plot, etc. to see what one comedian looks like when compared to another.

In [None]:
import pandas as pd 
#  retrieve the pickled file
data = pd.read_pickle('data_dtm.pkl')
#==============================================================================
# We will now transpose the datafram on the leaning diagonal so we can tally 
# the words used by each comedian. This is easier to do when the values are in 
# a column.
data = data.transpose()
data.head()

In [None]:
# Find the top 30 words said by each comedian
top_dict = {}
for comedian in data.columns:
    top = data[comedian].sort_values(ascending=False).head(30)
    top_dict[comedian]= list(zip(top.index, top.values))

top_dict

In [None]:
# Print the top 15 words said by each comedian
for comedian, top_words in top_dict.items():
    print(comedian)
    print(', '.join([word for word, count in top_words[0:14]]))
    print('---')

NOTE: At this point, we could go on and create word clouds. However, by looking
at these top words, you can see that some of them have very little meaning and 
could be added to a stop words list, so let's do just that.

In [None]:
#==============================================================================
# Look at the most common top words --> add them to the stop word list
from collections import Counter

# Let's first pull out the top 30 words for each comedian
words = []
for comedian in data.columns:
    top = [word for (word, count) in top_dict[comedian]]
    for t in top:
        words.append(t)
# words

In [None]:
# Let's aggregate this list and identify the most common words along with how many routines they occur in
# Counter(words).most_common()

In [None]:
# If more than half of the comedians have it as a top word, exclude it from the list
add_stop_words = [word for word, count in Counter(words).most_common() if count > 6]
add_stop_words

In [None]:
# Let's update our document-term matrix with the new list of stop words
from sklearn.feature_extraction import text 
from sklearn.feature_extraction.text import CountVectorizer

# Read in cleaned data
data_clean = pd.read_pickle('data_clean.pkl')

# Add new stop words
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Recreate document-term matrix
cv = CountVectorizer(stop_words=stop_words)
data_cv = cv.fit_transform(data_clean.transcript)
data_stop = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_stop.index = data_clean.index

# Pickle it for later use
import pickle
pickle.dump(cv, open("cv_stop.pkl", "wb"))
data_stop.to_pickle("dtm_stop.pkl")

In [None]:
# Let's make some word clouds!
# Terminal / Anaconda Prompt: conda install -c conda-forge wordcloud
from wordcloud import WordCloud

# wc = WordCloud(stopwords=stop_words, background_color="white", colormap="Dark2", max_font_size=150, random_state=42)
wc = WordCloud(stopwords=stop_words, background_color="black", max_font_size=150, random_state=42)

In [None]:
# Reset the output dimensions
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [16, 6]

full_names = ['Ali Wong', 'Anthony Jeselnik', 'Bill Burr', 'Bo Burnham', 'Dave Chappelle', 'Hasan Minhaj',
              'Jim Jefferies', 'Joe Rogan', 'John Mulaney', 'Louis C.K.', 'Mike Birbiglia', 'Ricky Gervais']

# Create subplots for each comedian
for index, comedian in enumerate(data.columns):
    wc.generate(data_clean.transcript[comedian])
    
    plt.subplot(3, 4, index+1)
    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")
    plt.title(full_names[index])
    
plt.show()

## Put data into a martix for analysis

In [None]:
data.head

In [None]:
import numpy as np 

# Find the number of unique words that each comedian uses
# # Identify the non-zero items in the document-term matrix, meaning that the word occurs at least once
unique_list = []
for comedian in data.columns:
    uniques = data[comedian].to_numpy().nonzero()[0].size
    unique_list.append(uniques)

data_words = pd.DataFrame(list(zip(full_names, unique_list)), columns=['comedian', 'unique_words'])
data_unique_sort = data_words.sort_values(by='unique_words')
data_unique_sort

In [None]:
# Calculate the words per minute of each comedian

# Find the total number of words that a comedian uses
total_list = []
for comedian in data.columns:
    totals = sum(data[comedian])
    total_list.append(totals)
    
# Comedy special run times from IMDB, in minutes
run_times = [60, 59, 80, 60, 67, 73, 77, 63, 62, 58, 76, 79]

# Let's add some columns to our dataframe
data_words['total_words'] = total_list
data_words['run_times'] = run_times
data_words['words_per_minute'] = data_words['total_words'] / data_words['run_times']

# Sort the dataframe by words per minute to see who talks the slowest and fastest
data_wpm_sort = data_words.sort_values(by='words_per_minute')
data_wpm_sort

In [None]:
# Let's plot our findings
import numpy as np

y_pos = np.arange(len(data_words))

plt.subplot(1, 2, 1)
plt.barh(y_pos, data_unique_sort.unique_words, align='center')
plt.yticks(y_pos, data_unique_sort.comedian)
plt.title('Number of Unique Words', fontsize=20)

plt.subplot(1, 2, 2)
plt.barh(y_pos, data_wpm_sort.words_per_minute, align='center')
plt.yticks(y_pos, data_wpm_sort.comedian)
plt.title('Number of Words Per Minute', fontsize=20)

plt.tight_layout()
plt.show()

* **Vocabulary**
   * Ricky Gervais (British comedy) and Bill Burr (podcast host) use a lot of words in their comedy
   * Louis C.K. (self-depricating comedy) and Anthony Jeselnik (dark humor) have a smaller vocabulary


* **Talking Speed**
   * Joe Rogan (blue comedy) and Bill Burr (podcast host) talk fast
   * Bo Burnham (musical comedy) and Anthony Jeselnik (dark humor) talk slow
   
Ali Wong is somewhere in the middle in both cases. Nothing too interesting here.

## Amount of Profanity

Assume that profanity is a distraction; F- is used for every part of speech as a filler word. At the same time saying 'that was shit' has the opposite meaning to 'that was the shit' - idiomatic.

In [None]:
# Earlier I said we'd revisit profanity. Let's take a look at the most common words again.
# Counter(words).most_common()

In [None]:
data[:1]

In [None]:
# Let's isolate just these 'blue' words and see how frequntly they are used by a given comedian.
data_blue_words = data.transpose()[['fucking', 'fuck', 'shit', 'nigga']]
data_profanity = pd.concat([data_blue_words.fucking + data_blue_words.fuck, data_blue_words.shit, data_blue_words.nigga], axis=1)
data_profanity.columns = ['f_word', 's_word', 'n_word']
data_profanity

In [None]:
# Let's create a scatter plot of our findings
plt.rcParams['figure.figsize'] = [10, 8]

for i, comedian in enumerate(data_profanity.index):
    x = data_profanity.f_word.loc[comedian]
    y = data_profanity.s_word.loc[comedian]
    plt.scatter(x, y, color='blue')
    plt.text(x+1.5, y+0.5, full_names[i], fontsize=10)
    plt.xlim(-5, 155) 
    
plt.title('Number of Blue Words Used in Routine', fontsize=20)
plt.xlabel('Number of F Bombs', fontsize=15)
plt.ylabel('Number of S Words', fontsize=15)

plt.show()

Here is a 3D plot of Blue words used by the comedians

In [None]:
from mpl_toolkits.mplot3d import axes3d
import matplotlib.pyplot as plt

fig = plt.figure()

ax = fig.add_subplot(111, projection='3d')
# ax = plt.axes(projection='3d')

# Data for a three-dimensional line
for i, comedian in enumerate(data_profanity.index):
    x = data_profanity.f_word.loc[comedian]
    y = data_profanity.s_word.loc[comedian]
    z = data_profanity.n_word.loc[comedian]
    ax.scatter(x, y, z)

ax.set_title('Blue Words')
ax.set_xlabel('F Bombs')
ax.set_ylabel('S Word')
ax.set_zlabel('N Word')

fig.show()

Who talks about family

In [None]:
# Let's isolate just the words that relate to family and see how frequntly they are used by a given comedian.
data_family_words = data.transpose()[['dad', 'mom', 'kids', 'husband', 'wife','grandma']]
data_family = pd.concat([data_family_words.dad, data_family_words.mom, data_family_words.kids, data_family_words.husband, data_family_words.wife, data_family_words.grandma], axis=1)
data_family.columns = ['dad', 'mom', 'kids', 'husband', 'wife','grandma']
data_family

What does that look like?

In [None]:
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 8]

for i, comedian in enumerate(data_family.index):
    x = data_family.dad.loc[comedian]
    y = data_family.mom.loc[comedian]
    plt.scatter(x, y, color='blue')
    plt.text(x+1.5, y+0.5, full_names[i], fontsize=10)
    plt.xlim(-5, 155) 

# plt.xlim(0,70)
# plt.xlim(0,'auto')
plt.set(xlim=(xmin, xmax), ylim=(ymin, ymax))
plt.title('Number of References to Parents in a Routine', fontsize=20)
plt.xlabel('Number of Dads', fontsize=15)
plt.ylabel('Number of Moms', fontsize=15)

plt.show()

In [None]:
import matplotlib.pyplot as plt
plt.plot(range(5))
end = 5
plt.plot(range(end))

# plt.xlim(-5, 5)
plt.xlim(0, end)
# plt.ylim(-5, 5)
plt.ylim(0, end)
plt.show()

In [None]:
# # <test>
# import numpy  

# # Set values of data points
# # Note, this uses a numpy array which is not the same as a list

# # n = 10
# # x,y,z = np.random.normal(0,1,(3, n))
# x = [.1,.2,.3,.4,.5,.6,.7,.8,.3,.4,.5,.6,.7,.8,.1,.2,.6,.7,.3,.8,.1,.2,.4,.5] # define a regular list
# x = numpy.array(x) # make regular list into a numpy list

# y = [.3,.4,.5,.6,.7,.8,.1,.2,.6,.7,.3,.8,.1,.2,.4,.5,.1,.2,.3,.4,.5,.6,.7,.8]
# # y = [3,4,5,6,7,8,1,2,6,7,3,8,1,2,4,5,1,2,3,4,5,6,7,8]
# y = numpy.array(y)

# z = [.6,.7,.3,.8,.1,.2,.4,.5,.1,.2,.3,.4,.5,.6,.7,.8,.3,.4,.5,.6,.7,.8,.1,.2]
# z = numpy.array(z)
# # </test>

In [None]:
import numpy 
import ipyvolume as ipv 
fig = ipv.figure()
ipv.pylab.xlim(0, 1) # Set limits of x axis.
ipv.pylab.ylim(0, 1) # Set limits of y axis.
ipv.pylab.zlim(0, 1) # Set limits of z axis.

ipv.scatter(x,y,z,marker='sphere')
# scatter = ipv.scatter(x,y,z,marker='sphere') # scatter is an object you can use later
ipv.show()

In [None]:
# Let's isolate just these 'blue' words and see how frequntly they are used by a given comedian.
data_blue_words2 = data.transpose()[['fucking', 'fuck', 'shit', 'nigga']]
data_profanity2 = pd.concat([data_blue_words.fucking + data_blue_words.fuck, data_blue_words.shit, data_blue_words.nigga], axis=1)
data_profanity2.columns = ['f_word', 's_word', 'n_word']
data_profanity2

In [None]:
import ipyvolume as ipv2 
import numpy 

fig2 = ipv.figure()
ipv.pylab.xlim(0, 120) # Set limits of x axis.
ipv.pylab.ylim(0, 120) # Set limits of y axis.
ipv.pylab.zlim(0, 120) # Set limits of z axis.

f_ = s_ = n_ = []

# Data for a three-dimensional line
for i, comedian in enumerate(data_profanity2.index):
    f = data_profanity2['f_word'][comedian]
    s = data_profanity2['s_word'][comedian]
    n = data_profanity2['n_word'][comedian]
    print('{}: {}, {}, {}'.format(comedian,f,s,n))
    f_.append(f)
    s_.append(s)
    n_.append(n)

f_ = numpy.array(f_)
s_ = numpy.array(s_)
n_ = numpy.array(n_)

ipv2.scatter(f_, s_, n_, marker='sphere')
ipv2.show()

# Sentiment Analysis

So far, all of the analysis we've done has been pretty generic - looking at counts, creating scatter plots, etc. These techniques could be applied to numeric data as well.

When it comes to text data, there are a few popular techniques that we'll be going through starting with sentiment analysis. A few key points to remember with sentiment analysis.

* TextBlob Module: Linguistic researchers have labeled the sentiment of words based on their domain expertise. Sentiment of words can vary based on where it is in a sentence. The TextBlob module allows us to take advantage of these labels.
* Sentiment Labels: Each word in a corpus is labeled in terms of polarity and subjectivity (there are more labels as well, but we're going to ignore them for now). A corpus' sentiment is the average of these.
* Polarity: How positive or negative a word is. -1 is very negative. +1 is very positive.
* Subjectivity: How subjective, or opinionated a word is. 0 is fact. +1 is very much an opinion.
For more info on how TextBlob coded up its [sentiment function](https://planspace.org/20150607-textblob_sentiment/).

Let's take a look at the sentiment of the various transcripts, both overall and throughout the comedy routine.

## Sentiment of Routine

In [None]:
# We'll start by reading in the corpus, which preserves word order
import pandas as pd

data = pd.read_pickle('corpus.pkl')
data

### First, a few words about textblob

* https://github.com/sloria/TextBlob 
* for full documentation, go to https://textblob.readthedocs.io/.
* Features
* Noun phrase extraction
* Part-of-speech tagging
* Sentiment analysis
* Classification (Naive Bayes, Decision Tree)
* Tokenization (splitting text into words and sentences)
* Word and phrase frequencies
* Parsing
* n-grams
* Word inflection (pluralization and singularization) and lemmatization
* Spelling correction
* Add new models or languages through extensions
* WordNet integration


In [None]:

# Create quick lambda functions to find the polarity and subjectivity of each routine
# Terminal / Anaconda Navigator: conda install -c conda-forge textblob
from textblob import TextBlob

pol = lambda x: TextBlob(x).sentiment.polarity
sub = lambda x: TextBlob(x).sentiment.subjectivity

data['polarity'] = data['transcript'].apply(pol)
data['subjectivity'] = data['transcript'].apply(sub)
data

In [None]:
# Let's plot the results
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [10, 8]

for index, comedian in enumerate(data.index):
    x = data.polarity.loc[comedian]
    y = data.subjectivity.loc[comedian]
    plt.scatter(x, y, color='blue')
    plt.text(x+.001, y+.001, data['full_name'][index], fontsize=10)
    plt.xlim(-.01, .12) 
    
plt.title('Sentiment Analysis', fontsize=20)
plt.xlabel('<-- Negative -------- Positive -->', fontsize=15)
plt.ylabel('<-- Facts -------- Opinions -->', fontsize=15)

plt.show()

Reprint the graph to demonstrate differences in the full range of sentiment and objectivity. Now can comedians be thought of as a group?

They appear to be a pretty tight cluster after all.

In [None]:
# Let's plot the results - the bigger picture
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [10, 8]

for index, comedian in enumerate(data.index):
    x = data.polarity.loc[comedian]
    y = data.subjectivity.loc[comedian]
    plt.scatter(x, y, color='blue')
    plt.text(x+.001, y+.001, data['full_name'][index], fontsize=10)
    plt.xlim(-1, 1) 
    plt.ylim(0, 1) 
    
plt.title('Sentiment Analysis', fontsize=20)
plt.xlabel('<-- Negative -------- Positive -->', fontsize=15)
plt.ylabel('<-- Facts -------- Opinions -->', fontsize=15)

plt.show()

## Longitudinal Sentiment Analysis
Instead of looking at the overall sentiment, let's see if there's anything interesting about the sentiment over time throughout each routine.

In [None]:
# Split each routine into 10 parts
import numpy as np
import math

def split_text(text, n=10):
    '''Takes in a string of text and splits into n equal parts, with a default of 10 equal parts.'''

    # Calculate length of text, the size of each chunk of text and the starting points of each chunk of text
    length = len(text)
    size = math.floor(length / n)
    start = np.arange(0, length, size)
    
    # Pull out equally sized pieces of text and put it into a list
    split_list = []
    for piece in range(n):
        split_list.append(text[start[piece]:start[piece]+size])
    return split_list

In [None]:
# Let's take a look at our data again
data

In [None]:
# Let's create a list to hold all of the pieces of text
list_pieces = []
for t in data.transcript:
    split = split_text(t)
    list_pieces.append(split)
    
list_pieces

In [None]:
# The list has 10 elements, one for each transcript
len(list_pieces)

# Topic Modeling

## Introduction

Another popular text analysis technique is called topic modeling. The ultimate goal of topic modeling is to find various topics that are present in your corpus. Each document in the corpus will be made up of at least one topic, if not multiple topics.

In this notebook, we will be covering the steps on how to do **Latent Dirichlet Allocation (LDA)**, which is one of many topic modeling techniques. It was specifically designed for text data.

To use a topic modeling technique, you need to provide (1) a document-term matrix and (2) the number of topics you would like the algorithm to pick up.

Once the topic modeling technique is applied, your job as a human is to interpret the results and see if the mix of words in each topic make sense. If they don't make sense, you can try changing up the number of topics, the terms in the document-term matrix, model parameters, or even try a different model.

In [None]:
# Let's read in our document-term matrix
import pandas as pd
import pickle

data = pd.read_pickle('dtm_stop.pkl')
data

In [None]:
# Import the necessary modules for LDA with gensim
# Terminal / Anaconda Navigator: conda install -c conda-forge gensim
from gensim import matutils, models
import scipy.sparse

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [None]:
# One of the required inputs is a term-document matrix
tdm = data.transpose()
tdm.head()

In [None]:
# We're going to put the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [None]:
# Gensim also requires dictionary of the all terms and their respective location in the term-document matrix
cv = pickle.load(open("cv_stop.pkl", "rb"))
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term), we need to specify two other parameters - the number of topics and the number of passes. Let's start the number of topics at 2, see if the results make sense, and increase the number from there.

In [None]:
# Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term),
# we need to specify two other parameters as well - the number of topics and the number of passes
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=10)
lda.print_topics()

In [None]:
# LDA for num_topics = 3
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=10)
lda.print_topics()

In [None]:
# LDA for num_topics = 4
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=10)
lda.print_topics()

These topics aren't looking too great. We've tried modifying our parameters. Let's try modifying our terms list as well.

## Topic Modeling - Attempt #2 (Nouns Only)

One popular trick is to look only at terms that are from one part of speech (only nouns, only adjectives, etc.). Check out the UPenn tag set: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.


In [None]:
# Let's create a function to pull out nouns from a string of text
from nltk import word_tokenize, pos_tag

def nouns(text):
    '''Given a string of text, tokenize the text and pull out only the nouns.'''
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)] 
    return ' '.join(all_nouns)

In [None]:
# Read in the cleaned data, before the CountVectorizer step
data_clean = pd.read_pickle('data_clean.pkl')
data_clean

In [None]:
import nltk
# nltk.download()
# Apply the nouns function to the transcripts to filter only on nouns

data_nouns = pd.DataFrame(data_clean.transcript.apply(nouns))
data_nouns


In [None]:
# Create a new document-term matrix using only nouns
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

# Re-add the additional stop words since we are recreating the document-term matrix
add_stop_words = ['like', 'im', 'know', 'just', 'dont', 'thats', 'right', 'people',
                  'youre', 'got', 'gonna', 'time', 'think', 'yeah', 'said']
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Recreate a document-term matrix with only nouns
cvn = CountVectorizer(stop_words=stop_words)
data_cvn = cvn.fit_transform(data_nouns.transcript)
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names())
data_dtmn.index = data_nouns.index
data_dtmn

In [None]:
# Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))

# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())

In [None]:
# Let's start with 2 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=2, id2word=id2wordn, passes=10)
ldan.print_topics()

In [None]:
# Let's try topics = 3
ldan = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=10)
ldan.print_topics()

In [None]:
# Let's try 4 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=10)
ldan.print_topics()

## Topic Modeling - Attempt #3 (Nouns and Adjectives)


In [None]:
# Let's create a function to pull out nouns from a string of text
def nouns_adj(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] 
    return ' '.join(nouns_adj)

In [None]:
# Apply the nouns function to the transcripts to filter only on nouns
data_nouns_adj = pd.DataFrame(data_clean.transcript.apply(nouns_adj))
data_nouns_adj

In [None]:
# Create a new document-term matrix using only nouns and adjectives, also remove common words with max_df
cvna = CountVectorizer(stop_words=stop_words, max_df=.8)
data_cvna = cvna.fit_transform(data_nouns_adj.transcript)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names())
data_dtmna.index = data_nouns_adj.index
data_dtmna

In [None]:
# Create the gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [None]:
# Let's start with 2 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna, passes=10)
ldana.print_topics()

In [None]:
# Let's try 3 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=10)
ldana.print_topics()

In [None]:
# Let's try 4 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=10)
ldana.print_topics()

## Identify Topics in Each Document

Out of the 9 topic models we looked at, the nouns and adjectives, 4 topic one made the most sense. So let's pull that down here and run it through some more iterations to get more fine-tuned topics.

In [None]:
# Our final LDA model (for now)
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=80)
ldana.print_topics()

These four topics look pretty decent. Let's settle on these for now.
* Topic 0: mom, parents
* Topic 1: husband, wife
* Topic 2: guns
* Topic 3: profanity

In [None]:
# Let's take a look at which topics each transcript contains
corpus_transformed = ldana[corpusna]
list(zip([a for [(a,b)] in corpus_transformed], data_dtmna.index))

For a first pass of LDA, these kind of make sense to me, so we'll call it a day for now.
* Topic 0: mom, parents [Anthony, Hasan, Louis, Ricky]
* Topic 1: husband, wife [Ali, John, Mike]
* Topic 2: guns [Bill, Bo, Jim]
* Topic 3: profanity [Dave, Joe]

## Additional Exercises

1. Try further modifying the parameters of the topic models above and see if you can get better topics.
2. Create a new topic model that includes terms from a different [part of speech](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) and see if you can get better topics.