<a href="https://colab.research.google.com/github/dhillonarman/standup-comedy-nlp-analysis/blob/main/Data_cleaning_transformation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Getting The Data**

In [None]:
# Web scraping, pickle imports
import requests
from bs4 import BeautifulSoup
import pickle

# Scrapes transcript data from scrapsfromtheloft.com
def url_to_transcript(url):
    '''Returns transcript data specifically from scrapsfromtheloft.com.'''
    page = requests.get(url).text
    soup = BeautifulSoup(page, "lxml")
    text = [p.text for p in soup.find(class_="site-content").find_all('p')]
    print(url)
    return text

# URLs of transcripts in scope
urls = ['http://scrapsfromtheloft.com/2017/08/02/john-mulaney-comeback-kid-2015-full-transcript/',
        'http://scrapsfromtheloft.com/comedy/ronny-chieng-love-to-hate-it-transcript/',
        'http://scrapsfromtheloft.com/2017/09/19/ali-wong-baby-cobra-2016-full-transcript/',
        'http://scrapsfromtheloft.com/2017/08/03/anthony-jeselnik-thoughts-prayers-2015-full-transcript/',
        'http://scrapsfromtheloft.com/2018/03/03/mike-birbiglia-my-girlfriends-boyfriend-2013-full-transcript/',
        'http://scrapsfromtheloft.com/2017/08/19/joe-rogan-triggered-2016-full-transcript/']

# Comedian names
comedians = [ 'john', 'ronny', 'ali', 'anthony', 'mike', 'joe']

In [None]:
# # Actually request transcripts (takes a few minutes to run)
transcripts = [url_to_transcript(u) for u in urls]

http://scrapsfromtheloft.com/2017/08/02/john-mulaney-comeback-kid-2015-full-transcript/
http://scrapsfromtheloft.com/comedy/ronny-chieng-love-to-hate-it-transcript/
http://scrapsfromtheloft.com/2017/09/19/ali-wong-baby-cobra-2016-full-transcript/
http://scrapsfromtheloft.com/2017/08/03/anthony-jeselnik-thoughts-prayers-2015-full-transcript/
http://scrapsfromtheloft.com/2018/03/03/mike-birbiglia-my-girlfriends-boyfriend-2013-full-transcript/
http://scrapsfromtheloft.com/2017/08/19/joe-rogan-triggered-2016-full-transcript/


In [None]:
# # Pickle files for later use

# # New directory to hold the text files
!mkdir transcripts

for i, c in enumerate(comedians):
  with open("transcripts/" + c + ".txt", "wb") as file:
       pickle.dump(transcripts[i], file)

In [None]:
# Load pickled files
data = {}
for i, c in enumerate(comedians):
    with open("transcripts/" + c + ".txt", "rb") as file:
        data[c] = pickle.load(file)

In [None]:
# Double check to make sure data has been loaded properly
data.keys()

dict_keys(['john', 'ronny', 'ali', 'anthony', 'mike', 'joe'])

In [None]:
# More checks
data['ali'][:5]

['Ladies and gentlemen, please welcome to the stage: Ali Wong!',
 'Hi. Hello! Welcome! Thank you! Thank you for coming. Hello! Hello. We are gonna have to get this shit over with, ’cause I have to pee in, like, ten minutes. But thank you, everybody, so much for coming.',
 'Um… It’s a very exciting day for me. It’s been a very exciting year for me. I turned 33 this year. Yes! Thank you, five people. I appreciate that. Uh, I can tell that I’m getting older, because, now, when I see an 18-year-old girl, my automatic thought… is “Fuck you.” “Fuck you. I don’t even know you, but fuck you!” ‘Cause I’m straight up jealous. I’m jealous, first and foremost, of their metabolism. Because 18-year-old girls, they could just eat like shit, and then they take a shit and have a six-pack, right? They got that-that beautiful inner thigh clearance where they put their feet together and there’s that huge gap here with the light of potential just radiating through.',
 'And then, when they go to sleep, they

**Cleaning The Data**

In [None]:
next(iter(data.keys()))

'john'

In [None]:
# Notice that our dictionary is currently in key: comedian, value: list of text format
next(iter(data.values()))

['Armed with boyish charm and a sharp wit, the former “SNL” writer John Mulaney offers sly takes on marriage, his beef with babies and the time he met Bill Clinton',
 'All right, Petunia. Wish me luck out there. You will die on August 7th, 2037. That’s pretty good. All right. Hello. Hello, Chicago. Nice to see you again. Thank you. That was very nice. Thank you. Look, now, you’re a wonderful crowd, but I need you to keep your energy up the entire show, okay? Because… No, no, no. Thank you. Some crowds… some crowds, they have big energy in the beginning and then they run out of places to go. So… I don’t judge those crowds, by the way, okay? We’ve all gone too big too fast and then run out of room. We’ve all made a “Happy Birthday” sign… Wait. You get that poster board up, and you’re like, “I don’t need to trace it. I know how big letters should be. To begin with, a big-ass ‘H’. Followed by a big-ass ‘A’ and… Oh, no! Oh, God! Okay, all right. Real skinny ‘P’ with a high hump, and then we

In [None]:
# We are going to change this to key: comedian, value: string format
def combine_text(list_of_text):
    '''Takes a list of text or paragraphs and combines them into one large chunk of text. for one comedian'''
    combined_text = ' '.join(list_of_text)
    return combined_text

In [None]:
# Combine it!
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}

In [None]:
# We can either keep it in dictionary format or put it into a pandas dataframe
import pandas as pd
pd.set_option('max_colwidth',150)

data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['transcript']
data_df = data_df.sort_index()
data_df

Unnamed: 0,transcript
ali,"Ladies and gentlemen, please welcome to the stage: Ali Wong! Hi. Hello! Welcome! Thank you! Thank you for coming. Hello! Hello. We are gonna have ..."
anthony,"Thank you. Thank you. Thank you, San Francisco. Thank you so much. So good to be here. People were surprised when I told ’em I was gonna tape my s..."
joe,"[rock music playing] [audience cheering] [announcer] Ladies and gentlemen, welcome Joe Rogan. [audience cheering and applauding] What the f*ck is ..."
john,"Armed with boyish charm and a sharp wit, the former “SNL” writer John Mulaney offers sly takes on marriage, his beef with babies and the time he m..."
mike,"Wow. Hey, thank you. Thanks. Thank you, guys. Hey, Seattle. Nice to see you. Look at this. Look at us. We’re here. This is crazy. It’s insane. So ..."
ronny,"[tuning] [gentle Hawaiian music playing over radio] [revving] [announcer] Ladies and gentlemen, make some noise for Ronny Chieng! [crowd cheering]..."


In [None]:
# Let's take a look at the transcript for Ali Wong
data_df.transcript.loc['john']

"Armed with boyish charm and a sharp wit, the former “SNL” writer John Mulaney offers sly takes on marriage, his beef with babies and the time he met Bill Clinton All right, Petunia. Wish me luck out there. You will die on August 7th, 2037. That’s pretty good. All right. Hello. Hello, Chicago. Nice to see you again. Thank you. That was very nice. Thank you. Look, now, you’re a wonderful crowd, but I need you to keep your energy up the entire show, okay? Because… No, no, no. Thank you. Some crowds… some crowds, they have big energy in the beginning and then they run out of places to go. So… I don’t judge those crowds, by the way, okay? We’ve all gone too big too fast and then run out of room. We’ve all made a “Happy Birthday” sign… Wait. You get that poster board up, and you’re like, “I don’t need to trace it. I know how big letters should be. To begin with, a big-ass ‘H’. Followed by a big-ass ‘A’ and… Oh, no! Oh, God! Okay, all right. Real skinny ‘P’ with a high hump, and then we’ll p

i) Make text all lower case

ii) Remove punctuation

iii) Remove numerical values

iv) Remove common non-sensical text



In [None]:
# Apply a first round of text cleaning techniques
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [None]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_df.transcript.apply(round1))
data_clean

Unnamed: 0,transcript
ali,ladies and gentlemen please welcome to the stage ali wong hi hello welcome thank you thank you for coming hello hello we are gonna have to get thi...
anthony,thank you thank you thank you san francisco thank you so much so good to be here people were surprised when i told em i was gonna tape my special ...
joe,ladies and gentlemen welcome joe rogan what the fck is going on san francisco thanks for coming i appreciate it god damn put your phone down f...
john,armed with boyish charm and a sharp wit the former snl writer john mulaney offers sly takes on marriage his beef with babies and the time he met b...
mike,wow hey thank you thanks thank you guys hey seattle nice to see you look at this look at us were here this is crazy its insane so about five years...
ronny,ladies and gentlemen make some noise for ronny chieng thank you thank you hawaii im turning so im harvesting my wifes eggs this year im h...


**We will add an additional regular expression to the clean_text_round2 function to further
clean the text**

In [None]:
def clean_text_round2(text):
    '''Further cleaning by removing fillers and single characters'''
    text = re.sub(r'\b(?:uh|um|like|you know|okay|alright)\b', '', text, flags=re.IGNORECASE)
    text = re.sub(r'(\w+)(year)(old)', r'\1 \2 \3', text, flags=re.IGNORECASE)
    return text

round2 = lambda x: clean_text_round2(x)

In [None]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_clean.transcript.apply(round2))
data_clean

Unnamed: 0,transcript
ali,ladies and gentlemen please welcome to the stage ali wong hi hello welcome thank you thank you for coming hello hello we are gonna have to get thi...
anthony,thank you thank you thank you san francisco thank you so much so good to be here people were surprised when i told em i was gonna tape my special ...
joe,ladies and gentlemen welcome joe rogan what the fck is going on san francisco thanks for coming i appreciate it god damn put your phone down f...
john,armed with boyish charm and a sharp wit the former snl writer john mulaney offers sly takes on marriage his beef with babies and the time he met b...
mike,wow hey thank you thanks thank you guys hey seattle nice to see you look at this look at us were here this is crazy its insane so about five years...
ronny,ladies and gentlemen make some noise for ronny chieng thank you thank you hawaii im turning so im harvesting my wifes eggs this year im h...


In [None]:
data_clean.transcript.loc['ali']

'ladies and gentlemen please welcome to the stage ali wong hi hello welcome thank you thank you for coming hello hello we are gonna have to get this shit over with cause i have to pee in  ten minutes but thank you everybody so much for coming  its a very exciting day for me its been a very exciting year for me i turned  this year yes thank you five people i appreciate that  i can tell that im getting older because now when i see an  girl my automatic thought is fuck you fuck you i dont even know you but fuck you cause im straight up jealous im jealous first and foremost of their metabolism because  girls they could just eat  shit and then they take a shit and have a sixpack right they got thatthat beautiful inner thigh clearance where they put their feet together and theres that huge gap here with the light of potential just radiating through and then when they go to sleep they just go to sleep right they dont have insomnia yet they dont know what its  to have to take a ambien or downl

**Corpus - corpus is a collection of texts, and they are all put together neatly in a pandas
dataframed text**

In [None]:
data_clean.to_pickle("corpus.pkl")

v) Tokenize text

vi) Remove stop words

**Document-Term Matrix - word counts in matrix format (data_dtm)**

In [None]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean.transcript)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())
data_dtm.index = data_clean.index
data_dtm

Unnamed: 0,abc,ability,able,abortion,abruptly,absolutely,absorbing,absurdities,abuse,abused,...,youre,youtube,youve,yyou,zealand,zeppelin,zero,zombie,zuckerberg,éclair
ali,1,0,2,0,0,0,1,1,1,2,...,31,0,2,1,0,0,0,1,0,0
anthony,0,0,0,2,0,0,0,1,0,0,...,19,0,6,0,10,0,0,0,0,0
joe,0,0,2,0,0,0,0,1,1,0,...,42,3,6,0,0,0,0,0,0,0
john,0,0,3,0,0,1,0,1,0,0,...,28,0,3,0,0,0,0,0,0,1
mike,0,0,0,0,0,0,0,1,0,0,...,28,0,3,0,0,2,1,0,0,0
ronny,0,1,1,0,1,0,0,1,0,0,...,18,7,2,0,0,0,2,0,3,0


In [None]:
# Let's pickle it for later use
data_dtm.to_pickle("dtm.pkl")

**We wil Play around with CountVectorizer's parameters, ngram_range**

In [None]:
cv2 = CountVectorizer(ngram_range=(1,2), min_df=2, max_df=0.85)
data_cv2 = cv2.fit_transform(data_clean.transcript)
data_dtm2 = pd.DataFrame(data_cv2.toarray(), columns=cv2.get_feature_names_out())
data_dtm2.index = data_clean.index
data_dtm2

Unnamed: 0,able,able to,about all,about and,about getting,about guy,about her,about his,about how,about im,...,youre so,youre supposed,youre the,youre with,yourself thats,youtube,youve ever,youve got,youve seen,zero
ali,2,2,1,0,0,0,0,0,2,1,...,0,1,1,1,0,0,0,0,0,0
anthony,0,0,0,1,1,1,0,1,0,1,...,0,0,1,0,0,0,0,1,1,0
joe,2,2,0,0,1,0,0,0,1,0,...,0,1,2,3,0,3,3,0,1,0
john,3,3,1,0,0,1,1,1,1,1,...,1,1,0,0,1,0,1,0,0,0
mike,0,0,0,2,0,0,2,1,1,0,...,1,0,1,0,0,0,2,0,0,1
ronny,1,1,0,0,0,0,0,0,0,0,...,0,1,1,0,1,7,0,1,0,2


In [None]:
data_dtm.to_pickle("dtm2.pkl")

**1) ngram_range**= (a, b) → Defines word sequences (n-grams) to extract.

(1,1): Only single words (unigrams)

(1,2): Single words + two-word phrases (bigrams)

(2,2): Only two-word phrases

**2) min_df** = Removes rare words that appear in fewer than min_df documents.

min_df=2: Ignores words appearing in fewer than 2 documents

min_df=0.01: Ignores words in less than 1% of documents

**3) max_df** = Removes overly common words that appear in more than max_df documents.

max_df=0.85: Removes words appearing in more than 85% of documents

Helps eliminate words that don’t provide useful differentiation