# 1 Data Cleaning 

## 1.1 Introduction
we will be walking through:
1. Getting the data
2. Cleaning the data
3. Organizing the data
   
The output of this notebook will be clean, organized data in two standard text formats:
1. **Corpus** - a collection of text
2. **Document Text Matrix** - word counts in matrix format

## 1.2 Problem Statement
How someone speech is different than othes?

## 1.3 Getting The Data
Get routine of comedian who has imdb rating greater or equal than 7.5/10 and more than 2000 votes.\
Do scraping from the [SCRAPS FROM THE LOFT](https://scrapsfromtheloft.com/comedy/your-friend-nate-bargatze-transcript/)

In [2]:
# Web scraping, pickle imports
import requests
from bs4 import BeautifulSoup
import pickle

In [40]:
def url_to_transcript(url):
    """Return transcript data specifically from scrapsfromloft.com"""
    page = requests.get(url).text
    soup = BeautifulSoup(page, 'html')
    text = [p.text for p in soup.find_all('p')]
    print(url)
    return text

# URLs of transcripts in scope
urls = [
    'https://scrapsfromtheloft.com/comedy/tom-papa-home-free-transcript/',
    'https://scrapsfromtheloft.com/comedy/anthony-jeselnik-bones-and-all-transcript/',
    'https://scrapsfromtheloft.com/comedy/bill-maher-is-anyone-else-seeing-this-transcript/',
    'https://scrapsfromtheloft.com/comedy/ali-wong-single-lady-transcript/',
    'https://scrapsfromtheloft.com/comedy/ahir-shah-ends-transcript/',
    'https://scrapsfromtheloft.com/comedy/ari-shaffir-americas-sweetheart-transcript/',
    'https://scrapsfromtheloft.com/comedy/gabriel-iglesias-legend-of-fluffy-2025-transcript/',
    'https://scrapsfromtheloft.com/comedy/your-friend-nate-bargatze-transcript/',
    'https://scrapsfromtheloft.com/comedy/joe-rogan-burn-the-boats-transcript/'
]
comedians = ['tom', 'anthony', 'bill', 'ali', 'ahir', 'ari', 'gabriel', 'nate', 'joe']
minutes_runs = [62, 51, 66, 59, 61, 75, 101, 63, 67]

In [41]:
# this will take some time to scrape
transcripts = [url_to_transcript(u) for u in urls]

https://scrapsfromtheloft.com/comedy/tom-papa-home-free-transcript/
https://scrapsfromtheloft.com/comedy/anthony-jeselnik-bones-and-all-transcript/
https://scrapsfromtheloft.com/comedy/bill-maher-is-anyone-else-seeing-this-transcript/
https://scrapsfromtheloft.com/comedy/ali-wong-single-lady-transcript/
https://scrapsfromtheloft.com/comedy/ahir-shah-ends-transcript/
https://scrapsfromtheloft.com/comedy/ari-shaffir-americas-sweetheart-transcript/
https://scrapsfromtheloft.com/comedy/gabriel-iglesias-legend-of-fluffy-2025-transcript/
https://scrapsfromtheloft.com/comedy/your-friend-nate-bargatze-transcript/
https://scrapsfromtheloft.com/comedy/joe-rogan-burn-the-boats-transcript/


In [42]:
data = dict()
for i, c in enumerate(comedians):
    data[c] = transcripts[i]

In [6]:
# pickle files for later use
# make a new directory to hold the text files
# import os
# os.mkdir('data')

In [43]:
# Save the dictionary to a pickle file
with open('data/data.pickle', 'wb') as file:
    pickle.dump(data, file)

In [44]:
# check if the data has been loaded or not
data.keys()

dict_keys(['tom', 'anthony', 'bill', 'ali', 'ahir', 'ari', 'gabriel', 'nate', 'joe'])

In [46]:
# double check
data['nate'][:2]

['Your Friend, Nate Bargatze (2024)\nGenre: Comedy, Stand-up\nDirector: Ryan Polito\nStar: Nate Bargatze\nPremiered on Netflix on December 24, 2024',
 'Nate Bargatze’s 2024 Netflix stand-up special, Your Friend, Nate Bargatze, is a comedic journey through the absurdities of everyday life. With his signature deadpan delivery, Bargatze riffs on relatable topics like pizza-ordering dilemmas, domestic quirks with his frugal wife, and parenting mishaps, weaving personal anecdotes with exaggerated humor. The show opens with a spirited introduction from his daughter, setting the stage for a performance rooted in self-deprecating humor, small-town nostalgia, and sharp observations about modern absurdities. Bargatze’s storytelling shines as he recounts his days as a water meter reader, his failed attempts to manage daily routines without his wife’s guidance, and his humorous take on aging and familial dynamics.']

## 1.4 Cleaning The Data
When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing with outliers etc.\
With text data there are some common data cleaning techniques, which are also known as text pre-processing techniques.

**Common data cleaning steps on all text:**
* Make test all lower case
* Remove punctuation
* Remove numerical values
* Remove common non-sensical test (\n)
* Tokenize text
* Rome stop words

**More data cleaning steps after tokenizatio:**
* Stemming/ lemmatization
* Parts of speech tagging
* Create bi-grams or tri-grams
* Deal with typos
* And more...

In [47]:
# lets take a look at our data again
next(iter(data.keys()))

'tom'

In [None]:
# Notice that our id currently in key: comedian and values: list of text format
next(iter(data.values()))

In [49]:
def combine_text(list_of_text):
    """Takes a list of text and combines them into one large chunk of text."""
    combined = " ".join(list_of_text)
    return combined

In [50]:
# combine it
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}

In [51]:
# put it into pandas dataframe
import pandas as pd
df = pd.DataFrame(data_combined).transpose()
df.columns=['transcript']

In [52]:
df

Unnamed: 0,transcript
tom,[“Soul Man” by The Blues Brothers plays] [audi...
anthony,[crowd cheering] Thank you all for coming to t...
bill,Bill Maher‘s Is Anyone Else Seeing This? is a ...
ali,[“Get me Bodied (Extended Mix)” playing] [audi...
ahir,Ahir Shah: Ends (2024)\nGenre: Stand-up Comedy...
ari,"[lightning crashes] Every town, I figured out,..."
gabriel,Gabriel Iglesias’ Legend of Fluffy is a lively...
nate,"Your Friend, Nate Bargatze (2024)\nGenre: Come..."
joe,"Joe Rogan, performing live in San Antonio, exp..."


### 👆🏼 so this is `corpus`.

In [76]:
# lets take a look at the transcript for Nate
df.transcript.loc['nate'][:1000]

'Your Friend, Nate Bargatze (2024)\nGenre: Comedy, Stand-up\nDirector: Ryan Polito\nStar: Nate Bargatze\nPremiered on Netflix on December 24, 2024 Nate Bargatze’s 2024 Netflix stand-up special, Your Friend, Nate Bargatze, is a comedic journey through the absurdities of everyday life. With his signature deadpan delivery, Bargatze riffs on relatable topics like pizza-ordering dilemmas, domestic quirks with his frugal wife, and parenting mishaps, weaving personal anecdotes with exaggerated humor. The show opens with a spirited introduction from his daughter, setting the stage for a performance rooted in self-deprecating humor, small-town nostalgia, and sharp observations about modern absurdities. Bargatze’s storytelling shines as he recounts his days as a water meter reader, his failed attempts to manage daily routines without his wife’s guidance, and his humorous take on aging and familial dynamics. The special thrives on Bargatze’s ability to find comedy in the mundane, whether it’s deb

In [54]:
# Apply a first round of text cleaning techniques 
import re
import string

def clean_text_round1(text):
    """Make text lowercase, remove text in square brackets, remove punctuation, etc."""
    text = text.lower()
    text = text.replace("…","")
    # Remove text inside square brackets
    text = re.sub(r'[.*?]', '', text)
    
    # Remove punctuation
    text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text)
    
    # Remove words containing numbers
    text = re.sub(r'\w*\d\w*', '', text)
    
    return text

round1 = lambda x : clean_text_round1(x)

In [55]:
# Lets take a look at the updated text
data_clean = pd.DataFrame(df.transcript.apply(round1))
data_clean

Unnamed: 0,transcript
tom,“soul man” by the blues brothers plays audienc...
anthony,crowd cheering thank you all for coming to the...
bill,bill maher‘s is anyone else seeing this is a s...
ali,“get me bodied extended mix” playing audience ...
ahir,ahir shah ends \ngenre standup comedy social c...
ari,lightning crashes every town i figured out eve...
gabriel,gabriel iglesias’ legend of fluffy is a lively...
nate,your friend nate bargatze \ngenre comedy stand...
joe,joe rogan performing live in san antonio expre...


In [77]:
data_clean.transcript.loc['nate'][:1000]

'your friend nate bargatze genre comedy standupdirector ryan politostar nate bargatzepremiered on netflix on december   nate bargatzes  netflix standup special your friend nate bargatze is a comedic journey through the absurdities of everyday life with his signature deadpan delivery bargatze riffs on relatable topics like pizzaordering dilemmas domestic quirks with his frugal wife and parenting mishaps weaving personal anecdotes with exaggerated humor the show opens with a spirited introduction from his daughter setting the stage for a performance rooted in selfdeprecating humor smalltown nostalgia and sharp observations about modern absurdities bargatzes storytelling shines as he recounts his days as a water meter reader his failed attempts to manage daily routines without his wifes guidance and his humorous take on aging and familial dynamics the special thrives on bargatzes ability to find comedy in the mundane whether its debating the dominance of dogs sleeping in beds or lamenting

In [63]:
def clean_text_round2(text):
    """Get rid of some additional punctuation, musical notes, quotes, etc."""
    # Remove musical notes and special characters
    text = re.sub('[‘’“”]', '', text)
    
    # Remove single and double quotes
    text = re.sub("[\"\']", '', text)
    
    # Remove additional punctuation
    text = re.sub('\n', '', text)

    text = re.sub('♪', '', text)
    
    return text

# Lambda function for round2 cleaning
round2 = lambda x: clean_text_round2(x)

In [64]:
# Lets take a look at the updated text
data_clean = pd.DataFrame(data_clean.transcript.apply(round2))
data_clean

Unnamed: 0,transcript
tom,soul man by the blues brothers plays audience ...
anthony,crowd cheering thank you all for coming to the...
bill,bill mahers is anyone else seeing this is a sh...
ali,get me bodied extended mix playing audience ch...
ahir,ahir shah ends genre standup comedy social com...
ari,lightning crashes every town i figured out eve...
gabriel,gabriel iglesias legend of fluffy is a lively ...
nate,your friend nate bargatze genre comedy standup...
joe,joe rogan performing live in san antonio expre...


In [78]:
data_clean.transcript.loc['nate'][:1000]

'your friend nate bargatze genre comedy standupdirector ryan politostar nate bargatzepremiered on netflix on december   nate bargatzes  netflix standup special your friend nate bargatze is a comedic journey through the absurdities of everyday life with his signature deadpan delivery bargatze riffs on relatable topics like pizzaordering dilemmas domestic quirks with his frugal wife and parenting mishaps weaving personal anecdotes with exaggerated humor the show opens with a spirited introduction from his daughter setting the stage for a performance rooted in selfdeprecating humor smalltown nostalgia and sharp observations about modern absurdities bargatzes storytelling shines as he recounts his days as a water meter reader his failed attempts to manage daily routines without his wifes guidance and his humorous take on aging and familial dynamics the special thrives on bargatzes ability to find comedy in the mundane whether its debating the dominance of dogs sleeping in beds or lamenting

**NOTE:** This data cleaning aka text pre-processing step could go on for a while, but we are going to stop for now. \
After going through some analysis techniques, if you see that the results don't make sense or could be improved, you can come back and make some edits such as:
* Mark 'cheering' and 'cheer' as the same word (`stemming/lemmatization`)
* Combine 'thank you' into one term (bi-grams)
* And a lot more...

## 1.5  Organizing The Data
These are two types of data formats:
1. `Corpus` - a collection text
2. `DTM` - word counts in matrix format

### 1.5.1 Corpus
It is a collection of texts, and they are all put together neatly in a pandas dataframe.

In [66]:
# lets see the corpus data
df

Unnamed: 0,transcript
tom,[“Soul Man” by The Blues Brothers plays] [audi...
anthony,[crowd cheering] Thank you all for coming to t...
bill,Bill Maher‘s Is Anyone Else Seeing This? is a ...
ali,[“Get me Bodied (Extended Mix)” playing] [audi...
ahir,Ahir Shah: Ends (2024)\nGenre: Stand-up Comedy...
ari,"[lightning crashes] Every town, I figured out,..."
gabriel,Gabriel Iglesias’ Legend of Fluffy is a lively...
nate,"Your Friend, Nate Bargatze (2024)\nGenre: Come..."
joe,"Joe Rogan, performing live in San Antonio, exp..."


In [67]:
# Lets pickle it for later use
df.to_pickle('data/corpus.pkl')

### 1.5.2 Documen-Term Matrix (DTM)
The text must be tokenized, meaning broken down into smaller pieces. The most common tokenizer technique is to break down text into words. We can do this using scikit-learn's `CountVectorizer`, where every row will represent a different document and every column will represent a different word.

In addition, with `CountVectorizer`, we can remove stop words. **Stop words** are common words that add no additional meaning to text such as 'a', 'the', etc.

In [68]:
# create a DTM
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the CountVectorizer with English stop words
vectorizer = CountVectorizer(stop_words='english')

# Fit the vectorizer to the data and transform it into a document-term matrix
data_cv = vectorizer.fit_transform(data_clean.transcript)

# Convert the document-term matrix into a DataFrame
data_dtm = pd.DataFrame(data_cv.toarray())

# Set the DataFrame index to match the original data index
data_dtm.index = data_clean.index

# Get feature names using the vocabulary_ attribute
feature_names = list(vectorizer.vocabulary_.keys())

# Set the DataFrame columns to be the feature names
data_dtm.columns = feature_names

In [70]:
data_dtm.shape

(9, 6507)

In [71]:
# Lets pickle it for later use
data_dtm.to_pickle('data/dtm.pkl')

In [72]:
# Pickle the cleaned data for later use
data_clean.to_pickle('data/data_clean.pkl')

# Pickle the CountVectorizer for later use
with open('data/vectorizer.pkl', 'wb') as file:
    pickle.dump(vectorizer, file)

## 1.6 Additionl text processing
```python
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the CountVectorizer with custom parameters
vectorizer = CountVectorizer(
    ngram_range=(1, 2),  # Extract unigrams and bigrams
    min_df=5,            # Ignore terms that appear in fewer than 5 documents
    max_df=0.8           # Ignore terms that appear in more than 80% of the documents
)
```
