# Data Cleaning

` Data cleaning is a time consuming and unenjoyable task, yet it's a very important one. Keep in mind, "garbage in, garbage out".`

#### Feeding dirty data into a model will give us results that are meaningless.

### Objective:

1. Getting the data
2. Cleaning the data
3. Organizing the data - organize the cleaned data into a way that is easy to input into other algorithms

### Output :
#### cleaned and organized data in two standard text formats:

1. Corpus - a collection of text
2. Document-Term Matrix - word counts in matrix format

## Problem Statement

Look at transcripts of various comedians and note their similarities and differences and find if the stand up comedian of your choice has comedy style different than other comedian.


## Getting The Data

You can get the transcripts of some comedian from [Scraps From The Loft](http://scrapsfromtheloft.com).

You can take help of IMDB and select only 10 or 20 comedian having highest rating.






### For example:

In [None]:
# Web scraping, pickle imports
import requests                                 #to make http requests, fetch html contents
from bs4 import BeautifulSoup                   #parsing od html to xml
import pickle

In [None]:


def url_to_transcript(url):
    page = requests.get(url).text                                               #send request to url, text used to retrieve html contents of webpage
    soup = BeautifulSoup(page, "lxml")                                          #parse the content stored in page using parser lxml
    post_content = soup.find(class_="elementor-widget-theme-post-content")
    if post_content:
        text = [p.text for p in post_content]                                   #if post_content exsits, extract the text content of HTML elements within post_content
        print(url)
        return text
    else:
        print(f'No elements with class "post-content" found at {url}')
        return None

In [None]:
#URL of transcript in scope

urls=[
     'https://scrapsfromtheloft.com/books/yanis-varoufakis-technofeudalism-american-big-tech-has-enslaved-us/',
     'https://scrapsfromtheloft.com/books/john-gray-everything-you-know-about-the-future-is-wrong/',
     'https://scrapsfromtheloft.com/books/2021-year-milley-halted-nuclear-chaos/',
     'https://scrapsfromtheloft.com/books/complete-works-of-oscar-wilde-vyvyan-holland-introduction/',
     'https://scrapsfromtheloft.com/books/mary-mccarthy-on-hannah-arendt-origins-of-totalitarianism/',
     'https://scrapsfromtheloft.com/books/where-i-lived-and-what-i-lived-for-henry-david-thoreau/',
     'https://scrapsfromtheloft.com/books/death-of-dan-greenburg-humorist-and-writer/']

In [None]:

Authors=['YANIS VAROUFAKIS','JOHN GRAY','MILLEY MARKER','Vyvyan Holland','MARY','HENRY','DAN GREENBURG']

In [None]:
!pip3 install lxml



In [None]:
#Actually making a request to transcripts
transcripts=[url_to_transcript(u) for u in urls]

https://scrapsfromtheloft.com/books/yanis-varoufakis-technofeudalism-american-big-tech-has-enslaved-us/
https://scrapsfromtheloft.com/books/john-gray-everything-you-know-about-the-future-is-wrong/
https://scrapsfromtheloft.com/books/2021-year-milley-halted-nuclear-chaos/
https://scrapsfromtheloft.com/books/complete-works-of-oscar-wilde-vyvyan-holland-introduction/
https://scrapsfromtheloft.com/books/mary-mccarthy-on-hannah-arendt-origins-of-totalitarianism/
https://scrapsfromtheloft.com/books/where-i-lived-and-what-i-lived-for-henry-david-thoreau/
https://scrapsfromtheloft.com/books/death-of-dan-greenburg-humorist-and-writer/


In [None]:
# Pickle files for later use
# This section of code is responsible for saving the scraped transcripts of comedians into individual text files using the pickle module
# Make a new directory to hold the text files
!mkdir transcripts

for i, c in enumerate(Authors):
    with open("transcripts/" + c + ".txt", "wb") as file:
        pickle.dump(transcripts[i], file)                                       #serialoze transcript data and write in file

mkdir: cannot create directory ‘transcripts’: File exists


In [None]:
##Load pickled files

data = {}
for i, c in enumerate(Authors):
    with open("transcripts/" + c + ".txt", "rb") as file:                       #rb-->binary read mode
        data[c] = pickle.load(file)                                             #deserialize the data from pickle file to memory

In [None]:
# Double check to make sure data has been loaded properly
data.keys()

dict_keys(['YANIS VAROUFAKIS', 'JOHN GRAY', 'MILLEY MARKER', 'Vyvyan Holland', 'MARY', 'HENRY', 'DAN GREENBURG'])

In [None]:
# More checks
data['JOHN GRAY'][1]

'\nCrossroads of Empire: John Gray’s Vision of Political Realignment and the Fate of the West\nJohn Gray, the prolific political philosopher, has long been a figure of intrigue and controversy. Over the past half-century, Gray has navigated the intellectual currents of political thought, never comfortably fitting into a single ideological box. His ability to provoke thought across the political spectrum is perhaps best encapsulated in his latest work, which examines the changing fortunes of the West through a prism of revived historical paradigms: feudalism, religious orthodoxy, and ultra-nationalism.\nGray’s central thesis challenges the modern consensus on growth and progress, suggesting that neither are guaranteed nor inevitable. This perspective underpins his broader critique of globalization and liberal capitalism, positions that have brought him both acclaim and criticism. His skepticism towards the West’s commitment to these ideologies has proven prescient, with his early critiq

## Cleaning The Data

When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing with outliers, etc. With text data, there are some common data cleaning techniques, which are also known as text pre-processing techniques.

With text data, this cleaning process can go on forever. There's always an exception to every cleaning step. So, we're going to follow the MVP (minimum viable product) approach - start simple and iterate.
### Assignment:
1. Perform the following data cleaning on transcripts:
i) Make text all lower case
ii) Remove punctuation
iii) Remove numerical values
iv) Remove common non-sensical text (/n)
v) Tokenize text
vi) Remove stop words

In [None]:
#Cleaning the data

# MVP Approch(minimum viable product)-start simple and iterate


next(iter(data.keys()))

'YANIS VAROUFAKIS'

In [None]:

#key :comedian
#value : string format

def combine_text(list_of_text):
    #Takes list of text and combines it into one large chunk of text
    combined_text=' '.join(list_of_text)
    return combined_text

In [None]:
#Combine it
data_combined={key:[combine_text(value)] for(key,value) in data.items()}

In [None]:
data_combined

{'YANIS VAROUFAKIS': ['\n \nIn his book, ‘Technofeudalism: What Killed Capitalism’, Yanis Varoufakis explores how giant tech firms, both in the US and China are expanding their control over the planet. His analysis is that, whilst material resources certainly matter, the real battle ground is over digital real estate.\nVaroufakis, a renowned economist and political thinker, presents a future where the digital age has not ushered in the equitable, socialist utopia many had hoped for but has instead birthed a new form of inequality and economic disparity reminiscent of feudal times, yet distinctly marked by digital characteristics.\nThe foundation of Varoufakis’s argument rests on the observation that the digital giants—companies like Google, Uber, Facebook, Apple, and Amazon—have amassed unprecedented power and wealth, not merely through traditional capitalist means but by establishing monopolies over digital platforms and resources. These companies, according to Varoufakis, are the new

In [None]:
#We can either keep it in dictionary format or put it in pandas dataframe
import pandas as pd
pd.set_option('max_colwidth',150)

data_df=pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns=['transcript']
data_df=data_df.sort_index()                  #sort alphabetically
data_df

Unnamed: 0,transcript
DAN GREENBURG,"\n \nDan Greenburg, the renowned humorist and writer, passed away at the age of 87. With a career that spanned several decades, Greenburg was cele..."
HENRY,"\n \nHenry David Thoreau was born in 1817 and raised in Concord, Massachusetts, living there for most of his life. Along with Ralph Waldo Emerson,..."
JOHN GRAY,"\n \nCrossroads of Empire: John Gray’s Vision of Political Realignment and the Fate of the West\nJohn Gray, the prolific political philosopher, ha..."
MARY,"\n \nNewport RFD 2\nRhode Island\n4/26/51\nDear Hannah:\nI’ve read your book [The Origins of Totalitarianism], absorbed, for the past two weeks, i..."
MILLEY MARKER,"\n \nCAPITOL HILL, THREE YEARS ON\nCOUP IN WASHINGTON. In the days following the election, fears mounted that an unhinged Trump might bomb Iran. T..."
Vyvyan Holland,"\n \nVyvyan Holland, in his introduction to the 1966 edition of Oscar Wilde’s works, provides an insightful overview of his father’s lineage, life..."
YANIS VAROUFAKIS,"\n \nIn his book, ‘Technofeudalism: What Killed Capitalism’, Yanis Varoufakis explores how giant tech firms, both in the US and China are expandin..."


In [None]:
# Let's take a look at the transcript for Matte
data_df.transcript.loc['YANIS VAROUFAKIS']

'\n \nIn his book, ‘Technofeudalism: What Killed Capitalism’, Yanis Varoufakis explores how giant tech firms, both in the US and China are expanding their control over the planet. His analysis is that, whilst material resources certainly matter, the real battle ground is over digital real estate.\nVaroufakis, a renowned economist and political thinker, presents a future where the digital age has not ushered in the equitable, socialist utopia many had hoped for but has instead birthed a new form of inequality and economic disparity reminiscent of feudal times, yet distinctly marked by digital characteristics.\nThe foundation of Varoufakis’s argument rests on the observation that the digital giants—companies like Google, Uber, Facebook, Apple, and Amazon—have amassed unprecedented power and wealth, not merely through traditional capitalist means but by establishing monopolies over digital platforms and resources. These companies, according to Varoufakis, are the new lords of a techno-feu

In [None]:
#Applying first round of text cleaning technique

import re  #to work with regular expressions
import string

#Make all the text lowercase
#remove text within square brackets with ''
#remove punctuation
#remove words containing numbers

def clean_text_round1(text):
    text=text.lower()
    text=re.sub('\[.*?\]','',text)
    text=re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text=re.sub('\w*\d\w*','',text)
    return text

In [None]:
# Let's take a look at the updated text
round1=lambda x:clean_text_round1(x)
data_clean=pd.DataFrame(data_df.transcript.apply(round1))
data_clean

Unnamed: 0,transcript
DAN GREENBURG,\n \ndan greenburg the renowned humorist and writer passed away at the age of with a career that spanned several decades greenburg was celebrated...
HENRY,\n \nhenry david thoreau was born in and raised in concord massachusetts living there for most of his life along with ralph waldo emerson thoreau...
JOHN GRAY,\n \ncrossroads of empire john gray’s vision of political realignment and the fate of the west\njohn gray the prolific political philosopher has l...
MARY,\n \nnewport rfd \nrhode island\n\ndear hannah\ni’ve read your book absorbed for the past two weeks in the bathtub riding in the car waiting in l...
MILLEY MARKER,\n \ncapitol hill three years on\ncoup in washington in the days following the election fears mounted that an unhinged trump might bomb iran the j...
Vyvyan Holland,\n \nvyvyan holland in his introduction to the edition of oscar wilde’s works provides an insightful overview of his father’s lineage life and li...
YANIS VAROUFAKIS,\n \nin his book ‘technofeudalism what killed capitalism’ yanis varoufakis explores how giant tech firms both in the us and china are expanding th...


In [None]:
#Applying second round of cleaning

#Get rid of some additional punctuation
# non-sensical text that was missed the first time around such as new line character \n

def clean_text_round2(text):
    text=re.sub('[‘’“”…]', '', text)
    text= re.sub('\n','',text)
    return text

In [None]:
# Let's take a look at the updated text
round2=lambda x: clean_text_round2(x)
data_clean= pd.DataFrame(data_clean.transcript.apply(round2))
data_clean

Unnamed: 0,transcript
DAN GREENBURG,dan greenburg the renowned humorist and writer passed away at the age of with a career that spanned several decades greenburg was celebrated for...
HENRY,henry david thoreau was born in and raised in concord massachusetts living there for most of his life along with ralph waldo emerson thoreau was...
JOHN GRAY,crossroads of empire john grays vision of political realignment and the fate of the westjohn gray the prolific political philosopher has long bee...
MARY,newport rfd rhode islanddear hannahive read your book absorbed for the past two weeks in the bathtub riding in the car waiting in line in the gr...
MILLEY MARKER,capitol hill three years oncoup in washington in the days following the election fears mounted that an unhinged trump might bomb iran the joint c...
Vyvyan Holland,vyvyan holland in his introduction to the edition of oscar wildes works provides an insightful overview of his fathers lineage life and literary...
YANIS VAROUFAKIS,in his book technofeudalism what killed capitalism yanis varoufakis explores how giant tech firms both in the us and china are expanding their co...


In [None]:
#Applying 3rd round of cleaning
#Get rid of "♪" replace with empty string ""

def clean_text_round3(text):
    text=re.sub("♪","",text)
    return text

In [None]:
# Let's take a look at the updated text
round3=lambda x: clean_text_round3(x)
data_clean= pd.DataFrame(data_clean.transcript.apply(round3))
data_clean

Unnamed: 0,transcript
DAN GREENBURG,dan greenburg the renowned humorist and writer passed away at the age of with a career that spanned several decades greenburg was celebrated for...
HENRY,henry david thoreau was born in and raised in concord massachusetts living there for most of his life along with ralph waldo emerson thoreau was...
JOHN GRAY,crossroads of empire john grays vision of political realignment and the fate of the westjohn gray the prolific political philosopher has long bee...
MARY,newport rfd rhode islanddear hannahive read your book absorbed for the past two weeks in the bathtub riding in the car waiting in line in the gr...
MILLEY MARKER,capitol hill three years oncoup in washington in the days following the election fears mounted that an unhinged trump might bomb iran the joint c...
Vyvyan Holland,vyvyan holland in his introduction to the edition of oscar wildes works provides an insightful overview of his fathers lineage life and literary...
YANIS VAROUFAKIS,in his book technofeudalism what killed capitalism yanis varoufakis explores how giant tech firms both in the us and china are expanding their co...


## Organizing The Data

### Assignment:
1. Organized data in two standard text formats:
   a) Corpus - corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.
   b) Document-Term Matrix - word counts in matrix format

### Corpus: Example

A corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.

In [None]:
# Let's take a look at our dataframe
data_df

Unnamed: 0,transcript
DAN GREENBURG,"\n \nDan Greenburg, the renowned humorist and writer, passed away at the age of 87. With a career that spanned several decades, Greenburg was cele..."
HENRY,"\n \nHenry David Thoreau was born in 1817 and raised in Concord, Massachusetts, living there for most of his life. Along with Ralph Waldo Emerson,..."
JOHN GRAY,"\n \nCrossroads of Empire: John Gray’s Vision of Political Realignment and the Fate of the West\nJohn Gray, the prolific political philosopher, ha..."
MARY,"\n \nNewport RFD 2\nRhode Island\n4/26/51\nDear Hannah:\nI’ve read your book [The Origins of Totalitarianism], absorbed, for the past two weeks, i..."
MILLEY MARKER,"\n \nCAPITOL HILL, THREE YEARS ON\nCOUP IN WASHINGTON. In the days following the election, fears mounted that an unhinged Trump might bomb Iran. T..."
Vyvyan Holland,"\n \nVyvyan Holland, in his introduction to the 1966 edition of Oscar Wilde’s works, provides an insightful overview of his father’s lineage, life..."
YANIS VAROUFAKIS,"\n \nIn his book, ‘Technofeudalism: What Killed Capitalism’, Yanis Varoufakis explores how giant tech firms, both in the US and China are expandin..."


In [None]:
# Let's add the comedians' full names as well
full_names=['YANIS VAROUFAKIS','JOHN GRAY','MILLEY MARKER','Vyvyan Holland','MARY','HENRY','DAN GREENBURG']

data_df['full_name'] = full_names
data_df

Unnamed: 0,transcript,full_name
DAN GREENBURG,"\n \nDan Greenburg, the renowned humorist and writer, passed away at the age of 87. With a career that spanned several decades, Greenburg was cele...",YANIS VAROUFAKIS
HENRY,"\n \nHenry David Thoreau was born in 1817 and raised in Concord, Massachusetts, living there for most of his life. Along with Ralph Waldo Emerson,...",JOHN GRAY
JOHN GRAY,"\n \nCrossroads of Empire: John Gray’s Vision of Political Realignment and the Fate of the West\nJohn Gray, the prolific political philosopher, ha...",MILLEY MARKER
MARY,"\n \nNewport RFD 2\nRhode Island\n4/26/51\nDear Hannah:\nI’ve read your book [The Origins of Totalitarianism], absorbed, for the past two weeks, i...",Vyvyan Holland
MILLEY MARKER,"\n \nCAPITOL HILL, THREE YEARS ON\nCOUP IN WASHINGTON. In the days following the election, fears mounted that an unhinged Trump might bomb Iran. T...",MARY
Vyvyan Holland,"\n \nVyvyan Holland, in his introduction to the 1966 edition of Oscar Wilde’s works, provides an insightful overview of his father’s lineage, life...",HENRY
YANIS VAROUFAKIS,"\n \nIn his book, ‘Technofeudalism: What Killed Capitalism’, Yanis Varoufakis explores how giant tech firms, both in the US and China are expandin...",DAN GREENBURG


In [None]:
# Let's pickle it for later use
data_df.to_pickle("corpus.pkl")

### Document-Term Matrix: Example

For many of the techniques we'll be using in future assignment, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's ` CountVectorizer `, where every row will represent a different document and every column will represent a different word.

In addition, with ` CountVectorizer `, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

In [None]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean.transcript)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())
data_dtm.index=data_clean.index


data_dtm

Unnamed: 0,aaron,aberration,ability,able,abroad,absence,absolute,absorbed,absorbing,abuse,...,year,years,yes,york,young,youp,youre,youth,zack,zuocheng
DAN GREENBURG,0,0,5,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,2,0
HENRY,0,0,0,1,0,0,2,0,0,0,...,1,3,1,1,0,0,0,0,0,0
JOHN GRAY,1,1,6,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
MARY,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
MILLEY MARKER,0,0,0,0,0,0,0,0,0,0,...,0,2,0,0,0,0,1,0,0,2
Vyvyan Holland,0,0,0,0,0,0,0,0,1,0,...,3,8,0,1,2,0,0,1,0,0
YANIS VAROUFAKIS,0,0,1,0,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [None]:
max_word = data_dtm.sum(axis=0).sort_values(ascending=False).index[0]
max_count = data_dtm.sum(axis=0).max()
min_word = data_dtm.sum(axis=0).sort_values(ascending=False).index[0]
min_count = data_dtm.sum(axis=0).min()
print("Most frequently used word:", max_word)
print("Total count of the most frequent word:", max_count)

Most frequently used word: economic
Total count of the most frequent word: 112


In [None]:
# Let's pickle it for later use
data_dtm.to_pickle("dtm.pkl")

In [None]:
# Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object
data_clean.to_pickle('data_clean.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))

In [None]:
data_dtm['able']

DAN GREENBURG       0
HENRY               1
JOHN GRAY           0
MARY                0
MILLEY MARKER       0
Vyvyan Holland      0
YANIS VAROUFAKIS    0
Name: able, dtype: int64

## Additional Assignments:

1. Can you add an additional regular expression to the clean_text_round2 function to further clean the text?
2. Play around with CountVectorizer's parameters. What is ngram_range? What is min_df and max_df?

In [None]:
#Q1. Can you add an additional regular expression to the clean_text_round2 function to further clean the text?
def clean_text_round2(text):
    # Remove additional punctuation and non-sensical text
    text = re.sub('[‘’“”…]', '', text)  # Remove special characters
    text = re.sub('\n', '', text)  # Remove newline characters
    text = re.sub('\(.*?\)', '', text)  # Remove text within parentheses
    text = re.sub('\s+', ' ', text)  # Remove extra whitespaces
    return text

# Apply clean_text_round2 function to the transcript data
data_clean_round2 = data_clean['transcript'].apply(clean_text_round2)

# Print the cleaned text
for transcript in data_clean_round2:
    print(transcript)


 dan greenburg the renowned humorist and writer passed away at the age of with a career that spanned several decades greenburg was celebrated for his sharp wit and satirical writing which shone brightly across a diverse body of work including books essays screenplays and more he made a significant mark in literature and entertainment poking fun at a wide range of subjects with a unique blend of humor and insighthis most notable work how to be a jewish mother humorously dissected the stereotypical jewish mother offering advice with a tongueincheek tone that captivated readers nationwide the book became a cultural touchstone reflecting greenburgs ability to observe and satirize the nuances of everyday life despite its specific cultural references the books underlying themes of familial love and the complexities of identity resonated with a broad audience making it a bestsellerborn in chicago greenburgs career spanned various genres from horror and the occult to murder mysteries and child

#Q2. Play around with CountVectorizer's parameters. What is ngram_range? What is min_df and max_df?

The CountVectorizer is a class within the scikit-learn library (sklearn) that is utilised to transform a set of text documents into a matrix representing the frequency of each token. It is an essential tool in the field of natural language processing (NLP) and text mining activities.

Below is a concise explanation of the functioning of CountVectorizer:

1)Tokenization involves the process of breaking down the input text documents into individual words or tokens. This process entails decomposing each document into discrete words or concepts, also known as tokens.

2)Vocabulary Construction: The system proceeds to create a lexicon consisting of all distinct tokens found in the input documents. Every individual token is transformed into a distinct feature within the vocabulary, and each document is represented as a row inside the resulting matrix.

3)Counting: The process involves tallying the frequency of each token in the vocabulary for every document and populating the matrix with the respective values. The matrix that is obtained is commonly known as a document-term matrix. In this matrix, each row corresponds to a document, and each column corresponds to a token. The values in the matrix indicate the frequency of each token in the respective document.


The term "ngram_range" refers to a range of consecutive words or characters that are considered as a unit in natural language processing tasks. This option defines the scope of n-grams to be taken into account during the process of tokenizing the text. An n-gram refers to a consecutive series of n elements extracted from a specific text sample. In the case of the sentence "The cat sat on the mat", the 1-grams (unigrams) consist of individual words: ["The", "cat", "sat", "on", "the", "mat"]. The 2-grams (bigrams) are pairs of consecutive words: ["The cat", "cat sat", "sat on", "on the", "the mat"]. Lastly, the 3-grams (trigrams) are groups of three consecutive words: ["The cat sat", "cat sat on", "sat on the", "on the mat"]. The ngram_range parameter is defined as a tuple (min_n, max_n), where min_n represents the smallest number of tokens in an n-gram and max_n represents the maximum number of tokens in an n-gram. The default value for ngram_range is (1, 1), meaning that only unigrams are considered.

min_df refers to the minimum frequency of a term in a document that is required for it to be included in the analysis. This parameter defines the minimum threshold for the number of documents in which a token must be included in order to be considered part of the vocabulary. For instance, when min_df=2, it indicates that a token must be included in a minimum of 2 documents in order to be evaluated. If the value of min_df is a float, it indicates the minimum proportion of documents in which the token must appear. For example, when min_df=0.1, it indicates that the token must be present in a minimum of 10% of the documents. The default value for min_df is 1, indicating that the token must be present in at least one document.

The max_df option determines the upper limit for the number of documents in which a token can occur in order to be considered for inclusion in the vocabulary. For instance, when max_df=0.5, it indicates that a token can exist in a maximum of 50% of the documents in order to be considered. If max_df is specified as a float, it denotes the fraction of documents in which the token is allowed to occur. For example, when max_df=0.9, it indicates that the token is allowed to appear in a maximum of 90% of the documents. The default value for max_df is 1.0, indicating that there is no upper limit.