# Data Cleaning

` Data cleaning is a time consuming and unenjoyable task, yet it's a very important one. Keep in mind, "garbage in, garbage out".`

#### Feeding dirty data into a model will give us results that are meaningless.

### Objective:

1. Getting the data 
2. Cleaning the data 
3. Organizing the data - organize the cleaned data into a way that is easy to input into other algorithms

### Output :
#### cleaned and organized data in two standard text formats:

1. Corpus - a collection of text
2. Document-Term Matrix - word counts in matrix format

## Problem Statement

Look at transcripts of various comedians and note their similarities and differences and find if the stand up comedian of your choice has comedy style different than other comedian.


## Getting The Data

You can get the transcripts of some comedian from [Scraps From The Loft](http://scrapsfromtheloft.com). 

You can take help of IMDB and select only 10 or 20 comedian having highest rating.






### For example:

In [27]:
# Web scraping, pickle imports
import requests
from bs4 import BeautifulSoup
import pickle

# Scrapes transcript data from scrapsfromtheloft.com
def url_to_transcript(url):
    '''Returns transcript data specifically from scrapsfromtheloft.com.'''
    page = requests.get(url).text
    soup = BeautifulSoup(page, "lxml")
    text = [p.text for p in soup.find(class_="ast-container").find_all('p')]
    print(url)
    return text

# URLs of transcripts in scope
urls = ['https://scrapsfromtheloft.com/comedy/dave-chappelle-sticks-stones-transcript/',
        'https://scrapsfromtheloft.com/comedy/pete-davidson-alive-from-new-york-transcript/',
        'https://scrapsfromtheloft.com/comedy/neal-brennan-blocks-transcript/',
        'https://scrapsfromtheloft.com/comedy/vir-das-losing-it-transcript/',
        'https://scrapsfromtheloft.com/comedy/chris-rock-total-blackout-the-tamborine-extended-cut-transcript/',
        'https://scrapsfromtheloft.com/comedy/george-carlin-doin-it-again-transcript/',
        'https://scrapsfromtheloft.com/comedy/norm-macdonald-hitlers-dog-gossip-trickery-2017-full-transcript/',
        'https://scrapsfromtheloft.com/comedy/eddie-murphy-raw-transcript/',
        'https://scrapsfromtheloft.com/comedy/hasan-minhaj-homecoming-king-transcript/',
        'https://scrapsfromtheloft.com/comedy/bill-burr-paper-tiger-transcript/',
        'https://scrapsfromtheloft.com/comedy/ricky-gervais-supernature-transcript/',
        'https://scrapsfromtheloft.com/comedy/louis-c-k-sorry-transcript/',
        'https://scrapsfromtheloft.com/comedy/bo-burnham-inside-transcript/',
        'https://scrapsfromtheloft.com/comedy/taylor-tomlinson-quarter-life-crisis-transcript/',
        'https://scrapsfromtheloft.com/comedy/trevor-noah-i-wish-you-would-transcript/']

# Comedian names
comedians = ['dave', 'pete', 'brennan', 'vir', 'rock', 'carlin', 'norm', 'murphy', 'hasan', 'burr', 'ricky', 'louis', 'burnham', 'taylor', 'trevor']

In [28]:
# # Actually request transcripts (takes a few minutes to run)
transcripts = [url_to_transcript(u) for u in urls]

https://scrapsfromtheloft.com/comedy/dave-chappelle-sticks-stones-transcript/
https://scrapsfromtheloft.com/comedy/pete-davidson-alive-from-new-york-transcript/
https://scrapsfromtheloft.com/comedy/neal-brennan-blocks-transcript/
https://scrapsfromtheloft.com/comedy/vir-das-losing-it-transcript/
https://scrapsfromtheloft.com/comedy/chris-rock-total-blackout-the-tamborine-extended-cut-transcript/
https://scrapsfromtheloft.com/comedy/george-carlin-doin-it-again-transcript/
https://scrapsfromtheloft.com/comedy/norm-macdonald-hitlers-dog-gossip-trickery-2017-full-transcript/
https://scrapsfromtheloft.com/comedy/eddie-murphy-raw-transcript/
https://scrapsfromtheloft.com/comedy/hasan-minhaj-homecoming-king-transcript/
https://scrapsfromtheloft.com/comedy/bill-burr-paper-tiger-transcript/
https://scrapsfromtheloft.com/comedy/ricky-gervais-supernature-transcript/
https://scrapsfromtheloft.com/comedy/louis-c-k-sorry-transcript/
https://scrapsfromtheloft.com/comedy/bo-burnham-inside-transcript/


In [29]:
# Pickle files for later use

# Make a new directory to hold the text files
!mkdir transcripts

for i, c in enumerate(comedians):
    with open("transcripts/" + c + ".txt", "wb") as file:
        pickle.dump(transcripts[i], file)

mkdir: transcripts: File exists


In [30]:
# Load pickled files
data = {}
for i, c in enumerate(comedians):
    with open("transcripts/" + c + ".txt", "rb") as file:
        data[c] = pickle.load(file)

In [31]:
# Double check to make sure data has been loaded properly
data.keys()

dict_keys(['dave', 'pete', 'brennan', 'vir', 'rock', 'carlin', 'norm', 'murphy', 'hasan', 'burr', 'ricky', 'louis', 'burnham', 'taylor', 'trevor'])

In [32]:
# More checks
data['burnham'][:2]

['Exploring mental health decline over 2020, the constant challenges our world faces, and the struggles of life itself, Bo Burnham creates a wonderful masterpiece to explain each of these, both from general view and personal experience.',
 '* * *']

## Cleaning The Data

When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing with outliers, etc. With text data, there are some common data cleaning techniques, which are also known as text pre-processing techniques.

With text data, this cleaning process can go on forever. There's always an exception to every cleaning step. So, we're going to follow the MVP (minimum viable product) approach - start simple and iterate.
### Assignment:
1. Perform the following data cleaning on transcripts:
i) Make text all lower case
ii) Remove punctuation
iii) Remove numerical values
iv) Remove common non-sensical text (/n)
v) Tokenize text
vi) Remove stop words

In [33]:
# Let's take a look at our data again
next(iter(data.keys()))

'dave'

In [34]:
# Notice that our dictionary is currently in key: comedian, value: list of text format
next(iter(data.values()))

['Sticks & Stones is Dave Chappelle’s fifth Netflix special.\nIn the promotional trailer Morgan Freeman narrates as Chappelle swaggers across a salt flat in leather pants, aviator shades and a remarkably long t-shirt.',
 '[Morgan Freeman] This is Dave. He tells jokes for a living. Hopefully he makes people laugh, but these days it’s a high stakes game.',
 'Hmm, how did we get here, I wonder? I don’t mean that metaphorically, I’m really asking: how did Dave get here? I mean, what the fuck is this?',
 'But what do I know? I’m just Morgan Freeman.',
 'Anyway, I guess what I’m trying to say is\xa0if you say anything… you risk everything. But if that’s the way it’s gotta be—okay, fine, fuck it! ',
 'Ahahah, he’s back folks!',
 'Sticks & Stones streamed August 26, 2019 on Netflix.',
 '“TELL ME SOMETHING’',
 'YOU MOTHAFUCKAS\nCAN’T TELL ME NOTHIN’',
 'I’D RATHER DIE THAN\nTO LISTEN TO YOU…”',
 '—KENDRICK LAMAR,\nPULITZER PRIZE WINNER',
 '“I KNOW REAL N*GGAS\nHAPPEN TO LOVE IT”',
 '—SHAWN CART

In [35]:
# We are going to change this to key: comedian, value: string format
def combine_text(list_of_text):
    '''Takes a list of text and combines them into one large chunk of text.'''
    combined_text = ' '.join(list_of_text)
    return combined_text

In [36]:
# Combine it!
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}

In [37]:
# We can either keep it in dictionary format or put it into a pandas dataframe
import pandas as pd
pd.set_option('max_colwidth',190)

data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['transcript']
data_df = data_df.sort_index()
data_df

Unnamed: 0,transcript
brennan,"[gentle music playing] [audience applauding] [audience cheering] All right, let me explain. Friend of mine… “Former friend,” we’ll call her. [audience laughter] …is an artist, right? And..."
burnham,"Exploring mental health decline over 2020, the constant challenges our world faces, and the struggles of life itself, Bo Burnham creates a wonderful masterpiece to explain each of these,..."
burr,"Recorded Live at the Royal Albert Hall, London, England [cheering and applause] [female announcer] Ladies and gentlemen, please welcome Bill Burr! All right, thank you. Thank you very mu..."
carlin,"Recorded on January 12–13, 1990, State Theatre, New Brunswick, New Jersey So you want to talk about it? Oh yeah. It all started in 1977. I mean, that’s when I started doing it regularly...."
dave,"Sticks & Stones is Dave Chappelle’s fifth Netflix special.\nIn the promotional trailer Morgan Freeman narrates as Chappelle swaggers across a salt flat in leather pants, aviator shades a..."
hasan,"[theme music: orchestral hip-hop] [crowd roars] What’s up? Davis, what’s up? I’m home. I had to bring it back here. Netflix said, “Where do you want to do the special? LA, Chicago, New Y..."
louis,"Recorded at the Madison Square Garden on August 14, 2021 * * * ♪♪ [“Like a Rolling Stone” by Bob Dylan playing] ♪♪ ♪ Once upon a time you dressed so fine ♪\n♪ Threw the bums a dime in yo..."
murphy,"After achieving fame with Saturday Night Live and Beverly Hills Cop, Eddie Murphy released a film version of one of his live stand-up performances. He mainly focuses on the topics of div..."
norm,"Then people go, “Goddamn, at least he’s not a hypocrite.” “You’ve got to give it to him, that’s the worst part of it.” All right. I ate a pork chop. I don’t want to brag or anything like..."
pete,"So, Louis C.K. tried to get me fired from SNL my first year, and this is that story. So, it’s, like, 2014 or ’15, uh, and it’s the finale of SNL, and I-I was so shocked and happy that I ..."


In [38]:
data_df.transcript.loc['dave']

'Sticks & Stones is Dave Chappelle’s fifth Netflix special.\nIn the promotional trailer Morgan Freeman narrates as Chappelle swaggers across a salt flat in leather pants, aviator shades and a remarkably long t-shirt. [Morgan Freeman] This is Dave. He tells jokes for a living. Hopefully he makes people laugh, but these days it’s a high stakes game. Hmm, how did we get here, I wonder? I don’t mean that metaphorically, I’m really asking: how did Dave get here? I mean, what the fuck is this? But what do I know? I’m just Morgan Freeman. Anyway, I guess what I’m trying to say is\xa0if you say anything… you risk everything. But if that’s the way it’s gotta be—okay, fine, fuck it!  Ahahah, he’s back folks! Sticks & Stones streamed August 26, 2019 on Netflix. “TELL ME SOMETHING’ YOU MOTHAFUCKAS\nCAN’T TELL ME NOTHIN’ I’D RATHER DIE THAN\nTO LISTEN TO YOU…” —KENDRICK LAMAR,\nPULITZER PRIZE WINNER “I KNOW REAL N*GGAS\nHAPPEN TO LOVE IT” —SHAWN CARTER\n(BILLIONAIRE) ♪ I was dreaming When I wrote t

In [39]:
# Let's take a look at the transcript for Eddie Murphy
data_df.transcript.loc['murphy']

'After achieving fame with Saturday Night Live and Beverly Hills Cop, Eddie Murphy released a film version of one of his live stand-up performances. He mainly focuses on the topics of divorce and relations between the sexes, but also goes into some of the problems he’s encountered because of fame, including offended listeners and fans who continually greet him with his unprintable catch phrases. \xa0 – Show me that little dance you-all be doing.\n– I told y’all to stop running in here.\nYes, ma’am. I’m gonna smack one of you now, you hear? Them pants cost $3.98, baby, you hear? See that chocolate cake I bought? The chocolate cake that was on the counter? – Yeah. – Well, check Cousin Cecil’s pockets. He probably got it in there with the turkey leg and the sweet potato pie. Hey, little brother. Show me that little dance y’all be doing. Get down, Lester, you is talking! You move like you’re 21 . That dance ain’t new. lt ain’t nothing but the old shuffle-butt. Well, show me that move. Oh, 

In [40]:
data_df.transcript.loc['burnham']



In [41]:
# Apply a first round of text cleaning techniques
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [42]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_df.transcript.apply(round1))
data_clean

Unnamed: 0,transcript
brennan,all right let me explain friend of mine… “former friend” we’ll call her …is an artist right and the theme of our friendship is kind of feeling alone in the world right so i wrote thi...
burnham,exploring mental health decline over the constant challenges our world faces and the struggles of life itself bo burnham creates a wonderful masterpiece to explain each of these both fr...
burr,recorded live at the royal albert hall london england ladies and gentlemen please welcome bill burr all right thank you thank you very much thank you thank you thank you thank you how ...
carlin,recorded on january – state theatre new brunswick new jersey so you want to talk about it oh yeah it all started in i mean that’s when i started doing it regularly how many times have ...
dave,sticks stones is dave chappelle’s fifth netflix special\nin the promotional trailer morgan freeman narrates as chappelle swaggers across a salt flat in leather pants aviator shades and ...
hasan,what’s up davis what’s up i’m home i had to bring it back here netflix said “where do you want to do the special la chicago new york” i was like “nah son davis california” this has um...
louis,recorded at the madison square garden on august ♪♪ ♪♪ ♪ once upon a time you dressed so fine ♪\n♪ threw the bums a dime in your prime ♪\n♪ didn’t you ♪ ♪♪ ♪ people call say beware ...
murphy,after achieving fame with saturday night live and beverly hills cop eddie murphy released a film version of one of his live standup performances he mainly focuses on the topics of divorc...
norm,then people go “goddamn at least he’s not a hypocrite” “you’ve got to give it to him that’s the worst part of it” all right i ate a pork chop i don’t want to brag or anything like that b...
pete,so louis ck tried to get me fired from snl my first year and this is that story so it’s like or ’ uh and it’s the finale of snl and ii was so shocked and happy that i didn’t get fired a...


In [43]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…♪–]', '', text)
    text = re.sub('\n', '', text)
    return text

round2 = lambda x: clean_text_round2(x)

In [44]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_clean.transcript.apply(round2))
data_clean

Unnamed: 0,transcript
brennan,all right let me explain friend of mine former friend well call her is an artist right and the theme of our friendship is kind of feeling alone in the world right so i wrote this sho...
burnham,exploring mental health decline over the constant challenges our world faces and the struggles of life itself bo burnham creates a wonderful masterpiece to explain each of these both fr...
burr,recorded live at the royal albert hall london england ladies and gentlemen please welcome bill burr all right thank you thank you very much thank you thank you thank you thank you how ...
carlin,recorded on january state theatre new brunswick new jersey so you want to talk about it oh yeah it all started in i mean thats when i started doing it regularly how many times have yo...
dave,sticks stones is dave chappelles fifth netflix specialin the promotional trailer morgan freeman narrates as chappelle swaggers across a salt flat in leather pants aviator shades and a r...
hasan,whats up davis whats up im home i had to bring it back here netflix said where do you want to do the special la chicago new york i was like nah son davis california this has um this h...
louis,recorded at the madison square garden on august once upon a time you dressed so fine threw the bums a dime in your prime didnt you people call say beware doll youre bound ...
murphy,after achieving fame with saturday night live and beverly hills cop eddie murphy released a film version of one of his live standup performances he mainly focuses on the topics of divorc...
norm,then people go goddamn at least hes not a hypocrite youve got to give it to him thats the worst part of it all right i ate a pork chop i dont want to brag or anything like that but its i...
pete,so louis ck tried to get me fired from snl my first year and this is that story so its like or uh and its the finale of snl and ii was so shocked and happy that i didnt get fired and t...


## Organizing The Data

### Assignment:
1. Organized data in two standard text formats:
   a) Corpus - corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.
   b) Document-Term Matrix - word counts in matrix format

### Corpus: Example

A corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.

In [45]:
# Let's take a look at our dataframe
data_df

Unnamed: 0,transcript
brennan,"[gentle music playing] [audience applauding] [audience cheering] All right, let me explain. Friend of mine… “Former friend,” we’ll call her. [audience laughter] …is an artist, right? And..."
burnham,"Exploring mental health decline over 2020, the constant challenges our world faces, and the struggles of life itself, Bo Burnham creates a wonderful masterpiece to explain each of these,..."
burr,"Recorded Live at the Royal Albert Hall, London, England [cheering and applause] [female announcer] Ladies and gentlemen, please welcome Bill Burr! All right, thank you. Thank you very mu..."
carlin,"Recorded on January 12–13, 1990, State Theatre, New Brunswick, New Jersey So you want to talk about it? Oh yeah. It all started in 1977. I mean, that’s when I started doing it regularly...."
dave,"Sticks & Stones is Dave Chappelle’s fifth Netflix special.\nIn the promotional trailer Morgan Freeman narrates as Chappelle swaggers across a salt flat in leather pants, aviator shades a..."
hasan,"[theme music: orchestral hip-hop] [crowd roars] What’s up? Davis, what’s up? I’m home. I had to bring it back here. Netflix said, “Where do you want to do the special? LA, Chicago, New Y..."
louis,"Recorded at the Madison Square Garden on August 14, 2021 * * * ♪♪ [“Like a Rolling Stone” by Bob Dylan playing] ♪♪ ♪ Once upon a time you dressed so fine ♪\n♪ Threw the bums a dime in yo..."
murphy,"After achieving fame with Saturday Night Live and Beverly Hills Cop, Eddie Murphy released a film version of one of his live stand-up performances. He mainly focuses on the topics of div..."
norm,"Then people go, “Goddamn, at least he’s not a hypocrite.” “You’ve got to give it to him, that’s the worst part of it.” All right. I ate a pork chop. I don’t want to brag or anything like..."
pete,"So, Louis C.K. tried to get me fired from SNL my first year, and this is that story. So, it’s, like, 2014 or ’15, uh, and it’s the finale of SNL, and I-I was so shocked and happy that I ..."


In [46]:
# Let's add the comedians' full names as well
full_names = ['Neal Brennan','Bo Burnham', 'Bill Burr', 'George Carlin', 'Dave Chappelle', 'Hasan Minhaj', 'Louis C.K.',
              'Eddie Murphy', 'Norm Macdonald', 'Pete Davidson', 'Ricky Gervais', 'Chris Rock', 'Taylor Tomlinson','Trevor Noah','Vir Das']

data_df['full_name'] = full_names
data_df

Unnamed: 0,transcript,full_name
brennan,"[gentle music playing] [audience applauding] [audience cheering] All right, let me explain. Friend of mine… “Former friend,” we’ll call her. [audience laughter] …is an artist, right? And...",Neal Brennan
burnham,"Exploring mental health decline over 2020, the constant challenges our world faces, and the struggles of life itself, Bo Burnham creates a wonderful masterpiece to explain each of these,...",Bo Burnham
burr,"Recorded Live at the Royal Albert Hall, London, England [cheering and applause] [female announcer] Ladies and gentlemen, please welcome Bill Burr! All right, thank you. Thank you very mu...",Bill Burr
carlin,"Recorded on January 12–13, 1990, State Theatre, New Brunswick, New Jersey So you want to talk about it? Oh yeah. It all started in 1977. I mean, that’s when I started doing it regularly....",George Carlin
dave,"Sticks & Stones is Dave Chappelle’s fifth Netflix special.\nIn the promotional trailer Morgan Freeman narrates as Chappelle swaggers across a salt flat in leather pants, aviator shades a...",Dave Chappelle
hasan,"[theme music: orchestral hip-hop] [crowd roars] What’s up? Davis, what’s up? I’m home. I had to bring it back here. Netflix said, “Where do you want to do the special? LA, Chicago, New Y...",Hasan Minhaj
louis,"Recorded at the Madison Square Garden on August 14, 2021 * * * ♪♪ [“Like a Rolling Stone” by Bob Dylan playing] ♪♪ ♪ Once upon a time you dressed so fine ♪\n♪ Threw the bums a dime in yo...",Louis C.K.
murphy,"After achieving fame with Saturday Night Live and Beverly Hills Cop, Eddie Murphy released a film version of one of his live stand-up performances. He mainly focuses on the topics of div...",Eddie Murphy
norm,"Then people go, “Goddamn, at least he’s not a hypocrite.” “You’ve got to give it to him, that’s the worst part of it.” All right. I ate a pork chop. I don’t want to brag or anything like...",Norm Macdonald
pete,"So, Louis C.K. tried to get me fired from SNL my first year, and this is that story. So, it’s, like, 2014 or ’15, uh, and it’s the finale of SNL, and I-I was so shocked and happy that I ...",Pete Davidson


In [47]:
# Let's pickle it for later use
data_df.to_pickle("corpus.pkl")

### Document-Term Matrix: Example

For many of the techniques we'll be using in future assignment, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's ` CountVectorizer `, where every row will represent a different document and every column will represent a different word.

In addition, with ` CountVectorizer `, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

In [48]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean.transcript)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm



Unnamed: 0,aaaah,aah,aall,aand,aarrives,abandon,abbreviation,abducted,abducting,abdullahs,...,zoo,zoom,zoomed,zoomers,zucchinis,zucker,zuckerberg,zuckerfuck,zuckerfucker,zuckermother
brennan,0,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
burnham,0,0,0,0,0,0,0,0,0,0,...,0,1,0,1,0,0,1,0,0,0
burr,0,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,0,0
carlin,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
dave,1,2,0,2,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
hasan,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
louis,0,0,0,0,0,0,0,0,0,0,...,9,0,0,0,0,0,0,0,0,0
murphy,0,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
norm,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
pete,0,0,1,3,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [49]:
# Let's pickle it for later use
data_dtm.to_pickle("dtm.pkl")

In [50]:
# Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object
data_clean.to_pickle('data_clean.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))

## Additional Assignments:

1. Can you add an additional regular expression to the clean_text_round2 function to further clean the text?
2. Play around with CountVectorizer's parameters. What is ngram_range? What is min_df and max_df?

1. On clear inspection of transcripts after first round of cleaning, we can see we can clearly remove music symbol (♪) and hypen (-) in the transcript of some comedians which is also not necessary for your processing, Therefore I've added it to re.sub function's regular expression. After second round of cleaning, it is removed and transcript is not more workable.

2. CountVectorizer has many parameters such as encoding, strip_accent, stop_words, token_pattern, ngram_range ...... etc. More could be found on https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html official documentation for same