In [65]:
import pandas as pd
import numpy as np

import os
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

from nlp_acquire import get_news_articles

## Natural Language Processing

### <font color=red>What is Natural Language Processing?</font>

Natural Language Processing allows you to use techniques in Python libraries like NLTK (Natural Language Tool Kit) and Spacy to create a machine-useable structure out of natural language text. In other words, you can manipulate natural language in such a way that renders it useful in machine learning. Machines can't read words, but they can recognize numbers, so we have to process the text we want to use in a way that retains the original meaning while representing the text with numbers.

___

### <font color=orange>So What?</font>

We will establish a workflow to preprocess our text data and prepare it for further use in exploration and modeling. This preprocessing is know as text normalization. **Normalization** is when you perform a series of tasks like making all text lowercase, removing punctuation, expanding contractions, removing anything that's not an ASCII character, etc.

1. Lowercase all characters in our string.
2. Remove special characters from our string.
3. Remove accented characters from our string.
4. Tokenize our string into discrete units. (words, punctuation)
5. Stem and Lemmatize the words in our string. (group by base word)
6. Remove stopwords from our string. (remove noise)

___

### <font color=green>Now What?</font>

Let's create functions to help us normalize our text data.

In [66]:
# Read our text data into a pandas DataFrame.

df = get_news_articles()
df.head()

Unnamed: 0,topic,title,author,content
0,business,"Father said 'if you want to blow money up, fine' as I put $250M on vaccine: Adar",Pragya Swastik,"Serum Institute of India's (SII) CEO Adar Poonawalla revealed that his father Cyrus S Poonawalla told him, ""Look, it's your money. If you want to blow it up, fine,"" as he put $250 million in ramping up COVID-19 vaccine manufacturing capacity. Adar told The Washington Post, ""I decided to go all o..."
1,business,China suspends fish imports from Indian firm after coronavirus detected,Krishna Veera Vanamali,"China has suspended imports from India's Basu International for one week after detecting the novel coronavirus on three samples taken from the outer packaging of frozen cuttlefish. Imports will resume automatically after one week, Chinese customs said. Companies from Brazil, Russia, Ecuador and ..."
2,business,Scientist behind '90% effective' COVID-19 vaccine says it can end the pandemic,Krishna Veera Vanamali,"German scientist Uğur Şahin, the CEO of BioNTech which co-developed a coronavirus vaccine with Pfizer, said he's confident his product can end the pandemic and ""bash the virus over the head."" The vaccine is 90% effective based on initial data from a late-stage trial. ""I believe that even protect..."
3,business,"Russian vaccine arrives in India, video of it being unloaded from truck surfaces",Krishna Veera Vanamali,"Russia's coronavirus vaccine, Sputnik V, has arrived in India after Dr Reddy's Laboratories got approval to conduct late-stage trials of the vaccine in the country. A video has surfaced on social media, which shows containers with logos of Dr Reddy's and Sputnik V being unloaded from a small tru..."
4,business,"Special Indian version of PUBG Mobile to be launched, announces developer",Krishna Veera Vanamali,"South Korea's PUBG Corporation on Thursday announced that it is preparing to launch 'PUBG Mobile India', which has been specially created for India. Additionally, the company said it plans to make investments worth $100 million in India. It will also create an Indian subsidiary which will hire ..."


#### Normalize Text

##### Lowercase Text
```python
# Make all characters in string lowercase.
df.col.str.lower()
```

In [73]:
# Note I have not reassigned this or changed the inplace argument to True yet; just a look.

df.content.str.lower()[:2]

0    serum institute of india's (sii) ceo adar poonawalla revealed that his father cyrus s poonawalla told him, "look, it's your money. if you want to blow it up, fine," as he put $250 million in ramping up covid-19 vaccine manufacturing capacity. adar told the washington post, "i decided to go all o...
1    china has suspended imports from india's basu international for one week after detecting the novel coronavirus on three samples taken from the outer packaging of frozen cuttlefish. imports will resume automatically after one week, chinese customs said. companies from brazil, russia, ecuador and ...
Name: content, dtype: object

##### Remove Special Characters

I found [this article](https://kanoki.org/2019/11/12/how-to-use-regex-in-pandas/) very helpful when using Regex with Pandas!

```python
# Remove characters that are not letters, underscores, or spaces.
df.col.str.replace(r"[^\w\s]", '', regex=True)

# Remove characters that are not letters, numbers, single quotes, or spaces.
df.col.str.replace(r"[^a-z0-9'\s]", '', regex=True)
```

In [74]:
# Test our regex on a string first; we'll use `.replace()` when we use it on our Series.

string = 'If we want to know if our regex is replacing [these] or "" or ! - these.'

string = re.sub(r'[^\w\s]', '', string)
string

'If we want to know if our regex is replacing these or  or   these'

In [75]:
# Again, this is a look because it has not been reassigned or changed in place.

df.content.replace(r'[^\w\s]', '', regex=True)[:2]

0    Serum Institute of Indias SII CEO Adar Poonawalla revealed that his father Cyrus S Poonawalla told him Look its your money If you want to blow it up fine as he put 250 million in ramping up COVID19 vaccine manufacturing capacity Adar told The Washington Post I decided to go all out His father fo...
1    China has suspended imports from Indias Basu International for one week after detecting the novel coronavirus on three samples taken from the outer packaging of frozen cuttlefish Imports will resume automatically after one week Chinese customs said Companies from Brazil Russia Ecuador and Indone...
Name: content, dtype: object

___

##### Remove Accented Characters

We will remove accented characters by chaining together the following methods:

```python
# Remove inconsistencies in unicode character encoding.
df.col.str.normalize(form, unistr)
```

```python
# Convert string to ASCII character set and drop non-ASCII characters.
df.col.str.encode('ascii', 'ignore')
```

```python
# Convert the bytes back into a string object.
df.col.str.decode('utf-8', 'ignore')
```

[Here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.normalize.html) if you want more on using `unicodedata.normalize()` as a str method on a Pandas Series.

In [77]:
# Again, this is a look because it has not been reassigned or changed in place.

df.content.str.normalize('NFKC').str.encode('ascii', 'ignore').str.decode('utf-8', 'ignore')[:2]

0    Serum Institute of India's (SII) CEO Adar Poonawalla revealed that his father Cyrus S Poonawalla told him, "Look, it's your money. If you want to blow it up, fine," as he put $250 million in ramping up COVID-19 vaccine manufacturing capacity. Adar told The Washington Post, "I decided to go all o...
1    China has suspended imports from India's Basu International for one week after detecting the novel coronavirus on three samples taken from the outer packaging of frozen cuttlefish. Imports will resume automatically after one week, Chinese customs said. Companies from Brazil, Russia, Ecuador and ...
Name: content, dtype: object

##### Basic Clean Function

In [10]:
def basic_clean(df, col):
    '''
    This function takes in a df and a string for a column and
    returns the df with a new column named 'basic_clean' with the
    passed column text normalized.
    '''
    df['basic_clean'] = df[col].str.lower()\
                    .replace(r'[^\w\s]', '', regex=True)\
                    .str.normalize('NFKC')\
                    .str.encode('ascii', 'ignore')\
                    .str.decode('utf-8', 'ignore')
    return df

In [11]:
df = basic_clean(df, 'content')
df.head(2)

Unnamed: 0,topic,title,author,content,basic_clean
0,business,Scientist behind '90% effective' COVID-19 vacc...,Krishna Veera Vanamali,"German scientist Uğur Şahin, the CEO of BioNTe...",german scientist uur ahin the ceo of biontech ...
1,business,"Father said 'if you want to blow money up, fin...",Pragya Swastik,Serum Institute of India's (SII) CEO Adar Poon...,serum institute of indias sii ceo adar poonawa...


___

#### Tokenize Text

**Tokenization** - is when you split larger strings of text into smaller pieces or tokens by setting a boundary. You might chunk a sentence into words using a space as a boundary or a paragraph into sentences using punctuation as a boundary.

##### Using `.split()`

Tokenizing using `.split()` is simple but also limited to one delimiter.

In [19]:
text = 'Knowledge is the compound interest of curiosity. - James Clear'

In [20]:
# Split on default whitespace.

text.split()

['Knowledge',
 'is',
 'the',
 'compound',
 'interest',
 'of',
 'curiosity.',
 '-',
 'James',
 'Clear']

In [21]:
text = "There's the kind of person who is always the victim in any story they tell. Always on the receiving end of some injustice. There's the kind of person who is always the kind of hero of every story they tell. There's the smart person; they delivered the clever put down there."

In [22]:
# Split on periods.

text.split('.')

["There's the kind of person who is always the victim in any story they tell",
 ' Always on the receiving end of some injustice',
 " There's the kind of person who is always the kind of hero of every story they tell",
 " There's the smart person; they delivered the clever put down there",
 '']

##### Using Regex

Python Regex Cheatsheet [here](https://www.debuggex.com/cheatsheet/regex/python).

In [23]:
# Split your text using a regex pattern in .findall()

pattern = r'[\w]+'
text = 'Knowledge is the compound interest of curiosity. - James Clear'

tokens = re.findall(pattern, text)
tokens

['Knowledge',
 'is',
 'the',
 'compound',
 'interest',
 'of',
 'curiosity',
 'James',
 'Clear']

In [24]:
# Use `.compile()` with .split(text) to split your text on more than one delimiter

pattern = re.compile(r'[.;!?]')
text = "There's the kind of person who is always the victim in any story they tell. Always on the receiving end of some injustice. There's the kind of person who is always the kind of hero of every story they tell. There's the smart person; they delivered the clever put down there."

pattern.split(text)

["There's the kind of person who is always the victim in any story they tell",
 ' Always on the receiving end of some injustice',
 " There's the kind of person who is always the kind of hero of every story they tell",
 " There's the smart person",
 ' they delivered the clever put down there',
 '']

##### Using NLTK Tokenization

```python
# Create the tokenizer
tokenizer = nltk.tokenize.ToktokTokenizer()

# Apply the tokenizer to a Series or column in a df
df[tokenized] = df.col.apply(tokenizer.tokenize)
```
**<font color=red>I am not going to join the words in my lists back into sentences because this form will be helpful in my next functions.</font>**

In [61]:
# Create the tokenizer

tokenizer = nltk.tokenize.ToktokTokenizer()

In [62]:
# Apply the tokenizer to a Series or column in a df; now we have a list of tokens.

df.basic_clean.apply(tokenizer.tokenize)[:2]

AttributeError: 'DataFrame' object has no attribute 'basic_clean'

In [63]:
# Here we apply nltk's tokenizer to each row, or text, in our basic_clean Series

df['clean_tokens'] = df.basic_clean.apply(tokenizer.tokenize)

AttributeError: 'DataFrame' object has no attribute 'basic_clean'

In [64]:
df.clean_tokens

AttributeError: 'DataFrame' object has no attribute 'clean_tokens'

In [28]:
df[['basic_clean', 'clean_tokens']].head(2)

Unnamed: 0,basic_clean,clean_tokens
0,german scientist uur ahin the ceo of biontech ...,"[german, scientist, uur, ahin, the, ceo, of, b..."
1,serum institute of indias sii ceo adar poonawa...,"[serum, institute, of, indias, sii, ceo, adar,..."


##### Tokenize Function

In [29]:
def tokenize(df, col):
    '''
    This function takes in a df and a string for a column and
    returns a df with a new column named 'clean_tokes' with the
    passed column text tokenized and in a list.
    '''
    tokenizer = nltk.tokenize.ToktokTokenizer()
    df['clean_tokens'] = df[col].apply(tokenizer.tokenize)
    return df

In [30]:
df = tokenize(df, 'basic_clean')
df[['basic_clean', 'clean_tokens']].head(2)

Unnamed: 0,basic_clean,clean_tokens
0,german scientist uur ahin the ceo of biontech ...,"[german, scientist, uur, ahin, the, ceo, of, b..."
1,serum institute of indias sii ceo adar poonawa...,"[serum, institute, of, indias, sii, ceo, adar,..."


___

#### Stem Tokens

**Stemming** is when you reduce related words in your text to their common stem. It can make it easier when you are searching for a particular word in your text to search for their common stem rather than every form of the word. Stemmers aren't that sophisticated in the way they chop off word endings at their common stems; Spacy, another python NLP library, doesn't even include a stemmer in their library. Spacy only offers the more sophisticated lemmatizer, which we will look at in NLTK next.

##### Using NLTK PorterStemmer

```python
# Create the Stemmer.
ps = nltk.porter.PorterStemmer()
```

```python
# Apply the stemmer to each word in column; now we have a Series of lists with stemmed tokens.
stems = df.tokenized_col.apply(lambda row: [ps.stem(word) for word in row])
```

```python
# Join our lists of words into strings again and assign to our dataframe.
df['stemmed'] = stems.str.join(' ')
```

In [31]:
# Create porter stemmer

ps = nltk.porter.PorterStemmer()

In [32]:
# Apply the stemmer to each word in column; now we have a Series of lists with stemmed tokens.

stems = df.clean_tokens.apply(lambda row: [ps.stem(word) for word in row])
stems.head(2)

0    [german, scientist, uur, ahin, the, ceo, of, b...
1    [serum, institut, of, india, sii, ceo, adar, p...
Name: clean_tokens, dtype: object

In [33]:
# Join our lists of words into strings again and assign to our dataframe.

df['stemmed'] = stems.str.join(' ')

In [34]:
# We can do a quick visual check that our stemmer is working.

pd.options.display.max_colwidth = 300
df[['basic_clean', 'stemmed']].head(2)

Unnamed: 0,basic_clean,stemmed
0,german scientist uur ahin the ceo of biontech which codeveloped a coronavirus vaccine with pfizer said hes confident his product can end the pandemic and bash the virus over the head the vaccine is 90 effective based on initial data from a latestage trial i believe that even protection only from...,german scientist uur ahin the ceo of biontech which codevelop a coronaviru vaccin with pfizer said he confid hi product can end the pandem and bash the viru over the head the vaccin is 90 effect base on initi data from a latestag trial i believ that even protect onli from symptomat infect will h...
1,serum institute of indias sii ceo adar poonawalla revealed that his father cyrus s poonawalla told him look its your money if you want to blow it up fine as he put 250 million in ramping up covid19 vaccine manufacturing capacity adar told the washington post i decided to go all out his father fo...,serum institut of india sii ceo adar poonawalla reveal that hi father cyru s poonawalla told him look it your money if you want to blow it up fine as he put 250 million in ramp up covid19 vaccin manufactur capac adar told the washington post i decid to go all out hi father found sii in 1966


##### Stem Function

In [35]:
def stem(df, col):
    '''
    This function takes in a df and a string for a column name and
    returns a df with a new column named 'stemmed'.
    '''
    # Create porter stemmer.
    ps = nltk.porter.PorterStemmer()
    
    # Apply the stemmer to each word in column; now we have a Series of lists with stemmed tokens.
    stems = df[col].apply(lambda row: [ps.stem(word) for word in row])
    
    # Join our lists of words into strings again and assign to our dataframe.
    df['stemmed'] = stems.str.join(' ')
    
    return df

In [36]:
df = stem(df, 'clean_tokens')
df[['basic_clean', 'stemmed']].tail(2)

Unnamed: 0,basic_clean,stemmed
95,actor matthew perry who played the role of chandler bing in friends on thursday shared an update on the upcoming reunion special he tweeted friends reunion being rescheduled for the beginning of march looks like we have a busy year coming up the unscripted reunion has been delayed multiple times...,actor matthew perri who play the role of chandler bing in friend on thursday share an updat on the upcom reunion special he tweet friend reunion be reschedul for the begin of march look like we have a busi year come up the unscript reunion ha been delay multipl time due to the ongo coronaviru pa...
96,recalling her initial days in the music industry singer sunidhi chauhan said when i started out i was not needed at that time she mentioned that while the industry did not have space for new artists she managed to make her mark they welcomed me with open arms and here i am today thanks to those ...,recal her initi day in the music industri singer sunidhi chauhan said when i start out i wa not need at that time she mention that while the industri did not have space for new artist she manag to make her mark they welcom me with open arm and here i am today thank to those time she ad


___

#### Lemmatize Tokens

**Lemmatization** - is when you reduce related words in your text to their lemma or word base by applying a morphological analysis to your text. Like stemming, this is done to reduce the number of forms you have of the same word, so they can be analyzed as a single item. However, with stemming, you might end up with tokens that are not actually words anymore after they have been chopped off at their base, but with lemmatization you will end up with real words. A drawback of lemmatization is that it takes longer than stemming; you can try both to see which gives you better results as you analyze a given text.

##### Using NLTK WordNetLemmatizer

```python
# Download
nltk.download('wordnet')
```

```python
# Create the Lemmatizer.
wnl = nltk.stem.WordNetLemmatizer()
```

```python
# Lemmatize each token from our clean_tokens Series of lists. 
lemmas = df.tokenized_col.apply(lambda row: [wnl.lemmatize(word) for word in row])
```

```python
# Join our lists of words into strings again and assign to our dataframe.
df['lemmatized'] = lemmas.str.join(' ')
```

In [37]:
# Create the Lemmatizer

wnl = nltk.stem.WordNetLemmatizer()

In [38]:
# Lemmatize each token from our clean_tokens Series of lists

lemmas = df.clean_tokens.apply(lambda row: [wnl.lemmatize(word) for word in row])
lemmas.head(2)

0    [german, scientist, uur, ahin, the, ceo, of, biontech, which, codeveloped, a, coronavirus, vaccine, with, pfizer, said, he, confident, his, product, can, end, the, pandemic, and, bash, the, virus, over, the, head, the, vaccine, is, 90, effective, based, on, initial, data, from, a, latestage, tri...
1    [serum, institute, of, india, sii, ceo, adar, poonawalla, revealed, that, his, father, cyrus, s, poonawalla, told, him, look, it, your, money, if, you, want, to, blow, it, up, fine, a, he, put, 250, million, in, ramping, up, covid19, vaccine, manufacturing, capacity, adar, told, the, washington,...
Name: clean_tokens, dtype: object

In [39]:
# Join our lists of words into strings again and assign to our dataframe.

df['lemmatized'] = lemmas.str.join(' ')

In [40]:
df[['basic_clean', 'lemmatized']].head(2)

Unnamed: 0,basic_clean,lemmatized
0,german scientist uur ahin the ceo of biontech which codeveloped a coronavirus vaccine with pfizer said hes confident his product can end the pandemic and bash the virus over the head the vaccine is 90 effective based on initial data from a latestage trial i believe that even protection only from...,german scientist uur ahin the ceo of biontech which codeveloped a coronavirus vaccine with pfizer said he confident his product can end the pandemic and bash the virus over the head the vaccine is 90 effective based on initial data from a latestage trial i believe that even protection only from ...
1,serum institute of indias sii ceo adar poonawalla revealed that his father cyrus s poonawalla told him look its your money if you want to blow it up fine as he put 250 million in ramping up covid19 vaccine manufacturing capacity adar told the washington post i decided to go all out his father fo...,serum institute of india sii ceo adar poonawalla revealed that his father cyrus s poonawalla told him look it your money if you want to blow it up fine a he put 250 million in ramping up covid19 vaccine manufacturing capacity adar told the washington post i decided to go all out his father found...


In [42]:
# Just a quick check that these are actually changing something.

df.stemmed.iloc[90] == df.lemmatized.iloc[90]

False

**<font color=green>Now What?</font>**

##### Lemmatize Function

In [43]:
def lemmatize(df, col):
    '''
    This function takes in a df and a string for column name and
    returns the original df with a new column called 'lemmatized'.
    '''
    # Create the lemmatizer.
    wnl = nltk.stem.WordNetLemmatizer()
    
    # Lemmatize each token from our clean_tokes Series of lists.
    lemmas = df[col].apply(lambda row: [wnl.lemmatize(word) for word in row])
    
    # Join the cleaned and lemmatized tokens back into strings and assign to df.
    df['lemmatized'] = lemmas.str.join(' ')
    return df

In [44]:
df = lemmatize(df, 'clean_tokes')
df[['basic_clean', 'lemmatized']].head(2)

Unnamed: 0,basic_clean,lemmatized
0,german scientist uur ahin the ceo of biontech which codeveloped a coronavirus vaccine with pfizer said hes confident his product can end the pandemic and bash the virus over the head the vaccine is 90 effective based on initial data from a latestage trial i believe that even protection only from...,german scientist uur ahin the ceo of biontech which codeveloped a coronavirus vaccine with pfizer said he confident his product can end the pandemic and bash the virus over the head the vaccine is 90 effective based on initial data from a latestage trial i believe that even protection only from ...
1,serum institute of indias sii ceo adar poonawalla revealed that his father cyrus s poonawalla told him look its your money if you want to blow it up fine as he put 250 million in ramping up covid19 vaccine manufacturing capacity adar told the washington post i decided to go all out his father fo...,serum institute of india sii ceo adar poonawalla revealed that his father cyrus s poonawalla told him look it your money if you want to blow it up fine a he put 250 million in ramping up covid19 vaccine manufacturing capacity adar told the washington post i decided to go all out his father found...


#### Remove Stopwords

**Stopwords** - are words which are filtered out during the preparation of your text for analyzation and modeling. Stopwords are those that offer little to the meaning of your text and are basically just adding noise to your analysis. Or, as Ryan Orsinger would say, "Stopwords aren't the real story of the document." Words such as 'the', 'and', 'a', and the like can be removed, so you can better focus on the good stuff. 

##### Using NLTK Stopwords

```python
# Necessary import
import nltk; nltk.download('stopwords')
```

```python
# Create list of stopwords.
stopword_list = stopwords.words('english')
```

```python
# Create list of tokens or words.
words = df.col.str.split()
```

```python
# Create a Series of words lists with stopwords removed.
filtered_words = words.apply(lambda row: [word for word in row if word not in stopword_list])
```

```python
# Join the tokens or words in the list back into strings and assign to df.
df['clean_words'] = filtered_words.str.join(' ')
```

In [45]:
# Create the list of stopwords.

stopword_list = stopwords.words('english')

In [46]:
# Split words in lemmatized column

words = df.lemmatized.str.split()

In [47]:
# Check each word in each row of the column against stopword_list and return only those that are not in list

filtered_words = words.apply(lambda row: [word for word in row if word not in stopword_list])
df['clean_lemmatized'] = filtered_words.str.join(' ')

In [48]:
df[['basic_clean', 'clean_lemmatized']].head(2)

Unnamed: 0,basic_clean,clean_lemmatized
0,german scientist uur ahin the ceo of biontech which codeveloped a coronavirus vaccine with pfizer said hes confident his product can end the pandemic and bash the virus over the head the vaccine is 90 effective based on initial data from a latestage trial i believe that even protection only from...,german scientist uur ahin ceo biontech codeveloped coronavirus vaccine pfizer said confident product end pandemic bash virus head vaccine 90 effective based initial data latestage trial believe even protection symptomatic infection dramatic effect ahin said
1,serum institute of indias sii ceo adar poonawalla revealed that his father cyrus s poonawalla told him look its your money if you want to blow it up fine as he put 250 million in ramping up covid19 vaccine manufacturing capacity adar told the washington post i decided to go all out his father fo...,serum institute india sii ceo adar poonawalla revealed father cyrus poonawalla told look money want blow fine put 250 million ramping covid19 vaccine manufacturing capacity adar told washington post decided go father founded sii 1966


In [49]:
# Do the same as above but for stemmed words

words = df.stemmed.str.split()

filtered_words = words.apply(lambda row: [word for word in row if word not in stopword_list])
df['clean_stemmed'] = filtered_words.str.join(' ')

In [50]:
df[['basic_clean', 'clean_stemmed']].head(2)

Unnamed: 0,basic_clean,clean_stemmed
0,german scientist uur ahin the ceo of biontech which codeveloped a coronavirus vaccine with pfizer said hes confident his product can end the pandemic and bash the virus over the head the vaccine is 90 effective based on initial data from a latestage trial i believe that even protection only from...,german scientist uur ahin ceo biontech codevelop coronaviru vaccin pfizer said confid hi product end pandem bash viru head vaccin 90 effect base initi data latestag trial believ even protect onli symptomat infect dramat effect ahin said
1,serum institute of indias sii ceo adar poonawalla revealed that his father cyrus s poonawalla told him look its your money if you want to blow it up fine as he put 250 million in ramping up covid19 vaccine manufacturing capacity adar told the washington post i decided to go all out his father fo...,serum institut india sii ceo adar poonawalla reveal hi father cyru poonawalla told look money want blow fine put 250 million ramp covid19 vaccin manufactur capac adar told washington post decid go hi father found sii 1966


##### Stopwords Function

In [51]:
def remove_stopwords(df, col, extra_words=[]):
    '''
    This function takes in a df and a string for column name, optional extra_words parameter
    if you want to add extra stopwords and returns the df with a new column 
    named 'clean' with stopwords removed.
    '''
    # Create stopword_list
    stopword_list = stopwords.words('english')
    
    # Add optional additional stopwords
    stopword_list.extend(extra_words)
    
    # Split words in column
    words = df[col].str.split()
    
    # Check each word in each row of the column against stopword_list and return only those that are not in list
    filtered_words = words.apply(lambda row: [word for word in row if word not in stopword_list])
    
    # Create new column of words that have stopwords removed
    df['clean_' + col] = filtered_words.str.join(' ')
    
    return df

In [52]:
df = remove_stopwords(df, 'lemmatized')
df[['basic_clean', 'clean_lemmatized']].head(2)

Unnamed: 0,basic_clean,clean_lemmatized
0,german scientist uur ahin the ceo of biontech which codeveloped a coronavirus vaccine with pfizer said hes confident his product can end the pandemic and bash the virus over the head the vaccine is 90 effective based on initial data from a latestage trial i believe that even protection only from...,german scientist uur ahin ceo biontech codeveloped coronavirus vaccine pfizer said confident product end pandemic bash virus head vaccine 90 effective based initial data latestage trial believe even protection symptomatic infection dramatic effect ahin said
1,serum institute of indias sii ceo adar poonawalla revealed that his father cyrus s poonawalla told him look its your money if you want to blow it up fine as he put 250 million in ramping up covid19 vaccine manufacturing capacity adar told the washington post i decided to go all out his father fo...,serum institute india sii ceo adar poonawalla revealed father cyrus poonawalla told look money want blow fine put 250 million ramping covid19 vaccine manufacturing capacity adar told washington post decided go father founded sii 1966


In [53]:
___

Unnamed: 0,basic_clean,clean_lemmatized
0,german scientist uur ahin the ceo of biontech which codeveloped a coronavirus vaccine with pfizer said hes confident his product can end the pandemic and bash the virus over the head the vaccine is 90 effective based on initial data from a latestage trial i believe that even protection only from...,german scientist uur ahin ceo biontech codeveloped coronavirus vaccine pfizer said confident product end pandemic bash virus head vaccine 90 effective based initial data latestage trial believe even protection symptomatic infection dramatic effect ahin said
1,serum institute of indias sii ceo adar poonawalla revealed that his father cyrus s poonawalla told him look its your money if you want to blow it up fine as he put 250 million in ramping up covid19 vaccine manufacturing capacity adar told the washington post i decided to go all out his father fo...,serum institute india sii ceo adar poonawalla revealed father cyrus poonawalla told look money want blow fine put 250 million ramping covid19 vaccine manufacturing capacity adar told washington post decided go father founded sii 1966


#### Prep Article Data Function

In [54]:
df = get_news_articles()

In [55]:
def prep_article_data(df):
    '''
    This function takes in the news articles df and
    returns the df with original columns plus cleaned
    and lemmatized content without stopwords.
    '''
    # Do basic clean on article content.
    df = basic_clean(df, 'content')
    
    # Tokenize clean article content.
    df = tokenize(df, 'basic_clean')
    
    # Stem cleaned and tokenized article content.
    df = stem(df, 'clean_tokens')
    
    # Remove stopwords from Lemmatized article content.
    df = remove_stopwords(df, 'stemmed')
    
    # Lemmatize cleaned and tokenized article content.
    df = lemmatize(df, 'clean_tokens')
    
    # Remove stopwords from Lemmatized article content.
    df = remove_stopwords(df, 'lemmatized')
    
    return df[['topic', 'title', 'author', 'content', 'clean_stemmed', 'clean_lemmatized']]

In [56]:
df = prep_article_data(df)
df.head(2)

Unnamed: 0,topic,title,author,content,clean_stemmed,clean_lemmatized
0,business,Scientist behind '90% effective' COVID-19 vaccine says it can end the pandemic,Krishna Veera Vanamali,"German scientist Uğur Şahin, the CEO of BioNTech which co-developed a coronavirus vaccine with Pfizer, said he's confident his product can end the pandemic and ""bash the virus over the head."" The vaccine is 90% effective based on initial data from a late-stage trial. ""I believe that even protect...",german scientist uur ahin ceo biontech codevelop coronaviru vaccin pfizer said confid hi product end pandem bash viru head vaccin 90 effect base initi data latestag trial believ even protect onli symptomat infect dramat effect ahin said,german scientist uur ahin ceo biontech codeveloped coronavirus vaccine pfizer said confident product end pandemic bash virus head vaccine 90 effective based initial data latestage trial believe even protection symptomatic infection dramatic effect ahin said
1,business,"Father said 'if you want to blow money up, fine' as I put $250M on vaccine: Adar",Pragya Swastik,"Serum Institute of India's (SII) CEO Adar Poonawalla revealed that his father Cyrus S Poonawalla told him, ""Look, it's your money. If you want to blow it up, fine,"" as he put $250 million in ramping up COVID-19 vaccine manufacturing capacity. Adar told The Washington Post, ""I decided to go all o...",serum institut india sii ceo adar poonawalla reveal hi father cyru poonawalla told look money want blow fine put 250 million ramp covid19 vaccin manufactur capac adar told washington post decid go hi father found sii 1966,serum institute india sii ceo adar poonawalla revealed father cyrus poonawalla told look money want blow fine put 250 million ramping covid19 vaccine manufacturing capacity adar told washington post decided go father founded sii 1966
