In [1]:
import pandas as pd
import numpy as np

import os
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

from nlp_acquire import get_news_articles

## Natural Language Processing

### <font color=red>What is Natural Language Processing?</font>

Natural Language Processing allows you to use techniques in Python libraries like NLTK (Natural Language Tool Kit) and Spacy to create a machine-useable structure out of natural language text. In other words, you can manipulate natural language in such a way that renders it useful in machine learning. Machines can't read words, but they can recognize numbers, so we have to process the text we want to use in a way that retains the original meaning while representing the text with numbers.

___

### <font color=orange>So What?</font>

We will establish a workflow to preprocess our text data and prepare it for further use in exploration and modeling. This preprocessing is know as text normalization. **Normalization** is when you perform a series of tasks like making all text lowercase, removing punctuation, expanding contractions, removing anything that's not an ASCII character, etc.

1. Lowercase all characters in our string.
2. Remove accented & non-ASCII characters from our string.
3. Remove special characters from our string.
4. Tokenize our string into discrete units. (words, punctuation)
5. Stem and Lemmatize the words in our string. (group by base word)
6. Remove stopwords from our string. (remove noise)
7. Store original and cleaned text for future use.

___

### <font color=green>Now What?</font>

Let's create functions to help us preprocess our text data. According to the curriculum, we should build our functions to take in a string. We can use the `.apply()` method when it's time to use our functions on a Series in a DataFrame.

In [2]:
# Read our text data into a pandas DataFrame from our saved json file.

df = get_news_articles(cached=True)
df.head()

Unnamed: 0,topic,title,author,content
0,business,Scientist behind '90% effective' COVID-19 vacc...,Krishna Veera Vanamali,"German scientist Uğur Şahin, the CEO of BioNTe..."
1,business,"Father said 'if you want to blow money up, fin...",Pragya Swastik,Serum Institute of India's (SII) CEO Adar Poon...
2,business,China suspends fish imports from Indian firm a...,Krishna Veera Vanamali,China has suspended imports from India's Basu ...
3,business,"Russian vaccine arrives in India, video of it ...",Krishna Veera Vanamali,"Russia's coronavirus vaccine, Sputnik V, has a..."
4,business,Special Indian version of PUBG Mobile to be la...,Krishna Veera Vanamali,South Korea's PUBG Corporation on Thursday ann...


In [3]:
articles = df.content
print(type(articles))
articles[:5]

<class 'pandas.core.series.Series'>


0    German scientist Uğur Şahin, the CEO of BioNTe...
1    Serum Institute of India's (SII) CEO Adar Poon...
2    China has suspended imports from India's Basu ...
3    Russia's coronavirus vaccine, Sputnik V, has a...
4    South Korea's PUBG Corporation on Thursday ann...
Name: content, dtype: object

In [4]:
# Define a string that I will use to test my funtions; this is just the first row in my Series.

article = articles[0]
print(type(article))
article

<class 'str'>


'German scientist Uğur Şahin, the CEO of BioNTech which co-developed a coronavirus vaccine with Pfizer, said he\'s confident his product can end the pandemic and "bash the virus over the head." The vaccine is 90% effective based on initial data from a late-stage trial. "I believe that even protection only from symptomatic infections will have a dramatic effect," Şahin said.'

#### Normalize Text

##### Lowercase Text
```python
# Make all characters in string lowercase.
string = string.lower()
```

In [5]:
# Note I have to reassign this to save the changes.

article.lower()

'german scientist uğur şahin, the ceo of biontech which co-developed a coronavirus vaccine with pfizer, said he\'s confident his product can end the pandemic and "bash the virus over the head." the vaccine is 90% effective based on initial data from a late-stage trial. "i believe that even protection only from symptomatic infections will have a dramatic effect," şahin said.'

___

##### Remove Accented Characters

We will remove accented characters by chaining together the following methods:

```python
# Remove inconsistencies in unicode character encoding.
string = unicodedata.normalize(form, unistr)
```

```python
# Convert string to ASCII character set and drop non-ASCII characters.
string = string.encode('ascii', 'ignore')
```

```python
# Convert the bytes back into a string object.
string = string.decode('utf-8', 'ignore')
```

In [6]:
# I have to reassign to my variable if I want to save the changes.

unicodedata.normalize('NFKC', article).encode('ascii', 'ignore').decode('utf-8', 'ignore')

'German scientist Uur ahin, the CEO of BioNTech which co-developed a coronavirus vaccine with Pfizer, said he\'s confident his product can end the pandemic and "bash the virus over the head." The vaccine is 90% effective based on initial data from a late-stage trial. "I believe that even protection only from symptomatic infections will have a dramatic effect," ahin said.'

___

##### Remove Special Characters

Here are two common patterns I might want to use to remove special characters; it just depends on what you want to remain in your string.

```python
# Remove characters that are not letters, underscores, or spaces.
string = re.sub(r'[^\w\s]', '', string)

# Remove characters that are not letters, numbers, single quotes, or spaces.
string = re.sub(r"[^a-z0-9'\s]", '', string)
```

In [7]:
# Test our regex on our string first; we have to reassign if we want to save the changes.

re.sub(r'[^\w\s]', '', article)

'German scientist Uğur Şahin the CEO of BioNTech which codeveloped a coronavirus vaccine with Pfizer said hes confident his product can end the pandemic and bash the virus over the head The vaccine is 90 effective based on initial data from a latestage trial I believe that even protection only from symptomatic infections will have a dramatic effect Şahin said'

##### Basic Clean Function

In [8]:
def basic_clean(string):
    '''
    This function takes in a string and
    returns the string normalized.
    '''
    string = unicodedata.normalize('NFKC', string)\
             .encode('ascii', 'ignore')\
             .decode('utf-8', 'ignore')
    string = re.sub(r'[^\w\s]', '', string).lower()
    return string

In [9]:
basic_clean(article)

'german scientist uur ahin the ceo of biontech which codeveloped a coronavirus vaccine with pfizer said hes confident his product can end the pandemic and bash the virus over the head the vaccine is 90 effective based on initial data from a latestage trial i believe that even protection only from symptomatic infections will have a dramatic effect ahin said'

___

#### Tokenize Text

**Tokenization** - is when you split larger strings of text into smaller pieces or tokens by setting a boundary. You might chunk a sentence into words using a space as a boundary or a paragraph into sentences using punctuation as a boundary.

##### Using NLTK Tokenization

```python
# Create the tokenizer
tokenizer = nltk.tokenize.ToktokTokenizer()

# Use the tokenizer on my string and assign to a variable. 
tokenized_string = tokenizer.tokenize(string, return_str=True)
```

In [10]:
# Create the tokenizer

tokenizer = nltk.tokenize.ToktokTokenizer()

In [11]:
# Use the tokenizer on my string; assign to variable to save changes

tokenizer.tokenize(article, return_str=True)

'German scientist Uğur Şahin , the CEO of BioNTech which co-developed a coronavirus vaccine with Pfizer , said he \' s confident his product can end the pandemic and " bash the virus over the head. " The vaccine is 90 % effective based on initial data from a late-stage trial. " I believe that even protection only from symptomatic infections will have a dramatic effect , " Şahin said .'

##### Tokenize Function

In [12]:
def tokenize(string):
    '''
    This function takes in a string and
    returns a tokenized string.
    '''
    # Create tokenizer.
    tokenizer = nltk.tokenize.ToktokTokenizer()
    
    # Use tokenizer
    string = tokenizer.tokenize(string, return_str=True)
    
    return string

In [13]:
tokenize(article)

'German scientist Uğur Şahin , the CEO of BioNTech which co-developed a coronavirus vaccine with Pfizer , said he \' s confident his product can end the pandemic and " bash the virus over the head. " The vaccine is 90 % effective based on initial data from a late-stage trial. " I believe that even protection only from symptomatic infections will have a dramatic effect , " Şahin said .'

___

#### Stem Words

**Stemming** is when you reduce related words in your text to their common stem. It can make it easier when you are searching for a particular word in your text to search for their common stem rather than every form of the word. Stemmers aren't that sophisticated in the way they chop off word endings at their common stems; Spacy, another python NLP library, doesn't even include a stemmer in their library. Spacy only offers the more sophisticated lemmatizer, which we will look at in NLTK next.

##### Using NLTK PorterStemmer

```python
# Create the Stemmer.
ps = nltk.porter.PorterStemmer()
```

```python
# Apply the stemmer to each word in our string.
stems = [ps.stem(word) for word in string.split()]
```

```python
# Join our lists of words into strings again and assign to a variable.
stemmed_string = ' '.join('stems')
```

In [14]:
# Create porter stemmer.

ps = nltk.porter.PorterStemmer()

In [15]:
# Check stemmer. It works.

ps.stem('Called')

'call'

In [16]:
# Apply the stemmer to each word in our string.

stems = [ps.stem(word) for word in article.split()]
stems[:10]

['german',
 'scientist',
 'uğur',
 'şahin,',
 'the',
 'ceo',
 'of',
 'biontech',
 'which',
 'co-develop']

In [17]:
# Join our lists of words into a string again; assign to a variable to save changes

' '.join(stems)

'german scientist uğur şahin, the ceo of biontech which co-develop a coronaviru vaccin with pfizer, said he\' confid hi product can end the pandem and "bash the viru over the head." the vaccin is 90% effect base on initi data from a late-stag trial. "I believ that even protect onli from symptomat infect will have a dramat effect," şahin said.'

##### Stem Function

In [18]:
def stem(string):
    '''
    This function takes in a string and
    returns a string with words stemmed.
    '''
    # Create porter stemmer.
    ps = nltk.porter.PorterStemmer()
    
    # Use the stemmer to stem each word in the list of words we created by using split.
    stems = [ps.stem(word) for word in string.split()]
    
    # Join our lists of words into a string again and assign to a variable.
    string = ' '.join(stems)
    
    return string

In [19]:
stem(article)

'german scientist uğur şahin, the ceo of biontech which co-develop a coronaviru vaccin with pfizer, said he\' confid hi product can end the pandem and "bash the viru over the head." the vaccin is 90% effect base on initi data from a late-stag trial. "I believ that even protect onli from symptomat infect will have a dramat effect," şahin said.'

___

#### Lemmatize Words

**Lemmatization** - is when you reduce related words in your text to their lemma or word base by applying a morphological analysis to your text. Like stemming, this is done to reduce the number of forms you have of the same word, so they can be analyzed as a single item. While stemming might create tokens that are not actually words anymore after they have been chopped off at their base, lemmatization will leave you with real words. A drawback to lemmatization is that it takes longer than stemming; you can try both to see which gives you better results as you analyze a given text.

##### Using NLTK WordNetLemmatizer

```python
# Download the first time.
nltk.download('wordnet')
```

```python
# Create the Lemmatizer.
wnl = nltk.stem.WordNetLemmatizer()
```

```python
# Use the lemmatizer on each word in the list of words we created by using split.
lemmas = [wnl.lemmatize(word) for word in string.split()]
```

```python
# Join our list of words into a string again and assign to a variable.
lemmatized_string = ' '.join(lemmas)
```

In [20]:
# Create the Lemmatizer.

wnl = nltk.stem.WordNetLemmatizer()

In [21]:
# Check lemmatizer. It works.

wnl.lemmatize('Calls')

'Calls'

In [22]:
# Use the lemmatizer on each word in the list of words we created by using split.

lemmas = [wnl.lemmatize(word) for word in article.split()]
lemmas[:10]

['German',
 'scientist',
 'Uğur',
 'Şahin,',
 'the',
 'CEO',
 'of',
 'BioNTech',
 'which',
 'co-developed']

In [23]:
# Join our list of words into a string again; assign to a variable to save changes.

' '.join(lemmas)

'German scientist Uğur Şahin, the CEO of BioNTech which co-developed a coronavirus vaccine with Pfizer, said he\'s confident his product can end the pandemic and "bash the virus over the head." The vaccine is 90% effective based on initial data from a late-stage trial. "I believe that even protection only from symptomatic infection will have a dramatic effect," Şahin said.'

In [24]:
# Are there differences between the stemmed and lemmatized strings? Yes.

stems == lemmas

False

##### Lemmatize Function

In [25]:
def lemmatize(string):
    '''
    This function takes in string for and
    returns a string with words lemmatized.
    '''
    # Create the lemmatizer.
    wnl = nltk.stem.WordNetLemmatizer()
    
    # Use the lemmatizer on each word in the list of words we created by using split.
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    
    # Join our list of words into a string again and assign to a variable.
    string = ' '.join(lemmas)
    
    return string

In [26]:
lemmatize(article)

'German scientist Uğur Şahin, the CEO of BioNTech which co-developed a coronavirus vaccine with Pfizer, said he\'s confident his product can end the pandemic and "bash the virus over the head." The vaccine is 90% effective based on initial data from a late-stage trial. "I believe that even protection only from symptomatic infection will have a dramatic effect," Şahin said.'

In [27]:
# The functions are indeed doing different things here. Just checking...

lemmatize(article) == stem(article)

False

___

#### Remove Stopwords

**Stopwords** - are words which are filtered out during the preparation of your text for analyzation and modeling. Stopwords are those that offer little to the meaning of your text and are basically just adding noise to your analysis. Or, as Ryan Orsinger would say, "Stopwords aren't the real story of the document." Words such as 'the', 'and', 'a', and the like can be removed, so you can better focus on the good stuff. 

##### Using NLTK Stopwords

```python
# Download the first time.
nltk.download('stopwords')

# Create list of stopwords and assign to variable.
stopword_list = stopwords.words('english')

# Create list of words and assign to variable.
words = string.split()

# Create a list of words from my string with stopwords removed and assign to variable.
filtered_words = [w for w in words if w not in stopword_list]

# Join words in the list back into strings and assign to a variable.
string_without_stopwords = ' '.join(filtered_words)
```

In [28]:
# Create the list of stopwords.

stopword_list = stopwords.words('english')
stopword_list[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [29]:
# Split words in lemmatized column.

words = article.split()
words[:10]

['German',
 'scientist',
 'Uğur',
 'Şahin,',
 'the',
 'CEO',
 'of',
 'BioNTech',
 'which',
 'co-developed']

In [30]:
# Create a list of words from my string with stopwords removed and assign to variable.

filtered_words = [word for word in words if word not in stopword_list]
filtered_words[:10]

['German',
 'scientist',
 'Uğur',
 'Şahin,',
 'CEO',
 'BioNTech',
 'co-developed',
 'coronavirus',
 'vaccine',
 'Pfizer,']

In [31]:
# Join words in the list back into strings; assign to a variable to keep changes.

' '.join(filtered_words)

'German scientist Uğur Şahin, CEO BioNTech co-developed coronavirus vaccine Pfizer, said he\'s confident product end pandemic "bash virus head." The vaccine 90% effective based initial data late-stage trial. "I believe even protection symptomatic infections dramatic effect," Şahin said.'

##### Stopwords Function

In [32]:
def remove_stopwords(string, extra_words=[], exclude_words=[]):
    '''
    This function takes in a string, optional extra_words and exclude_words parameters
    with default empty lists and returns a string.
    '''
    # Create stopword_list.
    stopword_list = stopwords.words('english')
    
    # Remove additional exclude_words.
    stopword_list.extend(exclude_words)
    
    # Split words in string.
    words = string.split()
    
    # Create a list of words from my string with stopwords removed and assign to variable.
    filtered_words = [word for word in words if word not in stopword_list]
    
    # Add additional extra_words.
    filtered_words.extend(extra_words)
    
    # Join words in the list back into strings and assign to a variable.
    string_without_stopwords = ' '.join(filtered_words)
    
    return string_without_stopwords

In [33]:
remove_stopwords(article)

'German scientist Uğur Şahin, CEO BioNTech co-developed coronavirus vaccine Pfizer, said he\'s confident product end pandemic "bash virus head." The vaccine 90% effective based initial data late-stage trial. "I believe even protection symptomatic infections dramatic effect," Şahin said.'

In [34]:
# Test my function for adding extra_words and removing exclude_words passed as arguments. 

remove_stopwords(article, extra_words=['USAA', 'Codeup'], exclude_words=['German'])

'scientist Uğur Şahin, CEO BioNTech co-developed coronavirus vaccine Pfizer, said he\'s confident product end pandemic "bash virus head." The vaccine 90% effective based initial data late-stage trial. "I believe even protection symptomatic infections dramatic effect," Şahin said. USAA Codeup'

#### Prep Article Data Function

In [35]:
# I'm checking my code before I throw it in my function; always check it first!

df['content'].apply(basic_clean)\
             .apply(tokenize)\
             .apply(remove_stopwords)\
             .apply(lemmatize)

0     german scientist uur ahin ceo biontech codevel...
1     serum institute india sii ceo adar poonawalla ...
2     china suspended import india basu internationa...
3     russia coronavirus vaccine sputnik v arrived i...
4     south korea pubg corporation thursday announce...
                            ...                        
93    actor tiger shroffs sister krishna shroff brok...
94    nikhil dwivedi recently opened tweet wherein t...
95    singer selena gomez play peruvian mountaineer ...
96    actress sayani gupta said year diwali people l...
97    talking late actor asif basra found hanging pr...
Name: content, Length: 98, dtype: object

In [36]:
df = get_news_articles(cached=True)
df.head(2)

Unnamed: 0,topic,title,author,content
0,business,Scientist behind '90% effective' COVID-19 vacc...,Krishna Veera Vanamali,"German scientist Uğur Şahin, the CEO of BioNTe..."
1,business,"Father said 'if you want to blow money up, fin...",Pragya Swastik,Serum Institute of India's (SII) CEO Adar Poon...


In [37]:
def prep_article_data(df, column, extra_words=[], exclude_words=[]):
    '''
    This function take in a df and the string name for a text column with 
    option to pass lists for extra_words and exclude_words and
    returns a df with the text article title, original text, stemmed text,
    lemmatized text, cleaned, tokenized, & lemmatized text with stopwords removed.
    '''
    df['clean'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)\
                            .apply(lemmatize)
    
    df['stemmed'] = df[column].apply(basic_clean).apply(stem)
    
    df['lemmatized'] = df[column].apply(basic_clean).apply(lemmatize)
    
    return df[['title', column, 'stemmed', 'lemmatized', 'clean']]

In [38]:
df = prep_article_data(df, 'content')
df.head()

Unnamed: 0,title,content,stemmed,lemmatized,clean
0,Scientist behind '90% effective' COVID-19 vacc...,"German scientist Uğur Şahin, the CEO of BioNTe...",german scientist uur ahin the ceo of biontech ...,german scientist uur ahin the ceo of biontech ...,german scientist uur ahin ceo biontech codevel...
1,"Father said 'if you want to blow money up, fin...",Serum Institute of India's (SII) CEO Adar Poon...,serum institut of india sii ceo adar poonawall...,serum institute of india sii ceo adar poonawal...,serum institute india sii ceo adar poonawalla ...
2,China suspends fish imports from Indian firm a...,China has suspended imports from India's Basu ...,china ha suspend import from india basu intern...,china ha suspended import from india basu inte...,china suspended import india basu internationa...
3,"Russian vaccine arrives in India, video of it ...","Russia's coronavirus vaccine, Sputnik V, has a...",russia coronaviru vaccin sputnik v ha arriv in...,russia coronavirus vaccine sputnik v ha arrive...,russia coronavirus vaccine sputnik v arrived i...
4,Special Indian version of PUBG Mobile to be la...,South Korea's PUBG Corporation on Thursday ann...,south korea pubg corpor on thursday announc th...,south korea pubg corporation on thursday annou...,south korea pubg corporation thursday announce...


In [39]:
prep_article_data(df, 'content', extra_words=['Codeup'], exclude_words=['german', 'china'])

Unnamed: 0,title,content,stemmed,lemmatized,clean
0,Scientist behind '90% effective' COVID-19 vacc...,"German scientist Uğur Şahin, the CEO of BioNTe...",german scientist uur ahin the ceo of biontech ...,german scientist uur ahin the ceo of biontech ...,scientist uur ahin ceo biontech codeveloped co...
1,"Father said 'if you want to blow money up, fin...",Serum Institute of India's (SII) CEO Adar Poon...,serum institut of india sii ceo adar poonawall...,serum institute of india sii ceo adar poonawal...,serum institute india sii ceo adar poonawalla ...
2,China suspends fish imports from Indian firm a...,China has suspended imports from India's Basu ...,china ha suspend import from india basu intern...,china ha suspended import from india basu inte...,suspended import india basu international one ...
3,"Russian vaccine arrives in India, video of it ...","Russia's coronavirus vaccine, Sputnik V, has a...",russia coronaviru vaccin sputnik v ha arriv in...,russia coronavirus vaccine sputnik v ha arrive...,russia coronavirus vaccine sputnik v arrived i...
4,Special Indian version of PUBG Mobile to be la...,South Korea's PUBG Corporation on Thursday ann...,south korea pubg corpor on thursday announc th...,south korea pubg corporation on thursday annou...,south korea pubg corporation thursday announce...
...,...,...,...,...,...
93,Tiger's sister Krishna announces split from Eb...,"Actor Tiger Shroff's sister, Krishna Shroff, h...",actor tiger shroff sister krishna shroff ha br...,actor tiger shroffs sister krishna shroff ha b...,actor tiger shroffs sister krishna shroff brok...
94,It was a mark of protest: Nikhil on tweet sayi...,Nikhil Dwivedi recently opened up about his tw...,nikhil dwivedi recent open up about hi tweet w...,nikhil dwivedi recently opened up about his tw...,nikhil dwivedi recently opened tweet wherein t...
95,Selena to play mountaineer Silvia Vasquez-Lava...,Singer Selena Gomez will play Peruvian mountai...,singer selena gomez will play peruvian mountai...,singer selena gomez will play peruvian mountai...,singer selena gomez play peruvian mountaineer ...
96,Hope old demons are left behind and we adopt p...,Actress Sayani Gupta said that this year on Di...,actress sayani gupta said that thi year on diw...,actress sayani gupta said that this year on di...,actress sayani gupta said year diwali people l...
