In [3]:
import pandas as pd
import numpy as np

import os
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

from nlp_acquire import get_news_articles

## Natural Language Processing

### <font color=red>What is Natural Language Processing?</font>

Natural Language Processing allows you to use techniques in Python libraries like NLTK (Natural Language Tool Kit) and Spacy to create a machine-useable structure out of natural language text. In other words, you can manipulate natural language in such a way that renders it useful in machine learning. Machines can't read words, but they can recognize numbers, so we have to process the text we want to use in a way that retains the original meaning while representing the text with numbers.

___

### <font color=orange>So What?</font>

We will establish a workflow to preprocess our text data and prepare it for further use in exploration and modeling. This preprocessing is know as text normalization. **Normalization** is when you perform a series of tasks like making all text lowercase, removing punctuation, expanding contractions, removing anything that's not an ASCII character, etc.

1. Lowercase all characters in our string.
2. Remove special characters from our string.
3. Remove accented characters from our string.
4. Tokenize our string into discrete units. (words, punctuation)
5. Stem and Lemmatize the words in our string. (group by base word)
6. Remove stopwords from our string. (remove noise)

___

### <font color=green>Now What?</font>

Let's create functions to help us normalize our text data. According to the curriculum, we should build our functions to take in a string. We can use the `.apply()` method when it's time to use our functions on a Series in a DataFrame.

In [7]:
# Read our text data into a pandas DataFrame from our saved json file.

df = get_news_articles(cached=True)
df.head()

Unnamed: 0,topic,title,author,content
0,business,Scientist behind '90% effective' COVID-19 vacc...,Krishna Veera Vanamali,"German scientist Uğur Şahin, the CEO of BioNTe..."
1,business,"Father said 'if you want to blow money up, fin...",Pragya Swastik,Serum Institute of India's (SII) CEO Adar Poon...
2,business,China suspends fish imports from Indian firm a...,Krishna Veera Vanamali,China has suspended imports from India's Basu ...
3,business,"Russian vaccine arrives in India, video of it ...",Krishna Veera Vanamali,"Russia's coronavirus vaccine, Sputnik V, has a..."
4,business,Special Indian version of PUBG Mobile to be la...,Krishna Veera Vanamali,South Korea's PUBG Corporation on Thursday ann...


In [6]:
articles = df.content
print(type(articles))
articles[:5]

<class 'pandas.core.series.Series'>


0    German scientist Uğur Şahin, the CEO of BioNTe...
1    Serum Institute of India's (SII) CEO Adar Poon...
2    China has suspended imports from India's Basu ...
3    Russia's coronavirus vaccine, Sputnik V, has a...
4    South Korea's PUBG Corporation on Thursday ann...
Name: content, dtype: object

In [9]:
# Define a string that I will use to test my funtions; this is just the first row in my Series.

article = articles[0]
print(type(article))
article

<class 'str'>


'German scientist Uğur Şahin, the CEO of BioNTech which co-developed a coronavirus vaccine with Pfizer, said he\'s confident his product can end the pandemic and "bash the virus over the head." The vaccine is 90% effective based on initial data from a late-stage trial. "I believe that even protection only from symptomatic infections will have a dramatic effect," Şahin said.'

#### Normalize Text

##### Lowercase Text
```python
# Make all characters in string lowercase.
string = string.lower()
```

In [13]:
# Note I have to reassign this to save the changes.

article = article.lower()
article

'german scientist uğur şahin, the ceo of biontech which co-developed a coronavirus vaccine with pfizer, said he\'s confident his product can end the pandemic and "bash the virus over the head." the vaccine is 90% effective based on initial data from a late-stage trial. "i believe that even protection only from symptomatic infections will have a dramatic effect," şahin said.'

##### Remove Special Characters

Here are two common patterns I might want to use to remove special characters; it just depends on what you want to remain in your string.

```python
# Remove characters that are not letters, underscores, or spaces.
string = re.sub(r'[^\w\s]', '', string)

# Remove characters that are not letters, numbers, single quotes, or spaces.
string = re.sub(r"[^a-z0-9'\s]", '', string)
```

In [16]:
# Test our regex on our string first; we have to reassign if we want to save the changes.

article = re.sub(r'[^\w\s]', '', article)
article

'german scientist uğur şahin the ceo of biontech which codeveloped a coronavirus vaccine with pfizer said hes confident his product can end the pandemic and bash the virus over the head the vaccine is 90 effective based on initial data from a latestage trial i believe that even protection only from symptomatic infections will have a dramatic effect şahin said'

___

##### Remove Accented Characters

We will remove accented characters by chaining together the following methods:

```python
# Remove inconsistencies in unicode character encoding.
string = unicodedata.normalize(form, unistr)
```

```python
# Convert string to ASCII character set and drop non-ASCII characters.
string = string.encode('ascii', 'ignore')
```

```python
# Convert the bytes back into a string object.
string = string.decode('utf-8', 'ignore')
```

In [31]:
# I have to reassign to my variable if I want to save the changes.

article = unicodedata.normalize('NFKC', article).encode('ascii', 'ignore').decode('utf-8', 'ignore')
article

'german scientist uur ahin the ceo of biontech which codeveloped a coronavirus vaccine with pfizer said hes confident his product can end the pandemic and bash the virus over the head the vaccine is 90 effective based on initial data from a latestage trial i believe that even protection only from symptomatic infections will have a dramatic effect ahin said'

##### Basic Clean Function

In [32]:
def basic_clean(string):
    '''
    This function takes in a string and
    returns the string normalized.
    '''
    string = unicodedata.normalize('NFKC', string)\
             .encode('ascii', 'ignore')\
             .decode('utf-8', 'ignore')
    string = re.sub(r'[^\w\s]', '', string).lower()
    return string

In [33]:
article = basic_clean(article)
article

'german scientist uur ahin the ceo of biontech which codeveloped a coronavirus vaccine with pfizer said hes confident his product can end the pandemic and bash the virus over the head the vaccine is 90 effective based on initial data from a latestage trial i believe that even protection only from symptomatic infections will have a dramatic effect ahin said'

___

#### Tokenize Text

**Tokenization** - is when you split larger strings of text into smaller pieces or tokens by setting a boundary. You might chunk a sentence into words using a space as a boundary or a paragraph into sentences using punctuation as a boundary.

##### Using NLTK Tokenization

```python
# Create the tokenizer
tokenizer = nltk.tokenize.ToktokTokenizer()

# Use the tokenizer on my string and assign to a variable. 
tokenized_string = tokenizer.tokenize(string, return_str=True)
```

In [38]:
# Create the tokenizer

tokenizer = nltk.tokenize.ToktokTokenizer()

In [45]:
# Use the tokenizer to my string.

tokenized_article = tokenizer.tokenize(article, return_str=True)
tokenized_article

'german scientist uur ahin the ceo of biontech which codeveloped a coronavirus vaccine with pfizer said hes confident his product can end the pandemic and bash the virus over the head the vaccine is 90 effective based on initial data from a latestage trial i believe that even protection only from symptomatic infections will have a dramatic effect ahin said'

##### Tokenize Function

In [57]:
def tokenize(string):
    '''
    This function takes in a string and
    returns a tokenized string.
    '''
    # Create tokenizer.
    tokenizer = nltk.tokenize.ToktokTokenizer()
    
    # Use tokenizer
    string = tokenizer.tokenize(string, return_str=True)
    
    return string

___

#### Stem Words

**Stemming** is when you reduce related words in your text to their common stem. It can make it easier when you are searching for a particular word in your text to search for their common stem rather than every form of the word. Stemmers aren't that sophisticated in the way they chop off word endings at their common stems; Spacy, another python NLP library, doesn't even include a stemmer in their library. Spacy only offers the more sophisticated lemmatizer, which we will look at in NLTK next.

##### Using NLTK PorterStemmer

```python
# Create the Stemmer.
ps = nltk.porter.PorterStemmer()
```

```python
# Apply the stemmer to each word in our string.
stems = [ps.stem(word) for word in string.split()]
```

```python
# Join our lists of words into strings again and assign to a variable.
stemmed_string = ' '.join('stems')
```

In [58]:
# Create porter stemmer

ps = nltk.porter.PorterStemmer()

In [59]:
# Apply the stemmer to each word in our string.

stems = [ps.stem(word) for word in article.split()]
stems[:10]

['german',
 'scientist',
 'uur',
 'ahin',
 'the',
 'ceo',
 'of',
 'biontech',
 'which',
 'codevelop']

In [56]:
# Join our lists of words into a string again and assign to a variable. Take a peek.

stemmed_string = ' '.join(stems)
stemmed_string

'german scientist uur ahin the ceo of biontech which codevelop a coronaviru vaccin with pfizer said he confid hi product can end the pandem and bash the viru over the head the vaccin is 90 effect base on initi data from a latestag trial i believ that even protect onli from symptomat infect will have a dramat effect ahin said'

##### Stem Function

In [35]:
def stem(string):
    '''
    This function takes in a string and
    returns a string with words stemmed.
    '''
    # Create porter stemmer.
    ps = nltk.porter.PorterStemmer()
    
    # Use the stemmer to stem each word in the list of words we created by using split.
    stems = [ps.stem(word) for word in string.split()]
    
    # Join our lists of words into a string again and assign to a variable.
    string = ' '.join(stems)
    
    return string

___

#### Lemmatize Tokens

**Lemmatization** - is when you reduce related words in your text to their lemma or word base by applying a morphological analysis to your text. Like stemming, this is done to reduce the number of forms you have of the same word, so they can be analyzed as a single item. While stemming might create tokens that are not actually words anymore after they have been chopped off at their base, lemmatization will leave you with real words. A drawback to lemmatization is that it takes longer than stemming; you can try both to see which gives you better results as you analyze a given text.

##### Using NLTK WordNetLemmatizer

```python
# Download
nltk.download('wordnet')
```

```python
# Create the Lemmatizer.
wnl = nltk.stem.WordNetLemmatizer()
```

```python
# Use the lemmatizer on each word in the list of words we created by using split.
lemmas = [wnl.lemmatize(word) for word in string.split()]
```

```python
# Join our list of words into a string again and assign to a variable.
lemmatized_string = ' '.join(lemmas)
```

In [63]:
# Create the Lemmatizer.

wnl = nltk.stem.WordNetLemmatizer()

In [64]:
# Use the lemmatizer on each word in the list of words we created by using split.

lemmas = [wnl.lemmatize(word) for word in article.split()]

In [65]:
# Join our list of words into a string again and assign to a variable.

lemmatized_string = ' '.join(lemmas)
lemmatized_string

'german scientist uur ahin the ceo of biontech which codeveloped a coronavirus vaccine with pfizer said he confident his product can end the pandemic and bash the virus over the head the vaccine is 90 effective based on initial data from a latestage trial i believe that even protection only from symptomatic infection will have a dramatic effect ahin said'

In [66]:
# Are there differences between the stemmed and lemmatized strings? Yes.

stemmed_string == lemmatized_string

False

##### Lemmatize Function

In [67]:
def lemmatize(string):
    '''
    This function takes in string for and
    returns a string with words lemmatized.
    '''
    # Create the lemmatizer.
    wnl = nltk.stem.WordNetLemmatizer()
    
    # Use the lemmatizer on each word in the list of words we created by using split.
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    
    # Join our list of words into a string again and assign to a variable.
    string = ' '.join(lemmas)
    
    return df

#### Remove Stopwords

**Stopwords** - are words which are filtered out during the preparation of your text for analyzation and modeling. Stopwords are those that offer little to the meaning of your text and are basically just adding noise to your analysis. Or, as Ryan Orsinger would say, "Stopwords aren't the real story of the document." Words such as 'the', 'and', 'a', and the like can be removed, so you can better focus on the good stuff. 

##### Using NLTK Stopwords

```python
# Necessary import
import nltk; nltk.download('stopwords')

# Create list of stopwords and assign to variable.
stopword_list = stopwords.words('english')

# Create list of words and assign to variable.
words = string.split()

# Create a list of words from my string with stopwords removed and assign to variable.
filtered_words = [w for w in words if w not in stopword_list]

# Join words in the list back into strings and assign to a variable.
string_without_stopwords = ' '.join(filtered_words)
```

In [71]:
# Create the list of stopwords.

stopword_list = stopwords.words('english')

In [77]:
# Split words in lemmatized column

words = article.split()
words[:10]

['german',
 'scientist',
 'uur',
 'ahin',
 'the',
 'ceo',
 'of',
 'biontech',
 'which',
 'codeveloped']

In [81]:
# Create a list of words from my string with stopwords removed and assign to variable.

filtered_words = [word for word in words if word not in stopword_list]
filtered_words[:10]

['german',
 'scientist',
 'uur',
 'ahin',
 'ceo',
 'biontech',
 'codeveloped',
 'coronavirus',
 'vaccine',
 'pfizer']

In [79]:
# Join words in the list back into strings and assign to a variable.

string_without_stopwords = ' '.join(filtered_words)
string_without_stopwords

'german scientist uur ahin ceo biontech codeveloped coronavirus vaccine pfizer said hes confident product end pandemic bash virus head vaccine 90 effective based initial data latestage trial believe even protection symptomatic infections dramatic effect ahin said'

##### Stopwords Function

In [80]:
def remove_stopwords(string, extra_words=[], exclude_words=[]):
    '''
    This function takes in a string, optional extra_words and exclude_words parameters
    with default empty lists and returns a string.
    '''
    # Create stopword_list.
    stopword_list = stopwords.words('english')
    
    # Remove additional exclude_words.
    stopword_list = stopword_list.extend(exclude_words)
    
    # Split words in string.
    words = string.split()
    
    # Create a list of words from my string with stopwords removed and assign to variable.
    filtered_words = [word for word in words if word not in stopword_list]
    
    # Add additional extra_words.
    filtered_words.extend(extra_words)
    
    # Join words in the list back into strings and assign to a variable.
    string_without_stopwords = ' '.join(filtered_words)
    
    return string_without_stopwords

#### Prep Article Data Function

In [82]:
df = get_news_articles(cached=True)
df.head(2)

Unnamed: 0,topic,title,author,content
0,business,Scientist behind '90% effective' COVID-19 vacc...,Krishna Veera Vanamali,"German scientist Uğur Şahin, the CEO of BioNTe..."
1,business,"Father said 'if you want to blow money up, fin...",Pragya Swastik,Serum Institute of India's (SII) CEO Adar Poon...


In [None]:
def prep_article(df):
    '''
    This function
    '''

In [55]:
def prep_article_data(df):
    '''
    This function takes in a string and
    returns the string with original columns plus cleaned
    and lemmatized content without stopwords.
    '''
    # Do basic clean on article content.
    df = basic_clean(df, 'content')
    
    # Tokenize clean article content.
    df = tokenize(df, 'basic_clean')
    
    # Stem cleaned and tokenized article content.
    df = stem(df, 'clean_tokens')
    
    # Remove stopwords from Lemmatized article content.
    df = remove_stopwords(df, 'stemmed')
    
    # Lemmatize cleaned and tokenized article content.
    df = lemmatize(df, 'clean_tokens')
    
    # Remove stopwords from Lemmatized article content.
    df = remove_stopwords(df, 'lemmatized')
    
    return df[['topic', 'title', 'author', 'content', 'clean_stemmed', 'clean_lemmatized']]

In [56]:
df = prep_article_data(df)
df.head(2)

Unnamed: 0,topic,title,author,content,clean_stemmed,clean_lemmatized
0,business,Scientist behind '90% effective' COVID-19 vaccine says it can end the pandemic,Krishna Veera Vanamali,"German scientist Uğur Şahin, the CEO of BioNTech which co-developed a coronavirus vaccine with Pfizer, said he's confident his product can end the pandemic and ""bash the virus over the head."" The vaccine is 90% effective based on initial data from a late-stage trial. ""I believe that even protect...",german scientist uur ahin ceo biontech codevelop coronaviru vaccin pfizer said confid hi product end pandem bash viru head vaccin 90 effect base initi data latestag trial believ even protect onli symptomat infect dramat effect ahin said,german scientist uur ahin ceo biontech codeveloped coronavirus vaccine pfizer said confident product end pandemic bash virus head vaccine 90 effective based initial data latestage trial believe even protection symptomatic infection dramatic effect ahin said
1,business,"Father said 'if you want to blow money up, fine' as I put $250M on vaccine: Adar",Pragya Swastik,"Serum Institute of India's (SII) CEO Adar Poonawalla revealed that his father Cyrus S Poonawalla told him, ""Look, it's your money. If you want to blow it up, fine,"" as he put $250 million in ramping up COVID-19 vaccine manufacturing capacity. Adar told The Washington Post, ""I decided to go all o...",serum institut india sii ceo adar poonawalla reveal hi father cyru poonawalla told look money want blow fine put 250 million ramp covid19 vaccin manufactur capac adar told washington post decid go hi father found sii 1966,serum institute india sii ceo adar poonawalla revealed father cyrus poonawalla told look money want blow fine put 250 million ramping covid19 vaccine manufacturing capacity adar told washington post decided go father founded sii 1966
