<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Natural-Language-Processing" data-toc-modified-id="Natural-Language-Processing-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Natural Language Processing</a></span><ul class="toc-item"><li><span><a href="#What-is-Natural-Language-Processing?" data-toc-modified-id="What-is-Natural-Language-Processing?-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span><strong><font color="red">What is Natural Language Processing?</font></strong></a></span></li><li><span><a href="#So-What?" data-toc-modified-id="So-What?-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span><strong><font color="orange">So What?</font></strong></a></span></li><li><span><a href="#Normalization-Examples:" data-toc-modified-id="Normalization-Examples:-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span><strong><font color="purple">Normalization Examples:</font></strong></a></span><ul class="toc-item"><li><span><a href="#Lowercase-Using-df.col.str.lower()" data-toc-modified-id="Lowercase-Using-df.col.str.lower()-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span><strong>Lowercase Using <code>df.col.str.lower()</code></strong></a></span></li><li><span><a href="#Normalize-Unicode-Characters" data-toc-modified-id="Normalize-Unicode-Characters-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span><strong>Normalize Unicode Characters</strong></a></span></li><li><span><a href="#Remove-Special-Characters-Using-Regex" data-toc-modified-id="Remove-Special-Characters-Using-Regex-1.3.3"><span class="toc-item-num">1.3.3&nbsp;&nbsp;</span><strong>Remove Special Characters Using Regex</strong></a></span></li><li><span><a href="#Basic-Clean-Function" data-toc-modified-id="Basic-Clean-Function-1.3.4"><span class="toc-item-num">1.3.4&nbsp;&nbsp;</span>Basic Clean Function</a></span></li></ul></li><li><span><a href="#Tokenization-Examples:" data-toc-modified-id="Tokenization-Examples:-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span><strong><font color="purple">Tokenization Examples:</font></strong></a></span><ul class="toc-item"><li><span><a href="#Using-.split()" data-toc-modified-id="Using-.split()-1.4.1"><span class="toc-item-num">1.4.1&nbsp;&nbsp;</span><strong>Using <code>.split()</code></strong></a></span></li><li><span><a href="#Using-Regex" data-toc-modified-id="Using-Regex-1.4.2"><span class="toc-item-num">1.4.2&nbsp;&nbsp;</span>Using Regex</a></span></li><li><span><a href="#Using-NLTK-Tokenization" data-toc-modified-id="Using-NLTK-Tokenization-1.4.3"><span class="toc-item-num">1.4.3&nbsp;&nbsp;</span>Using NLTK Tokenization</a></span></li><li><span><a href="#Tokenize-Function" data-toc-modified-id="Tokenize-Function-1.4.4"><span class="toc-item-num">1.4.4&nbsp;&nbsp;</span>Tokenize Function</a></span></li><li><span><a href="#Using-NLTK-PorterStemmer" data-toc-modified-id="Using-NLTK-PorterStemmer-1.4.5"><span class="toc-item-num">1.4.5&nbsp;&nbsp;</span><strong>Using NLTK PorterStemmer</strong></a></span></li><li><span><a href="#Stem-Function" data-toc-modified-id="Stem-Function-1.4.6"><span class="toc-item-num">1.4.6&nbsp;&nbsp;</span>Stem Function</a></span></li><li><span><a href="#Using-NLTK-WordNetLemmatizer" data-toc-modified-id="Using-NLTK-WordNetLemmatizer-1.4.7"><span class="toc-item-num">1.4.7&nbsp;&nbsp;</span><strong>Using NLTK WordNetLemmatizer</strong></a></span></li><li><span><a href="#Lemmatize-Function" data-toc-modified-id="Lemmatize-Function-1.4.8"><span class="toc-item-num">1.4.8&nbsp;&nbsp;</span>Lemmatize Function</a></span></li></ul></li></ul></li></ul></div>

In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

import os
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

from acquire_walkthrough import get_all_urls, get_blog_articles, get_news_articles

### Natural Language Processing

#### **<font color=red>What is Natural Language Processing?</font>**

Natural Language Processing allows you to use techniques in Python libraries like NLTK (Natural Language Tool Kit) and Spacy to create machine-useable structure out of natural language text. In other words, you can manipulate natural language in such a way that renders it useful in machine learning. Machines can't read words, but they can recognize numbers, so we have to process the text we want to use in a way that retains the original meaning while representing the text with numbers.

#### **<font color=orange>So What?</font>**

We need to know some basic terminology to get started:

**Normalization** - is when you perform a series of tasks like making all text lowercase, removing punctuation, expanding contractions, removing anything that's not an ASCII character, etc.

#### **<font color=purple>Normalization Examples:</font>**

##### **Lowercase Using `df.col.str.lower()`**

In [3]:
df = get_news_articles()
df.head()

Unnamed: 0,topic,title,author,content
0,business,RBI allows banks to offer moratorium on EMI pa...,Krishna Veera Vanamali,RBI Governor Shaktikanta Das on Friday announc...
1,business,GDP growth in 2020-21 expected to remain in ne...,Ankush Verma,Reserve Bank of India Governor Shaktikanta Das...
2,business,Govt releases fare structure for domestic flig...,Nandini Sinha,The DGCA has released fare structure for domes...
3,business,"Oxfam to fire 1,450 staff, shut offices in 18 ...",Anushka Dixit,Oxfam International has announced that it'll b...
4,business,Vaccine development is like rollercoaster: Ser...,Dharna,Serum Institute of India CEO Adar Poonawalla s...


In [None]:
# Note I have not reassigned this or changed the inplace argument to True yet; just a look.

df.content.str.lower()

##### **Normalize Unicode Characters**

[Here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.normalize.html) is the documentation for using `unicodedata.normalize()` on a Pandas Series.

```python
df.col.str.normalize(form, unistr)
```

```python
df.col.str.encode('ascii', 'ignore')
```

```python
df.col.str.decode('utf-8', 'ignore')
```

```python
df.col.str.replace(r"[^A-z0-9'\s]", '', regex=True)
```

In [None]:
# Again, this is a look because it has not been reassigned or changed in place.

df.content.str.normalize('NFKC').str.encode('ascii', 'ignore').str.decode('utf-8', 'ignore')

##### **Remove Special Characters Using Regex**

I found [this article](https://kanoki.org/2019/11/12/how-to-use-regex-in-pandas/) very helpful when using Regex in Pandas!

In [None]:
# Again, this is a look because it has not been reassigned or changed in place.

df.content.str.replace(r"[^A-z0-9'\s]", '', regex=True)

In [None]:
# Now, I can chain these together and reassign to my df as a new columm

df['basic_clean'] = df.content.str.lower()\
                    .str.replace(r"[^A-z0-9'\s]", '', regex=True)\
                    .str.normalize('NFKC')\
                    .str.encode('ascii', 'ignore')\
                    .str.decode('utf-8', 'ignore')

In [None]:
df.head()

##### Basic Clean Function

In [4]:
def basic_clean(df, col):
    df['basic_clean'] = df[col].str.lower()\
                    .str.replace(r"[^A-z0-9'\s]", '', regex=True)\
                    .str.normalize('NFKC')\
                    .str.encode('ascii', 'ignore')\
                    .str.decode('utf-8', 'ignore')
    return df

In [6]:
df = basic_clean(df, 'content')
df.head(2)

Unnamed: 0,topic,title,author,content,basic_clean
0,business,RBI allows banks to offer moratorium on EMI pa...,Krishna Veera Vanamali,RBI Governor Shaktikanta Das on Friday announc...,rbi governor shaktikanta das on friday announc...
1,business,GDP growth in 2020-21 expected to remain in ne...,Ankush Verma,Reserve Bank of India Governor Shaktikanta Das...,reserve bank of india governor shaktikanta das...


**Tokenization** - is when you split larger strings of text into smaller pieces or tokens by setting a boundary. You might chunk a sentence into words using a space as a boundary or a paragraph into sentences using punctuation as a boundary.

#### **<font color=purple>Tokenization Examples:</font>**

##### **Using `.split()`**

Tokenizing using `.split()` is simple but also limited to one delimiter.

In [None]:
text = 'Knowledge is the compound interest of curiosity. - James Clear'

In [None]:
text.split()

In [None]:
text = """There\'s the kind of person who is always the victim in any story they tell. Always on the receiving end of some injustice. There\'s the kind of person who is always the kind of hero of every story they tell. There\'s the smart person; they delivered the clever put down there."""

In [None]:
text.split('.')

##### Using Regex

**<font color=purple>Identifiers</font>**

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

**<font color=purple>Quantifiers</font>**

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

**<font color=purple>More Regex</font>**

<table ><tr><th>Character</th><th>Description</th><th>Example</th></tr>
    
<tr ><td><span >|</span></td><td>or statement</td><td>r'dog|cat'</td></tr>

<tr ><td><span >*</span></td><td>wildcard</td><td>r'.at'</td></tr>
    
<tr ><td><span >^</span></td><td>starts with</td><td>r'^\d'</td></tr>
    
<tr ><td><span >[^]</span></td><td>exclusion</td><td>r'[^a-z]'</td></tr>

In [None]:
# Split your text using a regex pattern in .findall()

pattern = r'[\w]+'
text = 'Knowledge is the compound interest of curiosity. - James Clear'

tokens = re.findall(pattern, text)
tokens

In [None]:
# Use `.compile()` with .split(text) to split your text on more than one delimiter

pattern = re.compile(r'[.;!?]')
text = """There's the kind of person who is always the victim in any story they tell. Always on the receiving end of some injustice. There's the kind of person who is always the kind of hero of every story they tell. There's the smart person; they delivered the clever put down there."""

pattern.split(text)

##### Using NLTK Tokenization

```python
tokenizer = nltk.tokenize.ToktokTokenizer()

```

```python
df.col.apply(tokenizer.tokenize).str.join(' ')
```

In [None]:
df.head(2)

In [None]:
tokenizer = nltk.tokenize.ToktokTokenizer()

In [None]:
df.basic_clean.apply(tokenizer.tokenize)[:2]

In [None]:
# Here we apply nltk's tokenizer to each row, or text, in our basic_clean Series

df['clean_tokes'] = df.basic_clean.apply(tokenizer.tokenize)

In [None]:
df.head(2)

##### Tokenize Function

In [7]:
def tokenize(df, col):
    tokenizer = nltk.tokenize.ToktokTokenizer()
    df['clean_tokes'] = df[col].apply(tokenizer.tokenize)
    return df

In [8]:
df = tokenize(df, 'basic_clean')
df.head(2)

Unnamed: 0,topic,title,author,content,basic_clean,clean_tokes
0,business,RBI allows banks to offer moratorium on EMI pa...,Krishna Veera Vanamali,RBI Governor Shaktikanta Das on Friday announc...,rbi governor shaktikanta das on friday announc...,"[rbi, governor, shaktikanta, das, on, friday, ..."
1,business,GDP growth in 2020-21 expected to remain in ne...,Ankush Verma,Reserve Bank of India Governor Shaktikanta Das...,reserve bank of india governor shaktikanta das...,"[reserve, bank, of, india, governor, shaktikan..."


##### **Using NLTK PorterStemmer**

```python
ps = nltk.porter.PorterStemmer()
```

```python
df.col.apply(lambda row: [ps.stem(word) for word in row])
```

```python
df['stemmed'] = stems.str.join(' ')
```

In [10]:
ps = nltk.porter.PorterStemmer()

nltk.stem.porter.PorterStemmer

In [None]:
ps.stem('turning')

In [None]:
# Create a Series of lists of stemmed words using our cleaned tokens

stems = df.clean_tokes.apply(lambda row: [ps.stem(word) for word in row])
stems.head(2)

In [None]:
# Join our cleaned, stemmed lists of words back into strings of words/tokens

df['stemmed'] = stems.str.join(' ')

In [None]:
pd.options.display.max_colwidth = 300
df.head(2)

##### Stem Function

In [11]:
def stem(df, col):
    '''
    
    '''
    # Create porter stemmer
    ps = nltk.porter.PorterStemmer()
    
    # Create a Series of lists of stemmed words using our cleaned tokens
    stems = df[col].apply(lambda row: [ps.stem(word) for word in row])
    
    # Join our cleaned, stemmed lists of words back into strings of words/tokens
    df['stemmed'] = stems.str.join(' ')
    
    return df

In [12]:
df = stem(df, 'clean_tokes')
df.head()

Unnamed: 0,topic,title,author,content,basic_clean,clean_tokes,stemmed
0,business,RBI allows banks to offer moratorium on EMI pa...,Krishna Veera Vanamali,RBI Governor Shaktikanta Das on Friday announc...,rbi governor shaktikanta das on friday announc...,"[rbi, governor, shaktikanta, das, on, friday, ...",rbi governor shaktikanta da on friday announc ...
1,business,GDP growth in 2020-21 expected to remain in ne...,Ankush Verma,Reserve Bank of India Governor Shaktikanta Das...,reserve bank of india governor shaktikanta das...,"[reserve, bank, of, india, governor, shaktikan...",reserv bank of india governor shaktikanta da h...
2,business,Govt releases fare structure for domestic flig...,Nandini Sinha,The DGCA has released fare structure for domes...,the dgca has released fare structure for domes...,"[the, dgca, has, released, fare, structure, fo...",the dgca ha releas fare structur for domest fl...
3,business,"Oxfam to fire 1,450 staff, shut offices in 18 ...",Anushka Dixit,Oxfam International has announced that it'll b...,oxfam international has announced that it'll b...,"[oxfam, international, has, announced, that, i...",oxfam intern ha announc that it ' ll be fire 1...
4,business,Vaccine development is like rollercoaster: Ser...,Dharna,Serum Institute of India CEO Adar Poonawalla s...,serum institute of india ceo adar poonawalla s...,"[serum, institute, of, india, ceo, adar, poona...",serum institut of india ceo adar poonawalla sa...


##### **Using NLTK WordNetLemmatizer**

```python
nltk.download('wordnet')
```

```python
wnl = nltk.stem.WordNetLemmatizer()
```

```python
lemmas = df.col.apply(lambda row: [wnl.lemmatize(word) for word in row])
```

```python
df['lemmatized'] = lemmas.str.join(' ')
```

In [None]:
wnl = nltk.stem.WordNetLemmatizer()

In [None]:
lemmas = df.clean_tokes.apply(lambda row: [wnl.lemmatize(word) for word in row])
lemmas.head(2)

In [None]:
# Join our cleaned, lemmatized lists of words back into strings of words/tokens

df['lemmatized'] = lemmas.str.join(' ')

In [None]:
df.head(2)

##### Lemmatize Function

In [13]:
def lemmatize(df, col):
    '''
    This function takes in a df and a string for column name and
    returns a the original df with a new column called 'lemmatized'.
    '''
    wnl = nltk.stem.WordNetLemmatizer()
    lemmas = df[col].apply(lambda row: [wnl.lemmatize(word) for word in row])
    df['lemmatized'] = lemmas.str.join(' ')
    return df

In [14]:
df = lemmatize(df, 'clean_tokes')
df.head()

Unnamed: 0,topic,title,author,content,basic_clean,clean_tokes,stemmed,lemmatized
0,business,RBI allows banks to offer moratorium on EMI pa...,Krishna Veera Vanamali,RBI Governor Shaktikanta Das on Friday announc...,rbi governor shaktikanta das on friday announc...,"[rbi, governor, shaktikanta, das, on, friday, ...",rbi governor shaktikanta da on friday announc ...,rbi governor shaktikanta da on friday announce...
1,business,GDP growth in 2020-21 expected to remain in ne...,Ankush Verma,Reserve Bank of India Governor Shaktikanta Das...,reserve bank of india governor shaktikanta das...,"[reserve, bank, of, india, governor, shaktikan...",reserv bank of india governor shaktikanta da h...,reserve bank of india governor shaktikanta da ...
2,business,Govt releases fare structure for domestic flig...,Nandini Sinha,The DGCA has released fare structure for domes...,the dgca has released fare structure for domes...,"[the, dgca, has, released, fare, structure, fo...",the dgca ha releas fare structur for domest fl...,the dgca ha released fare structure for domest...
3,business,"Oxfam to fire 1,450 staff, shut offices in 18 ...",Anushka Dixit,Oxfam International has announced that it'll b...,oxfam international has announced that it'll b...,"[oxfam, international, has, announced, that, i...",oxfam intern ha announc that it ' ll be fire 1...,oxfam international ha announced that it ' ll ...
4,business,Vaccine development is like rollercoaster: Ser...,Dharna,Serum Institute of India CEO Adar Poonawalla s...,serum institute of india ceo adar poonawalla s...,"[serum, institute, of, india, ceo, adar, poona...",serum institut of india ceo adar poonawalla sa...,serum institute of india ceo adar poonawalla s...
