<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Natural-Language-Processing" data-toc-modified-id="Natural-Language-Processing-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Natural Language Processing</a></span><ul class="toc-item"><li><span><a href="#What-is-Natural-Language-Processing?" data-toc-modified-id="What-is-Natural-Language-Processing?-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span><strong><font color="red">What is Natural Language Processing?</font></strong></a></span></li><li><span><a href="#So-What?" data-toc-modified-id="So-What?-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span><strong><font color="orange">So What?</font></strong></a></span></li><li><span><a href="#Normalization-Examples:" data-toc-modified-id="Normalization-Examples:-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span><strong><font color="purple">Normalization Examples:</font></strong></a></span><ul class="toc-item"><li><span><a href="#Lowercase-Using-df.col.str.lower()" data-toc-modified-id="Lowercase-Using-df.col.str.lower()-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span><strong>Lowercase Using <code>df.col.str.lower()</code></strong></a></span></li><li><span><a href="#Normalize-Unicode-Characters" data-toc-modified-id="Normalize-Unicode-Characters-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span><strong>Normalize Unicode Characters</strong></a></span></li><li><span><a href="#Remove-Special-Characters-Using-Regex" data-toc-modified-id="Remove-Special-Characters-Using-Regex-1.3.3"><span class="toc-item-num">1.3.3&nbsp;&nbsp;</span><strong>Remove Special Characters Using Regex</strong></a></span></li><li><span><a href="#Stem-Characters-Using-nltk.porter.PorterStemmer()" data-toc-modified-id="Stem-Characters-Using-nltk.porter.PorterStemmer()-1.3.4"><span class="toc-item-num">1.3.4&nbsp;&nbsp;</span><strong>Stem Characters Using <code>nltk.porter.PorterStemmer()</code></strong></a></span></li></ul></li><li><span><a href="#Tokenization-Examples:" data-toc-modified-id="Tokenization-Examples:-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span><strong><font color="purple">Tokenization Examples:</font></strong></a></span><ul class="toc-item"><li><span><a href="#Using-.split()" data-toc-modified-id="Using-.split()-1.4.1"><span class="toc-item-num">1.4.1&nbsp;&nbsp;</span><strong>Using <code>.split()</code></strong></a></span></li><li><span><a href="#Using-Regex" data-toc-modified-id="Using-Regex-1.4.2"><span class="toc-item-num">1.4.2&nbsp;&nbsp;</span>Using Regex</a></span></li><li><span><a href="#Using-NLTK-Tokenization" data-toc-modified-id="Using-NLTK-Tokenization-1.4.3"><span class="toc-item-num">1.4.3&nbsp;&nbsp;</span>Using NLTK Tokenization</a></span></li></ul></li></ul></li></ul></div>

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

import os
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

from acquire_walkthrough import get_all_urls, get_blog_articles, get_news_articles

### Natural Language Processing

#### **<font color=red>What is Natural Language Processing?</font>**

Natural Language Processing allows you to use techniques in Python libraries like NLTK (Natural Language Tool Kit) and Spacy to create machine-useable structure out of natural language text. In other words, you can manipulate natural language in such a way that renders it useful in machine learning. Machines can't read words, but they can recognize numbers, so we have to process the text we want to use in a way that retains the original meaning while representing the text with numbers.

#### **<font color=orange>So What?</font>**

We need to know some basic terminology to get started:

**Normalization** - is when you perform a series of tasks like making all text lowercase, removing punctuation, expanding contractions, removing anything that's not an ASCII character, etc.

#### **<font color=purple>Normalization Examples:</font>**

##### **Lowercase Using `df.col.str.lower()`**

In [2]:
df = get_news_articles()
df.head()

Unnamed: 0,topic,title,author,content
0,business,RBI allows banks to offer moratorium on EMI pa...,Krishna Veera Vanamali,RBI Governor Shaktikanta Das on Friday announc...
1,business,GDP growth in 2020-21 expected to remain in ne...,Ankush Verma,Reserve Bank of India Governor Shaktikanta Das...
2,business,Govt releases fare structure for domestic flig...,Nandini Sinha,The DGCA has released fare structure for domes...
3,business,"Oxfam to fire 1,450 staff, shut offices in 18 ...",Anushka Dixit,Oxfam International has announced that it'll b...
4,business,Vaccine development is like rollercoaster: Ser...,Dharna,Serum Institute of India CEO Adar Poonawalla s...


In [5]:
# Note I have not reassigned this or changed the inplace argument to True yet; just a look.

df.content.str.lower()

0     rbi governor shaktikanta das on friday announc...
1     reserve bank of india governor shaktikanta das...
2     the dgca has released fare structure for domes...
3     oxfam international has announced that it'll b...
4     serum institute of india ceo adar poonawalla s...
                            ...                        
95    actress-singer ila arun has said she was initi...
96    late actor irrfan khan's son babil khan on thu...
97    actress esha gupta, while talking about her 20...
98    taking to instagram, priyanka chopra shared a ...
99    reacting to anushka sharma's video wherein vir...
Name: content, Length: 100, dtype: object

##### **Normalize Unicode Characters**

[Here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.normalize.html) is the documentation for using `unicodedata.normalize()` on a Pandas Series.

`df.col.str.normalize(form, unistr)`

`df.col.str.encode('ascii', 'ignore')`

`df.col.str.decode('utf-8', 'ignore')`

In [6]:
# Again, this is a look because it has not been reassigned or changed in place.

df.content.str.normalize('NFKC').str.encode('ascii', 'ignore').str.decode('utf-8', 'ignore')

0     RBI Governor Shaktikanta Das on Friday announc...
1     Reserve Bank of India Governor Shaktikanta Das...
2     The DGCA has released fare structure for domes...
3     Oxfam International has announced that it'll b...
4     Serum Institute of India CEO Adar Poonawalla s...
                            ...                        
95    Actress-singer Ila Arun has said she was initi...
96    Late actor Irrfan Khan's son Babil Khan on Thu...
97    Actress Esha Gupta, while talking about her 20...
98    Taking to Instagram, Priyanka Chopra shared a ...
99    Reacting to Anushka Sharma's video wherein Vir...
Name: content, Length: 100, dtype: object

##### **Remove Special Characters Using Regex**

I found [this article](https://kanoki.org/2019/11/12/how-to-use-regex-in-pandas/) very helpful when using Regex in Pandas!

In [7]:
# Again, this is a look because it has not been reassigned or changed in place.

df.content.str.replace(r"[^A-z0-9'\s]", '', regex=True)

0     RBI Governor Shaktikanta Das on Friday announc...
1     Reserve Bank of India Governor Shaktikanta Das...
2     The DGCA has released fare structure for domes...
3     Oxfam International has announced that it'll b...
4     Serum Institute of India CEO Adar Poonawalla s...
                            ...                        
95    Actresssinger Ila Arun has said she was initia...
96    Late actor Irrfan Khan's son Babil Khan on Thu...
97    Actress Esha Gupta while talking about her 201...
98    Taking to Instagram Priyanka Chopra shared a s...
99    Reacting to Anushka Sharma's video wherein Vir...
Name: content, Length: 100, dtype: object

In [3]:
# Now, I can chain these together and reassign to my df as a new columm

df['basic_clean'] = df.content.str.lower()\
                    .str.replace(r"[^A-z0-9'\s]", '', regex=True)\
                    .str.normalize('NFKC')\
                    .str.encode('ascii', 'ignore')\
                    .str.decode('utf-8', 'ignore')

In [4]:
df.head()

Unnamed: 0,topic,title,author,content,basic_clean
0,business,RBI allows banks to offer moratorium on EMI pa...,Krishna Veera Vanamali,RBI Governor Shaktikanta Das on Friday announc...,rbi governor shaktikanta das on friday announc...
1,business,GDP growth in 2020-21 expected to remain in ne...,Ankush Verma,Reserve Bank of India Governor Shaktikanta Das...,reserve bank of india governor shaktikanta das...
2,business,Govt releases fare structure for domestic flig...,Nandini Sinha,The DGCA has released fare structure for domes...,the dgca has released fare structure for domes...
3,business,"Oxfam to fire 1,450 staff, shut offices in 18 ...",Anushka Dixit,Oxfam International has announced that it'll b...,oxfam international has announced that it'll b...
4,business,Vaccine development is like rollercoaster: Ser...,Dharna,Serum Institute of India CEO Adar Poonawalla s...,serum institute of india ceo adar poonawalla s...


**Tokenization** - is when you split larger strings of text into smaller pieces or tokens by setting a boundary. You might chunk a sentence into words using a space as a boundary or a paragraph into sentences using punctuation as a boundary.

#### **<font color=purple>Tokenization Examples:</font>**

##### **Using `.split()`**

Tokenizing using `.split()` is simple but also limited to one delimiter.

In [8]:
text = 'Knowledge is the compound interest of curiosity. - James Clear'

In [9]:
text.split()

['Knowledge',
 'is',
 'the',
 'compound',
 'interest',
 'of',
 'curiosity.',
 '-',
 'James',
 'Clear']

In [10]:
text = """There\'s the kind of person who is always the victim in any story they tell. Always on the receiving end of some injustice. There\'s the kind of person who is always the kind of hero of every story they tell. There\'s the smart person; they delivered the clever put down there."""

In [11]:
text.split('.')

["There's the kind of person who is always the victim in any story they tell",
 ' Always on the receiving end of some injustice',
 " There's the kind of person who is always the kind of hero of every story they tell",
 " There's the smart person; they delivered the clever put down there",
 '']

##### Using Regex

**<font color=purple>Identifiers</font>**

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

**<font color=purple>Quantifiers</font>**

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

**<font color=purple>More Regex</font>**

<table ><tr><th>Character</th><th>Description</th><th>Example</th></tr>
    
<tr ><td><span >|</span></td><td>or statement</td><td>r'dog|cat'</td></tr>

<tr ><td><span >*</span></td><td>wildcard</td><td>r'.at'</td></tr>
    
<tr ><td><span >^</span></td><td>starts with</td><td>r'^\d'</td></tr>
    
<tr ><td><span >[^]</span></td><td>exclusion</td><td>r'[^a-z]'</td></tr>

In [12]:
pattern = r'[\w]+'
text = 'Knowledge is the compound interest of curiosity. - James Clear'

tokens = re.findall(pattern, text)
tokens

['Knowledge',
 'is',
 'the',
 'compound',
 'interest',
 'of',
 'curiosity',
 'James',
 'Clear']

In [13]:
# Use `.compile()` with .split(text) to split your text on more than one delimiter

pattern = re.compile(r'[.;!?]')
text = """There's the kind of person who is always the victim in any story they tell. Always on the receiving end of some injustice. There's the kind of person who is always the kind of hero of every story they tell. There's the smart person; they delivered the clever put down there."""

pattern.split(text)

["There's the kind of person who is always the victim in any story they tell",
 ' Always on the receiving end of some injustice',
 " There's the kind of person who is always the kind of hero of every story they tell",
 " There's the smart person",
 ' they delivered the clever put down there',
 '']

##### Using NLTK Tokenization



In [15]:
df.head(2)

Unnamed: 0,topic,title,author,content,basic_clean
0,business,RBI allows banks to offer moratorium on EMI pa...,Krishna Veera Vanamali,RBI Governor Shaktikanta Das on Friday announc...,rbi governor shaktikanta das on friday announc...
1,business,GDP growth in 2020-21 expected to remain in ne...,Ankush Verma,Reserve Bank of India Governor Shaktikanta Das...,reserve bank of india governor shaktikanta das...


In [18]:
tokenizer = nltk.tokenize.ToktokTokenizer()

In [32]:
# Here we apply nltk's tokenizer to each row, or text, in our basic_clean Series

df.basic_clean.apply(tokenizer.tokenize).head()

0    [rbi, governor, shaktikanta, das, on, friday, ...
1    [reserve, bank, of, india, governor, shaktikan...
2    [the, dgca, has, released, fare, structure, fo...
3    [oxfam, international, has, announced, that, i...
4    [serum, institute, of, india, ceo, adar, poona...
Name: basic_clean, dtype: object

In [29]:
# We can use `.str.join(' ')` to join the tokens in our list with spaces and store in new col

df['clean_tokes'] = df.basic_clean.apply(tokenizer.tokenize).str.join(' ')

In [30]:
df.head(2)

Unnamed: 0,topic,title,author,content,basic_clean,clean_tokes
0,business,RBI allows banks to offer moratorium on EMI pa...,Krishna Veera Vanamali,RBI Governor Shaktikanta Das on Friday announc...,rbi governor shaktikanta das on friday announc...,rbi governor shaktikanta das on friday announc...
1,business,GDP growth in 2020-21 expected to remain in ne...,Ankush Verma,Reserve Bank of India Governor Shaktikanta Das...,reserve bank of india governor shaktikanta das...,reserve bank of india governor shaktikanta das...


##### **Stem Characters Using `nltk.porter.PorterStemmer()`**

