<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Natural-Language-Processing" data-toc-modified-id="Natural-Language-Processing-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Natural Language Processing</a></span><ul class="toc-item"><li><span><a href="#What-is-Natural-Language-Processing?" data-toc-modified-id="What-is-Natural-Language-Processing?-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span><strong><font color="red">What is Natural Language Processing?</font></strong></a></span></li><li><span><a href="#So-What?" data-toc-modified-id="So-What?-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span><strong><font color="orange">So What?</font></strong></a></span><ul class="toc-item"><li><span><a href="#Tokenization-Examples:" data-toc-modified-id="Tokenization-Examples:-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span><strong><font color="purple">Tokenization Examples:</font></strong></a></span></li><li><span><a href="#Using-.split()" data-toc-modified-id="Using-.split()-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span><strong>Using <code>.split()</code></strong></a></span></li><li><span><a href="#Using-Regex" data-toc-modified-id="Using-Regex-1.2.3"><span class="toc-item-num">1.2.3&nbsp;&nbsp;</span>Using Regex</a></span></li></ul></li></ul></li></ul></div>

In [4]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

import os
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

from acquire_walkthrough import get_all_urls, get_blog_articles, get_news_articles

### Natural Language Processing

#### **<font color=red>What is Natural Language Processing?</font>**

Natural Language Processing allows you to use techniques in Python libraries like NLTK (Natural Language Tool Kit) and Spacy to create machine-useable structure out of natural language text. In other words, you can manipulate natural language in such a way that renders it useful in machine learning. Machines can't read words, but they can recognize numbers, so we have to process the text we want to use in a way that retains the original meaning while representing the text with numbers.

#### **<font color=orange>So What?</font>**

We need to know some basic terminology to get started:

**Tokenization** - is when you split larger strings of text into smaller pieces or tokens by setting a boundary. You might chunk a sentence into words using a space as a boundary or a paragraph into sentences using punctuation as a boundary.

#### **<font color=purple>Tokenization Examples:</font>**

##### **Using `.split()`**

Tokenizing using `.split()` is simple but also limited to one delimiter.

In [11]:
text = 'Knowledge is the compound interest of curiosity. - James Clear'

In [12]:
text.split()

['Knowledge',
 'is',
 'the',
 'compound',
 'interest',
 'of',
 'curiosity.',
 '-',
 'James',
 'Clear']

In [38]:
text = """There\'s the kind of person who is always the victim in any story they tell. Always on the receiving end of some injustice. There\'s the kind of person who is always the kind of hero of every story they tell. There\'s the smart person; they delivered the clever put down there."""

In [39]:
text.split('.')

["There's the kind of person who is always the victim in any story they tell",
 ' Always on the receiving end of some injustice',
 " There's the kind of person who is always the kind of hero of every story they tell",
 " There's the smart person; they delivered the clever put down there",
 '']

##### Using Regex

**<font color=purple>Identifiers</font>**

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

**<font color=purple>Quantifiers</font>**

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

In [9]:
df = get_news_articles()

In [10]:
df.head()

Unnamed: 0,topic,title,author,content
0,business,RBI allows banks to offer moratorium on EMI pa...,Krishna Veera Vanamali,RBI Governor Shaktikanta Das on Friday announc...
1,business,GDP growth in 2020-21 expected to remain in ne...,Ankush Verma,Reserve Bank of India Governor Shaktikanta Das...
2,business,Govt releases fare structure for domestic flig...,Nandini Sinha,The DGCA has released fare structure for domes...
3,business,"Oxfam to fire 1,450 staff, shut offices in 18 ...",Anushka Dixit,Oxfam International has announced that it'll b...
4,business,Vaccine development is like rollercoaster: Ser...,Dharna,Serum Institute of India CEO Adar Poonawalla s...
