# Natural Language Processing (NLP)

### NLP

▪ NLP is a machine learning technology that gives computers the ability to interpret, manipulate, and comprehend human language.

<img src="nlp.jpg" width="500">

### Structured Data vs. Unstructured Data

▪ Structured data is standardized, clearly defined, and searchable data, while unstructured data is raw data of various forms.

<img src="structured_vs_unstructured_data.png" width="500">

### Garbage In, Garbage Out

▪ Garbage in, garbage out is a concept common to computer science where the quality of output is determined by the quality of the input. 

<img src="garbage.png" width="500">

### Text Preprocessing

<img src="text_preprocessing.png" width="700">

▪ Unstructured data must first be cleaned and pre-processed before analysis.

textblob, nltk

# NLTK and Preprocessing Techniques

### NLP Toolkits

▪ NLTK, which stands for Natural Language Toolkits, is a suite of libraries built for working with NLP in Python.

### Prerequisites

▪ NLTK and NLTK dataset

### Installing NLTK via Anaconda Prompt

pip install nltk

### Installing NLTK Dataset via Jupyter Notebook

In [4]:
import nltk

# The following command downloads all data and models, and it will take awhile
# Do this step only if nltk_data is not available on your pc
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

## Text Data

▪ Text data is messy.

▪ To analyze this data, it has to be preprocessed into clean text in a format that machine models can understand.

![](https://i.imgur.com/3L6x92C.png)

## Text Data: Sample 

In [5]:
original_text = "Hi Mr. Smith! I'm going to buy some vegetables \
(2 tomatoes and 4 cucumbers) from the store. Should I pick up some black-eyed peas as well?"

## Remove Punctuation

In [6]:
import string

print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [None]:
import re # Regular expression library

clean_text = re.sub('[%s]' %(string.punctuation), '', original_text)
clean_text

In [None]:
# https://pynative.com/python-regex-special-sequences-and-character-classes/

clean_text = re.sub('[^\w\s]','', original_text)
clean_text

## Remove Numbers

In [None]:
# Removes all words containing digits
clean_text = re.sub('\d', '', clean_text)
clean_text

In [None]:
clean_text = re.sub('[^A-Za-z\s]','', original_text)
clean_text

## Covert Text to Lowercase

<img src="vegetarian.jpg" width="500">

In [None]:
sample_text = "The Nature's Vegetarian Restaurant located at Bangsar is a fantastic vegetarian restaurant."
print(sample_text)

In [None]:
clean_text = clean_text.lower()
clean_text

## Word Tokenization (original_text)

▪ Tokenization is the process of breaking down a phrase, sentence, paragraph, or an entire text document into smaller units.

<img src="tokenization.jpeg" width="700">

In [None]:
original_text = "Hi Mr. Smith! I'm going to buy some vegetables \
(2 tomatoes and 4 cucumbers) from the store. Should I pick up some black-eyed peas as well?"

print(original_text)

### \#1 Word Tokenization with word_tokenize()

In [None]:
from nltk.tokenize import word_tokenize

tokens = word_tokenize(original_text) 
print(tokens)

### \#2 Word Tokenization with regexp_tokenize()

In [None]:
from nltk import regexp_tokenize

tokens = regexp_tokenize(original_text, pattern = '\w+')
print(tokens)

### \#3 Word Tokenization with split()

In [None]:
tokens = original_text.split()
print(tokens)

### \#4 Word Tokenization with Regex

In [None]:
import re

tokens = re.split("\W+", original_text)
print(tokens)

In [None]:
import re

tokens = re.findall("\w+", original_text)
print(tokens)

## Word Tokenization (clean_text)

### \#1 Word Tokenization with word_tokenize()

In [None]:
from nltk.tokenize import word_tokenize

tokens = word_tokenize(clean_text) 
print(tokens)

### \#2 Word Tokenization with regexp_tokenize()

In [None]:
from nltk import regexp_tokenize

tokens = regexp_tokenize(clean_text, pattern = '\w+')
print(tokens)

### \#3 Word Tokenization with split()

In [None]:
tokens = clean_text.split()
print(tokens)

### \#4 Word Tokenization with Regex

In [None]:
import re

tokens = re.split("\W+", clean_text)
print(tokens)

In [None]:
import re

tokens = re.findall("\w+", clean_text)
print(tokens)

## Sentence Tokenization (original_text)

### \#1 Sentence Tokenization with sent_tokenize()

In [None]:
from nltk.tokenize import sent_tokenize

sent_tokens = sent_tokenize(original_text)
sent_tokens

### \#2 Sentence Tokenization with split()

In [None]:
sent_tokens = original_text.split(". ")
sent_tokens

In [None]:
sent_tokens = re.split("[?!.] ", original_text)
sent_tokens

## Remove Stop Words

![](https://i.imgur.com/T5RJXrX.png)

In [None]:
from nltk.corpus import stopwords

print(stopwords.fileids())

### Print English Stop Words

In [None]:
stop_words = stopwords.words('english')
print(stop_words)

stop_words_2 = stopwords.words('chinese')
print(stop_words_2)

### Identify Stop Words from the following Sentence

<h3><center>Stopwords are a commonly used words that generally don’t contribute anything to the meaning of the text.</center></h3>

### Sort the list of Stop Words

In [None]:
stop_words.sort() # sorted()
print(stop_words)

### Print Language-X Stop Words

In [None]:
stop_words_3 = stopwords.words('danish')
print(stop_words_3)

### Remove Stop Words from clean_text

In [None]:
print(tokens)

In [None]:
tokens_x_stopwords = [token for token in tokens if token not in stop_words]

print(tokens_x_stopwords)

## Stemming

![](https://i.imgur.com/9qllh8j.png)

### Stemming with LancasterStemmer

In [7]:
words_1 = ['Connects', 'Connecting', 'Connections', 'Connected', 'Connection', 'Connectings', 'Connect']
words_2 = ['drive', 'drives', 'driver', 'drivers', 'driven', 'driving']

In [8]:
from nltk.stem.lancaster import LancasterStemmer
lc_stemmer = LancasterStemmer()

for word_1 in words_1:
    print(word_1, "-->", lc_stemmer.stem(word_1))

Connects --> connect
Connecting --> connect
Connections --> connect
Connected --> connect
Connection --> connect
Connectings --> connect
Connect --> connect


In [9]:
for word_2 in words_2:
    print(word_2, "-->", lc_stemmer.stem(word_2))

drive --> driv
drives --> driv
driver --> driv
drivers --> driv
driven --> driv
driving --> driv


### Stemming with PorterStemmer

In [10]:
from nltk.stem import PorterStemmer
pt_stemmer = PorterStemmer()

for word_1 in words_1:
    print(word_1, "-->", pt_stemmer.stem(word_1))

Connects --> connect
Connecting --> connect
Connections --> connect
Connected --> connect
Connection --> connect
Connectings --> connect
Connect --> connect


In [11]:
for word_2 in words_2:
    print(word_2, "--->", pt_stemmer.stem(word_2))

drive ---> drive
drives ---> drive
driver ---> driver
drivers ---> driver
driven ---> driven
driving ---> drive


### Stemming with SnowballStemmer

In [12]:
from nltk.stem import SnowballStemmer
sb_stemmer = SnowballStemmer(language = 'english')

for word_1 in words_1:
    print(word_1, "--->", sb_stemmer.stem(word_1))

Connects ---> connect
Connecting ---> connect
Connections ---> connect
Connected ---> connect
Connection ---> connect
Connectings ---> connect
Connect ---> connect


In [13]:
for word_2 in words_2:
    print(word_2, "--->", sb_stemmer.stem(word_2))

drive ---> drive
drives ---> drive
driver ---> driver
drivers ---> driver
driven ---> driven
driving ---> drive


## Lemmatization

![](https://i.imgur.com/9qllh8j.png)

### Lemmatization with WordNetLemmatizer

In [14]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

wn_lemmatizer = WordNetLemmatizer()

for word_1 in words_1:
    print(word_1, "--->", wn_lemmatizer.lemmatize(word_1))

Connects ---> Connects
Connecting ---> Connecting
Connections ---> Connections
Connected ---> Connected
Connection ---> Connection
Connectings ---> Connectings
Connect ---> Connect


In [15]:
for word_2 in words_2:
    print(word_2, "--->", wn_lemmatizer.lemmatize(word_2))

drive ---> drive
drives ---> drive
driver ---> driver
drivers ---> driver
driven ---> driven
driving ---> driving


### Lemmatization with WordNetLemmatizer on clean_text

In [16]:
print(tokens_x_stopwords)

NameError: name 'tokens_x_stopwords' is not defined

In [17]:
lemma_x_stopwords = [wn_lemmatizer.lemmatize(word_1) for word_1 in tokens_x_stopwords]

print(lemma_x_stopwords)

NameError: name 'tokens_x_stopwords' is not defined

## Parts of Speech Tagging

![](https://i.imgur.com/8edVsCR.png)

do at very beginning since meaning might change after clean text

In [18]:
print(nltk.help.upenn_tagset())

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

### POS Tagging on Sample Text

In [19]:
from nltk.tag import pos_tag

text_1 = "James Smith lives in the United States."

tokens = pos_tag(word_tokenize(text_1))
print(tokens)

[('James', 'NNP'), ('Smith', 'NNP'), ('lives', 'VBZ'), ('in', 'IN'), ('the', 'DT'), ('United', 'NNP'), ('States', 'NNPS'), ('.', '.')]


### POS Tagging on original_text

In [20]:
print(original_text)

Hi Mr. Smith! I'm going to buy some vegetables (2 tomatoes and 4 cucumbers) from the store. Should I pick up some black-eyed peas as well?


In [21]:
tokens = pos_tag(word_tokenize(original_text))
print(tokens)

[('Hi', 'NNP'), ('Mr.', 'NNP'), ('Smith', 'NNP'), ('!', '.'), ('I', 'PRP'), ("'m", 'VBP'), ('going', 'VBG'), ('to', 'TO'), ('buy', 'VB'), ('some', 'DT'), ('vegetables', 'NNS'), ('(', '('), ('2', 'CD'), ('tomatoes', 'NNS'), ('and', 'CC'), ('4', 'CD'), ('cucumbers', 'NNS'), (')', ')'), ('from', 'IN'), ('the', 'DT'), ('store', 'NN'), ('.', '.'), ('Should', 'MD'), ('I', 'PRP'), ('pick', 'VB'), ('up', 'RP'), ('some', 'DT'), ('black-eyed', 'JJ'), ('peas', 'NNS'), ('as', 'IN'), ('well', 'RB'), ('?', '.')]


### POS Tagging on clean_text

In [22]:
tokens = pos_tag(lemma_x_stopwords)
print(tokens)

NameError: name 'lemma_x_stopwords' is not defined

## Chunking

▪ Chunking is a step following POS tagging and structuring the sentence in "chunks" by identifying continuous words that can be grouped together.

In [None]:
from nltk.chunk import ne_chunk

text_1 = "James Smith lives in the United States."
tokens = pos_tag(word_tokenize(text_1))

# This extracts entities from the list of words
result = ne_chunk(tokens) 
result.draw()

In [None]:
text_2 = "The Nature's Vegetarian Restaurant located at Bangsar is a fantastic vegetarian restaurant."
tokens = pos_tag(word_tokenize(text_2))

result = ne_chunk(tokens) 
result.draw()

## Compound Term Extraction

![](https://i.imgur.com/q1WuWai.png)

### Compound Term Extraction

In [None]:
from nltk.tokenize import MWETokenizer 

mwe_tokenizer = MWETokenizer([('James', 'Smith'), ('United', 'States')])

text_1 = "James Smith lives in the United States."
mwe_tokens = mwe_tokenizer.tokenize(word_tokenize(text_1))
mwe_tokens

In [None]:
mwe_tokenizer = MWETokenizer([("Nature's", "Vegetarian", "Restaurant"), ('Subang', 'Jaya')])

text_3 = "The Nature's Vegetarian Restaurant located at Subang Jaya is a fantastic vegetarian restaurant."
mwe_tokens = mwe_tokenizer.tokenize(word_tokenize(text_3))
mwe_tokens

## Basic Numpy Functionality

▪ NumPy, which stands for Numerical Python, is a Python library used for working with arrays.

### Create an ndarray (2D Array)

In [23]:
import numpy as np

arr_1d = np.array([2, 4, 6, 8]) 
arr_1d

array([2, 4, 6, 8])

In [24]:
print(type(arr_1d))

<class 'numpy.ndarray'>


In [25]:
arr_1d.shape

(4,)

In [27]:
arr_2d = np.random.randn(6, 4)
arr_2d

array([[ 0.85780906,  0.31365262, -0.70057474, -0.30541553],
       [ 1.55990202,  0.14329982, -0.27839342,  0.62009336],
       [ 0.59011357,  0.68103976, -0.11370023,  0.44723942],
       [-0.51140401,  0.57171169,  0.0826537 , -0.03379511],
       [ 2.73911057, -1.66741524,  0.70033702,  0.84138539],
       [-0.43435138,  2.11636159, -0.88252768,  1.76292792]])

In [29]:
print(type(arr_2d))

<class 'numpy.ndarray'>


In [30]:
arr_2d.shape

(6, 4)

## Basic Pandas Functionality

▪ Pandas stands for "Python Data Analysis Library".

![](https://i.imgur.com/HpgLFOT.png)

### Create a dataframe

▪ A dataframe is a 2-dimensional labeled data structure with columns of potentially different types. 

In [34]:
import pandas as pd

df = pd.DataFrame(arr_2d)
df

Unnamed: 0,0,1,2,3
0,0.857809,0.313653,-0.700575,-0.305416
1,1.559902,0.1433,-0.278393,0.620093
2,0.590114,0.68104,-0.1137,0.447239
3,-0.511404,0.571712,0.082654,-0.033795
4,2.739111,-1.667415,0.700337,0.841385
5,-0.434351,2.116362,-0.882528,1.762928


In [35]:
df.shape

(6, 4)

### Check the Labels of Rows and Columns

In [36]:
df.columns.values

array([0, 1, 2, 3], dtype=int64)

In [37]:
df.index.values

array([0, 1, 2, 3, 4, 5], dtype=int64)

### Add Labels to Rows and Columns

In [38]:
df.columns = ['A', 'B', 'C', 'D']
df

Unnamed: 0,A,B,C,D
0,0.857809,0.313653,-0.700575,-0.305416
1,1.559902,0.1433,-0.278393,0.620093
2,0.590114,0.68104,-0.1137,0.447239
3,-0.511404,0.571712,0.082654,-0.033795
4,2.739111,-1.667415,0.700337,0.841385
5,-0.434351,2.116362,-0.882528,1.762928


In [39]:
df.index = ['Rec_1', 'Rec_2', 'Rec_3', 'Rec_4', 'Rec_5', 'Rec_6']
df

Unnamed: 0,A,B,C,D
Rec_1,0.857809,0.313653,-0.700575,-0.305416
Rec_2,1.559902,0.1433,-0.278393,0.620093
Rec_3,0.590114,0.68104,-0.1137,0.447239
Rec_4,-0.511404,0.571712,0.082654,-0.033795
Rec_5,2.739111,-1.667415,0.700337,0.841385
Rec_6,-0.434351,2.116362,-0.882528,1.762928


## cookie_reviews.csv

### Reading data from cookie_reviews.csv

In [40]:
df = pd.read_csv('cookie_reviews.csv')
df

Unnamed: 0,user_id,stars,reviews
0,A368Z46FIKHSEZ,5,I love these cookies! Not only are they healt...
1,A1JAPP1CXRG57A,5,Quaker Soft Baked Oatmeal Cookies with raisins...
2,A2Z9JNXPIEL2B9,5,I am usually not a huge fan of oatmeal cookies...
3,A31CYJQO3FL586,5,I participated in a product review that includ...
4,A2KXQ2EKFF3K2G,5,My kids loved these. I was very pleased to giv...
...,...,...,...
908,A366PSH7KFLRPB,5,I loved these cookies and so did my kids. You ...
909,A2KV6EYQPKJRR5,5,This is a great tasting cookie. It is very sof...
910,A3O7REI0OSV89M,4,These are great for a quick snack! They are sa...
911,A9JS5GQQ6GIQT,5,I love the Quaker soft baked cookies. The rea...


In [41]:
df.head()

Unnamed: 0,user_id,stars,reviews
0,A368Z46FIKHSEZ,5,I love these cookies! Not only are they healt...
1,A1JAPP1CXRG57A,5,Quaker Soft Baked Oatmeal Cookies with raisins...
2,A2Z9JNXPIEL2B9,5,I am usually not a huge fan of oatmeal cookies...
3,A31CYJQO3FL586,5,I participated in a product review that includ...
4,A2KXQ2EKFF3K2G,5,My kids loved these. I was very pleased to giv...


In [42]:
df.tail()

Unnamed: 0,user_id,stars,reviews
908,A366PSH7KFLRPB,5,I loved these cookies and so did my kids. You ...
909,A2KV6EYQPKJRR5,5,This is a great tasting cookie. It is very sof...
910,A3O7REI0OSV89M,4,These are great for a quick snack! They are sa...
911,A9JS5GQQ6GIQT,5,I love the Quaker soft baked cookies. The rea...
912,AMAVEZAGCH52H,5,This cookie is really good and works really we...


In [43]:
df.head(10)

Unnamed: 0,user_id,stars,reviews
0,A368Z46FIKHSEZ,5,I love these cookies! Not only are they healt...
1,A1JAPP1CXRG57A,5,Quaker Soft Baked Oatmeal Cookies with raisins...
2,A2Z9JNXPIEL2B9,5,I am usually not a huge fan of oatmeal cookies...
3,A31CYJQO3FL586,5,I participated in a product review that includ...
4,A2KXQ2EKFF3K2G,5,My kids loved these. I was very pleased to giv...
5,A2U5TAIAQ675BL,5,I really enjoyed these individually wrapped bi...
6,A1R4PIBZBD3NZ0,4,I was surprised at how soft the cookie was. I ...
7,A1ECQ8LJMXG4WI,5,Filled with oats and raisins you'll love this ...
8,A3MSG4E5MLI1XP,5,"I was recently given a complimentary ""vox box""..."
9,A3BUDUV9GORLWH,5,the best and freshest cookie that comes in a p...


In [44]:
df.tail(10)

Unnamed: 0,user_id,stars,reviews
903,A55PK06Q6AKFY,4,These cookies are soft and delicious. Possibl...
904,A1YJMG0QJXZLD4,5,I cannot say these taste like home made but th...
905,A1W0EK0033YVGP,5,this cookie is super soft and chewy.i love it ...
906,AFJFINIKFOFSB,3,These cookies are reasonably tasty without bei...
907,AQUMNB8YWE595,5,THIS IS A FAB PRODUCT SOFT AND CHEWY YOU CANT ...
908,A366PSH7KFLRPB,5,I loved these cookies and so did my kids. You ...
909,A2KV6EYQPKJRR5,5,This is a great tasting cookie. It is very sof...
910,A3O7REI0OSV89M,4,These are great for a quick snack! They are sa...
911,A9JS5GQQ6GIQT,5,I love the Quaker soft baked cookies. The rea...
912,AMAVEZAGCH52H,5,This cookie is really good and works really we...


### Print the Labels of All Columns

In [45]:
df.columns.values

array(['user_id', 'stars', 'reviews'], dtype=object)

### Print All Values of a Column

In [46]:
df.user_id

0      A368Z46FIKHSEZ
1      A1JAPP1CXRG57A
2      A2Z9JNXPIEL2B9
3      A31CYJQO3FL586
4      A2KXQ2EKFF3K2G
            ...      
908    A366PSH7KFLRPB
909    A2KV6EYQPKJRR5
910    A3O7REI0OSV89M
911     A9JS5GQQ6GIQT
912     AMAVEZAGCH52H
Name: user_id, Length: 913, dtype: object

In [47]:
df.reviews

0      I love these cookies!  Not only are they healt...
1      Quaker Soft Baked Oatmeal Cookies with raisins...
2      I am usually not a huge fan of oatmeal cookies...
3      I participated in a product review that includ...
4      My kids loved these. I was very pleased to giv...
                             ...                        
908    I loved these cookies and so did my kids. You ...
909    This is a great tasting cookie. It is very sof...
910    These are great for a quick snack! They are sa...
911    I love the Quaker soft baked cookies.  The rea...
912    This cookie is really good and works really we...
Name: reviews, Length: 913, dtype: object

## DataFrame Slicing

▪ Python iloc() function is used for integer-location based indexing / selection by position.

In [48]:
df.iloc[0] # first row of data frame

user_id                                       A368Z46FIKHSEZ
stars                                                      5
reviews    I love these cookies!  Not only are they healt...
Name: 0, dtype: object

In [49]:
df.iloc[-1] # last row of data frame

user_id                                        AMAVEZAGCH52H
stars                                                      5
reviews    This cookie is really good and works really we...
Name: 912, dtype: object

In [50]:
df.iloc[:,0] # first column of data frame

0      A368Z46FIKHSEZ
1      A1JAPP1CXRG57A
2      A2Z9JNXPIEL2B9
3      A31CYJQO3FL586
4      A2KXQ2EKFF3K2G
            ...      
908    A366PSH7KFLRPB
909    A2KV6EYQPKJRR5
910    A3O7REI0OSV89M
911     A9JS5GQQ6GIQT
912     AMAVEZAGCH52H
Name: user_id, Length: 913, dtype: object

In [51]:
df.iloc[:,-1] # last column of data frame

0      I love these cookies!  Not only are they healt...
1      Quaker Soft Baked Oatmeal Cookies with raisins...
2      I am usually not a huge fan of oatmeal cookies...
3      I participated in a product review that includ...
4      My kids loved these. I was very pleased to giv...
                             ...                        
908    I loved these cookies and so did my kids. You ...
909    This is a great tasting cookie. It is very sof...
910    These are great for a quick snack! They are sa...
911    I love the Quaker soft baked cookies.  The rea...
912    This cookie is really good and works really we...
Name: reviews, Length: 913, dtype: object

In [52]:
df.iloc[0, 1] # first row, second column of the dataframe

5

In [53]:
df.iloc[0:4, 0:2] # first 4 rows and first 2 columns of data frame

Unnamed: 0,user_id,stars
0,A368Z46FIKHSEZ,5
1,A1JAPP1CXRG57A,5
2,A2Z9JNXPIEL2B9,5
3,A31CYJQO3FL586,5


## Preprocessing Exercise: cookie_reviews.csv

#### Question 1: Determine how many reviews there are in total.

#### Question 2: Determine the percentage of 1, 2, 3, 4 and 5 star reviews.

#### Question 3: Remove stop words

#### Question 4: Change to lower case

#### Question 5: Perform stemming

## Text Similarity Measures

▪ To measure distance between 2 strings.

<img src="similarity.png" width="700">

▪ Some examples of its application include information retrieval, text classification, document clustering, and topic modeling.

### Levenshtein distance

▪ **Levenshtein distance** is one way to measure the word similarity. 

▪ Minimum number of operations to get from one word to another.

![](https://i.imgur.com/FkdJmPi.png)

# TextBlob

▪ Other than NLTK, TextBlob is another Python library for processing textual data.

▪ TextBlob capabilities: Tokenization, Parts of speech tagging, Sentiment analysis, Spell check, etc.

## TextBlob Demo: Tokenization

In [1]:
#pip install textblob

from textblob import TextBlob
my_text = TextBlob("We're moving from NLTK to TextBlob. How fun!")
my_text.words

ModuleNotFoundError: No module named 'textblob'

## TextBlob Demo: Spell Check

▪ The correct() function calculates the Levenshtein distance between the word "graat" and all words in its word list of the words with the smallest Levenshtein distance, it outputs the most popular word.

In [None]:
blob = TextBlob("I'm graat at speling.")
print(blob.correct()) # print function requires Python 3

## TextBlob Demo: Tagging

In [None]:
blob = TextBlob("John hits the ball.")
for words, tag in blob.tags:
    print (words, tag)

## TextBlob Demo: Language Translation

▪ Textblob uses Google Translate as its translation engine

https://thinkinfi.com/natural-language-processing-using-textblob/

In [None]:
word = TextBlob("Bonjour, comment allez-vous")
word.translate(from_lang = 'fr', to = 'cn')

In [None]:
word.translate(from_lang = 'fr', to = 'zh-CN')

## Text Format for Analysis: Count Vectorizer

![](https://i.imgur.com/OQDeQlb.png)

### Features Extraction and CountVectorizer

▪ Feature extraction is the process of transforming textual data into numerical data.

▪ CountVectorizer is a tool used to vectorize text data by converting it into a matrix of token counts.

![](Count-Vectorization.png)

### Example 1: Create a DataFrame from original_text using CountVectorizer

In [54]:
original_text = ["Hi Mr. Smith! I'm going to buy some vegetables \
(2 tomatoes and 4 cucumbers) from the store. Should I pick up some black-eyed peas as well?"]

In [57]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
           
# Incorporate stop words when creating the count vectorizer
cv = CountVectorizer(stop_words = 'english') 

X = cv.fit_transform(original_text)
print(X)
print()

cv2 = CountVectorizer() 

Y = cv2.fit_transform(original_text)
print(Y)

  (0, 5)	1
  (0, 6)	1
  (0, 9)	1
  (0, 4)	1
  (0, 1)	1
  (0, 12)	1
  (0, 11)	1
  (0, 2)	1
  (0, 10)	1
  (0, 8)	1
  (0, 0)	1
  (0, 3)	1
  (0, 7)	1

  (0, 8)	1
  (0, 9)	1
  (0, 13)	1
  (0, 7)	1
  (0, 17)	1
  (0, 3)	1
  (0, 14)	2
  (0, 20)	1
  (0, 18)	1
  (0, 0)	1
  (0, 4)	1
  (0, 6)	1
  (0, 16)	1
  (0, 15)	1
  (0, 12)	1
  (0, 11)	1
  (0, 19)	1
  (0, 2)	1
  (0, 5)	1
  (0, 10)	1
  (0, 1)	1
  (0, 21)	1


In [58]:
cv.vocabulary_

{'hi': 5,
 'mr': 6,
 'smith': 9,
 'going': 4,
 'buy': 1,
 'vegetables': 12,
 'tomatoes': 11,
 'cucumbers': 2,
 'store': 10,
 'pick': 8,
 'black': 0,
 'eyed': 3,
 'peas': 7}

In [59]:
df = pd.DataFrame(X.toarray())
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,1,1,1,1,1,1,1,1,1,1,1,1,1


In [60]:
df = pd.DataFrame(X.toarray(), columns = cv.get_feature_names_out())
df

Unnamed: 0,black,buy,cucumbers,eyed,going,hi,mr,peas,pick,smith,store,tomatoes,vegetables
0,1,1,1,1,1,1,1,1,1,1,1,1,1


### Example 2: Create a DataFrame from clean_text using CountVectorizer

In [61]:
print(lemma_x_stopwords)

NameError: name 'lemma_x_stopwords' is not defined

In [62]:
text = [" ".join(lemma_x_stopwords)]
text

NameError: name 'lemma_x_stopwords' is not defined

In [63]:
cv = CountVectorizer() 

X = cv.fit_transform(text)

df = pd.DataFrame(X.toarray(), columns = cv.get_feature_names_out())
df

NameError: name 'text' is not defined

### Example 3: Create a DataFrame from the following Corpus using CountVectorizer

In [64]:
corpus = ['This is the first document.', 
          'This is the second document.', 
          'And the third one. One is fun.']

cv = CountVectorizer() 

X = cv.fit_transform(corpus)

df = pd.DataFrame(X.toarray(), columns = cv.get_feature_names_out())
df

Unnamed: 0,and,document,first,fun,is,one,second,the,third,this
0,0,1,1,0,1,0,0,1,0,1
1,0,1,0,0,1,0,1,1,0,1
2,1,0,0,1,1,2,0,1,1,0


### Example 4: Create a DataFrame from the following Corpus using CountVectorizer

In [65]:
corpus = ['The weather is hot under the sun',
          'I make my hot chocolate with milk',
          'One hot encoding',
          'I will have a chai latte with milk',
          'There is a hot sale today']

cv = CountVectorizer(stop_words = 'english') 

X = cv.fit_transform(corpus).toarray()

df = pd.DataFrame(X, columns = cv.get_feature_names_out())
df

Unnamed: 0,chai,chocolate,encoding,hot,latte,make,milk,sale,sun,today,weather
0,0,0,0,1,0,0,0,0,1,0,1
1,0,1,0,1,0,1,1,0,0,0,0
2,0,0,1,1,0,0,0,0,0,0,0
3,1,0,0,0,1,0,1,0,0,0,0
4,0,0,0,1,0,0,0,1,0,1,0


## Document Similarity

![](https://i.imgur.com/PyirXsy.png)

### Measuring Document Similarity

In [66]:
from itertools import combinations

pairs = list(combinations(['A', 'B', 'C', 'D'], 2))
pairs

[('A', 'B'), ('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D'), ('C', 'D')]

In [67]:
# range() returns an immutable sequence of numbers that can be easily converted to lists
x = list(range(5))
x

[0, 1, 2, 3, 4]

In [68]:
# calculate the cosine similarity between all combinations of documents
from itertools import combinations
from sklearn.metrics.pairwise import cosine_similarity

# list all combinations of the 5 sentences in pairs, in terms of indexes
# (0, 1), (0, 2), (0, 3), (0, 4), (1, 2), (1, 3), ..., (3,4)
pairs = list(combinations(range(len(corpus)), 2)) 
pairs

[(0, 1),
 (0, 2),
 (0, 3),
 (0, 4),
 (1, 2),
 (1, 3),
 (1, 4),
 (2, 3),
 (2, 4),
 (3, 4)]

In [69]:
combos = [(corpus[a_index], corpus[b_index]) for (a_index, b_index) in pairs]
combos

[('The weather is hot under the sun', 'I make my hot chocolate with milk'),
 ('The weather is hot under the sun', 'One hot encoding'),
 ('The weather is hot under the sun', 'I will have a chai latte with milk'),
 ('The weather is hot under the sun', 'There is a hot sale today'),
 ('I make my hot chocolate with milk', 'One hot encoding'),
 ('I make my hot chocolate with milk', 'I will have a chai latte with milk'),
 ('I make my hot chocolate with milk', 'There is a hot sale today'),
 ('One hot encoding', 'I will have a chai latte with milk'),
 ('One hot encoding', 'There is a hot sale today'),
 ('I will have a chai latte with milk', 'There is a hot sale today')]

In [70]:
# Calculate the cosine similarity for all pairs of phrases and sort by most similar
results = [cosine_similarity([X[a_index]], [X[b_index]]) for (a_index, b_index) in pairs]
sorted(zip(results, combos), reverse = True)

[(array([[0.40824829]]),
  ('The weather is hot under the sun', 'One hot encoding')),
 (array([[0.40824829]]), ('One hot encoding', 'There is a hot sale today')),
 (array([[0.35355339]]),
  ('I make my hot chocolate with milk', 'One hot encoding')),
 (array([[0.33333333]]),
  ('The weather is hot under the sun', 'There is a hot sale today')),
 (array([[0.28867513]]),
  ('The weather is hot under the sun', 'I make my hot chocolate with milk')),
 (array([[0.28867513]]),
  ('I make my hot chocolate with milk', 'There is a hot sale today')),
 (array([[0.28867513]]),
  ('I make my hot chocolate with milk', 'I will have a chai latte with milk')),
 (array([[0.]]),
  ('The weather is hot under the sun', 'I will have a chai latte with milk')),
 (array([[0.]]), ('One hot encoding', 'I will have a chai latte with milk')),
 (array([[0.]]),
  ('I will have a chai latte with milk', 'There is a hot sale today'))]

### Question: Which Two Documents are Most Similar?

![](https://i.imgur.com/jrfN6Jj.png)

![](https://i.imgur.com/BI8XP92.png)

![](https://i.imgur.com/3IbfQXT.png)

![](https://i.imgur.com/pnNqzql.png)

### CountVectorizer vs. TfidfVectorizer

▪ Original documents

![](table2.png)

▪ Documents with stopwords removed

![](table1.png)

▪ Feature extraction with CountVectorizer

![](table3.png)

Feature extraction with TfidfVectorizer

![](table4.png)

https://medium.com/codex/document-indexing-using-tf-idf-189afd04a9fc

In [71]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

corpus = ['This is the first document.',
          'This is the second document.',
          'And the third one. One is fun.']

cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
df = pd.DataFrame(X, columns=cv.get_feature_names_out())
df

Unnamed: 0,and,document,first,fun,is,one,second,the,third,this
0,0,1,1,0,1,0,0,1,0,1
1,0,1,0,0,1,0,1,1,0,1
2,1,0,0,1,1,2,0,1,1,0


In [72]:
from sklearn.feature_extraction.text import TfidfVectorizer

cv_tfidf = TfidfVectorizer()
X_tfidf = cv_tfidf.fit_transform(corpus).toarray()
df_tfidf = pd.DataFrame(X_tfidf, columns=cv_tfidf.get_feature_names_out())
df_tfidf

Unnamed: 0,and,document,first,fun,is,one,second,the,third,this
0,0.0,0.450145,0.591887,0.0,0.349578,0.0,0.0,0.349578,0.0,0.450145
1,0.0,0.450145,0.0,0.0,0.349578,0.0,0.591887,0.349578,0.0,0.450145
2,0.36043,0.0,0.0,0.36043,0.212876,0.72086,0.0,0.212876,0.36043,0.0


![](https://i.imgur.com/xlJibKw.png)

### Document Similarity: Example with TF-IDF

In [73]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ['The weather is hot under the sun',
          'I make my hot chocolate with milk',
          'One hot encoding',
          'I will have a chai latte with milk',
          'There is a hot sale today']

# Create the document-term matrix with TF-IDF vectorizer
cv_tfidf = TfidfVectorizer(stop_words = "english")
X_tfidf = cv_tfidf.fit_transform(corpus).toarray()
dt_tfidf = pd.DataFrame(X_tfidf, columns = cv_tfidf.get_feature_names_out())
dt_tfidf

Unnamed: 0,chai,chocolate,encoding,hot,latte,make,milk,sale,sun,today,weather
0,0.0,0.0,0.0,0.370086,0.0,0.0,0.0,0.0,0.6569,0.0,0.6569
1,0.0,0.580423,0.0,0.327,0.0,0.580423,0.468282,0.0,0.0,0.0,0.0
2,0.0,0.0,0.871247,0.490845,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.614189,0.0,0.0,0.0,0.614189,0.0,0.495524,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.370086,0.0,0.0,0.0,0.6569,0.0,0.6569,0.0


In [74]:
# calculate the cosine similarity for all pairs of phrases and sort by most similar
results_tfidf = [cosine_similarity([X_tfidf[a_index]], [X_tfidf[b_index]]) for (a_index, b_index) in pairs]
sorted(zip(results_tfidf, combos), reverse=True)

[(array([[0.23204486]]),
  ('I make my hot chocolate with milk', 'I will have a chai latte with milk')),
 (array([[0.18165505]]),
  ('The weather is hot under the sun', 'One hot encoding')),
 (array([[0.18165505]]), ('One hot encoding', 'There is a hot sale today')),
 (array([[0.16050661]]),
  ('I make my hot chocolate with milk', 'One hot encoding')),
 (array([[0.1369638]]),
  ('The weather is hot under the sun', 'There is a hot sale today')),
 (array([[0.12101835]]),
  ('The weather is hot under the sun', 'I make my hot chocolate with milk')),
 (array([[0.12101835]]),
  ('I make my hot chocolate with milk', 'There is a hot sale today')),
 (array([[0.]]),
  ('The weather is hot under the sun', 'I will have a chai latte with milk')),
 (array([[0.]]), ('One hot encoding', 'I will have a chai latte with milk')),
 (array([[0.]]),
  ('I will have a chai latte with milk', 'There is a hot sale today'))]

### Question: Which Two Documents are Most Similar?

![](https://i.imgur.com/mj4J60v.png)

## Text Similarity Exercise

We will be using a song lyric dataset from Kaggle to identify songs with similar lyrics. The data set contains artists, songs and lyrics for 55K+ songs, but today we will be focusing on songs by one group in particular - The Beatles. The following code will help you load in the data and get set up for this exercise.

In [None]:
import nltk
import pandas as pd

In [None]:
data = pd.read_csv('songdata.csv')
data.head()

### Question 1: Note the '\n' (new line) characters in the lyrics. Remove them using regular expressions.

In [None]:
# Code?

### Question 2: List all the rows with "Imagine" in the title.

In [None]:
# Code?

### Question 3: Extract the first line of lyric out from the first song.

In [None]:
# Code?

### Question 4: Find out the sentiment of the extracted lyric. 

In [None]:
# Code?