# Natural Language Processing (NLP)

### NLP

▪ NLP is a machine learning technology that gives computers the ability to interpret, manipulate, and comprehend human language.

<img src="nlp.jpg" width="500">

### Structured Data vs. Unstructured Data

▪ Structured data is standardized, clearly defined, and searchable data, while unstructured data is raw data of various forms.

<img src="structured_vs_unstructured_data.png" width="500">

### Garbage In, Garbage Out

▪ Garbage in, garbage out is a concept common to computer science where the quality of output is determined by the quality of the input. 

<img src="garbage.png" width="500">

### Text Preprocessing

<img src="text_preprocessing.png" width="700">

▪ Unstructured data must first be cleaned and pre-processed before analysis.

# NLTK and Preprocessing Techniques

### NLP Toolkits

▪ NLTK, which stands for Natural Language Toolkits, is a suite of libraries built for working with NLP in Python.

### Prerequisites

▪ NLTK and NLTK dataset

### Installing NLTK via Anaconda Prompt

pip install nltk

### Installing NLTK Dataset via Jupyter Notebook

In [None]:
import nltk

# The following command downloads all data and models, and it will take awhile
# Do this step only if nltk_data is not available on your pc
# nltk.download()

## Text Data

▪ Text data is messy.

▪ To analyze this data, it has to be preprocessed into clean text in a format that machine models can understand.

![](https://i.imgur.com/3L6x92C.png)

## Text Data: Sample 

In [None]:
original_text = "Hi Mr. Smith! I'm going to buy some vegetables \
(2 tomatoes and 4 cucumbers) from the store. Should I pick up some black-eyed peas as well?"

## Remove Punctuation

In [None]:
import string

print(string.punctuation)

In [None]:
import re # Regular expression library

clean_text = re.sub('[%s]' %(string.punctuation), '', original_text)
clean_text

In [None]:
# https://pynative.com/python-regex-special-sequences-and-character-classes/

clean_text = re.sub('[^\w\s]','', original_text)
clean_text

## Remove Numbers

In [None]:
# Removes all words containing digits
clean_text = re.sub('\d', '', clean_text)
clean_text

In [None]:
clean_text = re.sub('[^A-Za-z\s]','', original_text)
clean_text

## Covert Text to Lowercase

<img src="vegetarian.jpg" width="500">

In [None]:
sample_text = "The Nature's Vegetarian Restaurant located at Bangsar is a fantastic vegetarian restaurant."
print(sample_text)

In [None]:
clean_text = clean_text.lower()
clean_text

## Word Tokenization (original_text)

▪ Tokenization is the process of breaking down a phrase, sentence, paragraph, or an entire text document into smaller units.

<img src="tokenization.jpeg" width="700">

In [None]:
original_text = "Hi Mr. Smith! I'm going to buy some vegetables \
(2 tomatoes and 4 cucumbers) from the store. Should I pick up some black-eyed peas as well?"

print(original_text)

### \#1 Word Tokenization with word_tokenize()

In [None]:
from nltk.tokenize import word_tokenize

tokens = word_tokenize(original_text) 
print(tokens)

### \#2 Word Tokenization with regexp_tokenize()

In [None]:
from nltk import regexp_tokenize

tokens = regexp_tokenize(original_text, pattern = '\w+')
print(tokens)

### \#3 Word Tokenization with split()

In [None]:
tokens = original_text.split()
print(tokens)

### \#4 Word Tokenization with Regex

In [None]:
import re

tokens = re.split("\W+", original_text)
print(tokens)

In [None]:
import re

tokens = re.findall("\w+", original_text)
print(tokens)

## Word Tokenization (clean_text)

### \#1 Word Tokenization with word_tokenize()

In [None]:
from nltk.tokenize import word_tokenize

tokens = word_tokenize(clean_text) 
print(tokens)

### \#2 Word Tokenization with regexp_tokenize()

In [None]:
from nltk import regexp_tokenize

tokens = regexp_tokenize(clean_text, pattern = '\w+')
print(tokens)

### \#3 Word Tokenization with split()

In [None]:
tokens = clean_text.split()
print(tokens)

### \#4 Word Tokenization with Regex

In [None]:
import re

tokens = re.split("\W+", clean_text)
print(tokens)

In [None]:
import re

tokens = re.findall("\w+", clean_text)
print(tokens)

## Sentence Tokenization (original_text)

### \#1 Sentence Tokenization with sent_tokenize()

In [None]:
from nltk.tokenize import sent_tokenize

sent_tokens = sent_tokenize(original_text)
sent_tokens

### \#2 Sentence Tokenization with split()

In [None]:
sent_tokens = original_text.split(". ")
sent_tokens

In [None]:
sent_tokens = re.split("[?!.] ", original_text)
sent_tokens

## Remove Stop Words

![](https://i.imgur.com/T5RJXrX.png)

In [None]:
from nltk.corpus import stopwords

print(stopwords.fileids())

### Print English Stop Words

In [None]:
stop_words = stopwords.words('english')
print(stop_words)

### Identify Stop Words from the following Sentence

<h3><center>Stopwords are a commonly used words that generally don’t contribute anything to the meaning of the text.</center></h3>

### Sort the list of Stop Words

In [None]:
stop_words.sort() # sorted()
print(stop_words)

### Print Language-X Stop Words

In [None]:
stop_words_2 = stopwords.words('danish')
print(stop_words_2)

### Remove Stop Words from clean_text

In [None]:
print(tokens)

In [None]:
tokens_x_stopwords = [token for token in tokens if token not in stop_words]

print(tokens_x_stopwords)

## Stemming

![](https://i.imgur.com/9qllh8j.png)

### Stemming with LancasterStemmer

In [None]:
words_1 = ['Connects', 'Connecting', 'Connections', 'Connected', 'Connection', 'Connectings', 'Connect']
words_2 = ['drive', 'drives', 'driver', 'drivers', 'driven', 'driving']

In [None]:
from nltk.stem.lancaster import LancasterStemmer
lc_stemmer = LancasterStemmer()

for word_1 in words_1:
    print(word_1, "-->", lc_stemmer.stem(word_1))

In [None]:
for word_2 in words_2:
    print(word_2, "-->", lc_stemmer.stem(word_2))

### Stemming with PorterStemmer

In [None]:
from nltk.stem import PorterStemmer
pt_stemmer = PorterStemmer()

for word_1 in words_1:
    print(word_1, "-->", pt_stemmer.stem(word_1))

In [None]:
for word_2 in words_2:
    print(word_2, "--->", pt_stemmer.stem(word_2))

### Stemming with SnowballStemmer

In [None]:
from nltk.stem import SnowballStemmer
sb_stemmer = SnowballStemmer(language = 'english')

for word_1 in words_1:
    print(word_1, "--->", sb_stemmer.stem(word_1))

In [None]:
for word_2 in words_2:
    print(word_2, "--->", sb_stemmer.stem(word_2))

## Lemmatization

![](https://i.imgur.com/9qllh8j.png)

### Lemmatization with WordNetLemmatizer

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

wn_lemmatizer = WordNetLemmatizer()

for word_1 in words_1:
    print(word_1, "--->", wn_lemmatizer.lemmatize(word_1))

In [None]:
for word_2 in words_2:
    print(word_2, "--->", wn_lemmatizer.lemmatize(word_2))

### Lemmatization with WordNetLemmatizer on clean_text

In [None]:
print(tokens_x_stopwords)

In [None]:
lemma_x_stopwords = [wn_lemmatizer.lemmatize(word_1) for word_1 in tokens_x_stopwords]

print(lemma_x_stopwords)

## Parts of Speech Tagging

![](https://i.imgur.com/8edVsCR.png)

In [None]:
print(nltk.help.upenn_tagset())

### POS Tagging on Sample Text

In [None]:
from nltk.tag import pos_tag

text_1 = "James Smith lives in the United States."

tokens = pos_tag(word_tokenize(text_1))
print(tokens)

### POS Tagging on original_text

In [None]:
print(original_text)

In [None]:
tokens = pos_tag(word_tokenize(original_text))
print(tokens)

### POS Tagging on clean_text

In [None]:
tokens = pos_tag(lemma_x_stopwords)
print(tokens)

## Chunking

▪ Chunking is a step following POS tagging and structuring the sentence in "chunks" by identifying continuous words that can be grouped together.

In [None]:
from nltk.chunk import ne_chunk

text_1 = "James Smith lives in the United States."
tokens = pos_tag(word_tokenize(text_1))

# This extracts entities from the list of words
result = ne_chunk(tokens) 
result.draw()

In [None]:
text_2 = "The Nature's Vegetarian Restaurant located at Bangsar is a fantastic vegetarian restaurant."
tokens = pos_tag(word_tokenize(text_2))

result = ne_chunk(tokens) 
result.draw()

## Compound Term Extraction

![](https://i.imgur.com/q1WuWai.png)

### Compound Term Extraction

In [None]:
from nltk.tokenize import MWETokenizer 

mwe_tokenizer = MWETokenizer([('James', 'Smith'), ('United', 'States')])

text_1 = "James Smith lives in the United States."
mwe_tokens = mwe_tokenizer.tokenize(word_tokenize(text_1))
mwe_tokens

In [None]:
mwe_tokenizer = MWETokenizer([("Nature's", "Vegetarian", "Restaurant"), ('Subang', 'Jaya')])

text_3 = "The Nature's Vegetarian Restaurant located at Subang Jaya is a fantastic vegetarian restaurant."
mwe_tokens = mwe_tokenizer.tokenize(word_tokenize(text_3))
mwe_tokens

## Basic Numpy Functionality

▪ NumPy, which stands for Numerical Python, is a Python library used for working with arrays.

### Create an ndarray (2D Array)

In [None]:
import numpy as np

arr_1d = np.array([2, 4, 6, 8]) 
arr_1d

In [None]:
print(type(arr_1d))

In [None]:
arr_1d.shape

In [None]:
arr_2d = np.random.randn(6, 4)
arr_2d

In [None]:
print(type(arr))

In [None]:
arr_2d.shape

## Basic Pandas Functionality

▪ Pandas stands for "Python Data Analysis Library".

![](https://i.imgur.com/HpgLFOT.png)

### Create a dataframe

▪ A dataframe is a 2-dimensional labeled data structure with columns of potentially different types. 

In [None]:
df = pd.DataFrame(arr_2d)
df

In [None]:
df.shape

### Check the Labels of Rows and Columns

In [None]:
df.columns.values

In [None]:
df.index.values

### Add Labels to Rows and Columns

In [None]:
df.columns = ['A', 'B', 'C', 'D']
df

In [None]:
df.index = ['Rec_1', 'Rec_2', 'Rec_3', 'Rec_4', 'Rec_5', 'Rec_6']
df

## cookie_reviews.csv

### Reading data from cookie_reviews.csv

In [None]:
df = pd.read_csv('cookie_reviews.csv')
df

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.head(10)

In [None]:
df.tail(10)

### Print the Labels of All Columns

In [None]:
df.columns.values

### Print All Values of a Column

In [None]:
df.user_id

In [None]:
df.reviews

## DataFrame Slicing

▪ Python iloc() function is used for integer-location based indexing / selection by position.

In [None]:
df.iloc[0] # first row of data frame

In [None]:
df.iloc[-1] # last row of data frame

In [None]:
df.iloc[:,0] # first column of data frame

In [None]:
df.iloc[:,-1] # last column of data frame

In [None]:
df.iloc[0, 1] # first row, second column of the dataframe

In [None]:
df.iloc[0:4, 0:2] # first 4 rows and first 2 columns of data frame

## Preprocessing Exercise: cookie_reviews.csv

#### Question 1: Determine how many reviews there are in total.

#### Question 2: Determine the percentage of 1, 2, 3, 4 and 5 star reviews.

#### Question 3: Remove stop words

#### Question 4: Change to lower case

#### Question 5: Perform stemming

## Text Similarity Measures

▪ To measure distance between 2 strings.

<img src="similarity.png" width="700">

▪ Some examples of its application include information retrieval, text classification, document clustering, and topic modeling.

### Levenshtein distance

▪ **Levenshtein distance** is one way to measure the word similarity. 

▪ Minimum number of operations to get from one word to another.

![](https://i.imgur.com/FkdJmPi.png)

# TextBlob

▪ Other than NLTK, TextBlob is another Python library for processing textual data.

▪ TextBlob capabilities: Tokenization, Parts of speech tagging, Sentiment analysis, Spell check, etc.

## TextBlob Demo: Tokenization

In [None]:
#pip install textblob

from textblob import TextBlob
my_text = TextBlob("We're moving from NLTK to TextBlob. How fun!")
my_text.words

## TextBlob Demo: Spell Check

▪ The correct() function calculates the Levenshtein distance between the word "graat" and all words in its word list of the words with the smallest Levenshtein distance, it outputs the most popular word.

In [None]:
blob = TextBlob("I'm graat at speling.")
print(blob.correct()) # print function requires Python 3

## TextBlob Demo: Tagging

In [None]:
blob = TextBlob("John hits the ball.")
for words, tag in blob.tags:
    print (words, tag)

## TextBlob Demo: Language Translation

▪ Textblob uses Google Translate as its translation engine

https://thinkinfi.com/natural-language-processing-using-textblob/

In [None]:
word = TextBlob("Bonjour, comment allez-vous")
word.translate(from_lang = 'fr', to = 'cn')

In [None]:
word.translate(from_lang = 'fr', to = 'zh-CN')

## Text Format for Analysis: Count Vectorizer

![](https://i.imgur.com/OQDeQlb.png)

### Features Extraction and CountVectorizer

▪ Feature extraction is the process of transforming textual data into numerical data.

▪ CountVectorizer is a tool used to vectorize text data by converting it into a matrix of token counts.

![](Count-Vectorization.png)

### Example 1: Create a DataFrame from original_text using CountVectorizer

In [None]:
original_text = ["Hi Mr. Smith! I'm going to buy some vegetables \
(2 tomatoes and 4 cucumbers) from the store. Should I pick up some black-eyed peas as well?"]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
           
# Incorporate stop words when creating the count vectorizer
cv = CountVectorizer(stop_words = 'english') 

X = cv.fit_transform(original_text)
print(X)

In [None]:
cv.vocabulary_

In [None]:
df = pd.DataFrame(X.toarray())
df

In [None]:
df = pd.DataFrame(X.toarray(), columns = cv.get_feature_names_out())
df

### Example 2: Create a DataFrame from clean_text using CountVectorizer

In [None]:
print(lemma_x_stopwords)

In [None]:
text = [" ".join(lemma_x_stopwords)]
text

In [None]:
cv = CountVectorizer() 

X = cv.fit_transform(text)

df = pd.DataFrame(X.toarray(), columns = cv.get_feature_names_out())
df

### Example 3: Create a DataFrame from the following Corpus using CountVectorizer

In [None]:
corpus = ['This is the first document.', 
          'This is the second document.', 
          'And the third one. One is fun.']

cv = CountVectorizer() 

X = cv.fit_transform(corpus)

df = pd.DataFrame(X.toarray(), columns = cv.get_feature_names_out())
df

### Example 4: Create a DataFrame from the following Corpus using CountVectorizer

In [None]:
corpus = ['The weather is hot under the sun',
          'I make my hot chocolate with milk',
          'One hot encoding',
          'I will have a chai latte with milk',
          'There is a hot sale today']

cv = CountVectorizer(stop_words = 'english') 

X = cv.fit_transform(corpus).toarray()

df = pd.DataFrame(X, columns = cv.get_feature_names_out())
df

## Document Similarity

![](https://i.imgur.com/PyirXsy.png)

### Measuring Document Similarity

In [None]:
from itertools import combinations

pairs = list(combinations(['A', 'B', 'C', 'D'], 2))
pairs

In [None]:
# range() returns an immutable sequence of numbers that can be easily converted to lists
x = list(range(5))
x

In [None]:
# calculate the cosine similarity between all combinations of documents
from itertools import combinations
from sklearn.metrics.pairwise import cosine_similarity

# list all combinations of the 5 sentences in pairs, in terms of indexes
# (0, 1), (0, 2), (0, 3), (0, 4), (1, 2), (1, 3), ..., (3,4)
pairs = list(combinations(range(len(corpus)), 2)) 
pairs

In [None]:
combos = [(corpus[a_index], corpus[b_index]) for (a_index, b_index) in pairs]
combos

In [None]:
# Calculate the cosine similarity for all pairs of phrases and sort by most similar
results = [cosine_similarity([X[a_index]], [X[b_index]]) for (a_index, b_index) in pairs]
sorted(zip(results, combos), reverse = True)

### Question: Which Two Documents are Most Similar?

![](https://i.imgur.com/jrfN6Jj.png)

![](https://i.imgur.com/BI8XP92.png)

![](https://i.imgur.com/3IbfQXT.png)

![](https://i.imgur.com/pnNqzql.png)

### CountVectorizer vs. TfidfVectorizer

▪ Original documents

![](table2.png)

▪ Documents with stopwords removed

![](table1.png)

▪ Feature extraction with CountVectorizer

![](table3.png)

Feature extraction with TfidfVectorizer

![](table4.png)

https://medium.com/codex/document-indexing-using-tf-idf-189afd04a9fc

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

corpus = ['This is the first document.',
          'This is the second document.',
          'And the third one. One is fun.']

cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
df = pd.DataFrame(X, columns=cv.get_feature_names_out())
df

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

cv_tfidf = TfidfVectorizer()
X_tfidf = cv_tfidf.fit_transform(corpus).toarray()
df_tfidf = pd.DataFrame(X_tfidf, columns=cv_tfidf.get_feature_names_out())
df_tfidf

![](https://i.imgur.com/xlJibKw.png)

### Document Similarity: Example with TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ['The weather is hot under the sun',
          'I make my hot chocolate with milk',
          'One hot encoding',
          'I will have a chai latte with milk',
          'There is a hot sale today']

# Create the document-term matrix with TF-IDF vectorizer
cv_tfidf = TfidfVectorizer(stop_words = "english")
X_tfidf = cv_tfidf.fit_transform(corpus).toarray()
dt_tfidf = pd.DataFrame(X_tfidf, columns = cv_tfidf.get_feature_names_out())
dt_tfidf

In [None]:
# calculate the cosine similarity for all pairs of phrases and sort by most similar
results_tfidf = [cosine_similarity([X_tfidf[a_index]], [X_tfidf[b_index]]) for (a_index, b_index) in pairs]
sorted(zip(results_tfidf, combos), reverse=True)

### Question: Which Two Documents are Most Similar?

![](https://i.imgur.com/mj4J60v.png)

## Text Similarity Exercise

We will be using a song lyric dataset from Kaggle to identify songs with similar lyrics. The data set contains artists, songs and lyrics for 55K+ songs, but today we will be focusing on songs by one group in particular - The Beatles. The following code will help you load in the data and get set up for this exercise.

In [None]:
import nltk
import pandas as pd

In [None]:
data = pd.read_csv('songdata.csv')
data.head()

### Question 1: Note the '\n' (new line) characters in the lyrics. Remove them using regular expressions.

In [None]:
# Code?

### Question 2: List all the rows with "Imagine" in the title.

In [None]:
# Code?

### Question 3: Extract the first line of lyric out from the first song.

In [None]:
# Code?

### Question 4: Find out the sentiment of the extracted lyric. 

In [None]:
# Code?