# Chapter 09
## Natural Language Proessing
### Tokenization
This code snippet is tokenizing the given text using the Natural Language Toolkit (nltk) library in Python.  The Natural Language Toolkit (nltk) is a widely-used library in Python, specifically designed for working with human language data.
- Let us start by importing relevant functions and using it.

In [1]:
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
from nltk.tokenize import word_tokenize
corpus = 'This is a book about algorithms.'

tokens = word_tokenize(corpus)
print(tokens)

['This', 'is', 'a', 'book', 'about', 'algorithms', '.']


To tokenize text based on sentences, you can use the sent_tokenize function from the nltk.tokenize module.

In [3]:
from nltk.tokenize import sent_tokenize
corpus = 'This is a book about algorithms. It covers various topics in depth.'


In this example, the corpus variable contains two sentences. The sent_tokenize function takes the corpus as input and returns a list of sentences. When you run the modified code, you will get the following output:

In [4]:
sentences = sent_tokenize(corpus)
print(sentences)

['This is a book about algorithms.', 'It covers various topics in depth.']


Sometimes we may need to break down large texts into paragraph-level chunks, NLTK can help with that task. It's a feature that could be particularly useful in applications such as document summarization, where understanding the structure at the paragraph level may be crucial. Tokenizing text into paragraphs might seem straightforward, but it can be complex depending on the structure and format of the text. A simple approach is to split the text by two newline characters, which often separate paragraphs in plain text documents.

In [5]:
def tokenize_paragraphs(text):
    # Split by two newline characters
    paragraphs = text.split('\n\n')
    return [p.strip() for p in paragraphs if p]


## Cleaning data using Python
Let us study some techniques used to clean data and prepare it for machine learning tasks:

In [6]:
import string
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Make sure to download the NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Let us look into how we can clean text using Python.

In [7]:
def clean_text(text):
    """
    Cleans input text by converting case, removing punctuation, numbers, white spaces, stop words and stemming
    """
    # Convert to lowercase
    text = text.lower()

    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Remove numbers
    text = re.sub(r'\d+', '', text)

    # Remove white spaces
    text = text.strip()

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = nltk.word_tokenize(text)
    filtered_text = [word for word in tokens if word not in stop_words]
    text = ' '.join(filtered_text)

    # Stemming
    ps = PorterStemmer()
    tokens = nltk.word_tokenize(text)
    stemmed_text = [ps.stem(word) for word in tokens]
    text = ' '.join(stemmed_text)

    return text


Let us test this function clean_text()

In [8]:
corpus="7- Today, Ottawa is becoming cold again "
clean_text(corpus)

'today ottawa becom cold'

### Understanding the term "Document Matrix"
This matrix structure allows efficient storage, organization, and analysis of large text datasets In Python, the CountVectorizer module from the sklearn library can be used to create TDM as follows:

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

# Define a list of documents
documents = ["Machine Learning is useful", "Machine Learning is fun", "Machine Learning is AI"]

# Create an instance of CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the documents into a TDM
tdm = vectorizer.fit_transform(documents)

# Print the TDM
print(tdm.toarray())

[[0 0 1 1 1 1]
 [0 1 1 1 1 0]
 [1 0 1 1 1 0]]


In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Define a list of documents
documents = ["Machine Learning enables learning", "Machine Learning is fun", "Machine Learning is useful"]

# Create an instance of TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents into a TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(documents)

# Get the feature names
feature_names = vectorizer.get_feature_names_out()

# Loop over the feature names and print the TF-IDF score for each term
for i, term in enumerate(feature_names):
    tfidf = tfidf_matrix[:, i].toarray().flatten()
    print(f"{term}: {tfidf}")

enables: [0.60366655 0.         0.        ]
fun: [0.         0.66283998 0.        ]
is: [0.         0.50410689 0.50410689]
learning: [0.71307037 0.39148397 0.39148397]
machine: [0.35653519 0.39148397 0.39148397]
useful: [0.         0.         0.66283998]


### Implementing word embedding with Word2Vec
Word2Vec is a prominent method used for obtaining vector representations of words, commonly referred to as word embeddings. Rather than "generating words," this algorithm creates numerical vectors that represent the semantic meaning of each word in the language.

In [11]:
import gensim

# Define a text corpus
corpus = [['apple', 'banana', 'orange', 'pear'],
          ['car', 'bus', 'train', 'plane'],
          ['dog', 'cat', 'fox', 'fish']]

# Train a word2vec model on the corpus
model = gensim.models.Word2Vec(corpus, window=5, min_count=1, workers=4)

In [12]:
print(model.wv.similarity('car', 'train'))

-0.057745814


In [13]:
print(model.wv.similarity('car', 'apple'))

0.11117952


## Case study: Restaurant review sentiment analysis
We will use the Yelp Reviews dataset which contains labelled reviews as positive(5 stars) or negative(1start).  We will train a model that can classify the reviews of a restaurant as negative or positive

In [14]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [15]:
import numpy as np
import pandas as pd
import re
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

In [16]:
url = 'https://storage.googleapis.com/neurals/data/2023/Restaurant_Reviews.tsv'
dataset = pd.read_csv(url, delimiter='\t', quoting=3)
dataset.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [17]:
def clean_text(text):
    text = re.sub('[^a-zA-Z]', ' ', text)
    text = text.lower()
    text = text.split()
    ps = PorterStemmer()
    text = [
        ps.stem(word) for word in text
        if not word in set(stopwords.words('english'))]
    text = ' '.join(text)
    return text

corpus = [clean_text(review) for review in dataset['Review']]

In [18]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix

# Initialize the CountVectorizer and transform the corpus
vectorizer = CountVectorizer(max_features=1500)
X = vectorizer.fit_transform(corpus).toarray()

# Get the target labels
y = dataset.iloc[:, 1].values

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

# Initialize and train the Gaussian Naive Bayes classifier
classifier = GaussianNB()
classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = classifier.predict(X_test)

# Compute the confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)


[[55 42]
 [12 91]]
