# Lab 8: NLTK

Name: **Krish Agarwal** <br>
Reg No: **21112016** <br>
Class: **5BSc DS A** <br>
Date: **26/10/2023**

---------------

## `Objective`:  
1. Use NLTK for:
    1. Tokenization
    1. Stemming
    1. Lemmatization on a movie review CSV file.
1. Find the frequency of the word
1. Categorize Part of speech (POS)
1. Classify the text

## `References`:  
1. **Dataset**: https://www.kaggle.com/datasets/columbine/imdb-dataset-sentiment-analysis-in-csv-format/
1. **NLTK API**: https://www.nltk.org/

## `Completion Status`:

| Question Number | Status |
| --- | --- |
| 1A | Completed |
| 1B | Completed |
| 1C | Completed |
| 2 | Completed |
| 3 | Completed |
| 4 | Completed |

## `Code`:

In [1]:
# Importing the data
import pandas as pd

data = pd.read_csv('D:/Z/Downloads/Movie Dataset.csv')
data

Unnamed: 0,text,label
0,I grew up (b. 1965) watching and loving the Th...,0
1,"When I put this movie in my DVD player, and sa...",0
2,Why do people who do not know what a particula...,0
3,Even though I have great interest in Biblical ...,0
4,Im a die hard Dads Army fan and nothing will e...,1
...,...,...
39995,"""Western Union"" is something of a forgotten cl...",1
39996,This movie is an incredible piece of work. It ...,1
39997,My wife and I watched this movie because we pl...,0
39998,"When I first watched Flatliners, I was amazed....",1


In [2]:
# importing the necessary libraries/modules
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk import pos_tag
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

In [3]:
# Downloading the packages

'''
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
'''

"\nnltk.download('punkt')\nnltk.download('stopwords')\nnltk.download('wordnet')\nnltk.download('averaged_perceptron_tagger')\n"

In [4]:
# pre-processing the text by combining all the 3 methods: Tokenization, Stemming, Lemmatization
def process_text(text):
    # Tokenization
    tokens = word_tokenize(text.lower())  # Convert to lowercase
    # Remove stopwords and non-alphabetic tokens
    tokens = [t for t in tokens if t.isalpha() and t not in stopwords.words('english')]
    # Stemming and Lemmatization
    stemmed = [stemmer.stem(t) for t in tokens]
    lemmatized = [lemmatizer.lemmatize(t) for t in tokens]
    return ' '.join(lemmatized)  # Returning lemmatized text; change to 'stemmed' if stemmed text is preferred

**Q1A) Tokenization**

In [5]:
# tokenization function
def tokenize(text):
    return word_tokenize(text)

In [6]:
# applying the above function
data['tokens'] = data['text'].apply(tokenize)

print(data['tokens'].head())

0    [I, grew, up, (, b, ., 1965, ), watching, and,...
1    [When, I, put, this, movie, in, my, DVD, playe...
2    [Why, do, people, who, do, not, know, what, a,...
3    [Even, though, I, have, great, interest, in, B...
4    [Im, a, die, hard, Dads, Army, fan, and, nothi...
Name: tokens, dtype: object


**Q1B) Stemming** 

In [7]:
# defining the stemmer
stemmer = PorterStemmer()

# stemming function 
def stem(text):
    tokens = word_tokenize(text)
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    return ' '.join(stemmed_tokens)  # Join the stemmed tokens back into a string

In [8]:
# applying the above function
data['stemmed text'] = data['text'].apply(stem)

print(data['stemmed text'].head())

0    i grew up ( b . 1965 ) watch and love the thun...
1    when i put thi movi in my dvd player , and sat...
2    whi do peopl who do not know what a particular...
3    even though i have great interest in biblic mo...
4    im a die hard dad armi fan and noth will ever ...
Name: stemmed text, dtype: object


**Q1C) Lemmatization**

In [9]:
# defining the lemmatizer
lemmatizer = WordNetLemmatizer()

# lemmatization function
def lemmatize(text):
    tokens = word_tokenize(text)
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return ' '.join(lemmatized_tokens)  # Join the lemmatized tokens back into a string

In [10]:
# applying the above function
data['lemmatized text'] = data['text'].apply(lemmatize)

print(data['lemmatized text'].head())

0    I grew up ( b . 1965 ) watching and loving the...
1    When I put this movie in my DVD player , and s...
2    Why do people who do not know what a particula...
3    Even though I have great interest in Biblical ...
4    Im a die hard Dads Army fan and nothing will e...
Name: lemmatized text, dtype: object


**Q2) Find the frequency of the word**

In [11]:
# applying the above function
data['processed text'] = data['text'].apply(process_text)

In [12]:
word_frequency = FreqDist(' '.join(data['processed text']).split())
print(word_frequency.most_common(10))

[('br', 161465), ('movie', 80093), ('film', 72210), ('one', 42940), ('like', 32261), ('time', 23760), ('good', 23223), ('character', 22280), ('would', 21187), ('even', 19895)]


**Q3) Categorize Part of speech (POS)**

In [13]:
sample_text = data['processed text'].iloc[0]  # Taking first review as sample
pos_tags = pos_tag(word_tokenize(sample_text))
print(pos_tags)

[('grew', 'VBD'), ('b', 'NN'), ('watching', 'VBG'), ('loving', 'VBG'), ('thunderbird', 'NN'), ('mate', 'JJ'), ('school', 'NN'), ('watched', 'VBD'), ('played', 'JJ'), ('thunderbird', 'NN'), ('school', 'NN'), ('lunch', 'JJ'), ('school', 'NN'), ('wanted', 'VBD'), ('virgil', 'NN'), ('scott', 'NN'), ('one', 'CD'), ('wanted', 'VBD'), ('alan', 'NN'), ('counting', 'VBG'), ('became', 'VBD'), ('art', 'JJ'), ('form', 'NN'), ('took', 'VBD'), ('child', 'NN'), ('see', 'NN'), ('movie', 'NN'), ('hoping', 'VBG'), ('would', 'MD'), ('get', 'VB'), ('glimpse', 'NN'), ('loved', 'VBN'), ('child', 'JJ'), ('bitterly', 'RB'), ('disappointing', 'JJ'), ('high', 'JJ'), ('point', 'NN'), ('snappy', 'JJ'), ('theme', 'NN'), ('tune', 'NN'), ('could', 'MD'), ('compare', 'VB'), ('original', 'JJ'), ('score', 'NN'), ('thunderbird', 'NN'), ('thankfully', 'RB'), ('early', 'JJ'), ('saturday', 'NN'), ('morning', 'NN'), ('one', 'CD'), ('television', 'NN'), ('channel', 'NN'), ('still', 'RB'), ('play', 'VB'), ('rerun', 'JJ'), ('s

**Q4) Classify the text**

In [14]:
# splitting the data into training and testing
X = data['processed text']
y = data['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [15]:
# Pre-processing the text data into numerical data through vectorizer
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

In [16]:
# Training a Multinomial Naive Bayes model 
classifier = MultinomialNB()
classifier.fit(X_train_tfidf, y_train)
y_pred = classifier.predict(X_test_tfidf)

In [17]:
# Evaluating the classifier
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.864125
              precision    recall  f1-score   support

           0       0.85      0.88      0.87      3966
           1       0.88      0.84      0.86      4034

    accuracy                           0.86      8000
   macro avg       0.86      0.86      0.86      8000
weighted avg       0.86      0.86      0.86      8000



In [18]:
# Classify a new text

#defining a new and unseen review
new_text = "This movie is fantastic, and the acting is superb"
#processing the text using the above created user-defined function
new_text_processed = process_text(new_text)
# vectorizing the string
new_text_tfidf = vectorizer.transform([new_text_processed])
# classifying the review
prediction = classifier.predict(new_text_tfidf)

print(f'Classification: {prediction[0]}')

Classification: 1


<hr>