Student Name: Fatima Nawab

##**Natural Language Processing | DS & AI Cohort 13**
**Objective**

Understand and apply core NLP techniques - stemming, lemmatization, N-grams,
vectorization methods, and Naive Bayes classification to build and evaluate a complete text classification pipeline.

Your goal is to transform raw text into meaningful representations and
use a machine learning model to perform sentiment classification.


##Importing necessary libraries

In [1]:
import pandas as pd
import numpy as np
import re
import string

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


In [2]:
from nltk.stem import PorterStemmer, WordNetLemmatizer


In [3]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

###1. Stemming and Lemmatization


In [4]:
# Load the IMDB dataset, using the 'python' engine for better handling of parsing errors
df = pd.read_csv('/content/IMDB Dataset.csv', engine='python')

# Display the first 5 rows
print("First 5 rows of the DataFrame:")
print(df.head())

# Print the shape of the DataFrame
print("\nShape of the DataFrame:")
print(df.shape)

# Display concise summary of the DataFrame
print("\nConcise summary of the DataFrame:")
df.info()


First 5 rows of the DataFrame:
                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive

Shape of the DataFrame:
(50000, 2)

Concise summary of the DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [5]:
df['sentiment'].value_counts()


Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
positive,25000
negative,25000


In [6]:
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})


##Text Preprocessing
###1.1. Define Stopwords, Stemmer & Lemmatizer

In [7]:
stop_words = set(stopwords.words('english'))
porter_stemmer = PorterStemmer()
wordnet_lemmatizer = WordNetLemmatizer()
def perform_stemming(token_list):
    return [porter_stemmer.stem(token) for token in token_list]

def perform_lemmatization(token_list):
    return [wordnet_lemmatizer.lemmatize(token) for token in token_list]


###1.2. Text Cleaning Function

In [8]:
def clean_review_text(text):
    # Convert text to lowercase
    text = text.lower()

    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)

    # Remove punctuation and special characters
    text = re.sub(r'[^a-z0-9\s]', '', text)

    # Tokenize text
    tokens = word_tokenize(text)

    # Remove English stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Return cleaned text as a single string
    return " ".join(tokens)



###1.3. Apply Text Cleaning

In [9]:
print("Starting text preprocessing...")
df['clean_review'] = df['review'].apply(clean_review_text)
print("Text preprocessing completed.")


Starting text preprocessing...
Text preprocessing completed.


###1.4. Tokenization Function

In [10]:
def tokenize_review(text):
    return word_tokenize(text)


In [11]:
print("Tokenizing cleaned reviews...")
df['review_tokens'] = df['clean_review'].apply(tokenize_review)
print("Tokenization completed.")


Tokenizing cleaned reviews...
Tokenization completed.


###1.5. Define Stemming and Lemmatization Functions

In [12]:
def perform_stemming(token_list):
    return [porter_stemmer.stem(token) for token in token_list]

def perform_lemmatization(token_list):
    return [wordnet_lemmatizer.lemmatize(token) for token in token_list]


###1.6. Apply Stemming & Lemmatization

In [39]:
print("Applying stemming...")
df['review_stemmed'] = df['review_tokens'].apply(perform_stemming)
print("Stemming completed.")


Applying stemming...
Stemming completed.


In [14]:
print("Applying lemmatization...")
df['review_lemmatized'] = df['review_tokens'].apply(perform_lemmatization)
print("Lemmatization completed.")


Applying lemmatization...
Lemmatization completed.


###1.7. Compare Results on Sample Review

In [15]:
sample_row = 1

print("\n--- Text Processing Comparison ---")
print("Original Review:\n", df['review'].iloc[sample_row])
print("\nTokenized Review:\n", df['review_tokens'].iloc[sample_row])
print("\nStemmed Review:\n", df['review_stemmed'].iloc[sample_row])
print("\nLemmatized Review:\n", df['review_lemmatized'].iloc[sample_row])



--- Text Processing Comparison ---
Original Review:
 A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's mura

###1.8. Analysis: Stemming vs Lemmatization (Markdown Section)

Stemming is a rule-based technique that trims words to their root form. While it is computationally efficient, it may generate words that do not exist in the dictionary.

Lemmatization uses vocabulary and morphological analysis to return meaningful base forms of words. Although slower than stemming, it preserves semantic correctness.

In this experiment, lemmatized text retained better readability and interpretability, making it more suitable for downstream NLP tasks such as sentiment analysis.

##2. N-gram Language Modeling (Unigrams, Bigrams, Trigrams)

In [16]:
from nltk.util import ngrams
from collections import Counter
from nltk.tokenize import word_tokenize

###2.1. Function to Create N-grams

In [17]:
def create_ngrams(token_list, n_value):
    """
    Generates n-grams from a list of tokens.
    """
    return list(ngrams(token_list, n_value))


###2.2. Generate Unigrams, Bigrams & Trigrams for Reviews

In [18]:
print("Creating unigram representations...")
df['review_unigrams'] = df['review_tokens'].apply(lambda tokens: create_ngrams(tokens, 1))

print("Creating bigram representations...")
df['review_bigrams'] = df['review_tokens'].apply(lambda tokens: create_ngrams(tokens, 2))

print("Creating trigram representations...")
df['review_trigrams'] = df['review_tokens'].apply(lambda tokens: create_ngrams(tokens, 3))

print("N-gram creation completed successfully.")


Creating unigram representations...
Creating bigram representations...
Creating trigram representations...
N-gram creation completed successfully.


###2.3. Examine N-grams for a Sample Review

In [19]:
example_index = 0

example_text = df['clean_review'].iloc[example_index]
example_unigrams = df['review_unigrams'].iloc[example_index]
example_bigrams = df['review_bigrams'].iloc[example_index]
example_trigrams = df['review_trigrams'].iloc[example_index]

print(f"\n--- N-gram Analysis for Review at Index {example_index} ---")
print("\nCleaned Review Text:\n", example_text)
print("\nSample Unigrams (first 10):\n", example_unigrams[:10])
print("\nSample Bigrams (first 10):\n", example_bigrams[:10])
print("\nSample Trigrams (first 10):\n", example_trigrams[:10])



--- N-gram Analysis for Review at Index 0 ---

Cleaned Review Text:
 one reviewers mentioned watching 1 oz episode youll hooked right exactly happened methe first thing struck oz brutality unflinching scenes violence set right word go trust show faint hearted timid show pulls punches regards drugs sex violence hardcore classic use wordit called oz nickname given oswald maximum security state penitentary focuses mainly emerald city experimental section prison cells glass fronts face inwards privacy high agenda em city home manyaryans muslims gangstas latinos christians italians irish moreso scuffles death stares dodgy dealings shady agreements never far awayi would say main appeal show due fact goes shows wouldnt dare forget pretty pictures painted mainstream audiences forget charm forget romanceoz doesnt mess around first episode ever saw struck nasty surreal couldnt say ready watched developed taste oz got accustomed high levels graphic violence violence injustice crooked guards whol

###2.4. Define Counters for all N-grams and display frequencies and probabilities

In [20]:
from collections import Counter

sample_index = 0
# Sample review N-grams
sample_unigrams = df['review_unigrams'].iloc[sample_index]
sample_bigrams = df['review_bigrams'].iloc[sample_index]
sample_trigrams = df['review_trigrams'].iloc[sample_index]

# Count frequencies
unigram_counts = Counter(sample_unigrams)
bigram_counts = Counter(sample_bigrams)
trigram_counts = Counter(sample_trigrams)

# Display frequencies and probabilities
print("\n--- Sample N-gram Frequencies and Probabilities ---")

print("\nUnigrams:")
total_unigrams = len(sample_unigrams)
for gram, count in unigram_counts.most_common():
    print(f"  {gram}: Count={count}, Probability={count/total_unigrams:.4f}")

print("\nBigrams:")
total_bigrams = len(sample_bigrams)
if total_bigrams > 0:
    for gram, count in bigram_counts.most_common():
        print(f"  {gram}: Count={count}, Probability={count/total_bigrams:.4f}")
else:
    print("  No bigrams generated.")

print("\nTrigrams:")
total_trigrams = len(sample_trigrams)
if total_trigrams > 0:
    for gram, count in trigram_counts.most_common():
        print(f"  {gram}: Count={count}, Probability={count/total_trigrams:.4f}")
else:
    print("  No trigrams generated.")

print("\nN-gram probability calculation complete.")


--- Sample N-gram Frequencies and Probabilities ---

Unigrams:
  ('oz',): Count=5, Probability=0.0298
  ('violence',): Count=4, Probability=0.0238
  ('show',): Count=3, Probability=0.0179
  ('prison',): Count=3, Probability=0.0179
  ('forget',): Count=3, Probability=0.0179
  ('watching',): Count=2, Probability=0.0119
  ('episode',): Count=2, Probability=0.0119
  ('right',): Count=2, Probability=0.0119
  ('first',): Count=2, Probability=0.0119
  ('struck',): Count=2, Probability=0.0119
  ('city',): Count=2, Probability=0.0119
  ('high',): Count=2, Probability=0.0119
  ('say',): Count=2, Probability=0.0119
  ('due',): Count=2, Probability=0.0119
  ('wholl',): Count=2, Probability=0.0119
  ('inmates',): Count=2, Probability=0.0119
  ('get',): Count=2, Probability=0.0119
  ('one',): Count=1, Probability=0.0060
  ('reviewers',): Count=1, Probability=0.0060
  ('mentioned',): Count=1, Probability=0.0060
  ('1',): Count=1, Probability=0.0060
  ('youll',): Count=1, Probability=0.0060
  ('hooke

##Explanation & Analysis
üîç Understanding N-gram Representations

Unigrams treat each word independently and represent text as individual tokens.
This approach is simple but ignores word order and context.

Bigrams capture pairs of consecutive words, allowing the model to learn short-range dependencies such as negations (‚Äúnot good‚Äù).

Trigrams model longer word sequences and preserve more contextual information, but they increase dimensionality and sparsity.

As the value of n increases, contextual understanding improves, but the number of unique N-grams grows significantly, making the representation more sparse and computationally expensive.

##3. Vectorization Techniques
Objective

Convert textual reviews into numerical form so they can be used by machine learning models.

**Approach**

1. Text data cannot be directly processed by machine learning algorithms.

2. Therefore, vectorization techniques are used to represent text as numerical vectors.

3. In this project, Bag of Words using CountVectorizer is applied.

In [22]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform cleaned text data
X_vectors = vectorizer.fit_transform(df['clean_review'])

# Display matrix shape
print("Vectorized data shape:", X_vectors.shape)
print("Total documents:", X_vectors.shape[0])
print("Vocabulary size:", X_vectors.shape[1])

Vectorized data shape: (50000, 221431)
Total documents: 50000
Vocabulary size: 221431


In [23]:
# Convert one review into dense format for display
sample_vector = X_vectors[0].toarray()
print("Sample vector (first 15 values):", sample_vector[0][:15])


Sample vector (first 15 values): [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [24]:
vocab_words = vectorizer.get_feature_names_out()
print("Sample vocabulary words:", vocab_words[100:110])


Sample vocabulary words: ['1000000000' '10000000000' '1000000000000' '1000000000000010'
 '1000000000000010000000000000' '100000dm' '100001' '10002000' '10005000'
 '1000lb']


###4. Text Classification Model (Naive Bayes)
Model Used

Multinomial Naive Bayes, suitable for count-based text features.

In [25]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB


In [29]:
# preparing target labels
y = df['sentiment']   # already encoded
X = X_vectors


In [30]:
# test , train and split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=1
)

print("Training samples:", X_train.shape)
print("Testing samples:", X_test.shape)


Training samples: (37500, 221431)
Testing samples: (12500, 221431)


In [31]:
# traing model
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)


In [32]:
# prediction
y_predictions = nb_model.predict(X_test)


##5. Model Evaluation
Evaluation Metrics Used

Accuracy

Precision

Recall

F1-Score

Confusion Matrix

In [33]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix


###Metric Result

In [34]:
print("Accuracy:", accuracy_score(y_test, y_predictions))
print("Precision:", precision_score(y_test, y_predictions))
print("Recall:", recall_score(y_test, y_predictions))
print("F1 Score:", f1_score(y_test, y_predictions))


Accuracy: 0.8636
Precision: 0.8748539963290506
Recall: 0.845918038076799
F1 Score: 0.8601427282421459


###Confusion Matrix

In [35]:
print("Confusion Matrix:\n", confusion_matrix(y_test, y_predictions))


Confusion Matrix:
 [[5552  750]
 [ 955 5243]]


###Sample Predictions

In [36]:
for i in range(5):
    print("\nReview:", df['review'].iloc[y_test.index[i]][:100])
    print("Actual:", "Positive" if y_test.iloc[i] == 1 else "Negative")
    print("Predicted:", "Positive" if y_predictions[i] == 1 else "Negative")



Review: With No Dead Heroes you get stupid lines like that as this woefully abysmal action flick needs to be
Actual: Negative
Predicted: Negative

Review: I thought maybe... maybe this could be good. An early appearance by the Re-Animator (Jeffery Combs);
Actual: Negative
Predicted: Negative

Review: An elite American military team which of course happens to include two good looking women and a guy 
Actual: Negative
Predicted: Negative

Review: Ridiculous horror film about a wealthy man (John Carradine) dying and leaving everything to his four
Actual: Negative
Predicted: Negative

Review: Well, if you are one of those Katana's film-nuts (just like me) you sure will appreciate this metaph
Actual: Positive
Predicted: Positive


###Unigrams vs Bigrams Comparison

In [38]:
vectorizer_bi = CountVectorizer(ngram_range=(1,2))
X_bi = vectorizer_bi.fit_transform(df['clean_review'])

X_train_bi, X_test_bi, y_train_bi, y_test_bi = train_test_split(
    X_bi, y, test_size=0.25, random_state=1
)

nb_bi = MultinomialNB()
nb_bi.fit(X_train_bi, y_train_bi)

y_pred_bi = nb_bi.predict(X_test_bi)

print("Bigram Accuracy:", accuracy_score(y_test_bi, y_pred_bi))


Bigram Accuracy: 0.88672


##**6. Reflection**
Short Answers

###Effect of stemming/lemmatization:
Reduced vocabulary size and improved consistency by grouping similar word forms.

###Did n-grams improve performance?
Bigrams slightly improved results by capturing short phrases like ‚Äúnot good‚Äù.

###BoW vs CountVectorizer:
Both are similar, but CountVectorizer provides better control and preprocessing options.

###Why Naive Bayes works well for text?
It handles high-dimensional sparse data efficiently and performs well with word frequencies.

###Future improvements:

1. use TF-IDF

2. Apply word embeddings (Word2Vec, GloVe)

3. Experiment with deep learning models (LSTM, BERT)