# Task 1
## Nzambuli Daniel
## 665721
#  Exploring Language Modeling with N-Gram Sizes and Smoothing Techniques
## 22/09/2024


# Objective

The goal of this lab assignment is to understand how different n-gram sizes and smoothing
techniques affect the performance of language models. You will implement n-gram models,
apply various smoothing techniques, and evaluate their performance using a sample text
dataset i.e. Movie Review Dataset
[Here](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)

## Background

Language modeling is a crucial task in natural language processing (NLP) that involves
predicting the next word in a sequence given the previous words. N-gram models are a
type of statistical language model that uses the probabilities of sequences of n words
to make predictions. Smoothing techniques are employed to handle the problem of zero
probabilities for unseen n-grams in the training data.

## Materials Needed

- Python 3.x
- Libraries: NLTK, NumPy, Pandas, Matplotlib (for visualization)
- A text dataset e.g., a large text corpus like the Movie Review Dataset available on Kagglele. ata.
aset

# Assignment Steps

## Step 1: Data Preparation

In [1]:
import pandas as pd
import numpy as np

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
# Define a custom method to apply a function in chunks
def apply_in_chunks(self, func, chunk_size=50, **kwargs):
    results = []
    for start in range(0, len(self), chunk_size):
        chunk = self.iloc[start:start+chunk_size]
        result_chunk = chunk.apply(func, axis=1, **kwargs)
        results.append(result_chunk)
    return pd.concat(results).reset_index(drop=True)

# Monkey patch the method to DataFrame class
pd.DataFrame.apply_in_chunks = apply_in_chunks

In [4]:
dataset = pd.read_csv("/content/drive/MyDrive/NLP Quiz 1/IMDB Dataset.csv")
dataset.head(n = 5)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


# Reasons for Data Preprocessing
A real-world data generally contains noises, missing values, and maybe in an unusable format which cannot be directly used for machine learning models.

## Goal for Preprocessing

1. Cleaning makes the data suitable for an ML model
2. Increase the accuracy and efficiency of ML models

## Steps

- get the dataset
- import the libraries
- import the dataset
- find missing data
- encode categorical data
- split data into training and test set
- feature scaling

## Key terms

1. **Dataset** -- the collected data for a particular problem in a proper format
2. **Comma-Separated Values** -- file used to save tabular data with comma separation
3. **Numpy** -- mathematical operation python library, support operations on multidimensional arrays and matrices
4. **Matplotlib** -- 2-D plotting library with a sub-library *pyplot*
5. **Pandas** -- library for importing, managing and manipulation of datasets.
6. **Scikit-learn** -- library for building machine learning models

In [5]:
# select the missing data row and column values
np.where(pd.isnull(dataset))

(array([], dtype=int64), array([], dtype=int64))

# Display basic statistics about reviews and sentiments

In [6]:
dataset['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
positive,25000
negative,25000


In [7]:
dataset['review'].apply(len).describe()

Unnamed: 0,review
count,50000.0
mean,1309.43102
std,989.728014
min,32.0
25%,699.0
50%,970.0
75%,1590.25
max,13704.0


## Clean the data

There are parts of this data that need to be modified

1. Removing the HTML tags that were left in the reviews
2. Convert all reviews into lowercase
3. Remove all extra spaces



In [8]:
import re

def clean_review(text):
    text = re.sub(r'<br\s*/>', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    return text.lower()

dataset['clean_review'] = dataset['review'].apply(clean_review)

dataset.head(n = 5)

Unnamed: 0,review,sentiment,clean_review
0,One of the other reviewers has mentioned that ...,positive,one of the other reviewers has mentioned that ...
1,A wonderful little production. <br /><br />The...,positive,a wonderful little production. the filming tec...
2,I thought this was a wonderful way to spend ti...,positive,i thought this was a wonderful way to spend ti...
3,Basically there's a family where a little boy ...,negative,basically there's a family where a little boy ...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"petter mattei's ""love in the time of money"" is..."


There are no missing values in this dataset.

Progress can be made towards

# Encoding Categorical Data

In [9]:
from sklearn.preprocessing import LabelEncoder

label_encoder_x = LabelEncoder()
dataset.iloc[:, 1]= label_encoder_x.fit_transform(dataset.iloc[:, 1])
dataset.head(n = 5)

Unnamed: 0,review,sentiment,clean_review
0,One of the other reviewers has mentioned that ...,1,one of the other reviewers has mentioned that ...
1,A wonderful little production. <br /><br />The...,1,a wonderful little production. the filming tec...
2,I thought this was a wonderful way to spend ti...,1,i thought this was a wonderful way to spend ti...
3,Basically there's a family where a little boy ...,0,basically there's a family where a little boy ...
4,"Petter Mattei's ""Love in the Time of Money"" is...",1,"petter mattei's ""love in the time of money"" is..."


The data has been encoded such that

| Value| Representation|
|:-----|-----:|
|Positive| 1|
|Negative| 0|


Now

# Perform Splitting of Data into Training and Test Set

In [10]:
dataset.to_csv("/content/drive/MyDrive/NLP Quiz 1/Clean_IMDB_Dataset.csv")

In [11]:
dataset.columns

Index(['review', 'sentiment', 'clean_review'], dtype='object')

In [12]:
from sklearn.model_selection import train_test_split

x = np.array(dataset.iloc[:,2])
y = np.array(dataset.iloc[:,1])

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 21)

## Parameters of train_test_split

1. **x** -- the independent variable
2. **y** -- the dependent variable
3. **test_size** -- the proportion of the whole dataset that will be part of the training dataset
4. **random_state** -- the seed for random selection

there is no need to perform `Feature engineering`. This is because the *independent variable* is going to be used to make a corpus.

# Create a Corpus

corpuses need:
- removing of punctuation
- tokenizization of the text

In [13]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from collections import defaultdict, Counter

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [14]:
def preproc_text(text):
    '''
    preproc_text

    A function that converts the text from regular text to tokenized text for a corpus

    input:
        text
    output:
        tokens
    '''
    text = re.sub(r'[^\w\s]', '', text)
    tokens = word_tokenize(text)
    return tokens

In [15]:
clean_df = pd.DataFrame({
    'review':x_train,
    'sentiment': y_train})
clean_df['tokens'] = clean_df['review'].apply(preproc_text)
clean_df.head(n = 5)

Unnamed: 0,review,sentiment,tokens
0,the clouded yellow is a compact psychological ...,1,"[the, clouded, yellow, is, a, compact, psychol..."
1,dvd has become the equivalent of the old late ...,0,"[dvd, has, become, the, equivalent, of, the, o..."
2,"good drama/comedy, with two good performances ...",1,"[good, dramacomedy, with, two, good, performan..."
3,not worth the video rental or the time or the ...,0,"[not, worth, the, video, rental, or, the, time..."
4,"we've all been there, sitting with some friend...",0,"[weve, all, been, there, sitting, with, some, ..."


In [16]:
clean_df.to_csv("/content/drive/MyDrive/NLP Quiz 1/token_data.csv")

# Create N-grams

In [17]:
def gen_ngram(token, no_n_gram):
    ngrams = zip(*[token[i:] for i in range(no_n_gram)])
    return [' '.join(ngram) for ngram in ngrams]

# store n-grams
ngram_counts = defaultdict(lambda: defaultdict(int))

In [18]:
for tokens in clean_df['tokens']:
    for n in range(1, 4):  # Generate 1, 2, 3
        ngrams = gen_ngram(tokens, n)
        for ngram in ngrams:
            ngram_counts[n][ngram] += 1

## Unigram

**example** The cloud is dark grey. The sky is very grey.

|Unigram| Count|
|:---|----:|
|the| 2|
|cloud| 1|
|is| 2|
|dark| 1|
|grey| 2|
|very|1|

In [19]:
unigrams_df = pd.DataFrame(ngram_counts[1].items(), columns = ['Unigram', 'Counts'])
unigrams_df.head(n = 5)

Unnamed: 0,Unigram,Counts
0,the,530591
1,clouded,10
2,yellow,173
3,is,167942
4,a,256171


In [20]:
unigrams_df.to_csv("/content/drive/MyDrive/NLP Quiz 1/unigrams_df.csv")

## Bigrams

**example** The cloud is dark grey. The sky is dark grey.

|Bigram| Counts|
|:----|----:|
|The cloud| 1|
|cloud is| 1|
|is dark| 2|
|dark grey|2 |
|The sky| 1|
|sky is| 1|

In [21]:
bigram_df = pd.DataFrame(ngram_counts[2].items(), columns = ['Bigram', "Counts"])
bigram_df.head(n = 5)

Unnamed: 0,Bigram,Counts
0,the clouded,1
1,clouded yellow,1
2,yellow is,9
3,is a,20661
4,a compact,6


In [22]:
bigram_df.to_csv("/content/drive/MyDrive/NLP Quiz 1/bigram_df.csv")

## Trigrams

**example** this cloud is very dark

|Trigram|Counts|
|:----|----:|
|This cloud is| 1|
|cloud is very| 1|
|is very dark| 1|

In [23]:
trigram_df = pd.DataFrame(ngram_counts[3].items(), columns = ['Trigram', "Counts"])
trigram_df.to_csv("/content/drive/MyDrive/NLP Quiz 1/trigram_df.csv")

In [24]:
trigram_df.head(n = 5)

Unnamed: 0,Trigram,Counts
0,the clouded yellow,1
1,clouded yellow is,1
2,yellow is a,4
3,is a compact,1
4,a compact psychological,1


# calculate the probability

For each n-gram, calculate the probability using the formula:

$$
P(wn∣wn−1,...,w1)=\frac{C(w1,w2,...,wn)}{C(w1,w2,...,wn−1)}
$$

In [25]:
def calc_ngram_prob(ngram_df: pd.DataFrame) -> np.ndarray:
  '''
  calc_ngram_prob calculates the probability of the ngrams

  input
    df -- a dataframe
  output
    a numpy array of the probabilities
  '''
  total = ngram_df['Counts'].sum()
  ngram_probs = np.array(ngram_df['Counts'] / total)

  return ngram_probs, total

In [26]:
unigrams_df['Probabilities'],unigram_tot  = calc_ngram_prob(unigrams_df)
unigrams_df.head(n = 5)

Unnamed: 0,Unigram,Counts,Probabilities
0,the,530591,0.058198
1,clouded,10,1e-06
2,yellow,173,1.9e-05
3,is,167942,0.018421
4,a,256171,0.028098


In [27]:
bigram_df['Probabilities'],bigram_tot = calc_ngram_prob(bigram_df)
bigram_df.head(n = 5)

Unnamed: 0,Bigram,Counts,Probabilities
0,the clouded,1,1.101694e-07
1,clouded yellow,1,1.101694e-07
2,yellow is,9,9.91525e-07
3,is a,20661,0.002276211
4,a compact,6,6.610167e-07


In [28]:
trigram_df['Probabilities'], trigram_tot = calc_ngram_prob(trigram_df)
trigram_df.head(n = 5)

Unnamed: 0,Trigram,Counts,Probabilities
0,the clouded yellow,1,1.106571e-07
1,clouded yellow is,1,1.106571e-07
2,yellow is a,4,4.426283e-07
3,is a compact,1,1.106571e-07
4,a compact psychological,1,1.106571e-07


# Write the Probabilities


In [29]:
def write_ngram_df():
  unigrams_df.to_csv("/content/drive/MyDrive/NLP Quiz 1/unigrams_df.csv")
  bigram_df.to_csv("/content/drive/MyDrive/NLP Quiz 1/bigram_df.csv")
  trigram_df.to_csv("/content/drive/MyDrive/NLP Quiz 1/trigram_df.csv")

write_ngram_df()

# Step 3: Smoothing Techniques
## 1. Laplace (Add-One) Smoothing

In [30]:
# create test ngrams
clean_test = pd.DataFrame({
    'review':x_test,
    'sentiment': y_test})

clean_test['tokens'] = clean_test['review'].apply(preproc_text)

# create the ngrams
ngram_tests = defaultdict(lambda: defaultdict(int))
for tokens in clean_test['tokens']:
    for n in range(1, 4):  # Generate 1, 2, 3
        ngrams = gen_ngram(tokens, n)
        for ngram in ngrams:
            ngram_tests[n][ngram] += 1

# create test dataframes
test_unigram = pd.DataFrame(ngram_tests[1].items(), columns = ['Unigram', 'Counts'])
test_bigram = pd.DataFrame(ngram_tests[2].items(), columns = ['Bigram', 'Counts'])
test_trigram = pd.DataFrame(ngram_tests[3].items(), columns = ['Trigram', 'Counts'])

# # calculate the probabilities of each element
# test_unigram['Probability'] = test_unigram['Counts'] / unigram_tot
# test_bigram['Probability'] = test_bigram['Counts'] / bigram_tot
# test_trigram['Probability'] = test_trigram['Counts'] / trigram_tot

# 1. Laplace smoothing

In [31]:
def laplace_smth(row, df, vocab_sz, lower_order_ngram):
  ngram_count = row["Counts"]

  if 'Unigram' in df.columns:
    n_minus_1_gram_count = df["Counts"].sum()

  elif 'Bigram' in df.columns:
    lower_ngram = ' '.join(row['Bigram'].split()[:-1])
    n_minus_1_gram_count = lower_order_ngram[lower_order_ngram['Unigram'] == lower_ngram]["Counts"].sum()
  elif 'Trigram' in df.columns:
    lower_ngram = ' '.join(row['Trigram'].split()[:-1])
    n_minus_1_gram_count = lower_order_ngram[lower_order_ngram['Bigram'] == lower_ngram]["Counts"].sum()

  else:
    raise ValueError("Dataframe must contain 'Unigram', 'Bigram', or 'Trigram' columns")

  smoothed_prob = (ngram_count + 1) / (n_minus_1_gram_count + vocab_sz)
  return smoothed_prob

# 2. Good-Turing Discounting

Good-Turing smoothing adjusts the probability of an n-gram based on how often n-grams of similar frequencies occur. The main idea is to re-estimate the probabilities of low-frequency n-grams, including unseen ones, based on the frequency of n-grams with a frequency of 1, 2, 3.

In [32]:
def calculate_N_r(df):
  '''
  frequency of frequencies
  '''
  counts = df['Counts'].values
  N_r = Counter(counts)
  return N_r

def good_turing_smth(row, N_r, total_ngrams):
  '''
  good-turing smoothing formula
  '''
  r = row['Counts']
  r_plus_1 = r + 1

  # unseen ngrams
  if r == 0:
    return N_r.get(1, 0)/ sum(N_r.values())

  # adjusted count
  if r_plus_1 in N_r and r in N_r:
    r_star = (r_plus_1 * N_r[r_plus_1])/ N_r[r]
  else:
    r_star = r

  smoothed_prob = r_star / total_ngrams
  return smoothed_prob

# 3.

## Laplace Output

In [33]:
unigram_sz = len(unigrams_df)
bigram_sz = len(bigram_df)
trigram_sz = len(trigram_df)

unigrams_df['lap_smth'] = unigrams_df.apply(laplace_smth,
                                              axis = 1,
                                              df = unigrams_df,
                                              lower_order_ngram = None,
                                              vocab_sz = unigram_sz)
unigrams_df.head(n = 5)

Unnamed: 0,Unigram,Counts,Probabilities,lap_smth
0,the,530591,0.058198,0.057272
1,clouded,10,1e-06,1e-06
2,yellow,173,1.9e-05,1.9e-05
3,is,167942,0.018421,0.018128
4,a,256171,0.028098,0.027651


In [None]:
bigram_df['lap_smth'] = bigram_df.apply(laplace_smth,
                                              axis = 1,
                                              df = bigram_df,
                                              lower_order_ngram = unigrams_df,
                                              vocab_sz = bigram_sz)

bigram_df.head(n = 5)

In [None]:
trigram_df['lap_smth'] = test_unigram.apply(laplace_smth,
                                              axis = 1,
                                              df = trigram_df,
                                              lower_order_ngram = bigram_df,
                                              vocab_sz = trigram_sz)
trigram_df.head(n = 5)

# Good-Turing Output

In [None]:
unigram_n_r = calculate_N_r(unigrams_df)
bigram_n_r = calculate_N_r(bigram_df)
trigram_n_r = calculate_N_r(trigram_df)

In [None]:
unigrams_df['good_tur_smth'] = unigrams_df.apply(
    good_turing_smth,
    axis = 1,
    N_r = unigram_n_r,
    total_ngrams = unigram_tot
)

unigrams_df.head(n=5)

In [None]:
bigram_df['good_tur_smth'] = bigram_df.apply(
    good_turing_smth,
    axis = 1,
    N_r = bigram_n_r,
    total_ngrams = bigram_tot
)

bigram_df.head(n=5)

In [None]:
trigram_df['good_tur_smth'] = trigram_df.apply(
    good_turing_smth,
    axis = 1,
    N_r = trigram_n_r,
    total_ngrams = trigram_tot
)

trigram_df.head(n=5)

# Step 4: Model Performance Evaluation

## 1. Perplexity Calculation

In [None]:
def calculate_perplexity(df, prob_column):
  log_probs = np.log(df[prob_column])
  avg_log_prob = np.mean(log_probs)
  perplexity = np.exp(-avg_log_prob)
  return perplexity

In [None]:
Unigram_laplace_perplexity = calculate_perplexity(unigrams_df, 'laplace_smth')
Unigram_good_turing_perplexity = calculate_perplexity(unigrams_df, 'good_tur_smth')

print(f"Unigram Laplace Perplexity: {Unigram_laplace_perplexity}")
print(f"Unigram Good-Turing Perplexity: {Unigram_good_turing_perplexity}")

In [None]:
Bigram_laplace_perplexity = calculate_perplexity(bigram_df, 'laplace_smth')
Bigram_good_turing_perplexity = calculate_perplexity(bigram_df, 'good_tur_smth')

print(f"Bigram Laplace Perplexity: {Bigram_laplace_perplexity}")
print(f"Bigram Good-Turing Perplexity: {Bigram_good_turing_perplexity}")

In [None]:
Trigram_laplace_perplexity = calculate_perplexity(trigram_df, 'laplace_smth')
Trigram_good_turing_perplexity = calculate_perplexity(trigram_df, 'good_tur_smth')

print(f"Trigram Laplace Perplexity: {Trigram_laplace_perplexity}")
print(f"Trigram Good-Turing Perplexity: {Trigram_good_turing_perplexity}")

## 2. Visualize Analysis

In [None]:
import matplotlib.pyplot as plt

ngram_size = [1, 2, 3]

lap_perplex = [Unigram_laplace_perplexity,
               Bigram_laplace_perplexity,
               Trigram_laplace_perplexity
               ]
good_tur_perplex = [Unigram_good_turing_perplexity,
                    Bigram_good_turing_perplexity,
                    Trigram_good_turing_perplexity
                    ]

plot_data = pd.DataFrame({
    'N-gram Size': ngram_sizes,
    'Laplace Perplexity': lap_perplex,
    'Good-Turing Perplexity': good_tur_perplex
})

In [None]:
plt.plot(plot_data['N-gram Size'], plot_data['Laplace Perplexity'], marker='o', label='Laplace Smoothing')
plt.plot(plot_data['N-gram Size'], plot_data['Good-Turing Perplexity'], marker='s', label='Good-Turing Smoothing')

plt.title('Perplexity vs N-gram Size')
plt.xlabel('N-gram Size')
plt.ylabel('Perplexity')

plt.legend()

plt.show()