# Task 1
## Nzambuli Daniel
## 665721
#  Exploring Language Modeling with N-Gram Sizes and Smoothing Techniques
## 22/09/2024

# Objective

The goal of this lab assignment is to understand how different n-gram sizes and smoothing 
techniques affect the performance of language models. You will implement n-gram models, 
apply various smoothing techniques, and evaluate their performance using a sample text 
dataset i.e. Movie Review Dataset
[Here](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)

## Background

Language modeling is a crucial task in natural language processing (NLP) that involves
predicting the next word in a sequence given the previous words. N-gram models are a
type of statistical language model that uses the probabilities of sequences of n words
to make predictions. Smoothing techniques are employed to handle the problem of zero
probabilities for unseen n-grams in the training data.

## Materials Needed

- Python 3.x
- Libraries: NLTK, NumPy, Pandas, Matplotlib (for visualization)
- A text dataset e.g., a large text corpus like the Movie Review Dataset available on Kagglele. ata.aset

# Assignment Steps

## Step 1: Data Preparation

In [1]:
import pandas as pd
import numpy as np

In [2]:
dataset = pd.read_csv("IMDB Dataset.csv")
dataset.head

<bound method NDFrame.head of                                                   review sentiment
0      One of the other reviewers has mentioned that ...  positive
1      A wonderful little production. <br /><br />The...  positive
2      I thought this was a wonderful way to spend ti...  positive
3      Basically there's a family where a little boy ...  negative
4      Petter Mattei's "Love in the Time of Money" is...  positive
...                                                  ...       ...
49995  I thought this movie did a down right good job...  positive
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  I am a Catholic taught in parochial elementary...  negative
49998  I'm going to have to disagree with the previou...  negative
49999  No one expects the Star Trek movies to be high...  negative

[50000 rows x 2 columns]>

# Reasons for Data Preprocessing
A real-world data generally contains noises, missing values, and maybe in an unusable format which cannot be directly used for machine learning models.

## Goal for Preprocessing

1. Cleaning makes the data suitable for an ML model
2. Increase the accuracy and efficiency of ML models

## Steps

- get the dataset
- import the libraries
- import the dataset
- find missing data
- encode categorical data
- split data into training and test set
- feature scaling

## Key terms

1. **Dataset** -- the collected data for a particular problem in a proper format
2. **Comma-Separated Values** -- file used to save tabular data with comma separation
3. **Numpy** -- mathematical operation python library, support operations on multidimensional arrays and matrices
4. **Matplotlib** -- 2-D plotting library with a sub-library *pyplot*
5. **Pandas** -- library for importing, managing and manipulation of datasets.
6. **Scikit-learn** -- library for building machine learning models

In [3]:
# select the missing data row and column values
np.where(pd.isnull(dataset))

(array([], dtype=int64), array([], dtype=int64))

# Display basic statistics about reviews and sentiments

In [5]:
dataset['sentiment'].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

In [6]:
dataset['review'].apply(len).describe()

count    50000.000000
mean      1309.431020
std        989.728014
min         32.000000
25%        699.000000
50%        970.000000
75%       1590.250000
max      13704.000000
Name: review, dtype: float64

## Clean the data

There are parts of this data that need to be modified

1. Removing the HTML tags that were left in the reviews
2. Convert all reviews into lowercase
3. Remove all extra spaces



In [7]:
import re

def clean_review(text):
    text = re.sub(r'<br\s*/>', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    return text.lower()

dataset['clean_review'] = dataset['review'].apply(clean_review)

dataset.head

<bound method NDFrame.head of                                                   review sentiment  \
0      One of the other reviewers has mentioned that ...  positive   
1      A wonderful little production. <br /><br />The...  positive   
2      I thought this was a wonderful way to spend ti...  positive   
3      Basically there's a family where a little boy ...  negative   
4      Petter Mattei's "Love in the Time of Money" is...  positive   
...                                                  ...       ...   
49995  I thought this movie did a down right good job...  positive   
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative   
49997  I am a Catholic taught in parochial elementary...  negative   
49998  I'm going to have to disagree with the previou...  negative   
49999  No one expects the Star Trek movies to be high...  negative   

                                            clean_review  
0      one of the other reviewers has mentioned that ...  
1      a wo

There are no missing values in this dataset. 

Progress can be made towards 

# Encoding Categorical Data

In [8]:
from sklearn.preprocessing import LabelEncoder

label_encoder_x = LabelEncoder()
dataset.iloc[:, 1]= label_encoder_x.fit_transform(dataset.iloc[:, 1])
dataset.head

<bound method NDFrame.head of                                                   review sentiment  \
0      One of the other reviewers has mentioned that ...         1   
1      A wonderful little production. <br /><br />The...         1   
2      I thought this was a wonderful way to spend ti...         1   
3      Basically there's a family where a little boy ...         0   
4      Petter Mattei's "Love in the Time of Money" is...         1   
...                                                  ...       ...   
49995  I thought this movie did a down right good job...         1   
49996  Bad plot, bad dialogue, bad acting, idiotic di...         0   
49997  I am a Catholic taught in parochial elementary...         0   
49998  I'm going to have to disagree with the previou...         0   
49999  No one expects the Star Trek movies to be high...         0   

                                            clean_review  
0      one of the other reviewers has mentioned that ...  
1      a wo

The data has been encoded such that 

| Value| Representation|
|:-----|-----:|
|Positive| 1|
|Negative| 0|


Now 

# Perform Splitting of Data into Training and Test Set

In [9]:
dataset.columns

Index(['review', 'sentiment', 'clean_review'], dtype='object')

In [10]:
from sklearn.model_selection import train_test_split

x = np.array(dataset.iloc[:,2])
y = np.array(dataset.iloc[:,1])

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 21)

## Parameters of train_test_split

1. **x** -- the independent variable
2. **y** -- the dependent variable
3. **test_size** -- the proportion of the whole dataset that will be part of the training dataset
4. **random_state** -- the seed for random selection

there is no need to perform `Feature engineering`. This is because the *independent variable* is going to be used to make a corpus.

# Create a Corpus

corpuses need:
- removing of punctuation
- tokenizization of the text

In [11]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from collections import defaultdict, Counter

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [12]:
def preproc_text(text):
    '''
    preproc_text

    A function that converts the text from regular text to tokenized text for a corpus

    input:
        text
    output:
        tokens
    '''
    text = re.sub(r'[^\w\s]', '', text)
    tokens = word_tokenize(text)
    return tokens

In [15]:
clean_df = pd.DataFrame({
    'review':x_train,
    'sentiment': y_train})
clean_df['tokens'] = clean_df['review'].apply(preproc_text)
clean_df.head

<bound method NDFrame.head of                                                   review sentiment  \
0      the clouded yellow is a compact psychological ...         1   
1      dvd has become the equivalent of the old late ...         0   
2      good drama/comedy, with two good performances ...         1   
3      not worth the video rental or the time or the ...         0   
4      we've all been there, sitting with some friend...         0   
...                                                  ...       ...   
39995  shintarô katsu, best known for the zatôichi fi...         1   
39996  this is easily one of the worst movies i have ...         0   
39997  excellent film. suzy kendall will hold your in...         1   
39998  simply put, the only saving grace this movie h...         0   
39999  when i first heard about this movie, i noticed...         1   

                                                  tokens  
0      [the, clouded, yellow, is, a, compact, psychol...  
1      [dvd

# Create N-grams

In [16]:
def gen_ngram(token, no_n_gram):
    ngrams = zip(*[token[i:] for i in range(no_n_gram)])
    return [' '.join(ngram) for ngram in ngrams]

# store n-grams
ngram_counts = defaultdict(lambda: defaultdict(int))

In [19]:
for tokens in clean_df['tokens']:
    for n in range(1, 4):  # Generate 1, 2, 3
        ngrams = gen_ngram(tokens, n)
        for ngram in ngrams:
            ngram_counts[n][ngram] += 1