# Programming Machine Learning Lab
# Exercise 08

**General Instructions:**

1. You need to submit the PDF as well as the filled notebook file.
1. Name your submissions by prefixing your matriculation number to the filename. Example, if your MR is 12345 then rename the files as **"12345_Exercise_11.xxx"**
1. Complete all your tasks and then do a clean run before generating the final PDF. (_Clear All Ouputs_ and _Run All_ commands in Jupyter notebook)

**Exercise Specific instructions::**

1. You are allowed to use only NumPy and Pandas (unless stated otherwise). You can use any library for visualizations.

### Part 1

**TF-IDF and BOW**

In this part, you will be working with the IMBD movie review dataset to perform various natural language processing tasks. You need to get the dataset from https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

1. Download and read the dataset (subset the data to only use 10,000 rows).
1. Perform tokenization on the review text.
1. Remove stop words from the tokenized text.
1. Use regular expressions to clean the text, removing any HTML tags, emails, and other unnecessary information.
1. Convert the cleaned data into a TF-IDF and BOW representation from scratch.

*Note: you can use NLTK for all sub-parts except the last*

**Main task**:
Using the BOW and Tf-Idf representation, implement a Naive-Bayes classifier for the data from scratch. Use Laplace smoothing for the implementation **Do not use sklearn for this part** 

[Reference Slide](https://www.ismll.uni-hildesheim.de/lehre/ml-16w/script/ml-09-A8-bayesian-networks.pdf)

In [1]:
### Your code here
import pandas as pd
import numpy as np
import re
import nltk
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from collections import Counter

# Download the IMDB dataset from Kaggle and load it
# Please make sure to have the dataset downloaded and change the path accordingly
#dataset_path = ""
df = pd.read_csv("IMDB Dataset.csv")

# Subset the data to use only 10,000 rows
imdb = df.sample(n=10000, random_state=8) #tested diffrent random state to get 5k each
imdb.describe()
imdb.head()

#['sentiment'].value_counts()
#dataset is now balanced 


Unnamed: 0,review,sentiment
40552,I gave this film my rare 10 stars.<br /><br />...,positive
6106,This is one of the best films I have seen in y...,positive
27831,"About halfway through, I realized I didn't car...",negative
28253,The only good part of this movie was the endin...,negative
37221,"After reading several comments, I felt I had t...",positive


In [2]:
from nltk.tokenize.toktok import ToktokTokenizer
import nltk
#Tokenization of text
tokenizer=ToktokTokenizer()
#Setting English stopwords
stopword_list=nltk.corpus.stopwords.words('english')

In [5]:
from bs4 import BeautifulSoup
import string
#Removing the html strips
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

#Removing the square brackets
def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)

#Removing the noisy text
def denoise_text(text):
    text = strip_html(text)
    text = remove_between_square_brackets(text)
    translator = str.maketrans('', '', string.punctuation)
    text = text.translate(translator)
    return text
#Apply function on review column
imdb['review']=imdb['review'].apply(denoise_text)

  soup = BeautifulSoup(text, "html.parser")


In [6]:
#set stopwords to english
stop=set(stopwords.words('english'))
#print(stop)

#removing the stopwords
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text
#Apply function on review column
imdb['review']=imdb['review'].apply(remove_stopwords)


In [7]:
#Stemming the text
def simple_stemmer(text):
    ps=nltk.porter.PorterStemmer()
    text= ' '.join([ps.stem(word) for word in text.split()])
    return text
#Apply function on review column
imdb['review']=imdb['review'].apply(simple_stemmer)

In [8]:
imdb.head(10)

Unnamed: 0,review,sentiment
40552,gave film rare 10 starswhen first began watch ...,positive
6106,one best film seen year gwyneth paltrow fan ex...,positive
27831,halfway realiz didnt care charact least howev ...,negative
28253,good part movi end know part movi light come l...,negative
37221,read sever comment felt add two cent well sorr...,positive
12236,good lord movi need new classif cover watch ab...,negative
36969,littl pictur succe mani big pictur fail littl ...,positive
42315,fan bad movi mst3k member mft3k must say ive s...,negative
3201,arent enough gaythem movi arent enough come mo...,negative
30303,like film lot actor great particularli potent ...,positive


The review text is now clean no stop words no punctuation and reduced to its stemm

In [8]:
from collections import Counter
def create_bow_representation(data):
    vocabulary = Counter()
    for tokens in data:
        vocabulary.update(tokens)

    bow_matrix = pd.DataFrame(0, index=range(len(data)), columns=vocabulary.keys())

    for i, tokens in enumerate(data):
        for token in tokens:
            bow_matrix.at[i, token] += 1

    return bow_matrix


def create_tfidf_representation(data, bow_matrix):
    tfidf_matrix = pd.DataFrame(0, index=range(len(data)), columns=bow_matrix.columns)

    for i, tokens in enumerate(data):
        token_counter = Counter(tokens)
        for token, count in token_counter.items():
            tf = count / len(tokens)
            idf = np.log(len(data) / (1 + vocabulary[token]))
            tfidf_matrix.at[i, token] = tf * idf

    return tfidf_matrix

In [9]:
class NaiveBayesClassifier:
    def __init__(self):
        self.class_probs = {}
        self.word_probs = {}

    def fit(self, X, y, alpha=1):
        # Calculate class probabilities
        total_docs = len(y)
        self.class_probs = dict(Counter(y))
        for label in self.class_probs:
            self.class_probs[label] /= total_docs

        # Calculate word probabilities
        unique_words = X.columns
        for label in self.class_probs:
            self.word_probs[label] = {}
            subset_X = X[y == label]
            total_words_in_class = subset_X.values.sum() + alpha * len(unique_words)
            for word in unique_words:
                word_count = subset_X[word].sum() + alpha
                self.word_probs[label][word] = word_count / total_words_in_class

    def predict(self, X):
        predictions = []
        for _, row in X.iterrows():
            max_prob = float('-inf')
            predicted_label = None
            for label in self.class_probs:
                prob = np.log(self.class_probs[label])
                for word, count in row.items():
                    prob += count * np.log(self.word_probs[label].get(word, 1e-10))
                if prob > max_prob:
                    max_prob = prob
                    predicted_label = label
            predictions.append(predicted_label)
        return predictions


In [10]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
train_df, test_df = train_test_split(imdb, test_size=0.2, random_state=42)
# Create BOW representation for training and testing sets
bow_train = create_bow_representation(train_df['review'])
bow_test = create_bow_representation(test_df['review'])
# Create TF-IDF representation for training and testing sets
tfidf_train = create_tfidf_representation(train_df['review'], bow_train)
tfidf_test = create_tfidf_representation(test_df['review'], bow_train)

# Train Naive-Bayes classifier using BOW representation
nb_classifier_bow = NaiveBayesClassifier()
nb_classifier_bow.fit(bow_train, train_df['sentiment'])

# Train Naive-Bayes classifier using TF-IDF representation
nb_classifier_tfidf = NaiveBayesClassifier()
nb_classifier_tfidf.fit(tfidf_train, train_df['sentiment'])

# Make predictions on the test set
predictions_bow = nb_classifier_bow.predict(bow_test)
predictions_tfidf = nb_classifier_tfidf.predict(tfidf_test)

# Evaluate the accuracy
accuracy_bow = np.mean(predictions_bow == test_df['sentiment'])
accuracy_tfidf = np.mean(predictions_tfidf == test_df['sentiment'])

print(f"Accuracy using BOW representation: {accuracy_bow:.2%}")
print(f"Accuracy using TF-IDF representation: {accuracy_tfidf:.2%}")

KeyboardInterrupt: 

**Evaluation**

Use sklearn implementation of Naive-Bayes classifier and compare the results with your implementation.

In [2]:
### Your code here

Referneces
Source code for Toktoktokenizer
https://www.nltk.org/_modules/nltk/tokenize/toktok.html
https://www.kaggle.com/code/lakshmi25npathi/sentiment-analysis-of-imdb-movie-reviews

### Part 2

**N-gram Language Model**


You won't believe what happened ??? !

Is the word "next" on the tip of your tongue? Although there are other possibilities, that is undoubtedly the most likely one. Other options are "after", "after that", and "to them". Our intuition tells us that some sentence endings are more plausible than others, especially when we take into account the previous information, the location of the phrase, and the speaker or author.

N-gram language models simply formalize that intuition. An n-gram model gives each possibility a probability score by solely taking into account the words that came before it. The probability of the word "next" in our example may be 80\%, whereas the probabilities of the words "after" and "then" might be 10\%, 5\%, and 5\%, respectively.

By leveraging these statistics, n-grams fuel the development of language models, which in turn contribute to an overall speech recognition system.

**Main task**:

In this part you are tasked with coding a N-gram language model on the dataset (https://www.kaggle.com/datasets/nltkdata/europarl). Use the english language for the task.


Evaluate your model based on perplexity and generate sentences using n-grams with n={2,3,4,5}. 

*Reading Material: https://web.stanford.edu/~jurafsky/slp3/3.pdf*

In [None]:
### Your code here