# Programming Machine Learning Lab
# Exercise 08

**General Instructions:**

1. You need to submit the PDF as well as the filled notebook file.
1. Name your submissions by prefixing your matriculation number to the filename. Example, if your MR is 12345 then rename the files as **"12345_Exercise_11.xxx"**
1. Complete all your tasks and then do a clean run before generating the final PDF. (_Clear All Ouputs_ and _Run All_ commands in Jupyter notebook)

**Exercise Specific instructions::**

1. You are allowed to use only NumPy and Pandas (unless stated otherwise). You can use any library for visualizations.

### Part 1

**TF-IDF and BOW**

In this part, you will be working with the IMBD movie review dataset to perform various natural language processing tasks. You need to get the dataset from https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

1. Download and read the dataset (subset the data to only use 10,000 rows).
1. Perform tokenization on the review text.
1. Remove stop words from the tokenized text.
1. Use regular expressions to clean the text, removing any HTML tags, emails, and other unnecessary information.
1. Convert the cleaned data into a TF-IDF and BOW representation from scratch.

*Note: you can use NLTK for all sub-parts except the last*

**Main task**:
Using the BOW and Tf-Idf representation, implement a Naive-Bayes classifier for the data from scratch. Use Laplace smoothing for the implementation **Do not use sklearn for this part** 

[Reference Slide](https://www.ismll.uni-hildesheim.de/lehre/ml-16w/script/ml-09-A8-bayesian-networks.pdf)

In [1]:
import pandas as pd
import numpy as np
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict

In [7]:

# Download the dataset from https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
# Load the dataset and subset it to 10,000 rows
df = pd.read_csv('IMDB Dataset.csv', header=0, index_col=None)
df = df.sample(n=10000, random_state=42)

# Clean text using regular expressions
def clean_text(text):
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    # Remove emails
    text = re.sub(r'\S*@\S*\s?', '', text)
    # Remove other unnecessary information
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    return text

df['cleaned_text'] = df['review'].apply(clean_text)

# Tokenization
df['tokenized_text'] = df['cleaned_text'].apply(lambda x: word_tokenize(x.lower()))

# Remove stop words
stop_words = set(stopwords.words('english'))
df['filtered_text'] = df['tokenized_text'].apply(lambda x: [word for word in x if word.isalpha() and word not in stop_words])

# Prepare data for classification
train_data = df['filtered_text'].to_numpy()[:8000]
test_data = df['filtered_text'].to_numpy()[8000:]
train_target = df['sentiment'].to_numpy()[:8000]
test_target = df['sentiment'].to_numpy()[8000:]

In [26]:
# Naive-Bayes classifier using BOW representation with Laplace smoothing
def train_naive_bayes_bow(data, target):
    vocabulary = set([word for sublist in data for word in sublist])
    word_counts = defaultdict(int)
    class_counts = defaultdict(int)
    total_docs = len(data)
    
    for i in range(total_docs):
        current_class = target[i]
        class_counts[current_class] += 1
        
        for word in data[i]:
            word_counts[(word, current_class)] += 1
    
    return vocabulary, word_counts, class_counts, total_docs

def predict_naive_bayes_bow(vocabulary, word_counts, class_counts, total_docs, document, alpha=1):
    scores = defaultdict(float)
    
    # c is the class label in this iteration
    for c in class_counts:
        scores[c] = (class_counts[c] / total_docs)
        for word in document:
            scores[c] *= ((word_counts[(word, c)] + alpha) / (class_counts[c] + alpha * len(vocabulary))) * 1000
    # # Normalize scores array
    total_score = sum(scores.values())
    normalized_scores = {label: score / total_score for label, score in scores.items()}
    
    # Get the class with the highest probability
    predicted_class = max(normalized_scores, key=normalized_scores.get)
    
    return predicted_class

# Train the Naive-Bayes classifier with BOW representation
vocabulary, word_counts_per_class, class_counts, total_docs = train_naive_bayes_bow(train_data, train_target)

# Test the Naive-Bayes classifier on the test set
correct_predictions = 0
for i in range(len(test_data)):
    prediction = predict_naive_bayes_bow(vocabulary, word_counts_per_class, class_counts, total_docs, test_data[i])
    if prediction == test_target[i]:
        correct_predictions += 1

accuracy = correct_predictions / len(test_data)
print(f"Accuracy: {accuracy}")

Accuracy: 0.834


In [None]:
# Function to calculate TF-IDF representation from scratch
def calculate_tfidf(data):
    unique_words = set(word for review in data for word in review)
    word_count_in_each_doc = {word: np.zeros(len(data)) for word in unique_words}

    for i, review in enumerate(data):
        for word in review:
            word_count_in_each_doc[word][i] += 1

    idf = {word: np.log(len(data) / sum(1 for review in data if word in review)) for word in unique_words}

    tfidf = {word: [tf_value * idf[word] for tf_value in word_count_in_each_doc[word]] for word in unique_words}

    return tfidf

# Calculate TF-IDF representation for training and testing data
tfidf_train = calculate_tfidf(train_data)
print(tfidf_train)
# tfidf_test = calculate_tfidf(test_data)

# # Function to implement Naive-Bayes classifier with Laplace smoothing
# def naive_bayes_classifier(tfidf_data, labels, alpha=1):
#     classes = set(labels)
#     class_probabilities = {c: sum(1 for label in labels if label == c) / len(labels) for c in classes}

#     word_probabilities = {word: {c: [] for c in classes} for word in tfidf_data.keys()}

#     for word, values in tfidf_data.items():
#         for c in classes:
#             word_probabilities[word][c] = (
#                 (sum(1 for i, label in enumerate(labels) if label == c and values[i] > 0) + alpha) /
#                 (sum(1 for i, label in enumerate(labels) if label == c) + alpha * len(tfidf_data))
#             )

#     return class_probabilities, word_probabilities

# # Train the Naive-Bayes classifier
# class_probabilities, word_probabilities = naive_bayes_classifier(tfidf_train, train_labels)

# # Function to predict labels using the trained classifier
# def predict(tfidf_data, class_probabilities, word_probabilities):
#     predictions = []
#     classes = class_probabilities.keys()

#     for values in zip(*tfidf_data.values()):
#         scores = {c: np.log(class_probabilities[c]) for c in classes}

#         for word, value in zip(tfidf_data.keys(), values):
#             for c in classes:
#                 scores[c] += np.log(word_probabilities[word][c] if value > 0 else 1 - word_probabilities[word][c])

#         predictions.append(max(scores, key=scores.get))

#     return predictions

# # Make predictions on the test set
# predictions = predict(tfidf_test, class_probabilities, word_probabilities)

# # Evaluate the accuracy of the classifier
# accuracy = sum(1 for pred, true in zip(predictions, test_labels) if pred == true) / len(test_labels)
# print(f'Accuracy: {accuracy:.2%}')

**Evaluation**

Use sklearn implementation of Naive-Bayes classifier and compare the results with your implementation.

In [2]:
### Your code here

### Part 2

**N-gram Language Model**


You won't believe what happened ??? !

Is the word "next" on the tip of your tongue? Although there are other possibilities, that is undoubtedly the most likely one. Other options are "after", "after that", and "to them". Our intuition tells us that some sentence endings are more plausible than others, especially when we take into account the previous information, the location of the phrase, and the speaker or author.

N-gram language models simply formalize that intuition. An n-gram model gives each possibility a probability score by solely taking into account the words that came before it. The probability of the word "next" in our example may be 80\%, whereas the probabilities of the words "after" and "then" might be 10\%, 5\%, and 5\%, respectively.

By leveraging these statistics, n-grams fuel the development of language models, which in turn contribute to an overall speech recognition system.

**Main task**:

In this part you are tasked with coding a N-gram language model on the dataset (https://www.kaggle.com/datasets/nltkdata/europarl). Use the english language for the task.


Evaluate your model based on perplexity and generate sentences using n-grams with n={2,3,4,5}. 

*Reading Material: https://web.stanford.edu/~jurafsky/slp3/3.pdf*

In [None]:
### Your code here