<img src="https://drive.google.com/uc?id=1-cL5eOpEsbuIEkvwW2KnpXC12-PAbamr" style="Width:1000px">

<img src="https://drive.google.com/uc?id=1fyePzwUvVF9OBK2q9-t9f8fojx6Sp0Es" style="Width:250px, height:250px">

# Movie Reviews

In this and the following exercise, you will use the famous <a href="https://www.imdb.com/">IMDB movie dataset</a> as saved on <a href="https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews">Kaggle</a>. The Kaggle dataset contains 50K movies, but this tends to crash your kernel and be slow. The version I give you today has been downsampled to 25k reviews..

Your task is is to classify movie reviews as positive or negative. You will:

- Preprocess the reviews (remove punctuation and lower case)
- Vectorize a Bag of words
- Train and score a Naive Bayes model

Let's start by importing the data. We will use `cross_validation` today so we won't worry too much about a test set (though for serious NLP you would want to have one!)

P.S. Look on the photo at the spelling of results. Dall-E is getting better in 2023, but not yet perfect!

In [None]:
from nbta.utils import download_data
download_data(id='1DtuD7LtrfUfGSZioocYZvTzkpvLwgqwA')

In [None]:
import pandas as pd

data = pd.read_csv("raw_data/IMDB_dataset_25k.csv")

# This data is too large for most of your systems, so we will take only 10% of the dataset:
data = data.sample(frac=0.1, random_state=42).reset_index(drop=True)

data.head()

The dataset is made up of positive and negative movie reviews.

## Preprocessing

Create a new column in `data` called `clean_text`. This will contain a cleaned version of the `review`, where you will remove punctuation,  lower case the text, remove digits, remove english stop-words, lemmatiaze your text, and tokenize it. We will preserve the text as a sentence.

In [None]:
# First let's make sure that the stopwords from NLTK are downloaded on your system:

from nltk import download

download('stopwords')

In [None]:
# To reduce the memory footprint
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(dtype=np.int8)

In [None]:
from nltk.stem import WordNetLemmatizer
import string
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize

def prepare_text(text):
    # remove punctuations and digits 
    text = ''.join([char for char in text if char not in string.punctuation]).\
            lower()
    text = ''.join([char for char in text if not char.isdigit()])
    # stop words
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    word_tokens = word_tokenize(text) 
    
    return ' '.join([lemmatizer.lemmatize(w) for w in word_tokens if not w in stop_words])
    

In [None]:
data['clean_text'] = data.review.apply(prepare_text)
data

### ☑️ Test your code

Note: this only tests if you achieve the mandated **precision** and **recall** on an unseen dataset. It does not check the quality of your code or the completeness of your answer.


In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('check_data',
                         sentence = data.clean_text[0],
)

result.write()
print(result.check())

## Bag-of-Words modelling

Using `cross_validate`, score a Multinomial Naive Bayes model trained on a Bag-of-Word representation of the texts. Save its test accuracy as a variable named `bow_accuracy`. <details><summary>hint</summary>Use a `CountVectorizer`!</details>

In [None]:
from sklearn.model_selection import cross_validate
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

X_v = vectorizer.fit_transform(data.clean_text)

In [None]:
X_v.toarray()

In [None]:
from sklearn.naive_bayes import MultinomialNB

y = data.sentiment.apply(lambda x: x=='positive')

cross_val = cross_validate(MultinomialNB(),X_v,y,cv=5)

In [None]:
bow_accuracy = cross_val['test_score'].mean()
bow_accuracy

## N-gram modelling

👇 Using `cross_validate`, score a Multinomial Naive Bayes model trained on a 2-gram Bag-of-Word representation of the texts. You will use again the `CountVectorizer()` class but need to choose the right parameters. Save the test accuracy of your cross_validation as a variable named `ng_accuracy`

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = CountVectorizer(ngram_range = (2,2))

X_v = vectorizer.fit_transform(data.clean_text)

pd.DataFrame(X_v.toarray(),columns = vectorizer.get_feature_names_out())

In [None]:
cross_val = cross_validate(MultinomialNB(),X_v,y,cv=5)

In [None]:
ng_accuracy = cross_val['test_score'].mean()
ng_accuracy

## Assessing your model

Which model performed better, and why do you think that is?

<details><summary>Solution</summary>We would expect the N-Gram model to outperform your Bag-of-Words by a small margin. However, because of our reduced dataset, this is not really the case here. N-Grams are normally better (though more computationally costly) because they capture the context of the words around a single token. This give more meaning to words that could otherwise have different meaning depending on the context. You will see this furhter in deep-learning, when you learn about the attention mechanism for `Transformers`, the de-facto go-to architecture for NLP today.</details>

### ☑️ Test your code

Note: this only tests if you achieve the mandated **precision** and **recall** on an unseen dataset. It does not check the quality of your code or the completeness of your answer.


In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('model_performance',
                         bow_model = bow_accuracy,
                         ng_model = ng_accuracy
)

result.write()
print(result.check())

# Saving your data

To save time, we will reuse the preprocessed data in the next exercise. Therefore, save the `data` dataframe as a `csv` file on the path `../02-Tuning-for-NLP/raw_data` as `processed_data.csv`.

In [None]:
data.to_csv('../02-Tuning-for-NLP/raw_data/processed_data.csv', index=False)

# 🏁 Finished!

Well done! <span style="color:teal">**Push your exercise to GitHub**</span>, and move on to the next one.