# Movie Reviews

In [2]:
import pandas as pd

data = pd.read_pickle("reviews")

data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


The dataset is made up of positive and negative movie reviews.

## Preprocessing

👇 Remove punctuation and lower case the text.

In [4]:
import string

def remove_punctuation(df, column):
    for punctuation in string.punctuation:
        df[column] = df[column].apply(lambda x: x.replace(punctuation,''))
    return df[column]
def lower(df, column):
    df[column] = df[column].apply(lambda x: x.lower())
    return df[column]

data['clean_text'] = remove_punctuation(data, "reviews")
data['clean_text'] = lower(data, "clean_text")
data

Unnamed: 0,target,reviews,clean_text
0,neg,plot two teen couples go to a church party d...,plot two teen couples go to a church party d...
1,neg,the happy bastards quick movie review \ndamn t...,the happy bastards quick movie review \ndamn t...
2,neg,it is movies like these that make a jaded movi...,it is movies like these that make a jaded movi...
3,neg,quest for camelot is warner bros first fe...,quest for camelot is warner bros first fe...
4,neg,synopsis a mentally unstable man undergoing p...,synopsis a mentally unstable man undergoing p...
...,...,...,...
1995,pos,wow what a movie \nits everything a movie ca...,wow what a movie \nits everything a movie ca...
1996,pos,richard gere can be a commanding actor but he...,richard gere can be a commanding actor but he...
1997,pos,glorystarring matthew broderick denzel washin...,glorystarring matthew broderick denzel washin...
1998,pos,steven spielbergs second epic film on world wa...,steven spielbergs second epic film on world wa...


## Bag-of-Words modelling

👇 Using `cross_validate`, score a Multinomial Naive Bayes model trained on a Bag-of-Word representation of the texts.

In [5]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_validate

# Convert texts to Bag-of-Words representation
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data.reviews)
y = data.target

# Define the model
model = MultinomialNB()

# Evaluate the model using cross-validation
scores = cross_validate(model, X, y, cv=5, scoring=["accuracy", "f1_macro"])

# Print the results
print("Accuracy: ", scores["test_accuracy"].mean())
print("F1-Score (Macro-Average): ", scores["test_f1_macro"].mean())


Accuracy:  0.8145
F1-Score (Macro-Average):  0.8144339958520785


## N-gram modelling

👇 Using `cross_validate`, score a Multinomial Naive Bayes model trained on a 2-gram Bag-of-Word representation of the texts.

In [7]:
# Convert texts to 2-gram Bag-of-Words representation
vectorizer = CountVectorizer(ngram_range=(2, 2))
X = vectorizer.fit_transform(data.reviews)
y = data.target
# Define the model
model = MultinomialNB()

# Evaluate the model using cross-validation
scores = cross_validate(model, X, y, cv=5, scoring=["accuracy", "f1_macro"])

# Print the results
print("Accuracy: ", scores["test_accuracy"].mean())
print("F1-Score (Macro-Average): ", scores["test_f1_macro"].mean())

Accuracy:  0.8365
F1-Score (Macro-Average):  0.8364234703972191


⚠️ Please push the exercise once you are done 🙃

## 🏁 