## Movie Reviews Sentiment Analysis

In this experiment, we have data with IMDB movie reviews: the texts of the reviews and the marks (positive or negative). The goal is to predict the marks for reviews in test dataset.

The metric to calculate the accuracy of predictions is AUC.

In [None]:
# Importing the necessary libraries

# Pandas for data handling
import pandas as pd

# BeautifulSoup for HTML parsing
from bs4 import BeautifulSoup

# Regular expressions for text preprocessing
import re

# NLTK for text processing
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# CountVectorizer for creating feature vectors
from sklearn.feature_extraction.text import CountVectorizer

# Random Forest and Naive Bayes classifiers
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB

# ROC AUC score for model evaluation
from sklearn.metrics import roc_auc_score

# Train-test split for evaluation
from sklearn.model_selection import train_test_split

In [9]:
train = pd.read_csv('../root/input/labeledTrainData.tsv', header=0, delimiter='\t', quoting=3)
test = pd.read_csv('../root/input/testData.tsv', header=0, delimiter='\t', quoting=3)

## Data Preprocessing

In [10]:
def text_to_words(text):
    """
    Extract words from text.
    
    Args:
    text: The input text to process.
    
    Returns:
    A string containing the meaningful words extracted from the text.
    """
    # Removing HTML tags using BeautifulSoup
    text = BeautifulSoup(text, 'lxml').get_text()
    
    # Removing non-alphabetic characters using regular expressions
    letters = re.sub('[^a-zA-Z]', ' ', text)
    
    # Converting text to lowercase and splitting into words
    words = letters.lower().split()
    
    # Removing stopwords using NLTK
    stops = set(stopwords.words('english'))
    meaningful_words = [w for w in words if not w in stops]
    
    # Joining the meaningful words back into a string
    return ' '.join(meaningful_words)

In [11]:
print(text_to_words(train['review'][0]))

stuff going moment mj started listening music watching odd documentary watched wiz watched moonwalker maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mj feeling towards press also obvious message drugs bad kay visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans would say made fans true really nice actual feature film bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pesci character ranted wanted people know supplying drugs etc dunno maybe hates mj music lots cool things like mj turning car robot whole speed demon sequence also director must patience saint came filming kiddy bad sequence usually directors hate working

In [12]:
def clean(a):
    """
    Cleaning data.
    """
    for i in range(0, a.size):
        yield text_to_words(a[i])

In [13]:
vectorizer = CountVectorizer(analyzer = 'word',
                             tokenizer = None,
                             preprocessor = None,
                             stop_words = None,
                             max_df = 0.5,
                             max_features = 10000)

In [14]:
# Convert train reviews to cleaned text
train_reviews = list(clean(train['review']))

# Convert train reviews to feature vectors
train_data_features = vectorizer.fit_transform(train_reviews)

# Convert test reviews to cleaned text
test_reviews = list(clean(test['review']))

# Convert test reviews to feature vectors
test_data_features = vectorizer.transform(test_reviews)

In [15]:
Xtrain, Xtest, ytrain, ytest = train_test_split(train_data_features, train['sentiment'], test_size=0.20, random_state=36)

## Modelling

### Multinomial Naive Bayes Classifier

In [16]:
# Initialize Multinomial Naive Bayes classifier
mnb = MultinomialNB(alpha=0.0001)

# Fit the classifier on training data and predict probabilities on validation data
y_val_m = mnb.fit(Xtrain, ytrain).predict_proba(Xtest)[:, 1]

# Fit the classifier on the entire training data and predict probabilities on test data
y_pred_m = mnb.fit(train_data_features, train['sentiment']).predict_proba(test_data_features)[:, 1]

# Calculate the accuracy of prediction on the validation set using ROC AUC score
roc_auc_score(ytest, y_val_m)

0.9238410855634166

### Random Forest Classifier

In [17]:
# Initialize Random Forest classifier
forest = RandomForestClassifier(n_estimators=300, criterion='gini')

# Fit the classifier on training data and predict probabilities on validation data
y_val_f = forest.fit(Xtrain, ytrain).predict_proba(Xtest)[:, 1]

# Fit the classifier on the entire training data and predict probabilities on test data
y_pred_f = forest.fit(train_data_features, train['sentiment']).predict_proba(test_data_features)[:, 1]

# Calculate the accuracy of prediction on the validation set using ROC AUC score
roc_auc_score(ytest, y_val_f)

0.9281989299312834

### Ensemble

In [18]:
# Ensemble of both models
roc_auc_score(ytest, y_val_m + y_val_f)

0.9401949221736932

In [19]:
output = pd.DataFrame(data={'id':test['id'], 'sentiment':y_pred_m + y_pred_f})

output.to_csv('cv-model.csv', index=False, quoting=3)