The following is based on:
- https://www.kaggle.com/code/lakshmi25npathi/sentiment-analysis-of-imdb-movie-reviews
- https://openclassrooms.com/en/courses/6532301-introduction-to-natural-language-processing/7047916-apply-classifier-models-for-sentiment-analysis

# 5. Sentiment analysis modeling

Sentiment analysis is one of the most common text classification applications of vectorization.

The goal of sentiment analysis is to classify an opinion as positive or negative. Sentiment analysis is mostly used to infer customers' views on a product or a service on social networks. You can also apply it to political discourse, digital sociology, or information extraction.

Consider the following statement taken from a movie review: 
- "This movie was amazing." Rather easy to identify the positive word, right? 
- Now, what about this one? "That movie was amazingly bad." Slightly more complex! 
- Now what about this social media comment: "It's not unattractive!" Even more complex!

Negations, idioms, slang and sarcasm complexify sentiment analysis.

You will practice sentiment analysis with the IMDB dataset which is a dataset containing 50K movie reviews. 
You will train a classification model based on 40K movie reviews and you will be ranked based on the 10K movie reviews left by submitting your predictions on the Kaggle competition (https://www.kaggle.com/t/46868a736945459381c048c676afc0ce).

## 5.1. Sample submission

In [33]:
import pandas as pd

In [35]:
# Importing the training data
imdb_data = pd.read_csv('data/imdb.csv', index_col=0)
print(imdb_data.shape)
imdb_data.head(10)

(40000, 2)


Unnamed: 0_level_0,review,sentiment
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,One of the other reviewers has mentioned that ...,positive
2,A wonderful little production. <br /><br />The...,positive
3,I thought this was a wonderful way to spend ti...,positive
4,Basically there's a family where a little boy ...,negative
5,"Petter Mattei's ""Love in the Time of Money"" is...",positive
6,"Probably my all-time favorite movie, a story o...",positive
7,I sure would like to see a resurrection of a u...,positive
8,"This show was an amazing, fresh & innovative i...",negative
9,Encouraged by the positive comments about this...,negative
10,If you like original gut wrenching laughter yo...,positive


In [36]:
# Summary of the dataset
imdb_data.describe()

Unnamed: 0,review,sentiment
count,40000,40000
unique,39734,2
top,Loved today's show!!! It was a variety and not...,negative
freq,5,20007


In [37]:
# Sentiment count
imdb_data['sentiment'].value_counts()

negative    20007
positive    19993
Name: sentiment, dtype: int64

In [38]:
# Spliting the dataset into training and test set
from sklearn.model_selection import train_test_split
train_reviews, test_reviews, train_sentiments, test_sentiments = train_test_split(imdb_data['review'], imdb_data['sentiment'], test_size=0.2, random_state=42)
print(train_reviews.shape,train_sentiments.shape)
print(test_reviews.shape,test_sentiments.shape)

(32000,) (32000,)
(8000,) (8000,)


In [39]:
# Count vectorizer for bag of words
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words = 'english')
# Transformed train reviews
cv_train_reviews = cv.fit_transform(train_reviews)
# Transformed test reviews
cv_test_reviews = cv.transform(test_reviews)

print('BOW_cv_train:', cv_train_reviews.shape)
print('BOW_cv_test:', cv_test_reviews.shape)

BOW_cv_train: (32000, 84446)
BOW_cv_test: (8000, 84446)


In [40]:
# Training a logistic regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(penalty='l2',max_iter=500,C=1,random_state=42)
# Fitting the model for Bag of words
lr_bow = lr.fit(cv_train_reviews,train_sentiments)

In [41]:
# Predicting the model for bag of words
lr_bow_predict = lr.predict(cv_test_reviews)
print(lr_bow_predict)
# Accuracy score for bag of words
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
lr_bow_score = accuracy_score(test_sentiments,lr_bow_predict)
print("lr_bow_score :",lr_bow_score)
# Classification report for bag of words 
lr_bow_report = classification_report(test_sentiments, lr_bow_predict, target_names=['Positive','Negative'])
print(lr_bow_report)

['positive' 'positive' 'positive' ... 'positive' 'positive' 'negative']
lr_bow_score : 0.88475
              precision    recall  f1-score   support

    Positive       0.89      0.88      0.88      4002
    Negative       0.88      0.89      0.88      3998

    accuracy                           0.88      8000
   macro avg       0.88      0.88      0.88      8000
weighted avg       0.88      0.88      0.88      8000



In [46]:
# Predicting on evaluation set for Kaggle competition
imdb_eval_data = pd.read_csv("data/imdb_eval.csv", index_col=0)
print(imdb_eval_data.shape)
imdb_eval_data['sentiment'] = lr.predict(cv.transform(imdb_eval_data['review']))
imdb_eval_data

(10000, 1)


Unnamed: 0_level_0,review,sentiment
id,Unnamed: 1_level_1,Unnamed: 2_level_1
40001,First off I want to say that I lean liberal on...,negative
40002,I was excited to see a sitcom that would hopef...,negative
40003,When you look at the cover and read stuff abou...,positive
40004,"Like many others, I counted on the appearance ...",negative
40005,"This movie was on t.v the other day, and I did...",negative
...,...,...
49996,I thought this movie did a down right good job...,positive
49997,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49998,I am a Catholic taught in parochial elementary...,positive
49999,I'm going to have to disagree with the previou...,positive


In [47]:
# Saving for Kaggle submission
imdb_eval_data[['sentiment']].to_csv('data/submission/submission1_cv_logreg.csv')

## 5.2. Your turn!

Try to beat this first submission by playing on the vectorization method, the classification model, the preprocessing of the text, ... be creative!