The following is based on:
- https://www.kaggle.com/code/lakshmi25npathi/sentiment-analysis-of-imdb-movie-reviews
- https://openclassrooms.com/en/courses/6532301-introduction-to-natural-language-processing/7047916-apply-classifier-models-for-sentiment-analysis

# 5. Sentiment analysis modeling

Sentiment analysis is one of the most common text classification applications of vectorization.

The goal of sentiment analysis is to classify an opinion as positive or negative. Sentiment analysis is mostly used to infer customers' views on a product or a service on social networks. You can also apply it to political discourse, digital sociology, or information extraction.

Consider the following statement taken from a movie review: 
- "This movie was amazing." Rather easy to identify the positive word, right? 
- Now, what about this one? "That movie was amazingly bad." Slightly more complex! 
- Now what about this social media comment: "It's not unattractive!" Even more complex!

Negations, idioms, slang and sarcasm complexify sentiment analysis.

## 5.1. A first submission

In [3]:
import pandas as pd

In [27]:
# Importing the training data
imdb_data = pd.read_csv('data/imdb.csv')
print(imdb_data.shape)
imdb_data.head(10)

(40000, 2)


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,Basically there's a family where a little boy ...,negative
3,"Petter Mattei's ""Love in the Time of Money"" is...",positive
4,"Probably my all-time favorite movie, a story o...",positive
5,I sure would like to see a resurrection of a u...,positive
6,"This show was an amazing, fresh & innovative i...",negative
7,Encouraged by the positive comments about this...,negative
8,If you like original gut wrenching laughter yo...,positive
9,The cast played Shakespeare.<br /><br />Shakes...,negative


In [28]:
# Summary of the dataset
imdb_data.describe()

Unnamed: 0,review,sentiment
count,40000,40000
unique,39727,2
top,"Hilarious, clean, light-hearted, and quote-wor...",positive
freq,4,20044


In [29]:
# Sentiment count
imdb_data['sentiment'].value_counts()

positive    20044
negative    19956
Name: sentiment, dtype: int64

In [31]:
# Spliting the dataset into training and test set
from sklearn.model_selection import train_test_split
train_reviews, test_reviews, train_sentiments, test_sentiments = train_test_split(imdb_data['review'], imdb_data['sentiment'], test_size=0.2, random_state=42)
print(train_reviews.shape,train_sentiments.shape)
print(test_reviews.shape,test_sentiments.shape)

In [37]:
# Count vectorizer for bag of words
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words = 'english')
# Transformed train reviews
cv_train_reviews = cv.fit_transform(train_reviews)
# Transformed test reviews
cv_test_reviews = cv.transform(test_reviews)

print('BOW_cv_train:', cv_train_reviews.shape)
print('BOW_cv_test:', cv_test_reviews.shape)

BOW_cv_train: (32000, 84801)
BOW_cv_test: (8000, 84801)


In [57]:
# Training a logistic regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(penalty='l2',max_iter=500,C=1,random_state=42)
# Fitting the model for Bag of words
lr_bow = lr.fit(cv_train_reviews,train_sentiments)

In [41]:
# Predicting the model for bag of words
lr_bow_predict = lr.predict(cv_test_reviews)
print(lr_bow_predict)
# Accuracy score for bag of words
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
lr_bow_score = accuracy_score(test_sentiments,lr_bow_predict)
print("lr_bow_score :",lr_bow_score)
# Classification report for bag of words 
lr_bow_report = classification_report(test_sentiments, lr_bow_predict, target_names=['Positive','Negative'])
print(lr_bow_report)

['negative' 'negative' 'negative' ... 'positive' 'positive' 'positive']
lr_bow_score : 0.8775
              precision    recall  f1-score   support

    Positive       0.88      0.87      0.88      4039
    Negative       0.87      0.88      0.88      3961

    accuracy                           0.88      8000
   macro avg       0.88      0.88      0.88      8000
weighted avg       0.88      0.88      0.88      8000



In [54]:
# Predicting on evaluation set for Kaggle competition
imdb_eval_data = pd.read_csv("data/imdb_eval.csv")
print(imdb_eval_data.shape)
imdb_eval_data['sentiment'] = lr.predict(cv.transform(imdb_eval_data['review']))
imdb_eval_data

(8003, 1)


Unnamed: 0,review,sentiment
0,With No Dead Heroes you get stupid lines like ...,negative
1,I thought maybe... maybe this could be good. A...,negative
2,An elite American military team which of cours...,negative
3,Ridiculous horror film about a wealthy man (Jo...,negative
4,"Well, if you are one of those Katana's film-nu...",negative
...,...,...
7998,If you like the excitement of a good submarine...,positive
7999,"I sat down to watch ""Midnight Cowboy"" thinking...",positive
8000,I voted this a 10 out of 10 simply because it ...,positive
8001,"Okay, my title is kinda lame, and almost sells...",negative


In [56]:
# Saving for Kaggle submission
imdb_eval_data.to_csv('data/submission/cv_and_logistic.csv')

## 5.2. Your turn!

Try to beat this first submission by playing on the vectorization method, the classification model, the preprocessing of the text, ... be creative!