# Case Study: Sentiment Analysis

### Setup

The following dataset contains a collection of product reviews and their numeric rating (1-5). We will treat the rating as a sentiment label, where 1 and 2 are negative, 3 is neutral, and 4 and 5 are positive. We will build a sentiment analysis model using this dataset. For simplicity, we will exclude the neutral reviews.
 

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt
import nltk

# Read in the data
df = pd.read_csv("https://github.com/febse/data/raw/refs/heads/main/ta/reviews.csv")
# Sample the data to speed up computation
# Comment out this line to match with lecture

df.head()


Unnamed: 0.1,Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,394349,Sony XPERIA Z2 D6503 FACTORY UNLOCKED Internat...,,244.95,5,Very good one! Better than Samsung S and iphon...,0.0
1,34377,Apple iPhone 5c 8GB (Pink) - Verizon Wireless,Apple,194.99,1,"The phone needed a SIM card, would have been n...",1.0
2,248521,Motorola Droid RAZR MAXX XT912 M Verizon Smart...,Motorola,174.99,5,I was 3 months away from my upgrade and my Str...,3.0
3,167661,CNPGD [U.S. Office Extended Warranty] Smartwat...,CNPGD,49.99,1,an experience i want to forget,0.0
4,73287,Apple iPhone 7 Unlocked Phone 256 GB - US Vers...,Apple,922.0,5,GREAT PHONE WORK ACCORDING MY EXPECTATIONS.,1.0


In [6]:
# Remove missing values
df.dropna(inplace=True)

# Drop reviews with neutral ratings
df = df[df['Rating'] != 3]

"Super! No problem with the phone"
"Many problems with the phone"

# Map ratings 4 and 5 to "positive" 
# Map ratings 1 and 2 to "negative"
df["positive"] = np.where(df['Rating'] > 3, 1, 0)
df.Reviews.head(10)

1     The phone needed a SIM card, would have been n...
2     I was 3 months away from my upgrade and my Str...
3                        an experience i want to forget
4           GREAT PHONE WORK ACCORDING MY EXPECTATIONS.
5     I fell in love with this phone because it did ...
6     I am pleased with this Blackberry phone! The p...
7     Great product, best value for money smartphone...
9             I've bought 3 no problems. Fast delivery.
10                         Great phone for the price...
11    My mom is not good with new technoloy but this...
Name: Reviews, dtype: object

In [7]:
# Most ratings are positive
df['positive'].mean()

0.7475259604695931

In [8]:
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['Reviews'],
                                                    df['positive'],
                                                    random_state=0)

In [9]:
print('X_train first entry:\n\n', X_train.iloc[0])
print('\n\nX_train shape: ', X_train.shape)

X_train first entry:

 Broke in like 20 mins


X_train shape:  (27662,)


# TF-IDF

Fit another logistic regression model. This time vectorize using the DF-IDF vectorization.

In [31]:
# Fit the TfidfVectorizer to the training data specifying a minimum document frequency of 5
idf_vect = TfidfVectorizer()
idf_vect.fit(X_train)
len(idf_vect.get_feature_names_out())

21374

# N-grams

Include bigrams in the model. Fit another logistic regression model and compare its test performance with the previous model. Look 
at the top-10 features for each class.
