In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

Link to data file(5 core digital music review): http://jmcauley.ucsd.edu/data/amazon/

In [2]:
df = pd.read_json('Digital_Music_5.json', lines=True)

In [3]:
df.head()

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,5555991584,"[3, 3]",5,"It's hard to believe ""Memory of Trees"" came ou...","09 12, 2006",A3EBHHCZO6V2A4,"Amaranth ""music fan""",Enya's last great album,1158019200
1,5555991584,"[0, 0]",5,"A clasically-styled and introverted album, Mem...","06 3, 2001",AZPWAXJG9OJXV,bethtexas,Enya at her most elegant,991526400
2,5555991584,"[2, 2]",5,I never thought Enya would reach the sublime h...,"07 14, 2003",A38IRL0X2T4DPF,bob turnley,The best so far,1058140800
3,5555991584,"[1, 1]",5,This is the third review of an irish album I w...,"05 3, 2000",A22IK3I6U76GX0,Calle,Ireland produces good music.,957312000
4,5555991584,"[1, 1]",4,"Enya, despite being a successful recording art...","01 17, 2008",A1AISPOIIHTHXX,"Cloud ""...""",4.5; music to dream to,1200528000


In [4]:
df.shape

(64706, 9)

Our goal is to predict if a review was positive or negative based on it's review text, to practice sentiment anlysis. We will drop all but two columns.

In [5]:
#Converting review score from range of 1-5 to binary positive or negative
#if the review is greater than 3 we will label it as positive otherwise negative
df['is_positive'] = np.where(df['overall'] > 3, 1, 0)

#Drop all columns except is_positive and text
df = df.drop(df.loc[:,~df.columns.isin(['is_positive', 'reviewText'])], axis=1)

In [6]:
df.is_positive.value_counts()

1    52116
0    12590
Name: is_positive, dtype: int64

Class balance is a little skewed we will have to take this into account. 

According to Wang and Greiner(2005) Support vector machines with a bag of words approach will give us the best results for sentiment classification. We will use Sci kit learns Count Vectorizer method to create our bag of words on train and use those words to predict on test.

Source:https://pdfs.semanticscholar.org/aa3d/afab5bd4112b3f55929582bfec48139ff4c3.pdf

In [7]:
#Set X and Y
X = df.reviewText
y = df.is_positive

#Split data with stratisfy to ensure class balance stays the same on train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.20)

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
#Using built in stop words list, and setting the minimum document frequency to 2.
vec = CountVectorizer(stop_words='english', min_df=2)
vec.fit(X_train)
X_train = vec.transform(X_train)
X_test = vec.transform(X_test)
#Print our features "Bag of Words"
print('\nnumber of features: ', len(vec.get_feature_names()),'\n')
print(vec.get_feature_names())


number of features:  49502 



In [9]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score

clf = LinearSVC(class_weight='balanced',C=.1)
clf.fit(X_train, y_train)

print('train',cross_val_score(clf, X_train, y_train, cv=5))

train [ 0.83745412  0.82856867  0.83278594  0.83413833  0.83549073]


In [10]:
print('test',cross_val_score(clf, X_test, y_test, cv=5))

test [ 0.82425647  0.82193897  0.83314021  0.82225657  0.81600309]


In [11]:
from sklearn.metrics import confusion_matrix, classification_report

train_pred = clf.predict(X_train)
test_pred = clf.predict(X_test)

print('train confusion matrix and report:')
print(confusion_matrix(y_train, train_pred))
print(classification_report(y_train, train_pred))

print('test confusion matrix and report:')
print(confusion_matrix(y_test, test_pred))
print(classification_report(y_test, test_pred))

train confusion matrix and report:
[[ 9876   196]
 [ 1562 40130]]
             precision    recall  f1-score   support

          0       0.86      0.98      0.92     10072
          1       1.00      0.96      0.98     41692

avg / total       0.97      0.97      0.97     51764

test confusion matrix and report:
[[1687  831]
 [1283 9141]]
             precision    recall  f1-score   support

          0       0.57      0.67      0.61      2518
          1       0.92      0.88      0.90     10424

avg / total       0.85      0.84      0.84     12942



Very overfit on train. we can see we get very high accuracy when we don't cross validate. 