#### Analysis takes 400K+ Amazon reviews for the same cell phone to predict a positive or negative rating based on the written text of the review.

In [1]:
import pandas as pd
import numpy as np

review_data = pd.read_csv('Amazon_Unlocked_Mobile.csv')
review_data.head(10)

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0
5,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,1,I already had a phone with problems... I know ...,1.0
6,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,2,The charging port was loose. I got that solder...,0.0
7,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,2,"Phone looks good but wouldn't stay charged, ha...",0.0
8,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I originally was using the Samsung S2 Galaxy f...,0.0
9,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,3,It's battery life is great. It's very responsi...,0.0


In [2]:
len(review_data)  # number of records

413840

First, drop all rows with Rating 3.

In [3]:
review_data = review_data[review_data['Rating'] != 3]

In [4]:
len(review_data)

382075

Create dummy column 'Recommend', converting Ratings of 4 or 5 to a 1 and Ratings of 1 or 2 to a 0.

In [5]:
review_data['Recommend'] = np.where(review_data['Rating'] > 3, 1, 0)

Remove all columns except Reviews and Recommend

In [6]:
review_data = review_data[['Reviews', 'Recommend']]

In [7]:
review_data.head()

Unnamed: 0,Reviews,Recommend
0,I feel so LUCKY to have found this used (phone...,1
1,"nice phone, nice up grade from my pantach revu...",1
2,Very pleased,1
3,It works good but it goes slow sometimes but i...,1
4,Great phone to replace my lost phone. The only...,1


Remove any rows that have a blank Reviews or Recommend

In [8]:
review_data = review_data.dropna()

In [9]:
len(review_data)

382015

Check the balance of the target data

In [10]:
review_data['Recommend'].mean()

0.7459235893878513

about 3/4 of the reviews are positive

Split the data into train and test sets

In [11]:
from sklearn.model_selection import train_test_split

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(review_data['Reviews'], 
                                                    review_data['Recommend'], 
                                                    random_state=0)

In [12]:
np.shape(X_train)

(286511,)

In [13]:
np.shape(X_test)

(95504,)

# CountVectorizer

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

#  Use CountVectorizer to learn the vocabulary of the training set Reviews
#  Create features for document term matrix
features = CountVectorizer().fit(X_train)

In [15]:
len(features.get_feature_names())

58586

58K words in the feature set

In [16]:
features.get_feature_names()[::2000]  # just a sample of every 2000th feature

['00',
 '3the',
 'acdc',
 'aplicaciones',
 'battety',
 'burla',
 'cingularproxy',
 'corazon',
 'demás',
 'droid',
 'ev',
 'flaps',
 'goe',
 'hmmmmmmm',
 'insuperable',
 'lad',
 'managerbackupbrowsercalculatorcalendarcaller',
 'mounted',
 'occassionaly',
 'peaseng',
 'predated',
 'ramblr',
 'response',
 'section',
 'sned',
 'suis',
 'thr',
 'undesirable',
 'vários',
 'yeahhh']

Now transform the training set Reviews into the document-term matrix

In [17]:
X_train_dtm = features.transform(X_train)

In [18]:
np.shape(X_train_dtm)

(286511, 58586)

286K observations x 58K features

Use Logistic Regression model

In [19]:
from sklearn.linear_model import LogisticRegression

# instantiate and train the model
model = LogisticRegression()
model.fit(X_train_dtm, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Now transform the X_test data into the dtm

In [20]:
X_test_dtm = features.transform(X_test)

In [21]:
predictions = model.predict(X_test_dtm)

In [22]:
predictions[:20]  # check the first 20

array([1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

Now compare predictions with y_test

In [23]:
checkPredictions = (y_test == predictions)

In [24]:
checkPredictions.mean()

0.9494471435751383

Nearly 95% accuracy on the test data.

What were most important features in the dtm?

In [25]:
arglist = model.coef_[0].argsort()

In [26]:
coeffs = model.coef_[0]

In [27]:
feature_words = np.array(features.get_feature_names())

Features with highest coefficients will be at end of (sorted) arglist

In [28]:
highest20 = arglist[:-20:-1]

In [29]:
coeffs[highest20]

array([5.28914406, 4.79018593, 3.67920014, 3.56791797, 3.49802113,
       3.34494266, 3.12687901, 2.91986122, 2.9090389 , 2.90853096,
       2.88045286, 2.84730278, 2.83701767, 2.81879718, 2.76078955,
       2.7027161 , 2.66894747, 2.64151028, 2.60697809])

In [30]:
feature_words[highest20]

array(['excelent', 'excelente', 'excellent', 'exelente', 'loves',
       'loving', 'lovely', 'love', 'superb', 'perfecto', 'awesome',
       '4eeeks', 'perfect', 'amazing', 'downside', 'exelent', 'buen',
       'flawlessly', '27'], dtype='<U567')

Interesting 'excelent' and 'excelente' had higher coefficients than 'excellent'.  
'4eeeks' ?

What were most negative features?

In [31]:
lowest20 = arglist[:20]

In [32]:
coeffs[lowest20]

array([-3.7763103 , -3.35673008, -3.16946378, -3.04321123, -2.95516872,
       -2.94279382, -2.80269725, -2.80131284, -2.79911234, -2.76965969,
       -2.76111553, -2.75120412, -2.68779548, -2.6877409 , -2.65757652,
       -2.6387338 , -2.59703318, -2.58344036, -2.54864704, -2.50800302])

In [33]:
feature_words[lowest20]

array(['worst', 'worthless', 'mony', 'false', 'junk', 'lemon', 'terrible',
       'dissatisfied', 'horrible', 'synchronize', 'pos', 'garbage',
       'nope', 'blacklist', 'waste', 'messing', 'useless', 'cheaply',
       'blocked', 'disliked'], dtype='<U567')

# Tfidf: term frequency - inverse document frequency

Try the analysis using the tf-idf vectorizer, limit features to words that appear in a minimum of 5 reviews.

In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer

features_tfidf = TfidfVectorizer(min_df=5).fit(X_train)

In [35]:
len(features_tfidf.get_feature_names())

20070

Previously 58,586.  tf-idf reduces features by about 2/3rds

In [36]:
X_train_dtm_tfidf = features_tfidf.transform(X_train)

Use Logistic Regression model again.  
Train the model

In [37]:
model_tfidf = LogisticRegression()
model_tfidf.fit(X_train_dtm_tfidf, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Transform the X_test data into the tfidf feature vector

In [38]:
X_test_dtm_tfidf = features_tfidf.transform(X_test) 

In [39]:
predictions_tfidf = model_tfidf.predict(X_test_dtm_tfidf)

Compare predictions with y_test

In [40]:
check_pred_tfidf = (y_test == predictions_tfidf)
check_pred_tfidf.mean()

0.9498345619031664

Previously 0.94945.  Very slightly better results but with a much smaller dtm, less processing time.

Most important features?

In [41]:
arglist = model_tfidf.coef_[0].argsort()
coeffs = model_tfidf.coef_[0]
feature_words = np.array(features_tfidf.get_feature_names())

In [42]:
highest20 = arglist[:-20:-1]
lowest20 = arglist[:20]

In [43]:
coeffs[highest20]

array([13.46168417, 13.37184933, 11.16672123,  9.39678989,  9.37336164,
        8.31237879,  8.19713052,  8.08741865,  7.96099975,  7.90564656,
        7.28891172,  6.78600496,  6.22260021,  6.02962363,  5.92200796,
        5.7700605 ,  5.64398939,  5.54642785,  5.47961211])

In [44]:
feature_words[highest20]

array(['love', 'great', 'excellent', 'perfect', 'amazing', 'awesome',
       'loves', 'easy', 'perfectly', 'best', 'far', 'excelent', 'exactly',
       'good', 'complaints', 'pleased', 'excelente', 'fantastic', 'fast'],
      dtype='<U31')

In [45]:
coeffs[lowest20]

array([-11.07896341,  -9.78517042,  -7.39252584,  -7.38764742,
        -7.28327914,  -7.12248775,  -6.8261299 ,  -6.81918869,
        -6.51068972,  -6.44883585,  -5.89145572,  -5.88386692,
        -5.80929061,  -5.78895662,  -5.62905154,  -5.55644425,
        -5.5085627 ,  -5.4129117 ,  -5.32826609,  -5.14076882])

In [46]:
feature_words[lowest20]

array(['not', 'worst', 'terrible', 'waste', 'useless', 'disappointed',
       'return', 'poor', 'horrible', 'returning', 'doesn', 'junk', 'slow',
       'stopped', 'unable', 'freezes', 'wouldn', 'worthless', 'locked',
       'garbage'], dtype='<U31')

Highest and lowest tfidf scores

In [47]:
arglist_tfidf = X_train_dtm_tfidf.max(0).toarray()[0].argsort()

In [48]:
highest20 = arglist_tfidf[:-20:-1]
lowest20 = arglist_tfidf[:20]

In [49]:
X_train_dtm_tfidf.max(0).toarray()[0][highest20]

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1.])

In [50]:
feature_words[highest20]  # highest tfidf score, not highest model coefficients

array(['grieta', 'greay', 'uff', 'bulls', 'gud', 'gucci', 'grest', 'una',
       'unacceptable', 'greatt', 'happy', 'greatness', 'unbelievable',
       'greate', 'great', 'calida', 'calidad', 'grate', 'handy'],
      dtype='<U31')

In [51]:
X_train_dtm_tfidf.max(0).toarray()[0][lowest20]

array([0.02026129, 0.02026129, 0.02026129, 0.02026129, 0.02026129,
       0.02325681, 0.02325681, 0.0233752 , 0.0233752 , 0.0233752 ,
       0.0233752 , 0.0233752 , 0.0233752 , 0.0233752 , 0.0233752 ,
       0.0233752 , 0.0233752 , 0.0233752 , 0.0233752 , 0.0233752 ])

In [52]:
feature_words[lowest20]

array(['warmness', 'aggregration', 'commenter', 'storageso', 'pthalo',
       '1300', '403', '34ghz', '16nm', 'keynote', 'a10', 'messiah',
       'reading___', '770', '401p', '700nits', 'excites', 'submarket',
       'brawns', 'resins'], dtype='<U31')

### Trying my own

In [53]:
rev1 = pd.Series(['The phone was amazing.  The sound was very good.  I like the way it feels in my hand.  I would recommend it to others.'])

Transform the rev1 text to the tfidf document term matrix.

In [54]:
rev1_dtm = features_tfidf.transform(rev1)

Have the model make predictions on rev1

In [55]:
pred = model_tfidf.predict(rev1_dtm)

In [56]:
pred

array([1])

Positive!

In [57]:
rev2 = pd.Series(['The quality could be better.  It felt plasticy.  There are not so many apps available for this phone.  Would not buy again.'])

In [58]:
rev2_dtm = features_tfidf.transform(rev2)

In [59]:
pred = model_tfidf.predict(rev2_dtm)

In [60]:
pred

array([0])

Negative!