# Project 4: Kaggle Competition - Amazon Alexa Reviews

## NoteBook Contents
- Part 1 - Data Cleansing, EDA, and Pre-Proccessing
- Part 2 - Modeling, Conclusion

----------------------------------------

###  Part 1 - Data Cleansing, EDA, and Pre-Proccessing

In [1]:
#libraries

#################################
import nltk
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import re
import seaborn as sns

#################################
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

#################################
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score, KFold
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

Downloaded the 'amazon_alexa.tsv' file provided by Kaggle

In [2]:
train = pd.read_csv('./data/amazon_alexa.tsv',sep='\t')

In [3]:
train.head(5)

Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1
4,5,31-Jul-18,Charcoal Fabric,Music,1


In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3150 entries, 0 to 3149
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   rating            3150 non-null   int64 
 1   date              3150 non-null   object
 2   variation         3150 non-null   object
 3   verified_reviews  3150 non-null   object
 4   feedback          3150 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 123.2+ KB


In [5]:
train.isnull().sum()

rating              0
date                0
variation           0
verified_reviews    0
feedback            0
dtype: int64

In [6]:
train.describe()

Unnamed: 0,rating,feedback
count,3150.0,3150.0
mean,4.463175,0.918413
std,1.068506,0.273778
min,1.0,0.0
25%,4.0,1.0
50%,5.0,1.0
75%,5.0,1.0
max,5.0,1.0


In [7]:
#unique rating values
train_test = train.copy()
train_test['rating'] = train_test['rating'].apply(lambda x: str(x))

unique = train_test['rating'].unique()
print(sorted(unique))

['1', '2', '3', '4', '5']


---------------------------------

The 'rating' field in the train dataset has a value between 1 - 5 and reflects the following self-defined values:

***1 - very negative***

***2 - negative***

***3 - neutral***

***4 - positive***

***5 - very positive***

I added a new field to the train data reflecting these values:

In [8]:
Sentiment_Rating=[]

for row in train.rating:
    if row == 1:
        Sentiment_Rating.append('very negative')
    
    elif row == 2:
        Sentiment_Rating.append('negative')
    
    elif row == 3:
        Sentiment_Rating.append('neutral')
    
    elif row == 4:
        Sentiment_Rating.append('positive')
    
    elif row == 5:
        Sentiment_Rating.append('very positive')
    
    else:
        Sentiment_Rating.append('none')

In [9]:
train['Sentiment_Rating'] = Sentiment_Rating

In [10]:
train.head(2)

Unnamed: 0,rating,date,variation,verified_reviews,feedback,Sentiment_Rating
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1,very positive
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1,very positive


In [11]:
#rating-type counts
pd.value_counts(train['Sentiment_Rating'].values)

very positive    2286
positive          455
very negative     161
neutral           152
negative           96
dtype: int64

---------------------------------

Utilized WordNetLemmatizer to clean up the 'verified_reviews' field strings to be used for modeling:

In [None]:
nltk.download('stopwords')
nltk.download('wordnet')

In [13]:
#source: https://www.datacamp.com/community/tutorials/stemming-lemmatization-python

#initiate WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def review_cleansed(phrase): 
    phrase = re.sub('[^a-zA-Z]', ' ',phrase) #remove bad characters
    phrase = phrase.lower() #convert all letters to lower case
    phrase = phrase.split() #put strings in phrase in list
    
    #iterate through all words in strings
    phrase = [lemmatizer.lemmatize(a) for a in phrase if not a in set(stopwords.words('english'))]
    return (' '.join(phrase))

In [14]:
#tie back clean reviews to original reviews in train data
clean = []
for i in range(0, train.shape[0]):
    clean.append(review_cleansed(train['verified_reviews'][i]))

In [15]:
clean

['love echo',
 'loved',
 'sometimes playing game answer question correctly alexa say got wrong answer like able turn light away home',
 'lot fun thing yr old learns dinosaur control light play game like category nice sound playing music well',
 'music',
 'received echo gift needed another bluetooth something play music easily accessible found smart speaker wait see else',
 'without cellphone cannot use many feature ipad see use great alarm u r almost deaf hear alarm bedroom living room reason enough keep fun ask random question hear response seem smartbon politics yet',
 'think th one purchased working getting one every room house really like feature offer specifily playing music echo controlling light throughout house',
 'look great',
 'love listened song heard since childhood get news weather information great',
 'sent year old dad talk constantly',
 'love learning knew thing eveyday still figuring everything work far easy use understand make laugh time',
 'purchased mother knee prob

In [16]:
#add 'Clean_Review' column to train data
train['Clean_Review'] = clean

In [17]:
train.head(5)

Unnamed: 0,rating,date,variation,verified_reviews,feedback,Sentiment_Rating,Clean_Review
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1,very positive,love echo
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1,very positive,loved
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1,positive,sometimes playing game answer question correct...
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1,very positive,lot fun thing yr old learns dinosaur control l...
4,5,31-Jul-18,Charcoal Fabric,Music,1,very positive,music


---------------------------------

###  Part 2 - Modeling, Conclusion

In [18]:
# baseline model/accuracy - highest values
train['Sentiment_Rating'].value_counts().max() / train['Sentiment_Rating'].value_counts().sum()

0.7257142857142858

In [19]:
#initiated CountVectorizer
cv = CountVectorizer(stop_words='english', max_features= 10000)

In [20]:
#set X & Y
X = cv.fit_transform(clean).toarray()
y = train['rating']

In [None]:
X = train[['Clean_Review']]
y = train['rating']

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

##### 1. Naive Bayes Model

In [22]:
#created pipeline
    #initiated TfidfTransformer to determine how relevant given words are
    #initiated MultinomialNB
pipeline1 = Pipeline([
    ('tfidf', TfidfTransformer()),
    ('mnb', MultinomialNB())
])

In [23]:
#fit pipeline
pipeline1.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])

In [24]:
#predictions
y_pred = pipeline1.predict(X_test)

In [25]:
#y_pred

In [26]:
#accuracy
pipeline1.score(X_test, y_test)

0.7238095238095238

In [None]:
#cross validated to validate accuracy
kfold = KFold(n_splits = 10, random_state = 7)
kfold_results = cross_val_score(pipeline1, X_test, y_test, cv = kfold)

In [28]:
kfold_results.mean()

0.7253968253968253

##### 2. Logistic Regression Model

In [29]:
#created pipeline
    #initiated TfidfTransformer to determine how relevant given words are
    #initiated LogisticRegression
pipeline2 = Pipeline([
    ('tfidf', TfidfTransformer()),
    ('lr', LogisticRegression(solver='lbfgs', max_iter=10000))
])

In [30]:
#fit pipeline
pipeline2.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfTransformer()),
                ('clf', LogisticRegression(max_iter=10000))])

In [31]:
#predictions
y_pred2 = pipeline2.predict(X_test)

In [32]:
#y_pred2

In [33]:
#accuracy
pipeline2.score(X_test, y_test)

0.7428571428571429

In [None]:
#cross validated to validate accuracy
kfold = KFold(n_splits = 10, random_state = 7)
kfold_results2 = cross_val_score(pipeline2, X_test, y_test, cv = kfold)

In [35]:
kfold_results2.mean()

0.7222222222222221

---------------------------------

### Conclusion

Both models do a somewhat good job (~70%) at predicting a user's rating score given their written review -

- Naive Bayes Model Accuracy Score: 71% correct
- Logistic Regression Model Accuracy Score: 72% correct

For both models the accuracy score and k-Fold cross val score are almost identical -

- Naive Bayes Model: 71% vs 71%
- Logistic Regression Model: 72% vs 71%