## Amazon Alexa Reviews Analysis

### [Dataset link](https://www.kaggle.com/sid321axn/amazon-alexa-reviews)

* The project analyzes reviews by users of **Amazon’s Alexa products**. 
* Using **Natural Language Processing** on the product reviews and some additional features, a machine learning model should be able to predict if the feedback is **positive (1) or negative (0).**

* The primary methods used are **Random Forrest and Gradient Boosting** for this dataset. 

### Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline 

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
sns.set_palette("bright")

### Importing Data

In [None]:
data = pd.read_csv("../input/amazon_alexa.tsv", sep="\t")

In [None]:
data.head()

<br>
### Exploring the dataset 
This data set has five columns:
* rating
* date
* variation
* verfied_reviews
* feedback

We will explore each column with the help of charts and how does it impact our target column **feedback**.

In [None]:
data.columns

Rating column has values:

In [None]:
data['rating'].unique()

#### Converting *date* attribute from string to datetime.date datatype
We will be using date column for feature engineering, so it would be a good idea if we convert this column from a **string** datatype to a **datetime.date** datatype.

In [None]:
type(data['date'][0]) , data['date'][0]

In [None]:
data['date'] = pd.to_datetime(data['date'])
data['date'][0]

In [None]:
dates = data['date']
only_dates = []
for date in dates:
    only_dates.append(date.date())

data['only_dates'] = only_dates
data['only_dates'][0]

### Feature Engineering:

#### Extracting *Year, Month, Day of the Week* from date.
* We will be using these features later in the model.
* We will extract month, year and day of the week into separate columns.

In [None]:
only_year = []
for date in dates:
    only_year.append(date.year)
data['year'] = only_year


only_month = []
for date in dates:
    only_month.append(date.month)
data['month'] = only_month

# 1 -> monday
# 7 -> sunday
only_weekday = []
for date in dates:
    only_weekday.append(date.isoweekday())
data['day_of_week'] = only_weekday

#### Estimating length of the reviews 
* Calculating the length of text proves to be an important feature for classifying text in a Natural Language Processing problem.

In [None]:
reviews = data['verified_reviews']
len_review = []
for review in reviews:
    len_review.append(len(review))

data['len_of_reviews'] = len_review

In [None]:
data['len_of_reviews'][0], data['verified_reviews'][0]

#### Updated Column List:
* As a result, we have added new columns in our dataset.

In [None]:
data.columns

### Visualizing your Exploratory Data Analysis:

* With the help of this graph we can detect that the number of 5 rating review is high in this dataset. <br>
* In other words it seems that customers are very much happy with Alexa products.

In [None]:
plt.figure(figsize=(15,7))
plt.bar(height = data.groupby('rating').count()['date'], x = sorted(data['rating'].unique(), reverse= False))
plt.xlabel("Ratings")
plt.ylabel("Count")
plt.title("Count of Ratings")
plt.show()

* On applying a hue of feedback, we can detect that reviews which have a rating of more than 2, result in a positive feedback (1). 

In [None]:
plt.figure(figsize=(15,7))
sns.countplot(x="rating", hue="feedback", data=data)
plt.show()

* The bar plot of rating with respect to variation highlights that black dot is the most frequently ordered product and also most liked.

In [None]:
plt.figure(figsize=(15,7))
sns.barplot(x="rating", y="variation", hue="feedback", data=data, estimator= sum, ci = None)
plt.show()

* On changing the aggregation function to mean(default), average rating seems to be 4.5 for every positive feedback review.

In [None]:
plt.figure(figsize=(15,7))
sns.barplot(x="rating", y="variation", hue="feedback", data=data, ci = None)
plt.show()

* When we take month into consideration, most orders in this dataset comes from the month of July.

In [None]:
plt.figure(figsize=(15,7))
sns.barplot(y="rating", x="month", hue="feedback", data=data, ci = None, estimator= sum)
plt.show()

* Changing the average function to mean again does not highlight anything important, just the fact that the products have high ratings.

In [None]:
plt.figure(figsize=(15,7))
sns.barplot(y="rating", x="month", hue="feedback", data=data, ci = None)
plt.show()

* When day of the week is considered, it seems that Monday happens to be the day when most people write their reviews.
* This can relate to prime delivery guarantee within two days, and the most frequent day of ordering being on Saturday or the weekend.

In [None]:
plt.figure(figsize=(15,7))
sns.countplot(x="day_of_week", hue="feedback", data=data)
plt.show()

In [None]:
plt.figure(figsize=(15,7))
sns.barplot(y="rating", x="day_of_week", hue="feedback", data=data, ci = None)
plt.show()

* Overall this dataset is imbalanced towards negative reviews.
* Therefore the important score to look at would be the **F1 Score**, on how the model performed.

In [None]:
plt.figure(figsize=(15,7))
sns.countplot(x="feedback", data=data)
plt.show()

* Finally the length column, which depicts that customers with negative review tend to write a longer review.

In [None]:
plt.figure(figsize=(15,7))
sns.distplot(data[data['feedback'] == 0]['len_of_reviews'], label = 'Feedback - 0')
sns.distplot(data[data['feedback'] == 1]['len_of_reviews'], label = 'Feedback - 1')
plt.legend()
plt.show()

### Data Preprocessing:

#### TfidfVectorizer:<br>

* Since we cannot directly insert text data into out machine learning models, we will have to use a vectorizer.
* The most vectorizer for any text data happens to be Count-Vectorizer, because it is easy to understand and relate to.
* We will use Term frequency inverse document frequency (TF-IDF) vectorizer for this dataset.
* The formula is as:
![tdf](https://skymind.ai/images/wiki/tfidf.png)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tdf = TfidfVectorizer(stop_words='english')

In [None]:
pd.DataFrame(tdf.fit_transform(data['verified_reviews']).toarray())

In [None]:
tdf_data = pd.DataFrame(tdf.fit_transform(data['verified_reviews']).toarray())

### One Hot Encoding: <br>

* For variation we will be using one hot encoding, which can be expalined by the image below.

![ohe](https://i.imgur.com/mtimFxh.png)
<br>
* One important thing to take care about it no matter how many dummy variables you end up having, just make sure that drop any one variable.
* You can do this by setting **drop_first = True**.
* This problem is sometimes stated as dummy variable trap.

In [None]:
pd.get_dummies(data['variation'], drop_first= True)

In [None]:
one_hot_data = pd.get_dummies(data['variation'])

* Now, we can just concat all the features which we intend to use into a singe dataframe called **X**.

In [None]:
X = pd.concat([data['rating'], one_hot_data, tdf_data, data['month'], data['day_of_week'], data['len_of_reviews']], axis=1)

In [None]:
X.head()

* And the target vector **y**.

In [None]:
y = data['feedback']

### K Fold Cross Validation:
* K Fold cross validation gives a good idea on how is our selected model performing on different chunks of data.
* We are getting perfect scores through cross validation, as a result we would not be performing hyper parameter tuning.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, cross_val_score

rf = RandomForestClassifier()

k_fold = KFold(n_splits=5)

cross_val_score(rf, X, y, cv=k_fold, scoring='accuracy')

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

#### Random Forest Classifier:

In [None]:
rf = RandomForestClassifier()
fit_model = rf.fit(X_train, y_train)

* One of the most important methods of random forest classifier in scikit learn is **feature_importances_**.
* Let us have a look at the top 10 features.

In [None]:
t = zip(fit_model.feature_importances_, X_train.columns)
t1 = reversed(sorted(t , key=lambda x: x[0]))
i = 0
for element in t1:
    if (i < 10):
        print(element)
        i = i + 1

In [None]:
y_pred = rf.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, f1_score

In [None]:
print("==============================================")
print("For Random Forest Classifier:\n")
print("Accuracy Score: ",accuracy_score(y_test, y_pred))
print("Precision Score: ",precision_score(y_test, y_pred))
print("Recall Score: ",recall_score(y_test, y_pred))
print("F1 Score: ",f1_score(y_test, y_pred))
print("Confusion Matrix:\t \n",confusion_matrix(y_test, y_pred))

print("==============================================")

#### Gradient Boosting Classifier: 

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier()
gbc.fit(X_train, y_train)
y_pred = rf.predict(X_test)

print("==============================================")
print("For Gradient Boosting Classifier:\n")
print("Accuracy Score: ",accuracy_score(y_test, y_pred))
print("Precision Score: ",precision_score(y_test, y_pred))
print("Recall Score: ",recall_score(y_test, y_pred))
print("F1 Score: ",f1_score(y_test, y_pred))
print("Confusion Matrix:\t \n",confusion_matrix(y_test, y_pred))
print("==============================================")

### Conclusions: 
* Feature Engineering is the most crucial step when it comes to Natural Language Processing. 
* Switching Count Vectorizer with a TDF IF Vectorizer also made a difference on F1 score. 


In [None]:
results = pd.DataFrame(data = {'Y Test': y_test, 'Y Predictions': y_pred})

In [None]:
results.head()

In [None]:
results.to_csv('Results.csv')