## Sentiment Analysis on Amazon Alexa Reviews  

In this exercise we will explore the sentiments for Amazon Alexa products, such as Alexa Echo, Echo Dot, Firestick etc. This dataset is taken from [Kaggle](https://www.kaggle.com/sid321axn/amazon-alexa-reviews) and consists of 3151 Amazon customer reviews, star ratings, date of review, variant, and feedback on Amazon Alexa products like Alexa Echo, Echo dots, Alexa Firesticks etc. 

In [None]:
import numpy as np
import pandas as pd
import os
import re


import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix

import xgboost as xgb   #If we decide to usee XGBoost optimization algorithm
from xgboost import XGBClassifier


import matplotlib.pyplot as plt
import seaborn as sea
%matplotlib inline

### Data 
Amazon customers reviews for Alexa Echo, Firestick, Echo Dot etc. is the dataset we will use. You can find the dataset here: 
**./data/amazon_alexa.tsv**

In [None]:
# Checking to see if the data file contains the number of reviews we expect.
dataset = [line.rstrip() for line in open('../data/amazon_alexa.tsv')]
print(len(dataset))

#### Inspect the data with Pandas

Read the dataset into a Pandas dataframe using pandas.read_csv()

In [None]:
dataset = pd.read_csv('../data/amazon_alexa.tsv', delimiter='\t', quoting=3)
dataset.tail()

In [None]:
# Inspect some of the data stats. This tells us that the positive feedback, the predominant class in this dataset 
# is 1, mostly a rating of 5
dataset.describe()

In [None]:
# We can interpret ratings under 3 as negative
sea.countplot("rating", hue="feedback", data=dataset)

In [None]:
# Plotting most purchased products
plt.figure(figsize=(40,8))
sea.countplot("variation", hue="feedback", data=dataset)

In [None]:
dataset.groupby('rating').describe()

#### Preprocessing Text

This is where we will apply some of the text preprocessing steps that we have learned in preprocessing notebook and Exercise 1, such as removing stopwords, stemming, removing punctuation from text etc. by using relevant nltk and regex methods to increase the accuracy of our model's results.

In [None]:
# Clean data: extract stopwords, reduce words to their stems, which is a practice that often increases the results of sentiment analysis

corpus = []

for i in range(0,3150):
    
    # We are interested in the 'verified_reviews' column of the dataset where we have the text of the review
    data = re.sub('[^a-zA-Z]', ' ', dataset['verified_reviews'][i])
    data = data.lower()
    data = data.split()
    
    # Create a new Porter stemmer instance, which is one of the stemmers in nltk library to remove morphological affixes from words.
    stemmer = PorterStemmer()
    # Stem the words that are not one of the stopwords in the English language as defined in nltk
    data = [stemmer.stem(word) for word in data if not word in set(stopwords.words('english'))]
    data = ' '.join(data)
    corpus.append(data)

#### Implement vectorization of words (feature extraction) as you learned in Exercise 1 with CountVectorizer class

In [None]:
# Feature extraction
from sklearn.feature_extraction.text import CountVectorizer


# Build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
# 1500 most occuring words are features for training our classifier.
vectorizer = CountVectorizer(max_features=1500)

# The function fit_transform() is used for dataset transformations in scikit-learn. 
# The fit_transform function of the CountVectorizer class converts text documents into corresponding numeric features.
X = vectorizer.fit_transform(corpus).toarray()
print('Displaying word embeddings:\n ', X)

y = dataset.iloc[:,4].values

#### Split the dataset into train and test sets. Use sklearn train_test_split() function

In [None]:
# Split the dataset to train and test
from sklearn.model_selection import train_test_split

# This means that X_test and y_test contains 20% of our data which we reserve for testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

#### Initialize a classifier from scikit-learn and train it on the data

In [None]:
from sklearn.ensemble import RandomForestClassifier

sentiment_classifier = RandomForestClassifier(n_estimators = 1000, random_state = 0)

#### Fit and predict the results of your model 

In [None]:
'''
   In scikit-learn, an estimator for classification is a Python object that implements the methods fit(X, y) and predict(T)
   Train the classifier: Once the model is initialized, we train it to our specific dataset, Scikit-learn’s fit() method 
   allows us to do so. This is where our machine learning classifier actually learns the underlying functions that produce the results.
'''
sentiment_classifier.fit(X_train, y_train)

In [None]:
# To predict the sentiment for the documents in our test set we can use the predict method of 
# the RandomForestClassifier class

y_pred = sentiment_classifier.predict(X_test)

#### Create a confusion matrix (True/False Negatives/Positives)

In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

#### Evaluate your model's results using scikit-learn evaluation metrics

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))

#### Congratulations! You have implemented your Sentiment Analysis Model using Natural Language ToolKit and scikit-learn libraries!

+ For more practice, you can change some other machine learning algorithm to see if you can improve the performance. You can also change the parameters of the CountVectorizer class to see if you can get any improvement or use a different vectorizer, such as `TfidfVectorizer` class from scikit-learn library.