## Sentiment Analysis on Amazon Alexa Reviews  

In this exercise we will explore the sentiments for Amazon Alexa products, such as Alexa Echo, Echo Dot, Firestick etc. This dataset is taken from [Kaggle](https://www.kaggle.com/sid321axn/amazon-alexa-reviews) and consists of 3151 Amazon customer reviews, star ratings, date of review, variant, and feedback on Amazon Alexa products like Alexa Echo, Echo dots, Alexa Firesticks etc. 

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import os
import re

import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

'''
VADER (Valence Aware Dictionary and Sentiment Reasoner) is a lexicon and rule-based sentiment analysis 
tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

'''
nltk.download('vader_lexicon')
nltk.download('punkt')

from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import sentiment
from nltk import word_tokenize

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix

import matplotlib.pyplot as plt
import seaborn as sea
%matplotlib inline

#### Inspect the data with Pandas

Read the dataset into a Pandas dataframe using `pandas.read_csv()`

In [None]:
# Your code here
dataset = pd.read_csv('../data/amazon_alexa.tsv', delimiter='\t', quoting=3)

#### (Optional) Inspect the data using some of Pandas methods describe() , info(), groupby() etc. that would give us an overall idea about data stats

In [None]:
# Inspect some of the data stats. This tells us that the positive feedback, the predominant class in this dataset 
# is 1, mostly a rating of 5
dataset.describe()

In [None]:
# Your code here (optional)

#### Plotting data

In [None]:
#TODO: Run this cell

# We can interpret ratings under 3 as negative
sea.countplot("rating", hue="feedback", data=dataset)

In [None]:
#TODO: Run this cell

# Plotting most purchased products 
plt.figure(figsize=(40,8))
sea.countplot("variation", hue="feedback", data=dataset)

#### Inspecting data with nltk VADER

+ The VADER lexicon in the nltk library calculates negative, positive, and neutral values for our text, and provides a word tokenizer, which splits our data file into sentences or words. Inspect the code below.

In [None]:
#TODO: Run this cell

# Initializing VADER
sentiment_analyzer = SentimentIntensityAnalyzer()

#Initialize `english.pickle` word tokenizer function 
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

# Convert our verified_reviews column in the dataframe into a string and tokenize
sentences = dataset['verified_reviews']
sentences = list(sentences)
sentences = str(sentences)
sent = tokenizer.tokenize(sentences)


'''
Next, we will find all sentences in our text file that include a specific keyword and designate 
these sentences as a list. In this example, we choose "love" as our keyword and designate our list 
of all sentences that include this word "love_list".  

'''
#The "*"s are wildcards which match everything before and after the word "love" itself.
r = re.compile(".* love .*")

# love_list now contains all the sentences that has the word 'love' in them
love_list = list(filter(r.match, sent))

print(len(love_list))
print(love_list[:10])


# We will now run the sentiment analysis on those sentences in the love_list
# Running this loop wll allow us to see the calculated positive, negative, 
# and neutral scores for each sentence that contains the word "evacuation" in our text source
for sentence in love_list:
    print(sentence)
    scores = sentiment_analyzer.polarity_scores(sentence)
    for key in sorted(scores):
        print('{0}: {1}, '.format(key, scores[key]), end='')
    print()
    
# Future practice: change the keyword and inspect the results

#### Preprocessing Text

This is where we will apply some of the text preprocessing steps that we have learned in Preprocessing data Notebook, such as removing stopwords, <br> stemming, removing punctuation from text etc. by using relevant nltk and regex methods to get our data ready for training and increase the accuracy of our model's results.

+ Lowercase
+ Tokenization
+ Stemming 
+ Remove stopwords
+ Remove punctuation

In [None]:
# TODO: Your code here

#### Sentiment Classification

Now that we have a clean dataset that we have preprocessed, we are ready to build our sentiment analysis model by using <br> one of the classifiers in the scikit-learn library. For this exercise, we will use a classification model from the scikit-learn library. 

#### Implement vectorization of words (feature extraction) as you learned in Exercise 1 with CountVectorizer class

In [None]:
# Your code here

#### Split the dataset into train and test sets. Use scikit-learn train_test_split() function

In [None]:
# Your code here

#### Initialize a classifier from scikit-learn library and train it on the data 

In [None]:
# You can choose another classifier from scikit-learn library if you prefer doing so; 
# however, it might require some modifications to your code so far. -- Great practice for later!

# Your code here


#### Fit and predict the results of your model using scikit_learn fit() and predict() methods

In [None]:
# Your code here

#### Create a confusion matrix using `scikit-library`

In [None]:
from sklearn.metrics import confusion_matrix

# Your code here

#### Evaluate your results using scikit-learn evaluation metrics

+ `scikit-learn` library provides a number of metrics to evaluate the accuracy of models. `accuracy_score()` and `classification_report()` are two of those that you might find useful. 

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test,y_pred))
# Your code here