# <h1><center>Alexa Customer Reviews Case Study</center></h1>

In this project, I use NLP methods to derive sentiment of product reviews (positive vs negative)

Used NLP techniques such as stemming and TFIDF to preprocess data and performed classification (positive Vs negative) using Random Forest to predict sentiment of product reviews

Project Overview:
- Used NLP text processing techniques such as stemming and TFIDF to prepare data for training
- Used vectorization techniques to convert text into numerical values for DL model
- Performed classification using Random Forest algorithm. 

=======================================================================================================================

Context: dataset contains over 3000 Amazon's Alexa customer reviews

Attribute Information:
1. rating - 1 to 5 stars
2. date - date of purchase
3. variation - color variation
4. verified_reviews - customers' reviews
5. feedback - 0: Negative feedback, 1: Positive feedback

Data Source: https://www.kaggle.com/sid321axn/amazon-alexa-reviews

In [4]:
 import nltk
 nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [5]:
import numpy as np
import pandas as pd
import re
import nltk
from sklearn.datasets import load_files
nltk.download('stopwords')
import pickle
from nltk.corpus import twitter_samples
from nltk.corpus import stopwords
from sklearn.ensemble import RandomForestClassifier

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
# import dataset
df = pd.read_csv('amazon_alexa.tsv', sep = '\t')
df.head()

Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1
4,5,31-Jul-18,Charcoal Fabric,Music,1


In [7]:
# we will extract x and y
X = df['verified_reviews']
y = df['feedback']

In [8]:
# preprocessing

documents = []

from nltk.stem import WordNetLemmatizer

stemmer = WordNetLemmatizer()

for sen in range(0, len(X)):
    # Remove all the special characters
    document = re.sub(r'\W', ' ', str(X[sen]))
    
    # remove all single characters
    document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)
    
    # Remove single characters from the start
    document = re.sub(r'\^[a-zA-Z]\s+', ' ', document) 
    
    # Substituting multiple spaces with single space
    document = re.sub(r'\s+', ' ', document, flags=re.I)
    
    # Removing prefixed 'b'
    document = re.sub(r'^b\s+', '', document)
    
    # Converting to Lowercase
    document = document.lower()
    
    # Lemmatization
    document = document.split()

    document = [stemmer.lemmatize(word) for word in document]
    document = ' '.join(document)
    
    documents.append(document)

In [9]:
# convert strings into digits using CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
X = vectorizer.fit_transform(documents).toarray()

In [10]:
# TFIDF

from sklearn.feature_extraction.text import TfidfTransformer
tfidfconverter = TfidfTransformer()
X = tfidfconverter.fit_transform(X).toarray()

In [11]:
# split the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [12]:
# train using Random Forest Classifier

classifier = RandomForestClassifier(n_estimators=1000, random_state=0)
classifier.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

In [13]:
y_pred = classifier.predict(X_test)

In [14]:
# evaluate model

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print('Accuracy score:',accuracy_score(y_test, y_pred))

[[ 15  39]
 [  1 575]]
              precision    recall  f1-score   support

           0       0.94      0.28      0.43        54
           1       0.94      1.00      0.97       576

    accuracy                           0.94       630
   macro avg       0.94      0.64      0.70       630
weighted avg       0.94      0.94      0.92       630

Accuracy score: 0.9365079365079365
