## Plan of Action


1.   We are using **Amazon Alexa Reviews dataset (3150 reviews)**, that contains: **customer reviews, rating out of 5**, date of review, Alexa variant 
2.   First we  **generate sentiment labels: positive/negative**, by marking *positive for reviews with rating >3 and negative for remaining*
3. Then, we **clean dataset through Vectorization Feature Engineering** (TF-IDF) - a popular technique
4. Post that, we use **Support Vector Classifier for Model Fitting** and check for model performance (*we are getting >90% accuracy*)
5. Last, we use our model to do **predictions on real Amazon reviews** using: a simple way and then a fancy way



## Import datasets

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('Dataset-SA.csv') 
df


Unnamed: 0,product_name,product_price,Rate,Review,Summary,Sentiment
0,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,5,super!,great cooler excellent air flow and for this p...,positive
1,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,5,awesome,best budget 2 fit cooler nice cooling,positive
2,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,3,fair,the quality is good but the power of air is de...,positive
3,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,1,useless product,very bad product its a only a fan,negative
4,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,3,fair,ok ok product,neutral
...,...,...,...,...,...,...
205044,cello Pack of 18 Opalware Cello Dazzle Lush Fi...,1299,5,must buy!,good product,positive
205045,cello Pack of 18 Opalware Cello Dazzle Lush Fi...,1299,5,super!,nice,positive
205046,cello Pack of 18 Opalware Cello Dazzle Lush Fi...,1299,3,nice,very nice and fast delivery,positive
205047,cello Pack of 18 Opalware Cello Dazzle Lush Fi...,1299,5,just wow!,awesome product,positive


## Data Pre-Processing

In [3]:
dataset = df[['Review','Rate']]
dataset.columns = ['Review', 'Sentiment']

dataset

Unnamed: 0,Review,Sentiment
0,super!,5
1,awesome,5
2,fair,3
3,useless product,1
4,fair,3
...,...,...
205044,must buy!,5
205045,super!,5
205046,nice,3
205047,just wow!,5


In [4]:
# Define a function to compute sentiments based on numeric labels
def compute_sentiments(labels):
    sentiments = []
    for label in labels:
        if label > 3.0:
            sentiment = 1
        elif label <= 3.0:
            sentiment = 0
        sentiments.append(sentiment)
    return sentiments

# Apply the compute_sentiments function to the 'Sentiment' column
dataset['Sentiment'] = compute_sentiments(dataset['Sentiment'])
dataset

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['Sentiment'] = compute_sentiments(dataset['Sentiment'])


Unnamed: 0,Review,Sentiment
0,super!,1
1,awesome,1
2,fair,0
3,useless product,0
4,fair,0
...,...,...
205044,must buy!,1
205045,super!,1
205046,nice,0
205047,just wow!,1


In [5]:
#dataset['Sentiment'] = compute_sentiments(dataset.Sentiment)
#dataset

In [6]:
dataset.head()

Unnamed: 0,Review,Sentiment
0,super!,1
1,awesome,1
2,fair,0
3,useless product,0
4,fair,0


In [7]:
# check distribution of sentiments
dataset['Sentiment'].value_counts()

Sentiment
1    160659
0     44390
Name: count, dtype: int64

In [8]:
# check for null values
dataset.isnull().sum()

# no null values in the data

Review       24664
Sentiment        0
dtype: int64

In [9]:
dataset['Review'].fillna(' ', inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['Review'].fillna(' ', inplace=True)


### Data Transformation

In [10]:
x = dataset['Review']
y = dataset['Sentiment']


In [11]:
import spacy
nlp = spacy.load('en_core_web_sm')

import string
punct = string.punctuation
# punct

from spacy.lang.en.stop_words import STOP_WORDS
stopwords = list(STOP_WORDS) # list of stopwords

class CustomTokenizerExample():
    def __init__(self):
        pass

    def text_data_cleaning(self,sentence):
        doc = nlp(sentence)                         # spaCy tokenize text & call doc components, in order

        tokens = [] # list of tokens
        for token in doc:
            if token.lemma_ != "-PRON-":
                temp = token.lemma_.lower().strip()
            else:
              temp = token.lower_
            tokens.append(temp)

        cleaned_tokens = []
        for token in tokens:
            if token not in stopwords and token not in punct:
                cleaned_tokens.append(token)
        return cleaned_tokens

In [12]:
stopwords

['first',
 'bottom',
 'must',
 'therefore',
 'ever',
 'since',
 'off',
 'them',
 'ours',
 'do',
 'per',
 'in',
 'almost',
 'themselves',
 'for',
 'that',
 'but',
 'twelve',
 'somehow',
 'eleven',
 'seeming',
 'whereby',
 '’re',
 'nothing',
 'every',
 'just',
 'last',
 'they',
 'hereby',
 'several',
 'their',
 'elsewhere',
 'too',
 'whence',
 'against',
 'no',
 'amongst',
 'again',
 'even',
 'each',
 'though',
 'anyway',
 'part',
 'unless',
 'some',
 'due',
 'himself',
 'mostly',
 'where',
 'me',
 'his',
 'ourselves',
 'all',
 'give',
 'see',
 'done',
 'between',
 'everything',
 'indeed',
 'few',
 'wherever',
 'many',
 'am',
 'my',
 'although',
 'enough',
 'further',
 'nor',
 'anyone',
 'various',
 "'s",
 'became',
 'such',
 'herself',
 'whoever',
 'take',
 'your',
 'side',
 'becoming',
 'down',
 'toward',
 'had',
 'was',
 'ca',
 'than',
 'whatever',
 'beside',
 'the',
 'we',
 'she',
 'throughout',
 'have',
 'very',
 'been',
 'were',
 'whereas',
 'made',
 'go',
 'often',
 'become',
 'th

In [13]:
# if root form of that word is not proper noun then it is going to convert that into lower form
# and if that word is a proper noun, then we are directly taking lower form,
# because there is no lemma for proper noun

# stopwords and punctuations removed

In [14]:
# let's do a test
token = CustomTokenizerExample()
token.text_data_cleaning("Those were the best days of my life!")

['good', 'day', 'life']

### Feature Engineering (TF-IDF)

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [16]:
tfidf = TfidfVectorizer()
# tokenizer=text_data_cleaning, tokenization will be done according to this function

In [17]:
tfidf

## Train the model

In [18]:
# Train_Test_Split

In [19]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.8, stratify = dataset.Sentiment, random_state = 0)

In [20]:
x_train.shape, x_test.shape


((41009,), (164040,))

In [21]:
print(x.shape, x_train.shape, x_test.shape,y_train.shape, y_test.shape)

(205049,) (41009,) (164040,) (41009,) (164040,)


In [22]:
x_train

127870             excellent
204911    highly recommended
66452      terrific purchase
171850             very good
59308                       
                 ...        
626                wonderful
86183              hated it!
137832                  nice
121919    highly recommended
101630        classy product
Name: Review, Length: 41009, dtype: object

In [23]:
text = []

for i in x_train:
    for j in i:
        cleaned_text = token.text_data_cleaning(j)
        text.append(cleaned_text)

text


[['e'],
 ['x'],
 ['c'],
 ['e'],
 ['l'],
 ['l'],
 ['e'],
 ['n'],
 ['t'],
 ['h'],
 [],
 ['g'],
 ['h'],
 ['l'],
 ['y'],
 [],
 ['r'],
 ['e'],
 ['c'],
 ['o'],
 ['m'],
 ['m'],
 ['e'],
 ['n'],
 ['d'],
 ['e'],
 ['d'],
 ['t'],
 ['e'],
 ['r'],
 ['r'],
 [],
 ['f'],
 [],
 ['c'],
 [],
 ['p'],
 ['u'],
 ['r'],
 ['c'],
 ['h'],
 [],
 ['s'],
 ['e'],
 ['v'],
 ['e'],
 ['r'],
 ['y'],
 [],
 ['g'],
 ['o'],
 ['o'],
 ['d'],
 [],
 ['s'],
 ['u'],
 ['p'],
 ['e'],
 ['r'],
 [],
 [],
 ['w'],
 ['e'],
 ['s'],
 ['o'],
 ['m'],
 ['e'],
 [],
 ['n'],
 ['o'],
 ['t'],
 [],
 ['s'],
 ['p'],
 ['e'],
 ['c'],
 [],
 ['f'],
 [],
 ['e'],
 ['d'],
 ['w'],
 ['o'],
 ['n'],
 ['d'],
 ['e'],
 ['r'],
 ['f'],
 ['u'],
 ['l'],
 ['r'],
 ['e'],
 ['c'],
 ['o'],
 ['m'],
 ['m'],
 ['e'],
 ['n'],
 ['d'],
 ['e'],
 ['d'],
 [],
 ['p'],
 ['r'],
 ['o'],
 ['d'],
 ['u'],
 ['c'],
 ['t'],
 ['s'],
 [],
 ['m'],
 ['p'],
 ['l'],
 ['y'],
 [],
 [],
 ['w'],
 ['e'],
 ['s'],
 ['o'],
 ['m'],
 ['e'],
 ['n'],
 ['o'],
 ['t'],
 [],
 ['s'],
 ['p'],
 ['e'],
 ['c'],
 [],
 ['f

In [24]:
tfidf_matrix = tfidf.fit_transform(text)

AttributeError: 'list' object has no attribute 'lower'

In [25]:
from sklearn.linear_model import LogisticRegression  # Import LogisticRegression

# Create a TF-IDF vectorizer
tfidf = TfidfVectorizer(tokenizer=token.text_data_cleaning)
x_train_tfidf = tfidf.fit_transform(x_train)

# Create and fit a logistic regression model
logistic_classifier = LogisticRegression()
logistic_classifier.fit(x_train, y_train)

# Make predictions
x_test_tfidf = tfidf.transform(x_test)
predictions = logistic_classifier.predict(x_test_tfidf)




ValueError: could not convert string to float: 'excellent'

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [None]:
# confusion_matrix
confusion_matrix(y_test, predictions)

array([[ 5963,  2915],
       [   31, 32101]], dtype=int64)

In [None]:
# classification_report
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.99      0.67      0.80      8878
           1       0.92      1.00      0.96     32132

    accuracy                           0.93     41010
   macro avg       0.96      0.84      0.88     41010
weighted avg       0.93      0.93      0.92     41010



# Predict Sentiments using Model

In [None]:
review = 'bad review'

# Transform the review using the same TF-IDF vectorizer
review_tfidf = tfidf.transform([review])

# Make a prediction for the review
prediction = logistic_classifier.predict(review_tfidf)

# Print the prediction
print("Predicted class:", prediction[0])

Predicted class: 0


In [None]:
import pickle

# Save the trained model and vectorizer to a pickle file

with open('sentiment_model.pkl', 'wb') as model_file:
    pickle.dump(logistic_classifier, model_file)

with open('tfidf_vectorizer.pkl', 'wb') as vectorizer_file:
    pickle.dump(tfidf, vectorizer_file)

In [None]:
import pickle

# Save the trained model and vectorizer to a pickle file

with open('sentiment_model.pkl', 'wb') as model_file:
    pickle.dump(logistic_classifier, model_file)

with open('tfidf_vectorizer.pkl', 'wb') as vectorizer_file:
    pickle.dump(tfidf, vectorizer_file)

with open('tokenizer.pkl', 'wb') as tokenizer_file:
    pickle.dump(token, tokenizer_file)

# Load the model and vectorizer from the pickle files
with open('sentiment_model.pkl', 'rb') as model_file:
    loaded_model = pickle.load(model_file)

with open('tfidf_vectorizer.pkl', 'rb') as vectorizer_file:
    loaded_vectorizer = pickle.load(vectorizer_file)

with open('model/tokenizer.pkl', 'rb') as tokenizer_file:
    oaded_tokenizer = pickle.load(tokenizer_file)


# Example review for prediction
example_review = 'this is my bad review'

# Transform the example review using the loaded vectorizer
example_review_tfidf = loaded_vectorizer.transform([example_review])

# Make predictions
prediction = loaded_model.predict(example_review_tfidf)

# Display the prediction
if prediction[0] == 1:
    print('Positive sentiment')
else:
    print('Negative sentiment')


