## Plan of Action


1.   We are using **Amazon Alexa Reviews dataset (3150 reviews)**, that contains: **customer reviews, rating out of 5**, date of review, Alexa variant 
2.   First we  **generate sentiment labels: positive/negative**, by marking *positive for reviews with rating >3 and negative for remaining*
3. Then, we **clean dataset through Vectorization Feature Engineering** (TF-IDF) - a popular technique
4. Post that, we use **Support Vector Classifier for Model Fitting** and check for model performance (*we are getting >90% accuracy*)
5. Last, we use our model to do **predictions on real Amazon reviews** using: a simple way and then a fancy way



## Import datasets

In [1]:
import numpy as np
import pandas as pd

In [2]:
#Loading the dataset
df = pd.read_csv('Dataset-SA.csv') 
df

Unnamed: 0,product_name,product_price,Rate,Review,Summary,Sentiment
0,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,5,super!,great cooler excellent air flow and for this p...,positive
1,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,5,awesome,best budget 2 fit cooler nice cooling,positive
2,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,3,fair,the quality is good but the power of air is de...,positive
3,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,1,useless product,very bad product its a only a fan,negative
4,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,3,fair,ok ok product,neutral
...,...,...,...,...,...,...
205044,cello Pack of 18 Opalware Cello Dazzle Lush Fi...,1299,5,must buy!,good product,positive
205045,cello Pack of 18 Opalware Cello Dazzle Lush Fi...,1299,5,super!,nice,positive
205046,cello Pack of 18 Opalware Cello Dazzle Lush Fi...,1299,3,nice,very nice and fast delivery,positive
205047,cello Pack of 18 Opalware Cello Dazzle Lush Fi...,1299,5,just wow!,awesome product,positive


## Data Pre-Processing

In [3]:
dataset = df[['Review','Rate']]
dataset.columns = ['Review', 'Sentiment']

dataset

Unnamed: 0,Review,Sentiment
0,super!,5
1,awesome,5
2,fair,3
3,useless product,1
4,fair,3
...,...,...
205044,must buy!,5
205045,super!,5
205046,nice,3
205047,just wow!,5


In [4]:
# Define a function to compute sentiments based on numeric labels
def compute_sentiments(labels):
    sentiments = []
    for label in labels:
        if label > 3.0:
            sentiment = 1
        elif label <= 3.0:
            sentiment = 0
        sentiments.append(sentiment)
    return sentiments

# Apply the compute_sentiments function to the 'Sentiment' column
dataset['Sentiment'] = compute_sentiments(dataset['Sentiment'])
dataset

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['Sentiment'] = compute_sentiments(dataset['Sentiment'])


Unnamed: 0,Review,Sentiment
0,super!,1
1,awesome,1
2,fair,0
3,useless product,0
4,fair,0
...,...,...
205044,must buy!,1
205045,super!,1
205046,nice,0
205047,just wow!,1


In [5]:
#dataset['Sentiment'] = compute_sentiments(dataset.Sentiment)
#dataset

In [6]:
dataset.head()

Unnamed: 0,Review,Sentiment
0,super!,1
1,awesome,1
2,fair,0
3,useless product,0
4,fair,0


In [7]:
# check distribution of sentiments
dataset['Sentiment'].value_counts()

Sentiment
1    160659
0     44390
Name: count, dtype: int64

In [8]:
# check for null values
dataset.isnull().sum()

# no null values in the data

Review       24664
Sentiment        0
dtype: int64

In [9]:
dataset['Review'].fillna(' ', inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['Review'].fillna(' ', inplace=True)


### Data Transformation

In [10]:
x = dataset['Review']
y = dataset['Sentiment']


In [11]:
import spacy
nlp = spacy.load('en_core_web_sm')

import string
punct = string.punctuation
# punct

from spacy.lang.en.stop_words import STOP_WORDS
stopwords = list(STOP_WORDS) # list of stopwords

class CustomTokenizerExample():
    def __init__(self):
        pass

    def text_data_cleaning(self,sentence):
        doc = nlp(sentence)                         # spaCy tokenize text & call doc components, in order

        tokens = [] # list of tokens
        for token in doc:
            if token.lemma_ != "-PRON-":
                temp = token.lemma_.lower().strip()
            else:
              temp = token.lower_
            tokens.append(temp)

        cleaned_tokens = []
        for token in tokens:
            if token not in stopwords and token not in punct:
                cleaned_tokens.append(token)
        return cleaned_tokens

In [12]:
# if root form of that word is not proper noun then it is going to convert that into lower form
# and if that word is a proper noun, then we are directly taking lower form,
# because there is no lemma for proper noun

# stopwords and punctuations removed

In [13]:
# let's do a test
token = CustomTokenizerExample()
token.text_data_cleaning("Those were the best days of my life!")

['good', 'day', 'life']

### Feature Engineering (TF-IDF)

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [15]:
tfidf = TfidfVectorizer(tokenizer=token.text_data_cleaning)
# tokenizer=text_data_cleaning, tokenization will be done according to this function

In [16]:
tfidf

## Train the model

In [17]:
# Train_Test_Split

In [18]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, stratify = dataset.Sentiment, random_state = 0)

In [19]:
x_train.shape, x_test.shape
# 2520 samples in training dataset and 630 in test dataset

((164039,), (41010,))

In [20]:
print(x.shape, x_train.shape, x_test.shape,y_train.shape, y_test.shape)

(205049,) (164039,) (41010,) (164039,) (41010,)


In [None]:
from sklearn.linear_model import LogisticRegression  # Import LogisticRegression

# Create a TF-IDF vectorizer
tfidf = TfidfVectorizer(tokenizer=token.text_data_cleaning)

# Create and fit a logistic regression model
logistic_classifier = LogisticRegression()
x_train_tfidf = tfidf.fit_transform(x_train)
logistic_classifier.fit(x_train_tfidf, y_train)

# Make predictions
x_test_tfidf = tfidf.transform(x_test)
predictions = logistic_classifier.predict(x_test_tfidf)


In [1]:
token

NameError: name 'token' is not defined

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [None]:
# confusion_matrix
confusion_matrix(y_test, y_pred)

In [None]:
# classification_report
print(classification_report(y_test, y_pred))

# Predict Sentiments using Model

In [None]:
review = 'bad'

# Transform the review using the same TF-IDF vectorizer
review_tfidf = tfidf.transform([review])

# Make a prediction for the review
prediction = logistic_classifier.predict(review_tfidf)

# Print the prediction
print("Predicted class:", prediction[0])