## Plan of Action


1.   We are using **Amazon Alexa Reviews dataset (3150 reviews)**, that contains: **customer reviews, rating out of 5**, date of review, Alexa variant 
2.   First we  **generate sentiment labels: positive/negative**, by marking *positive for reviews with rating >3 and negative for remaining*
3. Then, we **clean dataset through Vectorization Feature Engineering** (TF-IDF) - a popular technique
4. Post that, we use **Support Vector Classifier for Model Fitting** and check for model performance (*we are getting >90% accuracy*)
5. Last, we use our model to do **predictions on real Amazon reviews** using: a simple way and then a fancy way



## Import datasets

In [None]:
import numpy as np
import pandas as pd

In [None]:
#Loading the dataset
dump = pd.read_csv('../input/merged-annotated/Automatic_Annotated_MergedDataset.csv',sep=',') 

dump

## Data Pre-Processing

In [None]:
dataset = dump[['Review','Suggestion/Complaint']]
dataset.columns = ['Review', 'Sentiment']

dataset.head()

### Data Transformation

In [None]:
x = dataset['Review']
y = dataset['Sentiment']

In [None]:
#!pip install tokenizer

In [None]:
#import tokenizer_input
#from b2_tokenizer_input import CustomTokenizerExample

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

import string
punct = string.punctuation
# punct

from spacy.lang.en.stop_words import STOP_WORDS
stopwords = list(STOP_WORDS) # list of stopwords

class CustomTokenizerExample():
    def __init__(self):
        pass

    def text_data_cleaning(self,sentence):
        doc = nlp(sentence)                         # spaCy tokenize text & call doc components, in order

        tokens = [] # list of tokens
        for token in doc:
            if token.lemma_ != "-PRON-":
                temp = token.lemma_.lower().strip()
            else:
              temp = token.lower_
            tokens.append(temp)

        cleaned_tokens = []
        for token in tokens:
            if token not in stopwords and token not in punct:
                cleaned_tokens.append(token)
        return cleaned_tokens

In [None]:
# if root form of that word is not proper noun then it is going to convert that into lower form
# and if that word is a proper noun, then we are directly taking lower form,
# because there is no lemma for proper noun

# stopwords and punctuations removed

In [None]:
# let's do a test
token = CustomTokenizerExample()
token.text_data_cleaning("Those were the best days of my life!")

### Feature Engineering (TF-IDF)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
#tfidf = TfidfVectorizer(tokenizer=token.text_data_cleaning)
# tokenizer=text_data_cleaning, tokenization will be done according to this function
cvec=CountVectorizer()

## Train the model

### Train/ Test Split

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, stratify = dataset.Sentiment, random_state = 0)

In [None]:
x_train.shape, x_test.shape
# 2520 samples in training dataset and 630 in test dataset

### Fit x_train and y_train

In [None]:
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from sklearn.ensemble import BaggingClassifier

In [None]:
#classifier = LinearSVC()
xgb=XGBClassifier()
classifier = BaggingClassifier(base_estimator=xgb, n_estimators=31, random_state=314)

In [None]:
#pipeline = Pipeline([('tfid',tfidf), ('clf',classifier)])
pipeline = Pipeline([('cvec',cvec), ('clf',classifier)])
# it will first do vectorization and then it will do classification

In [None]:
pipeline.fit(x_train, y_train)

## Check Model Performance

In [None]:
y_pred = pipeline.predict(x_test)

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [None]:
# With pipeline, we don't need to prepare the dataset for testing (x_test)

In [None]:
# confusion_matrix
confusion_matrix(y_test, y_pred)

# we are getting almost 91% accuracy

In [None]:
# classification_report
print(classification_report(y_test, y_pred))

In [None]:
round(accuracy_score(y_test, y_pred)*100,2)

# Model Serialization

In [None]:
import joblib
joblib.dump(pipeline,'notokbagging_xgb_count.pkl')