# Amazon Fine Food Reviews Analysis


Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews <br>

EDA: https://nycdatascience.com/blog/student-works/amazon-fine-foods-visualization/


The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.<br>

Number of reviews: 568,454<br>
Number of users: 256,059<br>
Number of products: 74,258<br>
Timespan: Oct 1999 - Oct 2012<br>
Number of Attributes/Columns in data: 10 

Attribute Information:

1. Id
2. ProductId - unique identifier for the product
3. UserId - unqiue identifier for the user
4. ProfileName
5. HelpfulnessNumerator - number of users who found the review helpful
6. HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
7. Score - rating between 1 and 5
8. Time - timestamp for the review
9. Summary - brief summary of the review
10. Text - text of the review


#### Objective:
Given a review, determine whether the review is positive (rating of 4 or 5) or negative (rating of 1 or 2).

<br>
[Q] How to determine if a review is positive or negative?<br>
<br> 
[Ans] We could use Score/Rating. A rating of 4 or 5 can be cosnidered as a positive review. A rating of 1 or 2 can be considered as negative one. A review of rating 3 is considered nuetral and such reviews are ignored from our analysis. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.




# [1]. Reading Data

## [1.1] Loading the data

The dataset is available in two forms
1. .csv file
2. SQLite Database

In order to load the data, We have used the SQLITE dataset as it is easier to query the data and visualise the data efficiently.
<br> 

Here as we only want to get the global sentiment of the recommendations (positive or negative), we will purposefully ignore all Scores equal to 3. If the score is above 3, then the recommendation wil be set to "positive". Otherwise, it will be set to "negative".

In [1]:
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")


import sqlite3 
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

from tqdm import tqdm
import os



In [2]:
# using SQLite Table to read data.
con = sqlite3.connect('database.sqlite') 

# filtering only positive and negative reviews i.e. 
# not taking into consideration those reviews with Score=3
# SELECT * FROM Reviews WHERE Score != 3 LIMIT 500000, will give top 500000 data points
# you can change the number to any other number based on your computing power

# filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 LIMIT 500000""", con) 
# for tsne assignment you can take 5k data points

filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 LIMIT 5000""", con) 

# Give reviews with Score>3 a positive rating(1), and reviews with a score<3 a negative rating(0).
def partition(x):
    if x < 3:
        return 0
    return 1

#changing reviews with score less than 3 to be positive and vice-versa
actualScore = filtered_data['Score']
positiveNegative = actualScore.map(partition) 
filtered_data['Score'] = positiveNegative
print("Number of data points in our data", filtered_data.shape)
filtered_data.head(3)

DatabaseError: Execution failed on sql ' SELECT * FROM Reviews WHERE Score != 3 LIMIT 5000': no such table: Reviews

In [None]:
display = pd.read_sql_query("""
SELECT UserId, ProductId, ProfileName, Time, Score, Text, COUNT(*)
FROM Reviews
GROUP BY UserId
HAVING COUNT(*)>1
""", con)

In [None]:
print(display.shape)
display.head()

In [None]:
display[display['UserId']=='AZY10LLTJ71NX']

In [None]:
display['COUNT(*)'].sum()

#  [2] Exploratory Data Analysis

## [2.1] Data Cleaning: Deduplication

It is observed (as shown in the table below) that the reviews data had many duplicate entries. Hence it was necessary to remove duplicates in order to get unbiased results for the analysis of the data.  Following is an example:

In [None]:
display= pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3 AND UserId="AR5J8UI46CURR"
ORDER BY ProductID
""", con)
display.head()

As it can be seen above that same user has multiple reviews with same values for HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary and Text and on doing analysis it was found that <br>
<br> 
ProductId=B000HDOPZG was Loacker Quadratini Vanilla Wafer Cookies, 8.82-Ounce Packages (Pack of 8)<br>
<br> 
ProductId=B000HDL1RQ was Loacker Quadratini Lemon Wafer Cookies, 8.82-Ounce Packages (Pack of 8) and so on<br>

It was inferred after analysis that reviews with same parameters other than ProductId belonged to the same product just having different flavour or quantity. Hence in order to reduce redundancy it was decided to eliminate the rows having same parameters.<br>

The method used for the same was that we first sort the data according to ProductId and then just keep the first similar product review and delelte the others. for eg. in the above just the review for ProductId=B000HDL1RQ remains. This method ensures that there is only one representative for each product and deduplication without sorting would lead to possibility of different representatives still existing for the same product.

In [None]:
#Sorting data according to ProductId in ascending order
sorted_data=filtered_data.sort_values('ProductId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

In [None]:
#Deduplication of entries
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)
final.shape

In [None]:
#Checking to see how much % of data still remains
(final['Id'].size*1.0)/(filtered_data['Id'].size*1.0)*100

<b>Observation:-</b> It was also seen that in two rows given below the value of HelpfulnessNumerator is greater than HelpfulnessDenominator which is not practically possible hence these two rows too are removed from calcualtions

In [None]:
display= pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3 AND Id=44737 OR Id=64422
ORDER BY ProductID
""", con)

display.head()

In [None]:
final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]

In [None]:
#Before starting the next phase of preprocessing lets see the number of entries left
print(final.shape)

#How many positive and negative reviews are present in our dataset?
final['Score'].value_counts()

#  [3] Preprocessing

## [3.1].  Preprocessing Review Text

Now that we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.

Hence in the Preprocessing phase we do the following in the order below:-

1. Begin by removing the html tags
2. Remove any punctuations or limited set of special characters like , or . or # etc.
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
5. Convert the word to lowercase
6. Remove Stopwords
7. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)<br>

After which we collect the words used to describe positive and negative reviews

In [None]:
# printing some random reviews
sent_0 = final['Text'].values[0]
print(sent_0)
print("="*50)

sent_1000 = final['Text'].values[1000]
print(sent_1000)
print("="*50)

sent_1500 = final['Text'].values[1500]
print(sent_1500)
print("="*50)

sent_4900 = final['Text'].values[4900]
print(sent_4900)
print("="*50)

In [None]:
# remove urls from text python: https://stackoverflow.com/a/40823105/4084039
sent_0 = re.sub(r"http\S+", "", sent_0)
sent_1000 = re.sub(r"http\S+", "", sent_1000)
sent_150 = re.sub(r"http\S+", "", sent_1500)
sent_4900 = re.sub(r"http\S+", "", sent_4900)

print(sent_0)

In [None]:
# https://stackoverflow.com/questions/16206380/python-beautifulsoup-how-to-remove-all-tags-from-an-element
from bs4 import BeautifulSoup

soup = BeautifulSoup(sent_0, 'lxml')
text = soup.get_text()
print(text)
print("="*50)

soup = BeautifulSoup(sent_1000, 'lxml')
text = soup.get_text()
print(text)
print("="*50)

soup = BeautifulSoup(sent_1500, 'lxml')
text = soup.get_text()
print(text)
print("="*50)

soup = BeautifulSoup(sent_4900, 'lxml')
text = soup.get_text()
print(text)

In [None]:
# https://stackoverflow.com/a/47091490/4084039
import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [None]:
sent_1500 = decontracted(sent_1500)
print(sent_1500)
print("="*50)

In [None]:
#remove words with numbers python: https://stackoverflow.com/a/18082370/4084039
sent_0 = re.sub("\S*\d\S*", "", sent_0).strip()
print(sent_0)

In [None]:
#remove spacial character: https://stackoverflow.com/a/5843547/4084039
sent_1500 = re.sub('[^A-Za-z0-9]+', ' ', sent_1500)
print(sent_1500)

In [None]:
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
# <br /><br /> ==> after the above steps, we are getting "br br"
# we are including them into stop words list
# instead of <br /> if we have <br/> these tags would have revmoved in the 1st step

stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])

In [None]:
# Combining all the above stundents 
from tqdm import tqdm
preprocessed_reviews = []
# tqdm is for printing the status bar
for sentance in tqdm(final['Text'].values):
    sentance = re.sub(r"http\S+", "", sentance)
    sentance = BeautifulSoup(sentance, 'lxml').get_text()
    sentance = decontracted(sentance)
    sentance = re.sub("\S*\d\S*", "", sentance).strip()
    sentance = re.sub('[^A-Za-z]+', ' ', sentance)
    # https://gist.github.com/sebleier/554280
    sentance = ' '.join(e.lower() for e in sentance.split() if e.lower() not in stopwords)
    preprocessed_reviews.append(sentance.strip())

In [None]:
preprocessed_reviews[1500]

In [None]:
filename = "preprocessed"
outfile = open(filename,'wb')


<h2><font color='red'>[3.2] Preprocessing Review Summary</font></h2>

In [None]:
## Similartly you can do preprocessing for review summary also.

# [4] Featurization

## [4.1] BAG OF WORDS

In [None]:
#BoW
count_vect = CountVectorizer() #in scikit-learn
count_vect.fit(preprocessed_reviews)
print("some feature names ", count_vect.get_feature_names()[:10])
print('='*50)

final_counts = count_vect.transform(preprocessed_reviews)
print("the type of count vectorizer ",type(final_counts))
print("the shape of out text BOW vectorizer ",final_counts.get_shape())
print("the number of unique words ", final_counts.get_shape()[1])

## [4.2] Bi-Grams and n-Grams.

In [None]:
#bi-gram, tri-gram and n-gram

#removing stop words like "not" should be avoided before building n-grams
# count_vect = CountVectorizer(ngram_range=(1,2))
# please do read the CountVectorizer documentation http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

# you can choose these numebrs min_df=10, max_features=5000, of your choice
count_vect = CountVectorizer(ngram_range=(1,2), min_df=10, max_features=5000)
final_bigram_counts = count_vect.fit_transform(preprocessed_reviews)
print("the type of count vectorizer ",type(final_bigram_counts))
print("the shape of out text BOW vectorizer ",final_bigram_counts.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_bigram_counts.get_shape()[1])

## [4.3] TF-IDF

In [None]:
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2), min_df=10)
tf_idf_vect.fit(preprocessed_reviews)
print("some sample features(unique words in the corpus)",tf_idf_vect.get_feature_names()[0:10])
print('='*50)

final_tf_idf = tf_idf_vect.transform(preprocessed_reviews)
print("the type of count vectorizer ",type(final_tf_idf))
print("the shape of out text TFIDF vectorizer ",final_tf_idf.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_tf_idf.get_shape()[1])

## [4.4] Word2Vec

In [None]:
# Train your own Word2Vec model using your own text corpus
i=0
list_of_sentance=[]
for sentance in preprocessed_reviews:
    list_of_sentance.append(sentance.split())

In [None]:
# Using Google News Word2Vectors

# in this project we are using a pretrained model by google
# its 3.3G file, once you load this into your memory 
# it occupies ~9Gb, so please do this step only if you have >12G of ram
# we will provide a pickle file wich contains a dict , 
# and it contains all our courpus words as keys and  model[word] as values
# To use this code-snippet, download "GoogleNews-vectors-negative300.bin" 
# from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
# it's 1.9GB in size.


# http://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/#.W17SRFAzZPY
# you can comment this whole cell
# or change these varible according to your need

is_your_ram_gt_16g=False
want_to_use_google_w2v = False
want_to_train_w2v = True

if want_to_train_w2v:
    # min_count = 5 considers only words that occured atleast 5 times
    w2v_model=Word2Vec(list_of_sentance,min_count=5,size=50, workers=4)
    print(w2v_model.wv.most_similar('great'))
    print('='*50)
    print(w2v_model.wv.most_similar('worst'))
    
elif want_to_use_google_w2v and is_your_ram_gt_16g:
    if os.path.isfile('GoogleNews-vectors-negative300.bin'):
        w2v_model=KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
        print(w2v_model.wv.most_similar('great'))
        print(w2v_model.wv.most_similar('worst'))
    else:
        print("you don't have gogole's word2vec file, keep want_to_train_w2v = True, to train your own w2v ")

In [None]:
w2v_words = list(w2v_model.wv.vocab)
print("number of words that occured minimum 5 times ",len(w2v_words))
print("sample words ", w2v_words[0:50])

## [4.4.1] Converting text into vectors using Avg W2V, TFIDF-W2V

#### [4.4.1.1] Avg W2v

In [None]:
# average Word2Vec
# compute average word2vec for each review.
sent_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm(list_of_sentance): # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_vectors.append(sent_vec)
print(len(sent_vectors))
print(len(sent_vectors[0]))

#### [4.4.1.2] TFIDF weighted W2v

In [None]:
# S = ["abc def pqr", "def def def abc", "pqr pqr def"]
model = TfidfVectorizer()
tf_idf_matrix = model.fit_transform(preprocessed_reviews)
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(model.get_feature_names(), list(model.idf_)))

In [None]:
# TF-IDF weighted Word2Vec
tfidf_feat = model.get_feature_names() # tfidf words/col-names
# final_tf_idf is the sparse matrix with row= sentence, col=word and cell_val = tfidf

tfidf_sent_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in tqdm(list_of_sentance): # for each review/sentence 
    sent_vec = np.zeros(50) # as word vectors are of zero length
    weight_sum =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words and word in tfidf_feat:
            vec = w2v_model.wv[word]
#             tf_idf = tf_idf_matrix[row, tfidf_feat.index(word)]
            # to reduce the computation we are 
            # dictionary[word] = idf value of word in whole courpus
            # sent.count(word) = tf valeus of word in this review
            tf_idf = dictionary[word]*(sent.count(word)/len(sent)) 
            sent_vec += (vec * tf_idf)
            weight_sum += tf_idf
    if weight_sum != 0:
        sent_vec /= weight_sum
    tfidf_sent_vectors.append(sent_vec)
    row += 1

# [5] Assignment 7: SVM

<ol>
    <li><strong>Apply SVM on these feature sets</strong>
        <ul>
            <li><font color='red'>SET 1:</font>Review text, preprocessed one converted into vectors using (BOW)</li>
            <li><font color='red'>SET 2:</font>Review text, preprocessed one converted into vectors using (TFIDF)</li>
            <li><font color='red'>SET 3:</font>Review text, preprocessed one converted into vectors using (AVG W2v)</li>
            <li><font color='red'>SET 4:</font>Review text, preprocessed one converted into vectors using (TFIDF W2v)</li>
        </ul>
    </li>
    <br>
    <li><strong>Procedure</strong>
        <ul>
    <li>You need to work with 2 versions of SVM
        <ul><li>Linear kernel</li>
            <li>RBF kernel</li></ul>
    <li>When you are working with linear kernel, use SGDClassifier’ with hinge loss because it is computationally less expensive.</li>
    <li>When you are working with ‘SGDClassifier’ with hinge loss and trying to find the AUC
        score, you would have to use <a href='https://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html'>CalibratedClassifierCV</a></li>
    <li>Similarly, like kdtree of knn, when you are working with RBF kernel it's better to reduce
the number of dimensions. You can put min_df = 10, max_features = 500 and consider a sample size of 40k points.</li>                
        </ul>
    </li>
    <br>
    <li><strong>Hyper paramter tuning (find best alpha in range [10^-4 to 10^4], and the best penalty among 'l1', 'l2')</strong>
        <ul>
    <li>Find the best hyper parameter which will give the maximum <a href='https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/receiver-operating-characteristic-curve-roc-curve-and-auc-1/'>AUC</a> value</li>
    <li>Find the best hyper paramter using k-fold cross validation or simple cross validation data</li>
    <li>Use gridsearch cv or randomsearch cv or you can also write your own for loops to do this task of hyperparameter tuning</li>          
        </ul>
    </li>
    <br>
    <li><strong>Feature importance</strong>
        <ul>
    <li>When you are working on the linear kernel with BOW or TFIDF please print the top 10 best
features for each of the positive and negative classes.</li>
        </ul>
    </li>
    <br>
    <li><strong>Feature engineering</strong>
        <ul>
    <li>To increase the performance of your model, you can also experiment with with feature engineering like :</li>
            <ul>
            <li>Taking length of reviews as another feature.</li>
            <li>Considering some features from review summary as well.</li>
        </ul>
        </ul>
    </li>
    <br>
    <li><strong>Representation of results</strong>
        <ul>
    <li>You need to plot the performance of model both on train data and cross validation data for each hyper parameter, like shown in the figure.
    <img src='train_cv_auc.JPG' width=300px></li>
    <li>Once after you found the best hyper parameter, you need to train your model with it, and find the AUC on test data and plot the ROC curve on both train and test.
    <img src='train_test_auc.JPG' width=300px></li>
    <li>Along with plotting ROC curve, you need to print the <a href='https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/confusion-matrix-tpr-fpr-fnr-tnr-1/'>confusion matrix</a> with predicted and original labels of test data points. Please visualize your confusion matrices using <a href='https://seaborn.pydata.org/generated/seaborn.heatmap.html'>seaborn heatmaps.
    <img src='confusion_matrix.png' width=300px></li>
        </ul>
    </li>
    <br>
    <li><strong>Conclusion</strong>
        <ul>
    <li>You need to summarize the results at the end of the notebook, summarize it in the table format. To print out a table please refer to this prettytable library<a href='http://zetcode.com/python/prettytable/'>  link</a> 
        <img src='summary.JPG' width=400px>
    </li>
        </ul>
</ol>

<h4><font color='red'>Note: Data Leakage</font></h4>

1. There will be an issue of data-leakage if you vectorize the entire data and then split it into train/cv/test.
2. To avoid the issue of data-leakag, make sure to split your data first and then vectorize it. 
3. While vectorizing your data, apply the method fit_transform() on you train data, and apply the method transform() on cv/test data.
4. For more details please go through this <a href='https://soundcloud.com/applied-ai-course/leakage-bow-and-tfidf'>link.</a>

# Applying SVM

## [5.1] Linear SVM

### [5.1.1] Applying Linear SVM on BOW,<font color='red'> SET 1</font>

### L1 Regularization

In [None]:
final['preprocessed']=preprocessed_reviews

#Creating the feature vector X and target class vector y
X=final['preprocessed']
Y=final['Score']

# let's check the count of each class 
Y.value_counts()

### Splitting data

In [None]:
# Splitting data as train and test set to fit our model and to calculate the performance of the model
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2) # this is random splitting


### Vectorisation - BOW

In [None]:
#importing library
from sklearn import preprocessing

print("Shape of train and test set")
print(X_train.shape,y_train.shape)
print(X_test.shape, y_test.shape)

print("="*100)

#Applying Bag of Words on our data
#Fit_transform the test data and transform  the test data, this is done to prevent data leakage

from sklearn.feature_extraction.text import CountVectorizer

#creating an instance of the count vectoriser to produce bi grams and with maximum fearures being 5000
# countvectorizer takes in the text data and returns the matrix of token counts which is sparse by default

vectorizer = CountVectorizer(ngram_range=(1,2),min_df=10,max_features=5000)
vectorizer.fit(X_train) # fit has to happen only on train data

# we use the fitted CountVectorizer to convert the text to vector
X_train_bow = vectorizer.transform(X_train)
X_test_bow = vectorizer.transform(X_test)

#Normalise the data to ensure both are on bthe same scale
X_train_bown = preprocessing.normalize(X_train_bow)
X_test_bown = preprocessing.normalize(X_test_bow)


### Hyperparameter tuning on alpha 

In [None]:
from sklearn import linear_model
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
import math

cl = linear_model.SGDClassifier(loss='hinge',n_jobs=-1,penalty="l1")
parameters = {'alpha':[0.00001,0.0001,.001,.01,.1,1,10,100,1000,10000]}

#Creating an instance of GCV that performs a 3 fold cross validation by applying Linear SVC on the data
#with roc_auc as the performance metric and taking a set of parameters in the form of alpha
clf = GridSearchCV(cl, parameters, cv=3, scoring='roc_auc',n_jobs=-1)  

# fitting with train data
clf.fit(X_train_bown, y_train)

#observing the results- mean auc score, std deviation and the alpha values
print("Results for CV data")
print(pd.DataFrame(clf.cv_results_)[['mean_test_score', 'std_test_score', 'params']])

# lets create a list for the mean auc scores and the standard deviation of these scores(Train  data)
train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']

# lets create a list for the mean auc scores and the standard deviation of these scores(Test  data)
cv_auc = clf.cv_results_['mean_test_score'] 
cv_auc_std= clf.cv_results_['std_test_score']

c=[0.00001,0.0001,.001,.01,.1,1,10,100,1000,10000]
logval = []
for i in c:
    logval.append(math.log(i))

# plotting the train_auc curve
plt.plot(logval, train_auc, label='Train AUC')
# this code is copied from here: https://stackoverflow.com/a/48803361/4084039
plt.gca().fill_between(logval,train_auc - train_auc_std,train_auc + train_auc_std,alpha=0.2,color='darkblue')

# plotting the test_auc curve
plt.plot(logval, cv_auc, label='CV AUC')
# this code is copied from here: https://stackoverflow.com/a/48803361/4084039
plt.gca().fill_between(logval,cv_auc - cv_auc_std,cv_auc + cv_auc_std,alpha=0.2,color='darkorange')
plt.legend()
plt.xlabel("log(Lambda): hyperparameter")
plt.ylabel("AUC")
plt.title("Performance PLOTS")
plt.show()


### Choosing the best hyperparameter

In [None]:

# Choosing the Hyperparamter k  based on the max auc score
print(clf.best_score_)
print(clf.best_params_)
print(clf.best_estimator_)

### Observation 
We can see that the mean_test_score is maximum for alpha = 0.0001, So training our model with alpha = 0.0001

### Training the model with hyperparamater alpha = 0.0001

In [None]:
# Code Ref: Applied AI
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc,accuracy_score
import scikitplot as splot
from sklearn.calibration import CalibratedClassifierCV

cl = linear_model.SGDClassifier(loss='hinge',n_jobs=-1,penalty="l1",alpha=0.0001,class_weight="balanced")
#cl.fit(X_train_bown, y_train)

cal = CalibratedClassifierCV(cl, method='sigmoid')
cal.fit(X_train_bown, y_train)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs
train_fpr, train_tpr, thresholds = roc_curve(y_train, cal.predict_proba(X_train_bown)[:,1])

#print( roc_curve(y_train, neigh.predict_proba(X_train_bow)[:,1]))
test_fpr, test_tpr, thresholds = roc_curve(y_test, cal.predict_proba(X_test_bown)[:,1])

#plotting the test and train auc curve - TPR v/s FPR
plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("ERROR PLOTS")
plt.show()



from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
splot.metrics.plot_confusion_matrix(y_train, cal.predict(X_train_bown))

print("Test confusion matrix")
splot.metrics.plot_confusion_matrix(y_test, cal.predict(X_test_bown))

### L2 Regularization

In [None]:
final['preprocessed']=preprocessed_reviews

#Creating the feature vector X and target class vector y
X=final['preprocessed']
Y=final['Score']

# let's check the count of each class 
Y.value_counts()

### Splitting data`

In [None]:
# Splitting data as train and test set to fit our model and to calculate the performance of the model
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2) # this is random splitting


In [None]:
#importing library
from sklearn import preprocessing

print("Shape of train and test set")
print(X_train.shape,y_train.shape)
print(X_test.shape, y_test.shape)

print("="*100)

#Applying Bag of Words on our data
#Fit_transform the test data and transform  the test data, this is done to prevent data leakage

from sklearn.feature_extraction.text import CountVectorizer

#creating an instance of the count vectoriser to produce bi grams and with maximum fearures being 5000
# countvectorizer takes in the text data and returns the matrix of token counts which is sparse by default

vectorizer = CountVectorizer(ngram_range=(1,2),min_df=10,max_features=5000)
vectorizer.fit(X_train) # fit has to happen only on train data

# we use the fitted CountVectorizer to convert the text to vector
X_train_bow = vectorizer.transform(X_train)
X_test_bow = vectorizer.transform(X_test)

#Normalise the data to ensure both are on bthe same scale
X_train_bown = preprocessing.normalize(X_train_bow)
X_test_bown = preprocessing.normalize(X_test_bow)


### Hyperparameter tuning on alpha 

In [None]:
from sklearn import linear_model
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
import math

cl = linear_model.SGDClassifier(loss='hinge',n_jobs=-1)
parameters = {'alpha':[0.0001,.001,.01,.1,1,10,100,1000,10000]}

#Creating an instance of GCV that performs a 3 fold cross validation by applying Linear SVC on the data
#with roc_auc as the performance metric and taking a set of parameters in the form of alpha
clf = GridSearchCV(cl, parameters, cv=3, scoring='roc_auc',n_jobs=-1)  

# fitting with train data
clf.fit(X_train_bown, y_train)

#observing the results- mean auc score, std deviation and the alpha values
print("Results for CV data")
print(pd.DataFrame(clf.cv_results_)[['mean_test_score', 'std_test_score', 'params']])

# lets create a list for the mean auc scores and the standard deviation of these scores(Train  data)
train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']

# lets create a list for the mean auc scores and the standard deviation of these scores(Test  data)
cv_auc = clf.cv_results_['mean_test_score'] 
cv_auc_std= clf.cv_results_['std_test_score']

c=[0.0001,.001,.01,.1,1,10,100,1000,10000]
logval = []
for i in c:
    logval.append(math.log(i))

# plotting the train_auc curve
plt.plot(logval, train_auc, label='Train AUC')
# this code is copied from here: https://stackoverflow.com/a/48803361/4084039
plt.gca().fill_between(logval,train_auc - train_auc_std,train_auc + train_auc_std,alpha=0.2,color='darkblue')

# plotting the test_auc curve
plt.plot(logval, cv_auc, label='CV AUC')
# this code is copied from here: https://stackoverflow.com/a/48803361/4084039
plt.gca().fill_between(logval,cv_auc - cv_auc_std,cv_auc + cv_auc_std,alpha=0.2,color='darkorange')
plt.legend()
plt.xlabel("log(Lambda): hyperparameter")
plt.ylabel("AUC")
plt.title("Performance PLOTS")
plt.show()


In [None]:

# Choosing the Hyperparamter k  based on the max auc score
print(clf.best_score_)
print(clf.best_params_)
print(clf.best_estimator_)

###  Observation 
We can see that for alpha = .0001 the mean_test_auc is max hence let's choose our hyperparamter as alpha = .0001 and train our model with that

### Training the model with hyperparamater alpha = 0.0001

In [None]:
# Code Ref: Applied AI
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc,accuracy_score
import scikitplot as splot
from sklearn.calibration import CalibratedClassifierCV

cl = linear_model.SGDClassifier(loss='hinge',n_jobs=-1,penalty="l2",alpha=0.0001,class_weight="balanced")
#cl.fit(X_train_bown, y_train)

cal = CalibratedClassifierCV(cl, method='sigmoid')
cal.fit(X_train_bown, y_train)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs
train_fpr, train_tpr, thresholds = roc_curve(y_train, cal.predict_proba(X_train_bown)[:,1])

#print( roc_curve(y_train, neigh.predict_proba(X_train_bow)[:,1]))
test_fpr, test_tpr, thresholds = roc_curve(y_test, cal.predict_proba(X_test_bown)[:,1])

#plotting the test and train auc curve - TPR v/s FPR
plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("ERROR PLOTS")
plt.show()



from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
splot.metrics.plot_confusion_matrix(y_train, cal.predict(X_train_bown))

print("Test confusion matrix")
splot.metrics.plot_confusion_matrix(y_test, cal.predict(X_test_bown))

### Feature Importance

In [None]:
features = vectorizer.get_feature_names()

In [None]:
# Code Ref: Applied AI
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc,accuracy_score
import scikitplot as splot
from sklearn.calibration import CalibratedClassifierCV

cl = linear_model.SGDClassifier(loss='hinge',n_jobs=-1,penalty="l2",alpha=0.0001,class_weight="balanced")
cl.fit(X_train_bown, y_train)

### Top 10  Features of Positive Class

In [None]:
wt = cl.coef_
pos=np.argsort(wt)[:,::-1]

In [None]:
# Top 10 imp features from positive class
for i in list(pos[0][0:10]):
    print(features[i])

### Top 10 Features of Negative Class

In [None]:
neg = np.argsort(wt)
for i in list(neg[0][0:10]):
    print(features[i])

### [5.1.2] Applying Linear SVM on TFIDF,<font color='red'> SET 2</font>

### L1 Regularisation

In [None]:
X=final['preprocessed']
Y=final['Score']

# let's check the count of each class 
Y.value_counts()

### Splitting data

In [None]:
# Splitting data as train and test set to fit our model and to calculate the performance of the model
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2) # this is random splitting


In [None]:
#importing library
from sklearn import preprocessing

print("Shape of train and test set")
print(X_train.shape,y_train.shape)
print(X_test.shape, y_test.shape)

#Applying TF-idf on our data
#Fit_transform the test data and transform  the test data, this is done to prevent data leakage

tf_idf_vect = TfidfVectorizer(ngram_range=(1,2), max_features=5000)
# Fit the train data
tf_idf_vect.fit(X_train)
print("some sample features(unique words in the corpus)",tf_idf_vect.get_feature_names()[0:10])
print('='*50)

# transform train and test data
X_train_tfidfv = tf_idf_vect.transform(X_train)
X_test_tfidfv=tf_idf_vect.transform(X_test)

#Normalise the data to ensure both are on the same scale
X_train_tfidf = preprocessing.normalize(X_train_tfidfv)
X_test_tfidf = preprocessing.normalize(X_test_tfidfv)

print("After vectorizations")

print(X_train_tfidf.shape, y_train.shape)
print(X_test_tfidf.shape, y_test.shape)

### Hyperparameter tuning on alpha 

In [None]:
from sklearn import linear_model
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
import math

cl = linear_model.SGDClassifier(loss='hinge',n_jobs=-1,penalty="l1")
parameters = {'alpha':[0.00001,0.0001,.001,.01,.1,1,10,100,1000,10000]}

#Creating an instance of GCV that performs a 3 fold cross validation by applying Linear SVC on the data
#with roc_auc as the performance metric and taking a set of parameters in the form of alpha
clf = GridSearchCV(cl, parameters, cv=3, scoring='roc_auc',n_jobs=-1)  

# fitting with train data
clf.fit(X_train_tfidf, y_train)

#observing the results- mean auc score, std deviation and the alpha values
print("Results for CV data")
print(pd.DataFrame(clf.cv_results_)[['mean_test_score', 'std_test_score', 'params']])

# lets create a list for the mean auc scores and the standard deviation of these scores(Train  data)
train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']

# lets create a list for the mean auc scores and the standard deviation of these scores(Test  data)
cv_auc = clf.cv_results_['mean_test_score'] 
cv_auc_std= clf.cv_results_['std_test_score']

c=[0.00001,0.0001,.001,.01,.1,1,10,100,1000,10000]
logval = []
for i in c:
    logval.append(math.log(i))

# plotting the train_auc curve
plt.plot(logval, train_auc, label='Train AUC')
# this code is copied from here: https://stackoverflow.com/a/48803361/4084039
plt.gca().fill_between(logval,train_auc - train_auc_std,train_auc + train_auc_std,alpha=0.2,color='darkblue')

# plotting the test_auc curve
plt.plot(logval, cv_auc, label='CV AUC')
# this code is copied from here: https://stackoverflow.com/a/48803361/4084039
plt.gca().fill_between(logval,cv_auc - cv_auc_std,cv_auc + cv_auc_std,alpha=0.2,color='darkorange')
plt.legend()
plt.xlabel("log(Lambda): hyperparameter")
plt.ylabel("AUC")
plt.title("Performance PLOTS")
plt.show()


In [None]:

# Choosing the Hyperparamter k  based on the max auc score
print(clf.best_score_)
print(clf.best_params_)
print(clf.best_estimator_)

### Observation 
We can see that the mean_test_score is maximum for alpha = 0.00001, So training our model with alpha = 0.00001

### Training the model with hyperparamater alpha = 0.00001

In [None]:
# Code Ref: Applied AI
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc,accuracy_score
import scikitplot as splot
from sklearn.calibration import CalibratedClassifierCV

cl = linear_model.SGDClassifier(loss='hinge',n_jobs=-1,penalty="l1",alpha=0.0001,class_weight="balanced")
#cl.fit(X_train_bown, y_train)

cal = CalibratedClassifierCV(cl, method='sigmoid')
cal.fit(X_train_tfidf, y_train)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs
train_fpr, train_tpr, thresholds = roc_curve(y_train, cal.predict_proba(X_train_tfidf)[:,1])

#print( roc_curve(y_train, neigh.predict_proba(X_train_bow)[:,1]))
test_fpr, test_tpr, thresholds = roc_curve(y_test, cal.predict_proba(X_test_tfidf)[:,1])

#plotting the test and train auc curve - TPR v/s FPR
plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("ERROR PLOTS")
plt.show()



from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
splot.metrics.plot_confusion_matrix(y_train, cal.predict(X_train_tfidf))

print("Test confusion matrix")
splot.metrics.plot_confusion_matrix(y_test, cal.predict(X_test_tfidf))

### L2 Regularisation

In [None]:
X=final['preprocessed']
Y=final['Score']

# let's check the count of each class 
Y.value_counts()

### Splitting data

In [None]:
# Splitting data as train and test set to fit our model and to calculate the performance of the model
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2) # this is random splitting


In [None]:
#importing library
from sklearn import preprocessing

print("Shape of train and test set")
print(X_train.shape,y_train.shape)
print(X_test.shape, y_test.shape)

#Applying TF-idf on our data
#Fit_transform the test data and transform  the test data, this is done to prevent data leakage

tf_idf_vect = TfidfVectorizer(ngram_range=(1,2), max_features=5000)
# Fit the train data
tf_idf_vect.fit(X_train)
print("some sample features(unique words in the corpus)",tf_idf_vect.get_feature_names()[0:10])
print('='*50)

# transform train and test data
X_train_tfidfv = tf_idf_vect.transform(X_train)
X_test_tfidfv=tf_idf_vect.transform(X_test)

#Normalise the data to ensure both are on the same scale
X_train_tfidf = preprocessing.normalize(X_train_tfidfv)
X_test_tfidf = preprocessing.normalize(X_test_tfidfv)

print("After vectorizations")

print(X_train_tfidf.shape, y_train.shape)
print(X_test_tfidf.shape, y_test.shape)

### Hyperparameter tuning on alpha 

In [None]:
from sklearn import linear_model
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
import math

cl = linear_model.SGDClassifier(loss='hinge',n_jobs=-1)
parameters = {'alpha':[0.0001,.001,.01,.1,1,10,100,1000,10000]}

#Creating an instance of GCV that performs a 3 fold cross validation by applying Linear SVC on the data
#with roc_auc as the performance metric and taking a set of parameters in the form of alpha
clf = GridSearchCV(cl, parameters, cv=3, scoring='roc_auc',n_jobs=-1)  

# fitting with train data
clf.fit(X_train_tfidf, y_train)

#observing the results- mean auc score, std deviation and the alpha values
print("Results for CV data")
print(pd.DataFrame(clf.cv_results_)[['mean_test_score', 'std_test_score', 'params']])

# lets create a list for the mean auc scores and the standard deviation of these scores(Train  data)
train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']

# lets create a list for the mean auc scores and the standard deviation of these scores(Test  data)
cv_auc = clf.cv_results_['mean_test_score'] 
cv_auc_std= clf.cv_results_['std_test_score']

c=[0.0001,.001,.01,.1,1,10,100,1000,10000]
logval = []
for i in c:
    logval.append(math.log(i))

# plotting the train_auc curve
plt.plot(logval, train_auc, label='Train AUC')
# this code is copied from here: https://stackoverflow.com/a/48803361/4084039
plt.gca().fill_between(logval,train_auc - train_auc_std,train_auc + train_auc_std,alpha=0.2,color='darkblue')

# plotting the test_auc curve
plt.plot(logval, cv_auc, label='CV AUC')
# this code is copied from here: https://stackoverflow.com/a/48803361/4084039
plt.gca().fill_between(logval,cv_auc - cv_auc_std,cv_auc + cv_auc_std,alpha=0.2,color='darkorange')
plt.legend()
plt.xlabel("log(Lambda): hyperparameter")
plt.ylabel("AUC")
plt.title("Performance PLOTS")
plt.show()


In [None]:

# Choosing the Hyperparamter k  based on the max auc score
print(clf.best_score_)
print(clf.best_params_)
print(clf.best_estimator_)

### Observation 
We can see that the mean_test_score is maximum for alpha = 0.0001, So training our model with alpha = 0.0001

### Training the model with hyperparamater alpha = 0.0001

In [None]:
# Code Ref: Applied AI
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc,accuracy_score
import scikitplot as splot
from sklearn.calibration import CalibratedClassifierCV

cl = linear_model.SGDClassifier(loss='hinge',n_jobs=-1,penalty="l2",alpha=0.0001,class_weight="balanced")
#cl.fit(X_train_bown, y_train)

cal = CalibratedClassifierCV(cl, method='sigmoid')
cal.fit(X_train_tfidf, y_train)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs
train_fpr, train_tpr, thresholds = roc_curve(y_train, cal.predict_proba(X_train_tfidf)[:,1])

#print( roc_curve(y_train, neigh.predict_proba(X_train_bow)[:,1]))
test_fpr, test_tpr, thresholds = roc_curve(y_test, cal.predict_proba(X_test_tfidf)[:,1])

#plotting the test and train auc curve - TPR v/s FPR
plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("ERROR PLOTS")
plt.show()



from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
splot.metrics.plot_confusion_matrix(y_train, cal.predict(X_train_tfidf))

print("Test confusion matrix")
splot.metrics.plot_confusion_matrix(y_test, cal.predict(X_test_tfidf))

### Feature Importance

In [None]:
features = tf_idf_vect.get_feature_names()

In [None]:
# Code Ref: Applied AI
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc,accuracy_score
import scikitplot as splot
from sklearn.calibration import CalibratedClassifierCV

cl = linear_model.SGDClassifier(loss='hinge',n_jobs=-1,penalty="l2",alpha=0.0001,class_weight="balanced")
cl.fit(X_train_tfidf, y_train)

### Top 10 Features from Positive Class

In [None]:
wt = cl.coef_
pos=np.argsort(wt)[:,::-1]

In [None]:
# Top 10 imp features from positive class
for i in list(pos[0][0:10]):
    print(features[i])

### Top 10 Features from Negative Class

In [None]:
neg = np.argsort(wt)
for i in list(neg[0][0:10]):
    print(features[i])

### [5.1.3]  Applying Linear SVM on AVG W2V,<font color='red'> SET 3</font>

### Splitting data and vectorisation - AVG W2V

In [None]:
# Splitting data as train and test set to fit our model and to calculate the performance of the model
from sklearn.model_selection import train_test_split

#sent_vectors_w=preprocessing.normalize(sent_vectors)
X_train_w, X_test_w, y_trainw, y_testw = train_test_split(sent_vectors, Y, test_size=0.2) # this is random splitting



### L1 Regularisation

### Hyperparameter tuning on alpha 

In [None]:
from sklearn import linear_model
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
import math

cl = linear_model.SGDClassifier(loss='hinge',n_jobs=-1,penalty="l1")
parameters = {'alpha':[0.00001,0.0001,.001,.01,.1,1,10,100,1000,10000]}

#Creating an instance of GCV that performs a 3 fold cross validation by applying Linear SVC on the data
#with roc_auc as the performance metric and taking a set of parameters in the form of alpha
clf = GridSearchCV(cl, parameters, cv=3, scoring='roc_auc',n_jobs=-1)  

# fitting with train data
clf.fit(X_train_w, y_trainw)

#observing the results- mean auc score, std deviation and the alpha values
print("Results for CV data")
print(pd.DataFrame(clf.cv_results_)[['mean_test_score', 'std_test_score', 'params']])

# lets create a list for the mean auc scores and the standard deviation of these scores(Train  data)
train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']

# lets create a list for the mean auc scores and the standard deviation of these scores(Test  data)
cv_auc = clf.cv_results_['mean_test_score'] 
cv_auc_std= clf.cv_results_['std_test_score']

c=[0.00001,0.0001,.001,.01,.1,1,10,100,1000,10000]
logval = []
for i in c:
    logval.append(math.log(i))

# plotting the train_auc curve
plt.plot(logval, train_auc, label='Train AUC')
# this code is copied from here: https://stackoverflow.com/a/48803361/4084039
plt.gca().fill_between(logval,train_auc - train_auc_std,train_auc + train_auc_std,alpha=0.2,color='darkblue')

# plotting the test_auc curve
plt.plot(logval, cv_auc, label='CV AUC')
# this code is copied from here: https://stackoverflow.com/a/48803361/4084039
plt.gca().fill_between(logval,cv_auc - cv_auc_std,cv_auc + cv_auc_std,alpha=0.2,color='darkorange')
plt.legend()
plt.xlabel("log(Lambda): hyperparameter")
plt.ylabel("AUC")
plt.title("Performance PLOTS")
plt.show()


In [None]:

# Choosing the Hyperparamter k  based on the max auc score
print(clf.best_score_)
print(clf.best_params_)
print(clf.best_estimator_)

### Observation 
We can see that the mean_test_score is maximum for alpha = 0.001, So training our model with alpha = 0.001

### Training the model with hyperparamater alpha = 0.001

In [None]:
# Code Ref: Applied AI
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc,accuracy_score
import scikitplot as splot
from sklearn.calibration import CalibratedClassifierCV

cl = linear_model.SGDClassifier(loss='hinge',n_jobs=-1,penalty="l1",alpha=0.001,class_weight="balanced")
#cl.fit(X_train_bown, y_train)

cal = CalibratedClassifierCV(cl, method='sigmoid')
cal.fit(X_train_w, y_trainw)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs
train_fpr, train_tpr, thresholds = roc_curve(y_trainw, cal.predict_proba(X_train_w)[:,1])

#print( roc_curve(y_train, neigh.predict_proba(X_train_bow)[:,1]))
test_fpr, test_tpr, thresholds = roc_curve(y_testw, cal.predict_proba(X_test_w)[:,1])

#plotting the test and train auc curve - TPR v/s FPR
plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("ERROR PLOTS")
plt.show()



from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
splot.metrics.plot_confusion_matrix(y_trainw, cal.predict(X_train_w))

print("Test confusion matrix")
splot.metrics.plot_confusion_matrix(y_testw, cal.predict(X_test_w))

### L2 Regularisation

### Hyperparameter tuning on alpha 

In [None]:
from sklearn import linear_model
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
import math

cl = linear_model.SGDClassifier(loss='hinge',n_jobs=-1)
parameters = {'alpha':[0.00001,0.0001,.001,.01,.1,1,10,100,1000,10000]}

#Creating an instance of GCV that performs a 3 fold cross validation by applying Linear SVC on the data
#with roc_auc as the performance metric and taking a set of parameters in the form of alpha
clf = GridSearchCV(cl, parameters, cv=3, scoring='roc_auc',n_jobs=-1)  

# fitting with train data
clf.fit(X_train_w, y_trainw)

#observing the results- mean auc score, std deviation and the alpha values
print("Results for CV data")
print(pd.DataFrame(clf.cv_results_)[['mean_test_score', 'std_test_score', 'params']])

# lets create a list for the mean auc scores and the standard deviation of these scores(Train  data)
train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']

# lets create a list for the mean auc scores and the standard deviation of these scores(Test  data)
cv_auc = clf.cv_results_['mean_test_score'] 
cv_auc_std= clf.cv_results_['std_test_score']

c=[0.00001,0.0001,.001,.01,.1,1,10,100,1000,10000]
logval = []
for i in c:
    logval.append(math.log(i))

# plotting the train_auc curve
plt.plot(logval, train_auc, label='Train AUC')
# this code is copied from here: https://stackoverflow.com/a/48803361/4084039
plt.gca().fill_between(logval,train_auc - train_auc_std,train_auc + train_auc_std,alpha=0.2,color='darkblue')

# plotting the test_auc curve
plt.plot(logval, cv_auc, label='CV AUC')
# this code is copied from here: https://stackoverflow.com/a/48803361/4084039
plt.gca().fill_between(logval,cv_auc - cv_auc_std,cv_auc + cv_auc_std,alpha=0.2,color='darkorange')
plt.legend()
plt.xlabel("log(Lambda): hyperparameter")
plt.ylabel("AUC")
plt.title("Performance PLOTS")
plt.show()


In [None]:

# Choosing the Hyperparamter k  based on the max auc score
print(clf.best_score_)
print(clf.best_params_)
print(clf.best_estimator_)

### Observation 
We can see that the mean_test_score is maximum for alpha = 0.01, So training our model with alpha = 0.01

### Training the model with hyperparamater alpha = 0.01

In [None]:
# Code Ref: Applied AI
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc,accuracy_score
import scikitplot as splot
from sklearn.calibration import CalibratedClassifierCV

cl = linear_model.SGDClassifier(loss='hinge',n_jobs=-1,penalty="l2",alpha=0.01,class_weight="balanced")
#cl.fit(X_train_bown, y_train)

cal = CalibratedClassifierCV(cl, method='sigmoid')
cal.fit(X_train_w, y_trainw)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs
train_fpr, train_tpr, thresholds = roc_curve(y_trainw, cal.predict_proba(X_train_w)[:,1])

#print( roc_curve(y_train, neigh.predict_proba(X_train_bow)[:,1]))
test_fpr, test_tpr, thresholds = roc_curve(y_testw, cal.predict_proba(X_test_w)[:,1])

#plotting the test and train auc curve - TPR v/s FPR
plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("ERROR PLOTS")
plt.show()



from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
splot.metrics.plot_confusion_matrix(y_trainw, cal.predict(X_train_w))

print("Test confusion matrix")
splot.metrics.plot_confusion_matrix(y_testw, cal.predict(X_test_w))

### [5.1.4]  Applying Linear SVM on TFIDF W2V,<font color='red'> SET 4</font>

### Splitting data and Vectorisation - TFIDF W2V

In [None]:
# Splitting data as train and test set to fit our model and to calculate the performance of the model
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
tfidf_sent_vectors_t= preprocessing.normalize(tfidf_sent_vectors)
X_train_tw, X_test_tw, y_traintw, y_testtw = train_test_split(tfidf_sent_vectors, Y, test_size=0.2) # this is random splitting



### L1 Regularisation

### Hyperparameter tuning on alpha 

In [None]:
from sklearn import linear_model
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
import math

cl = linear_model.SGDClassifier(loss='hinge',n_jobs=-1,penalty="l1")
parameters = {'alpha':[0.00001,0.0001,.001,.01,.1,1,10,100,1000,10000]}

#Creating an instance of GCV that performs a 3 fold cross validation by applying Linear SVC on the data
#with roc_auc as the performance metric and taking a set of parameters in the form of alpha
clf = GridSearchCV(cl, parameters, cv=3, scoring='roc_auc',n_jobs=-1)  

# fitting with train data
clf.fit(X_train_tw, y_traintw)

#observing the results- mean auc score, std deviation and the alpha values
print("Results for CV data")
print(pd.DataFrame(clf.cv_results_)[['mean_test_score', 'std_test_score', 'params']])

# lets create a list for the mean auc scores and the standard deviation of these scores(Train  data)
train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']

# lets create a list for the mean auc scores and the standard deviation of these scores(Test  data)
cv_auc = clf.cv_results_['mean_test_score'] 
cv_auc_std= clf.cv_results_['std_test_score']

c=[0.00001,0.0001,.001,.01,.1,1,10,100,1000,10000]
logval = []
for i in c:
    logval.append(math.log(i))

# plotting the train_auc curve
plt.plot(logval, train_auc, label='Train AUC')
# this code is copied from here: https://stackoverflow.com/a/48803361/4084039
plt.gca().fill_between(logval,train_auc - train_auc_std,train_auc + train_auc_std,alpha=0.2,color='darkblue')

# plotting the test_auc curve
plt.plot(logval, cv_auc, label='CV AUC')
# this code is copied from here: https://stackoverflow.com/a/48803361/4084039
plt.gca().fill_between(logval,cv_auc - cv_auc_std,cv_auc + cv_auc_std,alpha=0.2,color='darkorange')
plt.legend()
plt.xlabel("log(Lambda): hyperparameter")
plt.ylabel("AUC")
plt.title("Performance PLOTS")
plt.show()


In [None]:

# Choosing the Hyperparamter k  based on the max auc score
print(clf.best_score_)
print(clf.best_params_)
print(clf.best_estimator_)

### Observation 
We can see that the mean_test_score is maximum for alpha = 0.001, So training our model with alpha = 0.001

### Training the model with hyperparamater alpha = 0.001

In [None]:
# Code Ref: Applied AI
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc,accuracy_score
import scikitplot as splot
from sklearn.calibration import CalibratedClassifierCV

cl = linear_model.SGDClassifier(loss='hinge',n_jobs=-1,penalty="l1",alpha=0.001,class_weight="balanced")
#cl.fit(X_train_bown, y_train)

cal = CalibratedClassifierCV(cl, method='sigmoid')
cal.fit(X_train_tw, y_traintw)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs
train_fpr, train_tpr, thresholds = roc_curve(y_traintw, cal.predict_proba(X_train_tw)[:,1])

#print( roc_curve(y_train, neigh.predict_proba(X_train_bow)[:,1]))
test_fpr, test_tpr, thresholds = roc_curve(y_testtw, cal.predict_proba(X_test_tw)[:,1])

#plotting the test and train auc curve - TPR v/s FPR
plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("ERROR PLOTS")
plt.show()



from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
splot.metrics.plot_confusion_matrix(y_traintw, cal.predict(X_train_tw))

print("Test confusion matrix")
splot.metrics.plot_confusion_matrix(y_testtw, cal.predict(X_test_tw))

### L2 Regularisation

### Hyperparameter tuning on alpha 

In [None]:
from sklearn import linear_model
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
import math

cl = linear_model.SGDClassifier(loss='hinge',n_jobs=-1,penalty="l2")
parameters = {'alpha':[0.00001,0.0001,.001,.01,.1,1,10,100,1000,10000]}

#Creating an instance of GCV that performs a 3 fold cross validation by applying Linear SVC on the data
#with roc_auc as the performance metric and taking a set of parameters in the form of alpha
clf = GridSearchCV(cl, parameters, cv=3, scoring='roc_auc',n_jobs=-1)  

# fitting with train data
clf.fit(X_train_tw, y_traintw)

#observing the results- mean auc score, std deviation and the alpha values
print("Results for CV data")
print(pd.DataFrame(clf.cv_results_)[['mean_test_score', 'std_test_score', 'params']])

# lets create a list for the mean auc scores and the standard deviation of these scores(Train  data)
train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']

# lets create a list for the mean auc scores and the standard deviation of these scores(Test  data)
cv_auc = clf.cv_results_['mean_test_score'] 
cv_auc_std= clf.cv_results_['std_test_score']

c=[0.00001,0.0001,.001,.01,.1,1,10,100,1000,10000]
logval = []
for i in c:
    logval.append(math.log(i))

# plotting the train_auc curve
plt.plot(logval, train_auc, label='Train AUC')
# this code is copied from here: https://stackoverflow.com/a/48803361/4084039
plt.gca().fill_between(logval,train_auc - train_auc_std,train_auc + train_auc_std,alpha=0.2,color='darkblue')

# plotting the test_auc curve
plt.plot(logval, cv_auc, label='CV AUC')
# this code is copied from here: https://stackoverflow.com/a/48803361/4084039
plt.gca().fill_between(logval,cv_auc - cv_auc_std,cv_auc + cv_auc_std,alpha=0.2,color='darkorange')
plt.legend()
plt.xlabel("log(Lambda): hyperparameter")
plt.ylabel("AUC")
plt.title("Performance PLOTS")
plt.show()


In [None]:

# Choosing the Hyperparamter k  based on the max auc score
print(clf.best_score_)
print(clf.best_params_)
print(clf.best_estimator_)

### Observation 
We can see that the mean_test_score is maximum for alpha = 0.01, So training our model with alpha = 0.01

### Training the model with hyperparamater alpha = 0.01

In [None]:
# Code Ref: Applied AI
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc,accuracy_score
import scikitplot as splot
from sklearn.calibration import CalibratedClassifierCV

cl = linear_model.SGDClassifier(loss='hinge',n_jobs=-1,penalty="l2",alpha=0.01,class_weight="balanced")
#cl.fit(X_train_bown, y_train)

cal = CalibratedClassifierCV(cl, method='sigmoid')
cal.fit(X_train_tw, y_traintw)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs
train_fpr, train_tpr, thresholds = roc_curve(y_traintw, cal.predict_proba(X_train_tw)[:,1])

#print( roc_curve(y_train, neigh.predict_proba(X_train_bow)[:,1]))
test_fpr, test_tpr, thresholds = roc_curve(y_testtw, cal.predict_proba(X_test_tw)[:,1])

#plotting the test and train auc curve - TPR v/s FPR
plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("ERROR PLOTS")
plt.show()



from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
splot.metrics.plot_confusion_matrix(y_traintw, cal.predict(X_train_tw))

print("Test confusion matrix")
splot.metrics.plot_confusion_matrix(y_testtw, cal.predict(X_test_tw))

## [5.2] RBF SVM

### [5.2.1] Applying RBF SVM on BOW,<font color='red'> SET 1</font>

In [None]:
X=final['preprocessed'][0:20000]
Y=final['Score'][0:20000]

# let's check the count of each class 
Y.value_counts()

### Splitting data

In [None]:
# Splitting data as train and test set to fit our model and to calculate the performance of the model
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2) # this is random splitting


In [None]:
#importing library
from sklearn import preprocessing

print("Shape of train and test set")
print(X_train.shape,y_train.shape)
print(X_test.shape, y_test.shape)

print("="*100)

#Applying Bag of Words on our data
#Fit_transform the test data and transform  the test data, this is done to prevent data leakage

from sklearn.feature_extraction.text import CountVectorizer

#creating an instance of the count vectoriser to produce bi grams and with maximum fearures being 5000
# countvectorizer takes in the text data and returns the matrix of token counts which is sparse by default

vectorizer = CountVectorizer(ngram_range=(1,2),min_df=10,max_features=5000)
vectorizer.fit(X_train) # fit has to happen only on train data

# we use the fitted CountVectorizer to convert the text to vector
X_train_bow = vectorizer.transform(X_train)
X_test_bow = vectorizer.transform(X_test)

#Normalise the data to ensure both are on bthe same scale
X_train_bown = preprocessing.normalize(X_train_bow)
X_test_bown = preprocessing.normalize(X_test_bow)


### Hyperparameter tuning on C and Gamma using RandomizedSearchCV

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import roc_auc_score
import math

cl = SVC(kernel="rbf")
parameters = {'C':[0.0001,.001,.01,.1,1],'gamma':[0.001,0.01,0.1]}

#Creating an instance of GCV that performs a 3 fold cross validation by applying Linear SVC on the data
#with roc_auc as the performance metric and taking a set of parameters in the form of alpha
clf = RandomizedSearchCV(cl, parameters, cv=3, scoring='roc_auc',n_jobs=-1,n_iter=10)  

# fitting with train data
clf.fit(X_train_bow, y_train)

#observing the results- mean auc score, std deviation and the alpha values
print("Results for CV data")
print(pd.DataFrame(clf.cv_results_)[['mean_test_score', 'std_test_score', 'params']])



In [None]:

# Choosing the Hyperparamter k  based on the max auc score
print(clf.best_score_)
print(clf.best_params_)
print(clf.best_estimator_)

### Observation 
We can see that the mean_test_score is maximum for C = 0.1 and gamma = 0.01 So training our model with C = 0.1 and gamma = 0.01.

### Training the model with hyperparamater C = 0.1 and Gamma = 0.01

In [None]:
# Code Ref: Applied AI
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc,accuracy_score
import scikitplot as splot
from sklearn.calibration import CalibratedClassifierCV

cl = SVC(C=0.1,gamma=0.01)
#cl.fit(X_train_bown, y_train)

cal = CalibratedClassifierCV(cl, method='sigmoid')
cal.fit(X_train_bow, y_train)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs
train_fpr, train_tpr, thresholds = roc_curve(y_train, cal.predict_proba(X_train_bow)[:,1])

#print( roc_curve(y_train, neigh.predict_proba(X_train_bow)[:,1]))
test_fpr, test_tpr, thresholds = roc_curve(y_test, cal.predict_proba(X_test_bow)[:,1])

#plotting the test and train auc curve - TPR v/s FPR
plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("ERROR PLOTS")
plt.show()



from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
splot.metrics.plot_confusion_matrix(y_train, cal.predict(X_train_bow))

print("Test confusion matrix")
splot.metrics.plot_confusion_matrix(y_test, cal.predict(X_test_bow))

### [5.2.2] Applying RBF SVM on TFIDF,<font color='red'> SET 2</font>

In [None]:
X=final['preprocessed'][0:40000]
Y=final['Score'][0:40000]

# let's check the count of each class 
Y.value_counts()

### Splitting data

In [None]:
# Splitting data as train and test set to fit our model and to calculate the performance of the model
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2) # this is random splitting


In [None]:
#importing library
from sklearn import preprocessing

print("Shape of train and test set")
print(X_train.shape,y_train.shape)
print(X_test.shape, y_test.shape)

#Applying TF-idf on our data
#Fit_transform the test data and transform  the test data, this is done to prevent data leakage

tf_idf_vect = TfidfVectorizer(ngram_range=(1,2), max_features=500,min_df=10)
# Fit the train data
tf_idf_vect.fit(X_train)
print("some sample features(unique words in the corpus)",tf_idf_vect.get_feature_names()[0:10])
print('='*50)

# transform train and test data
X_train_tfidfv = tf_idf_vect.transform(X_train)
X_test_tfidfv=tf_idf_vect.transform(X_test)

#Normalise the data to ensure both are on the same scale
X_train_tfidf = preprocessing.normalize(X_train_tfidfv)
X_test_tfidf = preprocessing.normalize(X_test_tfidfv)

print("After vectorizations")

print(X_train_tfidf.shape, y_train.shape)
print(X_test_tfidf.shape, y_test.shape)

### Hyperparameter tuning on C and Gamma using RandomizedSearchCV

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import roc_auc_score
import math

cl = SVC(kernel="rbf")
parameters = {'C':[0.0001,.001,.01,.1,1],'gamma':[0.001,0.01,0.1]}

#Creating an instance of GCV that performs a 3 fold cross validation by applying Linear SVC on the data
#with roc_auc as the performance metric and taking a set of parameters in the form of alpha
clf = RandomizedSearchCV(cl, parameters, cv=3, scoring='roc_auc',n_jobs=-1,n_iter=10)  

# fitting with train data
clf.fit(X_train_tfidf, y_train)

#observing the results- mean auc score, std deviation and the alpha values
print("Results for CV data")
print(pd.DataFrame(clf.cv_results_)[['mean_test_score', 'std_test_score', 'params']])



In [None]:

# Choosing the Hyperparamter k  based on the max auc score
print(clf.best_score_)
print(clf.best_params_)
print(clf.best_estimator_)

### Observation 
We can see that the mean_test_score is maximum for C = 0.1 and gamma = 0.1 So training our model with C = 0.1 and gamma = 0.1.

### Training the model with hyperparamater C = 0.1 and Gamma = 0.1

In [None]:
# Code Ref: Applied AI
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc,accuracy_score
import scikitplot as splot
from sklearn.calibration import CalibratedClassifierCV

cl = SVC(C=0.1,gamma=0.1)
#cl.fit(X_train_bown, y_train)

cal = CalibratedClassifierCV(cl, method='sigmoid')
cal.fit(X_train_tfidf, y_train)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs
train_fpr, train_tpr, thresholds = roc_curve(y_train, cal.predict_proba(X_train_tfidf)[:,1])

#print( roc_curve(y_train, neigh.predict_proba(X_train_bow)[:,1]))
test_fpr, test_tpr, thresholds = roc_curve(y_test, cal.predict_proba(X_test_tfidf)[:,1])

#plotting the test and train auc curve - TPR v/s FPR
plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("ERROR PLOTS")
plt.show()



from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
splot.metrics.plot_confusion_matrix(y_train, cal.predict(X_train_tfidf))

print("Test confusion matrix")
splot.metrics.plot_confusion_matrix(y_test, cal.predict(X_test_tfidf))

### [5.2.3]  Applying RBF SVM on AVG W2V,<font color='red'> SET 3</font>

### Splitting data and Vectorisation 

In [None]:
# Splitting data as train and test set to fit our model and to calculate the performance of the model
from sklearn.model_selection import train_test_split

#sent_vectors_w=preprocessing.normalize(sent_vectors)
X_train_w, X_test_w, y_trainw, y_testw = train_test_split(sent_vectors[0:40000], Y[0:40000], test_size=0.2) # this is random splitting



### Hyperparameter tuning on C and Gamma using RandomizedSearchCV

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import roc_auc_score
import math

cl = SVC(kernel="rbf")
parameters = {'C':[0.0001,.001,.01,.1,1],'gamma':[0.001,0.01,0.1]}

#Creating an instance of GCV that performs a 3 fold cross validation by applying Linear SVC on the data
#with roc_auc as the performance metric and taking a set of parameters in the form of alpha
clf = RandomizedSearchCV(cl, parameters, cv=3, scoring='roc_auc',n_jobs=-1,n_iter=10)  

# fitting with train data
clf.fit(X_train_w, y_trainw)

#observing the results- mean auc score, std deviation and the alpha values
print("Results for CV data")
print(pd.DataFrame(clf.cv_results_)[['mean_test_score', 'std_test_score', 'params']])



In [None]:

# Choosing the Hyperparamter k  based on the max auc score
print(clf.best_score_)
print(clf.best_params_)
print(clf.best_estimator_)

### Observation 
We can see that the mean_test_score is maximum for C = 1 and gamma = 0.1 So training our model with C = 1 and gamma = 0.1.

### Training the model with hyperparamater C = 1 and Gamma = 0.1

In [None]:
# Code Ref: Applied AI
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc,accuracy_score
import scikitplot as splot
from sklearn.calibration import CalibratedClassifierCV

cl = SVC(C=1,gamma=0.1)
#cl.fit(X_train_bown, y_train)

cal = CalibratedClassifierCV(cl, method='sigmoid')
cal.fit(X_train_w, y_trainw)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs
train_fpr, train_tpr, thresholds = roc_curve(y_trainw, cal.predict_proba(X_train_w)[:,1])

#print( roc_curve(y_train, neigh.predict_proba(X_train_bow)[:,1]))
test_fpr, test_tpr, thresholds = roc_curve(y_testw, cal.predict_proba(X_test_w)[:,1])

#plotting the test and train auc curve - TPR v/s FPR
plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("ERROR PLOTS")
plt.show()



from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
splot.metrics.plot_confusion_matrix(y_trainw, cal.predict(X_train_w))

print("Test confusion matrix")
splot.metrics.plot_confusion_matrix(y_testw, cal.predict(X_test_w))

### [5.2.4]  Applying RBF SVM on TFIDF W2V,<font color='red'> SET 4</font>

### Splitting data and Vectorisation

In [None]:
# Splitting data as train and test set to fit our model and to calculate the performance of the model
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
tfidf_sent_vectors_t= preprocessing.normalize(tfidf_sent_vectors)
X_train_tw, X_test_tw, y_traintw, y_testtw = train_test_split(tfidf_sent_vectors[0:40000], Y[0:40000], test_size=0.2) # this is random splitting



### Hyperparameter tuning on C and Gamma using RandomizedSearchCV

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import roc_auc_score
import math

cl = SVC(kernel="rbf")
parameters = {'C':[0.0001,.001,.01,.1,1],'gamma':[0.001,0.01,0.1]}

#Creating an instance of GCV that performs a 3 fold cross validation by applying Linear SVC on the data
#with roc_auc as the performance metric and taking a set of parameters in the form of alpha
clf = RandomizedSearchCV(cl, parameters, cv=3, scoring='roc_auc',n_jobs=-1,n_iter=10)  

# fitting with train data
clf.fit(X_train_tw, y_traintw)

#observing the results- mean auc score, std deviation and the alpha values
print("Results for CV data")
print(pd.DataFrame(clf.cv_results_)[['mean_test_score', 'std_test_score', 'params']])



In [None]:

# Choosing the Hyperparamter k  based on the max auc score
print(clf.best_score_)
print(clf.best_params_)
print(clf.best_estimator_)

### Observation 
We can see that the mean_test_score is maximum for C = 1 and gamma = 0.1 So training our model with C = 1 and gamma = 0.1.

### Training the model with hyperparamater C = 1 and Gamma = 0.1

In [None]:
# Code Ref: Applied AI
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc,accuracy_score
import scikitplot as splot
from sklearn.calibration import CalibratedClassifierCV

cl = SVC(C=1,gamma=0.1)
#cl.fit(X_train_bown, y_train)

cal = CalibratedClassifierCV(cl, method='sigmoid')
cal.fit(X_train_tw, y_traintw)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs
train_fpr, train_tpr, thresholds = roc_curve(y_traintw, cal.predict_proba(X_train_tw)[:,1])

#print( roc_curve(y_train, neigh.predict_proba(X_train_bow)[:,1]))
test_fpr, test_tpr, thresholds = roc_curve(y_testtw, cal.predict_proba(X_test_tw)[:,1])

#plotting the test and train auc curve - TPR v/s FPR
plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("ERROR PLOTS")
plt.show()



from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
splot.metrics.plot_confusion_matrix(y_traintw, cal.predict(X_train_tw))

print("Test confusion matrix")
splot.metrics.plot_confusion_matrix(y_testtw, cal.predict(X_test_tw))

# [6] Conclusions

In [None]:
from prettytable import PrettyTable

x = PrettyTable()

In [None]:
x.field_names = ["Vectoriser","Model","Regularization","lambda value","AUC Score"]

x.add_row(["BOW", "SGDClassifier","l1", 0.00001, 0.95 ])
x.add_row(["TFIDF","SGDClassifier", "l1", 0.00001, 0.95])
x.add_row(["Avg W2V","SGDClassifier",  "l1", 0.001, 0.90])
x.add_row(["Tf idf W2V","SGDClassifier",  "l1", 0.001, .88])
x.add_row(["Bow","SGDClassifier","l2", 0.0001, 0.94])
x.add_row(["TFIDF","SGDClassifier", "l2",0.0001, 0.94])
x.add_row(["Avg W2V", "SGDClassifier","l2", 0.01, 0.90])
x.add_row(["Tf idf W2V","SGDClassifier", "l2",0.01, 0.88])

print(x)

In [None]:
from prettytable import PrettyTable

x = PrettyTable()

In [None]:
x.field_names = ["Vectoriser", "Model","Kernel", "C value","Gamma Value", "AUC Score"]

x.add_row(["BOW", "SVC","RBF", 0.1, 0.01,0.92])
x.add_row(["TFIDF","SVC", "RBF", 0.1, 0.1,0.89])
x.add_row(["Avg W2V","SVC",  "RBF", 1, 0.1,0.92])
x.add_row(["Tf idf W2V","SVC",  "RBF", 1,0.1,0.90])

print(x)

### Obeservations

1. SGD Classifer which is a substitute for LinearSVM with the right hyperparameter alpha performs well on all our data.
2. SVC with RBF kernel with the right hyperparameters C and Gamma performs well on the test data.
3. CallibratedClassifierCV is used to get probability estimates on each class.