# Problem Statement

As a senior ML Engineer, you are asked to build a model that will improve the recommendations given to the users given their past reviews and ratings. 

 

In order to do this, you planned to build a sentiment-based product recommendation system, which includes the following tasks.
- Data sourcing and sentiment analysis
- Building a recommendation system
- Improving the recommendations using the sentiment analysis model
- Deploying the end-to-end project with a user interface

## Pipeline that needs to be performed

1. Data loading

2. Data cleaning and EDA

3. Text Preprocessing

4. Feature extraction

5. Model building using supervised learning

6. Building a recommendation System

7. Improving the recommendations using the sentiment analysis model

8. Deployment of this end to end project with a user interface using Flaskapi and Heroku

In [1]:
# from google.colab import drive
# drive.mount('/content/drive')

In [2]:
!pip install wordcloud
!pip install swifter
!pip install textblob
!pip install imblearn

In [3]:
# import required libraries

import pandas as pd,numpy as np
from numpy import *
pd.set_option('max_rows',100)
pd.set_option('max_columns',50)
pd.options.display.max_colwidth = 80

import os
from pprint import pprint
import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt,seaborn as sns
%matplotlib inline

# libraries for text processing
import string,re,swifter #swifter for faster processing of fucntion on pandas datafrane.
from textblob import TextBlob
from collections import Counter
from imblearn.over_sampling import SMOTE
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords,wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('tagsets')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')
from nltk import FreqDist
from wordcloud import WordCloud,STOPWORDS

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer , TfidfTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split,GridSearchCV,RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
import time,pickle
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
import xgboost as xgb
from sklearn import metrics
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics.pairwise import pairwise_distances,cosine_similarity

#  1. Data loading

In [4]:
#import required libraries
data  = pd.read_csv(r'../input/reviews-ratings/dataset.csv')

#dimension of the dataframe 
print('dataframe has {} rows and {} columns'.format(str(data.shape[0]),str(data.shape[1])))

In [5]:
#print the column names
for i,col in enumerate(data.columns.to_list()):
  print(i, ' --> ',col)

In [6]:
data.sample(n=10)

In [7]:
data.info(verbose=True)

In [8]:
#check for any duplicates
data.drop_duplicates(inplace=True)
data.shape

# 2. Data cleaning and EDA

In [9]:
#check the NaN values in all columns
def checkNaNvalues(df):
    return round(100*(df.isnull().sum()/len(df.index)), 2)

In [10]:
checkNaNvalues(data)

we can see that more than 90% of records are null in reviews_userCity and reviews_userProvince columns so it is better to drop these two columns

In [11]:
data.drop(columns=['reviews_userCity','reviews_userProvince'],inplace=True)
data.shape

In [12]:
#check the rating distribution in percentage
round(100*data.reviews_rating.value_counts()/len(data.reviews_rating.index),2)

In [13]:
#plot distribution of reviews rating-wise
plt.figure(figsize=(8,5))
ax = sns.countplot(x='reviews_rating', data=data)

In [14]:
#let's check the distribution of the user sentiments for all reviews
#distribution of user sentiments in percentage
for i in range(len(data.user_sentiment.value_counts().to_list())):
    print(data.user_sentiment.value_counts().index[i] + ' user sentiments across all reviews are {0}'.format(round(100*data.user_sentiment.value_counts()[i]/len(data.user_sentiment.index),2))+'%')

In [15]:
##plot the distribution of user sentiments using countplot
plt.figure(figsize=(8,5))
ax = sns.countplot(x='user_sentiment', data=data)

88% user have given positive reviews for the products of various categories.

In [16]:
#plot the distribution of user sentiments rating wise
plt.figure(figsize=(10,8))
ax = sns.countplot(x="user_sentiment",
                hue="reviews_rating",
                data=data);
ax.legend(loc = 'upper right')

it is clear that distribution of rating is not same for both positive and negative sentiments. This is clear case of class imbalance which we need to handle.
We can also see that user sentiment is positive but he has given less than 3 rating and we also can see that some user sentiments are negative but they have given rating of more than 3.It can also be proved using below data

In [17]:
data.groupby(['user_sentiment','reviews_rating'])['user_sentiment'].count()

In [18]:
#create a crosstable to see the ratings given for both user sentiment
pd.crosstab(data.user_sentiment,data.reviews_rating,margins=True)

we can correct this abnormality by giving positive sentiment for all reviews where rating was greater than 3 and negative sentiment where rating was less than 3

In [19]:
data.loc[(data.user_sentiment=='Positive') & (data.reviews_rating < 4),'user_sentiment'] = 'Negative'
data.loc[(data.user_sentiment=='Negative') & (data.reviews_rating > 3),'user_sentiment'] = 'Positive'

In [20]:
pd.crosstab(data.user_sentiment,data.reviews_rating,margins=True)

In [21]:
#plot the distribution of user sentiments rating wise again after correcting sentiments based on ratings
plt.figure(figsize=(10,8))
ax = sns.countplot(x="user_sentiment",
                hue="reviews_rating",
                data=data);
ax.legend(loc = 'upper right')

In [22]:
#get the top 10 brands with positive sentiments
data.loc[data.user_sentiment=='Positive'].groupby(['brand'])['brand'].count().sort_values(ascending=False).head(10).plot(kind='bar',color='g',title='top to brands with positive sentiment')

In [23]:
#get the top 10 brands with negative sentiments
data.loc[data.user_sentiment=='Negative'].groupby(['brand'])['brand'].count().sort_values(ascending=False).head(10).plot(kind='bar',color='r',title='top to brands with negative sentiment')

In [24]:
#check null value percentage again for all columns
checkNaNvalues(data)

In [25]:
#remove records where username is not available
data = data.loc[~data.reviews_username.isnull()]

In [26]:
#check NaN records
checkNaNvalues(data)

In [27]:
#function for aligning dataframe columns to left
def dframe_alignment(table):
    return table.style.set_properties(**{'text-align': 'left'}).set_table_styles([ dict(selector='th', props=[('text-align', 'left')] ) ])

In [28]:
# for the sentiment analysis not all the columns are required. We will use only those columns which are useful for our analysis.
df = data[['id','name','reviews_rating','reviews_username','reviews_text','reviews_title','user_sentiment']]
dframe_alignment(df.sample(n=10))


In [29]:
#dimension of the dataframe
df.shape

In [30]:
#let's merger reviews_text and reviews_title to one column and then remove the original columns
#before that fill all null records of reviews_title with ' '
df.reviews_title = df.reviews_title.fillna(' ')
df['reviewsText'] = df.reviews_title + '. ' + df.reviews_text
reviewsData  = df.drop(columns=['reviews_text','reviews_title'])
reviewsData.reviewsText = reviewsData.reviewsText.str.lstrip('. ')
dframe_alignment(reviewsData.sample(n=10))



In [31]:
#dimension of the dataframe 
reviewsData.shape

In [32]:
#check user_sentiments distribution after removing NaN records
reviewsData.user_sentiment.value_counts()

In [33]:
#check the record for which user sentiment is null and fill it based on it's review text
reviewsData.loc[reviewsData.user_sentiment.isna()]

In [34]:
#looking at the review text looks like it is a positive sentiment.Filling the usersentiment with 'Positive'
reviewsData.loc[reviewsData.user_sentiment.isna(),'user_sentiment'] = 'Positive'

In [35]:
#check for NaN records in the dataframe again
checkNaNvalues(reviewsData)

In [36]:
#check the distribution of user_sentiments again
reviewsData.user_sentiment.value_counts()

In [37]:
#plot hist to check the avg review length
char_length = [len(review) for review in reviewsData.reviewsText]

plt.figure(figsize=(15,8))
sns.histplot(data=char_length,bins=50,stat='count',kde=True)
plt.show()

We can see that majority of the reviews are in length of between 0 to 1000

# 3. Text processing

In [38]:
#create text processing function 
def textProcessing(text):
    '''
        This function parses a text and do the following.
        - Make the text lowercase
        - Remove whitespaces from both end of the string
        - Remove text in square brackets
        - Remove punctuation
        - Remove words containing numbers
    '''
    text = text.lower() # convert text to lower case
    text = text.strip() # remove whitespaces from both end of the string
    text = re.sub('\[.*?\]','',text) # remove text in square brackets
    text = re.sub('[%s]'%re.escape(string.punctuation),'',text) # remove string.punctuations(!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~) from the text
    #for removing punctuation you can also use str,maketrans(''.'',string.punctuation). it's just that re.sub is faster than maketrans
    text = re.sub('\w*\d\w*','',text) #remove words containing numbers
    return text
    

In [39]:
dframe_alignment(reviewsData.head(10))

In [40]:
reviewsData['reviews'] = reviewsData.reviewsText.swifter.apply(lambda x : textProcessing(x))
dframe_alignment(reviewsData.head(10))

In [41]:
#remove stop words from the text
stopWords = set(stopwords.words('english'))
def removeStopwords(text):
    words = [word for word in text.split() if word.isalpha() and word not in stopWords]
    return " ".join(words)

In [42]:
# This is a helper function to map NTLK position tags
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [43]:
#remove stopwords and lematize text 
lemmatizer = WordNetLemmatizer()
def textLemmatize(text):
    #remove the stopwords from the text
    words = removeStopwords(text)
    #map pos tags of each words
    wordnetPOSTags =  nltk.pos_tag(word_tokenize(words))
    #lemmatize the words according to their POS tag
    lemmatizedWords = [lemmatizer.lemmatize(token[0],get_wordnet_pos(token[1])) for i,token in enumerate(wordnetPOSTags)]
    return " ".join(lemmatizedWords)
     

In [44]:
reviewsData.reviews = reviewsData.reviews.swifter.apply(lambda x: textLemmatize(x))
dframe_alignment(reviewsData.head(10))

In [45]:
#Write your function to extract the POS tags 

def pos_Tag(text):
  #TextBlob provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more
  blob = TextBlob(text)
  return " ".join([word for (word,tag) in blob.tags if tag in ['JJ','JJR','JJS','NN']])

In [46]:
reviewsData['finalReviews'] = reviewsData.reviews.swifter.apply(lambda x : pos_Tag(x))
dframe_alignment(reviewsData.head(10))

###  EDA on reviews

### Using a word cloud find the top 25 words by frequency among all the reviews after processing the text

In [47]:
#create seperate dataframe for pos and negative sentiments for EDA
pos_review = pd.DataFrame(reviewsData.loc[reviewsData.user_sentiment=='Positive','finalReviews'])
neg_review = pd.DataFrame(reviewsData.loc[reviewsData.user_sentiment=='Negative','finalReviews'])

In [48]:
def get_top_25Words(df,top_n):
  top_25Words = df.str.split().values.tolist()
  return FreqDist([w for seq in top_25Words for w in seq]).most_common(top_n)

In [49]:
##create a function to plot the frequancy chart
def plot_word_frequency(df,top_n=10):
  word_freq = get_top_25Words(df,top_n)
  pprint(word_freq)
  labels = [element[0] for element in word_freq]
  counts = [element[1] for element in word_freq]
  ax = sns.barplot(y=labels, x=counts)
  return ax

In [50]:
#create a wordcloud to see the most used words visually
def creatWordCloud(top_25Words):
    ### Using word cloud print top 40 words by frequency
    stopwords_ = set(STOPWORDS)
    wordcloud = WordCloud(stopwords=stopwords_,background_color = 'white', width = 800, height = 400,
                          colormap = 'viridis', max_words = 40, contour_width = 3,
                          max_font_size = 80, contour_color = 'steelblue',
                          random_state = 0)

    plt.figure(figsize=(15,10))
    wordcloud.generate(" ".join([w for (w,c) in top_25Words]))
    plt.imshow(wordcloud)
    return plt.show()

In [51]:
#Find the top 25 words by frequency among positive reviews and their frequency count
plt.figure(figsize=(15,10))
plot_word_frequency(pos_review.finalReviews,25)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.show()

In [52]:
#print wordCloud for top 25 keywords for positive reviews
top_pos_25_words = FreqDist([w for seq in pos_review.finalReviews.str.split().values.tolist() for w in seq]).most_common(25)
creatWordCloud(top_pos_25_words)

In [53]:
#Find the top 25 keywords by frequency among negative reviews and their frequency count
plt.figure(figsize=(15,10))
plot_word_frequency(neg_review.finalReviews,25)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.show()

In [54]:
#print wordCloud for top 25 keywords for negative reviews
top_neg_25_words = FreqDist([w for seq in neg_review.finalReviews.str.split().values.tolist() for w in seq]).most_common(25)
creatWordCloud(top_neg_25_words)

### Find the top unigrams,bigrams and trigrams by frequency among all the reviews after processing the text.

In [55]:
#create a function for extracting top ngrams from the text
def get_top_ngrams(text, n=None, ngram=(1,1)):
  vec = CountVectorizer(stop_words='english', ngram_range=ngram).fit(text)
  bagofwords = vec.transform(text)
  sum_words = bagofwords.sum(axis=0)
  words_frequency = [(word, sum_words[0, index]) for word, index in vec.vocabulary_.items()]
  words_frequency = sorted(words_frequency, key = lambda x: x[1], reverse=True)
  return words_frequency[:n]

In [56]:
#function to create plot to see top 25 unigrams

def createBarPlot(df,xlabel,ylabel,title):
    plt.figure(figsize=[20,8])
    sns.barplot(x=df['ngram'], y=df['count'])
    plt.xticks(rotation=45,fontsize=12)
    plt.xlabel(xlabel,fontsize=15)
    plt.ylabel(ylabel,fontsize=15)
    plt.title(title,fontsize=15)
    return plt.show()

In [57]:
#Write your code here to find the top 25 unigram frequency among the reviews in the cleaned datafram(df_clean).
#top25_unigrams for positive sentiments
top_25PosUnigram = get_top_ngrams(pos_review.finalReviews.values.astype('U'),n=25,ngram = (1,1))
top_25PosUnigram = pd.DataFrame(top_25PosUnigram,columns=['ngram','count'])
top_25PosUnigram

In [58]:
createBarPlot(top_25PosUnigram,'unigram','count','top 25 positive unigrams')

In [59]:
#top25_unigrams for negative sentiments
top_25NegUnigram = get_top_ngrams(neg_review.finalReviews.values.astype('U'),n=25,ngram = (1,1))
top_25NegUnigram = pd.DataFrame(top_25NegUnigram,columns=['ngram','count'])
top_25NegUnigram

In [60]:
#plotting top 25 unigrams
createBarPlot(top_25NegUnigram,'unigram','count','top 25 negative unigrams')

In [61]:
#Write your code here to find the top 25 bigram frequency among the positive reviews 
top_25Posbigram = get_top_ngrams(pos_review.finalReviews.values.astype('U'),n=25,ngram = (2,2))
top_25Posbigram = pd.DataFrame(top_25Posbigram,columns=['ngram','count'])
top_25Posbigram

In [62]:
#plotting top 25 bigrams for positive sentiments
createBarPlot(top_25Posbigram,'bigram','count','top 25 positive bigrams')

In [63]:
#Write your code here to find the top 25 bigram frequency among the negative reviews 
top_25Negbigram = get_top_ngrams(neg_review.finalReviews.values.astype('U'),n=25,ngram = (2,2))
top_25Negbigram = pd.DataFrame(top_25Negbigram,columns=['ngram','count'])
top_25Negbigram

In [64]:
#plotting top 25 bigrams for negative sentiments
createBarPlot(top_25Negbigram,'bigram','count','top 25 negative bigrams')

In [65]:
#Write your code here to find the top 25 trigram frequency among the positive reviews 
top_25Postrigram = get_top_ngrams(pos_review.finalReviews.values.astype('U'),n=25,ngram = (3,3))
top_25Postrigram = pd.DataFrame(top_25Postrigram,columns=['ngram','count'])
top_25Postrigram

In [66]:
#plotting top 25 trigrams for positive sentiments
createBarPlot(top_25Postrigram,'trigram','count','top 25 positive trigrams')

In [67]:
#Write your code here to find the top 25 trigram frequency among the positive reviews 
top_25Negtrigram = get_top_ngrams(neg_review.finalReviews.values.astype('U'),n=25,ngram = (3,3))
top_25Negtrigram = pd.DataFrame(top_25Negtrigram,columns=['ngram','count'])
top_25Negtrigram

In [68]:
#plotting top 25 trigrams for positive sentiments
createBarPlot(top_25Negtrigram,'trigram','count','top 25 negative trigrams')

In [69]:
#let's check the distribution of the user sentiments for all reviews again
#distribution of user sentiments in percentage
for i in range(len(reviewsData.user_sentiment.value_counts().to_list())):
    print(reviewsData.user_sentiment.value_counts().index[i] + ' user sentiments across all reviews are {0}'.format(round(100*data.user_sentiment.value_counts()[i]/len(data.user_sentiment.index),2))+'%')

# 4. Feature Extraction

Convert the raw texts to a matrix of TF-IDF features

**max_df** is used for removing terms that appear too frequently, also known as "corpus-specific stop words"
max_df = 0.95 means "ignore terms that appear in more than 95% of the complaints"

**min_df** is used for removing terms that appear too infrequently
min_df = 2 means "ignore terms that appear in less than 2 complaints"

In [70]:
reviewsData = reviewsData[['id','name','reviews_rating', 'reviews_username', 'reviewsText','finalReviews','user_sentiment']]
print(reviewsData.shape)

In [71]:
#convert categorical value to numerical for user_sentiment
reviewsData.user_sentiment = reviewsData.user_sentiment.map({'Positive':1,'Negative':0})

In [72]:
reviewsData.user_sentiment.value_counts()

In [73]:
#Write your code here to initialise the TfidfVectorizer 
tfIdf = TfidfVectorizer(stop_words='english',max_df = 0.90,min_df=2,binary=True,ngram_range=(1,2)) # only unigram and bigrams are allowed

In [74]:
#Write your code here to create the Document Term Matrix by transforming the complaints column present in df_clean.
X_train_tfidf = tfIdf.fit_transform(reviewsData.finalReviews)
y = reviewsData.user_sentiment

In [75]:
X_train_tfidf.shape

In [76]:
#print(tfIdf.get_feature_names)

In [77]:
# split the dataset into train and test
X_train,X_test,y_train,y_test = train_test_split(X_train_tfidf,y,test_size=0.3,random_state=42)
print(X_train.shape,X_test.shape)
print(y_train.shape,y_test.shape)

In [78]:
#we need to handle the class imbalance to handle the uneven distribution of user sentiments

print(Counter(y_train))

In [79]:
#we will use SMOTE to handle class imbalance
sm = SMOTE()
X_train,y_train = sm.fit_resample(X_train,y_train)
print(Counter(y_train))

In [80]:
print(X_train.shape)

# 5. Model Building using superwised learning

We will create below superwised models and will select the best model with higher accuracy

1. Logistic regression
2. Random forest
3. XGBoost
4. Naive Bayes

In [81]:
#create a model class with all relevant method for model evaluation

class SuperWisedModelBuilder():
    
    #initialize the class object
    def __init__(self, model, X_train,X_test,y_train,y_test):
        self.model = model
        self.X_train = X_train
        self.y_train = y_train
        self.X_test = X_test
        self.y_test = y_test
    
    # train the model
    def train_model(self):
        self.model.fit(self.X_train,self.y_train)
        return self.model.predict(X_test)
    
    #evaluate the model
    def evaluate_model(self,y_pred_class):
        y_pred_prob = self.model.predict_proba(self.X_test)[:,1]
        self.printClassification_report(y_pred_class)
        print('######'*20)
        print('')
        self.metricsResult = self.evaluate_metrics(y_pred_class,y_pred_prob)
        print('')
        print('######'*20)
        print('')
        print("Confusion matrix")
        self.confusion_matrix(y_pred_class)
        print('######'*20)
        print('')
        print('ROC Curve')
        self.plot_roc_curve()
#         print('Precision- Recall Curve')
#         self.plot_precision_recall_vs_thresold(y_pred_prob)
#         print('######'*20)
        return self.metricsResult   

    def printClassification_report(self,y_pred_class):
        print("######"*20)
        print('')
        print('Classification report')
        print(metrics.classification_report(self.y_test, y_pred_class))
    
        
        
    def evaluate_metrics(self,y_pred_class,y_pred_prob):
        metricsResult = dict()
        modelAccuracy = metrics.accuracy_score(self.y_test,y_pred_class)
        modelPrecision = metrics.precision_score(self.y_test,y_pred_class)
        modelRecall = metrics.recall_score(self.y_test,y_pred_class)
        modelRoCAuCScore = metrics.roc_auc_score(self.y_test,y_pred_prob)
        modelf1Score = metrics.f1_score(self.y_test,y_pred_class)
        
        print("model accuracy is {0}%".format(round(100*modelAccuracy,2)))
        print("model precision is {0}%".format(round(100*modelPrecision,2)))
        print("model recall is {0}%".format(round(100*modelRecall,2)))
        print("model f1-score  is {0}%".format(round(100*modelf1Score,2)))
        print("model ROC-AUC score is {0}%".format(round(100*modelRoCAuCScore,2)))
    
        metricsResult['accuracy'] = modelAccuracy
        metricsResult['precision'] = modelPrecision
        metricsResult['recall'] = modelRecall
        metricsResult['f1-score'] = modelf1Score
        metricsResult['ROC-AUC score'] = modelRoCAuCScore
        
        return metricsResult
    
    def confusion_matrix(self, y_pred_class):
        confusion_matrix = metrics.confusion_matrix(self.y_test, y_pred_class)
        self.plot_confusion_matrix(confusion_matrix,self.model.classes_)
        
        
    def plot_confusion_matrix(self,data, classes_):
        sns.set(color_codes=True)
        plt.figure(figsize=(5,4))
        plt.title("Confusion Matrix")
        group_names = ['True Neg','False Pos','False Neg','True Pos']
        group_counts = ['{0:0.0f}'.format(value) for value in
                data.flatten()]
        group_percentages = ['{0:.2%}'.format(value) for value in
                     data.flatten()/np.sum(data)]
        labels = [f'{v1}\n{v2}\n{v3}' for v1, v2 ,v3 in zip(group_names,group_counts,group_percentages)]
        labels = np.asarray(labels).reshape(2,2)
        ax = sns.heatmap(data/np.sum(data), annot=labels, fmt='', cmap='Blues')
        ax.set_xticklabels(classes_)
        ax.set_yticklabels(classes_)
        ax.set(ylabel="True Values", xlabel="Predicted Values")
        plt.show()
    
    def plot_roc_curve(self):
        metrics.plot_roc_curve(self.model,self.X_test,self.y_test)
        
    def plot_precision_recall_vs_thresold(self,y_pred_prob):
        pre, rec, thr = metrics.precision_recall_curve(self.y_test, y_pred_prob)
        plt.figure(figsize=(8,4))
        plt.plot(thr, pre[:-1], label='precision')
        plt.plot(thr, rec[1:], label='recall')
        plt.xlabel('Threshold')
        plt.title('Precision & Recall vs Threshold', c='r', size=16)
        plt.legend()
        plt.show() 
        

In [82]:
def save_object(obj, filename):
    filename = "./"+filename+'.pkl'
    pickle.dump(obj, open(filename, 'wb'))

In [83]:
#save tfIdf vector for future use
save_object(tfIdf,'tfIdf vector')

# Logistic Regression

In [84]:
model = LogisticRegression(random_state=42,solver='liblinear', class_weight="balanced")
lr_model = SuperWisedModelBuilder(model,X_train,X_test,y_train,y_test)
%time y_pred_class = lr_model.train_model()
result_metrics = lr_model.evaluate_model(y_pred_class)

    f1 score is almot 94% with resample data.Let's try hyperparameter tuning

In [85]:
# hyperparameter tuning
# grid = {"C": [100, 10, 5, 4, 3, 2, 1, 1.0, 0.1, 0.01],
#                 "solver": ["liblinear"],'penalty' : ['l2']}

# lr_hpm = GridSearchCV(LogisticRegression(random_state=42),
#                                 param_grid=grid,
#                                 cv=4,
#                                 verbose=True,
#                                 n_jobs=-1)

# # Fit random hyperparameter search model
# lr_hpm.fit(X_train, y_train)
# #find best estimators
# lr_hpm.best_estimator_

In [86]:
#let's train the model with this estimator and evalute the model
lr_model_hpt = LogisticRegression(C=100, random_state=42, solver='liblinear')
lr_modelBuilder = SuperWisedModelBuilder(lr_model_hpt,X_train,X_test,y_train,y_test)
%time y_pred_class = lr_modelBuilder.train_model()
lr_result_metrics  = lr_modelBuilder.evaluate_model(y_pred_class)

    we can see not much difference on f1-score after hyperparameter tuning also f1-score is on lower side for class 0 .Let's try other models

In [87]:
save_object(lr_model_hpt,'Logistic Regression')

# Decision Tree

In [88]:
dt = DecisionTreeClassifier(random_state=42,criterion="gini", max_depth=10)
dt_model = SuperWisedModelBuilder(dt,X_train,X_test,y_train,y_test)
%time y_pred_class = dt_model.train_model()
dt_result_metrics = dt_model.evaluate_model(y_pred_class)

    f1 score is lower as compare to logistic regression.Let's do hyperparameter tuning

In [89]:
#hyperparameter tuning
# param = { "max_depth": [3, 5, 10,15],
#            "min_samples_split": np.arange(3, 20, 3),
#            "min_samples_leaf": np.arange(1, 20, 4),
#         "criterion":['gini','entropy']}

# gs = GridSearchCV(DecisionTreeClassifier(random_state=42),param_grid=param,n_jobs=-1,verbose=True,cv=4)
# %time gs.fit(X_train,y_train)
# print(gs.best_estimator_)

In [90]:
# CPU times: user 3 µs, sys: 0 ns, total: 3 µs
# Wall time: 7.15 µs
# Fitting 4 folds for each of 240 candidates, totalling 960 fits
# best estimator is DecisionTreeClassifier(max_depth=15, min_samples_split=3, random_state=42)
#train dt model with best parameters
dt_model_hpt = DecisionTreeClassifier(max_depth=15, min_samples_split=3, random_state=42)
dt_modelBuilder = SuperWisedModelBuilder(dt_model_hpt,X_train,X_test,y_train,y_test)
%time y_pred_class = dt_modelBuilder.train_model()
dt_result_metrics = dt_modelBuilder.evaluate_model(y_pred_class)

In [91]:
save_object(dt_model_hpt,'DecisionTree Model')

In [92]:
rf = RandomForestClassifier(criterion = 'gini',random_state=42,n_jobs=-1)
rf_model = SuperWisedModelBuilder(rf,X_train,X_test,y_train,y_test)
%time y_pred_class = rf_model.train_model()
rf_result_metrics = rf_model.evaluate_model(y_pred_class)

In [93]:
#hyperparameter tuning
# params = {"n_estimators": np.arange(25, 125, 25),
#            "max_depth": [3, 5, 10],
#            "min_samples_split": np.arange(2, 20, 2),
#            "min_samples_leaf": np.arange(1, 20, 2),
#           "criterion":['gini','entropy']}

# rf_hpt = GridSearchCV(RandomForestClassifier(oob_score=True,random_state=42),param_grid=params,cv=4,n_jobs=-1,verbose=True) 
# %time rf_hpt.fit(X_train,y_train)
# #find best estimator
# print(rf_hpt.best_estimator_)

In [94]:
rf_model_hpt = RandomForestClassifier(max_depth=10, min_samples_split=8, oob_score=True,random_state=42)
rf_modelBuilder = SuperWisedModelBuilder(rf_model_hpt,X_train,X_test,y_train,y_test)
%time y_pred_class = rf_modelBuilder.train_model()
rf_result_metrics = rf_modelBuilder.evaluate_model(y_pred_class)

In [95]:
save_object(rf_model_hpt,"Random Forest Classifier")

f1 score is looking good but for negative class it is not above 50%

# XGBoost

In [96]:
xgb_ = xgb.XGBClassifier(n_jobs=-1,random_state=42,learning_rate=0.1,objective = 'binary:logistic')
xgb_model = SuperWisedModelBuilder(xgb_,X_train,X_test,y_train,y_test)
%time y_pred_class = xgb_model.train_model()
xgb_metrics = xgb_model.evaluate_model(y_pred_class)

In [97]:
#hyperparameter tuning
# params = {
#         'n_estimators' : [100, 200, 300], # no of trees 
#         'learning_rate' : [0.01, 0.02, 0.05, 0.1, 0.25],  # eta
#         'min_child_weight': [1, 5, 7, 10],
#         'gamma': [0.1, 0.5, 1, 1.5, 5],
#         'subsample': [0.6, 0.8, 1.0],
#         'colsample_bytree': [0.6, 0.8, 1.0],
#         'max_depth': [3, 5, 10]
#         }


# random_search = RandomizedSearchCV(xgb.XGBClassifier(random_state=42), param_distributions=params, n_iter=50,scoring='accuracy', n_jobs=-1, cv=4, verbose=3, random_state=42)
# %time random_search.fit(X_train,y_train)
# #find the best estimator
# print(random_search.best_estimator_)


Fitting 4 folds for each of 50 candidates, totalling 200 fits
[CV 1/4] END colsample_bytree=1.0, gamma=1.5, learning_rate=0.05, max_depth=3, min_child_weight=10, n_estimators=300, subsample=0.8;, score=0.830 total time=  51.4s
[CV 4/4] END colsample_bytree=1.0, gamma=1.5, learning_rate=0.05, max_depth=3, min_child_weight=10, n_estimators=300, subsample=0.8;, score=0.897 total time=  46.6s
[CV 2/4] END colsample_bytree=1.0, gamma=5, learning_rate=0.01, max_depth=5, min_child_weight=1, n_estimators=300, subsample=0.8;, score=0.870 total time= 1.8min
[CV 3/4] END colsample_bytree=1.0, gamma=5, learning_rate=0.01, max_depth=5, min_child_weight=1, n_estimators=300, subsample=0.8;, score=0.870 total time= 1.8min
[CV 1/4] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.05, max_depth=10, min_child_weight=10, n_estimators=200, subsample=1.0;, score=0.850 total time= 1.1min
[CV 3/4] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.05, max_depth=10, min_child_weight=10, n_estimators=200, subsample=1.0;, score=0.921 total time= 1.1min
[CV 1/4] END colsample_bytree=0.8, gamma=5, learning_rate=0.25, max_depth=10, min_child_weight=7, n_estimators=300, subsample=1.0;, score=0.878 total time= 2.0min
[CV 3/4] END colsample_bytree=0.8, gamma=5, learning_rate=0.25, max_depth=10, min_child_weight=7, n_estimators=300, subsample=1.0;, score=0.946 total time= 2.0min
[CV 1/4] END colsample_bytree=0.8, gamma=5, learning_rate=0.1, max_depth=5, min_child_weight=1, n_estimators=300, subsample=0.6;, score=0.864 total time= 1.2min
[CV 3/4] END colsample_bytree=0.8, gamma=5, learning_rate=0.1, max_depth=5, min_child_weight=1, n_estimators=300, subsample=0.6;, score=0.933 total time= 1.2min
[CV 1/4] END colsample_bytree=0.8, gamma=5, learning_rate=0.1, max_depth=3, min_child_weight=1, n_estimators=300, subsample=0.8;, score=0.849 total time=  50.9s
[CV 3/4] END colsample_bytree=0.8, gamma=5, learning_rate=0.1, max_depth=3, min_child_weight=1, n_estimators=300, subsample=0.8;, score=0.920 total time=  51.2s
[CV 1/4] END colsample_bytree=0.8, gamma=0.5, learning_rate=0.25, max_depth=10, min_child_weight=10, n_estimators=100, subsample=0.8;, score=0.868 total time=  36.5s
[CV 3/4] END colsample_bytree=0.8, gamma=0.5, learning_rate=0.25, max_depth=10, min_child_weight=10, n_estimators=100, subsample=0.8;, score=0.932 total time=  37.3s
[CV 1/4] END colsample_bytree=0.8, gamma=0.1, learning_rate=0.1, max_depth=5, min_child_weight=10, n_estimators=200, subsample=1.0;, score=0.846 total time=  44.8s
[CV 3/4] END colsample_bytree=0.8, gamma=0.1, learning_rate=0.1, max_depth=5, min_child_weight=10, n_estimators=200, subsample=1.0;, score=0.920 total time=  45.2s
[CV 1/4] END colsample_bytree=1.0, gamma=0.1, learning_rate=0.1, max_depth=3, min_child_weight=5, n_estimators=100, subsample=0.8;, score=0.821 total time=  17.5s
[CV 3/4] END colsample_bytree=1.0, gamma=0.1, learning_rate=0.1, max_depth=3, min_child_weight=5, n_estimators=100, subsample=0.8;, score=0.889 total time=  17.4s
[CV 1/4] END colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=3, min_child_weight=1, n_estimators=100, subsample=0.8;, score=0.820 total time=  21.2s
[CV 3/4] END colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=3, min_child_weight=1, n_estimators=100, subsample=0.8;, score=0.887 total time=  21.7s
[CV 1/4] END colsample_bytree=0.6, gamma=0.1, learning_rate=0.25, max_depth=3, min_child_weight=10, n_estimators=300, subsample=0.8;, score=0.863 total time=  28.9s
[CV 3/4] END colsample_bytree=0.6, gamma=0.1, learning_rate=0.25, max_depth=3, min_child_weight=10, n_estimators=300, subsample=0.8;, score=0.925 total time=  28.7s
[CV 1/4] END colsample_bytree=0.8, gamma=5, learning_rate=0.25, max_depth=5, min_child_weight=1, n_estimators=300, subsample=0.6;, score=0.879 total time= 1.1min
[CV 3/4] END colsample_bytree=0.8, gamma=5, learning_rate=0.25, max_depth=5, min_child_weight=1, n_estimators=300, subsample=0.6;, score=0.948 total time= 1.1min
[CV 1/4] END colsample_bytree=0.8, gamma=1.5, learning_rate=0.01, max_depth=10, min_child_weight=10, n_estimators=300, subsample=0.8;, score=0.819 total time= 2.0min
[CV 3/4] END colsample_bytree=0.8, gamma=1.5, learning_rate=0.01, max_depth=10, min_child_weight=10, n_estimators=300, subsample=0.8;, score=0.890 total time= 2.0min
[CV 1/4] END colsample_bytree=1.0, gamma=0.1, learning_rate=0.02, max_depth=5, min_child_weight=10, n_estimators=300, subsample=0.8;, score=0.820 total time= 1.3min
[CV 3/4] END colsample_bytree=1.0, gamma=0.1, learning_rate=0.02, max_depth=5, min_child_weight=10, n_estimators=300, subsample=0.8;, score=0.892 total time= 1.3min
[CV 1/4] END colsample_bytree=1.0, gamma=0.5, learning_rate=0.05, max_depth=10, min_child_weight=1, n_estimators=200, subsample=0.6;, score=0.864 total time= 1.9min
[CV 3/4] END colsample_bytree=1.0, gamma=0.5, learning_rate=0.05, max_depth=10, min_child_weight=1, n_estimators=200, subsample=0.6;, score=0.935 total time= 1.9min
[CV 1/4] END colsample_bytree=0.8, gamma=0.5, learning_rate=0.02, max_depth=10, min_child_weight=7, n_estimators=300, subsample=0.6;, score=0.838 total time= 1.8min
[CV 3/4] END colsample_bytree=0.8, gamma=0.5, learning_rate=0.02, max_depth=10, min_child_weight=7, n_estimators=300, subsample=0.6;, score=0.913 total time= 1.8min
[CV 1/4] END colsample_bytree=0.8, gamma=0.1, learning_rate=0.25, max_depth=5, min_child_weight=1, n_estimators=200, subsample=0.6;, score=0.877 total time=  46.4s
[CV 3/4] END colsample_bytree=0.8, gamma=0.1, learning_rate=0.25, max_depth=5, min_child_weight=1, n_estimators=200, subsample=0.6;, score=0.945 total time=  45.3s
[CV 1/4] END colsample_bytree=0.8, gamma=0.1, learning_rate=0.05, max_depth=3, min_child_weight=1, n_estimators=200, subsample=0.6;, score=0.820 total time=  31.4s
[CV 3/4] END colsample_bytree=0.8, gamma=0.1, learning_rate=0.05, max_depth=3, min_child_weight=1, n_estimators=200, subsample=0.6;, score=0.889 total time=  31.2s
[CV 1/4] END colsample_bytree=1.0, gamma=5, learning_rate=0.05, max_depth=5, min_child_weight=7, n_estimators=100, subsample=0.8;, score=0.813 total time=  28.3s
[CV 3/4] END colsample_bytree=1.0, gamma=5, learning_rate=0.05, max_depth=5, min_child_weight=7, n_estimators=100, subsample=0.8;, score=0.884 total time=  27.9s
[CV 1/4] END colsample_bytree=0.6, gamma=0.1, learning_rate=0.02, max_depth=3, min_child_weight=7, n_estimators=200, subsample=0.8;, score=0.788 total time=  22.4s
[CV 3/4] END colsample_bytree=0.6, gamma=0.1, learning_rate=0.02, max_depth=3, min_child_weight=7, n_estimators=200, subsample=0.8;, score=0.854 total time=  23.1s
[CV 1/4] END colsample_bytree=0.6, gamma=1.5, learning_rate=0.01, max_depth=5, min_child_weight=10, n_estimators=100, subsample=1.0;, score=0.779 total time=  20.0s
[CV 3/4] END colsample_bytree=0.6, gamma=1.5, learning_rate=0.01, max_depth=5, min_child_weight=10, n_estimators=100, subsample=1.0;, score=0.820 total time=  19.7s
[CV 1/4] END colsample_bytree=1.0, gamma=1.5, learning_rate=0.25, max_depth=3, min_child_weight=7, n_estimators=300, subsample=0.6;, score=0.864 total time=  40.7s
[CV 3/4] END colsample_bytree=1.0, gamma=1.5, learning_rate=0.25, max_depth=3, min_child_weight=7, n_estimators=300, subsample=0.6;, score=0.929 total time=  41.7s
[CV 1/4] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.05, max_depth=3, min_child_weight=5, n_estimators=200, subsample=0.8;, score=0.820 total time=  22.8s
[CV 3/4] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.05, max_depth=3, min_child_weight=5, n_estimators=200, subsample=0.8;, score=0.887 total time=  22.8s
[CV 1/4] END colsample_bytree=0.6, gamma=5, learning_rate=0.05, max_depth=3, min_child_weight=5, n_estimators=300, subsample=0.6;, score=0.829 total time=  30.5s
[CV 3/4] END colsample_bytree=0.6, gamma=5, learning_rate=0.05, max_depth=3, min_child_weight=5, n_estimators=300, subsample=0.6;, score=0.901 total time=  29.8s
[CV 1/4] END colsample_bytree=1.0, gamma=0.1, learning_rate=0.02, max_depth=10, min_child_weight=10, n_estimators=200, subsample=0.8;, score=0.827 total time= 1.6min
[CV 3/4] END colsample_bytree=1.0, gamma=0.1, learning_rate=0.02, max_depth=10, min_child_weight=10, n_estimators=200, subsample=0.8;, score=0.901 total time= 1.6min[CV 2/4] END colsample_bytree=1.0, gamma=1.5, learning_rate=0.05, max_depth=3, min_child_weight=10, n_estimators=300, subsample=0.8;, score=0.900 total time=  50.7s
[CV 3/4] END colsample_bytree=1.0, gamma=1.5, learning_rate=0.05, max_depth=3, min_child_weight=10, n_estimators=300, subsample=0.8;, score=0.900 total time=  46.6s
[CV 1/4] END colsample_bytree=1.0, gamma=5, learning_rate=0.01, max_depth=5, min_child_weight=1, n_estimators=300, subsample=0.8;, score=0.799 total time= 1.9min
[CV 4/4] END colsample_bytree=1.0, gamma=5, learning_rate=0.01, max_depth=5, min_child_weight=1, n_estimators=300, subsample=0.8;, score=0.871 total time= 1.8min
[CV 2/4] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.05, max_depth=10, min_child_weight=10, n_estimators=200, subsample=1.0;, score=0.923 total time= 1.1min
[CV 4/4] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.05, max_depth=10, min_child_weight=10, n_estimators=200, subsample=1.0;, score=0.922 total time= 1.1min
[CV 2/4] END colsample_bytree=0.8, gamma=5, learning_rate=0.25, max_depth=10, min_child_weight=7, n_estimators=300, subsample=1.0;, score=0.945 total time= 2.1min
[CV 4/4] END colsample_bytree=0.8, gamma=5, learning_rate=0.25, max_depth=10, min_child_weight=7, n_estimators=300, subsample=1.0;, score=0.946 total time= 2.1min
[CV 2/4] END colsample_bytree=0.8, gamma=5, learning_rate=0.1, max_depth=5, min_child_weight=1, n_estimators=300, subsample=0.6;, score=0.935 total time= 1.2min
[CV 4/4] END colsample_bytree=0.8, gamma=5, learning_rate=0.1, max_depth=5, min_child_weight=1, n_estimators=300, subsample=0.6;, score=0.936 total time= 1.2min
[CV 2/4] END colsample_bytree=0.8, gamma=5, learning_rate=0.1, max_depth=3, min_child_weight=1, n_estimators=300, subsample=0.8;, score=0.921 total time=  51.0s
[CV 4/4] END colsample_bytree=0.8, gamma=5, learning_rate=0.1, max_depth=3, min_child_weight=1, n_estimators=300, subsample=0.8;, score=0.916 total time=  51.6s
[CV 2/4] END colsample_bytree=0.8, gamma=0.5, learning_rate=0.25, max_depth=10, min_child_weight=10, n_estimators=100, subsample=0.8;, score=0.933 total time=  37.4s
[CV 4/4] END colsample_bytree=0.8, gamma=0.5, learning_rate=0.25, max_depth=10, min_child_weight=10, n_estimators=100, subsample=0.8;, score=0.937 total time=  37.0s
[CV 2/4] END colsample_bytree=0.8, gamma=0.1, learning_rate=0.1, max_depth=5, min_child_weight=10, n_estimators=200, subsample=1.0;, score=0.920 total time=  45.2s
[CV 4/4] END colsample_bytree=0.8, gamma=0.1, learning_rate=0.1, max_depth=5, min_child_weight=10, n_estimators=200, subsample=1.0;, score=0.920 total time=  45.8s
[CV 2/4] END colsample_bytree=1.0, gamma=0.1, learning_rate=0.1, max_depth=3, min_child_weight=5, n_estimators=100, subsample=0.8;, score=0.892 total time=  16.9s
[CV 4/4] END colsample_bytree=1.0, gamma=0.1, learning_rate=0.1, max_depth=3, min_child_weight=5, n_estimators=100, subsample=0.8;, score=0.887 total time=  18.0s
[CV 2/4] END colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=3, min_child_weight=1, n_estimators=100, subsample=0.8;, score=0.891 total time=  21.6s
[CV 4/4] END colsample_bytree=1.0, gamma=0.5, learning_rate=0.1, max_depth=3, min_child_weight=1, n_estimators=100, subsample=0.8;, score=0.889 total time=  20.9s
[CV 2/4] END colsample_bytree=0.6, gamma=0.1, learning_rate=0.25, max_depth=3, min_child_weight=10, n_estimators=300, subsample=0.8;, score=0.927 total time=  28.5s
[CV 4/4] END colsample_bytree=0.6, gamma=0.1, learning_rate=0.25, max_depth=3, min_child_weight=10, n_estimators=300, subsample=0.8;, score=0.932 total time=  29.0s
[CV 2/4] END colsample_bytree=0.8, gamma=5, learning_rate=0.25, max_depth=5, min_child_weight=1, n_estimators=300, subsample=0.6;, score=0.949 total time= 1.1min
[CV 4/4] END colsample_bytree=0.8, gamma=5, learning_rate=0.25, max_depth=5, min_child_weight=1, n_estimators=300, subsample=0.6;, score=0.947 total time= 1.1min
[CV 2/4] END colsample_bytree=0.8, gamma=1.5, learning_rate=0.01, max_depth=10, min_child_weight=10, n_estimators=300, subsample=0.8;, score=0.892 total time= 2.0min
[CV 4/4] END colsample_bytree=0.8, gamma=1.5, learning_rate=0.01, max_depth=10, min_child_weight=10, n_estimators=300, subsample=0.8;, score=0.891 total time= 2.0min
[CV 2/4] END colsample_bytree=1.0, gamma=0.1, learning_rate=0.02, max_depth=5, min_child_weight=10, n_estimators=300, subsample=0.8;, score=0.892 total time= 1.3min
[CV 4/4] END colsample_bytree=1.0, gamma=0.1, learning_rate=0.02, max_depth=5, min_child_weight=10, n_estimators=300, subsample=0.8;, score=0.892 total time= 1.3min
[CV 2/4] END colsample_bytree=1.0, gamma=0.5, learning_rate=0.05, max_depth=10, min_child_weight=1, n_estimators=200, subsample=0.6;, score=0.936 total time= 1.9min
[CV 4/4] END colsample_bytree=1.0, gamma=0.5, learning_rate=0.05, max_depth=10, min_child_weight=1, n_estimators=200, subsample=0.6;, score=0.938 total time= 1.9min
[CV 2/4] END colsample_bytree=0.8, gamma=0.5, learning_rate=0.02, max_depth=10, min_child_weight=7, n_estimators=300, subsample=0.6;, score=0.913 total time= 1.9min
[CV 4/4] END colsample_bytree=0.8, gamma=0.5, learning_rate=0.02, max_depth=10, min_child_weight=7, n_estimators=300, subsample=0.6;, score=0.914 total time= 1.8min
[CV 2/4] END colsample_bytree=0.8, gamma=0.1, learning_rate=0.25, max_depth=5, min_child_weight=1, n_estimators=200, subsample=0.6;, score=0.946 total time=  45.2s
[CV 4/4] END colsample_bytree=0.8, gamma=0.1, learning_rate=0.25, max_depth=5, min_child_weight=1, n_estimators=200, subsample=0.6;, score=0.946 total time=  46.3s
[CV 2/4] END colsample_bytree=0.8, gamma=0.1, learning_rate=0.05, max_depth=3, min_child_weight=1, n_estimators=200, subsample=0.6;, score=0.892 total time=  30.9s
[CV 4/4] END colsample_bytree=0.8, gamma=0.1, learning_rate=0.05, max_depth=3, min_child_weight=1, n_estimators=200, subsample=0.6;, score=0.888 total time=  30.8s
[CV 2/4] END colsample_bytree=1.0, gamma=5, learning_rate=0.05, max_depth=5, min_child_weight=7, n_estimators=100, subsample=0.8;, score=0.888 total time=  27.4s
[CV 4/4] END colsample_bytree=1.0, gamma=5, learning_rate=0.05, max_depth=5, min_child_weight=7, n_estimators=100, subsample=0.8;, score=0.882 total time=  28.0s
[CV 2/4] END colsample_bytree=0.6, gamma=0.1, learning_rate=0.02, max_depth=3, min_child_weight=7, n_estimators=200, subsample=0.8;, score=0.858 total time=  23.2s
[CV 4/4] END colsample_bytree=0.6, gamma=0.1, learning_rate=0.02, max_depth=3, min_child_weight=7, n_estimators=200, subsample=0.8;, score=0.861 total time=  22.3s
[CV 2/4] END colsample_bytree=0.6, gamma=1.5, learning_rate=0.01, max_depth=5, min_child_weight=10, n_estimators=100, subsample=1.0;, score=0.850 total time=  19.7s
[CV 4/4] END colsample_bytree=0.6, gamma=1.5, learning_rate=0.01, max_depth=5, min_child_weight=10, n_estimators=100, subsample=1.0;, score=0.850 total time=  19.1s
[CV 2/4] END colsample_bytree=1.0, gamma=1.5, learning_rate=0.25, max_depth=3, min_child_weight=7, n_estimators=300, subsample=0.6;, score=0.931 total time=  41.5s
[CV 4/4] END colsample_bytree=1.0, gamma=1.5, learning_rate=0.25, max_depth=3, min_child_weight=7, n_estimators=300, subsample=0.6;, score=0.930 total time=  40.9s
[CV 2/4] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.05, max_depth=3, min_child_weight=5, n_estimators=200, subsample=0.8;, score=0.890 total time=  23.1s
[CV 4/4] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.05, max_depth=3, min_child_weight=5, n_estimators=200, subsample=0.8;, score=0.889 total time=  22.2s
[CV 2/4] END colsample_bytree=0.6, gamma=5, learning_rate=0.05, max_depth=3, min_child_weight=5, n_estimators=300, subsample=0.6;, score=0.900 total time=  30.5s
[CV 4/4] END colsample_bytree=0.6, gamma=5, learning_rate=0.05, max_depth=3, min_child_weight=5, n_estimators=300, subsample=0.6;, score=0.899 total time=  30.1s
[CV 2/4] END colsample_bytree=1.0, gamma=0.1, learning_rate=0.02, max_depth=10, min_child_weight=10, n_estimators=200, subsample=0.8;, score=0.902 total time= 1.6min
[CV 4/4] END colsample_bytree=1.0, gamma=0.1, learning_rate=0.02, max_depth=10, min_child_weight=10, n_estimators=200, subsample=0.8;, score=0.900 total time= 1.6minCPU times: user 1min 5s, sys: 261 ms, total: 1min 5s
Wall time: 1h 27min 48s
XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.6,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, gamma=0.1, gpu_id=-1, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.25, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=5, max_leaves=0, min_child_weight=1,
              missing=nan, monotone_constraints='()', n_estimators=300,
              n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=42,
              reg_alpha=0, reg_lambda=1, ...)

In [98]:
xgb_model_hpt = xgb.XGBClassifier(n_estimators=300,random_state=42, gamma=0.1,learning_rate=0.25,max_depth=5)
xgb_modelBuilder = SuperWisedModelBuilder(xgb_model_hpt,X_train,X_test,y_train,y_test)
%time y_pred_class = xgb_modelBuilder.train_model()
xgb_result_metrics = xgb_modelBuilder.evaluate_model(y_pred_class)

In [99]:
save_object(xgb_model_hpt,'XGBooster')

# Naive Byes Model

In [100]:
nb = MultinomialNB()
nb_model = SuperWisedModelBuilder(nb,X_train,X_test,y_train,y_test)
y_pred_class= nb_model.train_model()
nb_result_metrics = nb_model.evaluate_model(y_pred_class)

In [101]:
save_object(nb,'Naive Byes')

In [102]:
#creare a metrics dataframe 
metricsDic = {'metricsname':['accuracy','precision','recall','f1-score','ROC-AUC score']}
models = ['Logistic Regression','Decision Tree','Random Forest','XGBoost','Naive Byes']
modelMetrics = ['lr_resul_metrics','dt_result_metrics','rf_result_metrics','xgb_result_metrics','nb_result_metrics']
metricResult = pd.DataFrame(metricsDic)
metricResult['Logistic Regression'] = metricResult.metricsname.map(lr_result_metrics).apply(lambda x : str(round(x * 100,2))+'%')
metricResult['Decision Tree'] = metricResult.metricsname.map(dt_result_metrics).apply(lambda x : str(round(x * 100,2))+'%')
metricResult['Random Forest'] = metricResult.metricsname.map(rf_result_metrics).apply(lambda x : str(round(x * 100,2))+'%')
metricResult['XGBoost'] = metricResult.metricsname.map(xgb_result_metrics).apply(lambda x : str(round(x * 100,2))+'%')
metricResult['Naive Byes'] = metricResult.metricsname.map(nb_result_metrics).apply(lambda x : str(round(x * 100,2))+'%')
metricResult

#### from above table it is clear that XGBoost has the highest f1 score and accuracy. We will use XGBoost model for our system

# 6. Recommendation System
#### there are various method for building a recommendation system.

1. User based recommendation system
2. item based recommendation system

we will create both and find out the which one is best for our problem

In [103]:
reviewsData.shape

In [104]:
#create a new dataframe for recommendation system
recommData = reviewsData[['id','reviews_rating','reviews_username']]
recommData.shape

In [105]:
#check for null values in the dataframe
checkNaNvalues(recommData)

In [106]:
dframe_alignment(reviewsData.sample(n=10))

In [107]:
#divide the data in to train and test
train,test = train_test_split(recommData,test_size=0.3,random_state=42)
print(train.shape)
print(test.shape)

In [108]:
# Pivot the train ratings' dataset into matrix format in which columns are id's and the rows are username.
recommData_pv = train.pivot_table(index ='reviews_username',columns='id',values='reviews_rating').fillna(0)
recommData_pv.head(10)

### Creating dummy train & dummy test dataset.These dataset will be used for prediction

**Dummy train will be used later for prediction of the products which has not been rated by the user. To ignore the products rated by the user, we will rate it as 0 during prediction. The products not rated by user is rated as 1 for prediction in dummy train dataset.**

**Dummy test will be used for evaluation. To evaluate, we will only make prediction on the products rated by the user. So, this is marked as 1. This is just opposite of dummy_train.**

In [109]:
# Copy the train dataset into dummy_train
dummy_train = train.copy()
dummy_train.shape

In [110]:
dummy_train.loc[dummy_train.id =='AVpfm8yiLJeJML43AYyu']

In [111]:
# The products not rated by user is marked as 1 for prediction. 
dummy_train.reviews_rating = dummy_train.reviews_rating.swifter.apply(lambda x : 0 if x >= 1 else 1 )

In [112]:
dummy_train.loc[dummy_train.id =='AVpfm8yiLJeJML43AYyu']

In [113]:
# Convert the dummy train dataset into matrix format in which columns are id's and the rows are username.
dummy_train_pv = dummy_train.reset_index().pivot_table(index='reviews_username',columns='id',values='reviews_rating').fillna(1)
dummy_train_pv.head(5)

In [114]:
dummy_train_pv.shape

## User User based similarity

**Cosine Similarity**

Cosine Similarity is a measurement that quantifies the similarity between two vectors [Which is Rating Vector in this case] 

**Adjusted Cosine**

Adjusted cosine similarity is a modified version of vector-based similarity where we incorporate the fact that different users have different ratings schemes. In other words, some users might rate items highly in general, and others might give items lower ratings as a preference. To handle this nature from rating given by user , we subtract average ratings for each user from each user's rating for different movies.

In [115]:
recommData_pv.index.nunique()

In [116]:
# Creating the User Similarity Matrix using cosine_similarity function.
user_correlation_cs = cosine_similarity(recommData_pv)
user_correlation_cs[np.isnan(user_correlation_cs)] = 0
print(user_correlation_cs.shape)
print()
print(user_correlation_cs)

In [117]:
# using Adjusted Cosine similarity
# Here, we are not removing the NaN values and calculating the mean only for the products rated by the user
# Create a user-movie matrix.
recommData_adpv = train.reset_index().pivot_table(index='reviews_username', columns='id',values='reviews_rating')

In [118]:
print(recommData_adpv.shape)
print()
recommData_adpv.head(10)

In [119]:
recommData_adpv.index.nunique()

In [120]:
# Normalising the rating of the product for each user around 0 mean
mean = np.nanmean(recommData_adpv, axis=1)
recommData_adpv_subtracted = (recommData_adpv.T-mean).T

In [121]:
recommData_adpv_subtracted.head(10)

In [122]:
# Creating the User Similarity Matrix using cosine_similarity function.
user_correlation_adcs = cosine_similarity(recommData_adpv_subtracted.fillna(0))
user_correlation_adcs[np.isnan(user_correlation_adcs)] = 0
print(user_correlation_adcs.shape)
print()
print(user_correlation_adcs)

 ### Prediction - User User

*Doing the prediction for the users which are positively related with other users, and not the users which are negatively related as we are interested in the users which are more similar to the current users. So, ignoring the correlation for values less than 0.*

In [123]:
user_correlation_adcs[user_correlation_adcs<0]=0
user_correlation_adcs

In [124]:
#Rating predicted by the user (for protucts rated as well as not rated) is the weighted sum of correlation with the product rating (as present in the rating dataset).
print(user_correlation_adcs.shape)
print(recommData_adpv.shape)

In [125]:
user_predicted_ratings = np.dot(user_correlation_adcs, recommData_adpv.fillna(0))
user_predicted_ratings

In [126]:
print(user_predicted_ratings.shape)

In [127]:
#Since we are interested only in the products not rated by the user, we will ignore the products rated by the user by making it zero. 
user_final_rating = np.multiply(user_predicted_ratings,dummy_train_pv)
user_final_rating.head()

In [128]:
print(user_final_rating.shape)

In [129]:
save_object(user_final_rating , 'user_user_based_recommendations')

#### Find top 10 recommendation for user


In [130]:
userId = '00sab00'
recommendations  = user_final_rating.loc[userId].sort_values(ascending=False)[0:10]
recommendations

In [131]:
#display the top 10 product id, name and similarity_score 
final_recommendations = pd.DataFrame({'product_id': recommendations.index, 'similarity_score' : recommendations})
final_recommendations.reset_index(drop=True)
pd.merge(final_recommendations, reviewsData, on="id")[["id", "name", "similarity_score"]].drop_duplicates()

### Evaluation user-user

Evaluation will be same as you have seen above for the prediction. The only difference being, you will evaluate for the product already rated by the user insead of predicting it for the product not rated by the user.

In [132]:
print(test.shape)

In [133]:
## Find out the common users of test and train dataset.
common = test[test.reviews_username.isin(train.reviews_username)]
common.shape

In [134]:
common.head()

In [135]:
# convert into the user-movie matrix.
common_user_based_matrix = common.pivot_table(index='reviews_username', columns='id', values='reviews_rating')
common_user_based_matrix.head(5)

In [136]:
print(common_user_based_matrix.shape)

In [137]:
#Now you need to filter out the correlation of only those users that are common in both test and train datasets. To do this, you need to first convert the ‘user_correlation’ matrix into a dataframe
user_correlation_adcs_df = pd.DataFrame(user_correlation_adcs)
user_correlation_adcs_df.head()

In [138]:
recommData_adpv_subtracted.index

In [139]:
#set userid as index
user_correlation_adcs_df['userId'] = recommData_adpv_subtracted.index
user_correlation_adcs_df.set_index('userId',inplace=True)
user_correlation_adcs_df.head()

In [140]:
print(user_correlation_adcs_df.shape)

In [141]:
#filter out the correlation of only those users that are common in both test and train datasets
list_name = common.reviews_username.to_list()
user_correlation_adcs_df.columns = recommData_adpv_subtracted.index.tolist()
user_correlation_adcs_df_1 =  user_correlation_adcs_df[user_correlation_adcs_df.index.isin(list_name)]


In [142]:
print(user_correlation_adcs_df_1.shape)

In [143]:
user_correlation_adcs_df_1.head(5)

In [144]:
user_correlation_adcs_df_2 = user_correlation_adcs_df_1.T[user_correlation_adcs_df_1.T.index.isin(list_name)]
user_correlation_adcs_df_3 = user_correlation_adcs_df_2.T
print(user_correlation_adcs_df_3.shape)

In [145]:
user_correlation_adcs_df_3.head(3)

In [146]:
#replace correlation < 0
user_correlation_adcs_df_3[user_correlation_adcs_df_3<0]  = 0
common_user_predicted_ratings = np.dot(user_correlation_adcs_df_3, common_user_based_matrix.fillna(0))
common_user_predicted_ratings

In [147]:
dummy_test = common.copy()
dummy_test['reviews_rating'] = dummy_test['reviews_rating'].swifter.apply(lambda x: 1 if x>=1 else 0)
dummy_test_pv = dummy_test.pivot_table(index='reviews_username', columns='id', values='reviews_rating').fillna(0)
dummy_test_pv.head(5)

In [148]:
print(dummy_test_pv.shape)

In [149]:
common_user_predicted_ratings = np.multiply(common_user_predicted_ratings,dummy_test_pv)
common_user_predicted_ratings.head()

### Calculating the RMSE for only the products rated by user. For RMSE, normalising the rating to (1,5) range.

In [150]:
from sklearn.preprocessing import MinMaxScaler
from numpy import *

X  = common_user_predicted_ratings.copy() 
X = X[X>0]

scaler = MinMaxScaler(feature_range=(1, 5))
print(scaler.fit(X))
y = (scaler.transform(X))
print(y)

In [151]:
common_ = common.pivot_table(index='reviews_username', columns='id', values='reviews_rating')
# Finding total non-NaN value
total_non_nan = np.count_nonzero(~np.isnan(y))
rmse = (sum(sum((common_ - y )**2))/total_non_nan)**0.5
print(rmse)

## Item- Item based similarity

In [152]:
#item based matrix is nothing but transpose of rating matrix
recommData_i2_adpv = train.pivot_table(index ='reviews_username',columns='id',values='reviews_rating').T
recommData_i2_adpv.head(10)

In [153]:
recommData_i2_adpv.shape

### **Normalising the product rating for each product for using the Adujsted Cosine**

In [154]:
mean = np.nanmean(recommData_i2_adpv, axis=1)
recommData_i2_adpv_substracted = (recommData_i2_adpv.T-mean).T

In [155]:
#Finding the cosine similarity using cosine similarity function
# Item Similarity Matrix
item_correlation_adcs = cosine_similarity(recommData_i2_adpv_substracted.fillna(0))
item_correlation_adcs[np.isnan(item_correlation_adcs)] = 0
print(item_correlation_adcs.shape)
print()
print(item_correlation_adcs)


In [156]:
#Filtering the correlation only for which the value is greater than 0. (Positively correlated)
item_correlation_adcs[item_correlation_adcs<0]=0
item_correlation_adcs

### **Precition item-item**

In [157]:
print(item_correlation_adcs.shape)
print(recommData_i2_adpv.shape)

In [158]:
item_predicted_ratings = np.dot((recommData_i2_adpv.fillna(0).T),item_correlation_adcs)
item_predicted_ratings

In [159]:
print(item_predicted_ratings.shape)
print(dummy_train_pv.shape)

In [160]:
#Filtering the rating only for the products not rated by the user for recommendation
item_final_rating = np.multiply(item_predicted_ratings,dummy_train_pv)
item_final_rating.head()

In [161]:
save_object(item_final_rating,'item_item_based_recommendations')

### Finding top 10 recommendation for the user

In [162]:
userId = '00sab00'
recommendations_ = item_final_rating.loc[userId].sort_values(ascending=False)[0:10]
recommendations_

In [163]:
#display the top 10 product id, name and similarity_score 
final_recommendations = pd.DataFrame({'product_id': recommendations_.index, 'similarity_score' : recommendations_})
final_recommendations.reset_index(drop=True)
pd.merge(final_recommendations, reviewsData, on="id")[["id", "name", "similarity_score"]].drop_duplicates()

### Evaluation Item-Item 

In [164]:
test.columns

In [165]:
print(test.shape)

In [166]:
common_ = test[test.id.isin(train.id)]
common_.shape

In [167]:
common_.head(5)

In [168]:
common_item_based_matrix = common_.pivot_table(index='reviews_username',columns='id',values='reviews_rating').T
common_item_based_matrix.head(5)

In [169]:
print(common_item_based_matrix.shape)

In [170]:
item_correlation_df = pd.DataFrame(item_correlation_adcs)
print(item_correlation_df.shape)

In [171]:
item_correlation_df.head(5)

In [172]:
item_correlation_df['id'] = recommData_i2_adpv_substracted.index
item_correlation_df.set_index('id',inplace=True)
item_correlation_df.head()

In [173]:
#filter out the correlation of only those ids that are common in both test and train datasets
list_name = common_.id.to_list()
item_correlation_df.columns = recommData_i2_adpv_substracted.index.tolist()
item_correlation_df_1 =  item_correlation_df[item_correlation_df.index.isin(list_name)]
print(item_correlation_df.shape)
item_correlation_adcs_df_2 = item_correlation_df_1.T[item_correlation_df_1.T.index.isin(list_name)]
item_correlation_adcs_df_3 = item_correlation_adcs_df_2.T
print(item_correlation_adcs_df_3.shape)

In [174]:
print(common_item_based_matrix.shape)

In [175]:
#replace correlation < 0
item_correlation_adcs_df_3[item_correlation_adcs_df_3<0]=0
common_item_predicted_ratings = np.dot(item_correlation_adcs_df_3, common_item_based_matrix.fillna(0))
print(common_item_predicted_ratings.shape)

#### Dummy test will be used for evaluation. To evaluate, we will only make prediction on the products rated by the user. So, this is marked as 1. This is just opposite of dummy_train

In [176]:
dummy_test = common_.copy()

#The products not rated is marked as 0 for evaluation. And make the item- item matrix representaion.
dummy_test['reviews_rating'] = dummy_test['reviews_rating'].apply(lambda x: 1 if x>=1 else 0)

dummy_test_pv = dummy_test.pivot_table(index='reviews_username', columns='id', values='reviews_rating').T.fillna(0)

common_item_predicted_ratings = np.multiply(common_item_predicted_ratings,dummy_test_pv)
print(common_item_predicted_ratings.shape)
common_item_predicted_ratings.head(3)

### Calculate RMSE

In [177]:
X  = common_item_predicted_ratings.copy() 
X = X[X>0]

scaler = MinMaxScaler(feature_range=(1, 5))
print(scaler.fit(X))
y = (scaler.transform(X))

print(y)

In [178]:
common_pv = common_.pivot_table(index='reviews_username',columns='id',values='reviews_rating').T
common_pv.head(5)

In [179]:
# Finding total non-NaN value
total_non_nan = np.count_nonzero(~np.isnan(y))
rmse = (sum(sum((common_pv - y )**2))/total_non_nan)**0.5
print(rmse)

## We can see that RMSE for user-user based recommendation system is low as compare to item-item based recommendation system. We should use user-user based recommendation system.

# 7. Tuning of recommendation model

### top 20 recommendations and filteration by sentiment model

In [180]:
#userId = '08dallas'
userId = '00sab00'
recommendations  = user_final_rating.loc[userId].sort_values(ascending=False)[0:20]
recommendations = pd.DataFrame({'id': recommendations.index, 'similarity_score' : recommendations})
recommendations = recommendations.reset_index(drop=True)
recommendations

In [181]:
temp = reviewsData[reviewsData.id.isin(list(recommendations.id))]
X= tfIdf.transform(temp.finalReviews.values.astype(str))
temp['predicted_sentiment'] = xgb_model_hpt.predict(X)
temp.head(10)

In [182]:
df = pd.merge(recommendations, temp, on="id").drop_duplicates()
temp = df.drop(columns=['id','reviews_username','reviewsText','finalReviews','similarity_score'])
df_grouped  = temp.groupby('name', as_index=False).count()
df_grouped['similarity_score'] = df_grouped.name.apply(lambda x: df[(df.name==x) & (df['predicted_sentiment']==1)]["similarity_score"].median())
df_grouped["pos_review_count"] = df_grouped.name.apply(lambda x: df[(df.name==x) & (df['predicted_sentiment']==1)]["predicted_sentiment"].count())
df_grouped["total_review_count"] = df_grouped['predicted_sentiment']
df_grouped['pos_sentiment_percent'] = np.round(df_grouped["pos_review_count"]/df_grouped["total_review_count"]*100,2)
df_grouped.sort_values('pos_sentiment_percent', ascending=False)

In [183]:
X_sample = tfIdf.transform(["FInest Product"])
y_pred_sample = xgb_model_hpt.predict(X_sample)
y_pred_sample

In [184]:
X_sample = tfIdf.transform(["Bad content,not worth buying"])
y_pred_sample = xgb_model_hpt.predict(X_sample)
y_pred_sample

In [185]:
df_grouped.name.apply(lambda x: df[(df.name==x) & (df['predicted_sentiment']==1)]["similarity_score"].median())