An e-commerce company named 'Ebuss'which sells the products in various categories viz. household essentials, books, personal care products, medicines, cosmetic items, beauty products, electrical appliances, kitchen and dining products and health care products etc. 

Ebuss wish to compete with other competitors and even wish to expand their footprints in market, asked to build a model that will improve the recommendations given to the users given their past reviews and ratings.   

Henece, it is recommended to build a sentiment-based product recommendation system, which includes the following tasks.

**1) Data sourcing and sentiment analysis**

**2) Building a recommendation system**

**3) Improving the recommendations using the sentiment analysis model**

**4) Deploying the end-to-end project with a user interface**
 

#### <p style="font-family: Arial; font-size:1.5em;color:DeepPink;">Task 1: Data sourcing and sentiment analysis</p>

**1) Data cleaning**

**2) Text preprocessing**

**3) Exploratory Data Analysis**

**4) Feature extraction**

**5) Model Building and Evaluation**

## <span style="color:Orange">Import and Install useful packages</span>

In [4]:
!pip install textblob
!pip install wordcloud
!pip install catboost
!pip install pycrf
!pip install sklearn-crfsuite
!pip install gensim
!pip install gunicorn

# Libraries require for data reading & data visualization 
import json
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
%matplotlib inline
plt.style.use('fivethirtyeight')
import seaborn as sns
from collections import Counter


# Libraries loadings for EDA
from textblob import TextBlob
from wordcloud import WordCloud, STOPWORDS


# Libraries loading for input text preprocessing
import re, nltk, spacy, string
nlp = spacy.load("en_core_web_sm")

# from scikit-learn and NLP libraries 
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.decomposition import NMF
import imblearn
from imblearn.over_sampling import SMOTE



# Libraries for machine to learns models
import nltk
nltk.download('averaged_perceptron_tagger')
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import MinMaxScaler


from gensim.models.nmf import Nmf
from gensim.corpora.dictionary import Dictionary
from operator import itemgetter
from gensim.models.coherencemodel import CoherenceModel

from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import confusion_matrix, f1_score, classification_report

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

Collecting gunicorn
  Downloading gunicorn-20.1.0-py3-none-any.whl (79 kB)
     -------------------------------------- 79.5/79.5 kB 260.5 kB/s eta 0:00:00
Installing collected packages: gunicorn
Successfully installed gunicorn-20.1.0


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\RGhogare\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [None]:
import warnings
warnings.filterwarnings('ignore')

##### <p style="font-family: Arial; font-size:1.5em;color:Orange;">Task 1.1:Input Data reading , Reading, Data cleaning and pre-processing</p>

In [None]:
# Loading and reading data
df_sbprs = pd.read_csv("sample30.csv")

df_sbprs.head()

In [None]:
# Analyze the columns
df_sbprs.info()

In [None]:
#print number of rows and columns
df_sbprs.shape

**Number of rows are : 30,000**

**Number of Columns are: 15**

In [None]:
df_sbprs.describe

#### Load the attribute description data file

In [None]:
df_sbprs_description = pd.read_csv("Data+Attribute+Description.csv", encoding='unicode_escape')
df_sbprs_description

In [None]:
# Inspect missing data % in different columns
round(100*df_sbprs.isna().sum()/len(df_sbprs),2)

There are following three attributes(features) are needed to build the sentiment analysis viz.

**1:reviews_title**

**2:reviews_text**

**3:user_sentiment**


#### also in reviews_title column there are 0.63 missing values and we can replace that with blank

In [None]:
# In Reviews_title replace missing values with blank
df_sbprs['reviews_title'].fillna("", inplace = True)

# Drop the missing value in other two columns viz.reviews_text and user_sentiment  if any 
df_sbprs.dropna(subset=['reviews_text'], inplace=True)
df_sbprs.dropna(subset=['user_sentiment'], inplace=True)

In [None]:
# Inspect missing data % in different columns
round(100*df_sbprs.isna().sum()/len(df_sbprs),2)

In [None]:
df_sbprs.shape

**It shows, there is 1 missing value in user_sentiment**

#### Insecting reviews_text and reviews_title

In [None]:
df_sbprs['reviews_text'].head()

In [None]:
df_sbprs['reviews_title'].head()

#### Now lets combine reviews_title and reviews_text

In [None]:
# Combining  two columns 'reviews_text' and 'reviews_title' as these two depcits the sentiment of reviewer and provide the name as reviews_combine 
df_sbprs['reviews_combine'] = df_sbprs['reviews_text'] + ' ' + df_sbprs['reviews_title']
df_sbprs.head(1)

#### Now keep only the dataframe with two columns viz. user_sentiment and reviews_combine

In [None]:
df_sbprs_final = df_sbprs[['user_sentiment', 'reviews_combine']]

In [None]:
df_sbprs_final.head(1)

In [None]:
df_sbprs_final.shape

In [None]:
df_sbprs_final.dtypes

#### Before modelling check if there is any imbalance in dataset 

In [None]:
df_sbprs_final['user_sentiment'].value_counts()

In [None]:
sns.countplot(df_sbprs_final['user_sentiment'])

**This is typical case of imbalance dataset. Will handle this in model building**

##### <p style="font-family: Arial; font-size:1.5em;color:Orange;">Task 1.2:Text preprocessing</p>

## Prepare the text for topic modeling

Once removed all the blank complaints, need to:

* Make the text lowercase
* Remove punctuation



Once you have done these cleaning operations you need to perform the following:
* Lemmatize the texts
* Extract the POS tags of the lemmatized text and remove all the words which have tags other than NN[tag == "NN"].

In [None]:
def text_process(inputtext):
    '''
    function to create the 
    text for topic modelling 
    '''
        
    inputtext=inputtext.lower() #Make the text lowercase
       
    inputtext=re.sub(r'[%s]%re.escape(string.punctuation)','',inputtext) #Remove punctuation
       
    return inputtext

In [None]:
# Applying "text_process" on feature columns 
df_sbprs_final['reviews_combine'] = pd.DataFrame(df_sbprs_final['reviews_combine'].apply(lambda x: text_process(x)))

In [None]:
df_sbprs_final.head()

In [None]:
#### Follow the similar procedure for Lemmatize the texts and extracting Pos tags

In [None]:
def lemma_data(data):
    
    '''
     function to lemmatize texts 
    
    '''   
    
    store_lemms = [] # create empty list to store lemmas
    
    # Extract lemmas of given text and add to the list 'sent'
    document_text = nlp(data)
    for word in document_text:
        store_lemms.append(word.lemma_)
        
    
    return " ".join(store_lemms)  # return joint list of lemmas

In [None]:
df_sbprs_final["reviews_combine"] =  df_sbprs_final.apply(lambda x: lemma_data(x['reviews_combine']), axis=1)

# Check the dataframe
df_sbprs_final.head()

##### <p style="font-family: Arial; font-size:1.5em;color:Orange;">Task 1.3:Exploratory data analysis (EDA)</p>

Write the code in this task to perform the following:

*   Visualise the data according to the 'Complaint' character length'
*   Using a word cloud find the top 40 words by frequency among all the articles after processing the text

In [None]:
# Get the list of lengths complaints from POS_removed_complaint feature column 
length_doc = [len(i) for i in df_sbprs_final['reviews_combine']]
length_doc[:15]

In [None]:
## Visualize the data 
from matplotlib.pyplot import figure
figure(num=None, figsize=(30, 30))
font = {'family' : 'Times New Roman',
        'weight' : 'bold',
        'size'   : 50}
plt.rc('font', **font)
sns.set_style("whitegrid")
sns.set(font_scale = 3)

sns.histplot(length_doc,bins=50)
plt.title('Distribution of Reviews Character Length')
plt.ylabel('No. of Reviewes')
plt.xlabel('Review character length')
plt.show()

#### above distribution is right skewed. 

**Now we find the top 40 words by frequency among all the articles**

In [None]:
stopwords = set(STOPWORDS)
wordcloud = WordCloud(
                          background_color='White',
                          stopwords=stopwords,
                          width=1200, height=1000,
                          max_words=40,
                          max_font_size=40, 
                          random_state=42
                         ).generate(str(df_sbprs_final['reviews_combine']))

fig = plt.figure(figsize=(15,15))
plt.imshow(wordcloud,interpolation="bilinear")
plt.axis('off')
plt.show()

In [None]:
# Saving this data 
pickle.dump(df_sbprs_final, open('pickle/processed_data.pkl', 'wb'))

##### <p style="font-family: Arial; font-size:1.5em;color:Orange;">Task 1.4:Feature Extraction</p>

Convert the raw texts to a matrix of TF-IDF features

In [None]:
# Initiate the TfidfVectorizer 

tfidf=TfidfVectorizer(stop_words='english')

#### Create a document term matrix using fit_transform

The contents of a document term matrix are tuples of (complaint_id,token_id) tf-idf score:
The tuples that are not there have a tf-idf score of 0

In [None]:
df_sbprs_X_train_tfidf=tfidf.fit_transform(df_sbprs_final['reviews_combine'])

In [None]:
print(df_sbprs_X_train_tfidf)

In [None]:
# Converting to array the tf-udf vector
print(df_sbprs_X_train_tfidf.toarray())

In [None]:
# Saving this data 
pickle.dump(tfidf.vocabulary_, open("pickle/tfidf_vocab.pkl","wb"))

In [None]:
# Get the response and target variable
X = df_sbprs_X_train_tfidf
y = df_sbprs_final['user_sentiment']

In [None]:
#  Train-Test split with 70% training set and 30% test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)


print("X_train Shape {0}:".format(X_train.shape))
print("y_train Shape {0}:".format(y_train.shape))
print("X_test Shape {0}:".format(X_test.shape))
print("y_test Shape {0}:".format(y_test.shape))

##### <p style="font-family: Arial; font-size:1.5em;color:Orange;">Task 1.5:Model Building</p>

### We need to use the F1 Score parameter for evalution matrix and weighted average due to class imbalance while evaluting different models

In [None]:
def evalution_different_model(y_test, y_pred, model_name):
    

        
    '''
    Input to this function is a target test variable, target predicted variable from different models
    then print classification report and it return none 
    y_test: actual labels
    data : predicted labels
    models name: viz Logistic regression, Decision tree, Random Forest, XGBoost, Naive Bayes 
    '''
    
    
    print(f"CLASSIFICATION REPORT for {model_name}\n") # print classification report of given model
    print(classification_report(y_test, y_pred, target_names=["Positive", "Negative"]))
    
    # plotting confusion matrix of given model
    from matplotlib.pyplot import figure
    figure(num=None, figsize=(30, 30))
    font = {'family' : 'Times New Roman',
        'weight' : 'bold',
        'size'   : 50}
    plt.rc('font', **font)
    sns.set_style("whitegrid")
    sns.set(font_scale = 3)
    plt.title(f"CONFUSION MATRIX for{0}:".format(model_name))
    conf_matrix = confusion_matrix(y_test, y_pred)
    # a custom divergin palette
    cmap = sns.diverging_palette(100, 7, s=75, l=40,
                            n=5, center="light", as_cmap=True)
    sns.heatmap(conf_matrix, center=0, annot=True,fmt='.2f', square=True, cmap=cmap,xticklabels=["Positive", "Negative"], yticklabels=["Positive", "Negative"])
    plt.show()
    
    return

### Model :1 #Naive Bayes

In [None]:
# Initial run of the Multinomial Naive Bayes with default parameters
model_name = 'NAIVE BAYES'
nb_clf = MultinomialNB()
nb_clf.fit(X_train, y_train)
nb_y_pred = nb_clf.predict(X_test)

In [None]:
# Calculate F1 Score 
nb_f1_score = f1_score(y_test, nb_y_pred, average="weighted")
nb_f1_score

#### Hyperparameter tuning to get best result

In [None]:
nb_param = {
    'alpha': (0.00001,0.0001,0.001, 0.01, 0.1,1),
    'fit_prior':[True, False]
}

nb_grid = GridSearchCV(estimator=nb_clf, 
                       param_grid=nb_param,
                       verbose=1,
                       scoring='f1_weighted',
                       n_jobs=-1,
                       cv=5)
nb_grid.fit(X_train, y_train)
print(nb_grid.best_params_)

In [None]:
# running Naive Bayes with best parameters 

nb_clf_tuned = MultinomialNB(alpha=0.01, fit_prior=True)
nb_clf_tuned.fit(X_train, y_train)
nb_tuned_y_pred = nb_clf_tuned.predict(X_test)

In [None]:
# Calculate F1 Score 
nb_f1_score_tuned = f1_score(y_test, nb_tuned_y_pred, average="weighted")
nb_f1_score_tuned

In [None]:
# Evaluate  Naive Bayes classifier with best parameters
evalution_different_model(y_test, nb_tuned_y_pred, model_name)

Observation : **The F1 Score of Naive Bayes model with tuned parameters gives is ~0.86**

In [None]:
# A dataframe to insert F1 Scores for all subsequent models

In [None]:
f1_score_summary = pd.DataFrame([{'Model': 'Naive_Bayes','F1_Score': round(nb_f1_score_tuned, 2)}])
f1_score_summary

### Model :2 # Logistic regression

In [None]:
# Initial run of the Logistic Regression
model_name = 'Logistic Regression'
lr_clf = LogisticRegression(solver='liblinear')
lr_clf.fit(X_train, y_train)
lr_y_pred = lr_clf.predict(X_test)

In [None]:
# Calculate F1 Score 
lr_f1_score = f1_score(y_test, lr_y_pred, average="weighted")
lr_f1_score

#### Hyperparameter tuning to get best result

In [None]:
lr_param = {
    'penalty': ['l1', 'l2','elasticnet', 'none'],
    'C': [0.001,0.01,0.1,1,10,100],
    'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
}

lr_grid = GridSearchCV(estimator=lr_clf, 
                       param_grid=lr_param,
                       verbose=1,
                       scoring='f1_weighted',
                       n_jobs=-1,
                       cv=5)
lr_grid.fit(X_train, y_train)
print(lr_grid.best_params_)

In [None]:
# running Logistic Regression with best parameters 

lr_clf_tuned = LogisticRegression(C=10, penalty='l2',solver='newton-cg')
lr_clf_tuned.fit(X_train, y_train)
lr_tuned_y_pred = lr_clf_tuned.predict(X_test)

In [None]:
# Calculate F1 Score 
lr_f1_score_tuned = f1_score(y_test, lr_tuned_y_pred, average="weighted")
lr_f1_score_tuned

In [None]:
# Evaluate  Logistic Regression classifier with best parameters
evalution_different_model(y_test, lr_tuned_y_pred, model_name)

Observation : **The F1 Score of Logistic Regression model with tuned parameters gives is ~0.90**

In [None]:
f1_score_summary.loc[len(f1_score_summary.index)] = ['Logistic_Regression', round(lr_f1_score_tuned, 2)]
f1_score_summary

### Model :3 # Decision Tree

In [None]:
# Initial run of the Decision Tree
model_name = 'Decision Tree'
dt_clf = DecisionTreeClassifier()
dt_clf.fit(X_train, y_train)
dt_y_pred = dt_clf.predict(X_test)

In [None]:
# Calculate F1 Score 
dt_f1_score = f1_score(y_test, dt_y_pred, average="weighted")
dt_f1_score

#### Hyperparameter tuning to get best result

In [None]:
dt_param = {
    'criterion': ['gini', 'entropy'],
    'max_depth' : [5, 10, 15, 20, 25, 30],
    'min_samples_leaf':[1,5,10,15, 20, 25],
    'max_features':['auto','log2','sqrt',None],
}

dt_grid = GridSearchCV(estimator=dt_clf, 
                       param_grid=dt_param,
                       verbose=1,
                       scoring='f1_weighted',
                       n_jobs=-1,
                       cv=5)
dt_grid.fit(X_train, y_train)
print(dt_grid.best_params_)

In [None]:
# running Decision Tree with best parameters 

dt_clf_tuned = DecisionTreeClassifier(criterion='gini',max_depth=30,min_samples_leaf=1,max_features=None)
dt_clf_tuned.fit(X_train, y_train)
dt_tuned_y_pred = dt_clf_tuned.predict(X_test)

In [None]:
# Calculate F1 Score 
dt_f1_score_tuned = f1_score(y_test, dt_tuned_y_pred, average="weighted")
dt_f1_score_tuned

In [None]:
# Evaluate  Decision Tree classifier with best parameters
evalution_different_model(y_test, dt_tuned_y_pred, model_name)

Observation : **The F1 Score of Decision Tree model with tuned parameters gives is ~0.87**

In [None]:
f1_score_summary.loc[len(f1_score_summary.index)] = ['Decision_Tree', round(dt_f1_score_tuned, 2)]
f1_score_summary

### Model :4 # Random Forest

In [None]:
# Initial run of the Random Forest

model_name = 'Random Forest'
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train)
rf_y_pred = rf_clf.predict(X_test)

In [None]:
# Calculate F1 Score 
rf_f1_score = f1_score(y_test, rf_y_pred, average="weighted")
rf_f1_score

#### Hyperparameter tuning to get best result (OPTIONAL)

In [None]:
# Hyperparameter tuning to improve Random Forest performance
rf_param = {
     'n_estimators': [100, 200, 300],
     'criterion':['gini','entropy'],
     'max_depth': [10, 30, 40],
     'min_samples_split': [1, 5, 10],
     'min_samples_leaf': [1, 5, 10],
     'max_features': ['log2', 'sqrt', None]    
 }

rf_grid = RandomizedSearchCV(estimator=rf_clf, 
                        param_distributions=rf_param,
                        scoring='f1_weighted',
                        verbose=1,
                        n_jobs=-1,
                       cv=5)
rf_grid.fit(X_train, y_train)
print(rf_grid.best_params_)

**RandomizedSearchCV**: It tries random combinations for a range of values and hence it is good at testing a wide range of values and normally it reaches a very good combination very fast. This is recommended for large datasets or  number of parameters to tune are more. 

## <span style="color:RED">PLEASE NOTE</span> : As we are using <span style="color:Orange">RandomizedSearchCV</span>  above, the best parameters might change during different runs, hence we have evaluated this model multiple times and found that although parameters vary during each execution but F1 score does not vary much (variation is not more than 1-2%). Hence we came up with two best different set parameters as follows viz. 

1. SET_1_PARAMETER_NOTE


2. SET_2_PARAMETER_NOTE

#### <span style="color:RED">SET_1_PARAMETER_NOTE</span>: {'n_estimators': 100, 'min_samples_split': 10, 'min_samples_leaf': 10, 'max_features': None, 'max_depth': 40, 'criterion': 'gini'}

In [None]:
# running Random Forest with best parameters :Set 1 
rf_clf_tuned_set_1 = RandomForestClassifier(n_estimators=100, 
                                      min_samples_split=10, 
                                      min_samples_leaf=10, 
                                      max_features=None, 
                                      max_depth=40, 
                                      criterion='gini'
)

rf_clf_tuned_set_1.fit(X_train, y_train)
rf_tuned_y_pred_set_1 = rf_clf_tuned_set_1.predict(X_test)

In [None]:
# Calculate F1 Score 
rf_f1_score_tuned_set_1 = f1_score(y_test, rf_tuned_y_pred_set_1, average="weighted")
rf_f1_score_tuned_set_1

#### <span style="color:RED">SET_2_PARAMETER_NOTE</span>: {'n_estimators': 200, 'min_samples_split': 10, 'min_samples_leaf': 5, 'max_features': None, 'max_depth': 30, 'criterion': 'gini'}

In [None]:
# running Random Forest with best parameters 
rf_clf_tuned_set_2 = RandomForestClassifier(n_estimators=200, 
                                      min_samples_split=5, 
                                      min_samples_leaf=10, 
                                      max_features=None, 
                                      max_depth=30, 
                                      criterion='entropy'
)

rf_clf_tuned_set_2.fit(X_train, y_train)
rf_tuned_y_pred_set_2 = rf_clf_tuned_set_2.predict(X_test)

In [None]:
# Calculate F1 Score 
rf_f1_score_tuned_set_2 = f1_score(y_test, rf_tuned_y_pred_set_2, average="weighted")
rf_f1_score_tuned_set_2

## <span style="color:RED">NOTE</span>: As it can be seen above, while evaluating F1 score for two different set of parameters viz. Set_1 and Set_2 using random forest model, the change in <span style="color:RED">F1 Score</span> is very minuscule , hence  further evaluation of  random forest model is done with **Set_1 parameter**

**Observation** : Also it is observed that, after running random forest model with RandomizedSearchCV multiple times, the F1 score varies by 1-2%. Hence choosing Set_1 parameters 

In [None]:
# Evaluate  Random Forest classifier with best parameters (SET_1 )
evalution_different_model(y_test, rf_tuned_y_pred_set_1, model_name) 

In [None]:
f1_score_summary.loc[len(f1_score_summary.index)] = ['Random_Forest', round(rf_f1_score_tuned_set_1, 2)]
f1_score_summary

### Model :5 # XGBoost

In [None]:
# Initial run of the XG Boost

model_name = 'XGBoost'
xg_clf = XGBClassifier(tree_method='gpu_hist', 
                        gpu_id=0, 
                        predictor="gpu_predictor")
xg_clf.fit(X_train, y_train)
xg_y_pred = xg_clf.predict(X_test)

In [None]:
# Calculate F1 Score 
xg_f1_score = f1_score(y_test, xg_y_pred, average="weighted")
xg_f1_score

#### Hyperparameter tuning to get best result (OPTIONAL)

In [None]:
xg_param = {
    'learning_rate': [0.1, 0.2],
    'max_depth': [2, 6, 10],
    'min_child_weight': [7, 11, 19],
    'scale_pos_weight': [10, 12],
    'n_estimators': [300, 500] 
}

xg_grid = RandomizedSearchCV(estimator=xg_clf, 
                       param_distributions=xg_param,
                       scoring='f1_weighted',
                       verbose=1,
                       n_jobs=-1,
                       cv=5)
xg_grid.fit(X_train, y_train)
print(xg_grid.best_params_)

**RandomizedSearchCV**: It tries random combinations for a range of values and hence it is good at testing a wide range of values and normally it reaches a very good combination very fast. This is recommended for large datasets or  number of parameters to tune are more. 

## <span style="color:RED">PLEASE NOTE</span> : As we are using <span style="color:Orange">RandomizedSearchCV</span>  above, the best parameters might change during different runs, hence we have evaluated this model multiple times and found that although parameters vary during each execution but F1 score does not vary much (variation is not more than 1-2%). Hence we came up with two best different set parameters as follows viz. 

1. SET_1_PARAMETER_NOTE


2. SET_2_PARAMETER_NOTE

#### <span style="color:RED">SET_1_PARAMETER_NOTE</span>: {'scale_pos_weight': 12, 'n_estimators': 500, 'min_child_weight': 19, 'max_depth': 2, 'learning_rate': 0.1}

In [None]:
# running XG_Boost Tree with best parameters : Set 1

xgb_clf_tuned_set_1 = XGBClassifier(scale_pos_weight=10, 
                              n_estimators=500, 
                              min_child_weight=7, 
                              max_depth=2, 
                              learning_rate=0.1, 
                              tree_method='gpu_hist', 
                              gpu_id=0, 
                              predictor="gpu_predictor"
)

xgb_clf_tuned_set_1.fit(X_train, y_train)
xgb_tuned_y_pred_set_1 = xgb_clf_tuned_set_1.predict(X_test)

In [None]:
# Calculate F1 Score : Set 1
xgb_f1_score_tuned_set_1 = f1_score(y_test, xgb_tuned_y_pred_set_1, average="weighted")
xgb_f1_score_tuned_set_1

#### <span style="color:RED">SET_2_PARAMETER_NOTE</span>: {'scale_pos_weight': 12, 'n_estimators': 300, 'min_child_weight': 11, 'max_depth': 2, 'learning_rate': 0.2}

In [None]:
# running XG_Boost with best parameters : Set 2

xgb_clf_tuned_set_2 = XGBClassifier(scale_pos_weight=10, 
                              n_estimators=500, 
                              min_child_weight=19, 
                              max_depth=2, 
                              learning_rate=0.1, 
                              tree_method='gpu_hist', 
                              gpu_id=0, 
                              predictor="gpu_predictor"
)

xgb_clf_tuned_set_2.fit(X_train, y_train)
xgb_tuned_y_pred_set_2 = xgb_clf_tuned_set_2.predict(X_test)

In [None]:
# Calculate F1 Score Set 2
xgb_f1_score_tuned_set_2 = f1_score(y_test, xgb_tuned_y_pred_set_2, average="weighted")
xgb_f1_score_tuned_set_2

## <span style="color:RED">NOTE</span>: As it can be seen above, while evaluating F1 score for two different set of parameters viz. Set_1 and Set_2 using XG_BOOST model, the change in <span style="color:RED">F1 Score</span> is very minuscule , hence  further evaluation of  random forest model is done with **Set_1 parameter**

**Observation** : Also it is observed that, after running XG Boost classifier model with RandomizedSearchCV multiple times, the F1 score varies by 1-2%. Hence choosing Set_1 parameters for further evaluation

In [None]:
# Evaluate  XG BOOST  classifier with best parameters (Set_1)
evalution_different_model(y_test, xgb_tuned_y_pred_set_1, model_name)

In [None]:
f1_score_summary.loc[len(f1_score_summary.index)] = ['XGBOOST', round(xgb_f1_score_tuned_set_1, 2)]
f1_score_summary

**INFERNCES FROM SUPERVISED MODEL TUNING**: Logistic regersssion perform better comapred to all other models

## Training Logistic regression model on complete data set X and Y

In [None]:
# running Logistic Regression with best parameters on whole dataset  

lr_clf_tuned = LogisticRegression(C=10, penalty='l2',solver='newton-cg',random_state=42)
lr_clf_tuned.fit(X, y)




In [None]:
# Save tuned Logistic Regression model as pickle file
pickle.dump(lr_clf_tuned, open("pickle/logreg_model.pkl", "wb"))

#### <p style="font-family: Arial; font-size:1.5em;color:DeepPink;">Task 2: Building a recommendation system</p>

Following task are performed in this task viz.

 1. User-based recommendation system

 2. Item-based recommendation system

 3. Select best Recommendation System
 
 4. Recommend top-20 products to user



##### <p style="font-family: Arial; font-size:1.5em;color:Orange;">Task 2.1:User Based recommendation system</p>

### User-based Collaborative filter (UBCF): 

-  User-Based Collaborative Filtering(UBCF) is a technique used to predict the products that a user may be inretssted to buy based on the basis of ratings given to that item by the other/peers users who have similar taste with that of the target user


-  Steps for User-Based Collaborative Filtering:

      **1. Finding the similarity of users to the target user**
      
      **2. Prediction of missing rating of an item**
      
      **3. recommend the top n products to the target user**

In [None]:
df_sbprs.head(5)

In [None]:
# For this UBCF create a dataframe which contain only relevant columns
df_ubcf = df_sbprs[['reviews_username', 'id', 'reviews_rating']]



# Look at first few rows
df_ubcf.head()

In [None]:
# Inspect missing data % in different columns
df_ubcf.isna().sum()

In [None]:
df_ubcf.shape

In [None]:
# Here changing the column names

df_ubcf.columns = ['user_name', 'product_id', 'rating']

In [None]:
#  Train-Test split with 70% training set and 30% test set
train_ubcf, test_ubcf = train_test_split(df_ubcf, test_size=0.30, random_state=42,shuffle=True)


print("train Shape {0}:".format(train_ubcf.shape))
print("test Shape {0}:".format(test_ubcf.shape))


In [None]:
# Generating a pivot table with user names as index, products as attributes/features column with ratings as its values.
# Also here we are using fillna=0 so as to give 0rating to prodcts which have not been rated 
df_ubcf_pivot = train_ubcf.pivot_table(
    index='user_name',
    columns='product_id',
    values='rating'
).fillna(0)

df_ubcf_pivot.head()

In [None]:
# Check the shape of dataframe
df_ubcf_pivot.shape

Now as  a next step following startergy has been followed :
    
- Create a dummy set to remove the products which has already been rated by users.

- A copy of this dummy train is used for prediction of ratings given by peers and to allow this  where 0 rating is given to products which has already been rated by user and 1 to non-ated products

- dummy test is used  for evaluation. As this is evaluation phase opposite is true for test as comapred to train i.e. 1 rating to the products that have  been rated by user and 0 to the non-rated products.

In [None]:
# Copy the train dataset 
dummy_train_ubcf = train_ubcf.copy()

# Check the head of dataframe
dummy_train_ubcf.head()

In [None]:
# Check the ratings distribution
dummy_train_ubcf.rating.value_counts()

In [None]:
# 0 rating is given to products which has already been rated by user and 1 to non-ated products
dummy_train_ubcf['rating'] = dummy_train_ubcf['rating'].apply(lambda x: 0 if x>=1 else 1)

In [None]:
# fillna=1 so as to give rating to prodcts which have not been rated 
dummy_train_ubcf = dummy_train_ubcf.pivot_table(
    index='user_name',
    columns='product_id',
    values='rating'
).fillna(1)

# Check the head
dummy_train_ubcf.head(5)

In [None]:
# Check the shape of dataframe
dummy_train_ubcf.shape

### find similarity between the users by using adjusted cosine similarity metric

In [None]:
# Generating a user-product matrix (without deleting NaN values)
df_pivot_wo_nan = train_ubcf.pivot_table(
    index='user_name',
    columns='product_id',
    values='rating'
)

# View Head of DataFrame
df_pivot_wo_nan.head()

In [None]:
# Normalizing the rating of the products around 0 mean and hence subtracting average ratings of users for each indiviual product from each user's rating
mean = np.nanmean(df_pivot_wo_nan, axis=1)
df_pivot_wo_nan_subtracted = (df_pivot_wo_nan.T-mean).T
df_pivot_wo_nan_subtracted.head()

In [None]:
from sklearn.metrics.pairwise import pairwise_distances

In [None]:
# similarity Matrix with the help of  pairwise_distance function
user_correlation = 1 - pairwise_distances(df_pivot_wo_nan_subtracted.fillna(0), metric='cosine')
user_correlation[np.isnan(user_correlation)] = 0
print(user_correlation)

In [None]:
# Shape of user similarity matrix
user_correlation.shape

## Prediction for user based collaborative filtering 

In [None]:
# insert 0 for negative correlations 
user_correlation[user_correlation<0]=0
user_correlation

In [None]:
# Get predicted ratings of each user corresponding to each product in the given dataset
user_predicted_ratings = np.dot(user_correlation, df_pivot_wo_nan_subtracted.fillna(0))
user_predicted_ratings

In [None]:
# Shape of predicted ratings matrix
user_predicted_ratings.shape

**Now ignoring the non-rated products by setting their ratings as 0 and hence multiply the dummy_train matrix with the user_predicted_ratings matrix.**

In [None]:
# Get ratings of non-rated products 0 by multiplying dummy_train  user_predicted_ratings
user_final_rating = np.multiply(user_predicted_ratings, dummy_train_ubcf)
user_final_rating.head()

### top 20 recommendations for the user 

In [None]:
# find a non-zero rating
user_final_rating.sample(10)

In [None]:
# Top 20 products (having non-zero rating)
top_20_df = user_final_rating.loc['reggie'].sort_values(ascending=False)[0:20]
top_20_df

## Evaluation user based collaborative filtering

**Evaluation is always done for products that are already rated by user particualr user**

In [None]:
# Use users from test data set which are there in train dataset
df_common = test_ubcf[test_ubcf.user_name.isin(train_ubcf.user_name)]
df_common.shape

In [None]:
df_common.head()

In [None]:
# convert into the user-movie matrix (pivot form)
df_common_ubcf_matrix = df_common.pivot_table(index='user_name', columns='product_id', values='rating')
df_common_ubcf_matrix.head()

In [None]:
# shape
df_common_ubcf_matrix.shape

###  users that are common in both train and test dataset filter out those 

In [None]:
# user_correlation matrix into dataframe.
df_user_correlation_ubcf = pd.DataFrame(user_correlation)
df_user_correlation_ubcf.head()

In [None]:
# Set index of user correlation df as index of df_subtracted
df_user_correlation_ubcf['user_name'] = df_pivot_wo_nan_subtracted.index
df_user_correlation_ubcf.set_index('user_name',inplace=True)
df_user_correlation_ubcf.head()

In [None]:
# Fetch user names in a list
eval_list = df_common.user_name.tolist()

# replacing column names 
df_user_correlation_ubcf.columns = df_pivot_wo_nan_subtracted.index.tolist()

# Keep only user correlations common in both
df_user_correlation_ubcf_1 =  df_user_correlation_ubcf[df_user_correlation_ubcf.index.isin(eval_list)]

In [None]:
# Check the shape
df_user_correlation_ubcf_1.shape

In [None]:
# keep only correlations of users that are common in both train and test datasets
df_user_correlation_ubcf_2 = df_user_correlation_ubcf_1.T[df_user_correlation_ubcf_1.T.index.isin(eval_list)]

df_user_correlation_ubcf_3 = df_user_correlation_ubcf_2.T

df_user_correlation_ubcf_3.head()

In [None]:
# Shape
df_user_correlation_ubcf_3.shape

In [None]:
# Put negative correlations to 0
df_user_correlation_ubcf_3[df_user_correlation_ubcf_3<0]=0


arr_common_user_predicted_ratings = np.dot(df_user_correlation_ubcf_3, df_common_ubcf_matrix.fillna(0))
arr_common_user_predicted_ratings

In [None]:
# Fetch the predicted ratings 
dummy_test = df_common.copy()

dummy_test['rating'] = dummy_test['rating'].apply(lambda x: 1 if x>=1 else 0)

dummy_test = dummy_test.pivot_table(index='user_name', columns='product_id', values='rating').fillna(0)

In [None]:
# shape
dummy_test.shape

In [None]:
# Multiplication of 'common_user_predicted_ratings' with 'dummy_test'
arr_common_user_predicted_ratings = np.multiply(arr_common_user_predicted_ratings,dummy_test)

arr_common_user_predicted_ratings.head()

### Calculate the RMSE by normalizing the ratings 

In [None]:
from sklearn.preprocessing import MinMaxScaler
from numpy import *
# Get a copy of 'arr_common_user_predicted_ratings
X_arr_common_user_predicted_ratings  = arr_common_user_predicted_ratings.copy() 

# Filter  positive ratings
X_arr_common_user_predicted_ratings = X_arr_common_user_predicted_ratings[X_arr_common_user_predicted_ratings>0]

# Normalizing the ratings 
scaler = MinMaxScaler(feature_range=(1, 5))
print(scaler.fit(X_arr_common_user_predicted_ratings))
y_arr_common_user_predicted_ratings = (scaler.transform(X_arr_common_user_predicted_ratings))

print(y_arr_common_user_predicted_ratings)

In [None]:
# total non-NaN value
all_non_nan = np.count_nonzero(~np.isnan(y_arr_common_user_predicted_ratings))
all_non_nan

In [None]:
y_arr_common_user_predicted_ratings.shape

In [None]:
df_common_ubcf_matrix.shape

In [None]:
# RMSE 
rmse_ubcf = (sum(sum((df_common_ubcf_matrix - y_arr_common_user_predicted_ratings)**2))/all_non_nan)**0.5
print(rmse_ubcf)

## User Based Correlation filtering RMSE : `2.45`

##### <p style="font-family: Arial; font-size:1.5em;color:Orange;">Task 2.2:Item Based Recommendation System</p>

- 1) Here we are exploring the relationship between the pair of items (say X user bought and also Y user). We Can get the missing rating from rating given to other item by user

- 2) The first step is to generate a model finding similarity between all the item pairs.

- 3) In the second step executing a recommendation system

In [None]:
# Generating a pivot table with user names as index, products as attributes/features column with ratings as its values.
# Also here we are using fillna=0 so as to give 0rating to prodcts which have not been rated 
df_ibcf_pivot = train_ubcf.pivot_table(
    index='user_name',
    columns='product_id',
    values='rating'
).T

df_ibcf_pivot.head()

In [None]:
# Normalizing the rating of the products 
mean = np.nanmean(df_ibcf_pivot, axis=1)
df_ibcf_subtracted = (df_ibcf_pivot.T-mean).T
df_ibcf_subtracted.head()

In [None]:
# similarity Matrix with the help of  pairwise_distance function
item_correlation = 1 - pairwise_distances(df_ibcf_subtracted.fillna(0), metric='cosine')
item_correlation[np.isnan(item_correlation)] = 0
print(item_correlation)

In [None]:
# 0 to replace negative correlations 
item_correlation[item_correlation<0]=0
item_correlation

## Prediction for item based collaborative filtering 

In [None]:
# Get predicted ratings of each item corresponding to each product in the given dataset
item_predicted_ratings = np.dot((df_ibcf_pivot.fillna(0).T),item_correlation,)
item_predicted_ratings

In [None]:
# shape
item_predicted_ratings.shape

In [None]:
# Check whether the above shape is same as that of 'dummy_train'
dummy_train_ubcf.shape

In [None]:
# Multiplication of 'dummy_train' with 'item_predicted_ratings'
item_final_rating = np.multiply(item_predicted_ratings,dummy_train_ubcf)
item_final_rating.head()

In [None]:
# Top 20 products (having non-zero rating)
top_20_df_ibcf = item_final_rating.loc['00sab00'].sort_values(ascending=False)[0:20]
top_20_df_ibcf

## Evaluation (Item based correlation filtering)

In [None]:
# Fetching the users which are common in both test and train data set
common_ibcf = test_ubcf[test_ubcf['product_id'].isin(train_ubcf['product_id'])]
common_ibcf.shape

In [None]:
# check Head 
common_ibcf.head()

In [None]:
# Matrix form of item based data
common_item_based_matrix = common_ibcf.pivot_table(index='user_name', columns='product_id', values='rating').T
common_item_based_matrix.head()

In [None]:
# shape
common_item_based_matrix.shape

In [None]:
# Matrix form to dataframe
df_item_correlation = pd.DataFrame(item_correlation)
df_item_correlation.head()

In [None]:
# Setting index 
df_item_correlation['product_id'] = df_ibcf_subtracted.index
df_item_correlation.set_index('product_id',inplace=True)
df_item_correlation.head()

In [None]:
# Fetch user names in a list
eval_list_ibcf = common_ibcf['product_id'].tolist()

# replacing column names 
df_item_correlation.columns = df_ibcf_subtracted.index.tolist()

# Keep only user correlations common in both
df_item_correlation_ibcf_1 =  df_item_correlation[df_item_correlation.index.isin(eval_list_ibcf)]



In [None]:
# keep only correlations of users that are common in both train and test datasets
df_item_correlation_ibcf_2 = df_item_correlation_ibcf_1.T[df_item_correlation_ibcf_1.T.index.isin(eval_list_ibcf)]

df_item_correlation_ibcf_3 = df_item_correlation_ibcf_2.T

df_item_correlation_ibcf_3.head()

In [None]:
df_item_correlation_ibcf_3.shape

In [None]:
# Put negative correlations to 0
df_item_correlation_ibcf_3[df_item_correlation_ibcf_3<0]=0


arr_common_item_predicted_ratings = np.dot(df_item_correlation_ibcf_3, common_item_based_matrix.fillna(0))
arr_common_item_predicted_ratings

In [None]:
# Fetch the predicted ratings 
dummy_test_ibcf = common_ibcf.copy()

dummy_test_ibcf['rating'] = dummy_test_ibcf['rating'].apply(lambda x: 1 if x>=1 else 0)

dummy_test_ibcf = dummy_test_ibcf.pivot_table(index='user_name', columns='product_id', values='rating').T.fillna(0)

In [None]:
# Multiplication of 'common_user_predicted_ratings' with 'dummy_test'
arr_common_item_predicted_ratings = np.multiply(arr_common_item_predicted_ratings,dummy_test_ibcf)

arr_common_item_predicted_ratings.head()

In [None]:
from sklearn.preprocessing import MinMaxScaler
from numpy import *
# Get a copy of 'arr_common_user_predicted_ratings
X_arr_common_item_predicted_ratings  = arr_common_item_predicted_ratings.copy() 

# Filter  positive ratings
X_arr_common_item_predicted_ratings = X_arr_common_item_predicted_ratings[X_arr_common_item_predicted_ratings>0]

# Normalizing the ratings 
scaler = MinMaxScaler(feature_range=(1, 5))
print(scaler.fit(X_arr_common_item_predicted_ratings))
y_arr_common_item_predicted_ratings = (scaler.transform(X_arr_common_item_predicted_ratings))

print(y_arr_common_item_predicted_ratings)

In [None]:
# total non-NaN value
all_non_nan_ibcf = np.count_nonzero(~np.isnan(y_arr_common_item_predicted_ratings))
all_non_nan_ibcf

In [None]:
y_arr_common_item_predicted_ratings.shape

In [None]:
common_item_based_matrix.shape

In [None]:
# RMSE 
rmse_ibcf = (sum(sum((common_item_based_matrix - y_arr_common_item_predicted_ratings)**2))/all_non_nan)**0.5
print(rmse_ibcf)

###  1) `User based correlation filtering(UBCF)` gives less RMSE value compared to `Item based correlation filtering(IBCF)` 

###  2) Chossing `UBCF` over IBCF


In [None]:
# Calculate final ratings with UBCF
user_final_rating = np.multiply(user_predicted_ratings, dummy_train_ubcf)
user_final_rating.head()

In [None]:
type(user_final_rating)

In [None]:
# Saving the final ratings in a pickle file
pickle.dump(user_final_rating.astype('float32'), open('pickle/user_final_rating.pkl', 'wb'))

# `Task 3`: Improving the recommendations using the sentiment analysis model
Please check model.py

# Task 4 : Deployment of this end to end project with a user interface

Accomplished using Flask and Heroku