# Capstone project - Sentiment Based Product Recommendation System

## Problem Statement :
Ebuss, an e-commerce company, has captured a huge market share in many fields, and it sells the products in various categories such as household essentials, books, personal care products, medicines, cosmetic items, beauty products, electrical appliances, kitchen and dining products and health care products.
<br><br>
With the advancement in technology, it is imperative for Ebuss to grow quickly in the e-commerce market to become a major leader in the market because it has to compete with the likes of Amazon, Flipkart, etc., which are already market leaders.A model has been built that will improve the recommendations given to the users given their past reviews and ratings by building a sentiment-based product recommendation system.

## Table of contents
<ol>
  <li>Task 1 - Data Cleaning and Pre-Processing</li>
  <ul>
      <li>Import libraries</li>
      <li>Reading dataset</li>
      <li>Finding missing values & missing value imputation</li>
      <li>Train-test split</li>
  </ul>
  <li>Task 2 - Text Processing</li>
  <ul>
      <li>Processing training data</li>
      <li>Processing test data</li>
  </ul>
  <li>Task 3 - Feature Extraction using TF-IDF</li>
  <li>Task 4 - Model Building</li>
  <ul>
      <li>Apply Logistic Regression on the dataset, perform training and predictions</li>
      <li>Apply Random Forest Classifier on the dataset, perform training and predictions</li>
      <li>Apply Balanced Random Forest Classifier on the dataset, perform training and predictions</li>
      <li>Apply Bernoulli Bayes Classifier on the dataset, perform training and predictions</li>
      <li>Apply Logisitic Regression with SMOTE</li>
      <li>Finalising the model for fine tuning of recommenddation system</li>
      <li>Exploring Hyper Parameter tuning for Logistic Regression model</li>
  </ul>
  <li>Task 5 - Building the Recommendation System</li>
  <ul>
      <li>Converting the dataframe into pivot</li>
      <li>User Based Collaborative filtering</li>
      <ul>
          <li>Builiding - User User</li>
          <li>Prediction - User User</li>
          <li>Finding the top 5 recommendation for the user</li>
          <li>Evaluation - User User</li>
          <li>Fine tuning the recommendation system</li>
      </ul>
      <li>Item Based Collaborative filtering</li>
      <ul>
          <li>Builiding - Item Item</li>
          <li>Prediction - Item Item</li>
          <li>Finding the top 5 recommendation for the user</li>
          <li>Evaluation - Item Item</li>
          <li>Fine tuning the recommendation system</li>
      </ul>
  </ul>
  <li>Task 6 - Fine Tuning the Recommendation System and Recommendation of Top 5 Products</li>

</ol>



## Task 1 - Data Cleaning and Pre-Processing

### (1.1) Import libraries

In [1]:
import numpy as np
import pandas as pd
import re, nltk, spacy, string
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from tqdm import tqdm
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.naive_bayes import CategoricalNB, GaussianNB, BernoulliNB
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix,f1_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.preprocessing import MinMaxScaler
import pickle
import warnings
warnings.simplefilter("ignore")

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

### (1.2)Reading dataset

In [2]:
url = 'https://raw.githubusercontent.com/adwayskirwe/Capstone/main/sample30.csv'
reviews_df = pd.read_csv(url)

print(f"Shape of the pandas loaded dataset = {reviews_df.shape} \n")
print(f"Dataset columns names are as follows = {reviews_df.columns}" )

Shape of the pandas loaded dataset = (30000, 15) 

Dataset columns names are as follows = Index(['id', 'brand', 'categories', 'manufacturer', 'name', 'reviews_date',
       'reviews_didPurchase', 'reviews_doRecommend', 'reviews_rating',
       'reviews_text', 'reviews_title', 'reviews_userCity',
       'reviews_userProvince', 'reviews_username', 'user_sentiment'],
      dtype='object')


In [3]:
reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   id                    30000 non-null  object
 1   brand                 30000 non-null  object
 2   categories            30000 non-null  object
 3   manufacturer          29859 non-null  object
 4   name                  30000 non-null  object
 5   reviews_date          29954 non-null  object
 6   reviews_didPurchase   15932 non-null  object
 7   reviews_doRecommend   27430 non-null  object
 8   reviews_rating        30000 non-null  int64 
 9   reviews_text          30000 non-null  object
 10  reviews_title         29810 non-null  object
 11  reviews_userCity      1929 non-null   object
 12  reviews_userProvince  170 non-null    object
 13  reviews_username      29937 non-null  object
 14  user_sentiment        29999 non-null  object
dtypes: int64(1), object(14)
memory usage

In [4]:
#Taking a glance at the top 5 reviews from the dataset
pd.set_option('max_colwidth', 200)
reviews_df['reviews_text'].head(5)

0    i love this album. it's very good. more to the hip hop side than her current pop sound.. SO HYPE! i listen to this everyday at the gym! i give it 5star rating all the way. her metaphors are just c...
1                                                                                                                                             Good flavor. This review was collected as part of a promotion.
2                                                                                                                                                                                               Good flavor.
3    I read through the reviews on here before looking in to buying one of the couples lubricants, and was ultimately disappointed that it didn't even live up to the reviews I had read. For starters, n...
4                                                                       My husband bought this gel for us. The gel caused irritation and it felt like it was burning my skin. I woul

In [5]:
# Total number of unique users who have written a review
print(f"Total number of unique users who have written a review = {len(reviews_df['reviews_username'].unique())} ")

# Total number of unique categories
print(f"Total number of unique categories  = {len(reviews_df['categories'].unique())} ")


Total number of unique users who have written a review = 24915 
Total number of unique categories  = 270 


In [6]:
#Exploring the reviews_df['name'] and finding the product names having maximum frequency

print("Exploring the reviews_df['name'] and finding the product names having maximum frequency :- \n")
valuecount = reviews_df['name'].value_counts()
valuecount[valuecount > 500]

Exploring the reviews_df['name'] and finding the product names having maximum frequency :- 



Clorox Disinfecting Wipes Value Pack Scented 150 Ct Total                         8545
Godzilla 3d Includes Digital Copy Ultraviolet 3d/2d Blu-Ray/dvd                   3325
Clorox Disinfecting Bathroom Cleaner                                              2039
L'or233al Paris Elvive Extraordinary Clay Rebalancing Conditioner - 12.6 Fl Oz    1186
Planes: Fire Rescue (2 Discs) (includes Digital Copy) (blu-Ray/dvd)               1143
Burt's Bees Lip Shimmer, Raisin                                                    873
The Resident Evil Collection 5 Discs (blu-Ray)                                     845
Mike Dave Need Wedding Dates (dvd + Digital)                                       757
Nexxus Exxtra Gel Style Creation Sculptor                                          693
Red (special Edition) (dvdvideo)                                                   672
My Big Fat Greek Wedding 2 (blu-Ray + Dvd + Digital)                               668
Olay Regenerist Deep Hydration Regenerating

In [7]:
#Exploring categories appearing more than 500 times in dataset
valuecount = reviews_df['categories'].value_counts()
valuecount[valuecount > 500]

Household Essentials,Cleaning Supplies,Kitchen Cleaners,Cleaning Wipes,All-Purpose Cleaners,Health & Household,Household Supplies,Household Cleaning,Ways To Shop,Classroom Essentials,Featured Brands,Home And Storage & Org,Clorox,Glass Cleaners,Surface Care & Protection,Business & Industrial,Cleaning & Janitorial Supplies,Cleaners & Disinfectants,Cleaning Wipes & Pads,Cleaning Solutions,Housewares,Target Restock,Food & Grocery,Paper Goods,Wipes,All Purpose Cleaners    8545
Movies, Music & Books,Movies,Action & Adventure,Movies & Music,Movies & TV Shows,Frys                                                                                                                                                                                                                                                                                                                                                                                                   3325
Household Chemicals,Household Cleaners,Bath & 

In [8]:
#Exploring value count for reviews_rating columns
reviews_df['reviews_rating'].value_counts()

5    20831
4     6020
1     1384
3     1345
2      420
Name: reviews_rating, dtype: int64

In [9]:
#Exploring value count for reviews_username columns
reviews_df['reviews_username'].value_counts()

mike                 41
byamazon customer    41
chris                32
lisa                 16
sandy                15
                     ..
nurse32               1
lisa62                1
bigal515              1
mom271                1
kcoopxoxo             1
Name: reviews_username, Length: 24914, dtype: int64

In [10]:
#Creating a new feature 'user_sentiment_bool' - All negative sentiments will be marked as "0" and all positive sentiments will be marked as "1"
reviews_df['user_sentiment_bool'] = reviews_df['user_sentiment'].apply(lambda x: 0 if x == "Negative" else 1)
reviews_df.shape

(30000, 16)

### (1.3) Finding missing values & missing value imputation

In [11]:
# Finding missing values for 'reviews_username' column
print(f" Total number of missing values for 'reviews_username' column before missing value imputation = {reviews_df[reviews_df['reviews_username'].isnull()].shape[0]} ")

# Imputing missing values for 'reviews_username' column by filling null values with 'others'
reviews_df['reviews_username'].fillna('others',inplace=True)

# Checking for missing values for 'reviews_username' column post missing value imputation
print(f" Total number of missing values for 'reviews_username' column after missing value imputation= {reviews_df[reviews_df['reviews_username'].isnull()].shape[0]} ")

 Total number of missing values for 'reviews_username' column before missing value imputation = 63 
 Total number of missing values for 'reviews_username' column after missing value imputation= 0 


In [12]:
# Finding missing values for 'user_sentiment' column
print(f" Total number of missing values for 'user_sentiment' column before missing value imputation= {reviews_df[reviews_df['user_sentiment'].isnull()].shape[0]} ")

pd.set_option('max_colwidth', 175)
reviews_df[reviews_df['user_sentiment'].isnull()][['reviews_text','reviews_title','reviews_username','user_sentiment']]

 Total number of missing values for 'user_sentiment' column before missing value imputation= 1 


Unnamed: 0,reviews_text,reviews_title,reviews_username,user_sentiment
28354,my kids absolutely loved this film so much that we watched it twice. Having a digital copy means that every time we get in the car we get to watch it wherever we go. we ev...,a super hit with my children. they loved it!!??,787000000000.0,


In [13]:
# Imputing missing values for 'user_sentiment' column by filling null values with 'Positive' as the manual lookup of respective review text is found to be of Postive sentiment
reviews_df['user_sentiment'].fillna('Positive',inplace=True)

# Checking for missing values for 'reviews_username' column post missing value imputation
print(f" Total number of missing values for 'user_sentiment' column after missing value imputation= {reviews_df[reviews_df['user_sentiment'].isnull()].shape[0]} ")

 Total number of missing values for 'user_sentiment' column after missing value imputation= 0 


In [14]:
#Checking the count of non-null values for columns 'reviews_username' and 'user_sentiment' after missing value imputation 
reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   id                    30000 non-null  object
 1   brand                 30000 non-null  object
 2   categories            30000 non-null  object
 3   manufacturer          29859 non-null  object
 4   name                  30000 non-null  object
 5   reviews_date          29954 non-null  object
 6   reviews_didPurchase   15932 non-null  object
 7   reviews_doRecommend   27430 non-null  object
 8   reviews_rating        30000 non-null  int64 
 9   reviews_text          30000 non-null  object
 10  reviews_title         29810 non-null  object
 11  reviews_userCity      1929 non-null   object
 12  reviews_userProvince  170 non-null    object
 13  reviews_username      30000 non-null  object
 14  user_sentiment        30000 non-null  object
 15  user_sentiment_bool   30000 non-null

### (1.4)Train-test split

In [15]:
#Splitting datatset into train-test split with 70% training size and 30% test size
X_train, X_test, y_train, y_test = train_test_split(reviews_df,reviews_df['user_sentiment_bool'],test_size=0.3,shuffle=False)
print("X_train shape =", X_train.shape)
print("X_test shape =", X_test.shape)
print("y_train shape =", y_train.shape)
print("y_test shape =", y_test.shape)

X_train shape = (21000, 16)
X_test shape = (9000, 16)
y_train shape = (21000,)
y_test shape = (9000,)


## Task 2 - Text Processing

In [16]:
#This function takes a document(complaint) as input, preprocesses it and returns preprocessed output 

#stemmer = PorterStemmer()
wordnet_lemmatizer = WordNetLemmatizer()

def preprocess(document):
    'changes document to lower case and removes stopwords, punctuation, numbers and convert words to root form using wordnet_lemmatizer'

    # Make the text lowercase
    document = document.lower()
    
    #Remove punctuation and words containing numbers
    document = re.sub("[^\sA-z]","",document)
    
    # tokenize into words
    words = word_tokenize(document)
    
    # remove stop words
    words = [word for word in words if word not in stopwords.words("english")]
    
    # Lemmatizing the words
    words = [wordnet_lemmatizer.lemmatize(word) for word in words]
    
    # join words to make sentence
    document = " ".join(words)
    
    return document

### (2.1) Processing training data

In [17]:
#Processing all user reviews from the train dataset
preprocessed_review = [preprocess(review) for review in tqdm(X_train['reviews_text'])]
X_train['preprocessed_review'] = pd.Series(preprocessed_review)

100%|██████████| 21000/21000 [01:43<00:00, 201.95it/s]


### (2.2) Processing test data

In [18]:
#Processing all user reviews from the test dataset
preprocessed_review = [preprocess(review) for review in tqdm(X_test['reviews_text'])]
X_test['preprocessed_review'] = preprocessed_review

100%|██████████| 9000/9000 [00:44<00:00, 201.58it/s]


#### Checking shape of X_train and X_test

In [19]:
print(X_train.shape)
print(X_test.shape)

(21000, 17)
(9000, 17)


The dataset contains multiple columns as listed below :
*    id                   
*    brand                
*    categories           
*    manufacturer         
*    name                 
*    reviews_date         
*    reviews_didPurchase  
*    reviews_doRecommend  
*    reviews_rating       
*    reviews_text         
*    reviews_title        
*    reviews_userCity     
*    reviews_userProvince 
*    reviews_username     
*    user_sentiment 

<br>However, we will be using the review in the 'reviews_text' to create TF_IDF model and will use the 'user_sentiment' as the target variable. Other variables are not really needed and hence we will not use them while building/fitting the model.




In [20]:
#Comparing the raw review VS preprocessed review for train dataset
X_train[['reviews_text','preprocessed_review']].head(3)

Unnamed: 0,reviews_text,preprocessed_review
0,i love this album. it's very good. more to the hip hop side than her current pop sound.. SO HYPE! i listen to this everyday at the gym! i give it 5star rating all the way....,love album good hip hop side current pop sound hype listen everyday gym give star rating way metaphor crazy
1,Good flavor. This review was collected as part of a promotion.,good flavor review collected part promotion
2,Good flavor.,good flavor


In [21]:
#Comparing the raw review VS preprocessed review for test dataset
X_test[['reviews_text','preprocessed_review']].head(3)

Unnamed: 0,reviews_text,preprocessed_review
21000,Great product,great product
21001,Crispy chips that are the right size. has the right amount of salt.,crispy chip right size right amount salt
21002,My family buys these chips at least 2 times a week we love these .,family buy chip least time week love


## Task 3 - Feature Extraction using TF-IDF

In [22]:
#Default shape of features_df without specifying values of min_df and max_df = (30000,18403). Post specifying = (30000,5179)
#Creating TF-IDF vectorizer, for model building
vectorizer = TfidfVectorizer(max_df=0.95,min_df=3)

#Fitting the TF-IDF vectorizer on the training set reviews
X = vectorizer.fit_transform(X_train['preprocessed_review'])
features_df = pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names_out())
features_df

Unnamed: 0,ability,able,abosolutely,abrasive,absence,absolute,absolutely,absolutley,absolutly,absorb,...,zack,zero,zip,ziploc,ziplock,zipper,zojirushi,zombie,zone,zoo
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
#Transforming the TF-IDF vectorizer on the test set reviews
X_test_ = vectorizer.transform(X_test['preprocessed_review'])
features_df_test = pd.DataFrame(X_test_.toarray(), columns = vectorizer.get_feature_names_out())
features_df_test

Unnamed: 0,ability,able,abosolutely,abrasive,absence,absolute,absolutely,absolutley,absolutly,absorb,...,zack,zero,zip,ziploc,ziplock,zipper,zojirushi,zombie,zone,zoo
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Task 4 - Model Building

### (4.1) Apply Logistic Regression on the dataset, perform training and predictions

In [24]:
#Building model
logisticRegression = LogisticRegression(random_state=100)
logisticRegression.fit(features_df,y_train)

#Predicting the training set accuracy
y_train_pred = logisticRegression.predict(features_df)
logistic_regression_train_accuracy = accuracy_score(y_true=y_train, y_pred=y_train_pred)
print("logistic_regression_train_accuracy=",logistic_regression_train_accuracy)

#Predicting the test set accuracy
y_test_pred = logisticRegression.predict(features_df_test)
logistic_regression_test_accuracy = accuracy_score(y_true=y_test, y_pred=y_test_pred)
print(f"logistic_regression_test_accuracy = {logistic_regression_test_accuracy} \n")

#Calculating F1 score
print(f"logistic_regression_f1_score = {f1_score(y_test, y_test_pred,average='weighted')} ")
pd.DataFrame(confusion_matrix(y_test,y_test_pred))

logistic_regression_train_accuracy= 0.9371904761904762
logistic_regression_test_accuracy = 0.8966666666666666 

logistic_regression_f1_score = 0.8654008982449348 


Unnamed: 0,0,1
0,194,898
1,32,7876


In [25]:
#Saving the model using pickle
with open('model_pkl', 'wb') as files:
    pickle.dump(logisticRegression, files)

### (4.2) Apply Random Forest Classifier on the dataset, perform training and predictions

In [26]:
#Building model
randomForestClassifier = RandomForestClassifier(random_state=100)
randomForestClassifier.fit(features_df,y_train)

#Predicting the training set accuracy
y_train_pred = randomForestClassifier.predict(features_df)
randomForest_train_accuracy = accuracy_score(y_true=y_train, y_pred=y_train_pred)
print(f"randomForest_train_accuracy = {randomForest_train_accuracy} \n")

#Predicting the test set accuracy
y_test_pred = randomForestClassifier.predict(features_df_test)
randomForest_test_accuracy = accuracy_score(y_true=y_test, y_pred=y_test_pred)
print(f"randomForest_test_accuracy = {randomForest_test_accuracy} \n")

#Calculating F1 score
print(f"randomForest_f1_score = {f1_score(y_test, y_test_pred,average='weighted')} \n")
pd.DataFrame(confusion_matrix(y_test,y_test_pred))


randomForest_train_accuracy = 0.9999047619047619 

randomForest_test_accuracy = 0.8962222222222223 

randomForest_f1_score = 0.8738485055112659 



Unnamed: 0,0,1
0,274,818
1,116,7792


### (4.3) Apply Balanced Random Forest Classifier on the dataset, perform training and predictions

In [27]:
#Building model
balancedRandomForestClassifier = BalancedRandomForestClassifier(random_state=100)
balancedRandomForestClassifier.fit(features_df,y_train)

#Predicting the training set accuracy
y_train_pred = balancedRandomForestClassifier.predict(features_df)
balancedRandomForest_train_accuracy = accuracy_score(y_true=y_train, y_pred=y_train_pred)
print(f"balanced_randomForest_train_accuracy = {balancedRandomForest_train_accuracy} \n")

#Predicting the test set accuracy
y_test_pred = balancedRandomForestClassifier.predict(features_df_test)
balancedRandomForest_test_accuracy = accuracy_score(y_true=y_test, y_pred=y_test_pred)
print(f"balanced_randomForest_test_accuracy = {balancedRandomForest_test_accuracy} \n")

#Calculating F1 score
print(f"balanced_randomForest_f1_score = {f1_score(y_test, y_test_pred,average='weighted')} \n")
pd.DataFrame(confusion_matrix(y_test,y_test_pred))


balanced_randomForest_train_accuracy = 0.925 

balanced_randomForest_test_accuracy = 0.8545555555555555 

balanced_randomForest_f1_score = 0.8707868701846336 



Unnamed: 0,0,1
0,862,230
1,1079,6829


### (4.4) Apply Bernoulli Bayes Classifier on the dataset, perform training and predictions

In [28]:
#Building model
bernoulliNB = BernoulliNB()
bernoulliNB.fit(features_df,y_train)

#Predicting the training set accuracy
y_train_pred = bernoulliNB.predict(features_df)
bernoulliNB_train_accuracy = accuracy_score(y_true=y_train, y_pred=y_train_pred)
print(f"bernoulliNB_train_accuracy = {bernoulliNB_train_accuracy} \n")

#Predicting the test set accuracy
y_test_pred = bernoulliNB.predict(features_df_test)
bernoulliNB_test_accuracy = accuracy_score(y_true=y_test, y_pred=y_test_pred)
print(f"bernoulliNB_test_accuracy = {bernoulliNB_test_accuracy} \n")

#Calculating F1 score
print(f"bernoulliNB_f1_score = {f1_score(y_test, y_test_pred,average='weighted')} \n")
pd.DataFrame(confusion_matrix(y_test,y_test_pred))


bernoulliNB_train_accuracy = 0.9188095238095239 

bernoulliNB_test_accuracy = 0.8822222222222222 

bernoulliNB_f1_score = 0.8779897024059564 



Unnamed: 0,0,1
0,479,613
1,447,7461


### (4.5) Apply Logisitic Regression with SMOTE

In [29]:
# Value of target variable - count of 0s and 1s before applying SMOTE technique of over-sampling (to identify class imbalance)
y_train.value_counts()

1    18725
0     2275
Name: user_sentiment_bool, dtype: int64

In [30]:
# Perform oversampling using SMOTE to ensure class balance

oversample = SMOTE()
over_X, over_y = oversample.fit_resample(features_df, y_train)

print(over_X.shape)
print(over_y.shape)

(37450, 5179)
(37450,)


In [31]:
# Value of target variable - count of 0s and 1s before applying SMOTE technique of over-sampling
over_y.value_counts()

1    18725
0    18725
Name: user_sentiment_bool, dtype: int64

In [32]:
#Building model using SMOTE features/target variables
logisticRegression_smote = LogisticRegression(random_state=100)
logisticRegression_smote.fit(over_X,over_y)

#Predicting the training set accuracy
y_train_pred = logisticRegression_smote.predict(over_X)
logistic_regression_smote_train_accuracy = accuracy_score(y_true=over_y, y_pred=y_train_pred)
print(f"logistic_regression_train_accuracy = {logistic_regression_smote_train_accuracy} \n")

#Predicting the test set accuracy
y_test_pred = logisticRegression_smote.predict(features_df_test)
logistic_regression_smote_test_accuracy = accuracy_score(y_true=y_test, y_pred=y_test_pred)
print(f"logistic_regression_test_accuracy = {logistic_regression_smote_test_accuracy} \n")

#Calculating F1 score
print("logistic_regression_f1_score =", f1_score(y_test, y_test_pred,average='weighted'))
pd.DataFrame(confusion_matrix(y_test,y_test_pred))


logistic_regression_train_accuracy = 0.963497997329773 

logistic_regression_test_accuracy = 0.8885555555555555 

logistic_regression_f1_score = 0.8967843796596722


Unnamed: 0,0,1
0,828,264
1,739,7169


### (4.6) Finalising the model for fine tuning of recommenddation system






In [33]:
model_comparison_dict = {}
model_comparison_dict['Model name'] = ['Logistic Regression', 'Random Forest Classifier', 'Balanced Random Forest Classifier', 'Bernoulli Bayes Classifier ','Logistic Regression (SMOTE)']
model_comparison_dict['Model train accuracy'] = [logistic_regression_train_accuracy,randomForest_train_accuracy,balancedRandomForest_train_accuracy,bernoulliNB_train_accuracy,logistic_regression_smote_train_accuracy]
model_comparison_dict['Model test accuracy'] = [logistic_regression_test_accuracy,randomForest_test_accuracy,balancedRandomForest_test_accuracy,bernoulliNB_test_accuracy,logistic_regression_smote_test_accuracy]


model_comparison_df = pd.DataFrame(model_comparison_dict)
model_comparison_df

Unnamed: 0,Model name,Model train accuracy,Model test accuracy
0,Logistic Regression,0.93719,0.896667
1,Random Forest Classifier,0.999905,0.896222
2,Balanced Random Forest Classifier,0.925,0.854556
3,Bernoulli Bayes Classifier,0.91881,0.882222
4,Logistic Regression (SMOTE),0.963498,0.888556


Above model_comparison_df dataframe shows quick summary of train accuracy and test accuracy for the 5 models tried out for prediction of review sentiment. <br>
Looking at the results, logistic regression shows the highest test accuracy and has is not over-fitting too much either. So, we decide to use Logistic Regression model for fine tuning of our recommendation system.

### (4.7) Exploring Hyper Parameter tuning for Logistic Regression model

In [34]:
#Perform Hyper-parameter tuning using GridSearchCV
#This piece of code has been commented out as hyper-parameter tuning has been performed earlier and results have been verified
'''
logModel = LogisticRegression()

param_grid = [    
    {'penalty' : ['l2'],
    'solver' : ['lbfgs','liblinear','sag','saga'],
    'max_iter' : [100, 200]
    }
]

clf = GridSearchCV(logModel, param_grid = param_grid, cv = 2, verbose=True, n_jobs=-1)
best_clf = clf.fit(features_df,y_train)
print(f"Best classifier score = {best_clf.best_score_} \n")
print(f"Best estimator provided by GridSearchCV = {best_clf.best_estimator_}")
'''

'\nlogModel = LogisticRegression()\n\nparam_grid = [    \n    {\'penalty\' : [\'l2\'],\n    \'solver\' : [\'lbfgs\',\'liblinear\',\'sag\',\'saga\'],\n    \'max_iter\' : [100, 200]\n    }\n]\n\nclf = GridSearchCV(logModel, param_grid = param_grid, cv = 2, verbose=True, n_jobs=-1)\nbest_clf = clf.fit(features_df,y_train)\nprint(f"Best classifier score = {best_clf.best_score_} \n")\nprint(f"Best estimator provided by GridSearchCV = {best_clf.best_estimator_}")\n'

Here, we can observe the that the GridSearchCV is returning the default logistic regression model (i.e. the model with the default argument values) The accuracy is decent enough, so we will proceed with this model.<br><br>

Fitting 2 folds for each of 8 candidates, totalling 16 fits
Best classifier score = 0.9009047619047619 

Best estimator provided by GridSearchCV = **LogisticRegression()**

## Task 5 - Building the Recommendation System

### (5.1) Converting the dataframe into pivot


In [35]:
sample_df = reviews_df[['reviews_username','name','reviews_rating']]
sample_df_groupby = sample_df.groupby(['reviews_username','name']).mean()
sample_df_groupby = sample_df_groupby.reset_index()

print(f"Shape of sample_df_groupby = {sample_df_groupby.shape} \n")
print(f"Total usernames found = {len(sample_df_groupby['reviews_username'].unique())} ")

Shape of sample_df_groupby = (27605, 3) 

Total usernames found = 24915 


In [36]:
sample_df_groupby[sample_df_groupby['reviews_username'] == "mike"]

Unnamed: 0,reviews_username,name,reviews_rating
17494,mike,100:Complete First Season (blu-Ray),4.0
17495,mike,"Banana Boat Sunless Summer Color Self Tanning Lotion, Light To Medium",5.0
17496,mike,Bilbao Nightstand Gray Oak - South Shore,1.0
17497,mike,Chester's Cheese Flavored Puffcorn Snacks,5.0
17498,mike,Clorox Disinfecting Bathroom Cleaner,5.0
17499,mike,Clorox Disinfecting Wipes Value Pack Scented 150 Ct Total,5.0
17500,mike,Feit 60-Watt A19 Gu24 Base Led Light Bulb - Soft White,1.0
17501,mike,Godzilla 3d Includes Digital Copy Ultraviolet 3d/2d Blu-Ray/dvd,4.533333
17502,mike,Jason Aldean - They Don't Know,4.666667
17503,mike,Meguiar's Deep Crystal Car Wash 64-Oz.,5.0


In [37]:
print(f" Total number of missing values for 'reviews_username' column = {sample_df_groupby[sample_df_groupby['reviews_username'].isnull()].shape[0]} ")
print(f" Total number of missing values for 'name' column = {sample_df_groupby[sample_df_groupby['name'].isnull()].shape[0]} ")
print(f" Total number of missing values for 'reviews_rating' column = {sample_df_groupby[sample_df_groupby['reviews_rating'].isnull()].shape[0]} ")

train, test = train_test_split(sample_df_groupby, test_size=0.30, random_state=31)
print(train.shape)
print(test.shape)

 Total number of missing values for 'reviews_username' column = 0 
 Total number of missing values for 'name' column = 0 
 Total number of missing values for 'reviews_rating' column = 0 
(19323, 3)
(8282, 3)


In [38]:
# Pivot the reviews dataset into matrix format in which columns are product_names and the rows are usernames.

df_pivot = train.pivot(
    index='reviews_username',
    columns='name',
    values='reviews_rating'
).fillna(0)

print(f"Shape of df_pivot = {df_pivot.shape} \n")
df_pivot.head(3)

#df_pivot.loc["02dakota","Godzilla 3d Includes Digital Copy Ultraviolet 3d/2d Blu-Ray/dvd"]

Shape of df_pivot = (17877, 257) 



name,0.6 Cu. Ft. Letter A4 Size Waterproof 30 Min. Fire File Chest,100:Complete First Season (blu-Ray),2017-2018 Brownline174 Duraflex 14-Month Planner 8 1/2 X 11 Black,"2x Ultra Era with Oxi Booster, 50fl oz",4C Grated Parmesan Cheese 100% Natural 8oz Shaker,Africa's Best No-Lye Dual Conditioning Relaxer System Super,Alberto VO5 Salon Series Smooth Plus Sleek Shampoo,Alex Cross (dvdvideo),"All,bran Complete Wheat Flakes, 18 Oz.",Ambi Complexion Cleansing Bar,...,Voortman Sugar Free Fudge Chocolate Chip Cookies,Wagan Smartac 80watt Inverter With Usb,"Wallmount Server Cabinet (450mm, 9 RU)","Way Basics 3-Shelf Eco Narrow Bookcase Storage Shelf, Espresso - Formaldehyde Free - Lifetime Guarantee","WeatherTech 40647 14-15 Outlander Cargo Liners Behind 2nd Row, Black",Wedding Wishes Wedding Guest Book,Weleda Everon Lip Balm,Windex Original Glass Cleaner Refill 67.6oz (2 Liter),Yes To Carrots Nourishing Body Wash,Yes To Grapefruit Rejuvenating Body Wash
reviews_username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
00dog3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
00sab00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
01impala,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [39]:
# Copy the train dataset into dummy_train
dummy_train = train.copy()

# The movies not rated by user is marked as 1 for prediction. 
dummy_train['reviews_rating'] = dummy_train['reviews_rating'].apply(lambda x: 0 if x>=1 else 1)

# Convert the dummy train dataset into matrix format.
dummy_train = dummy_train.pivot(
    index='reviews_username',
    columns='name',
    values='reviews_rating'
).fillna(1)

print(f" Dummy_train shape = {dummy_train.shape} \n ")
dummy_train.head(3)

#dummy_train.loc["02dakota","Godzilla 3d Includes Digital Copy Ultraviolet 3d/2d Blu-Ray/dvd"]

 Dummy_train shape = (17877, 257) 
 


name,0.6 Cu. Ft. Letter A4 Size Waterproof 30 Min. Fire File Chest,100:Complete First Season (blu-Ray),2017-2018 Brownline174 Duraflex 14-Month Planner 8 1/2 X 11 Black,"2x Ultra Era with Oxi Booster, 50fl oz",4C Grated Parmesan Cheese 100% Natural 8oz Shaker,Africa's Best No-Lye Dual Conditioning Relaxer System Super,Alberto VO5 Salon Series Smooth Plus Sleek Shampoo,Alex Cross (dvdvideo),"All,bran Complete Wheat Flakes, 18 Oz.",Ambi Complexion Cleansing Bar,...,Voortman Sugar Free Fudge Chocolate Chip Cookies,Wagan Smartac 80watt Inverter With Usb,"Wallmount Server Cabinet (450mm, 9 RU)","Way Basics 3-Shelf Eco Narrow Bookcase Storage Shelf, Espresso - Formaldehyde Free - Lifetime Guarantee","WeatherTech 40647 14-15 Outlander Cargo Liners Behind 2nd Row, Black",Wedding Wishes Wedding Guest Book,Weleda Everon Lip Balm,Windex Original Glass Cleaner Refill 67.6oz (2 Liter),Yes To Carrots Nourishing Body Wash,Yes To Grapefruit Rejuvenating Body Wash
reviews_username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
00dog3,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
00sab00,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
01impala,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


**Cosine Similarity**

Cosine Similarity is a measurement that quantifies the similarity between two vectors. 

**Adjusted Cosine**

Adjusted cosine similarity is a modified version of vector-based similarity where we incorporate the fact that different users have different ratings schemes. In other words, some users might rate items highly in general, and others might give items lower ratings as a preference. To handle this nature from rating given by user , we subtract average ratings for each user from each user's rating for different movies.

### (5.2) User Based Collaborative filtering

#### (5.2.1) Builiding - User User

#### Using Cosine Similarity

In [40]:


# Creating the User Similarity Matrix using pairwise_distance function.
user_correlation = 1 - pairwise_distances(df_pivot, metric='cosine')
user_correlation[np.isnan(user_correlation)] = 0

print(f"User correlation dataframe shape = {user_correlation.shape} \n ")
print(f"User correlation dataframe is below = \n {user_correlation} ")

User correlation dataframe shape = (17877, 17877) 
 
User correlation dataframe is below = 
 [[1.        0.        0.        ... 0.        0.        0.       ]
 [0.        1.        0.        ... 0.        0.9486833 0.       ]
 [0.        0.        1.        ... 0.        0.        0.       ]
 ...
 [0.        0.        0.        ... 1.        0.        1.       ]
 [0.        0.9486833 0.        ... 0.        1.        0.       ]
 [0.        0.        0.        ... 1.        0.        1.       ]] 


#### Using adjusted Cosine
#### Here, we are not removing the NaN values and calculating the mean only for the reviews rated by the user

In [41]:
# Create a user-product matrix.
df_pivot = train.pivot(
   index='reviews_username',
   columns='name',
   values='reviews_rating'
)

print(f" df_pivot shape = {df_pivot.shape} \n ")
df_pivot.head(3)

 df_pivot shape = (17877, 257) 
 


name,0.6 Cu. Ft. Letter A4 Size Waterproof 30 Min. Fire File Chest,100:Complete First Season (blu-Ray),2017-2018 Brownline174 Duraflex 14-Month Planner 8 1/2 X 11 Black,"2x Ultra Era with Oxi Booster, 50fl oz",4C Grated Parmesan Cheese 100% Natural 8oz Shaker,Africa's Best No-Lye Dual Conditioning Relaxer System Super,Alberto VO5 Salon Series Smooth Plus Sleek Shampoo,Alex Cross (dvdvideo),"All,bran Complete Wheat Flakes, 18 Oz.",Ambi Complexion Cleansing Bar,...,Voortman Sugar Free Fudge Chocolate Chip Cookies,Wagan Smartac 80watt Inverter With Usb,"Wallmount Server Cabinet (450mm, 9 RU)","Way Basics 3-Shelf Eco Narrow Bookcase Storage Shelf, Espresso - Formaldehyde Free - Lifetime Guarantee","WeatherTech 40647 14-15 Outlander Cargo Liners Behind 2nd Row, Black",Wedding Wishes Wedding Guest Book,Weleda Everon Lip Balm,Windex Original Glass Cleaner Refill 67.6oz (2 Liter),Yes To Carrots Nourishing Body Wash,Yes To Grapefruit Rejuvenating Body Wash
reviews_username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
00dog3,,,,,,,,,,,...,,,,,,,,,,
00sab00,,,,,,,,,,,...,,,,,,,,,,
01impala,,,,,,,,,,,...,,,,,,,,,,


#### Normalising the rating of the *review* for each user around 0 mean

In [42]:
mean = np.nanmean(df_pivot, axis=1)
df_subtracted = (df_pivot.T-mean).T

print(f" df_subtracted shape = {df_subtracted.shape} \n ")
df_subtracted.head(3)

 df_subtracted shape = (17877, 257) 
 


name,0.6 Cu. Ft. Letter A4 Size Waterproof 30 Min. Fire File Chest,100:Complete First Season (blu-Ray),2017-2018 Brownline174 Duraflex 14-Month Planner 8 1/2 X 11 Black,"2x Ultra Era with Oxi Booster, 50fl oz",4C Grated Parmesan Cheese 100% Natural 8oz Shaker,Africa's Best No-Lye Dual Conditioning Relaxer System Super,Alberto VO5 Salon Series Smooth Plus Sleek Shampoo,Alex Cross (dvdvideo),"All,bran Complete Wheat Flakes, 18 Oz.",Ambi Complexion Cleansing Bar,...,Voortman Sugar Free Fudge Chocolate Chip Cookies,Wagan Smartac 80watt Inverter With Usb,"Wallmount Server Cabinet (450mm, 9 RU)","Way Basics 3-Shelf Eco Narrow Bookcase Storage Shelf, Espresso - Formaldehyde Free - Lifetime Guarantee","WeatherTech 40647 14-15 Outlander Cargo Liners Behind 2nd Row, Black",Wedding Wishes Wedding Guest Book,Weleda Everon Lip Balm,Windex Original Glass Cleaner Refill 67.6oz (2 Liter),Yes To Carrots Nourishing Body Wash,Yes To Grapefruit Rejuvenating Body Wash
reviews_username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
00dog3,,,,,,,,,,,...,,,,,,,,,,
00sab00,,,,,,,,,,,...,,,,,,,,,,
01impala,,,,,,,,,,,...,,,,,,,,,,


In [43]:
# Creating the User Similarity Matrix using pairwise_distance function.
user_correlation = 1 - pairwise_distances(df_subtracted.fillna(0), metric='cosine')
user_correlation[np.isnan(user_correlation)] = 0
print(f"user_correlation shape = {user_correlation.shape} \n")
print(f"user_correlation dataframe top rows = \n {user_correlation} ")

user_correlation shape = (17877, 17877) 

user_correlation dataframe top rows = 
 [[0. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]] 


#### (5.2.2) Prediction - User User

Doing the prediction for the users which are positively related with other users, and not the users which are negatively related as we are interested in the users which are more similar to the current users. So, ignoring the correlation for values less than 0.

In [44]:
user_correlation[user_correlation<0]=0
user_correlation

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

Rating predicted by the user (for movies rated as well as not rated) is the weighted sum of correlation with the movie rating (as present in the rating dataset).

In [45]:
user_predicted_ratings = np.dot(user_correlation, df_pivot.fillna(0))
print(f"user_predicted_ratings shape = {user_predicted_ratings.shape} \n")
user_predicted_ratings

user_predicted_ratings shape = (17877, 257) 



array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.77051412, 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

Since we are interested only in the movies not rated by the user, we will ignore the movies rated by the user by making it zero.

In [46]:
user_final_rating = np.multiply(user_predicted_ratings,dummy_train)
print(f"user_final_rating shape = {user_final_rating.shape} \n")
user_final_rating.head(3)

user_final_rating shape = (17877, 257) 



name,0.6 Cu. Ft. Letter A4 Size Waterproof 30 Min. Fire File Chest,100:Complete First Season (blu-Ray),2017-2018 Brownline174 Duraflex 14-Month Planner 8 1/2 X 11 Black,"2x Ultra Era with Oxi Booster, 50fl oz",4C Grated Parmesan Cheese 100% Natural 8oz Shaker,Africa's Best No-Lye Dual Conditioning Relaxer System Super,Alberto VO5 Salon Series Smooth Plus Sleek Shampoo,Alex Cross (dvdvideo),"All,bran Complete Wheat Flakes, 18 Oz.",Ambi Complexion Cleansing Bar,...,Voortman Sugar Free Fudge Chocolate Chip Cookies,Wagan Smartac 80watt Inverter With Usb,"Wallmount Server Cabinet (450mm, 9 RU)","Way Basics 3-Shelf Eco Narrow Bookcase Storage Shelf, Espresso - Formaldehyde Free - Lifetime Guarantee","WeatherTech 40647 14-15 Outlander Cargo Liners Behind 2nd Row, Black",Wedding Wishes Wedding Guest Book,Weleda Everon Lip Balm,Windex Original Glass Cleaner Refill 67.6oz (2 Liter),Yes To Carrots Nourishing Body Wash,Yes To Grapefruit Rejuvenating Body Wash
reviews_username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
00dog3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
00sab00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.886751,0.0,0.0,...,0.0,1.154701,0.0,0.0,0.541098,0.0,0.0,0.770514,0.0,0.0
01impala,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### (5.2.3) Finding the top 20 recommendation for the user

In [47]:
# Take the user ID as input.
user_input = input("Enter your user name")
print(f"User name input = {user_input} ")

Enter your user namedanielle
User name input = danielle 


In [48]:
top20_recommendations_for_user = user_final_rating.loc[user_input].sort_values(ascending=False)[0:20]
top20_recommendations_for_user

name
Godzilla 3d Includes Digital Copy Ultraviolet 3d/2d Blu-Ray/dvd                    9.753953
Mike Dave Need Wedding Dates (dvd + Digital)                                       8.554198
Planes: Fire Rescue (2 Discs) (includes Digital Copy) (blu-Ray/dvd)                5.772955
Pendaflex174 Divide It Up File Folder, Multi Section, Letter, Assorted, 12/pack    3.829053
The Resident Evil Collection 5 Discs (blu-Ray)                                     3.715511
Tostitos Bite Size Tortilla Chips                                                  3.176428
Hormel Chili, No Beans                                                             3.169038
Clorox Disinfecting Wipes Value Pack Scented 150 Ct Total                          3.094629
Red (special Edition) (dvdvideo)                                                   2.536889
Caress Moisturizing Body Bar Natural Silk, 4.75oz                                  2.166667
Coty Airspun Face Powder, Translucent Extra Coverage                       

#### (5.2.4) Evaluation - User User

Evaluation will we same as you have seen above for the prediction. 
The only difference being, you will evaluate for the movie already rated by the user insead of predicting it for the movie not rated by the user.

In [49]:
# Find out the common users of test and train dataset.
common = test[test.reviews_username.isin(train.reviews_username)]
print(f"common shape = {common.shape} \n")
common.head()

common shape = (1103, 3) 



Unnamed: 0,reviews_username,name,reviews_rating
25535,themcdermitts,Clorox Disinfecting Wipes Value Pack Scented 150 Ct Total,5.0
19051,nami,"Burt's Bees Lip Shimmer, Raisin",5.0
16377,manny,Planes: Fire Rescue (2 Discs) (includes Digital Copy) (blu-Ray/dvd),5.0
10150,heather4,Clorox Disinfecting Wipes Value Pack Scented 150 Ct Total,5.0
17807,mita,Clorox Disinfecting Wipes Value Pack Scented 150 Ct Total,5.0


In [50]:
# convert into the user-product matrix.
common_user_based_matrix = common.pivot_table(index='reviews_username', columns='name', values='reviews_rating')
print(f" common_user_based_matrix shape = {common_user_based_matrix.shape} ")

 common_user_based_matrix shape = (912, 125) 


In [51]:
# Convert the user_correlation matrix into dataframe.
user_correlation_df = pd.DataFrame(user_correlation)
print(f"user_correlation_df shape = {user_correlation_df.shape} \n")
print(f"df_subtracted shape = {df_subtracted.shape} \n")
df_subtracted.head(1)

user_correlation_df shape = (17877, 17877) 

df_subtracted shape = (17877, 257) 



name,0.6 Cu. Ft. Letter A4 Size Waterproof 30 Min. Fire File Chest,100:Complete First Season (blu-Ray),2017-2018 Brownline174 Duraflex 14-Month Planner 8 1/2 X 11 Black,"2x Ultra Era with Oxi Booster, 50fl oz",4C Grated Parmesan Cheese 100% Natural 8oz Shaker,Africa's Best No-Lye Dual Conditioning Relaxer System Super,Alberto VO5 Salon Series Smooth Plus Sleek Shampoo,Alex Cross (dvdvideo),"All,bran Complete Wheat Flakes, 18 Oz.",Ambi Complexion Cleansing Bar,...,Voortman Sugar Free Fudge Chocolate Chip Cookies,Wagan Smartac 80watt Inverter With Usb,"Wallmount Server Cabinet (450mm, 9 RU)","Way Basics 3-Shelf Eco Narrow Bookcase Storage Shelf, Espresso - Formaldehyde Free - Lifetime Guarantee","WeatherTech 40647 14-15 Outlander Cargo Liners Behind 2nd Row, Black",Wedding Wishes Wedding Guest Book,Weleda Everon Lip Balm,Windex Original Glass Cleaner Refill 67.6oz (2 Liter),Yes To Carrots Nourishing Body Wash,Yes To Grapefruit Rejuvenating Body Wash
reviews_username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
00dog3,,,,,,,,,,,...,,,,,,,,,,


In [52]:
user_correlation_df['reviews_username'] = df_subtracted.index
user_correlation_df.set_index('reviews_username',inplace=True)
print(f"user_correlation_df shape = {user_correlation_df.shape} \n ")
user_correlation_df.head()

user_correlation_df shape = (17877, 17877) 
 


Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,17867,17868,17869,17870,17871,17872,17873,17874,17875,17876
reviews_username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
00dog3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
00sab00,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
01impala,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
02dakota,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
02deuce,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [53]:
list_name = common.reviews_username.tolist()
print(f"Length of list_name list = {len(list_name)} \n ")

user_correlation_df.columns = df_subtracted.index.tolist()
print(f"user_correlation_df shape = {user_correlation_df.shape} \n ")

user_correlation_df_1 =  user_correlation_df[user_correlation_df.index.isin(list_name)]
print(f"user_correlation_df_1 shape = {user_correlation_df_1.shape} \n")

user_correlation_df_2 = user_correlation_df_1.T[user_correlation_df_1.T.index.isin(list_name)]
print(f"user_correlation_df_2 shape = {user_correlation_df_2.shape} \n")

user_correlation_df_3 = user_correlation_df_2.T
print(f"user_correlation_df_3 shape = {user_correlation_df_3.shape} \n")

Length of list_name list = 1103 
 
user_correlation_df shape = (17877, 17877) 
 
user_correlation_df_1 shape = (912, 17877) 

user_correlation_df_2 shape = (912, 912) 

user_correlation_df_3 shape = (912, 912) 



In [54]:
user_correlation_df_3.head()

Unnamed: 0_level_0,1234,1943,1witch,50cal,aaron,acjuarez08,adriana,adriana9999,adrienne,ah78,...,wonderwoman,wonster67,woody,woottos,worm,xavier,yummy,zach,zippy,zitro
reviews_username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1234,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.288675,0.0
1943,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1witch,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50cal,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aaron,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [55]:
user_correlation_df_3[user_correlation_df_3<0]=0

common_user_predicted_ratings = np.dot(user_correlation_df_3, common_user_based_matrix.fillna(0))
print(f"common_user_predicted_ratings shape = {common_user_predicted_ratings.shape} \n")
common_user_predicted_ratings

common_user_predicted_ratings shape = (912, 125) 



array([[1.44337567, 0.        , 3.2620217 , ..., 0.82823645, 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [2.5       , 0.        , 2.5819889 , ..., 3.9345476 , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [56]:
dummy_test = common.copy()

dummy_test['reviews_rating'] = dummy_test['reviews_rating'].apply(lambda x: 1 if x>=1 else 0)

dummy_test = dummy_test.pivot_table(index='reviews_username', columns='name', values='reviews_rating').fillna(0)

print(f"dummy_test shape = {dummy_test.shape} \n")

dummy_test shape = (912, 125) 



In [57]:
common_user_predicted_ratings = np.multiply(common_user_predicted_ratings,dummy_test)
print(f"common_user_predicted_ratings shape = {common_user_predicted_ratings.shape} \n")
common_user_predicted_ratings.head(3)

common_user_predicted_ratings shape = (912, 125) 



name,100:Complete First Season (blu-Ray),"42 Dual Drop Leaf Table with 2 Madrid Chairs""",Alex Cross (dvdvideo),"All,bran Complete Wheat Flakes, 18 Oz.",Ambi Complexion Cleansing Bar,"Aussie Aussome Volume Shampoo, 13.5 Oz","Australian Gold Exotic Blend Lotion, SPF 4","Aveeno Baby Continuous Protection Lotion Sunscreen with Broad Spectrum SPF 55, 4oz","Avery174 Ready Index Contemporary Table Of Contents Divider, 1-8, Multi, Letter",Axe Dry Anti-Perspirant Deodorant Invisible Solid Phoenix,...,Tresemme Kertatin Smooth Infusing Conditioning,Various Artists - Choo Choo Soul (cd),Vaseline Intensive Care Healthy Hands Stronger Nails,Vaseline Intensive Care Lip Therapy Cocoa Butter,"Vicks Vaporub, Regular, 3.53oz",Wagan Smartac 80watt Inverter With Usb,"Way Basics 3-Shelf Eco Narrow Bookcase Storage Shelf, Espresso - Formaldehyde Free - Lifetime Guarantee",Windex Original Glass Cleaner Refill 67.6oz (2 Liter),Yes To Carrots Nourishing Body Wash,Yes To Grapefruit Rejuvenating Body Wash
reviews_username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1234,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1943,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1witch,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Calculating the RMSE for only the movies rated by user. For RMSE, normalising the rating to (1,5) range.

In [58]:
from numpy import *
X  = common_user_predicted_ratings.copy() 
X = X[X>0]

scaler = MinMaxScaler(feature_range=(1, 5))
scaler.fit(X)
y = (scaler.transform(X))

print(y)

[[nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 ...
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]]


In [59]:
common_ = common.pivot_table(index='reviews_username', columns='name', values='reviews_rating')

# Finding total non-NaN value
total_non_nan = np.count_nonzero(~np.isnan(y))

rmse = (sum(sum((common_ - y )**2))/total_non_nan)**0.5
print(f"RMSE (Root Mean Squared Error = {rmse} \n ")

RMSE (Root Mean Squared Error = 2.7595285805425926 
 


#### (5.2.5) Fine tuning the recommendation system for User-User

In [60]:
#url = 'https://raw.githubusercontent.com/adwayskirwe/Capstone/main/sample30.csv'
#reviews_df = pd.read_csv(url)

In [61]:
top20_recommendations_for_user.index.tolist

<bound method IndexOpsMixin.tolist of Index(['Godzilla 3d Includes Digital Copy Ultraviolet 3d/2d Blu-Ray/dvd',
       'Mike Dave Need Wedding Dates (dvd + Digital)',
       'Planes: Fire Rescue (2 Discs) (includes Digital Copy) (blu-Ray/dvd)',
       'Pendaflex174 Divide It Up File Folder, Multi Section, Letter, Assorted, 12/pack',
       'The Resident Evil Collection 5 Discs (blu-Ray)',
       'Tostitos Bite Size Tortilla Chips', 'Hormel Chili, No Beans',
       'Clorox Disinfecting Wipes Value Pack Scented 150 Ct Total',
       'Red (special Edition) (dvdvideo)',
       'Caress Moisturizing Body Bar Natural Silk, 4.75oz',
       'Coty Airspun Face Powder, Translucent Extra Coverage',
       'Windex Original Glass Cleaner Refill 67.6oz (2 Liter)',
       'Jason Aldean - They Don't Know',
       'Just For Men Touch Of Gray Gray Hair Treatment, Black T-55',
       'Neutrogena Wet Skin Sunscreen Spray Broad Spectrum SPF 50, 5oz',
       'Clear Scalp & Hair Therapy Total Care Nourishing 

In [62]:
# load saved model
with open('model_pkl' , 'rb') as f:
    logisticRegression = pickle.load(f)

Below approach involves finding the postive sentiment ratio for each of the 20 recommended products. For each of the 20 products, a postive snetiment ratio is predicted and the top 5 products with highest postive sentiment ratio is selected <br><br>
For finding the positive sentiment ratio, we need to identify all reviews of the products and predict the review as positive/negative. As the dataset has fixed number of users and fixed number of reviews, we can use an efficient of approach of finding the sentiment using classification already available in the 'user_sentiment_bool' column, which will help to further find the positive sentiment ratio for the product. Hence, this approach is commented

In [63]:
'''
#List that will store postive-sentiment ratio for each of 20 recommended products
positive_sentiment_ratio_list = []
product_list = []

#For each of the 20 recommended products 
for i in range(0,len(top20_recommendations_for_user.index.tolist())):
  
  #Get product name
  product = top20_recommendations_for_user.index.tolist()[i]

  #Find out all postive + negative reviews about product from training dataset
  all_reviews_for_respective_product_df = reviews_df[reviews_df['name'] == product]['user_sentiment_bool']
   
  #Find Positive sentiment ratio of the product 
  positive_sentiment_ratio = round(np.count_nonzero(all_reviews_for_respective_product_df == 1)/(np.count_nonzero(all_reviews_for_respective_product_df == 0)+np.count_nonzero(all_reviews_for_respective_product_df == 1)),2)

  #Append the Positive sentiment ratio of the product to the list
  positive_sentiment_ratio_list.append(positive_sentiment_ratio)
  product_list.append(product)
'''

"\n#List that will store postive-sentiment ratio for each of 20 recommended products\npositive_sentiment_ratio_list = []\nproduct_list = []\n\n#For each of the 20 recommended products \nfor i in range(0,len(top20_recommendations_for_user.index.tolist())):\n  \n  #Get product name\n  product = top20_recommendations_for_user.index.tolist()[i]\n\n  #Find out all postive + negative reviews about product from training dataset\n  all_reviews_for_respective_product_df = reviews_df[reviews_df['name'] == product]['user_sentiment_bool']\n   \n  #Find Positive sentiment ratio of the product \n  positive_sentiment_ratio = round(np.count_nonzero(all_reviews_for_respective_product_df == 1)/(np.count_nonzero(all_reviews_for_respective_product_df == 0)+np.count_nonzero(all_reviews_for_respective_product_df == 1)),2)\n\n  #Append the Positive sentiment ratio of the product to the list\n  positive_sentiment_ratio_list.append(positive_sentiment_ratio)\n  product_list.append(product)\n"

In [64]:
'''
#Finiding product recommendation index of top 5 products having highest positive_sentiment_ratio
print(positive_sentiment_ratio_list)
index_list = np.argsort(positive_sentiment_ratio_list)[-5:]
print(f"index_list = {index_list}")
'''

'\n#Finiding product recommendation index of top 5 products having highest positive_sentiment_ratio\nprint(positive_sentiment_ratio_list)\nindex_list = np.argsort(positive_sentiment_ratio_list)[-5:]\nprint(f"index_list = {index_list}")\n'

Below approach involves finding the postive sentiment ratio for each of the 20 recommended products. For each of the 20 products, a postive snetiment ratio is predicted and the top 5 products with highest postive sentiment ratio is selected<br><br>

For finding the positive sentiment ratio, we need to identify all reviews of the products and predict the review as positive/negative. We use our saved Machine Learning model to identify the sentiment of the product. this is useful in a realtime enviornment wherein new reviews continue to come into the system and pred-defined sentiment of the new review is not available in the dataset. Hence, this approach is used. <br><br>
However, we can use either of the approach depending on the use-case.

In [65]:

#List that will store postive-sentiment ratio for each of 20 recommended products
positive_sentiment_ratio_list = []
product_list = []

#For each of the 20 recommended products 
for i in range(0,len(top20_recommendations_for_user.index.tolist())):
  
  #Get product name
  product = top20_recommendations_for_user.index.tolist()[i]

  #Find out all postive + negative reviews about product from training dataset
  all_reviews_for_respective_product_df = reviews_df[reviews_df['name'] == product]['reviews_text']
  
  #Preprocess all the above extracted reviews before predicting user sentiment
  preprocessed_review = [preprocess(review) for review in tqdm(all_reviews_for_respective_product_df)]
  all_reviews_for_respective_product_df = pd.Series(preprocessed_review)

  #Creating TF-IDF strcture for all the reviews
  X = vectorizer.transform(all_reviews_for_respective_product_df)
  all_reviews_for_respective_product_features_df = pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names_out())

  #Predict the sentiment of all the reviews of the product
  product_review_pred = logisticRegression.predict(all_reviews_for_respective_product_features_df)
  
  #Find Positive sentiment ratio of the product 
  positive_sentiment_ratio = round(np.count_nonzero(product_review_pred == 1)/(np.count_nonzero(product_review_pred == 0)+np.count_nonzero(product_review_pred == 1)),2)

  #Append the Positive sentiment ratio of the product to the list
  positive_sentiment_ratio_list.append(positive_sentiment_ratio)
  product_list.append(product)

100%|██████████| 3325/3325 [00:12<00:00, 275.86it/s]
100%|██████████| 757/757 [00:01<00:00, 388.73it/s]
100%|██████████| 1143/1143 [00:03<00:00, 352.25it/s]
100%|██████████| 310/310 [00:01<00:00, 192.86it/s]
100%|██████████| 845/845 [00:02<00:00, 302.89it/s]
100%|██████████| 264/264 [00:00<00:00, 479.54it/s]
100%|██████████| 196/196 [00:00<00:00, 371.59it/s]
100%|██████████| 8545/8545 [00:35<00:00, 238.51it/s]
100%|██████████| 672/672 [00:01<00:00, 356.64it/s]
100%|██████████| 68/68 [00:00<00:00, 230.50it/s]
100%|██████████| 158/158 [00:00<00:00, 169.29it/s]
100%|██████████| 348/348 [00:01<00:00, 187.33it/s]
100%|██████████| 204/204 [00:00<00:00, 386.64it/s]
100%|██████████| 224/224 [00:01<00:00, 188.11it/s]
100%|██████████| 14/14 [00:00<00:00, 98.59it/s]
100%|██████████| 372/372 [00:03<00:00, 116.04it/s]
100%|██████████| 141/141 [00:00<00:00, 321.29it/s]
100%|██████████| 247/247 [00:00<00:00, 286.36it/s]
100%|██████████| 96/96 [00:00<00:00, 144.29it/s]
100%|██████████| 634/634 [00:03<

In [66]:
#Finiding product recommendation index of top 5 products having highest positive_sentiment_ratio
print(positive_sentiment_ratio_list)
index_list = np.argsort(positive_sentiment_ratio_list)[-5:]
print(f"index_list = {index_list}")

[0.98, 0.95, 0.99, 0.94, 0.88, 0.88, 0.89, 0.95, 0.98, 0.99, 0.96, 0.91, 0.97, 0.95, 1.0, 0.98, 0.99, 0.98, 1.0, 0.98]
index_list = [16  2  9 14 18]


In [67]:
recommendation_dict = {}
recommendation_dict['product'] = product_list
recommendation_dict['recommendation_score'] = top20_recommendations_for_user.values
recommendation_dict['positive_sentiment_ratio'] = positive_sentiment_ratio_list

recommendation_df = pd.DataFrame(recommendation_dict)
recommendation_df.sort_values(by=["positive_sentiment_ratio"],ascending=False)

Unnamed: 0,product,recommendation_score,positive_sentiment_ratio
18,Vaseline Intensive Care Healthy Hands Stronger Nails,0.794497,1.0
14,"Neutrogena Wet Skin Sunscreen Spray Broad Spectrum SPF 50, 5oz",1.154701,1.0
9,"Caress Moisturizing Body Bar Natural Silk, 4.75oz",2.166667,0.99
2,Planes: Fire Rescue (2 Discs) (includes Digital Copy) (blu-Ray/dvd),5.772955,0.99
16,Alex Cross (dvdvideo),0.833333,0.99
0,Godzilla 3d Includes Digital Copy Ultraviolet 3d/2d Blu-Ray/dvd,9.753953,0.98
17,Dark Shadows (includes Digital Copy) (ultraviolet) (dvdvideo),0.819892,0.98
15,Clear Scalp & Hair Therapy Total Care Nourishing Shampoo,0.993382,0.98
19,Olay Regenerist Deep Hydration Regenerating Cream,0.666667,0.98
8,Red (special Edition) (dvdvideo),2.536889,0.98


In [68]:
#Mapping index with respect product name

product_name_list = []
manufacturer_list = []
category_list = []

for i in range(0, len(index_list)):
  product_name = top20_recommendations_for_user.index.tolist()[index_list[i]]
  manufacturer = reviews_df[reviews_df["name"] == product_name]["manufacturer"].unique()[0]
  category = reviews_df[reviews_df["name"] == product_name]["categories"].unique()[0]

  product_name_list.append(product_name)
  manufacturer_list.append(manufacturer)
  category_list.append(category)

mapping_dict = {}
mapping_dict["product_name"] = product_name_list
mapping_dict["manufacturer"] = manufacturer_list
mapping_dict["category"] = category_list

recommended_products_df = pd.DataFrame(mapping_dict)
recommended_products_df

Unnamed: 0,product_name,manufacturer,category
0,Alex Cross (dvdvideo),,"Movies & TV Shows,Instawatch Movies By VUDU,Instawatch Movies,Movies, Music & Books,Movies,Action & Adventure,Movies & TV,Shop Instawatch,Movies & Music"
1,Planes: Fire Rescue (2 Discs) (includes Digital Copy) (blu-Ray/dvd),Walt Disney,"Movies, Music & Books,Movies,Kids' & Family,Ways To Shop Entertainment,Movies & Tv On Blu-Ray,Movies & TV,Disney,Blu-ray,Children & Family,Movies & Music,Movies & TV Shows..."
2,"Caress Moisturizing Body Bar Natural Silk, 4.75oz",Caress,"Personal Care,Bath, Shower & Soap,Bar Soap,Bath & Body,Body Wash & Cleansers,Beauty,Bar Soaps,Cleansers,Soaps,Body Cleansers"
3,"Neutrogena Wet Skin Sunscreen Spray Broad Spectrum SPF 50, 5oz",Johnson & Johnson SLC,"Personal Care,Sun Care,Spray-on Sunscreen SPF 15 And Above,Beauty,Skin Care,Featured Brands,Health & Beauty,Neutrogena,Johnson & Johnson Beauty,Johnson & Johnson,Sun & Tan..."
4,Vaseline Intensive Care Healthy Hands Stronger Nails,Vaseline,"Personal Care,Skin Care,Hand Cream,Beauty,Body Lotions & Creams,Featured Brands,Health & Beauty,Unilever,Holiday Shop,Christmas,Bath & Body,Hand Creams & Lotions,Nail Care..."


In [69]:
reviews_df[reviews_df['reviews_username'] ==  user_input][["reviews_username","name","manufacturer","categories"]]

Unnamed: 0,reviews_username,name,manufacturer,categories
36,danielle,Ambi Complexion Cleansing Bar,FLEMING & CO,"Personal Care,Bath, Shower & Soap,Featured Brands,Health & Beauty,Johnson & Johnson,Bath & Body,Body Wash & Cleansers,Beauty,Skin Care,Facial Cleansers,Soaps"
1077,danielle,"Aussie Aussome Volume Shampoo, 13.5 Oz",Aussie,"Personal Care,Hair Care,Shampoo,Beauty,Shampoo & Conditioner,Shampoos,Daily Shampoo"
2913,danielle,My Big Fat Greek Wedding 2 (blu-Ray + Dvd + Digital),Universal,"Movies, Music & Books,Movies,Comedy,Movies & TV Shows,Instawatch Movies By VUDU,Shop Instawatch,Movies & TV,Ways To Shop Entertainment,Movies & Tv On Blu-Ray,Movies & Musi..."
2914,danielle,My Big Fat Greek Wedding 2 (blu-Ray + Dvd + Digital),Universal,"Movies, Music & Books,Movies,Comedy,Movies & TV Shows,Instawatch Movies By VUDU,Shop Instawatch,Movies & TV,Ways To Shop Entertainment,Movies & Tv On Blu-Ray,Movies & Musi..."
15437,danielle,Chester's Cheese Flavored Puffcorn Snacks,Frito-Lay,"Food,Packaged Foods,Snacks,Chips & Pretzels,Food & Beverage,Cookies, Chips & Snacks,Chips,Snacks, Cookies & Chips,Food & Beverage Ways To Shop,Tailgating Essentials,Grocer..."
18419,danielle,Trend Lab Park Nursing Cover - Paisley,Trend Lab,"Baby,Nursing & Feeding,Breastfeeding,Nursing Covers,Baby Gear,Nursing,Trend Lab,Feeding,Breastfeeding/Nursing,Nursing Covers/Blankets,Gift Baskets,Gift Sets,Baby Feeding,N..."
18425,danielle,Trend Lab Park Nursing Cover - Paisley,Trend Lab,"Baby,Nursing & Feeding,Breastfeeding,Nursing Covers,Baby Gear,Nursing,Trend Lab,Feeding,Breastfeeding/Nursing,Nursing Covers/Blankets,Gift Baskets,Gift Sets,Baby Feeding,N..."


### (5.3) Item Based Collaborative filtering

Taking the transpose of the rating matrix to normalize the rating around the mean for different movie ID. In the user based similarity, we had taken mean for each user instead of each movie.

#### (5.3.1) Builiding - Item Item

In [70]:
df_pivot = train.pivot(
    index='reviews_username',
    columns='name',
    values='reviews_rating'
).T

print(f"df_pivot shape = {df_pivot.shape} \n")
df_pivot.head(3)

df_pivot shape = (257, 17877) 



reviews_username,00dog3,00sab00,01impala,02dakota,02deuce,0325home,06stidriver,09mommy11,1085,10ten,...,zoso60,zotox,zowie,zsarah,zulaa118,zwithanx,zxcsdfd,zyiah4,zzdiane,zzz1127
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.6 Cu. Ft. Letter A4 Size Waterproof 30 Min. Fire File Chest,,,,,,,,,,,...,,,,,,,,,,
100:Complete First Season (blu-Ray),,,,,,,,,,,...,,,,,,,,,,
2017-2018 Brownline174 Duraflex 14-Month Planner 8 1/2 X 11 Black,,,,,,,,,,,...,,,,,,,,,,


Normalising the movie rating for each movie for using the Adujsted Cosine

In [71]:
mean = np.nanmean(df_pivot, axis=1)
df_subtracted = (df_pivot.T-mean).T

print(f"df_subtracted shape = {df_subtracted.shape} \n")
df_subtracted.head(3)

df_subtracted shape = (257, 17877) 



reviews_username,00dog3,00sab00,01impala,02dakota,02deuce,0325home,06stidriver,09mommy11,1085,10ten,...,zoso60,zotox,zowie,zsarah,zulaa118,zwithanx,zxcsdfd,zyiah4,zzdiane,zzz1127
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.6 Cu. Ft. Letter A4 Size Waterproof 30 Min. Fire File Chest,,,,,,,,,,,...,,,,,,,,,,
100:Complete First Season (blu-Ray),,,,,,,,,,,...,,,,,,,,,,
2017-2018 Brownline174 Duraflex 14-Month Planner 8 1/2 X 11 Black,,,,,,,,,,,...,,,,,,,,,,


Finding the cosine similarity using pairwise distances approach

In [72]:
# Item Similarity Matrix
item_correlation = 1 - pairwise_distances(df_subtracted.fillna(0), metric='cosine')
item_correlation[np.isnan(item_correlation)] = 0

print(f"item_correlation shape = {item_correlation.shape} \n")
print(item_correlation)

item_correlation shape = (257, 257) 

[[ 1.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          1.          0.         ... -0.00597492  0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 ...
 [ 0.         -0.00597492  0.         ...  1.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          1.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   1.        ]]


Filtering the correlation only for which the value is greater than 0. (Positively correlated)

In [73]:
item_correlation[item_correlation<0]=0
item_correlation

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

#### (5.3.2) Prediction - Item Item

In [74]:
item_predicted_ratings = np.dot((df_pivot.fillna(0).T),item_correlation)
print(f"item_predicted_ratings shape = {item_predicted_ratings.shape} \n")
print(f"dummy_train shape = {dummy_train.shape} \n")

item_predicted_ratings

item_predicted_ratings shape = (17877, 257) 

dummy_train shape = (17877, 257) 



array([[0.        , 0.        , 0.        , ..., 0.0154398 , 0.        ,
        0.        ],
       [0.        , 0.02163843, 0.        , ..., 0.00204219, 0.        ,
        0.        ],
       [0.        , 0.01172631, 0.        , ..., 0.        , 0.        ,
        0.01075527],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.03191969, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

Filtering the rating only for the movies not rated by the user for recommendation

In [75]:
item_final_rating = np.multiply(item_predicted_ratings,dummy_train)
print(f"item_final_rating shape = {item_final_rating.shape} \n")
item_final_rating.head(3)

item_final_rating shape = (17877, 257) 



name,0.6 Cu. Ft. Letter A4 Size Waterproof 30 Min. Fire File Chest,100:Complete First Season (blu-Ray),2017-2018 Brownline174 Duraflex 14-Month Planner 8 1/2 X 11 Black,"2x Ultra Era with Oxi Booster, 50fl oz",4C Grated Parmesan Cheese 100% Natural 8oz Shaker,Africa's Best No-Lye Dual Conditioning Relaxer System Super,Alberto VO5 Salon Series Smooth Plus Sleek Shampoo,Alex Cross (dvdvideo),"All,bran Complete Wheat Flakes, 18 Oz.",Ambi Complexion Cleansing Bar,...,Voortman Sugar Free Fudge Chocolate Chip Cookies,Wagan Smartac 80watt Inverter With Usb,"Wallmount Server Cabinet (450mm, 9 RU)","Way Basics 3-Shelf Eco Narrow Bookcase Storage Shelf, Espresso - Formaldehyde Free - Lifetime Guarantee","WeatherTech 40647 14-15 Outlander Cargo Liners Behind 2nd Row, Black",Wedding Wishes Wedding Guest Book,Weleda Everon Lip Balm,Windex Original Glass Cleaner Refill 67.6oz (2 Liter),Yes To Carrots Nourishing Body Wash,Yes To Grapefruit Rejuvenating Body Wash
reviews_username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
00dog3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01544,0.0,0.0
00sab00,0.0,0.021638,0.0,0.0,0.0,0.0,0.0,0.008015,0.0,0.0,...,0.0,0.032179,0.0,0.0,0.0,0.0,0.0,0.002042,0.0,0.0
01impala,0.0,0.011726,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.018086,0.0,0.0,0.004527,0.0,0.0,0.0,0.0,0.010755


#### (5.3.3) Finding the top 20 recommendation for the user

In [76]:
# Take the user  as input
user_input = input("Enter your user name")
print(f"user input = {user_input} \n")

Enter your user namedanielle
user input = danielle 



In [77]:
# Recommending the Top 5 products to the user.
top20_recommendations_for_user_ITEMITEM = item_final_rating.loc[user_input].sort_values(ascending=False)[0:20]
top20_recommendations_for_user_ITEMITEM

name
Windex Original Glass Cleaner Refill 67.6oz (2 Liter)                             0.354194
Clear Scalp & Hair Therapy Total Care Nourishing Shampoo                          0.251427
Mike Dave Need Wedding Dates (dvd + Digital)                                      0.244659
Godzilla 3d Includes Digital Copy Ultraviolet 3d/2d Blu-Ray/dvd                   0.230864
Bilbao Nightstand Gray Oak - South Shore                                          0.207864
Stargate (ws) (ultimate Edition) (director's Cut) (dvdvideo)                      0.197485
Cheetos Crunchy Flamin' Hot Cheese Flavored Snacks                                0.143983
Feit 60-Watt A19 Gu24 Base Led Light Bulb - Soft White                            0.140142
The Seaweed Bath Co. Argan Conditioner, Smoothing Citrus                          0.130069
Diet Canada Dry Ginger Ale - 12pk/12 Fl Oz Cans                                   0.128804
Various - Red Hot Blue:Tribute To Cole Porter (cd)                                0.0

#### (5.3.4) Evaluation - Item Item

Evaluation will we same as you have seen above for the prediction. The only difference being, you will evaluate for the movie already rated by the user insead of predicting it for the movie not rated by the user.*italicised text*

In [78]:
print(f"test columns = {test.columns} \n")

common = test[test.name.isin(train.name)]
print(f"common shape = {common.shape} \n")
common.head(3)

test columns = Index(['reviews_username', 'name', 'reviews_rating'], dtype='object') 

common shape = (8268, 3) 



Unnamed: 0,reviews_username,name,reviews_rating
12376,jodyl,The Honest Company Laundry Detergent,5.0
3545,buddy23,Godzilla 3d Includes Digital Copy Ultraviolet 3d/2d Blu-Ray/dvd,4.0
6234,crumextreme,"Burt's Bees Lip Shimmer, Raisin",5.0


In [79]:
common_item_based_matrix = common.pivot_table(index='reviews_username', columns='name', values='reviews_rating').T
print(f"common_item_based_matrix shape = {common_item_based_matrix.shape} \n")

item_correlation_df = pd.DataFrame(item_correlation)
print(f"item_correlation_df shape = {item_correlation_df.shape} \n")
item_correlation_df.head(1)

common_item_based_matrix shape = (197, 7937) 

item_correlation_df shape = (257, 257) 



Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,247,248,249,250,251,252,253,254,255,256
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [80]:
item_correlation_df['name'] = df_subtracted.index
item_correlation_df.set_index('name',inplace=True)
print(f"item_correlation_df shape = {item_correlation_df.shape} \n")
item_correlation_df.head()

item_correlation_df shape = (257, 257) 



Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,247,248,249,250,251,252,253,254,255,256
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.6 Cu. Ft. Letter A4 Size Waterproof 30 Min. Fire File Chest,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100:Complete First Season (blu-Ray),0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2017-2018 Brownline174 Duraflex 14-Month Planner 8 1/2 X 11 Black,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"2x Ultra Era with Oxi Booster, 50fl oz",0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4C Grated Parmesan Cheese 100% Natural 8oz Shaker,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [81]:
list_name = common.name.tolist()
print(f"Length of list_name list = {len(list_name)} \n ")

item_correlation_df.columns = df_subtracted.index.tolist()
print(f"item_correlation_df shape = {item_correlation_df.shape} \n ")

item_correlation_df_1 =  item_correlation_df[item_correlation_df.index.isin(list_name)]
print(f"item_correlation_df_1 shape = {item_correlation_df_1.shape} \n")

item_correlation_df_2 = item_correlation_df_1.T[item_correlation_df_1.T.index.isin(list_name)]
print(f"item_correlation_df_2 shape = {item_correlation_df_2.shape} \n")

item_correlation_df_3 = item_correlation_df_2.T
print(f"item_correlation_df_3 shape = {item_correlation_df_3.shape} \n")

item_correlation_df_3.head(3)

Length of list_name list = 8268 
 
item_correlation_df shape = (257, 257) 
 
item_correlation_df_1 shape = (197, 257) 

item_correlation_df_2 shape = (197, 197) 

item_correlation_df_3 shape = (197, 197) 



Unnamed: 0_level_0,0.6 Cu. Ft. Letter A4 Size Waterproof 30 Min. Fire File Chest,100:Complete First Season (blu-Ray),2017-2018 Brownline174 Duraflex 14-Month Planner 8 1/2 X 11 Black,"2x Ultra Era with Oxi Booster, 50fl oz",4C Grated Parmesan Cheese 100% Natural 8oz Shaker,Africa's Best No-Lye Dual Conditioning Relaxer System Super,Alberto VO5 Salon Series Smooth Plus Sleek Shampoo,Alex Cross (dvdvideo),"All,bran Complete Wheat Flakes, 18 Oz.",Ambi Complexion Cleansing Bar,...,Various Artists - Choo Choo Soul (cd),Vaseline Intensive Care Healthy Hands Stronger Nails,Vaseline Intensive Care Lip Therapy Cocoa Butter,"Vicks Vaporub, Regular, 3.53oz",Wagan Smartac 80watt Inverter With Usb,"Way Basics 3-Shelf Eco Narrow Bookcase Storage Shelf, Espresso - Formaldehyde Free - Lifetime Guarantee","WeatherTech 40647 14-15 Outlander Cargo Liners Behind 2nd Row, Black",Windex Original Glass Cleaner Refill 67.6oz (2 Liter),Yes To Carrots Nourishing Body Wash,Yes To Grapefruit Rejuvenating Body Wash
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.6 Cu. Ft. Letter A4 Size Waterproof 30 Min. Fire File Chest,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100:Complete First Season (blu-Ray),0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2017-2018 Brownline174 Duraflex 14-Month Planner 8 1/2 X 11 Black,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [82]:
item_correlation_df_3[item_correlation_df_3<0]=0

common_item_predicted_ratings = np.dot(item_correlation_df_3, common_item_based_matrix.fillna(0))

print(f"common_item_predicted_ratings shape = {common_item_predicted_ratings.shape} \n")
common_item_predicted_ratings


common_item_predicted_ratings shape = (197, 7937) 



array([[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        2.59929544e-02, 0.00000000e+00, 0.00000000e+00],
       [5.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        1.59925756e-03, 1.53391625e-02, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       ...,
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        8.57070593e-03, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        1.30621827e-02, 0.00000000e+00, 0.00000000e+00]])

Dummy test will be used for evaluation. To evaluate, we will only make prediction on the movies rated by the user. So, this is marked as 1. This is just opposite of dummy_train

In [83]:
dummy_test = common.copy()

dummy_test['reviews_rating'] = dummy_test['reviews_rating'].apply(lambda x: 1 if x>=1 else 0)

dummy_test = dummy_test.pivot_table(index='reviews_username', columns='name', values='reviews_rating').T.fillna(0)

common_item_predicted_ratings = np.multiply(common_item_predicted_ratings,dummy_test)

print(f"common_item_predicted_ratings shape = {common_item_predicted_ratings.shape} \n")

common_item_predicted_ratings shape = (197, 7937) 



The products not rated is marked as 0 for evaluation. And make the item- item matrix representaion.

In [84]:
common_ = common.pivot_table(index='reviews_username', columns='name', values='reviews_rating').T

In [85]:
X  = common_item_predicted_ratings.copy() 
X = X[X>0]

scaler = MinMaxScaler(feature_range=(1, 5))
print(scaler.fit(X))
y = (scaler.transform(X))

print(y)

MinMaxScaler(feature_range=(1, 5))
[[nan nan nan ... nan nan nan]
 [ 1. nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 ...
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]]


In [86]:
# Finding total non-NaN value
total_non_nan = np.count_nonzero(~np.isnan(y))
total_non_nan

8192

In [87]:
rmse = (sum(sum((common_ - y )**2))/total_non_nan)**0.5
print(f"Root Mean Squared Error = {rmse} ")

Root Mean Squared Error = 3.577252161061961 


In [88]:
#Printing top 20 recommended products for the user, from the Item based collaborative filtering
top20_recommendations_for_user_ITEMITEM.index.tolist

<bound method IndexOpsMixin.tolist of Index(['Windex Original Glass Cleaner Refill 67.6oz (2 Liter)',
       'Clear Scalp & Hair Therapy Total Care Nourishing Shampoo',
       'Mike Dave Need Wedding Dates (dvd + Digital)',
       'Godzilla 3d Includes Digital Copy Ultraviolet 3d/2d Blu-Ray/dvd',
       'Bilbao Nightstand Gray Oak - South Shore',
       'Stargate (ws) (ultimate Edition) (director's Cut) (dvdvideo)',
       'Cheetos Crunchy Flamin' Hot Cheese Flavored Snacks',
       'Feit 60-Watt A19 Gu24 Base Led Light Bulb - Soft White',
       'The Seaweed Bath Co. Argan Conditioner, Smoothing Citrus',
       'Diet Canada Dry Ginger Ale - 12pk/12 Fl Oz Cans',
       'Various - Red Hot Blue:Tribute To Cole Porter (cd)',
       'Wagan Smartac 80watt Inverter With Usb',
       'SC Johnson One Step No Buff Wax',
       'Planes: Fire Rescue (2 Discs) (includes Digital Copy) (blu-Ray/dvd)',
       'Bi-O-kleen Spray & Wipe All Purpose Cleaner',
       '100:Complete First Season (blu-Ray)',

#### (5.3.5) Fine tuning the recommendation system for Item-Item

In [89]:
positive_sentiment_ratio_list = []
product_list = []

#For each of the 20 recommended products 
for i in range(0,len(top20_recommendations_for_user_ITEMITEM.index.tolist())):
  
  #Get product name
  product = top20_recommendations_for_user_ITEMITEM.index.tolist()[i]

  #Find out all postive + negative reviews about product from training dataset
  all_reviews_for_respective_product_df = reviews_df[reviews_df['name'] == product]['reviews_text']
  
  #Preprocess all the above extracted reviews before predicting user sentiment
  preprocessed_review = [preprocess(review) for review in tqdm(all_reviews_for_respective_product_df)]
  all_reviews_for_respective_product_df = pd.Series(preprocessed_review)

  #Creating TF-IDF strcture for all the reviews
  X = vectorizer.transform(all_reviews_for_respective_product_df)
  all_reviews_for_respective_product_features_df = pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names_out())

  #Predict the sentiment of all the reviews of the product
  product_review_pred = logisticRegression.predict(all_reviews_for_respective_product_features_df)
  
  #Find Positive sentiment ratio of the product 
  positive_sentiment_ratio = round(np.count_nonzero(product_review_pred == 1)/(np.count_nonzero(product_review_pred == 0)+np.count_nonzero(product_review_pred == 1)),2)

  #Append the Positive sentiment ratio of the product to the list
  positive_sentiment_ratio_list.append(positive_sentiment_ratio)
  product_list.append(product)

100%|██████████| 348/348 [00:01<00:00, 180.16it/s]
100%|██████████| 372/372 [00:03<00:00, 113.55it/s]
100%|██████████| 757/757 [00:01<00:00, 396.65it/s]
100%|██████████| 3325/3325 [00:11<00:00, 293.76it/s]
100%|██████████| 6/6 [00:00<00:00, 33.65it/s]
100%|██████████| 186/186 [00:00<00:00, 240.76it/s]
100%|██████████| 60/60 [00:00<00:00, 312.98it/s]
100%|██████████| 8/8 [00:00<00:00, 77.64it/s]
100%|██████████| 9/9 [00:00<00:00, 133.14it/s]
100%|██████████| 43/43 [00:00<00:00, 197.22it/s]
100%|██████████| 3/3 [00:00<00:00, 350.85it/s]
100%|██████████| 5/5 [00:00<00:00, 209.04it/s]
100%|██████████| 8/8 [00:00<00:00, 75.37it/s]
100%|██████████| 1143/1143 [00:03<00:00, 368.02it/s]
100%|██████████| 8/8 [00:00<00:00, 183.05it/s]
100%|██████████| 139/139 [00:00<00:00, 233.58it/s]
100%|██████████| 15/15 [00:00<00:00, 78.89it/s]
100%|██████████| 264/264 [00:00<00:00, 502.87it/s]
100%|██████████| 57/57 [00:00<00:00, 118.59it/s]
100%|██████████| 53/53 [00:00<00:00, 77.69it/s]


In [90]:
#Finding product recommendation index of top 5 products having highest positive_sentiment_ratio
index_list = np.argsort(positive_sentiment_ratio_list)[-5:]
print(f"index_list = {index_list}")

index_list = [ 8 10 12 14 16]


In [91]:
recommendation_dict = {}
recommendation_dict['product'] = product_list
recommendation_dict['recommendation_score'] = top20_recommendations_for_user_ITEMITEM.values
recommendation_dict['positive_sentiment_ratio'] = positive_sentiment_ratio_list

recommendation_df = pd.DataFrame(recommendation_dict)
recommendation_df.sort_values(by=["positive_sentiment_ratio"],ascending=False)

Unnamed: 0,product,recommendation_score,positive_sentiment_ratio
10,Various - Red Hot Blue:Tribute To Cole Porter (cd),0.062545,1.0
12,SC Johnson One Step No Buff Wax,0.044226,1.0
16,"Iman Second To None Stick Foundation, Clay 1",0.030707,1.0
4,Bilbao Nightstand Gray Oak - South Shore,0.207864,1.0
5,Stargate (ws) (ultimate Edition) (director's Cut) (dvdvideo),0.197485,1.0
14,Bi-O-kleen Spray & Wipe All Purpose Cleaner,0.034258,1.0
7,Feit 60-Watt A19 Gu24 Base Led Light Bulb - Soft White,0.140142,1.0
8,"The Seaweed Bath Co. Argan Conditioner, Smoothing Citrus",0.130069,1.0
13,Planes: Fire Rescue (2 Discs) (includes Digital Copy) (blu-Ray/dvd),0.042373,0.99
15,100:Complete First Season (blu-Ray),0.03192,0.98


In [92]:
#Mapping index with respect product name

product_name_list = []
manufacturer_list = []
category_list = []

for i in range(0, len(index_list)):
  product_name = top20_recommendations_for_user_ITEMITEM.index.tolist()[index_list[i]]
  manufacturer = reviews_df[reviews_df["name"] == product_name]["manufacturer"].unique()[0]
  category = reviews_df[reviews_df["name"] == product_name]["categories"].unique()[0]

  product_name_list.append(product_name)
  manufacturer_list.append(manufacturer)
  category_list.append(category)

mapping_dict = {}
mapping_dict["product_name"] = product_name_list
mapping_dict["manufacturer"] = manufacturer_list
mapping_dict["category"] = category_list

recommended_products_df = pd.DataFrame(mapping_dict)
recommended_products_df

Unnamed: 0,product_name,manufacturer,category
0,"The Seaweed Bath Co. Argan Conditioner, Smoothing Citrus",The Seaweed Bath,"Personal Care,Hair Care,Conditioner,Beauty,Shampoo & Conditioner,Salon Hair Care,Conditioners"
1,Various - Red Hot Blue:Tribute To Cole Porter (cd),Capitol,"Movies, Music & Books,Music,R&b,Jazz,Electronic & Dance,Pop,See More Genres,Music on CD or Vinyl,Pop Music on CD or Vinyl,Pop Rock Music on CD or Vinyl,Folk,Progressive Ro..."
2,SC Johnson One Step No Buff Wax,S C JOHNSON WAX,"Household Chemicals,Household Cleaners,Floor Care,Wood Polish,Floor & Carpet Cleaner,Home,Household Supplies,Household Cleaning,Cleaning Products,appliances"
3,Bi-O-kleen Spray & Wipe All Purpose Cleaner,Biokleen Cleaners,"Household Chemicals,Household Cleaners,All Purpose Cleaner,Household Essentials,Bathroom,Bathroom Cleaners,Health & Household,Household Supplies,Household Cleaning,All-Pur..."
4,"Iman Second To None Stick Foundation, Clay 1",IMAN,"Personal Care,Makeup,Concealer & Foundation,Foundation,Beauty,Face"


## Task 6:Fine-Tuning the Recommendation System and selecting the approach for recommendation

In user-based collaborative filters (UBCF), the idea is this: given a product ratings/reviews database and the ID of the current user as an input, identify other users (often called the peer users) who had similar preferences to those of the current user in the past. For every product p that the current user has not seen, a rating prediction is made based on the ratings given to p by the peer users. <br> 
We have observed that the dataset has over 24000 different users, who have written reviews. Thus, there is a high possibility of users with similar tastes/preferences from this big list of users and thus we can have good recommendations using User-based collaborative filtering. Also, the RMSE score (Root Mean Squared Error) for User based filtering is lesser than RMSE score for Item based filtering