# **DATA ANALYSIS FOR RECOMMENDATION SYSTEM(RETAIL ROCKET)**

----------------------------------------------------------------------

### **REPORT_REQUIREMENT**

>**STEPS:**
*  Background and objective 
*  Dataset 
*  Answers to the research questions 
*  Limitations
*  Reproduce package
* Take-away messages and potential implications

-----------------------------
>**RECOMMENDATION  SYSTEM**
*  is one that learns about what items might be of interest to a
user, and then recommends those items for buying, renting, listening, watching, and so on.
Recommendation systems are broadly classified into categories:
 * content-based filtering
 * collaborative filtering
 * Hybrid systems
 * Matrix factorization


--------------------------------------
>**DESCRIBTION**
* The data contains the values collected from an e-commerce website but has been
anonymized to ensure the privacy of the users. 
* Since the data contains the user-item interactions and not the explicit ranking of items by
users, it, therefore, falls under the category of <A>implicit feedback information.

# **IMPLIMENTATION**

### **REQUIRED LIBRARIES**

In [None]:
import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns                       #visualisation
import warnings
import tensorflow
import random
# . Import the Input, Embedding, and Flatten layers from the Keras library:
from keras.layers import Input, Embedding, Flatten
import keras
import keras.utils
from keras import utils as np_utils
from keras.utils.vis_utils import plot_model
warnings.filterwarnings('ignore')
sns.set_style("darkgrid")
%matplotlib inline

In [None]:
print("Keras:{}".format(keras.__version__))

### **LOADING DATA**

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
category_tree= pd.read_csv('../input/ecommerce-dataset/category_tree.csv')

In [None]:
events = pd.read_csv('../input/ecommerce-dataset/events.csv')

In [None]:
item1=pd.read_csv('../input/ecommerce-dataset/item_properties_part1.csv')
item2=pd.read_csv('../input/ecommerce-dataset/item_properties_part2.csv')
items=pd.concat([item1,item2])

### **SHOW DATASET**

><A>The category tree data has the two columns of categoryid	parentid, as shown here:

In [None]:
# This file contains the category tree
category_tree.head(4)

In [None]:
category_tree.tail(4)

------------------------------------

><A>The events data has the five columns of timestamp, visitorid, event, itemid,
and transactionid, as shown here:

In [None]:
#This file contains the visitor-item interaction data
events.head(4) 

In [None]:
events.tail(4)

------------------------------

><a>The item data has the four columns of timestamp	,itemid,	property	,value as shown here:

In [None]:
# This file contains item properties
items.head()

In [None]:
items.tail()

------------------------------------------------------------------------------

# **PRE-PROCESSING**

### **EVENTS**

In [None]:
events

In [None]:
print('Unique counts:',events.nunique())

In [None]:
print('Kind of events:',events.event.unique())

><a>As we see there are  three kinds of events 

In [None]:
# to 3 with the following code:
events.event.replace(to_replace=dict(view=1,
 addtocart=2,
 transaction=3),
 inplace=True)

In [None]:
#  Drop the transcationid and timestamp columns that we don't need:
events.drop(['transactionid'],axis=1,inplace=True)
events.drop(['timestamp'],axis=1,inplace=True)

In [None]:
#  Shuffle the dataset to get random data for training and test datasets:
events = events.reindex(np.random.permutation(events.index))

In [None]:
#  Split the data in train, valid, and test sets, as follows:
split_1 = int(0.8 * len(events))
split_2 = int(0.9 * len(events))
train = events[:split_1]
valid = events[split_1:split_2]
test = events[split_2:]
print(train.head())
print(valid.head())
print(test.head())

### **The matrix factorization model** 

><a>Matrix factorization is a popular algorithm for implementing recommendation systems and
falls in the collaborative filtering algorithms category

In [None]:
#  Store the number of visitors and items in a variable, as follows:
n_visitors = events.visitorid.nunique()
n_items = events.itemid.nunique()
print(n_visitors)
print(n_items)

In [None]:
# Set the number of latent factors for embedding to 5. You may want to try
# different values to see the impact on the model training:
n_latent_factors = 5

In [None]:
# Start with the items – create an input layer for them as follows:
item_input = Input(shape=[1],name='Items')

In [None]:
# Create an Embedding representation layer and then flatten the Embedding layer
# to get the output in the number of latent dimensions that we set earlier:
item_embed = Embedding(n_items + 1,
 n_latent_factors,
 name='ItemsEmbedding')(item_input)
item_vec = Flatten(name='ItemsFlatten')(item_embed)

In [None]:
#  create the vector space representation for the visitors:
visitor_input = Input(shape=[1],name='Visitors')
visitor_embed = Embedding(n_visitors + 1,
 n_latent_factors,
 name='VisitorsEmbedding')(visitor_input)
visitor_vec = Flatten(name='VisitorsFlatten')(visitor_embed)

In [None]:
#  Create a layer for the dot product of both vector space representations:
dot_prod = keras.layers.dot([item_vec, visitor_vec],axes=[1,1],
 name='DotProduct')

In [None]:
# Build the Keras model from the input layers, and the dot product layer as the
# output layer, and compile it as follows:
model = keras.Model([item_input, visitor_input], dot_prod)
model.compile('adam', 'mse')
model.summary()


In [None]:
# Since the model is complicated, we can also draw it graphically using the following
# commands:
from keras.utils.vis_utils import plot_model
from IPython import display
import tensorflow
tensorflow.keras.utils.plot_model(model,
 to_file='model.png',
 show_shapes=True,
 show_layer_names=True)
display.display(display.Image('model.png'))

In [None]:
train.visitorid

In [None]:
train.itemid

In [None]:
train.event

In [None]:
# Now let's train and evaluate the model:
# model.fit([train.visitorid, train.itemid], train.event, epochs=50,verbose=0)

In [None]:
# score = model.evaluate([test.visitorid, test.itemid], test.event)
# print('mean squared error:', score)

### **The neural network model for Retailrocket**

In [None]:
# In this model, we set two different variables for latent factors for users and items but set
# both of them to 5. The reader is welcome to experiment with different values of latent
# factors:
n_lf_visitor = 5
n_lf_item = 5

In [None]:
# Build the item and visitor embeddings and vector space representations the same
# way we built earlier:
item_input = Input(shape=[1],name='Items')
item_embed = Embedding(n_items + 1,
 n_lf_visitor,
 name='ItemsEmbedding')(item_input)
item_vec = Flatten(name='ItemsFlatten')(item_embed)
visitor_input = Input(shape=[1],name='Visitors')
visitor_embed = Embedding(n_visitors + 1,
 n_lf_item,
name='VisitorsEmbedding')(visitor_input)
visitor_vec = Flatten(name='VisitorsFlatten')(visitor_embed)


In [None]:
#  Instead of creating a dot product layer, we concatenate the user and visitor
# representations, and then apply fully connected layers to get the
# recommendation output:
from keras.layers import Activation, Dense
concat = keras.layers.concatenate([item_vec, visitor_vec],
name='Concat')
fc_1 = Dense(80,name='FC-1')(concat)
fc_2 = Dense(40,name='FC-2')(fc_1)
fc_3 = Dense(20,name='FC-3', activation='relu')(fc_2)
output = Dense(1, activation='relu',name='Output')(fc_3)


In [None]:
#  Define and compile the model as follows:
optimizer =tensorflow.keras.optimizers.Adam(lr=0.001)
model = keras.Model([item_input, visitor_input], output)
model.compile(optimizer=optimizer,loss= 'mse')

In [None]:
#  Train and evaluate the model:
# model.fit([train.visitorid, train.itemid], train.event, epochs=20)

In [None]:
# score = model.evaluate([test.visitorid, test.itemid], test.event)
# print('mean squared error:', score)

----------------------

# **EDA**

In [None]:
events = pd.read_csv('../input/ecommerce-dataset/events.csv')

In [None]:
events.head(2)

In [None]:
# Looking at events of the first visitor 

print(events['event'].value_counts())
events.loc[events['visitorid']==257597].sort_values('timestamp')

In [None]:
people_bought = events.loc[events['event']=='transaction', 'visitorid'].unique()

In [None]:
people_bought

In [None]:
people_browsed = events.loc[~(events['visitorid'].isin(people_bought)), 'visitorid'].unique()

In [None]:
people_browsed

In [None]:
# Since the first visitor didn't buy anything, we'll refine our search to those visitors who bought something

people_bought = events.loc[events['event']=='transaction', 'visitorid'].unique().tolist()
people_browsed = events.loc[~(events['visitorid'].isin(people_bought)), 'visitorid'].unique().tolist()

print('Number of people who bought something: ', len(people_bought))
print('Some are people_bought:', people_bought[:10])

print('\nNumber of people who just browsed: ', len(people_browsed))
print('Some are people_browsed: ', people_browsed[:10])

In [None]:
all_bought_events = events.loc[events['visitorid'].isin(people_bought)].sort_values('timestamp').reset_index(drop=True)
all_browsed_events = events.loc[events['visitorid'].isin(people_browsed)].sort_values('timestamp').reset_index(drop=True)

In [None]:
all_bought_events 

In [None]:
all_browsed_events

In [None]:
# Now let's look at events of a person who bought something
all_bought_events.loc[all_bought_events['visitorid']==people_bought[0]]
# all_bought_events.loc[all_bought_events['visitorid']==people_bought[1]]
# all_bought_events.loc[all_bought_events['visitorid']==people_bought[2]]
# all_bought_events.loc[all_bought_events['visitorid']==people_bought[3]]

>Since there are variations in pattern of events leading to transaction, let's see the relation between different events
* Some probable events we want to look at:
*  1. How many times products were viewed
* 2. How many products were viewed
* 3. How many products were added to cart
* 4. How many products were transacted

-------------------------

## **LIGHTFM**

In [None]:
import scipy.sparse as sp
from scipy.sparse import vstack
from scipy import sparse
from scipy.sparse.linalg import spsolve
import pandas as pd


# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
from sklearn.model_selection import train_test_split

In [None]:
events = pd.read_csv('../input/ecommerce-dataset/events.csv')

In [None]:
user_activity_count = dict()
for row in events.itertuples():
    if row.visitorid not in user_activity_count:
        user_activity_count[row.visitorid] = {'view':0 , 'addtocart':0, 'transaction':0};
    if row.event == 'addtocart':
        user_activity_count[row.visitorid]['addtocart'] += 1 
    elif row.event == 'transaction':
        user_activity_count[row.visitorid]['transaction'] += 1
    elif row.event == 'view':
        user_activity_count[row.visitorid]['view'] += 1 

d = pd.DataFrame(user_activity_count)
dataframe = d.transpose()

In [None]:
dataframe

In [None]:
corr = dataframe.corr()
plt.figure(figsize=(11,8))
sns.heatmap(corr, cmap="coolwarm",annot=True)
plt.show()

In [None]:
user_activity_count = dict()
for row in events.itertuples():
    if row.itemid not in user_activity_count:
        user_activity_count[row.itemid] = {'view':0 , 'addtocart':0, 'transaction':0};
    if row.event == 'addtocart':
        user_activity_count[row.itemid]['addtocart'] += 1 
    elif row.event == 'transaction':
        user_activity_count[row.itemid]['transaction'] += 1
    elif row.event == 'view':
        user_activity_count[row.itemid]['view'] += 1 

d = pd.DataFrame(user_activity_count)
itemid_activity = d.transpose()

In [None]:
itemid_activity

In [None]:
corr = itemid_activity.corr()
plt.figure(figsize=(11,8))
sns.heatmap(corr, cmap="coolwarm",annot=True)
plt.show()

In [None]:
# the number of activity for each user 
dataframe['activity'] = dataframe['view'] + dataframe['addtocart'] + dataframe['transaction']
dataframe

In [None]:
# removing users with only a single view
cleaned_data = dataframe[dataframe['activity']!=1]
# all users contains the userids with more than 1 activity in the events (4lac)
all_users = set(cleaned_data.index.values)
all_items = set(events['itemid'])

In [None]:
cleaned_data

In [None]:
# todo: we need to clear items which are only viewed once
visitorid_to_index_mapping  = {}
itemid_to_index_mapping  = {}
vid = 0
iid = 0
for row in events.itertuples():
    if row.visitorid in all_users and row.visitorid not in visitorid_to_index_mapping:
        visitorid_to_index_mapping[row.visitorid] = vid
        vid = vid + 1

    if row.itemid in all_items and row.itemid not in itemid_to_index_mapping:
        itemid_to_index_mapping[row.itemid] = iid
        iid = iid + 1

In [None]:
n_users = len(all_users)
n_items = len(all_items)
user_to_item_matrix = sp.dok_matrix((n_users, n_items), dtype=np.int8)
# We need to check whether we need to add the frequency of view, addtocart and transation.
# Currently we are only taking a single value for each row and column.
action_weights = [1,2,3]

for row in events.itertuples():
    if row.visitorid not in all_users:
        continue
    
    
    mapped_visitor_id = visitorid_to_index_mapping[row.visitorid]
    mapped_item_id    = itemid_to_index_mapping[row.itemid]
    
    value = 0
    if row.event == 'view':
        value = action_weights[0]
    elif row.event == 'addtocart':
        value = action_weights[1]        
    elif row.event == 'transaction':
        value = action_weights[2]
        
    current_value = user_to_item_matrix[mapped_visitor_id, mapped_item_id]
    if value>current_value:
        user_to_item_matrix[mapped_visitor_id, mapped_item_id] = value
        
user_to_item_matrix = user_to_item_matrix.tocsr()

In [None]:
user_to_item_matrix

In [None]:
user_to_item_matrix.shape

### **Construct item X property matrix**
Remove items for which there are no events associated

Add items which are present in all_items but do NOT have any associated property

Provide the new itemId to these items

In [None]:
filtered_items = items[items.itemid.isin(all_items)]

In [None]:
# adding a fake property to filtered items, which do not have any property

fake_itemid = []
fake_timestamp = []
fake_property = []
fake_value = []
all_items_with_property = set(items.itemid)
for itx in list(all_items):
    if itx not in all_items_with_property:
        fake_itemid.insert(0, itx)
        fake_timestamp.insert(0, 0)
        fake_property.insert(0, 888)
        fake_value.insert(0, 0)
    
fake_property_dict = {'itemid':fake_itemid, 'timestamp':fake_timestamp, 'property':fake_property,
                     'value':fake_value}

fake_df = pd.DataFrame(fake_property_dict, columns=filtered_items.columns.values)
filtered_items = pd.concat([filtered_items, fake_df])

In [None]:
filtered_items['itemid'] = filtered_items['itemid'].apply(lambda x: itemid_to_index_mapping[x])

In [None]:
filtered_items = filtered_items.sort_values('timestamp', ascending=False).drop_duplicates(['itemid','property'])
filtered_items.sort_values(by='itemid', inplace=True)
item_to_property_matrix = filtered_items.pivot(index='itemid', columns='property', values='value')

In [None]:
item_to_property_matrix.shape

### **Filtering properties**
After analysing all the property values, we realised that there are properties with very less categories of values. There are some with a lot of value_counts. Since there is not much information given regarding the type of these properties, i decided to use properties with less than 50 value_counts. I then went ahead and one-hot-encoded these property values.

In [None]:
useful_cols = list()
cols = item_to_property_matrix.columns
for col in cols:
    value = len(item_to_property_matrix[col].value_counts())
    if value < 50:
        useful_cols.insert(0, col)

In [None]:
item_to_property_matrix = item_to_property_matrix[useful_cols]

In [None]:
item_to_property_matrix_one_hot_sparse = pd.get_dummies(item_to_property_matrix)

 ###  **Model**
>Utilized lightFM for recommendation model, since it contains the hybrid model implementation and is freakingly fast. It contains both matrix factorization to reduce dimensinality as well as using item/user properties for collaborative filtering

In [None]:
from lightfm import LightFM
import scipy.sparse as sp
from scipy.sparse import vstack

In [None]:
item_to_property_matrix_one_hot_sparse.shape

In [None]:
from scipy.sparse import csr_matrix
item_to_property_matrix_sparse = csr_matrix(item_to_property_matrix_one_hot_sparse.values)