Recommender systems are changing from novelties used by a few E-commerce sites, to serious business tools that are re-shaping the world of E-commerce. Many of the largest commerce Web sites are already using recommender systems to help their customers find products to purchase. A recommender system learns from a customer and recommends products that she will find most valuable from among the available products.

On any e-commerce site, a lot of exploration, clicking, and comparison shopping happens but with little sales and conversion. As per E-consultancy’s latest survey, on an average, a visitor spends 4-5 minutes browsing through e-commerce sites, but the online conversion rate still remains at 1.4%. To improve online conversion and sales, e-commerce majors like Amazon and Walmart personalize their website content and create offers to target various customer segments.

Fortunately, this can be done using clickstream data analysis. Clickstream data can tell an e-commerce site owner what products the customer has been browsing, the product categories the visitor is exploring, and how prices, ratings, and other relevant information are influencing buying decisions.

In [1]:
import pandas as pd
import numpy as np

import datetime 
import time

%matplotlib inline
import matplotlib.pyplot as plt 
import seaborn as sns

In [2]:
import random
import sklearn.utils

In [200]:
event_df = pd.read_csv('events.csv')
event_df.head()

Unnamed: 0,timestamp,visitorid,event,itemid,transactionid
0,1433221332117,257597,view,355908,
1,1433224214164,992329,view,248676,
2,1433221999827,111016,view,318965,
3,1433221955914,483717,view,253185,
4,1433221337106,951259,view,367447,


In [4]:
#Let's get all unique visitor ids as well
all_customers = event_df.visitorid.unique()
all_customers.size

1407580

So, we have a total of 1407580 unique visitors. But not all of them would end up purchasing something. How many of them do actually? 

For our dataset, the customers who have bought something the 'transactionid' is not NULL.

In [5]:
#Let's get all the customers who bought something
customer_purchased = event_df[event_df.transactionid.notnull()].visitorid.unique()
customer_purchased.size

11719

Out of 1407580 visitors, only 11719 customers ended up buying. So, the click through rate is around 0.83%.

In [6]:
customer_browsed = [x for x in all_customers if x not in customer_purchased]
len(customer_browsed)

1395861

Rest all, that is 1395861 visitors left without buying.

Let's now see the differnt kinds of events that are possible:

In [7]:
event_df.event.unique()

array(['view', 'addtocart', 'transaction'], dtype=object)

In [8]:
event_df.head()

Unnamed: 0,timestamp,visitorid,event,itemid,transactionid
0,1433221332117,257597,view,355908,
1,1433224214164,992329,view,248676,
2,1433221999827,111016,view,318965,
3,1433221955914,483717,view,253185,
4,1433221337106,951259,view,367447,


As 'visiotrid' is the only information we have regarding the user who made the purachase or other event, we will treat each visitorid as uninque user.

### Intent Prediction

Actions of user on a system can be representative of a certain intent. Ability to learn this intent through user’s actions can help draw certain insight into the behavior of users on a system.

Virtually all Ecommerce systems can be thought of as a generator of clickstream data - a log of {item - userid -action} tuples which captures user interactions with the system. A chronological set of these tuples grouped by user ID is commonly known as a
session. It is intuitively appealing to think of the patterns encoded in clickstream data that can be parsed and understood by LSTM to improve predictive performance.

Let's see the entire journey of 5 customers who ended up buying:

In [9]:
for customer in customer_purchased[:5]:
    path = event_df[event_df.visitorid == customer].sort_values('timestamp').event.tolist()
    print(customer)
    print(*path, sep = "->")
    print(" ")

599528
view->addtocart->transaction->view->view->view->view->view->view->view->view->view->view->view->view->view->view
 
121688
addtocart->view->view->addtocart->addtocart->addtocart->view->view->view->view->addtocart->addtocart->view->view->addtocart->view->addtocart->addtocart->view->view->addtocart->view->view->view->addtocart->view->transaction->transaction->transaction->transaction->transaction->transaction->transaction->transaction->transaction->transaction->transaction
 
552148
view->addtocart->transaction
 
102019
view->addtocart->view->view->view->view->view->transaction->transaction
 
189384
view->view->view->view->view->view->view->addtocart->addtocart->transaction->transaction->view->view->view->view->view->view->view->view->view->view->view->view->view->view->view->view->view->view
 


For simplicity, let's map events to values instead as follows:
* view : 1
* addtocart : 2
* transaction : 3

In [10]:
event_df.event = event_df.event.map({'view' :'1', 'addtocart':'2', 'transaction':'3'})

In [11]:
event_df

Unnamed: 0,timestamp,visitorid,event,itemid,transactionid
0,1433221332117,257597,1,355908,
1,1433224214164,992329,1,248676,
2,1433221999827,111016,1,318965,
3,1433221955914,483717,1,253185,
4,1433221337106,951259,1,367447,
5,1433224086234,972639,1,22556,
6,1433221923240,810725,1,443030,
7,1433223291897,794181,1,439202,
8,1433220899221,824915,1,428805,
9,1433221204592,339335,1,82389,


In [12]:
#for customer in customer_purchased[:5]:
 #   path = event_df[event_df.visitorid == customer].sort_values('timestamp').event.tolist()
  #  print(customer)
    #strng = "->".join(path)
   # print(path)
    #print(" ")

In [13]:
print('Total number of customers/visitors :',len(all_customers))
print('Total number of customers/visitors who ended up purchasing :',len(customer_purchased))
print('Total number of customers/visitors who just browsed :',len(customer_browsed))

Total number of customers/visitors : 1407580
Total number of customers/visitors who ended up purchasing : 11719
Total number of customers/visitors who just browsed : 1395861


Beacuse number of customers who just browsed heavily outnumbers those who end up purchasing, we will take a random sample out of our all_customers data, equal to the number of customers who ended up purchasing .

Also, we will only include those instances for our training whose journey is atleast 5 events long:

In [14]:
customer_purchased_refined = []
for customer in customer_purchased:
    path = event_df[event_df.visitorid == customer].sort_values('timestamp').event.tolist()
    if len(path) >= 5:
        customer_purchased_refined.append(customer)

In [15]:
customer_browsed_refined = []
for customer in customer_browsed:
    if  len(customer_browsed_refined) == 3*len(customer_purchased_refined):
        break;
    path = event_df[event_df.visitorid == customer].sort_values('timestamp').event.tolist()
    if len(path) >= 5:
        customer_browsed_refined.append(customer)

In [16]:
print(len(customer_browsed), len(customer_browsed_refined))
print(len(customer_purchased), len(customer_purchased_refined))

1395861 22830
11719 7610


The data that we had was highly imbalanced, understandably so, because only a handful of visitors do end up buying in the end. So, for our training we will choose customers from those who just ended up browsing, equal in number to those who ended up buying.

In [21]:
random_browsed = random.sample(customer_browsed,len(customer_purchased_refined))
sample = list(customer_purchased_refined) + list(random_browsed)

In [44]:
X = np.empty((len(sample),),dtype=object)
y = np.zeros((len(sample),), dtype=int)

In [45]:
i=0
for customer in sample:
    path = event_df[event_df.visitorid == customer].sort_values('timestamp').event.tolist()
    if '3' in path:
        path = list(filter(lambda a: a != '3', path))
        path = [int(x) for x in path]
        X[i] = path
        y[i] = 1
    else:
        path = [int(x) for x in path]
        X[i] = path
    i+=1

In [46]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5)

### LSTM

In [68]:
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

In [69]:
max_click_length = 30
X_train = sequence.pad_sequences(x_train, maxlen=max_click_length)
X_test = sequence.pad_sequences(x_test, maxlen=max_click_length)
X_val = sequence.pad_sequences(x_val, maxlen=max_click_length)

In [67]:
X_train[0]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 2, 1])

In [52]:
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(3, embedding_vecor_length, input_length=max_click_length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=2, batch_size=64)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 30, 32)            96        
_________________________________________________________________
lstm_4 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 101       
Total params: 53,397
Trainable params: 53,397
Non-trainable params: 0
_________________________________________________________________
None
Train on 10654 samples, validate on 2283 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x20805fe6748>

In [53]:
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 96.45%


In [54]:
x_test_5 = np.empty((len(x_test),),dtype=object)

i = 0
for i in range(0,len(x_test)):
    x_test_5[i] = x_test[i][:5]

X_test_5 = sequence.pad_sequences(x_test_5, maxlen=max_click_length)

scores = model.evaluate(X_test_5, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

In [56]:
x_test_10 = np.empty((len(x_test),),dtype=object)

i = 0
for i in range(0,len(x_test)):
    x_test_10[i] = x_test[i][:10]

X_test_10 = sequence.pad_sequences(x_test_10 ,maxlen=max_click_length)

In [57]:
scores = model.evaluate(X_test_10, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 90.85%


In [58]:
x_test_15 = np.empty((len(x_test),),dtype=object)

i = 0
for i in range(0,len(x_test)):
    x_test_15[i] = x_test[i][:15]

X_test_15 = sequence.pad_sequences(x_test_15, maxlen=max_click_length)

In [59]:
scores = model.evaluate(X_test_15, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 96.45%


##### Second Approach

In last approach, we saw an accuracy of 96%. In our second approach, we discard and do not use the "addtobasket" event that is present in the Retail Rocket dataset. Since it is so closely correlated with the buy event (users add to a basket before purchasing that basket), it renders the buyer prediction task trivial and an AUC of 0.97 is easily achievable for both our RNN and GBM models.

In [72]:
event_df.head()

Unnamed: 0,timestamp,visitorid,event,itemid,transactionid
0,1433221332117,257597,view,355908,
1,1433224214164,992329,view,248676,
2,1433221999827,111016,view,318965,
3,1433221955914,483717,view,253185,
4,1433221337106,951259,view,367447,


In [159]:
event_df_reformed = event_df[['timestamp','visitorid','event','itemid']]

In [160]:
event_df_reformed.head()

Unnamed: 0,timestamp,visitorid,event,itemid
0,1433221332117,257597,view,355908
1,1433224214164,992329,view,248676
2,1433221999827,111016,view,318965
3,1433221955914,483717,view,253185
4,1433221337106,951259,view,367447


In [161]:
event_df_reformed.event = event_df_reformed.event.map({'view' :'1', 'addtocart':'2', 'transaction':'3'})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In particular we discard and do not use the additional "addtobasket" event that is present in the dataset. Since it is so closely correlated with the buy event (users add to a basket before purchasing that basket), it renders the buyer prediction task trivial and an AUC of 0.97 is easily achievable for both our RNN and GBM models.

In [162]:
event_df_reformed = event_df_reformed[event_df_reformed.event!= 'addtocart']

In [163]:
event_df_reformed['item_id'] = event_df_reformed['itemid'].astype('category').cat.codes

item_lookup = event_df_reformed[['item_id', 'itemid']].drop_duplicates()

In [164]:
event_df_reformed.drop(['itemid'],axis=1,inplace=True)

In [134]:
item_lookup.head()

Unnamed: 0,item_id,itemid
0,179162,355908
1,125138,248676
2,160499,318965
3,127437,253185
4,184985,367447


In [148]:
random_browsed = random.sample(customer_browsed,len(customer_purchased))
sample = list(customer_purchased) + list(random_browsed)

In [149]:
item_lookup.shape[0]

234844

In [150]:
X = np.empty((len(sample),),dtype=object)
y = np.zeros((len(sample),), dtype=int)

In [151]:
i=0
for customer in sample:
    path = event_df_reformed[event_df_reformed.visitorid == customer].sort_values('timestamp').event.tolist()
    item = event_df_reformed[event_df_reformed.visitorid == customer].sort_values('timestamp').item_id.tolist()
    if '3' in path:
        item_new = []
        for a,b in zip(path,item):
            if a!='3':
                item_new.append(b)
        item = [int(x) for x in item_new] 
        X[i] = item
        y[i] = 1
    else:
        item = [int(x) for x in item] 
        X[i] = item
    i+=1

In [152]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5)

In [153]:
max_click_length = 30
X_train = sequence.pad_sequences(x_train, maxlen=max_click_length)
X_test = sequence.pad_sequences(x_test, maxlen=max_click_length)
X_val = sequence.pad_sequences(x_val, maxlen=max_click_length)

In [154]:
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(234844, embedding_vecor_length, input_length=max_click_length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=2, batch_size=64)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_10 (Embedding)     (None, 30, 32)            7515008   
_________________________________________________________________
lstm_10 (LSTM)               (None, 100)               53200     
_________________________________________________________________
dense_10 (Dense)             (None, 1)                 101       
Total params: 7,568,309
Trainable params: 7,568,309
Non-trainable params: 0
_________________________________________________________________
None
Train on 16406 samples, validate on 3516 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x208519ab9b0>

In [155]:
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 74.74%


In [156]:
x_test_5 = np.empty((len(x_test),),dtype=object)

i = 0
for i in range(0,len(x_test)):
    x_test_5[i] = x_test[i][:5]

X_test_5 = sequence.pad_sequences(x_test_5, maxlen=max_click_length)

scores = model.evaluate(X_test_5, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 74.20%


In [158]:
x_test_10 = np.empty((len(x_test),),dtype=object)

i = 0
for i in range(0,len(x_test)):
    x_test_10[i] = x_test[i][:10]

X_test_10 = sequence.pad_sequences(x_test_10, maxlen=max_click_length)

scores = model.evaluate(X_test_10, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 74.49%
