# Visit to Session
## Introduction
A visit is a continuous set of events that the user registers in the logs, typically with a unique id associated with a vist. For example, a visit immediately followed by a purchase would count as a visit and is expected to use the same visit/session id. This is because a user cannot purchase an item without viewing it first. A view->cart->purchase sequence of events can however be split across two visits as long as the carted item is accessible in the next visit.

If multiple visits are separated by short spans of time, it may be preferable to analyze the logs of such visits together. Therefore, we define the term session which will be groups of visits that are analyzed together.

## Setup

In [1]:
import glob
import matplotlib.pyplot as plt 
import numpy as np
import pandas as pd 
import pyarrow.feather as feather
import seaborn as sns

%matplotlib inline

In [2]:
FEATHER_PATH = "../../data/2019-Oct.feather"
np.random.seed(101)

In [3]:
%%time
data = feather.read_feather(FEATHER_PATH)

CPU times: user 13.2 s, sys: 8.35 s, total: 21.6 s
Wall time: 46.4 s


## Filtering
We'll randomly sample a small subset of users in order to speed up analysis. 

In [4]:
print("Total number of users:",len(data['user_id'].unique()))

sampled_users = np.random.choice(data['user_id'].unique(), size=10000, replace=False)

Total number of users: 3022290


Deleting the following cell should not break the rest of the code, although things would run much slower.

In [5]:
data = data[data['user_id'].isin(sampled_users)]
print("Total number of users:",len(data['user_id'].unique()))

Total number of users: 10000


## Analyzing the visits
In order to stick to the definitions of the source material, let's rename the 'user_session' field to 'visit_id'.

In [6]:
data.rename(columns = {"user_session":"visit_id"}, inplace=True)

In [7]:
print("Average number of events logged per visit:", len(data)/len(data['visit_id'].unique()) )

Average number of events logged per visit: 4.495635660980811


In [8]:
print("Average number of visits per user:", len(data['visit_id'].unique())/len(data['user_id'].unique()) )

Average number of visits per user: 3.0016


In [9]:
print(
    "Maximum number of visits by a single user:", 
    data.groupby('user_id')['visit_id'].nunique().max() 
    )

Maximum number of visits by a single user: 73


In [10]:
data.head()

Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,visit_id
17,2019-10-01 00:00:18 UTC,view,10900029,2053013555069845885,appliances.kitchen.mixer,bosch,58.95,519528062,901b9e3c-3f8f-4147-a442-c25d5c5ed332
224,2019-10-01 00:03:07 UTC,view,1004874,2053013555631882655,electronics.smartphone,samsung,383.51,519528062,901b9e3c-3f8f-4147-a442-c25d5c5ed332
312,2019-10-01 00:04:43 UTC,view,1004870,2053013555631882655,electronics.smartphone,samsung,286.86,519528062,901b9e3c-3f8f-4147-a442-c25d5c5ed332
398,2019-10-01 00:06:20 UTC,view,1004874,2053013555631882655,electronics.smartphone,samsung,383.51,519528062,901b9e3c-3f8f-4147-a442-c25d5c5ed332
400,2019-10-01 00:06:25 UTC,view,10900029,2053013555069845885,appliances.kitchen.mixer,bosch,58.95,519528062,901b9e3c-3f8f-4147-a442-c25d5c5ed332


## Creating Sessions out of visits
### Creating Visits
Let's define a session to be a group of visits that are no more than 30 minutes apart. First, let's get the start and end times of each session.

In [11]:
grouped = data.groupby('visit_id')
visits = pd.DataFrame(
    data = [
        grouped['user_id'].min() # max also fine
        ,grouped['event_time'].min()
        ,grouped['event_time'].max()
    ]
    ,index = [
        "user_id"
        ,"start_time"
        ,"end_time"
    ]
).T
del grouped
visits.head()

Unnamed: 0_level_0,user_id,start_time,end_time
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0004877e-8ef5-4ffd-90bf-d293b9973991,513325471,2019-10-01 05:29:03 UTC,2019-10-01 05:29:03 UTC
000bf4a0-654a-45a0-853f-eeae0df45d26,536331032,2019-10-06 17:05:12 UTC,2019-10-06 17:06:32 UTC
000eb398-bae9-43a2-a244-8753155ef1f1,519491226,2019-10-31 08:53:21 UTC,2019-10-31 08:53:21 UTC
00103afe-d940-42d8-be96-d1fdab2a4c53,556517838,2019-10-06 20:04:02 UTC,2019-10-06 20:04:02 UTC
00118c0a-8c6d-483c-84e5-1ca82e173513,518652214,2019-10-13 19:55:15 UTC,2019-10-13 20:30:38 UTC


In [12]:
visits['start_time'] = visits['start_time'].apply(pd.to_datetime)
visits['end_time'] = visits['end_time'].apply(pd.to_datetime)
visits['duration'] = (visits['end_time'] - visits['start_time']).dt.seconds

In [13]:
print("Describing the approximate duration in minutes, we get:")
round(visits['duration'].describe() / 60)

Describing the approximate duration in minutes, we get:


count     500.0
mean        7.0
std        49.0
min         0.0
25%         0.0
50%         1.0
75%         4.0
max      1423.0
Name: duration, dtype: float64

In [14]:
print("Median dwell time of the user in seconds:",visits['duration'].describe().loc['50%'])

Median dwell time of the user in seconds: 63.0


### From visit to session

In [15]:
%%time

next_visit = {}
time_to_next_visit = {}

for index, row in visits.iterrows():
    future_visits = visits[(visits['user_id']==row['user_id']) & (visits['start_time']>=row['end_time']) & (visits['start_time']>row['start_time'])]
    if len(future_visits) == 0:
            next_visit[index] = np.nan 
            time_to_next_visit[index] = np.nan 
    else:
        next_visit[index] = (future_visits['start_time']-row['end_time']).dt.seconds.idxmin() 
        time_to_next_visit[index] = (future_visits.loc[next_visit[index]]['start_time']-row['end_time']).seconds 

CPU times: user 1min 51s, sys: 590 ms, total: 1min 51s
Wall time: 1min 52s


In [16]:
visits['next_visit_id'] = visits.index.map(next_visit)
visits['time_to_next_visit'] = visits.index.map(time_to_next_visit)
visits.head()

Unnamed: 0_level_0,user_id,start_time,end_time,duration,next_visit_id,time_to_next_visit
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0004877e-8ef5-4ffd-90bf-d293b9973991,513325471,2019-10-01 05:29:03+00:00,2019-10-01 05:29:03+00:00,0,865de13b-c0ec-4d9b-9ceb-92e7818dbe1f,1900.0
000bf4a0-654a-45a0-853f-eeae0df45d26,536331032,2019-10-06 17:05:12+00:00,2019-10-06 17:06:32+00:00,80,2a349a74-5a33-4bc5-9199-eb184e366389,1799.0
000eb398-bae9-43a2-a244-8753155ef1f1,519491226,2019-10-31 08:53:21+00:00,2019-10-31 08:53:21+00:00,0,,
00103afe-d940-42d8-be96-d1fdab2a4c53,556517838,2019-10-06 20:04:02+00:00,2019-10-06 20:04:02+00:00,0,,
00118c0a-8c6d-483c-84e5-1ca82e173513,518652214,2019-10-13 19:55:15+00:00,2019-10-13 20:30:38+00:00,2123,97ce1822-fda7-4661-867f-30e4a0eb2bd6,40592.0


In [17]:
TIME_DIFF = 60 * 30 # 30 mins

In [18]:
# %%time

last_visit_in_sessions = {} # key: any visit id; value: the last visit id of the corresponding session

for index, row in visits.iterrows():
    current_index = index
    current_row = row
    session = []

    while current_row['time_to_next_visit'] < TIME_DIFF:
        if current_index in last_visit_in_sessions:
            break
        session.append(current_index)

        current_index = current_row['next_visit_id']
        current_row = visits.loc[current_index]

    session.append(current_index)
    last_visit_in_sessions.update(dict.fromkeys(session, last_visit_in_sessions.get(current_index, current_index)))


In [19]:
visits['last_visit_in_session'] = visits.index.map(last_visit_in_sessions)
visits.head()


Unnamed: 0_level_0,user_id,start_time,end_time,duration,next_visit_id,time_to_next_visit,last_visit_in_session
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0004877e-8ef5-4ffd-90bf-d293b9973991,513325471,2019-10-01 05:29:03+00:00,2019-10-01 05:29:03+00:00,0,865de13b-c0ec-4d9b-9ceb-92e7818dbe1f,1900.0,0004877e-8ef5-4ffd-90bf-d293b9973991
000bf4a0-654a-45a0-853f-eeae0df45d26,536331032,2019-10-06 17:05:12+00:00,2019-10-06 17:06:32+00:00,80,2a349a74-5a33-4bc5-9199-eb184e366389,1799.0,517ca520-068e-4855-ab69-489aed0b84dd
000eb398-bae9-43a2-a244-8753155ef1f1,519491226,2019-10-31 08:53:21+00:00,2019-10-31 08:53:21+00:00,0,,,000eb398-bae9-43a2-a244-8753155ef1f1
00103afe-d940-42d8-be96-d1fdab2a4c53,556517838,2019-10-06 20:04:02+00:00,2019-10-06 20:04:02+00:00,0,,,00103afe-d940-42d8-be96-d1fdab2a4c53
00118c0a-8c6d-483c-84e5-1ca82e173513,518652214,2019-10-13 19:55:15+00:00,2019-10-13 20:30:38+00:00,2123,97ce1822-fda7-4661-867f-30e4a0eb2bd6,40592.0,00118c0a-8c6d-483c-84e5-1ca82e173513


In [20]:
grouped = visits.reset_index().groupby('last_visit_in_session')
sessions = pd.concat(
    [
        grouped['user_id'].min().to_frame() # max also fine
        ,grouped['start_time'].min().to_frame()
        ,grouped['end_time'].max().to_frame()
        ,grouped['visit_id'].count()
    ]
    ,axis=1
)
sessions.rename({"visit_id":"visit_id_count"}, axis="columns", inplace=True)
sessions['duration'] = (sessions['end_time'] - sessions['start_time']).dt.seconds
del grouped
sessions.head()

Unnamed: 0_level_0,user_id,start_time,end_time,visit_id_count,duration
last_visit_in_session,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0004877e-8ef5-4ffd-90bf-d293b9973991,513325471,2019-10-01 05:29:03+00:00,2019-10-01 05:29:03+00:00,1,0
000eb398-bae9-43a2-a244-8753155ef1f1,519491226,2019-10-31 08:53:21+00:00,2019-10-31 08:53:21+00:00,1,0
00103afe-d940-42d8-be96-d1fdab2a4c53,556517838,2019-10-06 20:04:02+00:00,2019-10-06 20:04:02+00:00,1,0
00118c0a-8c6d-483c-84e5-1ca82e173513,518652214,2019-10-13 19:55:15+00:00,2019-10-13 20:30:38+00:00,1,2123
00140e7d-0932-41b0-833c-5d41c67f39a1,560827740,2019-10-28 12:37:38+00:00,2019-10-28 12:37:38+00:00,1,0


## Analyzing Sessions

In [21]:
print("Looking at the number of visits in each session, we get:")
sessions['visit_id_count'].describe()

Looking at the number of visits in each session, we get:


count    22942.000000
mean         1.308343
std          0.874666
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max         35.000000
Name: visit_id_count, dtype: float64

In [22]:
print("Describing the approximate duration in minutes, we get:")
round(sessions['duration'].describe() / 60)

Describing the approximate duration in minutes, we get:


count     382.0
mean       11.0
std        56.0
min         0.0
25%         0.0
50%         2.0
75%         8.0
max      1423.0
Name: duration, dtype: float64

In [23]:
print("Median session time of the user in seconds:",sessions['duration'].describe().loc['50%'])

Median session time of the user in seconds: 116.0
