# Inter session metrics
Like in the intra-session, in order to study inter-session metrics, we first need to create sessions out of visits.
## Visit to Session


### Setup

In [1]:
import glob
import matplotlib.pyplot as plt 
import numpy as np
import pandas as pd 
import pyarrow.feather as feather
import seaborn as sns

%matplotlib inline

In [2]:
FEATHER_PATH = "../../data/2019-Oct.feather"
np.random.seed(101)

In [3]:
TIME_DIFF = 60 * 30 # 30 mins # This will decide the maximum time difference between visits belonging to the same session.

In [4]:
%%time
data = feather.read_feather(FEATHER_PATH)

CPU times: user 12.1 s, sys: 5.6 s, total: 17.7 s
Wall time: 25.5 s


### Filtering
We'll randomly sample a small subset of users in order to speed up analysis. 

In [5]:
print("Total number of users:",len(data['user_id'].unique()))

sampled_users = np.random.choice(data['user_id'].unique(), size=10000, replace=False)

Total number of users: 3022290


Deleting the following cell should not break the rest of the code, although things would run much slower.

In [6]:
data = data[data['user_id'].isin(sampled_users)]
print("Total number of users:",len(data['user_id'].unique()))

Total number of users: 10000


In order to stick to the definitions of the source material, let's rename the 'user_session' field to 'visit_id'.

In [7]:
data.rename(columns = {"user_session":"visit_id"}, inplace=True)

### Creating Sessions out of visits
#### Creating a visits table
Let's define a session to be a group of visits that are no more than TIME_DIFF seconds apart. First, let's get the start and end times of each session.

In [8]:
grouped = data.groupby('visit_id')
visits = pd.DataFrame(
    data = [
        grouped['user_id'].min() # max also fine
        ,grouped['event_type'].min() # max also fine
        ,grouped['event_time'].min()
        ,grouped['event_time'].max()
    ]
    ,index = [
        "user_id"
        ,"event_type"
        ,"start_time"
        ,"end_time"
    ]
).T
del grouped
visits.head()

Unnamed: 0_level_0,user_id,event_type,start_time,end_time
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0004877e-8ef5-4ffd-90bf-d293b9973991,513325471,view,2019-10-01 05:29:03 UTC,2019-10-01 05:29:03 UTC
000bf4a0-654a-45a0-853f-eeae0df45d26,536331032,view,2019-10-06 17:05:12 UTC,2019-10-06 17:06:32 UTC
000eb398-bae9-43a2-a244-8753155ef1f1,519491226,view,2019-10-31 08:53:21 UTC,2019-10-31 08:53:21 UTC
00103afe-d940-42d8-be96-d1fdab2a4c53,556517838,view,2019-10-06 20:04:02 UTC,2019-10-06 20:04:02 UTC
00118c0a-8c6d-483c-84e5-1ca82e173513,518652214,view,2019-10-13 19:55:15 UTC,2019-10-13 20:30:38 UTC


In [9]:
visits['start_time'] = visits['start_time'].apply(pd.to_datetime)
visits['end_time'] = visits['end_time'].apply(pd.to_datetime)
visits['duration'] = (visits['end_time'] - visits['start_time']).dt.seconds

### From visit to session

In [10]:
%%time

next_visit = {}
time_to_next_visit = {}

for index, row in visits.iterrows():
    future_visits = visits[(visits['user_id']==row['user_id']) & (visits['start_time']>=row['end_time']) & (visits['start_time']>row['start_time'])]
    if len(future_visits) == 0:
            next_visit[index] = np.nan 
            time_to_next_visit[index] = np.nan 
    else:
        next_visit[index] = (future_visits['start_time']-row['end_time']).dt.seconds.idxmin() 
        time_to_next_visit[index] = (future_visits.loc[next_visit[index]]['start_time']-row['end_time']).seconds 

CPU times: user 1min 25s, sys: 103 ms, total: 1min 25s
Wall time: 1min 25s


In [11]:
visits['next_visit_id'] = visits.index.map(next_visit)
visits['time_to_next_visit'] = visits.index.map(time_to_next_visit)
visits.head()

Unnamed: 0_level_0,user_id,event_type,start_time,end_time,duration,next_visit_id,time_to_next_visit
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0004877e-8ef5-4ffd-90bf-d293b9973991,513325471,view,2019-10-01 05:29:03+00:00,2019-10-01 05:29:03+00:00,0,865de13b-c0ec-4d9b-9ceb-92e7818dbe1f,1900.0
000bf4a0-654a-45a0-853f-eeae0df45d26,536331032,view,2019-10-06 17:05:12+00:00,2019-10-06 17:06:32+00:00,80,2a349a74-5a33-4bc5-9199-eb184e366389,1799.0
000eb398-bae9-43a2-a244-8753155ef1f1,519491226,view,2019-10-31 08:53:21+00:00,2019-10-31 08:53:21+00:00,0,,
00103afe-d940-42d8-be96-d1fdab2a4c53,556517838,view,2019-10-06 20:04:02+00:00,2019-10-06 20:04:02+00:00,0,,
00118c0a-8c6d-483c-84e5-1ca82e173513,518652214,view,2019-10-13 19:55:15+00:00,2019-10-13 20:30:38+00:00,2123,97ce1822-fda7-4661-867f-30e4a0eb2bd6,40592.0


In [12]:
# %%time

last_visit_in_sessions = {} # key: any visit id; value: the last visit id of the corresponding session

for index, row in visits.iterrows():
    current_index = index
    current_row = row
    session = []

    while current_row['time_to_next_visit'] < TIME_DIFF:
        if current_index in last_visit_in_sessions:
            break
        session.append(current_index)

        current_index = current_row['next_visit_id']
        current_row = visits.loc[current_index]

    session.append(current_index)
    last_visit_in_sessions.update(dict.fromkeys(session, last_visit_in_sessions.get(current_index, current_index)))


In [13]:
visits['last_visit_in_session'] = visits.index.map(last_visit_in_sessions)
visits.head()


Unnamed: 0_level_0,user_id,event_type,start_time,end_time,duration,next_visit_id,time_to_next_visit,last_visit_in_session
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0004877e-8ef5-4ffd-90bf-d293b9973991,513325471,view,2019-10-01 05:29:03+00:00,2019-10-01 05:29:03+00:00,0,865de13b-c0ec-4d9b-9ceb-92e7818dbe1f,1900.0,0004877e-8ef5-4ffd-90bf-d293b9973991
000bf4a0-654a-45a0-853f-eeae0df45d26,536331032,view,2019-10-06 17:05:12+00:00,2019-10-06 17:06:32+00:00,80,2a349a74-5a33-4bc5-9199-eb184e366389,1799.0,517ca520-068e-4855-ab69-489aed0b84dd
000eb398-bae9-43a2-a244-8753155ef1f1,519491226,view,2019-10-31 08:53:21+00:00,2019-10-31 08:53:21+00:00,0,,,000eb398-bae9-43a2-a244-8753155ef1f1
00103afe-d940-42d8-be96-d1fdab2a4c53,556517838,view,2019-10-06 20:04:02+00:00,2019-10-06 20:04:02+00:00,0,,,00103afe-d940-42d8-be96-d1fdab2a4c53
00118c0a-8c6d-483c-84e5-1ca82e173513,518652214,view,2019-10-13 19:55:15+00:00,2019-10-13 20:30:38+00:00,2123,97ce1822-fda7-4661-867f-30e4a0eb2bd6,40592.0,00118c0a-8c6d-483c-84e5-1ca82e173513


In [14]:
grouped = visits.reset_index().groupby('last_visit_in_session')
sessions = pd.concat(
    [
        grouped['user_id'].min().to_frame() # max also fine
        ,grouped['start_time'].min().to_frame()
        ,grouped['end_time'].max().to_frame()
        ,grouped['visit_id'].count()
    ]
    ,axis=1
)
sessions.rename({"visit_id":"visit_id_count"}, axis="columns", inplace=True)
sessions['duration'] = (sessions['end_time'] - sessions['start_time']).dt.seconds
del grouped
sessions.head()

Unnamed: 0_level_0,user_id,start_time,end_time,visit_id_count,duration
last_visit_in_session,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0004877e-8ef5-4ffd-90bf-d293b9973991,513325471,2019-10-01 05:29:03+00:00,2019-10-01 05:29:03+00:00,1,0
000eb398-bae9-43a2-a244-8753155ef1f1,519491226,2019-10-31 08:53:21+00:00,2019-10-31 08:53:21+00:00,1,0
00103afe-d940-42d8-be96-d1fdab2a4c53,556517838,2019-10-06 20:04:02+00:00,2019-10-06 20:04:02+00:00,1,0
00118c0a-8c6d-483c-84e5-1ca82e173513,518652214,2019-10-13 19:55:15+00:00,2019-10-13 20:30:38+00:00,1,2123
00140e7d-0932-41b0-833c-5d41c67f39a1,560827740,2019-10-28 12:37:38+00:00,2019-10-28 12:37:38+00:00,1,0


# Inter-session metrics

## Absence time
This measures how long before the next session. It's important to note that the definition of session puts a sharp positive lower bound on the absence time.

In [15]:
%%time

next_session = {}
time_to_next_session = {}

for index, row in sessions.iterrows():
    future_sessions = sessions[(sessions['user_id']==row['user_id']) & (sessions['start_time']>=row['end_time']) & (sessions['start_time']>row['start_time'])]
    if len(future_sessions) == 0:
            next_session[index] = np.nan 
            time_to_next_session[index] = np.nan 
    else:
        next_session[index] = (future_sessions['start_time']-row['end_time']).dt.seconds.idxmin() 
        time_to_next_session[index] = (future_sessions.loc[next_session[index]]['start_time']-row['end_time']).seconds 

CPU times: user 29.5 s, sys: 42.8 ms, total: 29.6 s
Wall time: 29.6 s


In [16]:
sessions['next_session_id'] = sessions.index.map(next_session)
sessions['time_to_next_session'] = sessions.index.map(time_to_next_session)
sessions.head()

Unnamed: 0_level_0,user_id,start_time,end_time,visit_id_count,duration,next_session_id,time_to_next_session
last_visit_in_session,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0004877e-8ef5-4ffd-90bf-d293b9973991,513325471,2019-10-01 05:29:03+00:00,2019-10-01 05:29:03+00:00,1,0,865de13b-c0ec-4d9b-9ceb-92e7818dbe1f,1900.0
000eb398-bae9-43a2-a244-8753155ef1f1,519491226,2019-10-31 08:53:21+00:00,2019-10-31 08:53:21+00:00,1,0,,
00103afe-d940-42d8-be96-d1fdab2a4c53,556517838,2019-10-06 20:04:02+00:00,2019-10-06 20:04:02+00:00,1,0,,
00118c0a-8c6d-483c-84e5-1ca82e173513,518652214,2019-10-13 19:55:15+00:00,2019-10-13 20:30:38+00:00,1,2123,97ce1822-fda7-4661-867f-30e4a0eb2bd6,40592.0
00140e7d-0932-41b0-833c-5d41c67f39a1,560827740,2019-10-28 12:37:38+00:00,2019-10-28 12:37:38+00:00,1,0,,


In [17]:
print("Describing the absense time in hours, we get:")
round((sessions['time_to_next_session']/ 3600).describe())

Describing the absense time in hours, we get:


count    12564.0
mean         8.0
std          7.0
min          0.0
25%          2.0
50%          5.0
75%         12.0
max         24.0
Name: time_to_next_session, dtype: float64

Interesting. It seems there are cases where the absence time is zero. Let's investigate...

In [18]:
print("Fraction of sessions with an absence time of zero:",sum(sessions['time_to_next_session']==0) / len(sessions['time_to_next_session']==0))

Fraction of sessions with an absence time of zero: 0.00021794089442943074


In [19]:
print("Number of sessions with an absence time of zero:",sum(sessions['time_to_next_session']==0))

Number of sessions with an absence time of zero: 5


Looking at a single instance:

In [20]:
sessions[sessions['time_to_next_session']==0]

Unnamed: 0_level_0,user_id,start_time,end_time,visit_id_count,duration,next_session_id,time_to_next_session
last_visit_in_session,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
15c6c61f-f936-4107-9ff7-2aa2aabc04d2,562238663,2019-10-20 12:24:52+00:00,2019-10-20 12:25:03+00:00,2,11,5e59cd3c-0958-493a-a686-ff0a6f05d083,0.0
37f86643-3b2d-43b4-a335-2becb0d1c53c,556625935,2019-10-07 04:05:55+00:00,2019-10-14 04:57:03+00:00,6,3068,da32d6c2-de14-4095-829e-ebde0222ba01,0.0
46aa816e-8f70-48a6-92ef-86057242320a,562424765,2019-10-21 01:05:12+00:00,2019-10-21 01:05:30+00:00,2,18,fbd420d7-5e13-42ce-8f06-9ba29b151ec3,0.0
49591149-be7b-4c2d-89e7-48a846afe042,520290141,2019-10-01 15:32:20+00:00,2019-10-17 16:00:47+00:00,2,1707,7779b920-01a8-478e-b878-b493c4062939,0.0
553ebc85-475a-4d00-b663-c9e21666557d,563809084,2019-10-24 18:51:32+00:00,2019-10-24 18:51:34+00:00,2,2,d412dd43-be46-4bb0-b949-5a3091625f2a,0.0


In [21]:
sessions[sessions['user_id']==562238663]

Unnamed: 0_level_0,user_id,start_time,end_time,visit_id_count,duration,next_session_id,time_to_next_session
last_visit_in_session,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
15c6c61f-f936-4107-9ff7-2aa2aabc04d2,562238663,2019-10-20 12:24:52+00:00,2019-10-20 12:25:03+00:00,2,11,5e59cd3c-0958-493a-a686-ff0a6f05d083,0.0
5e59cd3c-0958-493a-a686-ff0a6f05d083,562238663,2019-10-20 12:25:03+00:00,2019-10-20 12:25:03+00:00,1,0,,


So it seems that the next session started at the very second the previous instant ended. In any case, there are so few of such cases that it should be safe to ignore them.

## Absence time and number of visits
### Correlation
This is the simplest way to quantify the relation between two quantities.

In [22]:
relevant_data = sessions[~sessions['time_to_next_session'].isna()]

In [23]:
np.corrcoef(relevant_data['time_to_next_session'],relevant_data['visit_id_count'])[0,1]

-0.009679237360831774

Log-log correlation

In [24]:
np.corrcoef(relevant_data['time_to_next_session'].map(np.log1p),relevant_data['visit_id_count'].map(np.log1p))[0,1]

-0.024274108824329278

As predicted by the slides, there is a negative correlation between absence time and number of visits in the session. This suggests that those who visit more during a session are quicker to come back and therefore engaged better with the product.