1️⃣ What is a Session?
A session is:
A group of events from the same user that occur close together in time.

In [1]:
import json
import pandas as pd

In [2]:
with open("../data/raw/events.json", "r") as f:
    events = json.load(f)

df = pd.json_normalize(events)

df["timestamp"] = pd.to_datetime(df["timestamp"])

df = df.sort_values("timestamp")

### Use Resolved User

In [3]:
df["resolved_user"] = df["userId"].fillna(df["anonymousId"])

df[["type", "event", "resolved_user", "timestamp"]]

Unnamed: 0,type,event,resolved_user,timestamp
0,identify,,user_001,2024-01-01 09:00:00+00:00
1,track,Signup Completed,user_001,2024-01-01 09:02:00+00:00
2,track,Login,user_001,2024-01-02 08:30:00+00:00
3,track,Feature Used,user_001,2024-01-02 08:35:00+00:00
4,track,Feature Used,anon_456,2024-01-02 10:00:00+00:00
5,identify,,user_002,2024-01-02 10:05:00+00:00


### Calculate Time Between Events

In [4]:
df["prev_timestamp"] = df.groupby("resolved_user")["timestamp"].shift(1)

df["time_since_last_event"] = (
    df["timestamp"] - df["prev_timestamp"]
)

df[["resolved_user", "timestamp", "prev_timestamp", "time_since_last_event"]]

Unnamed: 0,resolved_user,timestamp,prev_timestamp,time_since_last_event
0,user_001,2024-01-01 09:00:00+00:00,NaT,NaT
1,user_001,2024-01-01 09:02:00+00:00,2024-01-01 09:00:00+00:00,0 days 00:02:00
2,user_001,2024-01-02 08:30:00+00:00,2024-01-01 09:02:00+00:00,0 days 23:28:00
3,user_001,2024-01-02 08:35:00+00:00,2024-01-02 08:30:00+00:00,0 days 00:05:00
4,anon_456,2024-01-02 10:00:00+00:00,NaT,NaT
5,user_002,2024-01-02 10:05:00+00:00,NaT,NaT


### Define a New Session

If inactivity > 30 minutes → new session

In [6]:
session_threshold = pd.Timedelta(minutes=30)

df["new_session"] = (
    (df["time_since_last_event"].isna()) |
    (df["time_since_last_event"] > session_threshold)
)

df[["resolved_user", "timestamp", "time_since_last_event", "new_session"]]

Unnamed: 0,resolved_user,timestamp,time_since_last_event,new_session
0,user_001,2024-01-01 09:00:00+00:00,NaT,True
1,user_001,2024-01-01 09:02:00+00:00,0 days 00:02:00,False
2,user_001,2024-01-02 08:30:00+00:00,0 days 23:28:00,True
3,user_001,2024-01-02 08:35:00+00:00,0 days 00:05:00,False
4,anon_456,2024-01-02 10:00:00+00:00,NaT,True
5,user_002,2024-01-02 10:05:00+00:00,NaT,True


### Assign Session IDs

In [7]:
df["session_id"] = (
    df.groupby("resolved_user")["new_session"]
      .cumsum()
)

df[["resolved_user", "timestamp", "session_id"]]

Unnamed: 0,resolved_user,timestamp,session_id
0,user_001,2024-01-01 09:00:00+00:00,1
1,user_001,2024-01-01 09:02:00+00:00,1
2,user_001,2024-01-02 08:30:00+00:00,2
3,user_001,2024-01-02 08:35:00+00:00,2
4,anon_456,2024-01-02 10:00:00+00:00,1
5,user_002,2024-01-02 10:05:00+00:00,1


### View User Timelines

In [8]:
df[[
    "resolved_user",
    "session_id",
    "type",
    "event",
    "timestamp"
]].sort_values(["resolved_user", "timestamp"])

Unnamed: 0,resolved_user,session_id,type,event,timestamp
4,anon_456,1,track,Feature Used,2024-01-02 10:00:00+00:00
0,user_001,1,identify,,2024-01-01 09:00:00+00:00
1,user_001,1,track,Signup Completed,2024-01-01 09:02:00+00:00
2,user_001,2,track,Login,2024-01-02 08:30:00+00:00
3,user_001,2,track,Feature Used,2024-01-02 08:35:00+00:00
5,user_002,1,identify,,2024-01-02 10:05:00+00:00


In [None]:
1️⃣ How many sessions does each user have?
Each resolved user has one or more sessions depending on whether their events are separated by more than 30 minutes of inactivity.
2️⃣ What causes a new session to start?
A new session starts when the time gap between two consecutive events for the same user exceeds 30 minutes or when it is the user’s first event.
3️⃣ Why is sessionization based on time and not events?
Sessionization is time-based because user intent is inferred from inactivity gaps rather than specific actions, which vary across products.
4️⃣ What assumptions are we making with a 30-minute threshold?
We assume that 30 minutes of inactivity indicates the end of a user’s visit, even though actual engagement patterns may differ.
5️⃣ How could session logic differ for mobile vs SaaS apps?
Mobile apps may use shorter inactivity thresholds or app foreground/background signals, while SaaS web apps typically rely on longer time-based thresholds.