# Exploratory Data Analysis

# Intro

This notebook is for exploratory data analysis of the data provided in the OTTO – Multi-Objective Recommender System project. In this notebook, we will be understanding the properties of the dataset. After understanding the data, in other notebooks we will be benchmarking various machine learning approaches for building a recommendation system.

# Objective

The data description on the OTTO competition page on Kaggle mentions that for each `session`, we have the `ts` (time stamps), `aid` (article IDs), and `type` (event type). The `type` can be one of `clicks`, `carts`, and `orders`. The EDA in this notebook investigates the following in the training data:

1. Quantification: # of rows, counts of `session`, count of `aid`
2. Understanding User Behavior: count of unique `aid` per session (i.e. how many unique items the user is interested in), `clicks` per `session` (how much does the user browse), `carts` per `session` (how many items are added to cart), `orders` per `session` (are there multiple orders in a single session)
3. Understanding Articles: `clicks` per `aid` (i.e. most to least clicked articles), `carts` per `aid` (which items are added to cart the most), `orders` per `aid` (are there specific articles which trigger the ordering), co-occurence of `aid` (how likely are two `aid` to co-occur in a session), transition matrix of `aid` (which article will be clicked on next)
5. Understanding Time Spent: calculate difference of last and first `ts` as time elapsed, time elapsed per `session` (how much tiime does a user spend per session), time elapsed per `aid` (how much time is spent per article), time elapsed on an `aid` right before `carts` (how much time is spent on an article before adding to cart), time elapsed on `aid` added to cart and those `aid` not added to cart

# Package Freezing

#### Freezing the virtual environment on Kaggle
By documenting our environment, we want to ensures reproduceability.

In [2]:
!pip freeze > requirements.txt

# Reading Data

In [3]:
import numpy as np
import pandas as pd

data_path = r"/kaggle/input/otto-recommender-system/"

### Reading training data in chunks

Reference: https://www.kaggle.com/code/inversion/read-a-chunk-of-jsonl

In [4]:
num_lines = sum(1 for line in open(data_path+'/train.jsonl'))
print(f'number of lines in train: {num_lines:,}')

number of lines in train: 12,899,779


In [5]:
chunksize = 100_000
num_chunks = int(np.ceil(num_lines / 100_000))
print(f'number of chunks: {num_chunks:,}')

number of chunks: 129


Read the first two chunks

In [19]:
n = 2
train_sessions = pd.DataFrame()
chunks = pd.read_json(data_path + '/train.jsonl', lines=True, chunksize=chunksize)

for e, chunk in enumerate(chunks):
    if e < n:
        train_sessions = pd.concat([train_sessions, chunk])
    else:
        break

In [20]:
train_sessions.head(2)

Unnamed: 0,session,events
0,0,"[{'aid': 1517085, 'ts': 1659304800025, 'type':..."
1,1,"[{'aid': 424964, 'ts': 1659304800025, 'type': ..."


In [21]:
train_sessions.loc[0, "events"][0:5]

[{'aid': 1517085, 'ts': 1659304800025, 'type': 'clicks'},
 {'aid': 1563459, 'ts': 1659304904511, 'type': 'clicks'},
 {'aid': 1309446, 'ts': 1659367439426, 'type': 'clicks'},
 {'aid': 16246, 'ts': 1659367719997, 'type': 'clicks'},
 {'aid': 1781822, 'ts': 1659367871344, 'type': 'clicks'}]

### Counts of types by session

In [22]:
# Expand the 'events' column into individual rows
expanded_df = train_sessions.explode('events')
expanded_df.head(2)

Unnamed: 0,session,events
0,0,"{'aid': 1517085, 'ts': 1659304800025, 'type': ..."
0,0,"{'aid': 1563459, 'ts': 1659304904511, 'type': ..."


In [23]:
# Convert the dictionaries in 'events' column into separate columns
expanded_df[["aid", "ts", "type"]] = pd.json_normalize(expanded_df['events'])
expanded_df.head(2)

Unnamed: 0,session,events,aid,ts,type
0,0,"{'aid': 1517085, 'ts': 1659304800025, 'type': ...",1517085,1659304800025,clicks
0,0,"{'aid': 1563459, 'ts': 1659304904511, 'type': ...",1517085,1659304800025,clicks


In [30]:
# Group by the original index and 'type' to count occurrences
grouped_counts = (
    expanded_df
    .groupby(["session", 'type'])
    .size()
    .reset_index()
    .rename({0:"count"}, axis=1)
)

print(grouped_counts)

        session    type  count
0             0  clicks    276
1             1  clicks     32
2             2  clicks     33
3             3  clicks    226
4             4  clicks     19
...         ...     ...    ...
199995   199995  clicks     17
199996   199996  clicks     25
199997   199997  clicks    154
199998   199998  clicks      2
199999   199999   carts      9

[200000 rows x 3 columns]
