# Exploratory Data Analysis

# Intro

This notebook is for exploratory data analysis of the data provided in the OTTO – Multi-Objective Recommender System project. In this notebook, we will be understanding the properties of the dataset. After understanding the data, in other notebooks we will be benchmarking various machine learning approaches for building a recommendation system.

# Objective

The data description on the OTTO competition page on Kaggle mentions that for each `session`, we have the `ts` (time stamps), `aid` (article IDs), and `type` (event type). The `type` can be one of `clicks`, `carts`, and `orders`. The EDA in this notebook investigates the following in the training data:

1. Quantification: # of rows, counts of `session`, count of `aid`
2. Understanding User Behavior: count of unique `aid` per session (i.e. how many unique items the user is interested in), `clicks` per `session` (how much does the user browse), `carts` per `session` (how many items are added to cart), `orders` per `session` (are there multiple orders in a single session)
3. Understanding Articles: `clicks` per `aid` (i.e. most to least clicked articles), `carts` per `aid` (which items are added to cart the most), `orders` per `aid` (are there specific articles which trigger the ordering), co-occurence of `aid` (how likely are two `aid` to co-occur in a session), transition matrix of `aid` (which article will be clicked on next)
5. Understanding Time Spent: calculate difference of last and first `ts` as time elapsed, time elapsed per `session` (how much tiime does a user spend per session), time elapsed per `aid` (how much time is spent per article), time elapsed on an `aid` right before `carts` (how much time is spent on an article before adding to cart), time elapsed on `aid` added to cart and those `aid` not added to cart

Note: The training data set is very large (~13M sessions and ~217M events). Analyzing such large data for exploratory analysis only using the resources available on Kaggle would be quite challening. Therefore, we are using sample size of 150K. Techniques for performing exploratory data analyis on large datasets includes using the `multiprocessing` or `dask` packages. Another alternative would be, `pyspark`.

# References

1. https://www.kaggle.com/code/edwardcrookenden/otto-getting-started-eda-baseline
2. https://www.kaggle.com/code/inversion/read-a-chunk-of-jsonl

# Virtual Env Freezing

#### Freezing the virtual environment on Kaggle
By documenting our environment, we want to ensures reproduceability.

In [1]:
!pip freeze > requirements.txt

# Reading Data

In [2]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from datetime import timedelta

import os


data_path = r"/kaggle/input/otto-recommender-system/"

In [3]:
# Load in a sample to a pandas df

sample_size = 150000

chunks = pd.read_json(data_path+'/train.jsonl', lines=True, chunksize = sample_size)

for c in chunks:
    sample_train_df = c
    break

In [4]:
sample_train_df.head(3)

Unnamed: 0,session,events
0,0,"[{'aid': 1517085, 'ts': 1659304800025, 'type':..."
1,1,"[{'aid': 424964, 'ts': 1659304800025, 'type': ..."
2,2,"[{'aid': 763743, 'ts': 1659304800038, 'type': ..."


In [5]:
sample_train_df.loc[0, "events"][0:3]

[{'aid': 1517085, 'ts': 1659304800025, 'type': 'clicks'},
 {'aid': 1563459, 'ts': 1659304904511, 'type': 'clicks'},
 {'aid': 1309446, 'ts': 1659367439426, 'type': 'clicks'}]

In [6]:
# Expand the 'events' column into individual rows
expanded_df = sample_train_df.explode('events').reset_index(drop=True)
expanded_df.head(3)

Unnamed: 0,session,events
0,0,"{'aid': 1517085, 'ts': 1659304800025, 'type': ..."
1,0,"{'aid': 1563459, 'ts': 1659304904511, 'type': ..."
2,0,"{'aid': 1309446, 'ts': 1659367439426, 'type': ..."


In [7]:
# Convert the dictionaries in 'events' column into separate columns
expanded_df[["aid", "ts", "type"]] = pd.json_normalize(expanded_df['events'])
expanded_df = expanded_df.drop('events', axis=1)
expanded_df.head(3)

Unnamed: 0,session,aid,ts,type
0,0,1517085,1659304800025,clicks
1,0,1563459,1659304904511,clicks
2,0,1309446,1659367439426,clicks


In [8]:
expanded_df.shape

(7841827, 4)

# Quantification

### # of rows

In [9]:
%%time 

num_lines = sum(1 for line in open(data_path+'/train.jsonl'))
print(f'number of lines in train: {num_lines:,}')

number of lines in train: 12,899,779
CPU times: user 14.6 s, sys: 10.6 s, total: 25.2 s
Wall time: 2min 54s


In [11]:
print(f'number of sessions in the sample of training data: {len(expanded_df["session"].unique()):,}')

number of sessions in the sample of training data: 150,000


In [10]:
print(f'number of aid in the sample of training data: {len(expanded_df["aid"].unique()):,}')

number of aid in the sample of training data: 830,140


# Understanding User Behavior

### Counts of types by session

In [None]:
# Group by the original index and 'type' to count occurrences
grouped_counts = (
    expanded_df
    .groupby(["session", 'type'])
    .size()
    .reset_index()
    .rename({0:"count"}, axis=1)
)

print(grouped_counts)

In [None]:
# Exploration of first session in the dataframe

first_session = sample_train_df.loc[0,'events']

# Number of actions
print(f'{len(first_session)} actions')

# Elapsed time of session
elapsed_time = first_session[-1]['ts'] - first_session[0]['ts']
print(str(timedelta(milliseconds=elapsed_time)))

# Frequency of actions by type in first session
first_action_counts = {}
for i in first_session:
    first_action_counts[i['type']] = first_action_counts.get(i['type'], 0) + 1
print(first_action_counts)

In [None]:
# Counting frequency of article ID across all sessions
aid_counts = {}
for i, row in sample_train_df.iterrows():
    actions = row['events']
    for action in actions:
        aid_counts[action['aid']] = aid_counts.get(action['aid'], 0) + 1


In [None]:
# Group by original index and article ID to count frequency
article_id_counts = (
    expanded_df
    .groupby(["session", 'aid'])
    .size()
    .reset_index()
    .rename({0:"count"}, axis=1)
)

print(article_id_counts)