# Exploratory Data Analysis

# Intro

This notebook is for exploratory data analysis of the data provided in the OTTO – Multi-Objective Recommender System project. In this notebook, we will be understanding the properties of the dataset. After understanding the data, in other notebooks we will be benchmarking various machine learning approaches for building a recommendation system.

# Objective

The data description on the OTTO competition page on Kaggle mentions that for each `session`, we have the `ts` (time stamps), `aid` (article IDs), and `type` (event type). The `type` can be one of `clicks`, `carts`, and `orders`. The EDA in this notebook investigates the following in the training data:

1. Quantification: # of rows, counts of `session`, count of `aid`
2. Understanding User Behavior: count of unique `aid` per session (i.e. how many unique items the user is interested in), `clicks` per `session` (how much does the user browse), `carts` per `session` (how many items are added to cart), `orders` per `session` (are there multiple orders in a single session)
3. Understanding Articles: `clicks` per `aid` (i.e. most to least clicked articles), `carts` per `aid` (which items are added to cart the most), `orders` per `aid` (are there specific articles which trigger the ordering), co-occurence of `aid` (how likely are two `aid` to co-occur in a session), transition matrix of `aid` (which article will be clicked on next)
5. Understanding Time Spent: calculate difference of last and first `ts` as time elapsed, time elapsed per `session` (how much tiime does a user spend per session), time elapsed per `aid` (how much time is spent per article), time elapsed on an `aid` right before `carts` (how much time is spent on an article before adding to cart), time elapsed on `aid` added to cart and those `aid` not added to cart

# Package Freezing

#### Freezing the virtual environment on Kaggle
By documenting our environment, we want to ensures reproduceability.

In [None]:
!pip freeze > requirements.txt

# Reading Data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import timedelta

data_path = r"/kaggle/input/otto-recommender-system/"

### Reading training data in chunks

Reference: https://www.kaggle.com/code/inversion/read-a-chunk-of-jsonl

In [None]:
num_lines = sum(1 for line in open(data_path+'/train.jsonl'))
print(f'number of lines in train: {num_lines:,}')

In [None]:
chunksize = 100_000
num_chunks = int(np.ceil(num_lines / 100_000))
print(f'number of chunks: {num_chunks:,}')

Read the first two chunks

In [None]:
n = 2
train_sessions = pd.DataFrame()
chunks = pd.read_json(data_path + '/train.jsonl', lines=True, chunksize=chunksize)

for e, chunk in enumerate(chunks):
    if e < n:
        train_sessions = pd.concat([train_sessions, chunk])
    else:
        break

In [None]:
train_sessions.head(2)

In [None]:
train_sessions.loc[0, "events"][0:5]

### Counts of types by session

In [None]:
# Expand the 'events' column into individual rows
expanded_df = train_sessions.explode('events')
expanded_df.head(2)

In [None]:
# Convert the dictionaries in 'events' column into separate columns
expanded_df[["aid", "ts", "type"]] = pd.json_normalize(expanded_df['events'])
expanded_df.head(2)

In [None]:
# Group by the original index and 'type' to count occurrences
grouped_counts = (
    expanded_df
    .groupby(["session", 'type'])
    .size()
    .reset_index()
    .rename({0:"count"}, axis=1)
)

print(grouped_counts)

In [None]:
# Exploration of first session in the dataframe

first_session = train_sessions.loc[0,'events']

# Number of actions
print(f'{len(first_session)} actions')

# Elapsed time of session
elapsed_time = first_session[-1]['ts'] - first_session[0]['ts']
print(str(timedelta(milliseconds=elapsed_time)))

# Frequency of actions by type in first session
first_action_counts = {}
for i in first_session:
    first_action_counts[i['type']] = first_action_counts.get(i['type'], 0) + 1
print(first_action_counts)

In [None]:
# Counting frequency of article ID across all sessions
aid_counts = {}
for i, row in train_sessions.iterrows():
    actions = row['events']
    for action in actions:
        aid_counts[action['aid']] = aid_counts.get(action['aid'], 0) + 1



In [None]:
# Group by original index and article ID to count frequency
article_id_counts = (
    expanded_df
    .groupby(["session", 'aid'])
    .size()
    .reset_index()
    .rename({0:"count"}, axis=1)
)

print(article_id_counts)

### Understanding Articles

In [None]:
# Counting the frequency of clicks per aid
clicks_per_aid = expanded_df[expanded_df['type'] == 'clicks']['aid'].value_counts()
print('Top 10 most clicked articles')
print(clicks_per_aid.head(10))
print()
print('Top 10 least clicked articles')
print(clicks_per_aid.tail(10))

In [None]:
# Counting the frequency of carts per aid
carts_per_aid = expanded_df[expanded_df['type'] == 'carts']['aid'].value_counts()
print('Top 10 most carted articles')
print(carts_per_aid.head(10))

In [None]:
# Counting the frequency of orders per aid
orders_per_aid = expanded_df[expanded_df['type'] == 'orders']['aid'].value_counts()
print('Top 10 most ordered articles')
print(orders_per_aid.head(10))

In [None]:
# unique_session_aids = expanded_df.groupby('session')['aid'].unique()

# co_occurrence = pd.crosstab(expanded_df['session'], expanded_df['aid'])
# co_matrix = co_occurrence.T.dot(co_occurrence)
# np.fill_diagonal(co_matrix.values, 0)