# EDA on Game Progress

## Problem Statement
During the journey of EDA, I find out that each sample can't be **uniquely identified** by the pre-assumed composite key (`session_id`, `index`), which should indicate **the specific event in the specific session**. In this [forum](https://www.kaggle.com/competitions/predict-student-performance-from-game-play/discussion/384342#2134312), the competition host states that there might be some data errors. Hence, I wonder if this issue can be fixed.<br>
Furthermore, as the general game progresses, `index` should be **monotonically increasing**, which preserves the **ordering nature** of events. Nonetheless, this property doesn't always hold.<br>
Finally, `checkpoint` is an important indicator of question prompts, which should be followed by a **level-up**. But again, this assumption seems not valid in some cases. 

## About this Notebook
In this kernel, properties of `index` columns are explored. Then, a simple workaround is implemented to fix the aforementioned issue, which could facilitate better interpretation about the **sequential characteristics** of the data. In addition, `checkpoint`-related problems are discussed, which could help us come up with better ways to clean the data.

<a id="toc"></a>
## Table of Contents
* [1. Duplicated (`session_id`, `index`) Pairs](#dup_pk)
* [2. Reversed Index](#reverse_idx)
* [3. Reversed Level](#reverse_lv)
* [4. Jumped Index](#jump_idx)
* [5. Checkpoint Exploration](#ckpt)
* [6. Abnormal Game Plays](#abnormal_game)
* [7. Conclusion](#conclusion)

## Acknowledgements
Special thanks to @cdeotte 's response [here](https://www.kaggle.com/competitions/predict-student-performance-from-game-play/discussion/388751#2150691). As for detailed discussion about time series API, please see [here](https://www.kaggle.com/competitions/predict-student-performance-from-game-play/discussion/388479).

## Note
Because the competition host has announced the data update policy [here](https://www.kaggle.com/competitions/predict-student-performance-from-game-play/discussion/396202), I decide to re-run this notebook to make exploration results aligned with the newest data. What's more, the competitors are allowed to use supplemental data collected from [this site](https://fielddaylab.wisc.edu/opengamedata/), but I haven't tried it out. Hence, I only use the updated `train.csv` in this notebook without any other open source data.


## Import Packages

In [None]:
import gc
import os
import warnings
from collections import defaultdict
from typing import Any, Dict, Optional, Union
from tqdm import tqdm

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.axes import Axes

warnings.simplefilter("ignore")
sns.set_style("darkgrid")
pd.options.display.max_rows = None
pd.options.display.max_columns = None
colors = sns.color_palette("Set2")

In [None]:
INPUT_PATH = "/kaggle/input/predict-student-performance-from-game-play"

## Load Data
Let's load data first! Considering the RAM constraint, we can load the data with **downcasted data types**. For instance, loading `train.csv` with the downcastes dtypes can reduce memory usage from **2.0+ GB** to about **603.4 MB**.

**Note:** The following code snippet comes from [here](https://www.kaggle.com/competitions/predict-student-performance-from-game-play/discussion/384359). All the credits should go to @sakvaua. Furthermore, for the concern about memory reduction, please see [here](https://www.kaggle.com/competitions/predict-student-performance-from-game-play/discussion/384475).

In [None]:
dtypes = {
    "session_id": "category",
    "elapsed_time": np.int32,
    "event_name": "category",
    "level": np.uint8,
    "text": "category",
    "level_group": "category",
}

To further constraint memory usage, I only load necessay columns to do the analysis.

In [None]:
%%time

cols_to_use = ["session_id", "index", "elapsed_time", "event_name",
               "level", "text", "level_group"]
train = pd.read_csv(os.path.join(INPUT_PATH, "train.csv"), usecols=cols_to_use, dtype=dtypes)

In [None]:
# Some definitions for global access
N_SESS = train["session_id"].nunique()
LEVELS = range(23)
LV_COLORS = plt.cm.get_cmap("hsv", len(LEVELS))

<a id="dup_pk"></a>
## 1. Duplicated (`session_id`, `index`) Pairs
[**<span style="color:#FEF1FE; background-color:#535d70;border-radius: 5px; padding: 2px">Go to Table of Content</span>**](#toc)

At first, I assume each sample in `train.csv` can be uniquely identified by the composite key (`session_id`, `index`). However, we can find that there exist duplications.

In [None]:
pk_tmp = ["session_id", "index"]
assert not train.duplicated(subset=pk_tmp).any(), "There are some duplicated (`session_id`, `index`) pairs."

There are 142 sessions with duplicated (`session_id`, `index`) pairs. Let's see what's going on.

In [None]:
sess_dup = train.loc[train.duplicated(subset=pk_tmp, keep=False)]["session_id"].unique().tolist()
train_dup = train.loc[train["session_id"].isin(sess_dup)].reset_index(drop=True)
print(f"#Sessions with duplicated (`session_id`, `index`) pairs: {len(sess_dup)}")

As can be seen, `index` series in sessions with duplicated (`session_id`, `index`) pairs show **lightning shape**. Also, some sessions have no `index == 0` (*e.g.,* `session_id == "20110508425704692"`).

To see the plot, please unfold the cell.

In [None]:
n_rows, n_cols = 29, 5
fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(20, 80))
for i, (sess_id, gp) in enumerate(train_dup.groupby("session_id", observed=True)):
    gp = gp.reset_index(drop=True)
    axes[i // n_cols, i % n_cols].plot(gp["index"])
    axes[i // n_cols, i % n_cols].set_title(f"Event Index of Session {sess_id}\n Min Index {gp['index'].min()}")
    axes[i // n_cols, i % n_cols].set_xlabel(f"Event Ordering in DataFrame")
    axes[i // n_cols, i % n_cols].set_ylabel(f"Event Index")
plt.tight_layout()

<a id="reverse_idx"></a>
## 2. Reversed Index
[**<span style="color:#FEF1FE; background-color:#535d70;border-radius: 5px; padding: 2px">Go to Table of Content</span>**](#toc)

After glimpsing through sessions with duplicated (`session_id`, `index`) pairs, let's try to explore all the **reversed index** phenomenon in `train.csv`.

There exist 258 sessions with this phenomenon. Combined with the analysis above, we can conclude that there are $258 - 142 = 116$ sessions with **reversed index** phenomenon, but without duplicated (`session_id`, `index`) pairs.

In [None]:
sess_with_reversed_index = []
for sess_id, gp in train.groupby("session_id", observed=True):
    if not gp["index"].is_monotonic_increasing:
        sess_with_reversed_index.append(sess_id)
        
print(f"There are {len(sess_with_reversed_index)} sessions with \"reversed index\" phenomenon.")
print(f"-> {len(sess_dup)} sessions with duplicated (`session_id`, `index`) pairs.")
print(f"-> {len(sess_with_reversed_index) - len(sess_dup)} sessions without duplicated (`session_id`, `index`) pairs.")

To check whether samples are sorted in align with the progress of the game, I simply verify if all the `text` entries before

> Whatcha doing over there, Jo?

are all `undefined`. And, it holds!!

Then, we can move on to try to fix this **reversed index** phenomenon.

In [None]:
FIRST_DIALOG_PER_GRAMP = "Whatcha doing over there, Jo?"

for sess_id in sess_with_reversed_index:
    df = train[train["session_id"] == sess_id].reset_index(drop=True)
    
    text_prompts = defaultdict(int)
    for text in df["text"]:
        if text == FIRST_DIALOG_PER_GRAMP:
            assert len(text_prompts) == 1 and list(text_prompts.keys())[0] == "undefined" 
            break
        text_prompts[text] += 1

In the following plot, original `index` series is the **blue** line and `index_fixed` is the **red** one.

After fixing, the `index_fixed` column, which indicates the **ordering of events**, is monotonically increasing in all sessions. However, there still exist some issues to be better resolved, which are shown as follows:
1. `Index` column has some **jumps**, meaning that the ordering of events doesn't always increase by 1.
2.  In addition to the **reversed index** phenomenon, there is also **reversed level** phenomenon.

To see the plot, please unfold the cell.

In [None]:
train_with_reversed_index = train[train["session_id"].isin(sess_with_reversed_index)].reset_index(drop=True)

n_rows, n_cols = 52, 5
fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(20, 130))
sess_with_reversed_lv = []
for i, (sess_id, gp) in enumerate(train_with_reversed_index.groupby("session_id", observed=True)):
    df = gp.reset_index(drop=True)
    df["index_diff"] = (df["index"] - df["index"].shift(1)).bfill()
    reverse_pts = df[df["index_diff"] < 0]
    
    df["index_fixed"] = df["index"] - df["index"][0]
    for idx, row in reverse_pts.iterrows():
        index_diff_abs = abs(row["index_diff"])
        df.loc[df.index >= idx, "index_fixed"] = df.loc[df.index >= idx, "index_fixed"] + index_diff_abs
    
    axes[i // n_cols, i % n_cols].plot(df["index"], "b")
    axes[i // n_cols, i % n_cols].plot(df["index_fixed"], "r")
    axes[i // n_cols, i % n_cols].set_title(f"Event Index of Session {sess_id}\n"
                                            f"Min Index {df['index'].min()} - "
                                            f"Min Index Fixed {int(df['index_fixed'].min())}")
    axes[i // n_cols, i % n_cols].set_xlabel(f"Event Ordering in DataFrame")
    axes[i // n_cols, i % n_cols].set_ylabel(f"Event Index")
    
    if not df["level"].is_monotonic_increasing:
        sess_with_reversed_lv.append(sess_id)
plt.tight_layout()

In [None]:
print(f"There are {len(sess_with_reversed_lv)} sessions with \"reversed index\" having \"reversed level\" phenomenon.\n"
      f"-> Sessions: {sess_with_reversed_lv}")

<a id="reverse_lv"></a>
## 3. Reversed Level
[**<span style="color:#FEF1FE; background-color:#535d70;border-radius: 5px; padding: 2px">Go to Table of Content</span>**](#toc)

In this section, let's dive into the **reversed level** phenomenon, which indicates that `level` can go **from high to low** within the same session.

For those sessions with **reversed index** (258 sessions in total), there exist six sessions having level **goes from 22 (highest) to 0 (lowest)**, but `elapsed_time` continuously accumulated. we assume that it's because the corresponding student plays the game **twice in a row**. To prove whether the assumption holds, we go to check [this file](https://github.com/fielddaylab/jo_wilder/blob/5082da4057f30dd0917c97a65f2aa7be13469f79/src/utils/simplelog.js). The code snippet below looks like the way how `session_id` is generated.

```javascript
self.session_id = UUIDint();
self.persistent_session_id = getCookie("persistent_session_id");
if(!self.persistent_session_id)
{
    self.persistent_session_id = self.session_id;
    setCookie("persistent_session_id",self.persistent_session_id,100);
}
```

With this evidence, we think that students can play the game **multiple times in a row** with the same `session_id` if `persistent_session_id` has already existed.

In [None]:
train_with_reversed_lv = train_with_reversed_index[train_with_reversed_index["session_id"].isin(sess_with_reversed_lv)].reset_index(drop=True)

fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(20, 6))
for i, (sess_id, gp) in enumerate(train_with_reversed_lv.groupby("session_id", observed=True)):
    df = gp.reset_index(drop=True)
    df["level_diff"] = (df["level"] - df["level"].shift(1)).bfill()
    reverse_pt = df[df["level_diff"] < 0].index.values[0]
    
    ax = axes[i // 3, i % 3]
    ax.plot(df["index"])
    ax.set_title(sess_id)
    ax.set_xlabel("Event Ordering in DataFrame")
    ax.set_ylabel("Event Index")
    
    ax2 = ax.twinx()
    for i, lv in enumerate(LEVELS):
        lv_seq = df[df["level"] == lv]["level"]
        ax2.scatter(lv_seq.index.values, lv_seq.values, 0.8, marker="_", c=LV_COLORS(i), linewidths=2)
    ax2.axvline(x=int(reverse_pt), color='r', label="Reversed Level Point")
    ax2.set_ylabel("Level")
plt.legend()
plt.tight_layout()

Among all of the game play sessions, there exist 486 sessions (about 2.06%) with **reversed level** phenomenon.

In [None]:
sess_with_reversed_lv = []
for sess_id, gp in train.groupby("session_id"):
    if not gp["level"].is_monotonic_increasing:
        sess_with_reversed_lv.append(sess_id)

print(f"There are {len(sess_with_reversed_lv)} sessions (about {len(sess_with_reversed_lv) / N_SESS * 100:.2f}%) with \"reversed level\" phenomenon.")

<a id="jump_idx"></a>
## 4. Jumped Index
[**<span style="color:#FEF1FE; background-color:#535d70;border-radius: 5px; padding: 2px">Go to Table of Content</span>**](#toc)

As pointed out in the [section 2](#reverse_idx), `index` doesn't always increase by 1. There are some jumping points within `index` sequence. Before exploring this phenomenon, let's first fix **reversed index** in `train` DataFrame, making sure that `index` sequences of all sessions are **monotonically increasing**. In addition to the following fixing method, **directly re-indexing** can be applied.

In [None]:
dump_fixed_train = False

In [None]:
for sess_id in tqdm(sess_with_reversed_index):
    df = train[train["session_id"] == sess_id].copy()
    target_index = df.index
    df.reset_index(drop=True, inplace=True)
    df["index_diff"] = (df["index"] - df["index"].shift(1)).bfill()
    reverse_pts = df[df["index_diff"] < 0]
    
    df["index_fixed"] = df["index"] - df["index"][0]
    for idx, row in reverse_pts.iterrows():
        index_diff_abs = abs(row["index_diff"])
        df.loc[df.index >= idx, "index_fixed"] = df.loc[df.index >= idx, "index_fixed"] + index_diff_abs
    
    # Fix reversed index in original DataFrame
    train.loc[target_index, "index"] = df["index_fixed"].values

assert train.groupby("session_id")["index"].is_monotonic_increasing.all()
if dump_fixed_train:
    train.to_csv("./train_fixed.csv", index=False)

Nearly all of the **jumped index** phenomena occur right after `checkpoint`. As we know that `index` indicates the ordering of events and **jumped index** also preserves the ordering nature, I think it's okay to ignore the jumps. But, the reason behind the scene remains unknown.

In [None]:
train["index_diff"] = train.groupby("session_id").apply(lambda x: (x["index"] - x["index"].shift(1)).bfill()).values
train["jumped_index"] = train["index_diff"] > 1
print(f"Number of jumped points: {train['jumped_index'].sum()}")

In [None]:
train_before_jumps = train.iloc[train[train["jumped_index"]].index - 1]
event_cnt_before_jumps = train_before_jumps["event_name"].value_counts()
pct = event_cnt_before_jumps / event_cnt_before_jumps.sum() * 100
labels = [f"{sec} {ratio:.2f}%" for sec, ratio in zip(event_cnt_before_jumps.index, pct)]

fig, ax = plt.subplots(figsize=(10, 5))
patches, texts = ax.pie(event_cnt_before_jumps.values, 
                        colors=sns.color_palette("pastel"), 
                        shadow=True, 
                        startangle=90)
patches, labels, dummy = zip(*sorted(zip(patches, labels, event_cnt_before_jumps.values),
                                     key=lambda x: x[2],
                                     reverse=True))
ax.legend(patches, labels, bbox_to_anchor=(-0.1, 1.), fontsize=8)
ax.set_title("Ratio of Events before Jumps")
plt.show()

After each `checkpoint`, it takes the student a period of time (*i.e.,* a larget gap of `elapsed time` difference) to answer the questions of the corresponding `level_group`, which is marked in green as follows. I take the first session as illustration.

In [None]:
train_demo = train[train["session_id"] == train["session_id"][0]].reset_index()
ckpts = train_demo[train_demo["event_name"] == "checkpoint"].index

fig, ax = plt.subplots(figsize=(10, 3))
# Index
l1 = ax.plot(train_demo["index"], "b", label="Index")
ax.set_title(f"Index versus Elapsed Time Demo")
ax.set_xlabel(f"Event Order")
ax.set_ylabel(f"Index")
# Elapsed time
ax2 = ax.twinx()
l2 = ax2.plot(train_demo["elapsed_time"], "r", label="Elapsed Time")
ax2.set_ylabel(f"Elapsed Time")
# Checkpoints
for ckpt in ckpts:
    ax.axvline(ckpt, linestyle="--", linewidth=.5, color="g")
ax.legend(l1+l2, [l.get_label() for l in l1+l2])

ax2.set_zorder(-1)
ax.patch.set_visible(False)
ax2.patch.set_visible(True)
plt.show()

<a id="ckpt"></a>
## 5. Checkpoint Exploration
[**<span style="color:#FEF1FE; background-color:#535d70;border-radius: 5px; padding: 2px">Go to Table of Content</span>**](#toc)

As the general game progresses, I think each session should have **three** `checkpoint`s in total, each of which occurs right before the questions prompted. However, there exist sessions having less and more than three `checkpoint`s. One related issue is discussed [here](https://www.kaggle.com/competitions/predict-student-performance-from-game-play/discussion/388479#2149959).

In [None]:
n_ckpts_per_sess = train.groupby(["session_id", "level_group"]).apply(lambda x: (x["event_name"] == "checkpoint").sum()).reset_index()
n_ckpts_per_sess = n_ckpts_per_sess.pivot(index="session_id", columns="level_group", values=0)

n_ckpts_per_sess["total"] = n_ckpts_per_sess["0-4"] + n_ckpts_per_sess["5-12"] + n_ckpts_per_sess["13-22"]
n_ckpts_per_sess_val_cnt = n_ckpts_per_sess["total"].value_counts()

fig, ax = plt.subplots(figsize=(7, 4))
sns.barplot(x=n_ckpts_per_sess_val_cnt.index, y=n_ckpts_per_sess_val_cnt.values, 
            palette=colors, ax=ax)
for container in ax.containers:
    ax.bar_label(container)
ax.set_title("Number of Checkpoints Per Session")
ax.set_xlabel("Number of Checkpoints")
ax.set_ylabel("Session Count")
ax.set_ylim([0, 100])
ax.text(0.98, 95, f"{n_ckpts_per_sess_val_cnt[n_ckpts_per_sess_val_cnt.index == 3].values[0]}\n≈", horizontalalignment="center")
plt.tight_layout()

<a id="sess_with_two_ckpts"></a>
For the only session `22090108192456930` which doesn't have a `checkpoint` when `level_group == "0-4"`, the game still progresses smoothly toward **level 5**.

In [None]:
sess_without_ckpt = n_ckpts_per_sess[(n_ckpts_per_sess == 0).any(axis=1)]
print(f"The only session missing one `checkpoint`: {sess_without_ckpt.index[0]}")
train[train["session_id"] == "22090108192456930"].reset_index(drop=True).iloc[281:283, :6]

After `checkpoint`, the **level-up** should occur. However, this assumption doesn't always hold. There are three cases described as follows:
* `level_group == "0-4"`: There are extra events occuring after `checkpoint` is triggered in level 4.
* `level_group == "5-12"`: There are extra events occuring after `checkpoint` is triggered in level 12.
* `level_group == "13-22"`: There are extra events occuring after `checkpoint` is triggered in level 22, which should be considered as **the end of the game**.

In [None]:
train["lv_up_after_ckpt"] = train.groupby("session_id").apply(lambda x: (x["level"].shift(-1) - x["level"]).fillna(-1)).values
not_lvup_mask = (train["event_name"] == "checkpoint") & (train["level"].isin([4, 12, 22])) & (train["lv_up_after_ckpt"] == 0)
train_ckpt_without_lvup = train[not_lvup_mask]
ckpt_without_lvup_per_lv =  train_ckpt_without_lvup["level"].value_counts()

fig, ax = plt.subplots(figsize=(7, 4))
sns.barplot(x=ckpt_without_lvup_per_lv.index, y=ckpt_without_lvup_per_lv.values, 
            palette=colors, ax=ax)
for container in ax.containers:
    ax.bar_label(container)
ax.set_title("Number of Checkpoints Without Level-Up Followed")
ax.set_xlabel("Level")
ax.set_ylabel("Checkpoint Count")
plt.tight_layout()

Now, let's take session `20100012562027690` as an example. The student seems to double-check the clues after level goes up to 12, where the text prompts look different from those at level 11 (as shown below).

[![dialog.png](https://i.postimg.cc/XJdpgJ9w/dialog.png)](https://postimg.cc/tZRqq9ST)

In the beginning of level 12, *Jo* should go back to the capitol and talk to *Mrs.M*. After clicking on *Mrs.M*, the dialogue begins with
> Ooh, nice decorations!

However, what makes me confused is that this event occurs **after** the `checkpoint`. So far, I've not figured out what's going on in these cases.

Maybe, we can drop extra events **after the `checkpoint` and before level up**.

In [None]:
train_demo = train[train["session_id"] == "20100012562027690"]
train_demo[train_demo["level"] == 12].tail()

<a id="abnormal_game"></a>
## 6. Abnormal Game Plays
[**<span style="color:#FEF1FE; background-color:#535d70;border-radius: 5px; padding: 2px">Go to Table of Content</span>**](#toc)

In addition to the level-up problem mentioned above, let's combine `checkpoint` behaviour with issues like **reversed level** to gain deeper insights.

First of all, there exist 168 sessions with less than or more than 3 checkpoints. As discussed [here](#sess_with_two_ckpts), session `22090108192456930` has only two checkpoints. Hence, there are 167 sessions with more than 3 checkpoints. Also, we can confirm that sessions with 3 checkpoints have **exactly 1 checkpoint for each `level_group`**, which is considered **valid** temporarily.

In [None]:
ckpt_valid_mask = (n_ckpts_per_sess["0-4"] == 1) & (n_ckpts_per_sess["5-12"] == 1) & (n_ckpts_per_sess["13-22"] == 1)
sess_with_invalid_ckpt = n_ckpts_per_sess.loc[~ckpt_valid_mask].index.astype(int).tolist()
print(f"There are {len(sess_with_invalid_ckpt)} sessions with less than or more than 3 checkpoints.")

Then, let's see how `level` sequences perform in these so-called **invalid** sessions. Besides, **checkpoints** are marked with green line.

In [None]:
def plot_level_with_ckpt(sess_id: int, ax: Optional[Axes] = None) -> None:
    """Plot level sequence of the specified session.
    
    Parameters:
        sess_id: session identifier
    
    Return:
        None
    """
    plot = False
    sess = train[train["session_id"] == sess_id].reset_index(drop=True)
    ckpts = sess[sess["event_name"] == "checkpoint"].index
        
    if ax is None:
        fig, ax = plt.subplots(figsize=(8, 2))
        plot = True
    for i, lv in enumerate(LEVELS):
        lv_seq = sess[sess["level"] == lv]["level"]
        ax.scatter(lv_seq.index.values, lv_seq.values, 0.8, marker="_", c=LV_COLORS(i), linewidths=2)
    for ckpt in ckpts:
        ax.axvline(ckpt, linestyle="--", linewidth=.7, color="g")
    ax.set_title(f"Level Seq of {sess_id}", fontsize=12)

    if plot:
        plt.show()

Session `22090108192456930` has no checkpoint for `level_group == "0-4"`. However, the level sequence looks reasonable.

In [None]:
train["session_id"] = train["session_id"].astype(int)
plot_level_with_ckpt(22090108192456930)

For the remaining 167 sessions with more than 3 checkpoints, it's obvious that multiple game plays exist in a single session. Nonetheless, the second game play in a single sessions doesn't always start from `level == 0` (*e.g.,* session `21100409270509124`). This phenomenon also reveals the truth that **reversed level** doesn't always imply **reversed level group**.

To see the plot, please unfold the cell.

In [None]:
sess_with_invalid_ckpt.remove(22090108192456930)

fig, axes = plt.subplots(nrows=34, ncols=5, figsize=(20, 80))
for i, sess_id in enumerate(sess_with_invalid_ckpt):
    plot_level_with_ckpt(sess_id, ax=axes[i // 5, i % 5])
plt.tight_layout()

As for the potential data leakage reported by [@cdeotte](https://www.kaggle.com/cdeotte) [here](https://www.kaggle.com/competitions/predict-student-performance-from-game-play/discussion/388479#2149959), 442 out of 11779 sessions have the **reversed level group** phenomenon. Furthermore, there are 486 sessions with **reversed level** phenomenon. If **reversed level group** occurs, **reversed level** must hold, but not vice versa.  

In [None]:
lvgp_order = {"0-4": 0, "5-12": 1, "13-22": 2}
train["lvgp_order"] = train["level_group"].map(lvgp_order).astype(int)

sess_with_reversed_lv = []
sess_with_reversed_lvgp = []
for sess_id, gp in tqdm(train.groupby("session_id")):
    if not gp["level"].is_monotonic_increasing:
        sess_with_reversed_lv.append(sess_id)
    if not gp["lvgp_order"].is_monotonic_increasing:
        sess_with_reversed_lvgp.append(sess_id)
assert set(sess_with_reversed_lvgp).issubset(set(sess_with_reversed_lv))

print(f"There are {len(sess_with_reversed_lv)} sessions with \"reversed level\" phenomenon,\n"
      f"and {len(sess_with_reversed_lvgp)} with \"reversed level group\" phenomenon.")

All of the 44 ($486 - 442$) sessions with only **reversed level** phenomenon have exactly 3 checkpoints. The level sequences show that after the last checkpoint (*i.e.,* the third one), `level` will fall back to some specific point within `level_group == "13-22"`.

In [None]:
sess_with_reversed_lv_only = set(sess_with_reversed_lv).difference(set(sess_with_reversed_lvgp))

fig, axes = plt.subplots(nrows=9, ncols=5, figsize=(20, 25))
for i, sess_id in enumerate(sess_with_reversed_lv_only):
    plot_level_with_ckpt(sess_id, ax=axes[i // 5, i % 5])
plt.tight_layout()

<a id="conclusion"></a>
## 7. Conclusion
[**<span style="color:#FEF1FE; background-color:#535d70;border-radius: 5px; padding: 2px">Go to Table of Content</span>**](#toc)

Through exploring pre-assumed primary key (`session_id`, `index`) and properties related to `index` columns, we can find out that there exist some issues which can hinder us from interpreting the **sequential characteristics** of the data. With a simple fix, the **ordering nature** of events can be better represented.<br>
Also, the existence of `checkpoint`-related issues could lower data quality. What's worse, the problems like **reversed level** and **reversed level group** might lead to data leakage, which can facilitate LB boost to some extent. The quick exploration here might help us come up with better ways to clean the data in hand. Thanks!