# EdNet cleaning and preprocessing
Define similar interfaces/signatures for pre-processing both EdNet and MOOCCubeX
- Remove all users with more than 50 repetitions of the same video (as with MOOCCubeX)
    1. Select only video lectures records and store them
    2. Select only enter types - Might have to verify that the time between enter-events are larger then 10 minutes
- Aggregate into interactions sessions where gap is less than 10m (New, but based on the behaviour papers)
- Easier to do aggregations and users with more than THRESH interaction sessions for each video is removed

### Statistics
- ~462K user-lecture interactions ("enter"-events") of EdNet, 99% of user-video view count is <=4
    - No user has viewed the same video within the interaction threshold (10 minutes) -> Can use only enter events as blacklist
- 99.31% of all actions for each user, for each consecutive item interaction are less than 10% minutes
- The number of "enter"ss of a video per user is maximum 40, where the user overall had ~2k interactions of all platformss
    - No too unnatural behaviour
- #OLD 3334 records are related to sessions without an explicit start and end event, so they are removed
- #OLD 622,316 base sessions are found, where 15,729 base sessions have one (raw, not adjusted for watch time) gap larger than 10 minutes

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
import dask.dataframe as dd

In [2]:
ITEM_COL = "item_id"
USER_COL = "user_id"
TIME_COL = "timestamp"
SESSION_COL = "session_id"
CONSECUTIVE_ID = "item_consecutive_id"

In [15]:
def get_only_lecture_events(events_ddf, event_col=ITEM_COL):
    """Returns only the lecture events."""
    #logging.info("Fetching only lecture events")
    return events_ddf[events_ddf[event_col].str.startswith("l")]

In [5]:
ednet_path = Path("../EdNet")

In [11]:
%%time
ednet_raw = pd.read_feather(ednet_path / "KT4_merged.feather")
ednet_raw

CPU times: user 14.3 s, sys: 6.31 s, total: 20.6 s
Wall time: 21.1 s


Unnamed: 0,timestamp,action_type,item_id,cursor_time,source,user_answer,platform,user_id
0,1565096151269,enter,b3544,,diagnosis,,mobile,1
1,1565096187972,respond,q5012,,diagnosis,b,mobile,1
2,1565096194904,submit,b3544,,diagnosis,,mobile,1
3,1565096195001,enter,b3238,,diagnosis,,mobile,1
4,1565096218682,respond,q4706,,diagnosis,c,mobile,1
...,...,...,...,...,...,...,...,...
131441533,1574241377745,erase_choice,q7454,,sprint,b,mobile,837094
131441534,1574241382243,respond,q7454,,sprint,d,mobile,837094
131441535,1574241397373,submit,b5352,,sprint,,mobile,837094
131441536,1574241397417,enter,e5352,,sprint,,mobile,837094


In [12]:
ednet_raw.shape, ednet_raw["user_id"].nunique(), ednet_raw["item_id"].nunique()

((131441538, 8), 297915, 29642)

In [16]:
ednet_lectures = get_only_lecture_events(ednet_raw)
ednet_lectures

Unnamed: 0,timestamp,action_type,item_id,cursor_time,source,user_answer,platform,user_id
21,1565096637922,enter,l504,,archive,,mobile,1
22,1565096645773,play_video,l504,0.0,archive,,mobile,1
23,1565096651182,pause_video,l504,4805.0,archive,,mobile,1
24,1565096652123,play_video,l504,4992.0,archive,,mobile,1
25,1565097005408,pause_video,l504,358098.0,archive,,mobile,1
...,...,...,...,...,...,...,...,...
131438712,1574760515360,quit,l546,,adaptive_offer,,mobile,832396
131440043,1574846959021,enter,l357,,archive,,mobile,832452
131440044,1574846966153,play_video,l357,0.0,archive,,mobile,832452
131440045,1574847100157,pause_video,l357,134038.0,archive,,mobile,832452


In [17]:
ednet_lectures.shape, ednet_lectures[USER_COL].nunique(), ednet_lectures["item_id"].nunique()

((5029324, 8), 42828, 971)

In [18]:
lectures_deduped = ednet_lectures.drop_duplicates()

In [19]:
lectures_deduped.shape, lectures_deduped[USER_COL].nunique(), lectures_deduped["item_id"].nunique()

((5009098, 8), 42828, 971)

In [20]:
%%time
lectured_user_index = lectures_deduped.set_index(USER_COL)

CPU times: user 88.3 ms, sys: 98.5 ms, total: 187 ms
Wall time: 188 ms


In [None]:
lectures_ddf = dd.from_pandas(lectured_user_index, npartitions=10)

In [None]:
lectures_ddf.divisions

In [None]:
lectures_ddf.compute().shape

In [56]:
lectures_ddf.to_parquet(ednet_path / "KT4_lectures")

In [None]:
lectures_ddf.memory_usage_per_partition().compute()

#### Verify partitions

In [3]:
import itertools
import piso

In [7]:
#Verify partitions
partitions_path = ednet_path / "KT4_lectures"
part2user_id = {i: pd.read_parquet(partitions_path / f"part.{i}.parquet", columns=[USER_COL]).index.unique().values for i in range(8)}
for i, j in itertools.combinations(part2user_id.keys(), 2):
    shared_users = set(part2user_id[i]).intersection(set(part2user_id[j]))
    if shared_users:
        print(f"SHARED USERS BETWEEN partition {i} and {j},\t{len(shared_users)} users")