# Opening Notes

For this dataset, we are given a few things. We are told that it is sort of time-series tabular data. What I mean is that the data is organized by user_id, and in order of how the questions were viewed by that user. So the way I wanted to split the data was by taking features from the first questions each user saw, and create the training data on the later questions.

There was no easy way I found to do this, but I managed to group user_id's by the number of quesitions they had seen. I used these groups to set aside roughly 9-12% of the data to create my training dataset. the remaining 89-90% of the data was used to create features.

In this notebook I also created the lag_time variable which was the most prominent feature on the LGB Model I submitted!



In [1]:
#basic
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from tqdm.notebook import tqdm
import gc


import os

In [2]:
%%time
train_questions_only_df = pd.read_pickle('../input/riiid-train-df/train_df.pkl.gzip')

used_data_types_dict = {
    'question_id': 'int16',
    'bundle_id': 'int16',
    'correct_answer': 'int8',
    'part': 'int8',
    'tags': 'str',
}

CPU times: user 873 ms, sys: 5.01 s, total: 5.88 s
Wall time: 23.2 s


In [3]:
def get_lag_time(df):
    
    lag_dict = {}
    lag_list = []
    prev_timestamp = 123
    
    for pair in tqdm(df[['user_id','timestamp']].values):
        if pair[0] in lag_dict:
            if prev_timestamp == pair[1]:
                lag_list.append(lag_list[-1])
            else:
                lag_list.append(pair[1] - lag_dict[pair[0]])
                lag_dict[pair[0]] = pair[1]
            
        else:
            lag_dict[pair[0]]= 0
            lag_list.append(0)
            
        prev_timestamp=pair[1]
            
    df['lag_time']=lag_list
    
    return(df)

In [4]:
train_questions_only_df = get_lag_time(train_questions_only_df)

HBox(children=(FloatProgress(value=0.0, max=101230332.0), HTML(value='')))




In [5]:
b=train_questions_only_df[train_questions_only_df.groupby('user_id')['user_id'].transform('size') >= 10000]
valid_split1 = b.groupby('user_id').tail(1000)
train_split1 = b[~b.index.isin(valid_split1.index)]

print(valid_split1.shape[0]/train_split1.shape[0])

0.08949360586291752


In [6]:
b=train_questions_only_df[train_questions_only_df.groupby('user_id')['user_id'].transform('size') < 10000]
c=b[b.groupby('user_id')['user_id'].transform('size') >= 5000]
valid_split2 = c.groupby('user_id').tail(600)
train_split2 = c[~c.index.isin(valid_split2.index)]

print(valid_split2.shape[0]/train_split2.shape[0])

0.10014560088491771


In [7]:
b=train_questions_only_df[train_questions_only_df.groupby('user_id')['user_id'].transform('size') < 5000]
c=b[b.groupby('user_id')['user_id'].transform('size') >= 1000]
valid_split3 = c.groupby('user_id').tail(185)
train_split3 = c[~c.index.isin(valid_split3.index)]

print(valid_split3.shape[0]/train_split3.shape[0])

0.10266382602036785


In [8]:
b=train_questions_only_df[train_questions_only_df.groupby('user_id')['user_id'].transform('size') < 1000]
c=b[b.groupby('user_id')['user_id'].transform('size') >= 500]
valid_split4 = c.groupby('user_id').tail(65)
train_split4 = c[~c.index.isin(valid_split4.index)]

print(valid_split4.shape[0]/train_split4.shape[0])

0.1014135185675947


In [9]:
b=train_questions_only_df[train_questions_only_df.groupby('user_id')['user_id'].transform('size') < 500]
c=b[b.groupby('user_id')['user_id'].transform('size') >= 200]
valid_split5 = c.groupby('user_id').tail(28)
train_split5 = c[~c.index.isin(valid_split5.index)]

print(valid_split5.shape[0]/train_split5.shape[0])

0.09606992576414827


In [10]:
b=train_questions_only_df[train_questions_only_df.groupby('user_id')['user_id'].transform('size') < 200]
c=b[b.groupby('user_id')['user_id'].transform('size') >= 50]
valid_split6 = c.groupby('user_id').tail(8)
train_split6 = c[~c.index.isin(valid_split6.index)]

print(valid_split6.shape[0]/train_split6.shape[0])

0.08868120729430187


In [11]:
b=train_questions_only_df[train_questions_only_df.groupby('user_id')['user_id'].transform('size') < 50]
c=b[b.groupby('user_id')['user_id'].transform('size') >= 2]
valid_split7 = c.groupby('user_id').tail(3)
train_split7 = c[~c.index.isin(valid_split7.index)]

print(valid_split7.shape[0]/train_split7.shape[0])

0.12428416860063977


In [12]:
b=train_questions_only_df[train_questions_only_df.groupby('user_id')['user_id'].transform('size') < 50]
c=b[b.groupby('user_id')['user_id'].transform('size') < 2]

features_df, train_df = train_test_split(c, test_size=0.09, random_state=314)

print(train_df.shape[0]/features_df.shape[0])

0.10126582278481013


In [13]:
del train_questions_only_df, b, c
gc.collect()

73

I grouped all the dataframes together in the next cell, and when looping through I made sure to delete each df from memory. This was the only way I could get this done and stay withing the CPU limits.

In [14]:
for df in (valid_split1, valid_split2, valid_split3, valid_split4, valid_split5, valid_split6, valid_split7):
    train_df = pd.concat([train_df, df], axis=0)
    del df
    
for df in (train_split1, train_split2, train_split3, train_split4, train_split5, train_split6, train_split7):
    features_df = pd.concat([features_df, df], axis=0)
    del df
    
print(train_df.shape[0]/features_df.shape[0])

0.10068372099987073


I did not take the exact same percentage of rows from each group as if I did so the average number of questions correct were a little odd. I both datasets to have roughly the same ratio of number of questions correct to the number of questions incorrect.

In [15]:
print("features_df - average correct:",features_df.answered_correctly.mean())
print("train_df - average correct:", train_df.answered_correctly.mean())

features_df - average correct: 0.625398251735512
train_df - average correct: 0.6228412482479606


I found it useful to sort the rows just to keep everything nice and neat once I read the pickle file in the main notebook. This aided in the transparency and readability of the two dataframes.

In [16]:
train_df = train_df.sort_index()
features_df = features_df.sort_index()

In [17]:
#writing data to use in main notebook
train_df.to_pickle('./train_q_only.pkl.zip')
features_df.to_pickle('./features_q_only.pkl.zip')

# Closing Notes

I do think I would have a stronger model if I was able to select a few key features and train on a larger portion of the data. If I were to have time to do this competition over again I would attempt to create a model which could be trained on all 100 million rows.