# Opening Notes

This notebook is one of the key pieces of the LGB model I created. By creating this dataframe in a seperate notebook I was able to create some pretty awesome features and stay within the CPU limits of the kaggle notebook environment. 

In [1]:
import numpy as np
import pandas as pd

from tqdm.notebook import tqdm

The final import in the following kernel was super awesome. I knew there was some way the tags were connected together, and the I found Alex Bader's method of clustering these tags to be the best! His notebook taught me about an awesome framework called networkx which provided some awesome visualizations of the connections between tags. Please check out his notebook at this link -> [Link](https://www.kaggle.com/spacelx/2020-r3id-clustering-question-tags)

In [2]:
%%time

used_data_types_dict = {
    'question_id': 'int16',
    'bundle_id': 'int16',
    'correct_answer': 'int8',
    'part': 'int8',
    'tags': 'str',
}

questions = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/questions.csv',
                       usecols = used_data_types_dict.keys(), dtype=used_data_types_dict)

lectures_df = pd.read_csv('../input/riiid-test-answer-prediction/lectures.csv')
#ex = pd.read_csv('../input/riiid-test-answer-prediction/example_test.csv')

questions_communities = pd.read_csv('../input/2020-r3id-clustering-question-tags/question_cmnts.csv')

CPU times: user 18 ms, sys: 4.08 ms, total: 22.1 ms
Wall time: 54.3 ms


In [3]:
%%time
features_df = pd.read_pickle('../input/riiid-splitting-train-and-test-data/features_q_only.pkl.zip')
#train_df = pd.read_pickle('../input/riiid-splitting-train-and-test-data/train_q_only.pkl.zip')

CPU times: user 17.8 s, sys: 14.6 s, total: 32.4 s
Wall time: 46.5 s


The following cell shifts the prior_question_elapsed_time variable up one position so that we can see the amount of time it took to answer the question. I used this variable to find out the average amount of time it took to answer each question. 

In [4]:
features_df['q_time'] = features_df['prior_question_elapsed_time'].shift(-1)

In [5]:
#dropping all lectures
train_questions_only_df = features_df[features_df['answered_correctly']!=-1]

The following function was used to get the average question time of each question by question_id. One thing to note is that it took me a while to figure out why I kept getting all these inf variables for average question time. I figured out that the only way to solve this issue was to replace all the inf values in the column with nan values, and then use fillna on those cells. There was probably a better value that zero, buts that what I used!

In [6]:
def get_avg_q_time(train_questions_only_df):
    
    train_questions_only_df = train_questions_only_df.replace([np.inf, -np.inf], np.nan)
    train_questions_only_df['q_time'].fillna(0, inplace=True)
    train_questions_only_df = train_questions_only_df[train_questions_only_df['answered_correctly']!=0]
    
    grouped_by_content_df = train_questions_only_df.groupby('content_id')
    
    content_avg_q_time_df = grouped_by_content_df.agg({'q_time':['mean']})
    
    for i in questions[~questions.index.isin(content_avg_q_time_df.index)].index.values:
        content_avg_q_time_df.loc[i] = content_avg_q_time_df.values.mean()
     
    content_avg_q_time_df = content_avg_q_time_df.reset_index()
    
    content_avg_q_time_df.columns = [
    'content_id',
    'avg_q_time', 
    ]
    
    content_avg_q_time_df = content_avg_q_time_df.set_index('content_id').sort_index()
        
    return(content_avg_q_time_df)

content_avg_q_time_df = get_avg_q_time(train_questions_only_df)

Getting questions accuracy, number of times it was asked and the number of times it was correct. I took all three of these so that I could loop through and update these as we see more datapoints. 

In [7]:
#grouping by content_id
grouped_by_content_df = train_questions_only_df.groupby('content_id')

#getting mean count and other stuff for each content_id
content_answers_df = grouped_by_content_df.agg({'answered_correctly': ['mean', 'count', 'sum']}).copy()
content_answers_df.columns = [
    'q_mean_accuracy', 
    'q_question_asked',
    'q_question_correct',
]

There were actually some question_id's that we did not get to see, and I created a nifty function to fill these values. I filled the ration of number of correct and incorrect with roughly the overall avg. I thought adding the number of times the question was asked at 10 was reasonable. This was because I was planning on looping and updating this. If this was 1, one or two incorrect questions would seriouslt effect the supposed question accuracy.

In [8]:
def add_missing_questions(content_answers_df, questions):
    
    for i in questions[~questions.index.isin(content_answers_df.index)].index.values:
        content_answers_df.loc[i] = [0.6, 10, 6]
        
    content_answers_df = content_answers_df.sort_index()
        
    return(content_answers_df)

content_answers_df = add_missing_questions(content_answers_df, questions)

This is another nifty pandas manipulation statement and checks how many quesitons in the bundle_id that is tied to that specific question. This is a pretty cool feature I extracted, and Im glad I was able to do so in one line!

In [9]:
questions['num_in_bundle'] = questions.groupby(['bundle_id'])['question_id'].transform('count')

Merging all the dataframes I have created in this notebook into one. 

In [10]:
#adding community
content_answers_df = content_answers_df.merge(questions_communities['community'], left_index=True, right_index=True)

#adding numb in bundle
content_answers_df = content_answers_df.merge(questions['num_in_bundle'], left_index=True, right_index=True)

#adding avg_q_time
content_answers_df = content_answers_df.merge(content_avg_q_time_df['avg_q_time'], left_index=True, right_index=True)

content_answers_df

Unnamed: 0_level_0,q_mean_accuracy,q_question_asked,q_question_correct,community,num_in_bundle,avg_q_time
content_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.907535,6543.0,5938.0,2,1,18880.0
1,0.889810,6997.0,6226.0,2,1,18272.0
2,0.553755,42880.0,23745.0,2,1,22576.0
3,0.778388,21655.0,16856.0,2,1,20224.0
4,0.611676,30611.0,18724.0,2,1,20672.0
...,...,...,...,...,...,...
13518,0.770115,609.0,469.0,0,1,13984.0
13519,0.557960,647.0,361.0,1,1,23552.0
13520,0.668842,613.0,410.0,1,1,24400.0
13521,0.820513,585.0,480.0,0,1,18720.0


In [11]:
content_answers_df.to_pickle('./content_answers_df.pkl.zip')

# Closing Notes

Its typical for a data scientist working on decision tree models, but I do wonder what other features I could have gotten that could be tied to eac question_id. I defintaley could have filled the questions that were unseen with values that were representive of their tag or part. 

I think its impossible to fully exlpore all ways you can create features from the data given, but I do think I could have utilizaed the timestamp and lag_time features in conjunction with these stats to create some pretty powerful varaibles. Im interested to see some creative features used after the competition ends.