# Feature Engineering
#### Purpose:
The purpose of the file is to perform feature engineering steps on our training and validation parquet files to create a new 'final' training and validation files. These new files will be used in a different notebook for training. No columns will be dropped from the files here, only generation of new features and if needed, scaling.

In [2]:
import os
import pandas as pd
import polars as pl # used for reading in parquet files quickly
import matplotlib as plt
from pathlib import Path
from tqdm import tqdm

### Reading in data

In [3]:
train_df = pl.read_parquet(rf"./Data/train_data.parquet").to_pandas() # convert from polars to pandas 
val_df = pl.read_parquet(rf"./Data/val_data.parquet").to_pandas()

In [4]:
train_df

Unnamed: 0,student_id,timestamp,question_id,bundle_id,tags,elapsed_time,correct
0,1,1565096190868,5012,3544,74,38000,0.0
1,1,1565096221062,4706,3238,71,24000,1.0
2,1,1565096293432,4366,2898,103,68000,1.0
3,1,1565096339668,4829,3361,83,42000,0.0
4,1,1565096401774,6528,5060,90,59000,0.0
...,...,...,...,...,...,...,...
76078853,840472,1575306027437,9814,7165,136,37000,0.0
76078854,840472,1575306068437,4712,3244,71,37000,0.0
76078855,840472,1575306087437,3793,2325,82,15000,0.0
76078856,840473,1575306037437,3830,2362,106,25000,1.0


In [5]:
val_df

Unnamed: 0,student_id,timestamp,question_id,bundle_id,tags,elapsed_time,correct
0,4,1566782278107,5177,3709,74,85000,1.0
1,4,1566782311854,8104,5575,8;2;182,28000,1.0
2,4,1566782336708,4291,2823,84,22000,0.0
3,4,1566782351705,4020,2552,86,12000,1.0
4,4,1566782382870,5258,3790,87,28000,1.0
...,...,...,...,...,...,...,...
19215063,840471,1575305806437,10300,7651,76,18000,0.0
19215064,840471,1575305834437,8886,6237,83,25000,0.0
19215065,840471,1575305860437,8556,5907,85,24000,1.0
19215066,840471,1575305880437,9901,7252,74,18000,0.0


## Lecture Tags
#### Purpose:
KT-4 dataset provides information on student interactions. The interaction that we are going to be taking advantage in this case is if a student interacted with a lecture which is denoted by the 'l' char followed by a number. We plan to provide information on whether the student interacted with a lecture associated with a tag for on a question.

In [6]:
def combine_data(file_path, output_file):
    dfs = []
    csv_files = list(Path(file_path).glob("*.csv"))
    
    for csv_file in tqdm(csv_files, desc="Reading CSVs"):
        # getting student_id val
        filename = csv_file.stem # gets name before '.csv' (ex. 'u123')
        student_id = filename[1:] # removes 'u' from name
        print(f"filename: {csv_file}, student_id: {student_id}\n")        
        # scan more memory efficient then read since it does not store
        df = pl.scan_csv(csv_file).with_columns(
            pl.lit(student_id).alias("student_id"))
    
        # add column to df that will represent student_id based of file_num starting at 1
        dfs.append(df)
    
    pl.concat(dfs).sink_parquet(output_file)
    return True # file created successfully

In [7]:
file_path = "./Data/KT3/" 
file = "./Data/combined_kt3.parquet"

if not Path(file).exists():
    result = combine_data(file_path, file)
    if result: print("combined dataset file created successfuly")
    else: print("file failed to create")
else:
    print("combined_kt3.parquet dataset is present in Data folder")

kt3_df = pl.read_parquet("./Data/combined_kt3.parquet").to_pandas()
kt3_df

combined_kt3.parquet dataset is present in Data folder


Unnamed: 0,timestamp,action_type,item_id,source,user_answer,platform,student_id
0,1565096151269,enter,b3544,diagnosis,,mobile,1
1,1565096187972,respond,q5012,diagnosis,b,mobile,1
2,1565096194904,submit,b3544,diagnosis,,mobile,1
3,1565096195001,enter,b3238,diagnosis,,mobile,1
4,1565096218682,respond,q4706,diagnosis,c,mobile,1
...,...,...,...,...,...,...,...
89270649,1568964975390,enter,b3819,sprint,,mobile,9998
89270650,1568964992921,respond,q5287,sprint,c,mobile,9998
89270651,1568964996503,submit,b3819,sprint,,mobile,9998
89270652,1568964996572,enter,e3819,sprint,,mobile,9998


In [None]:
student_id = []