# Restrict Data

This file marks the beginning of the data preparation. However, it is only for illustrating what is done in the file `restrict_data.py` and no data is saved here.

We restrict the data sets to the relevant data. For example, we remove rows of assignments in `assignment_details.csv` which do not appear in `action_logs.csv` or `unit_test_scores.csv`.

In [1]:
%load_ext autoreload
%autoreload 2

import pandas as pd

import sys
import os
sys.path.append(os.path.abspath('../../sources'))

from data_preparation import restrict_data
import utils

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
  from .autonotebook import tqdm as notebook_tqdm


### Action Logs

In [2]:
action_logs = utils.read_action_logs()
print(len(action_logs))  # 23932276
action_logs.head()

23932276


Unnamed: 0,assignment_log_id,timestamp,problem_id,max_attempts,available_core_tutoring,score_viewable,continuous_score_viewable,action,hint_id,explanation_id
0,2QV1F2GSBZ,1599151000.0,,,,,,assignment_started,,
1,2QV1F2GSBZ,1599151000.0,I2GX4OQIE,3.0,answer,1.0,1.0,problem_started,,
2,2QV1F2GSBZ,1599151000.0,I2GX4OQIE,,,,,wrong_response,,
3,2QV1F2GSBZ,1599151000.0,I2GX4OQIE,,,,,wrong_response,,
4,2QV1F2GSBZ,1599151000.0,I2GX4OQIE,,,,,answer_requested,,


In [3]:
iu_ids = action_logs["assignment_log_id"].unique()
len(iu_ids)  # 638528

638528

### Unit Test Scores

In [4]:
unit_test_scores = utils.read_unit_test_scores()
print(len(unit_test_scores))  # 452439
unit_test_scores.head()

452439


Unnamed: 0,assignment_log_id,problem_id,score
0,1CEASUAUQJ,18J6436AS5,1
1,2IMKPEIL2Q,9RMI4CZU9,0
2,2IMKPEIL2Q,8F4U5WWTV,0
3,2IMKPEIL2Q,27D3I359NE,1
4,2IMKPEIL2Q,22DY4PFVMV,1


In [5]:
ut_ids = unit_test_scores["assignment_log_id"].unique()
len(ut_ids)  # 42343

42343

### Assignment Relationships

In [6]:
assignment_relationships = utils.read_assignment_relationships()
print(len(assignment_relationships))  # 699839
assignment_relationships.head()

699839


Unnamed: 0,unit_test_assignment_log_id,in_unit_assignment_log_id
0,7FGC8P0F1,V6YXT3UG
1,15KQFID5U5,1TFFYMT814
2,QKDRPCXSG,1N2IFGUASM
3,1JOJIQXU1B,15W4ET3W62
4,2C9YZRVZT0,1WORTY787C


### Assignment Details

We restrict the assignment details to those assignments for which there is data in `action_logs` or `unit_test_scores`. The other rows will be dropped.

In [7]:
assignment_details = utils.read_assignment_details()
print(len(assignment_details))  # 9319676
assignment_details.head()

9319676


Unnamed: 0_level_0,teacher_id,class_id,student_id,sequence_id,assignment_release_date,assignment_due_date,assignment_start_time,assignment_end_time
assignment_log_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2PLEB2KWK9,22OEQXISYV,133F5L5O95,L97DTM607,1FLYIHK4Q4,1539635000.0,1540067000.0,1539634866.476,
8G25XNCXN,2SKA2RTF6,2OL82EC95R,21S35PU5W2,CDLX4UJ84,1539871000.0,,1539871403.267,1539871641.345
266AW7UU1V,1FJ326JFAH,1WJWBO8XL4,IBO6BEHXA,2T42B3UC5,1539885000.0,,1539884690.684,
15SHL0U0E6,129LDU45TT,IBO6BEHXA,1CT2ERTNC7,7ZGYNOHS3,1539896000.0,1540242000.0,1539952545.055,
CQA32TBFI,1FJ326JFAH,1WJWBO8XL4,2JC4HHXU4M,2T42B3UC5,1539885000.0,,1539974068.802,


In [8]:
all_ids = list(ut_ids) + list(iu_ids)
assignment_details_rest = assignment_details.loc[all_ids].copy()
len(all_ids), len(assignment_details_rest)  # 680871

(680871, 680871)

### Sequence Relationships

In [9]:
sequence_relationships = utils.read_sequence_relationships()
print(len(sequence_relationships))
sequence_relationships.head()  # len=12564

12564


Unnamed: 0,unit_test_sequence_id,in_unit_sequence_id
0,K1U9M2PVF,1XEPEYCPC3
1,K1U9M2PVF,20SXJMMSRG
2,K1U9M2PVF,1SMS0A4N5G
3,K1U9M2PVF,1BROMSHRRA
4,K1U9M2PVF,520QV3Q8S


### Sequence Details

We restrict the sequence details to those sequences that have been worked on in `action_logs` or `unit_test_scores`, that is, to those sequences that appear in `assignment_details_rest`.

In [10]:
sequence_details = utils.read_sequence_details()
print(len(sequence_details))  # 10774
sequence_details.head()

10774


Unnamed: 0,sequence_id,sequence_folder_path_level_1,sequence_folder_path_level_2,sequence_folder_path_level_3,sequence_folder_path_level_4,sequence_folder_path_level_5,sequence_name,sequence_problem_ids
0,K1U9M2PVF,EngageNY/Eureka Math (© by Great Minds®) *,Algebra I,Module 1 - Relationships Between Quantities an...,Module 1---Assessments,,End-of-Module---Alg 1.1 End-of-Module Assessment,"[AQ0ZKSP6D, 2KTD380L98, 7CPDNFDLD, 2F9VV7RVWU,..."
1,1XEPEYCPC3,EngageNY/Eureka Math (© by Great Minds®) *,Algebra I,Module 1 - Relationships Between Quantities an...,Module 1---Assessments,,Mid-Module---Alg1.1 Mid-Module Assessment,"[WS70M9DP1, 13HDHY5VMI, 24WQMJBRDX, 1IFT888E81..."
2,20SXJMMSRG,EngageNY/Eureka Math (© by Great Minds®) *,Algebra I,Module 1 - Relationships Between Quantities an...,Topic A---Lesson 1: Graphs of Piecewise Linear...,,"Problem Set---Algebra I, M1, Lesson 1 (N.Q.A.1...","[1D3AXDDMQ9, 2HVIXDM2L5, 1I9N9TMSO6, 182WSU48H..."
3,1SMS0A4N5G,EngageNY/Eureka Math (© by Great Minds®) *,Algebra I,Module 1 - Relationships Between Quantities an...,Topic A---Lesson 2: Graphs of Quadratic Functions,,"Classwork---Algebra I, M1, Lesson 2 (N.Q.A.1, ...","[1X69IIUXB1, E083MYD2P]"
4,1BROMSHRRA,EngageNY/Eureka Math (© by Great Minds®) *,Algebra I,Module 1 - Relationships Between Quantities an...,Topic A---Lesson 2: Graphs of Quadratic Functions,,"Exit Ticket---Algebra 1, M1, Lesson 2 (N.Q.1, ...",[2BLJ83JUIM]


In [11]:
seq_ids = assignment_details_rest["sequence_id"].unique()
sequence_details_rest = sequence_details.loc[sequence_details["sequence_id"].isin(seq_ids)].copy()
len(seq_ids), len(sequence_details_rest)  # 5416, 5514

(5416, 5514)

### Problem Details

Important note: There are some problems that appear in `action_logs` and `unit_test_scores` but not in `problem_details`. This is described in the data description: "This file contains one row for every problem referenced in the dataset, except for some problems in the action logs, which have been deleted from the database. These problems likely had errors during their original transcription into ASSISTments that were corrected, but no record of the original problems was kept."

We restrict the problem details to those problems that appear in `action_logs` or `unit_test_scores`.

In [12]:
problem_details = utils.read_problem_details()
print(len(problem_details))  # 132738
problem_details.head()

132738


Unnamed: 0_level_0,problem_multipart_id,problem_multipart_position,problem_type,problem_skill_code,problem_skill_description,problem_contains_image,problem_contains_equation,problem_contains_video,problem_text_bert_pca
problem_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
10MFND3HAJ,2MHCTW1IIN,1,Multiple Choice,6.RP.A.3b,Unit Rate,0.0,0.0,1.0,"[0.53955209, -0.96322744, 0.49725574, 6.287953..."
IH3MOE7AF,1UEQMXOOFA,1,Multiple Choice,6.RP.A.3b,Unit Rate,0.0,0.0,0.0,"[-1.61147666, -1.50911536, 0.52055446, 6.01118..."
14YC7CEE2N,1UEQMXOOFA,2,Ungraded Open Response,6.RP.A.3b,Unit Rate,0.0,0.0,0.0,"[-8.95361845, 5.26005410, -4.41350451, -2.6751..."
16L5KQWLN7,1W7DRPNEJL,1,Ungraded Open Response,6.RP.A.3b,Unit Rate,0.0,0.0,0.0,"[-2.89295465, 1.73222701, -0.21075635, 0.16314..."
BU0LO0LDD,1Z6MGLD8VK,1,Ungraded Open Response,6.RP.A.3b,Unit Rate,0.0,0.0,0.0,"[-1.53959700, 1.35386494, -1.56874727, 0.89545..."


In [13]:
prob_ids_al = set(action_logs["problem_id"])
prob_ids_uts = set(unit_test_scores["problem_id"])
prob_ids = prob_ids_al | prob_ids_uts
len(prob_ids_al), len(prob_ids_uts), len(prob_ids)  # 57361, 1835, 59171

(57361, 1835, 59171)

There are problems that appear in both unit test and in unit sequences/assignments.

In [14]:
prob_ids_pd = set(problem_details.index)
prob_ids = prob_ids & prob_ids_pd
len(prob_ids)  # 58203

58203

In [15]:
problem_details_rest = problem_details.loc[list(prob_ids)].copy()
len(problem_details_rest)  # 58203

58203

### Save restricted files

In [16]:
#utils.save_as_csv(assignment_details_rest, "assignment_details_rest.csv", save_idx=True)
#utils.save_as_csv(sequence_details_rest, "sequence_details_rest.csv", save_idx=False)
#utils.save_as_csv(problem_details_rest, "problem_details_rest.csv", save_idx=True)

### Restriction as Function

In [17]:
(
    action_logs_orig,
    unit_test_scores_orig,
    assignment_relationships_orig,
    assignment_details_orig,
    sequence_relationships_orig,
    sequence_details_orig,
    problem_details_orig
) = utils.load_all_data()

In [18]:
(
    assignment_details,
    sequence_details,
    problem_details,
) = restrict_data.restrict_details_to_available_assignments(
    action_logs_orig,
    unit_test_scores_orig,
    assignment_details_orig,
    sequence_details_orig,
    problem_details_orig,
)

In [19]:
len(assignment_details), len(sequence_details), len(problem_details)  # 680871, 5514, 58203

(680871, 5514, 58203)