### Investigate 1566 [unassigned] in response_correctness

In [1]:
%run '../lib/libraries.ipynb'

dython              0.6.1
tinydb              4.2.0


In [2]:
dfRaw = load_df('dfRaw')

###### Association of  assignment_attempt_number and assignment_max_attempts vs response_correctness

In [3]:
dfRaw.where( ( F.col('response_correctness') == '[unassigned]') & (F.col('assignment_max_attempts') == 0) & (F.col('assignment_attempt_number') == 0)).select().count()

1566

- 1566 response_correctness "[unassigned]" is the same observations as 1566 assignment_attempt_number and assignment_max_attempts
- investigate additonal correlations

###### Variables Single Value Correlation to response_correctness = "[unassigned]"

In [4]:
dfUnassigned = dfRaw.filter(F.col("response_correctness") == "[unassigned]")
cols = single_val(dfUnassigned.toPandas())

display_sv_cols(dfUnassigned, cols)

Unnamed: 0,0
is_affecting_grade,True
assignment_max_attempts,0
assignment_late_submission,False
assignment_final_submission_date,NaT
assignment_start_date,NaT
assignment_due_date,NaT
assignment_attempt_number,0
is_force_scored,False
is_manual_scoring_required,False
item_is_offline_scored,False


- each observation is
  - fully scored (learner_attempt_status = "fully scored")
  - same organization
  - has a score (unweighted_final_score varries, not included above)
  - assignment_attempt_number = 0 and assignment_max_attempts = 0 single values
  - Empty
    - assignment_final_submission_date
    - assignment_start_date
    - assignment_start_date
  - Everything points to an assignment that's not attempted

###### Date Ranges for response_correctness = "[unassigned]"

In [5]:
types = get_var_types()
for f in types['intervalVars']:
  if f not in ["assignment_final_submission_date", "assignment_start_date", "assignment_due_date"]:
    print (f)
    dfUnassigned.agg(
      F.countDistinct(f).alias("unique"),
      F.count(F.when(F.col(f).isNull(), f)).alias("null"),
      F.min(f).alias("min"),
      F.max(f).alias("max")
   ).show(1, False)

max_student_stop_datetime
+------+----+-----------------------+-----------------------+
|unique|null|min                    |max                    |
+------+----+-----------------------+-----------------------+
|14    |0   |2019-09-10 14:44:06.405|2020-04-25 20:44:02.962|
+------+----+-----------------------+-----------------------+

min_student_start_datetime
+------+----+-----------------------+-----------------------+
|unique|null|min                    |max                    |
+------+----+-----------------------+-----------------------+
|14    |0   |2019-09-10 14:36:29.961|2020-04-01 01:22:00.607|
+------+----+-----------------------+-----------------------+

scored_datetime
+------+----+-----------------------+-----------------------+
|unique|null|min                    |max                    |
+------+----+-----------------------+-----------------------+
|161   |0   |2019-09-10 07:44:06.405|2020-04-25 13:44:02.962|
+------+----+-----------------------+-----------------------+

- dates spread among school year (9/10/2019 to 4/25/2020)
- max_student_start_datetime and max_student_stop_datetime have the same number of unique values (14)
  - these are entered by teacher
- remaining fields have the same number of unique values (161)

###### Categorical / Identifier Values for response_correctness = "[unassigned]"

In [6]:
for f in types['identifierVars']:
  print(f)
  dfUnassigned.agg(
    F.countDistinct(f).alias("unique"),
    F.count(F.when(F.col(f).isNull(), f)).alias("null")
  ).show()

assessment_id
+------+----+
|unique|null|
+------+----+
|    12|   0|
+------+----+

assessment_instance_attempt_id
+------+----+
|unique|null|
+------+----+
|   161|   0|
+------+----+

assessment_instance_id
+------+----+
|unique|null|
+------+----+
|    14|   0|
+------+----+

assessment_item_response_id
+------+----+
|unique|null|
+------+----+
|  1566|   0|
+------+----+

learner_assigned_item_attempt_id
+------+----+
|unique|null|
+------+----+
|  1566|   0|
+------+----+

learner_assignment_attempt_id
+------+----+
|unique|null|
+------+----+
|   161|   0|
+------+----+

learner_id
+------+----+
|unique|null|
+------+----+
|    64|   0|
+------+----+

org_id
+------+----+
|unique|null|
+------+----+
|     1|   0|
+------+----+

section_id
+------+----+
|unique|null|
+------+----+
|     5|   0|
+------+----+



- all scores for one organization
  - 1 organization
  - 5 sections
  - 12 assessments
  - 14 assessment intances
    - 2 repeated
    - see datetime variables above with 161 values
  - 64 learners
  - 161 attempts
    - see datetime variables above with 161 values
  - 1566 questions

###### Conclusion: 1566 [unassigned] in response_correctness
- isolated to one organization
- no pattern, an outlier
- 1566 is 2% of observations
  - 1566 / 80548
- remove 1566 rows
- binary variables assignment_attempt_number and assignment_max_attempts
  - will be unary after 1566 deleted
  - remove variables

In [7]:
finish_todo("Investigate 1566 [unassigned] in response_correctness")
finish_todo('Investigate assignment_attempt_number and assignment_max_attempts both have 1566 values')

add_todo("Remove 1566 [unassigned] in response_correctness")
add_todo('Remove assignment_attempt_number and assignment_max_attempts')
