In [1]:
%%capture
from functions import *

@register_cell_magic
def markdown(line, cell):
    return md(cell.format(**globals()))

---

# Student Assessment

The Student Assessments dataframe contains information about each student and the assessments they took during the module

In [6]:
student_assessment.head()

Unnamed: 0,id_assessment,id_student,date_submitted,is_banked,score
0,1752,11391,18,0,78.0
1,1752,28400,22,0,70.0
2,1752,31604,17,0,72.0
3,1752,32885,26,0,69.0
4,1752,38053,19,0,79.0


---

## Student Assessment Contents

* **id_assessment**: The assessment ID is the unique identifier for the assessment the student took.
* **id_student**: The student ID is the unique identifier for the student who took the assessment.
* **date_submitted**: The date submitted is the date the student submitted the exam relevant to the start date of the module.
* **is_banked**: Whether the score for the assessment is banked indicates wheter the assessment result was transferred from a previous presentation.
    - is_banked does indicate that the student took the course previously, but since it is their first score that is retained it is not a confounder and entries with a 1 for is_banked will be kept.
    - is_banked has no other relevant information though and so can be removed.

---

## Student Assessments Information

**Size**

In [7]:
md(f'''* Number of Rows: {len(student_assessment)}
* Number of Columns: {len(student_assessment.columns)}''')

* Number of Rows: 173912
* Number of Columns: 5

**Data Types**

In [8]:
student_assessment.dtypes

id_assessment       int64
id_student          int64
date_submitted      int64
is_banked           int64
score             float64
dtype: object

* id_student and id_assessments are both categorical values and so should be converted to objects

In [9]:
# converting the data types
student_assessment = student_assessment.astype({'id_assessment': int, 'id_student': int})
student_assessment = student_assessment.astype({'id_assessment': object, 'id_student': object})

**Null Values**

In [10]:
# prints the sum of a columns null value
student_assessment.isnull().sum()

id_assessment       0
id_student          0
date_submitted      0
is_banked           0
score             173
dtype: int64

In [12]:
null_score = student_assessment['score'].isnull().sum()

In [13]:
%%markdown

* We have {null_score} null values for score, which we are trying to predict.


* We have 173 null values for score, which we are trying to predict.


In [26]:
NaN_scores = student_assessment.loc[student_assessment['score'].isnull() == True]

In [48]:
NaN_scores

Unnamed: 0,id_assessment,id_student,date_submitted,is_banked,score
215,1752,721259,22,0,
937,1754,260355,127,0,
2364,1760,2606802,180,0,
3358,14984,186780,77,0,
3914,14984,531205,26,0,
...,...,...,...,...,...
148929,34903,582670,241,0,
159251,37415,610738,87,0,
166390,37427,631786,221,0,
169725,37435,648110,62,0,


In [52]:
students_w_NaN_scores = pd.DataFrame()

In [59]:
for index, row in NaN_scores.iterrows():
    students_w_NaN_scores = students_w_NaN_scores.append(student_info.loc[student_info['id_student'] == row['id_student']])

In [60]:
students_w_NaN_scores

Unnamed: 0,code_module,code_presentation,id_student,gender,region,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,final_result
227,AAA,2013J,721259,F,South Region,Lower Than A Level,50-60%,55<=,0,120,N,Withdrawn
638,AAA,2014J,721259,F,South Region,Lower Than A Level,50-60%,55<=,1,60,N,Withdrawn
108,AAA,2013J,260355,F,London Region,A Level or Equivalent,80-90%,35-55,0,60,N,Withdrawn
466,AAA,2014J,260355,F,London Region,A Level or Equivalent,80-90%,35-55,1,120,N,Withdrawn
733,AAA,2014J,2606802,M,North Region,A Level or Equivalent,60-70%,0-35,0,60,N,Fail
...,...,...,...,...,...,...,...,...,...,...,...,...
28279,FFF,2014J,582670,M,South Region,Lower Than A Level,90-100%,35-55,0,60,Y,Fail
30917,GGG,2013J,610738,F,London Region,Lower Than A Level,10-20,35-55,0,30,N,Fail
31682,GGG,2014B,631786,F,East Anglian Region,A Level or Equivalent,0-10%,0-35,0,30,Y,Pass
32167,GGG,2014J,648110,F,London Region,Lower Than A Level,10-20,0-35,0,60,N,Withdrawn


In [61]:
students_w_NaN_scores['final_result'].value_counts()

Withdrawn      104
Fail            82
Pass            40
Distinction      1
Name: final_result, dtype: int64

In [63]:
student_assessment.loc[student_assessment['id_student'] == 631786]

Unnamed: 0,id_assessment,id_student,date_submitted,is_banked,score
165061,37425,631786,85,0,84.0
165806,37426,631786,138,0,86.0
166390,37427,631786,221,0,
166841,37428,631786,195,0,80.0
167516,37429,631786,199,0,100.0
168078,37430,631786,199,0,100.0
168481,37431,631786,199,0,100.0


In [62]:
NaN_students_all_exams = pd.DataFrame()

In [None]:
for index, row in student_assessment.iterrows():
    NaN_students_all_exams = NaN_students_all_exams.append(student_assessment.loc[student_info['id_student'] == row['id_student']])

**Merged Assessment/Student_info dataframes**

In order to remove the students that we removed for the number of previous attempts, we must merge assessments and student info and find the difference

In [None]:
# merged 'student info/assessments' with a full outer join on their common columns
merged_si_assm = student_assessment.merge(student_info, how='outer', on=['id_student', 'code_module', 'code_presentation'], indicator=True)
merged_si_assm.head()

For this merge column the right side would be the student info dataframe and the left side would be assessments. If an entry receives the label of right_only there is a student who has no assessments, if the label is left_only, there is an assessment that doesn't match up with a student.

In [None]:
# variable for where merge is left_only, and only found on the 
only_assessments = merged_si_assm.loc[merged_si_assm['_merge']=='left_only']
only_student_info = merged_si_assm.loc[merged_si_assm['_merge']=='right_only']

**Assessments that do not map to students**:

In [None]:
only_assessments.head()

**Students without any test scores**:

In [None]:
only_student_info.head()

In [None]:
md(f'''
    We have {len(only_assessments)} values in only assessments, which map to students who had made previous attempts which we eliminated, and {len(only_student_info)} values in only student_info, which means we have students for whom we have no test scores.
    We can drop both of these which are missing values for the purpose of this dataframe since we are just analyzing test scores
    ''')

In [None]:
# merging assessments with the original student data dataframe to make sure that the missing students are the ones we removed.
merged_test = student_assessment.merge(student_info, how='outer', on=['id_student', 'code_module', 'code_presentation'], indicator=True)

# removing entries where num_prev_attempts == 0
merged_test = merged_test[merged_test['num_of_prev_attempts'] == 0]

# checking if any in only the student info dataframe remain (left_only). No output means all of the tests without students map to a student where num_prev_attempts == 0
merged_test.loc[merged_test['_merge']=='left_only']

In [None]:
# removing any student with NaN values in id_assessment or region
merged_si_assm = merged_si_assm.dropna(subset=['id_assessment', 'region'])

In [None]:
# reordering dataframe columns to group like data
merged_si_assm = merged_si_assm[['code_module', 'code_presentation', 'id_student', 'region', 'imd_band', 'age_band', 'gender', 'highest_education', 'disability', 'final_result', 'id_assessment', 'assessment_type', 'date_submitted', 'date', 'weight', 'score']]

In [None]:
# converting the data types back
merged_si_assm = merged_si_assm.astype({'id_assessment': int, 'id_student': int})
merged_si_assm = merged_si_assm.astype({'id_assessment': object, 'id_student': object})

In [None]:
# reset the index
merged_si_assm.reset_index(drop=True).head()

In [None]:
student_assessment = merged_si_assm

**Unique Counts**

In [None]:
student_assessment.nunique()

**Unique Categorical Values**

In [None]:
unique_vals(student_assessment)

**Duplicate Values:**

In [None]:
duplicate_vals(student_assessment)

**Statistics**

In [None]:
student_assessment.describe()