In [1]:
from ipynb.fs.full.Student_Info import stud_info
from functions import *


code_module: ['AAA' 'BBB' 'CCC' 'DDD' 'EEE' 'FFF' 'GGG']

code_presentation: ['2013J' '2014J' '2013B' '2014B']

id_student: [11391 28400 30268 ... 2648187 2679821 2684003]

region: ['East Anglian Region' 'Scotland' 'North Western Region'
 'South East Region' 'West Midlands Region' 'Wales' 'North Region'
 'South Region' 'Ireland' 'South West Region' 'East Midlands Region'
 'Yorkshire Region' 'London Region']

imd_band: ['90-100%' '20-30%' '30-40%' '50-60%' '80-90%' '70-80%' nan '60-70%'
 '40-50%' '10-20' '0-10%']

age_band: ['55<=' '35-55' '0-35']

gender: ['M' 'F']

highest_education: ['HE Qualification' 'A Level or Equivalent' 'Lower Than A Level'
 'Post Graduate Qualification' 'No Formal quals']

disability: ['N' 'Y']

final_result: ['Pass' 'Withdrawn' 'Fail' 'Distinction']

['90-100%' '20-30%' '30-40%' '50-60%' '80-90%' '70-80%' nan '60-70%'
 '40-50%' '10-20%' '0-10%']


---

<h2>Assessments and Student Assessments Dataframes</h2>

---

<h3>Assessments</h3>

The assessments dataframe contains information about the unique assessments in each code module and presentation.

In [123]:
assessments.head()

Unnamed: 0,code_module,code_presentation,id_assessment,assessment_type,date,weight
0,AAA,2013J,1752,TMA,19.0,10.0
1,AAA,2013J,1753,TMA,54.0,20.0
2,AAA,2013J,1754,TMA,117.0,20.0
3,AAA,2013J,1755,TMA,166.0,20.0
4,AAA,2013J,1756,TMA,215.0,30.0


---

<h4>Assessments Contents</h4>

* <b>code_module</b>: The code module represents the code name of the course the assessment was held for.
* <b>code_presentation</b>: The presentation represents the presentation which the test was held for.
* <b>id_assessment</b>: The assessment ID is the unique identifier for each assessment.
* <b>assessment_type</b>: The assessment type represents the kind of assessment it was.
    - There are three assessment types:
        * TMA: Tutor Marked Assessment
        * CMA: Computer Marked Assessment
        * Exam: The Final Exam
* <b>date</b>: The date is how many days from the start of the course the assessment took place
* <b>weight</b>: The weight is the weighted value of the assessment. Exams should have a weight of 100 which the rest of the assessments should add to 100 in total.

---

<h3>Student Assessments</h3>

The Student Assessments dataframe contains information about each student and the assessments they took during the module

In [124]:
student_assessment.head()

Unnamed: 0,id_assessment,id_student,date_submitted,is_banked,score
0,1752,11391,18,0,78.0
1,1752,28400,22,0,70.0
2,1752,31604,17,0,72.0
3,1752,32885,26,0,69.0
4,1752,38053,19,0,79.0


---

<h4>Student Assessment Contents</h4>

* <b>id_assessment</b>: The assessment ID is the unique identifier for the assessment the student took.
* <b>id_student</b>: The student ID is the unique identifier for the student who took the assessment.
* <b>date_submitted</b>: The date submitted is the date the student submitted the exam relevant to the start date of the module.
* <b>is_banked</b>: Whether the score for the assessment is banked indicates wheter the assessment result was transferred from a previous presentation.
    - is_banked does indicate that the student took the course previously, but since it is their first score that is retained it is not a confounder and entries with a 1 for is_banked will be kept.
    - is_banked has no other relevant information though and so can be removed.

In [125]:
# remove is_banked column from dataframe
student_assessment = student_assessment.drop(columns=['is_banked'])

```{note}
Since we are only interested in information that is directly relevant to our students, and since assessments just contains extra information about our student assessments, we will merge the assessments and student_assessment dataframes.
```

---

<h4>Assessments and Student Assessments Merged Dataframe:</h4>

In [126]:
# merges dataframes student_assessment with assessments with a full outer join on their common ID id_assessment
# creates a column _merge which tells you if the id_assessment was found in one or both dataframes
merged_assessments = student_assessment.merge(assessments, how='outer', on=['id_assessment'] ,indicator=True)
merged_assessments.head()

Unnamed: 0,id_assessment,id_student,date_submitted,score,code_module,code_presentation,assessment_type,date,weight,_merge
0,1752,11391.0,18.0,78.0,AAA,2013J,TMA,19.0,10.0,both
1,1752,28400.0,22.0,70.0,AAA,2013J,TMA,19.0,10.0,both
2,1752,31604.0,17.0,72.0,AAA,2013J,TMA,19.0,10.0,both
3,1752,32885.0,26.0,69.0,AAA,2013J,TMA,19.0,10.0,both
4,1752,38053.0,19.0,79.0,AAA,2013J,TMA,19.0,10.0,both


* Our new merge column tells us if the data maps perfectly, or if it is only found on the right or left side, the right side being the assessments dataframe and the left side being the student_assessments dataframe

<b>Rows that do not map:</b>

In [127]:
merged_assessments.loc[merged_assessments['_merge'] != 'both']

Unnamed: 0,id_assessment,id_student,date_submitted,score,code_module,code_presentation,assessment_type,date,weight,_merge
173912,1757,,,,AAA,2013J,Exam,268.0,100.0,right_only
173913,1763,,,,AAA,2014J,Exam,269.0,100.0,right_only
173914,14990,,,,BBB,2013B,Exam,240.0,100.0,right_only
173915,15002,,,,BBB,2013J,Exam,268.0,100.0,right_only
173916,15014,,,,BBB,2014B,Exam,234.0,100.0,right_only
173917,15025,,,,BBB,2014J,Exam,262.0,100.0,right_only
173918,40087,,,,CCC,2014B,Exam,241.0,100.0,right_only
173919,40088,,,,CCC,2014J,Exam,269.0,100.0,right_only
173920,30713,,,,EEE,2013J,Exam,235.0,100.0,right_only
173921,30718,,,,EEE,2014B,Exam,228.0,100.0,right_only


These rows all have entries in the assessments dataframe but have no match in the student_assessment dataframe. This indicates that no students in our data took these exams, and so we will drop them, and then the merge column since it will have no more useful information.

In [128]:
# remove tests that students did not take
assessments = merged_assessments.dropna(subset=['id_student'])

# reset the index to be consecutive again
assessments = assessments.reset_index(drop=True)

In [129]:
# drop the merge column since it is no longer of use
assessments = assessments.drop(columns=['_merge'])

---

<h4>Assessments Information</h4>

<b>Updated Dataframe</b>

In [130]:
assessments.head()

Unnamed: 0,id_assessment,id_student,date_submitted,score,code_module,code_presentation,assessment_type,date,weight
0,1752,11391.0,18.0,78.0,AAA,2013J,TMA,19.0,10.0
1,1752,28400.0,22.0,70.0,AAA,2013J,TMA,19.0,10.0
2,1752,31604.0,17.0,72.0,AAA,2013J,TMA,19.0,10.0
3,1752,32885.0,26.0,69.0,AAA,2013J,TMA,19.0,10.0
4,1752,38053.0,19.0,79.0,AAA,2013J,TMA,19.0,10.0


<b>Size</b>

In [131]:
md(f'''* Number of Rows: {len(assessments)}
* Number of Columns: {len(assessments.columns)}''')

* Number of Rows: 173912
* Number of Columns: 9

<b>Data Types</b>

In [132]:
assessments.dtypes

id_assessment          int64
id_student           float64
date_submitted       float64
score                float64
code_module           object
code_presentation     object
assessment_type       object
date                 float64
weight               float64
dtype: object

* id_student and id_assessments are both categorical values and so should be converted to objects

In [133]:
# converting the data types
assessments = assessments.astype({'id_assessment': int, 'id_student': int})
assessments = assessments.astype({'id_assessment': object, 'id_student': object})

<b>Null Values</b>

In [134]:
# prints the sum of a columns null value
assessments.isnull().sum()

id_assessment          0
id_student             0
date_submitted         0
score                173
code_module            0
code_presentation      0
assessment_type        0
date                   0
weight                 0
dtype: int64

* We have 2,873 null data points for assessment date. The documentation of this dataset states that if the exam date is missing then it is as the end of the last presentation week. We can find this information in the courses dataframe.

In [135]:
# adding the dates for the null test dates
for index, row in assessments[assessments['date'].isna()].iterrows():
    assessments.at[index, 'date'] = courses.loc[(courses['code_module'] == row['code_module']) & (courses['code_presentation'] == row['code_presentation']), 'module_presentation_length']

# reprinting to ensure it worked
assessments.isnull().sum()

id_assessment          0
id_student             0
date_submitted         0
score                173
code_module            0
code_presentation      0
assessment_type        0
date                   0
weight                 0
dtype: int64

* There are 173 null values for score. These records are, unfortunately not of much interest to us, since score is what we are trying to find the relationship for, and so we will discard them. This leaves us with no null data in assessments.

In [136]:
# removes any entry where the score is NaN
assessments = assessments.dropna(subset=['score'])

# reprinting to ensure it worked
assessments.isnull().sum()

id_assessment        0
id_student           0
date_submitted       0
score                0
code_module          0
code_presentation    0
assessment_type      0
date                 0
weight               0
dtype: int64

<b>Merged Assessment/Student_info dataframes</b>

In order to remove the students that we removed for the number of previous attempts, we must merge assessments and student info and find the difference

In [137]:
# merged 'student info/assessments' with a full outer join on their common columns
merged_sia = assessments.merge(stud_info, how='outer', on=['id_student', 'code_module', 'code_presentation'], indicator=True)
merged_sia.head()

Unnamed: 0,id_assessment,id_student,date_submitted,score,code_module,code_presentation,assessment_type,date,weight,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration,_merge
0,1752,11391.0,18.0,78.0,AAA,2013J,TMA,19.0,10.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,both
1,1753,11391.0,53.0,85.0,AAA,2013J,TMA,54.0,20.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,both
2,1754,11391.0,115.0,80.0,AAA,2013J,TMA,117.0,20.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,both
3,1755,11391.0,164.0,85.0,AAA,2013J,TMA,166.0,20.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,both
4,1756,11391.0,212.0,82.0,AAA,2013J,TMA,215.0,30.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,both


For this merge column the right side would be the student info dataframe and the left side would be assessments. If an entry receives the label of right_only there is a student who has no assessments, if the label is left_only, there is an assessment that doesn't match up with a student.

In [138]:
# variable for where merge is left_only, and only found on the 
only_assessments = merged_sia.loc[merged_sia['_merge']=='left_only']
only_stud_info = merged_sia.loc[merged_sia['_merge']=='right_only']

<b>Assessments that do not map to students</b>:

In [139]:
only_assessments.head()

Unnamed: 0,id_assessment,id_student,date_submitted,score,code_module,code_presentation,assessment_type,date,weight,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration,_merge
1671,1758,603861.0,-1.0,61.0,AAA,2014J,TMA,19.0,10.0,,,,,,,,,,left_only
1672,1759,603861.0,-1.0,56.0,AAA,2014J,TMA,54.0,20.0,,,,,,,,,,left_only
1673,1760,603861.0,-1.0,58.0,AAA,2014J,TMA,117.0,20.0,,,,,,,,,,left_only
1674,1761,603861.0,-1.0,69.0,AAA,2014J,TMA,166.0,20.0,,,,,,,,,,left_only
1675,1762,603861.0,-1.0,71.0,AAA,2014J,TMA,215.0,30.0,,,,,,,,,,left_only


<b>Students without any test scores<b>:

In [140]:
only_stud_info.head()

Unnamed: 0,id_assessment,id_student,date_submitted,score,code_module,code_presentation,assessment_type,date,weight,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration,_merge
173739,,30268,,,AAA,2013J,,,,North Western Region,30-40%,35-55,F,A Level or Equivalent,Y,Withdrawn,-92.0,12.0,right_only
173740,,135335,,,AAA,2013J,,,,East Anglian Region,20-30%,0-35,F,Lower Than A Level,N,Withdrawn,-29.0,30.0,right_only
173741,,281589,,,AAA,2013J,,,,North Western Region,30-40%,0-35,M,HE Qualification,N,Fail,-50.0,,right_only
173742,,346843,,,AAA,2013J,,,,Scotland,50-60%,35-55,F,HE Qualification,N,Fail,-44.0,,right_only
173743,,354858,,,AAA,2013J,,,,South Region,90-100%,35-55,M,HE Qualification,N,Withdrawn,-32.0,5.0,right_only


In [141]:
md(f'''
    We have {len(only_assessments)} values in only assessments, which map to students who had made previous attempts, and {len(only_stud_info)} values in only student_info, which means we have students for whom we have no test scores.
    We can drop both of these which are missing values for the purpose of this dataframe since we are just analyzing test scores
    ''')


    We have 20211 values in only assessments, which map to students who had made previous attempts, and 3162 values in only student_info, which means we have students for whom we have no test scores.
    We can drop both of these which are missing values for the purpose of this dataframe since we are just analyzing test scores
    

In [142]:
# merging assessments with the original student data dataframe to make sure that the missing students are the ones we removed.
merged_test = assessments.merge(student_info, how='outer', on=['id_student', 'code_module', 'code_presentation'], indicator=True)

# removing entries where num_prev_attempts == 0
merged_test = merged_test[merged_test['num_of_prev_attempts'] == 0]

# checking if any in only the student info dataframe remain (left_only). No output means all of the tests without students map to a student where num_prev_attempts == 0
merged_test.loc[merged_test['_merge']=='left_only']

Unnamed: 0,id_assessment,id_student,date_submitted,score,code_module,code_presentation,assessment_type,date,weight,gender,region,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,final_result,_merge


In [143]:
# removing any student with NaN values in id_assessment or region
assessments = merged_sia.dropna(subset=['id_assessment', 'region'])

In [144]:
# reordering dataframe columns to group like data
assessments = assessments[['code_module', 'code_presentation', 'id_student', 'region', 'imd_band', 'age_band', 'gender', 'highest_education', 'disability', 'final_result', 'id_assessment', 'assessment_type', 'date_submitted', 'date', 'weight', 'score']]

In [145]:
# converting the data types back
assessments = assessments.astype({'id_assessment': int, 'id_student': int})
assessments = assessments.astype({'id_assessment': object, 'id_student': object})

In [146]:
# reset the index
assessments.reset_index(drop=True).head()

Unnamed: 0,code_module,code_presentation,id_student,region,imd_band,age_band,gender,highest_education,disability,final_result,id_assessment,assessment_type,date_submitted,date,weight,score
0,AAA,2013J,11391,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,1752,TMA,18.0,19.0,10.0,78.0
1,AAA,2013J,11391,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,1753,TMA,53.0,54.0,20.0,85.0
2,AAA,2013J,11391,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,1754,TMA,115.0,117.0,20.0,80.0
3,AAA,2013J,11391,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,1755,TMA,164.0,166.0,20.0,85.0
4,AAA,2013J,11391,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,1756,TMA,212.0,215.0,30.0,82.0


<b>Unique Counts</b>

In [147]:
assessments.nunique()

code_module              7
code_presentation        4
id_student           21040
region                  13
imd_band                10
age_band                 3
gender                   2
highest_education        5
disability               2
final_result             4
id_assessment          188
assessment_type          3
date_submitted         299
date                    74
weight                  24
score                  101
dtype: int64

<b>Unique Categorical Values</b>

In [148]:
unique_vals(assessments)

code_module: ['AAA' 'BBB' 'CCC' 'DDD' 'EEE' 'FFF' 'GGG']

code_presentation: ['2013J' '2014J' '2013B' '2014B']

id_student: [11391 28400 31604 ... 692171 650630 573320]

region: ['East Anglian Region' 'Scotland' 'South East Region'
 'West Midlands Region' 'Wales' 'North Western Region' 'North Region'
 'South Region' 'Ireland' 'South West Region' 'East Midlands Region'
 'Yorkshire Region' 'London Region']

imd_band: ['90-100%' '20-30%' '50-60%' '80-90%' '30-40%' '70-80%' nan '60-70%'
 '40-50%' '10-20' '0-10%']

age_band: ['55<=' '35-55' '0-35']

gender: ['M' 'F']

highest_education: ['HE Qualification' 'A Level or Equivalent' 'Lower Than A Level'
 'Post Graduate Qualification' 'No Formal quals']

disability: ['N' 'Y']

final_result: ['Pass' 'Withdrawn' 'Fail' 'Distinction']

id_assessment: [1752 1753 1754 1755 1756 1758 1759 1760 1761 1762 14984 14985 14986 14987
 14988 14989 14991 14992 14993 14994 14995 14996 14997 14998 14999 15000
 15001 15003 15004 15005 15006 15007 15008 15009 150

<b>Duplicate Values:</b>

In [149]:
duplicate_vals(assessments)

No Duplicate Values


<b>Statistics</b>

In [150]:
assessments.describe()

Unnamed: 0,date_submitted,date,weight,score
count,153528.0,153528.0,153528.0,153528.0
mean,117.590785,133.821622,12.810627,76.256943
std,70.647748,79.053746,18.040581,18.646354
min,-11.0,12.0,0.0,0.0
25%,52.0,54.0,0.0,66.0
50%,117.0,131.0,9.0,80.0
75%,173.0,222.0,18.0,90.0
max,594.0,269.0,100.0,100.0


In [151]:
assessments_final = assessments