In [72]:
from functions import *

@register_cell_magic
def markdown(line, cell):
    return md(cell.format(**globals()))

<a id='StudentInfo'></a>

<h2>Student Info and Student Registration Dataframes</h2>

---

<h4>Student Info</h4>

The student info dataframe contains information about students including the module and presentation they took, demographic information and the final result of their studies.

In [73]:
# looking at the student_info dataframe
student_info.head()

Unnamed: 0,code_module,code_presentation,id_student,gender,region,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,final_result
0,AAA,2013J,11391,M,East Anglian Region,HE Qualification,90-100%,55<=,0,240,N,Pass
1,AAA,2013J,28400,F,Scotland,HE Qualification,20-30%,35-55,0,60,N,Pass
2,AAA,2013J,30268,F,North Western Region,A Level or Equivalent,30-40%,35-55,0,60,Y,Withdrawn
3,AAA,2013J,31604,F,South East Region,A Level or Equivalent,50-60%,35-55,0,60,N,Pass
4,AAA,2013J,32885,F,West Midlands Region,Lower Than A Level,50-60%,0-35,0,60,N,Pass


<h4>Contents</h4>

* <b>code_module</b>: The code module represents the course the student is taking.
* <b>code_presentation</b>: The code presentations are the year and semester the student is taking the course.
* <b>id_student</b>: The student ID is a unique identifier for each student
* <b>gender</b>: The gender represents the binary gender of a student 'M' for students who identify as male and 'F' for students who identify as female.
* <b>region</b>: Region represents the location of the student when they took the module. All regions are in the UK, Scotland, Ireland or Wales.
* <b>highest_education</b>: Highest education is representative of a students highest level of formal academic achievement.
    - Education levels in order from least to most formal education: 
        - No formal quals (qualifications)
        - Lower than A Level which is nearly but not quite analagous to under high school level
        - A Level or equivalent which is again nearly analagous to high school level, but more like college ready
        - HE Qualification which stands for higher education qualification
        - Post Graduate Qualification
* <b>imd_band</b>: The imd_band represents the Indices of multiple deprivation (IMD) score which is a commonly used method in the UK to measure poverty or deprivation in an area. The lower the score, the more 'deprived' the area is.
* <b>age_band</b>: There are only three bins for age; 0-35, 35-55 and over 55
* <b>num_of_prev_attempts</b>: The number of times the student has attempted the course previously.
* <b>studied_credits</b>: The number of credits for the module the student is taking.
* <b>disability</b>: Disability status is represented by a binary 'Y', yes a student does identify as having a disability and 'N', no a student does not identify as having a disability.
* <b>final_results</b>: * The final result is the students overall result in the class.
    - Possible Results include:
         - Pass: The student passed the course
         - Fail: The student did not pass the course
         - Withdraw: The student withdrew before the course term ended
         - Distinction: The student passed the class with distinction

---

<h3>Student Registration</h3>

The student registration dataframe contains information about the dates that students registered and,if applicable, unregistered from the module.

<h4>Contents</h4>

* <b>date_registration</b> is the date that the student registered for the module relative to the start of the module. A negative value indicates that many days before the module began.
* <b>date_unregistration</b> is the date that the student unregistered from the course module in relation to the start date of the course. 


In [3]:
# looking at the student_registration dataframe
student_registration.head()

Unnamed: 0,code_module,code_presentation,id_student,date_registration,date_unregistration
0,AAA,2013J,11391,-159.0,
1,AAA,2013J,28400,-53.0,
2,AAA,2013J,30268,-92.0,12.0
3,AAA,2013J,31604,-52.0,
4,AAA,2013J,32885,-176.0,


```{note}
* The student registration dataframe matches 1:1 with the student_info dataframe only adding the date the student registered and the date, if applicable, they unregistered, and so we will merge these two dataframes
* Though the number of previous attempts may be interesting to analyze on its own to see the relationship between students who had to take the course multiple times, and the differences in their bahavior on the second or higher attempt, here we are only interested in students on their first attempt. The reason is that familiarity with course content is a confounding variable. Due to this we will remove students on their second or higher attempt. We will then remove num_prev_attempts since it will not contain any interesting data.
* studied_credits will not be a part of our analysis, and so may be removed.
* The dataframe columns can then be reordered to keep relevent data together. 
```

In [48]:
# left join and merge student info with student registration
stud_info = student_info.merge(student_registration, how='left', on=['code_module', 'code_presentation', 'id_student'])

# changing the student info dataframe to include only records where num_prev_attempts is 
stud_info = stud_info[stud_info['num_of_prev_attempts'] == 0]

# reordering the student_info dataframe to keep country, module and student data together
stud_info = stud_info[['code_module', 'code_presentation', 'id_student', 'region', 'imd_band', 'age_band', 'gender', 'highest_education', 'disability', 'final_result', 'date_registration', 'date_unregistration']]

---

<h4>Student Info Information</h4>

<b>Updated Dataframe</b>

In [49]:
# looking at our now merged dataframe
stud_info.head()

Unnamed: 0,code_module,code_presentation,id_student,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration
0,AAA,2013J,11391,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,
1,AAA,2013J,28400,Scotland,20-30%,35-55,F,HE Qualification,N,Pass,-53.0,
2,AAA,2013J,30268,North Western Region,30-40%,35-55,F,A Level or Equivalent,Y,Withdrawn,-92.0,12.0
3,AAA,2013J,31604,South East Region,50-60%,35-55,F,A Level or Equivalent,N,Pass,-52.0,
4,AAA,2013J,32885,West Midlands Region,50-60%,0-35,F,Lower Than A Level,N,Pass,-176.0,


In [50]:
md(f'''

<b>Size</b>
    
* Number of Rows: {len(stud_info)}
* Number of Columns: {len(stud_info.columns)}

<b>Data Types</b>
''')



<b>Size</b>
    
* Number of Rows: 28421
* Number of Columns: 12

<b>Data Types</b>


In [51]:
# show student info data types
stud_info.dtypes

code_module             object
code_presentation       object
id_student               int64
region                  object
imd_band                object
age_band                object
gender                  object
highest_education       object
disability              object
final_result            object
date_registration      float64
date_unregistration    float64
dtype: object

* id_student is currently an int64 datatype, but would be more appropriate as an object data type since it is categorical.

In [52]:
# changing id_student to the object data type
stud_info['id_student'] = stud_info['id_student'].astype(object)

<b>Null Values:</b>

In [53]:
stud_info.isnull().sum()

code_module                0
code_presentation          0
id_student                 0
region                     0
imd_band                 990
age_band                   0
gender                     0
highest_education          0
disability                 0
final_result               0
date_registration         38
date_unregistration    19809
dtype: int64

* The imd_band variable has 990 null values which we may have to work around. 
* There are 19,809 null values for date_unregistration which represent the students that did not withdraw from the course.
* We have 38 null values for date_registration, and no mention of this in the dataset documentation, so we will treat this as missing data.

<b>Unique Counts:</b>

In [54]:
stud_info.nunique()

code_module                7
code_presentation          4
id_student             26096
region                    13
imd_band                  10
age_band                   3
gender                     2
highest_education          5
disability                 2
final_result               4
date_registration        311
date_unregistration      406
dtype: int64

<b>Unique Categorical Values</b>

In [55]:
unique_vals(stud_info)

code_module: ['AAA' 'BBB' 'CCC' 'DDD' 'EEE' 'FFF' 'GGG']

code_presentation: ['2013J' '2014J' '2013B' '2014B']

id_student: [11391 28400 30268 ... 2648187 2679821 2684003]

region: ['East Anglian Region' 'Scotland' 'North Western Region'
 'South East Region' 'West Midlands Region' 'Wales' 'North Region'
 'South Region' 'Ireland' 'South West Region' 'East Midlands Region'
 'Yorkshire Region' 'London Region']

imd_band: ['90-100%' '20-30%' '30-40%' '50-60%' '80-90%' '70-80%' nan '60-70%'
 '40-50%' '10-20' '0-10%']

age_band: ['55<=' '35-55' '0-35']

gender: ['M' 'F']

highest_education: ['HE Qualification' 'A Level or Equivalent' 'Lower Than A Level'
 'Post Graduate Qualification' 'No Formal quals']

disability: ['N' 'Y']

final_result: ['Pass' 'Withdrawn' 'Fail' 'Distinction']



In imd_band the % sign is missing in 10-20. We will add that for consistency and clarity

In [71]:
# changing all 10-20 values in student_info imd_band to 10-20% for consistency's sake
student_info.loc[student_info['imd_band'] == '10-20', 'imd_band'] = '10-20%'
print(student_info['imd_band'].explode().unique())

['90-100%' '20-30%' '30-40%' '50-60%' '80-90%' '70-80%' nan '60-70%'
 '40-50%' '10-20%' '0-10%']


<b>Duplicate Values</b>

In [56]:
analyze_df(stud_info, dupes=True)

'No Duplicate Values'

In [57]:
md(f'''* The Student info dataframe is {len(stud_info)} rows, but there are only {stud_info['id_student'].nunique()} unique student ID's.
* This suggests that there are some students who took multiple modules since we eliminated those who have taken the same course more than once.
        ''')

* The Student info dataframe is 28421 rows, but there are only 26096 unique student ID's.
* This suggests that there are some students who took multiple modules since we eliminated those who have taken the same course more than once.
        

In [67]:
stud_info[stud_info['id_student'].duplicated()].head()

Unnamed: 0,code_module,code_presentation,id_student,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration
11212,CCC,2014J,533512,East Anglian Region,40-50%,0-35,F,Lower Than A Level,N,Fail,-23.0,
12450,CCC,2014J,675116,London Region,30-40%,0-35,F,Lower Than A Level,N,Withdrawn,-98.0,-96.0
13110,DDD,2013B,86047,Wales,20-30%,0-35,F,HE Qualification,N,Pass,-60.0,
13137,DDD,2013B,131145,South West Region,40-50%,0-35,M,A Level or Equivalent,N,Pass,-103.0,
13140,DDD,2013B,134025,London Region,60-70%,0-35,M,A Level or Equivalent,N,Distinction,-58.0,


<b>Duplicate Student ID's</b>

In [61]:
# finding student records with duplicate ID's
pd.concat(x for _, x in stud_info.groupby("id_student") if len(x) > 1).head()

Unnamed: 0,code_module,code_presentation,id_student,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration
10598,CCC,2014J,29411,East Midlands Region,80-90%,0-35,M,A Level or Equivalent,N,Withdrawn,-135.0,100.0
14399,DDD,2013J,29411,East Midlands Region,80-90%,0-35,M,A Level or Equivalent,N,Pass,-96.0,
10600,CCC,2014J,29639,North Region,,0-35,M,Lower Than A Level,N,Pass,-24.0,
20417,EEE,2014B,29639,North Region,,0-35,M,Lower Than A Level,N,Pass,-26.0,
8659,CCC,2014B,29820,East Anglian Region,40-50%,0-35,M,HE Qualification,N,Pass,-57.0,


In [68]:
duped_sids = stud_info[stud_info['id_student'].duplicated()]
total_sid_dupes = pd.concat(x for _, x in stud_info.groupby("id_student") if len(x) > 1)

In [70]:
md(f'''We have {len(duped_sids)} students whose ID is listed more than once and a total of {len(total_sid_dupes)} duplicate records. These students do seem to be in different courses, and so we will leave them''')

We have 2325 students whose ID is listed more than once and a total of 4636 duplicate records. These students do seem to be in different courses, and so we will leave them

<b>Statistics:</b>

In [40]:
stud_info.describe().astype(int)

Unnamed: 0,date_registration,date_unregistration
count,28383,8612
mean,-68,49
std,48,81
min,-321,-274
25%,-100,-2
50%,-56,27
75%,-29,107
max,167,444


* There are 8,612 values for the count of date_unregistration which represents the number of students who withdrew from the course.
* The earliest date_unregistration date is 274 days before the course began, which means these students did not make it to the first day. We are only interested in students who took the course so we must eliminate students who did not attend.

In [45]:
# removing students who withdrew on or before the first day
stud_info = stud_info.drop(stud_info[(stud_info['date_unregistration'] <= 0)].index)
stud_info.reset_index(drop=True).head()

Unnamed: 0,code_module,code_presentation,id_student,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration
0,AAA,2013J,11391,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,
1,AAA,2013J,28400,Scotland,20-30%,35-55,F,HE Qualification,N,Pass,-53.0,
2,AAA,2013J,30268,North Western Region,30-40%,35-55,F,A Level or Equivalent,Y,Withdrawn,-92.0,12.0
3,AAA,2013J,31604,South East Region,50-60%,35-55,F,A Level or Equivalent,N,Pass,-52.0,
4,AAA,2013J,32885,West Midlands Region,50-60%,0-35,F,Lower Than A Level,N,Pass,-176.0,


In [46]:
# finds the longest module length in courses and prints it
longest_course = courses['module_presentation_length'].max()
longest_unreg = stud_info['date_unregistration'].max().astype(int)
md(f'''* The longest course from module_presentation length in the courses dataframe was {longest_course} days, yet we see here the latest unregistration date is {longest_unreg} days, which is longer than any course went on.
    ''')

* The longest course from module_presentation length in the courses dataframe was 269 days, yet we see here the latest unregistration date is 444 days, which is longer than any course went on.
    

<b>All Students with an unregistration point after 269 days:</b>

In [47]:
# finding students whose courses went on for longer than the maximum course length
stud_info.loc[stud_info['date_unregistration'] > 269]

Unnamed: 0,code_module,code_presentation,id_student,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration
25249,FFF,2013J,586851,Wales,0-10%,0-35,M,Lower Than A Level,N,Withdrawn,-22.0,444.0


* It seems to be just this one student is an outlier, but should not affect our overall analysis so we will leave this intact