In [29]:
from functions import *

# Student Registration Dataframe

---

## General

The student registration dataframe contains information about the dates that students registered and,if applicable, unregistered from the module.

In [30]:
# looking at the student_registration dataframe
student_registration.head()

Unnamed: 0,code_module,code_presentation,id_student,date_registration,date_unregistration
0,AAA,2013J,11391,-159.0,
1,AAA,2013J,28400,-53.0,
2,AAA,2013J,30268,-92.0,12.0
3,AAA,2013J,31604,-52.0,
4,AAA,2013J,32885,-176.0,


## Student Registration Contents

* **code_module**: The code module represents the course which the sutdent registered for.
* **code_presentation**: The code presentation represents the time of year the course which the student registered for began.
* **id_student**: The student ID is the unique identifier for each student.
* **date_registration**: The registration date is the date that the student registered for the module relative to the start of the module. A negative value indicates that many days before the module began.
* **date_unregistration**: The unregistration date is the date that the student unregistered from the course module in relation to the start date of the course, if applicable.


**Size**

In [31]:
# get the row & column sizes for student registration
get_size(student_registration)

Unnamed: 0,Count
Columns,5
Rows,32593


In [32]:
md(f'''
Student Registration has {len(student_registration.columns)} columns and {"{:,}".format(len(student_registration))} rows
''')


Student Registration has 5 columns and 32,593 rows


**Data Types**

In [33]:
# show student registration data types
get_dtypes(student_registration)

index,Type
code_module,object
code_presentation,object
id_student,object
date_registration,float64
date_unregistration,float64


* `id_student` is currently an int64 datatype, but would be more appropriate to cast it as categorical
* `object` datatypes will be converted to string
* `date_registration` and `date_unregistration` 

In [34]:
# changing id_student to the string data type
student_registration['id_student'] = student_registration['id_student'].astype('str')
# convert other objects datatypes to strings
student_registration = student_registration.convert_dtypes()
# show the result of the conversion
dataframe(student_registration.dtypes, columns=["Data Type"])

Unnamed: 0,Data Type
code_module,string
code_presentation,string
id_student,string
date_registration,Int64
date_unregistration,Int64


**Null Values:**

In [35]:
# get the null values for each column
null_vals(student_registration)

index,Null Values
code_module,0
code_presentation,0
id_student,0
date_registration,45
date_unregistration,22521


In [36]:
# store the sum of null values of date_registration
null_registration = student_registration['date_registration'].isnull().sum()
# store the sum of null values of date_unregistration
null_unregistration = student_registration['date_unregistration'].isnull().sum()

In [37]:
md(f'''
* We have {null_registration} null values for date_registration, and no mention of this in the dataset documentation, so we will treat this as missing data.
* There are {null_unregistration} null values for date_unregistration which represent the students that did not withdraw from the course.
''')


* We have 45 null values for date_registration, and no mention of this in the dataset documentation, so we will treat this as missing data.
* There are 22521 null values for date_unregistration which represent the students that did not withdraw from the course.


**Duplicate Values**

In [38]:
# get the duplicate values for student registration if any
get_dupes(student_registration)

There are no Duplicate Values

**Unique Counts:**

In [39]:
# get the sum of unique values in columns
count_unique(student_registration)

index,Count
code_module,7
code_presentation,4
id_student,28785
date_registration,332
date_unregistration,416


**Unique Categorical Values**

In [40]:
# get the unique categorical values
unique_vals(student_registration)

index,Values
code_module,"['AAA', 'BBB', 'CCC', 'DDD', 'EEE', 'FFF', 'GGG']"
code_presentation,"['2013J', '2014J', '2013B', '2014B']"


In [41]:
# store the number of students in student_info
total_students = len(student_registration)
# store the student info student ids that are unique
unique_students = student_registration['id_student'].nunique()
md(f'''* The Student info dataframe is {"{:,}".format(total_students)} rows, but there are only {"{:,}".format(unique_students)} unique student ID's.
* This suggests that there are some students who took multiple modules since we eliminated those who have taken the same course more than once.
        ''')

* The Student info dataframe is 32,593 rows, but there are only 28,785 unique student ID's.
* This suggests that there are some students who took multiple modules since we eliminated those who have taken the same course more than once.
        

In [42]:
duplicate_students = student_registration[student_registration['id_student'].duplicated()]
duplicate_students.head()

Unnamed: 0,code_module,code_presentation,id_student,date_registration,date_unregistration
395,AAA,2014J,65002,-144,
403,AAA,2014J,94961,-150,
415,AAA,2014J,129955,-143,143.0
422,AAA,2014J,135335,-82,24.0
423,AAA,2014J,135400,-51,


In [43]:
md(f'''
This dataframe contains the students who have duplicate records. 
There is a total of {"{:,}".format(len(duplicate_students))} students whose ID is listed more than once''')


This dataframe contains the students who have duplicate records. 
There is a total of 3,808 students whose ID is listed more than once

**Duplicate Student ID's**

In [44]:
# finding student records with duplicate ID's
total_duplicate_records = pd.concat(x for _, x in student_registration.groupby("id_student") if len(x) > 1)
total_duplicate_records.head()

Unnamed: 0,code_module,code_presentation,id_student,date_registration,date_unregistration
10637,CCC,2014J,100788,-103,
23954,FFF,2013J,100788,-82,
23955,FFF,2013J,101217,-121,93.0
27730,FFF,2014J,101217,-44,
12835,CCC,2014J,1031884,-30,


In [45]:
md(f'''
* This dataframe contains all of the records of the students whose ID appears more than once.
* There is a total of {"{:,}".format(len(total_duplicate_records))} duplicate student ID's. 
* These students do seem to be in different courses, and represent students who have taken multiple courses or the same course more than once''')


* This dataframe contains all of the records of the students whose ID appears more than once.
* There is a total of 7,346 duplicate student ID's. 
* These students do seem to be in different courses, and represent students who have taken multiple courses or the same course more than once

**Numerical Values**

In [46]:
student_registration.describe().round(2)

Unnamed: 0,date_registration,date_unregistration
count,32548.0,10072.0
mean,-69.41,49.76
std,49.26,82.46
min,-322.0,-365.0
25%,-100.0,-2.0
50%,-57.0,27.0
75%,-29.0,109.0
max,167.0,444.0


In [76]:
unreg_total = student_registration['date_unregistration'].count()
unreg_min = student_registration['date_unregistration'].min()
md(f'''
* There are {"{:,}".format(unreg_total)} values for of date_unregistration which represents the number of students who withdrew from the course.
* The earliest date_unregistration date is {unreg_min} days before the course began, which means this student did not make it to the first day. 
''')


* There are 10,072 values for of date_unregistration which represents the number of students who withdrew from the course.
* The earliest date_unregistration date is -365 days before the course began, which means this student did not make it to the first day. 


In [79]:
early_withdraws = student_registration.loc[student_registration['date_unregistration'] <= 0]
early_withdraws.head()

Unnamed: 0,code_module,code_presentation,id_student,date_registration,date_unregistration
125,AAA,2013J,292923,-162,-121
136,AAA,2013J,305539,-54,-3
198,AAA,2013J,405961,-170,-100
256,AAA,2013J,1763015,-58,-2
298,AAA,2013J,2318055,-56,-19


In [85]:
md(f'''
Here we can see the {"{:,}".format(len(early_withdraws))} students with a withdrawal date before the first day,
and it is tempting to remove them since students who never attended probably don't add much information, 
but first let's find them in the student info dataframe
''')


Here we can see the 3,097 students with a withdrawal date before the first day,
and it is tempting to remove them since students who never attended probably don't add much information, 
but first let's find them in the student info dataframe


In [86]:
withdraw_info = dataframe()
for i in student_registration.loc[student_registration['date_unregistration'] <= 0, 'id_student']:
    withdraw_info = withdraw_info.append(student_info.loc[student_info['id_student'] == int(i)])
    
withdraw_info.head()

Unnamed: 0,code_module,code_presentation,id_student,gender,region,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,final_result
125,AAA,2013J,292923,F,South East Region,A Level or Equivalent,90-100%,35-55,0,180,N,Withdrawn
136,AAA,2013J,305539,F,Wales,Lower Than A Level,80-90%,0-35,0,120,N,Withdrawn
198,AAA,2013J,405961,M,Scotland,A Level or Equivalent,90-100%,0-35,0,240,Y,Withdrawn
256,AAA,2013J,1763015,F,Scotland,A Level or Equivalent,10-20,35-55,0,60,N,Withdrawn
298,AAA,2013J,2318055,M,Wales,A Level or Equivalent,90-100%,35-55,0,60,N,Withdrawn


In [82]:
dataframe(withdraw_info['final_result'].value_counts())

Unnamed: 0,final_result
Withdrawn,3715
Pass,201
Fail,121
Distinction,40


In [90]:
md('''
It seems that while many of these students withdrew, there are those who still managed to fail, pass, and pass with distinction early.
because of this we will keep all of the data. The registration and unregistration dates seem to have unusual data.
''')


It seems that while many of these students withdrew, there are those who still managed to fail, pass, and pass with distinction early.
because of this we will keep all of the data. The registration and unregistration dates seem to have unusual data.


In [91]:
# finds the longest module length in courses and prints it
longest_course = courses['module_presentation_length'].max()
longest_unreg = student_registration['date_unregistration'].max().astype(int)
md(f'''* The longest course from module_presentation length in the courses dataframe was {longest_course} days, yet we see here the latest unregistration date is {longest_unreg} days, which is longer than any course went on.
    ''')

* The longest course from module_presentation length in the courses dataframe was 269 days, yet we see here the latest unregistration date is 444 days, which is longer than any course went on.
    

**All Students with an unregistration point after 269 days:**

In [92]:
# finding students whose courses went on for longer than the maximum course length
student_registration.loc[student_registration['date_unregistration'] > 269]

Unnamed: 0,code_module,code_presentation,id_student,date_registration,date_unregistration
25249,FFF,2013J,586851,-22,444


* It seems to be just this one student is an outlier, but should not affect our overall analysis so we will leave this intact