In [1]:
from functions import *

# Student Registration Dataframe

---

## General

The student registration dataframe contains information about the dates that students registered and,if applicable, unregistered from the module.

In [2]:
# looking at the student_registration dataframe
student_registration.head()

Unnamed: 0,code_module,code_presentation,id_student,date_registration,date_unregistration
0,AAA,2013J,11391,-159.0,
1,AAA,2013J,28400,-53.0,
2,AAA,2013J,30268,-92.0,12.0
3,AAA,2013J,31604,-52.0,
4,AAA,2013J,32885,-176.0,


## Student Registration Contents

* **code_module**: The code module represents the course which the sutdent registered for.
* **code_presentation**: The code presentation represents the time of year the course which the student registered for began.
* **id_student**: The student ID is the unique identifier for each student.
* **date_registration**: The registration date is the date that the student registered for the module relative to the start of the module. A negative value indicates that many days before the module began.
* **date_unregistration**: The unregistration date is the date that the student unregistered from the course module in relation to the start date of the course, if applicable.


**Size**

In [3]:
# get the row & column sizes for student registration
get_size(student_registration)

Unnamed: 0,Count
Columns,5
Rows,32593


In [4]:
md(f'''
Student Registration has {len(student_registration.columns)} columns and {"{:,}".format(len(student_registration))} rows
''')


Student Registration has 5 columns and 32,593 rows


**Data Types**

In [5]:
# show student registration data types
get_dtypes(student_registration)

index,Type
code_module,object
code_presentation,object
id_student,int64
date_registration,float64
date_unregistration,float64


* `id_student` is currently an int64 datatype, but would be more appropriate to cast it as categorical
* `object` datatypes will be converted to string
* `date_registration` and `date_unregistration` 

In [6]:
# changing id_student to the string data type
student_registration['id_student'] = student_registration['id_student'].astype('str')
# convert other objects datatypes to strings
student_registration = student_registration.convert_dtypes()
# show the result of the conversion
student_registration.dtypes

code_module            string
code_presentation      string
id_student             string
date_registration       Int64
date_unregistration     Int64
dtype: object

**Null Values:**

In [7]:
# get the null values for each column
null_vals(student_registration)

index,Null Values
code_module,0
code_presentation,0
id_student,0
date_registration,45
date_unregistration,22521


In [8]:
# store the sum of null values of date_registration
null_registration = student_registration['date_registration'].isnull().sum()
# store the sum of null values of date_unregistration
null_unregistration = student_registration['date_unregistration'].isnull().sum()

In [9]:
md(f'''
* We have {null_registration} null values for date_registration, and no mention of this in the dataset documentation, so we will treat this as missing data.
* There are {null_unregistration} null values for date_unregistration which represent the students that did not withdraw from the course.
''')


* We have 45 null values for date_registration, and no mention of this in the dataset documentation, so we will treat this as missing data.
* There are 22521 null values for date_unregistration which represent the students that did not withdraw from the course.


**Duplicate Values**

In [10]:
# get the duplicate values for student registration if any
get_dupes(student_registration)

There are no Duplicate Values

**Unique Counts:**

In [11]:
# get the sum of unique values in columns
count_unique(student_registration)

index,Count
code_module,7
code_presentation,4
id_student,28785
date_registration,332
date_unregistration,416


**Unique Categorical Values**

In [12]:
unique_vals(student_registration)

index,Values
code_module,"['AAA', 'BBB', 'CCC', 'DDD', 'EEE', 'FFF', 'GGG']"
code_presentation,"['2013J', '2014J', '2013B', '2014B']"


In [13]:
total_students = len(student_registration)
unique_students = student_registration['id_student'].nunique()
md(f'''* The Student info dataframe is {"{:,}".format(total_students)} rows, but there are only {"{:,}".format(unique_students)} unique student ID's.
* This suggests that there are some students who took multiple modules since we eliminated those who have taken the same course more than once.
        ''')

* The Student info dataframe is 32,593 rows, but there are only 28,785 unique student ID's.
* This suggests that there are some students who took multiple modules since we eliminated those who have taken the same course more than once.
        

In [14]:
duplicate_students = student_registration[student_registration['id_student'].duplicated()]
duplicate_students.head()

Unnamed: 0,code_module,code_presentation,id_student,date_registration,date_unregistration
395,AAA,2014J,65002,-144,
403,AAA,2014J,94961,-150,
415,AAA,2014J,129955,-143,143.0
422,AAA,2014J,135335,-82,24.0
423,AAA,2014J,135400,-51,


In [15]:
md(f'''
This dataframe contains the students who have duplicate records. 
There is a total of {"{:,}".format(len(duplicate_students))} students whose ID is listed more than once''')


This dataframe contains the students who have duplicate records. 
There is a total of 3,808 students whose ID is listed more than once

**Duplicate Student ID's**

In [16]:
# finding student records with duplicate ID's
total_duplicate_records = pd.concat(x for _, x in student_registration.groupby("id_student") if len(x) > 1)
total_duplicate_records.head()

Unnamed: 0,code_module,code_presentation,id_student,date_registration,date_unregistration
10637,CCC,2014J,100788,-103,
23954,FFF,2013J,100788,-82,
23955,FFF,2013J,101217,-121,93.0
27730,FFF,2014J,101217,-44,
12835,CCC,2014J,1031884,-30,


In [18]:
md(f'''
* This dataframe contains all of the records of the students whose ID appears more than once.
* There is a total of {"{:,}".format(len(total_duplicate_records))} duplicate student ID's. 
* These students do seem to be in different courses, and represent students who have taken multiple courses or the same course more than once''')


* This dataframe contains all of the records of the students whose ID appears more than once.
* There is a total of 7,346 duplicate student ID's. 
* These students do seem to be in different courses, and represent students who have taken multiple courses or the same course more than once

**Numerical Values**

In [19]:
student_registration.describe().round(2)

Unnamed: 0,date_registration,date_unregistration
count,32548.0,10072.0
mean,-69.41,49.76
std,49.26,82.46
min,-322.0,-365.0
25%,-100.0,-2.0
50%,-57.0,27.0
75%,-29.0,109.0
max,167.0,444.0


In [21]:
unreg_total = student_registration['date_unregistration'].count()
unreg_min = student_registration['date_unregistration'].min()
md(f'''
* There are {"{:,}".format(unreg_total)} values for of date_unregistration which represents the number of students who withdrew from the course.
* The earliest date_unregistration date is {unreg_min} days before the course began, which means this student did not make it to the first day. 
* We are only interested in students who took the course so we must eliminate students who did not attend.
''')


* There are 10,072 values for of date_unregistration which represents the number of students who withdrew from the course.
* The earliest date_unregistration date is -365 days before the course began, which means this student did not make it to the first day. 
* We are only interested in students who took the course so we must eliminate students who did not attend.


In [22]:
# removing students who withdrew on or before the first day
student_registration = student_registration.drop(student_registration[(student_registration['date_unregistration'] <= 0)].index)
student_registration.reset_index(drop=True).head()

Unnamed: 0,code_module,code_presentation,id_student,date_registration,date_unregistration
0,AAA,2013J,11391,-159,
1,AAA,2013J,28400,-53,
2,AAA,2013J,30268,-92,12.0
3,AAA,2013J,31604,-52,
4,AAA,2013J,32885,-176,


In [24]:
# finds the longest module length in courses and prints it
longest_course = courses['module_presentation_length'].max()
longest_unreg = student_registration['date_unregistration'].max().astype(int)
md(f'''* The longest course from module_presentation length in the courses dataframe was {longest_course} days, yet we see here the latest unregistration date is {longest_unreg} days, which is longer than any course went on.
    ''')

* The longest course from module_presentation length in the courses dataframe was 269 days, yet we see here the latest unregistration date is 444 days, which is longer than any course went on.
    

**All Students with an unregistration point after 269 days:**

In [None]:
# finding students whose courses went on for longer than the maximum course length
student_registration.loc[student_registration['date_unregistration'] > 269]

* It seems to be just this one student is an outlier, but should not affect our overall analysis so we will leave this intact