In [1]:
from functions import *

# Students

---

The student info dataframe contains information about students including the module and presentation they took, demographic information and the final result of their studies.

In [2]:
# looking at the student_info dataframe
student_info.head()

Unnamed: 0,code_module,code_presentation,id_student,gender,region,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,final_result
0,AAA,2013J,11391,M,East Anglian Region,HE Qualification,90-100%,55<=,0,240,N,Pass
1,AAA,2013J,28400,F,Scotland,HE Qualification,20-30%,35-55,0,60,N,Pass
2,AAA,2013J,30268,F,North Western Region,A Level or Equivalent,30-40%,35-55,0,60,Y,Withdrawn
3,AAA,2013J,31604,F,South East Region,A Level or Equivalent,50-60%,35-55,0,60,N,Pass
4,AAA,2013J,32885,F,West Midlands Region,Lower Than A Level,50-60%,0-35,0,60,N,Pass


## Student Info Contents

* **code_module**: The code module represents the course the student is taking.
* **code_presentation**: The code presentations are the year and semester the student is taking the course.
* **id_student**: The student ID is a unique identifier for each student
* **gender**: The gender represents the binary gender of a student 'M' for students who identify as male and 'F' for students who identify as female.
* **region**: Region represents the location of the student when they took the module. All regions are in the UK, Scotland, Ireland or Wales.
* **highest_education**: Highest education is representative of a students highest level of formal academic achievement.
    - Education levels in order from least to most formal education: 
        - No formal quals (qualifications)
        - Lower than A Level which is nearly but not quite analagous to under high school level
        - A Level or equivalent which is again nearly analagous to high school level, but more like college ready
        - HE Qualification which stands for higher education qualification
        - Post Graduate Qualification
* **imd_band**: The imd_band represents the Indices of multiple deprivation (IMD) score which is a commonly used method in the UK to measure poverty or deprivation in an area. The lower the score, the more 'deprived' the area is.
* **age_band**: There are only three bins for age; 0-35, 35-55 and over 55
* **num_of_prev_attempts**: The number of times the student has attempted the course previously.
* **studied_credits**: The number of credits for the module the student is taking.
* **disability**: Disability status is represented by a binary 'Y', yes a student does identify as having a disability and 'N', no a student does not identify as having a disability.
* **final_results**: * The final result is the students overall result in the class.
    - Possible Results include:
         - Pass: The student passed the course
         - Fail: The student did not pass the course
         - Withdraw: The student withdrew before the course term ended
         - Distinction: The student passed the class with distinction

* num_of_prev_attempts will be changed to prev_attempts to save space
* 'Y' and 'N' in disability can be changed to boolean to make it easier to work with

In [57]:
# rename num_of_prev_attempts column to prev_attempts to save space
student_info = student_info.rename(columns={'num_of_prev_attempts':'prev_attempts'})
student_info = student_info.replace({'disability':{'Y':True, 'N':False}})

---

## Student Info Information

In [4]:
# get size counts of student_info
get_size(student_info)

Unnamed: 0,Count
Columns,12
Rows,32593


In [5]:
md(f'''
Student Info has {len(student_info.columns)} columns and {"{:,}".format(len(student_info))} rows
''')


Student Info has 12 columns and 32,593 rows


In [6]:
# show student info data types
get_dtypes(student_info)

code_module          object
code_presentation    object
id_student            int64
gender               object
region               object
highest_education    object
imd_band             object
age_band             object
prev_attempts         int64
studied_credits       int64
disability           object
final_result         object
dtype: object

* `id_student` is currently `int64` datatype, but would be more appropriate to recast it as categorical.
* `object` datatypes can have unexpected behavior and should be recast to `string`


In [7]:
# changing id_student to the object data type
student_info['id_student'] = student_info['id_student'].astype(str)
student_info = student_info.convert_dtypes()
student_info.dtypes

code_module          string
code_presentation    string
id_student           string
gender               string
region               string
highest_education    string
imd_band             string
age_band             string
prev_attempts         Int64
studied_credits       Int64
disability           string
final_result         string
dtype: object

**Null Values**

In [8]:
null_vals(student_info)

index,Null Values
code_module,0
code_presentation,0
id_student,0
gender,0
region,0
highest_education,0
imd_band,1111
age_band,0
prev_attempts,0
studied_credits,0


In [9]:
# store sum of imd null values
imd_null = student_info['imd_band'].isnull().sum()
md(f'''The imd_band variable has {imd_null} null values which we may have to work around.''')

The imd_band variable has 1111 null values which we may have to work around.

**Duplicate Values**

In [10]:
# show duplicate values in student info if any
get_dupes(student_info)

There are no Duplicate Values

**Unique Counts**

In [11]:
# Get number of unique values per variable in student info
count_unique(student_info)

index,Count
code_module,7
code_presentation,4
id_student,28785
gender,2
region,13
highest_education,5
imd_band,10
age_band,3
prev_attempts,7
studied_credits,61


In [12]:
# store count of total student ids
total_students = student_info['id_student'].count()
# store count of unique student ids
unique_students = student_info['id_student'].nunique()

In [19]:
md(f'''
* There are {"{:,}".format(total_students)} entries for students but only {"{:,}".format(unique_students)} unique student IDs.
* This may represent students who have taken the course more than once or who are taking multiple modules
''')


* There are 32,593 entries for students but only 28,785 unique student IDs.
* This may represent students who have taken the course more than once or who are taking multiple modules


**Unique Categorical Values**

In [20]:
unique_vals(student_info)

index,Values
code_module,"['AAA', 'BBB', 'CCC', 'DDD', 'EEE', 'FFF', 'GGG']"
code_presentation,"['2013J', '2014J', '2013B', '2014B']"
gender,"['M', 'F']"
region,"['East Anglian Region', 'Scotland', 'North Western Region', 'South East Region', 'West Midlands Region', 'Wales', 'North Region', 'South Region', 'Ireland', 'South West Region', 'East Midlands Region', 'Yorkshire Region', 'London Region']"
highest_education,"['HE Qualification', 'A Level or Equivalent', 'Lower Than A Level', 'Post Graduate Qualification', 'No Formal quals']"
imd_band,"['90-100%', '20-30%', '30-40%', '50-60%', '80-90%', '70-80%', , '60-70%', '40-50%', '10-20%', '0-10%']"
age_band,"['55<=', '35-55', '0-35']"
disability,"['N', 'Y']"
final_result,"['Pass', 'Withdrawn', 'Fail', 'Distinction']"


In imd_band the % sign is missing in 10-20. We will add that for consistency and clarity

In [30]:
# changing all 10-20 values in student_info imd_band to 10-20% for consistency's sake
student_info.loc[student_info['imd_band'] == '10-20', 'imd_band'] = '10-20%'
# making sure it updated
dataframe(student_info['imd_band'].explode().unique(), columns=['imd_band']).sort_values(by='imd_band').reset_index(drop=True)

Unnamed: 0,imd_band
0,0-10%
1,10-20%
2,20-30%
3,30-40%
4,40-50%
5,50-60%
6,60-70%
7,70-80%
8,80-90%
9,90-100%


In [59]:
dataframe(student_info['final_result'].value_counts())

Unnamed: 0,final_result
Pass,12361
Withdrawn,10156
Fail,7052
Distinction,3024


**Numerical Values**

In [35]:
# show statistical breakdown of numerical values in student info
student_info.describe().round(1)

Unnamed: 0,prev_attempts,studied_credits
count,32593.0,32593.0
mean,0.2,79.8
std,0.5,41.1
min,0.0,30.0
25%,0.0,60.0
50%,0.0,60.0
75%,0.0,120.0
max,6.0,655.0


In [49]:
# store the highest number of module previous attempts by students
max_attempts = student_info['prev_attempts'].max()
max_credits = student_info['studied_credits'].max()
min_credits = student_info['studied_credits'].min()

In [55]:
md(f'''
* Most students do not have a previous attempt, but there is a high of {max_attempts} attempts.
    * We can only have data for up to two of the students attempts since we only have two years worth of data.
* The maximum amount of credits a student took during the module was {max_credits}
    * This over twenty times the minimum of {min_credits} credits.
* It is unknown how these courses were weighted, but this amount of credits at the same time may have influenced student success
''')


* Most students do not have a previous attempt, but there is a high of 6 attempts.
    * We can only have data for up to two of the students attempts since we only have two years worth of data.
* The maximum amount of credits a student took during the module was 655
    * This over twenty times the minimum of 30 credits.
* It is unknown how these courses were weighted, but this amount of credits at the same time may have influenced student success


# Student Registration Dataframe

---

## General

The student registration dataframe contains information about the dates that students registered and,if applicable, unregistered from the module.

In [30]:
# looking at the student_registration dataframe
student_registration.head()

Unnamed: 0,code_module,code_presentation,id_student,date_registration,date_unregistration
0,AAA,2013J,11391,-159.0,
1,AAA,2013J,28400,-53.0,
2,AAA,2013J,30268,-92.0,12.0
3,AAA,2013J,31604,-52.0,
4,AAA,2013J,32885,-176.0,


## Student Registration Contents

* **code_module**: The code module represents the course which the sutdent registered for.
* **code_presentation**: The code presentation represents the time of year the course which the student registered for began.
* **id_student**: The student ID is the unique identifier for each student.
* **date_registration**: The registration date is the date that the student registered for the module relative to the start of the module. A negative value indicates that many days before the module began.
* **date_unregistration**: The unregistration date is the date that the student unregistered from the course module in relation to the start date of the course, if applicable.


**Size**

In [31]:
# get the row & column sizes for student registration
get_size(student_registration)

Unnamed: 0,Count
Columns,5
Rows,32593


In [32]:
md(f'''
Student Registration has {len(student_registration.columns)} columns and {"{:,}".format(len(student_registration))} rows
''')


Student Registration has 5 columns and 32,593 rows


**Data Types**

In [33]:
# show student registration data types
get_dtypes(student_registration)

index,Type
code_module,object
code_presentation,object
id_student,object
date_registration,float64
date_unregistration,float64


* `id_student` is currently an int64 datatype, but would be more appropriate to cast it as categorical
* `object` datatypes will be converted to string
* `date_registration` and `date_unregistration` 

In [34]:
# changing id_student to the string data type
student_registration['id_student'] = student_registration['id_student'].astype('str')
# convert other objects datatypes to strings
student_registration = student_registration.convert_dtypes()
# show the result of the conversion
dataframe(student_registration.dtypes, columns=["Data Type"])

Unnamed: 0,Data Type
code_module,string
code_presentation,string
id_student,string
date_registration,Int64
date_unregistration,Int64


**Null Values:**

In [35]:
# get the null values for each column
null_vals(student_registration)

index,Null Values
code_module,0
code_presentation,0
id_student,0
date_registration,45
date_unregistration,22521


In [36]:
# store the sum of null values of date_registration
null_registration = student_registration['date_registration'].isnull().sum()
# store the sum of null values of date_unregistration
null_unregistration = student_registration['date_unregistration'].isnull().sum()

In [37]:
md(f'''
* We have {null_registration} null values for date_registration, and no mention of this in the dataset documentation, so we will treat this as missing data.
* There are {null_unregistration} null values for date_unregistration which represent the students that did not withdraw from the course.
''')


* We have 45 null values for date_registration, and no mention of this in the dataset documentation, so we will treat this as missing data.
* There are 22521 null values for date_unregistration which represent the students that did not withdraw from the course.


**Duplicate Values**

In [38]:
# get the duplicate values for student registration if any
get_dupes(student_registration)

There are no Duplicate Values

**Unique Counts:**

In [39]:
# get the sum of unique values in columns
count_unique(student_registration)

index,Count
code_module,7
code_presentation,4
id_student,28785
date_registration,332
date_unregistration,416


**Unique Categorical Values**

In [40]:
# get the unique categorical values
unique_vals(student_registration)

index,Values
code_module,"['AAA', 'BBB', 'CCC', 'DDD', 'EEE', 'FFF', 'GGG']"
code_presentation,"['2013J', '2014J', '2013B', '2014B']"


In [41]:
# store the number of students in student_info
total_students = len(student_registration)
# store the student info student ids that are unique
unique_students = student_registration['id_student'].nunique()
md(f'''* The Student info dataframe is {"{:,}".format(total_students)} rows, but there are only {"{:,}".format(unique_students)} unique student ID's.
* This suggests that there are some students who took multiple modules since we eliminated those who have taken the same course more than once.
        ''')

* The Student info dataframe is 32,593 rows, but there are only 28,785 unique student ID's.
* This suggests that there are some students who took multiple modules since we eliminated those who have taken the same course more than once.
        

In [42]:
duplicate_students = student_registration[student_registration['id_student'].duplicated()]
duplicate_students.head()

Unnamed: 0,code_module,code_presentation,id_student,date_registration,date_unregistration
395,AAA,2014J,65002,-144,
403,AAA,2014J,94961,-150,
415,AAA,2014J,129955,-143,143.0
422,AAA,2014J,135335,-82,24.0
423,AAA,2014J,135400,-51,


In [43]:
md(f'''
This dataframe contains the students who have duplicate records. 
There is a total of {"{:,}".format(len(duplicate_students))} students whose ID is listed more than once''')


This dataframe contains the students who have duplicate records. 
There is a total of 3,808 students whose ID is listed more than once

**Duplicate Student ID's**

In [44]:
# finding student records with duplicate ID's
total_duplicate_records = pd.concat(x for _, x in student_registration.groupby("id_student") if len(x) > 1)
total_duplicate_records.head()

Unnamed: 0,code_module,code_presentation,id_student,date_registration,date_unregistration
10637,CCC,2014J,100788,-103,
23954,FFF,2013J,100788,-82,
23955,FFF,2013J,101217,-121,93.0
27730,FFF,2014J,101217,-44,
12835,CCC,2014J,1031884,-30,


In [45]:
md(f'''
* This dataframe contains all of the records of the students whose ID appears more than once.
* There is a total of {"{:,}".format(len(total_duplicate_records))} duplicate student ID's. 
* These students do seem to be in different courses, and represent students who have taken multiple courses or the same course more than once''')


* This dataframe contains all of the records of the students whose ID appears more than once.
* There is a total of 7,346 duplicate student ID's. 
* These students do seem to be in different courses, and represent students who have taken multiple courses or the same course more than once

**Numerical Values**

In [46]:
student_registration.describe().round(2)

Unnamed: 0,date_registration,date_unregistration
count,32548.0,10072.0
mean,-69.41,49.76
std,49.26,82.46
min,-322.0,-365.0
25%,-100.0,-2.0
50%,-57.0,27.0
75%,-29.0,109.0
max,167.0,444.0


In [76]:
unreg_total = student_registration['date_unregistration'].count()
unreg_min = student_registration['date_unregistration'].min()
md(f'''
* There are {"{:,}".format(unreg_total)} values for of date_unregistration which represents the number of students who withdrew from the course.
* The earliest date_unregistration date is {unreg_min} days before the course began, which means this student did not make it to the first day. 
''')


* There are 10,072 values for of date_unregistration which represents the number of students who withdrew from the course.
* The earliest date_unregistration date is -365 days before the course began, which means this student did not make it to the first day. 


In [79]:
early_withdraws = student_registration.loc[student_registration['date_unregistration'] <= 0]
early_withdraws.head()

Unnamed: 0,code_module,code_presentation,id_student,date_registration,date_unregistration
125,AAA,2013J,292923,-162,-121
136,AAA,2013J,305539,-54,-3
198,AAA,2013J,405961,-170,-100
256,AAA,2013J,1763015,-58,-2
298,AAA,2013J,2318055,-56,-19


In [85]:
md(f'''
Here we can see the {"{:,}".format(len(early_withdraws))} students with a withdrawal date before the first day,
and it is tempting to remove them since students who never attended probably don't add much information, 
but first let's find them in the student info dataframe
''')


Here we can see the 3,097 students with a withdrawal date before the first day,
and it is tempting to remove them since students who never attended probably don't add much information, 
but first let's find them in the student info dataframe


In [86]:
withdraw_info = dataframe()
for i in student_registration.loc[student_registration['date_unregistration'] <= 0, 'id_student']:
    withdraw_info = withdraw_info.append(student_info.loc[student_info['id_student'] == int(i)])
    
withdraw_info.head()

Unnamed: 0,code_module,code_presentation,id_student,gender,region,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,final_result
125,AAA,2013J,292923,F,South East Region,A Level or Equivalent,90-100%,35-55,0,180,N,Withdrawn
136,AAA,2013J,305539,F,Wales,Lower Than A Level,80-90%,0-35,0,120,N,Withdrawn
198,AAA,2013J,405961,M,Scotland,A Level or Equivalent,90-100%,0-35,0,240,Y,Withdrawn
256,AAA,2013J,1763015,F,Scotland,A Level or Equivalent,10-20,35-55,0,60,N,Withdrawn
298,AAA,2013J,2318055,M,Wales,A Level or Equivalent,90-100%,35-55,0,60,N,Withdrawn


In [82]:
dataframe(withdraw_info['final_result'].value_counts())

Unnamed: 0,final_result
Withdrawn,3715
Pass,201
Fail,121
Distinction,40


In [90]:
md('''
It seems that while many of these students withdrew, there are those who still managed to fail, pass, and pass with distinction early.
because of this we will keep all of the data. The registration and unregistration dates seem to have unusual data.
''')


It seems that while many of these students withdrew, there are those who still managed to fail, pass, and pass with distinction early.
because of this we will keep all of the data. The registration and unregistration dates seem to have unusual data.


In [91]:
# finds the longest module length in courses and prints it
longest_course = courses['module_presentation_length'].max()
longest_unreg = student_registration['date_unregistration'].max().astype(int)
md(f'''* The longest course from module_presentation length in the courses dataframe was {longest_course} days, yet we see here the latest unregistration date is {longest_unreg} days, which is longer than any course went on.
    ''')

* The longest course from module_presentation length in the courses dataframe was 269 days, yet we see here the latest unregistration date is 444 days, which is longer than any course went on.
    

**All Students with an unregistration point after 269 days:**

In [92]:
# finding students whose courses went on for longer than the maximum course length
student_registration.loc[student_registration['date_unregistration'] > 269]

Unnamed: 0,code_module,code_presentation,id_student,date_registration,date_unregistration
25249,FFF,2013J,586851,-22,444


* It seems to be just this one student is an outlier, but should not affect our overall analysis so we will leave this intact