# Student Success Analysis and Prediction

---

In this notebook we will be analyzing the Open University Learning Analytics dataset. This dataset contains information about seven online courses, referred to as modules, the students taking these courses, and their interactions with the courses. There were four seperate presentations of these courses offered in February and October of 2013 and 2014. The analysis of this dataset is done with the goal of discovering relationships between student features, course features, student grades and the overall student outcome. We will begin with exploring the data, which is distributed between seven CSV files, and then apply machine learning algorithms to see what relationships we can tease out. 

In [1]:
from functions import *

@register_cell_magic
def markdown(line, cell):
    return md(cell.format(**globals()))


Navigation:

* Cleaning:
    * [Student Info](#StudentInfo) 
    * [Student Registration](#StudentRegistration) 
    * [Courses](#Courses) 
    * [Assessments](#Assessments)
    * [Student Assessment](#StudentAssessment)
    * [Student VLE](#StudentVLE)
    * [VLE](#VLE) 


<h1>Cleaning and Analysis</h1>

---

Let's get to know our data!

Step by step we will clean and explore the student data here.
For each dataframe we will first Get a general look at our data frame looking at datatypes, null values, duplicate values, and unique values and perform cleaning based on what we find, then we will explore the information visually

---

<h2>Observations and Cleaning</h2>

<h3>General</h3>

In [2]:
# pd.concat(x for _, x in vle.groupby(['id_student',"date"]) if len(x) > 1)[0:50]

Unnamed: 0,code_module,code_presentation,id_student,id_assessment,assessment_type,weight,date,date_submitted,score,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration,sum_click,_merge


In [3]:
merged_vle_ass.loc[merged_vle_ass['_merge'] == 'right_only']

NameError: name 'merged_vle_ass' is not defined

In [248]:
# merged_vle_ass.loc[merged_vle_ass['_merge'] == 'right_only']

In [249]:
# merged_vle_ass.loc[merged_vle_ass['_merge'] == 'left_only']

In [250]:
# merged_vle_ass2 = vle.merge(assessments, how='left', on=['code_module', 'code_presentation'],indicator=True).head()

In [251]:
# merged_vle_ass2 

In [252]:
# merged_vle_ass2.loc[merged_vle_ass2['_merge'] == 'right_only']

In [253]:
# merged_vle_ass2.loc[merged_vle_ass2['_merge'] == 'left_only']

In [254]:
# merged_vle_ass3 = assessments.merge(vle, how='right', on=['code_module', 'code_presentation'],indicator=True).head()

In [255]:
# merged_vle_ass3 

In [256]:
# merged_vle_ass3.loc[merged_vle_ass3['_merge'] == 'right_only']

In [257]:
# merged_vle_ass3.loc[merged_vle_ass3['_merge'] == 'left_only']

In [258]:
# merged_vle_ass4 = assessments.merge(vle, how='left', on=['code_module', 'code_presentation'],indicator=True).head()

In [259]:
# merged_vle_ass4

In [260]:
# merged_vle_ass4.loc[merged_vle_ass4['_merge'] == 'right_only']

In [261]:
# merged_vle_ass4.loc[merged_vle_ass4['_merge'] == 'left_only']

In [262]:
# merged_vle_si

In [263]:
# merged_vle_si.loc[merged_vle_si['_merge'] == 'left_only']

In [264]:
merged_vle_si = merged_vle_si.dropna(subset=['final_result'])

In [265]:
vle = merged_vle_si

In [266]:
vle

Unnamed: 0,code_module,code_presentation,id_student,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration,sum_click,_merge
0,AAA,2014J,6516.0,Scotland,80-90%,55<=,M,HE Qualification,N,Pass,-52.0,,2791.0,both
1,DDD,2013J,8462.0,London Region,30-40%,55<=,M,HE Qualification,N,Withdrawn,-137.0,119.0,656.0,both
2,AAA,2013J,11391.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,934.0,both
4,CCC,2014J,23698.0,East Anglian Region,50-60%,0-35,F,A Level or Equivalent,N,Pass,-110.0,,910.0,both
5,BBB,2013J,23798.0,Wales,50-60%,0-35,M,A Level or Equivalent,N,Distinction,-27.0,,590.0,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26069,DDD,2014B,2698251.0,South West Region,50-60%,0-35,F,A Level or Equivalent,N,Fail,-23.0,,1511.0,both
26070,AAA,2013J,2698257.0,East Midlands Region,60-70%,0-35,M,Lower Than A Level,N,Pass,-58.0,,758.0,both
26071,CCC,2014B,2698535.0,Wales,50-60%,0-35,M,Lower Than A Level,N,Withdrawn,-156.0,180.0,4241.0,both
26072,BBB,2014J,2698577.0,Wales,50-60%,35-55,F,Lower Than A Level,N,Fail,16.0,,717.0,both


* The data types are acceptable.
* There are 11 null values in date. The documentation for this dataset states that if the final exam date is missing it is at the end of the last presentation week.

We will again look at the possible values for our categorical variables:

---

<h4>Assessment Type</h4>


In [299]:
unique_assessments = assessments['id_assessment'].count()
tma = assessments['assessment_type'].value_counts()['TMA']
cma = assessments['assessment_type'].value_counts()['CMA']
exams = assessments['assessment_type'].value_counts()['Exam']
print(f"There are {unique_assessments} unique assessments\n{tma} are Tutor Marked Assessments (TMA)\n{cma} are Computer Marked Assessments (CMA)\n{exams} are Final Exams (Exam)")

There are 206 unique assessments
106 are Tutor Marked Assessments (TMA)
76 are Computer Marked Assessments (CMA)
24 are Final Exams (Exam)


In [300]:
print(assessments.loc[assessments['assessment_type'] == 'TMA', 'code_presentation'].value_counts())
print()
print(assessments.loc[assessments['assessment_type'] == 'CMA', 'code_presentation'].value_counts())
print()
print(assessments.loc[assessments['assessment_type'] == 'Exam', 'code_presentation'].value_counts())

2014J    32
2013J    29
2014B    28
2013B    17
Name: code_presentation, dtype: int64

2014B    22
2013B    19
2013J    18
2014J    17
Name: code_presentation, dtype: int64

2014J    8
2014B    7
2013J    6
2013B    3
Name: code_presentation, dtype: int64


<a id='VLE'></a>

---

<h2>VLE Dataframe</h2>

---

<h3>Cleaning</h3>

---

<h4>1. Look at the dataframe</h4>

---

---

<h4>2. Remove unnecessary variables</h4>

---

In [118]:
vle = vle[['id_site', 'code_module', 'code_presentation', 'activity_type']]

In [119]:
vle.head()

Unnamed: 0,id_site,code_module,code_presentation,activity_type
0,546943,AAA,2013J,resource
1,546712,AAA,2013J,oucontent
2,546998,AAA,2013J,resource
3,546888,AAA,2013J,url
4,547035,AAA,2013J,resource


---

<h4>3. Explore the dataframe</h4>

---

<h4>Basic Information</h4>

In [309]:
analyze_df(vle)

Dataframe Length:

6364


Data Types:

id_site                int64
code_module           object
code_presentation     object
activity_type         object
week_from            float64
week_to              float64
dtype: object


Null Data:

id_site                 0
code_module             0
code_presentation       0
activity_type           0
week_from            5243
week_to              5243
dtype: int64




In [314]:
print(vle['activity_type'].explode().unique())

['resource' 'oucontent' 'url' 'homepage' 'subpage' 'glossary' 'forumng'
 'oucollaborate' 'dataplus' 'quiz' 'ouelluminate' 'sharedsubpage'
 'questionnaire' 'page' 'externalquiz' 'ouwiki' 'dualpane'
 'repeatactivity' 'folder' 'htmlactivity']


---

<h4>Activity Type</h4>

In [68]:
print(vle['activity_type'].explode().unique())

['resource' 'oucontent' 'url' 'homepage' 'subpage' 'glossary' 'forumng'
 'oucollaborate' 'dataplus' 'quiz' 'ouelluminate' 'sharedsubpage'
 'questionnaire' 'page' 'externalquiz' 'ouwiki' 'dualpane'
 'repeatactivity' 'folder' 'htmlactivity']


<a id='StudentAssessment'></a>

---

<h2>Student Assessment Dataframe</h2>

---

<h3>Cleaning</h3>

---

<h4>1. Look at the dataframe</h4>



In [141]:
student_info_cm = student_info[['code_module', 'code_presentation', 'id_student']]

In [142]:
student_info_cm

Unnamed: 0,code_module,code_presentation,id_student
0,AAA,2013J,11391
1,AAA,2013J,28400
2,AAA,2013J,30268
3,AAA,2013J,31604
4,AAA,2013J,32885
...,...,...,...
28416,GGG,2014J,2640965
28417,GGG,2014J,2645731
28418,GGG,2014J,2648187
28419,GGG,2014J,2679821


In [132]:
merged = student_assessment.merge(student_info, how='right', on=['id_student', ],indicator=True)

In [133]:
merged.loc[merged['_merge'] == 'right_only']

Unnamed: 0,id_assessment,id_student,date_submitted,is_banked,score,code_module,code_presentation,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration,_merge
10,,30268,,,,AAA,2013J,North Western Region,30-40%,35-55,F,A Level or Equivalent,Y,Withdrawn,-92.0,12.0,right_only
212,,135335,,,,AAA,2013J,East Anglian Region,20-30%,0-35,F,Lower Than A Level,N,Withdrawn,-29.0,30.0,right_only
567,,281589,,,,AAA,2013J,North Western Region,30-40%,0-35,M,HE Qualification,N,Fail,-50.0,,right_only
810,,346843,,,,AAA,2013J,Scotland,50-60%,35-55,F,HE Qualification,N,Fail,-44.0,,right_only
816,,354858,,,,AAA,2013J,South Region,90-100%,35-55,M,HE Qualification,N,Withdrawn,-32.0,5.0,right_only
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
181546,,2282141,,,,GGG,2014J,Wales,0-10%,35-55,M,A Level or Equivalent,N,Withdrawn,-32.0,62.0,right_only
181556,,2338614,,,,GGG,2014J,Scotland,0-10%,35-55,F,A Level or Equivalent,Y,Withdrawn,-23.0,58.0,right_only
181575,,2475886,,,,GGG,2014J,East Anglian Region,40-50%,35-55,F,Lower Than A Level,N,Fail,-31.0,,right_only
181603,,2608143,,,,GGG,2014J,East Midlands Region,60-70%,35-55,M,HE Qualification,N,Withdrawn,-45.0,48.0,right_only


In [136]:
merged2 = student_assessment.merge(student_info, how='left', indicator=True)

In [140]:
merged2.loc[merged2['_merge'] == 'left_only']

Unnamed: 0,id_assessment,id_student,date_submitted,is_banked,score,code_module,code_presentation,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration,_merge
1710,1758,2318055,19,0,75.0,,,,,,,,,,,,left_only
1725,1758,2474849,19,0,35.0,,,,,,,,,,,,left_only
1755,1758,2654628,19,0,69.0,,,,,,,,,,,,left_only
1790,1758,121349,19,0,73.0,,,,,,,,,,,,left_only
1875,1758,303985,19,0,67.0,,,,,,,,,,,,left_only
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
194353,37443,629258,230,0,80.0,,,,,,,,,,,,left_only
194359,37443,633561,227,0,60.0,,,,,,,,,,,,left_only
194490,37443,470900,219,0,80.0,,,,,,,,,,,,left_only
194493,37443,505216,214,0,80.0,,,,,,,,,,,,left_only


---

<h4>2. Remove unnecessary variables</h4>

---

---

<h4>3. Explore the dataframe</h4>

---

<h4>Basic Information</h4>

In [117]:
analyze_df(student_assessment)

Dataframe Length:

173912


Data Types:

id_assessment       int64
id_student          int64
date_submitted      int64
is_banked           int64
score             float64
dtype: object


Null Data:

id_assessment       0
id_student          0
date_submitted      0
is_banked           0
score             173
dtype: int64




---

<h4>Assessment ID</h4>

---

<h4>Student ID</h4>

In [320]:
student_assessment['id_student'].value_counts()

537811     28
554881     26
632074     25
591581     24
570213     24
           ..
2586026     1
500279      1
497872      1
495324      1
2675393     1
Name: id_student, Length: 23369, dtype: int64

---

<h4>Date Submitted</h4>

---

<h4>Score</h4>

In [None]:
for index, row in student_assessment[student_assessment['score'].isna()].iterrows():
    assessments.at[index, 'date'] = courses.loc[(courses['code_module'] == row['code_module']) & (courses['code_presentation'] == row['code_presentation']), 'module_presentation_length']

In [319]:
assessment_score_nas = pd.DataFrame()
for i, row in student_assessment[student_assessment['score'].isna()].iterrows():
    print(student_info.loc[(student_info['id_student'] == row['id_student']), 'final_result'])

227    Withdrawn
638    Withdrawn
Name: final_result, dtype: object
108    Withdrawn
Name: final_result, dtype: object
733    Fail
Name: final_result, dtype: object
843    Withdrawn
Name: final_result, dtype: object
1574    Withdrawn
Name: final_result, dtype: object
1616    Withdrawn
Name: final_result, dtype: object
1981    Withdrawn
Name: final_result, dtype: object
2112    Fail
Name: final_result, dtype: object
843    Withdrawn
Name: final_result, dtype: object
753    Withdrawn
Name: final_result, dtype: object
1423    Fail
Name: final_result, dtype: object
2122    Withdrawn
Name: final_result, dtype: object
1256    Fail
Name: final_result, dtype: object
2122    Withdrawn
Name: final_result, dtype: object
886     Withdrawn
4837         Fail
Name: final_result, dtype: object
2122    Withdrawn
Name: final_result, dtype: object
1221    Withdrawn
5055         Fail
Name: final_result, dtype: object
1361    Pass
Name: final_result, dtype: object
1458    Pass
Name: final_result, dtype: ob

In [317]:
assessment_score_nas

In [123]:
student_assessment[student_assessment['score'].isna()]

Unnamed: 0,id_assessment,id_student,date_submitted,is_banked,score
215,1752,721259,22,0,
937,1754,260355,127,0,
2364,1760,2606802,180,0,
3358,14984,186780,77,0,
3914,14984,531205,26,0,
...,...,...,...,...,...
148929,34903,582670,241,0,
159251,37415,610738,87,0,
166390,37427,631786,221,0,
169725,37435,648110,62,0,


In [125]:
student_info.loc[student_info['id_student'] == 721259]

Unnamed: 0,code_module,code_presentation,id_student,region,imd_band,age_band,gender,highest_education,disability,final_result
227,AAA,2013J,721259,South Region,50-60%,55<=,F,Lower Than A Level,N,Withdrawn
638,AAA,2014J,721259,South Region,50-60%,55<=,F,Lower Than A Level,N,Withdrawn


In [126]:
student_info.loc[student_info['id_student'] == 260355]

Unnamed: 0,code_module,code_presentation,id_student,region,imd_band,age_band,gender,highest_education,disability,final_result
108,AAA,2013J,260355,London Region,80-90%,35-55,F,A Level or Equivalent,N,Withdrawn
466,AAA,2014J,260355,London Region,80-90%,35-55,F,A Level or Equivalent,N,Withdrawn


<a id='MachineLearning'></a>

<h1>Machine Learning</h1>

In [267]:
change_col_val(col_dict, student_info)

In [270]:
student_info = student_info.drop(columns=['date_registration', 'date_unregistration'])

In [276]:
vle

Unnamed: 0,code_module,code_presentation,id_student,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration,sum_click,_merge
0,AAA,2014J,6516.0,Scotland,80-90%,55<=,M,HE Qualification,N,Pass,-52.0,,2791.0,both
1,DDD,2013J,8462.0,London Region,30-40%,55<=,M,HE Qualification,N,Withdrawn,-137.0,119.0,656.0,both
2,AAA,2013J,11391.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,934.0,both
4,CCC,2014J,23698.0,East Anglian Region,50-60%,0-35,F,A Level or Equivalent,N,Pass,-110.0,,910.0,both
5,BBB,2013J,23798.0,Wales,50-60%,0-35,M,A Level or Equivalent,N,Distinction,-27.0,,590.0,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26069,DDD,2014B,2698251.0,South West Region,50-60%,0-35,F,A Level or Equivalent,N,Fail,-23.0,,1511.0,both
26070,AAA,2013J,2698257.0,East Midlands Region,60-70%,0-35,M,Lower Than A Level,N,Pass,-58.0,,758.0,both
26071,CCC,2014B,2698535.0,Wales,50-60%,0-35,M,Lower Than A Level,N,Withdrawn,-156.0,180.0,4241.0,both
26072,BBB,2014J,2698577.0,Wales,50-60%,35-55,F,Lower Than A Level,N,Fail,16.0,,717.0,both


In [271]:
student_info

Unnamed: 0,code_module,code_presentation,id_student,region,imd_band,age_band,gender,highest_education,disability,final_result
0,AAA,2013J,11391,0,9,2,0,3,0,1
1,AAA,2013J,28400,11,2,1,1,3,0,1
2,AAA,2013J,30268,1,3,1,1,2,1,2
3,AAA,2013J,31604,2,5,1,1,2,0,1
4,AAA,2013J,32885,3,5,0,1,1,0,1
...,...,...,...,...,...,...,...,...,...,...
28416,GGG,2014J,2640965,10,1,0,1,1,0,0
28417,GGG,2014J,2645731,0,4,1,1,1,0,3
28418,GGG,2014J,2648187,5,2,0,1,2,1,1
28419,GGG,2014J,2679821,2,9,1,1,1,0,2


In [273]:
from sklearn.linear_model import LinearRegression

# create linear regression object
mlr = LinearRegression()

# fit linear regression
mlr.fit(student_info[['gender', 'region']], student_info['final_result'])

# get the slope and intercept of the line best fit.
print(mlr.intercept_)
# -244.92350252069903

print(mlr.coef_)
# [ 5.97694123 19.37771052]

1.277053963629578
[-0.00537576 -0.00709779]


In [274]:
assessments.head()

Unnamed: 0,code_module,code_presentation,id_student,id_assessment,assessment_type,weight,date,date_submitted,score,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration
0,AAA,2013J,11391.0,1752,TMA,10.0,19.0,18.0,78.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,
1,AAA,2013J,28400.0,1752,TMA,10.0,19.0,22.0,70.0,Scotland,20-30%,35-55,F,HE Qualification,N,Pass,-53.0,
2,AAA,2013J,31604.0,1752,TMA,10.0,19.0,17.0,72.0,South East Region,50-60%,35-55,F,A Level or Equivalent,N,Pass,-52.0,
3,AAA,2013J,32885.0,1752,TMA,10.0,19.0,26.0,69.0,West Midlands Region,50-60%,0-35,F,Lower Than A Level,N,Pass,-176.0,
4,AAA,2013J,38053.0,1752,TMA,10.0,19.0,19.0,79.0,Wales,80-90%,35-55,M,A Level or Equivalent,N,Pass,-110.0,


In [275]:
assessments[assessments['region']=='Scotland']

Unnamed: 0,code_module,code_presentation,id_student,id_assessment,assessment_type,weight,date,date_submitted,score,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration
1,AAA,2013J,28400.0,1752,TMA,10.0,19.0,22.0,70.0,Scotland,20-30%,35-55,F,HE Qualification,N,Pass,-53.0,
5,AAA,2013J,45462.0,1752,TMA,10.0,19.0,20.0,70.0,Scotland,30-40%,0-35,M,HE Qualification,N,Pass,-67.0,
13,AAA,2013J,63400.0,1752,TMA,10.0,19.0,19.0,83.0,Scotland,40-50%,35-55,M,Lower Than A Level,N,Pass,-67.0,
60,AAA,2013J,164259.0,1752,TMA,10.0,19.0,18.0,82.0,Scotland,70-80%,0-35,M,A Level or Equivalent,N,Pass,-64.0,
75,AAA,2013J,186149.0,1752,TMA,10.0,19.0,33.0,85.0,Scotland,30-40%,35-55,M,HE Qualification,N,Pass,-109.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173622,GGG,2014J,640505.0,37437,TMA,0.0,173.0,177.0,85.0,Scotland,90-100%,0-35,F,HE Qualification,N,Distinction,-109.0,
173640,GGG,2014J,642968.0,37437,TMA,0.0,173.0,169.0,80.0,Scotland,40-50%,0-35,F,Lower Than A Level,N,Pass,-113.0,
173659,GGG,2014J,644743.0,37437,TMA,0.0,173.0,170.0,65.0,Scotland,60-70%,0-35,F,Lower Than A Level,N,Pass,-85.0,
173666,GGG,2014J,645377.0,37437,TMA,0.0,173.0,172.0,80.0,Scotland,30-40%,0-35,F,Lower Than A Level,N,Distinction,-86.0,


In [26]:
# list of final_result possibilities
final_results = ['Fail', 'Pass', 'Withdrawn', 'Distinction']

# list of disability possibilities
disability = ['N', 'Y']

# list of region possibilities
regions = ['East Anglian Region', 'North Western Region',
 'South East Region', 'West Midlands Region', 'North Region',
 'South Region', 'South West Region', 'East Midlands Region',
 'Yorkshire Region', 'London Region', 'Wales', 'Scotland', 'Ireland']

# list of highest_education possibilities
highest_ed = ['No Formal quals', 'Lower Than A Level', 'A Level or Equivalent', 'HE Qualification', 'Post Graduate Qualification' ]

# list of imd_band possibilites
imd_bands = ['0-10%', '10-20%', '20-30%', '30-40%', '40-50%', '50-60%', '60-70%', '70-80%', '80-90%', '90-100%']

# list of age_band possibilities
age_bands = ['0-35', '35-55', '55<=']

# list of code_module possibilities
code_mods = ['2013B', '2013J', '2014B', '2014J']

# list of gender possibilities
genders = ['M', 'F']

# dictionary mapping column string names to the above lists to pass to the change_col_val function
col_dict = {'imd_band':imd_bands, 'region':regions, 'disability':disability, 'age_band':age_bands, 'highest_education':highest_ed, 'gender':genders, 'final_result':final_results}