<h1>Put a badass title here</h1>

---

In this notebook we will be analyzing the Open University Learning Analytics dataset. This dataset contains information about seven online courses, referred to as modules, the students taking these courses, and their interactions with the courses. There were four seperate presentations of these courses offered in February and October of 2013 and 2014. The analysis of this dataset is done with the goal of discovering relationships between student features, course features, student grades and the overall student outcome. We will begin with exploring the data, which is distributed between seven CSV files, and then apply machine learning algorithms to see what relationships we can tease out. 

In [1]:
from functions import *

@register_cell_magic
def markdown(line, cell):
    return md(cell.format(**globals()))


Navigation:

* Cleaning:
    * [Student Info](#StudentInfo) 
    * [Student Registration](#StudentRegistration) 
    * [Courses](#Courses) 
    * [Assessments](#Assessments)
    * [Student Assessment](#StudentAssessment)
    * [Student VLE](#StudentVLE)
    * [VLE](#VLE) 


<h1>Cleaning and Analysis</h1>

---

Let's get to know our data!

Step by step we will clean and explore the student data here.
For each dataframe we will first Get a general look at our data frame looking at datatypes, null values, duplicate values, and unique values and perform cleaning based on what we find, then we will explore the information visually

---

<h2>Observations and Cleaning</h2>

<h3>General</h3>

* Though the number of previous attempts may be interesting to analyze on its own to see the relationship between students who had to take the course multiple times, and the differences in their bahavior on the second or higher attempt, here we are only interested in students on their first attempt. The reason is that familiarity with course content is a confounding variable. Due to this we will remove students on their second or higher attempt. We will then remove num_prev_attempts since it will not contain any interesting data.
* studied_credits will not be a part of our analysis, and so may be removed.
* The dataframe columns can be reordered to keep relevent data together. 

In [2]:
# changing the student info dataframe to include only records where num_prev_attempts is 
student_info = student_info[student_info['num_of_prev_attempts'] == 0]

In [3]:
# reordering the student_info dataframe to keep country, module and student data together
student_info = student_info[['code_module', 'code_presentation', 'id_student', 'region', 'imd_band', 'age_band', 'gender', 'highest_education', 'disability', 'final_result']]

* The student registration dataframe matches 1:1 with the student_info dataframe only adding the date the student registered and the date, if applicable, they unregistered, and so we will merge these two dataframes

In [4]:
# left join and merge student info with student registration
student_info = student_info.merge(student_registration, how='left', on=['code_module', 'code_presentation', 'id_student'])

<h3>Datatypes</h3>

In [5]:
# show student info data types
student_info.dtypes

code_module             object
code_presentation       object
id_student               int64
region                  object
imd_band                object
age_band                object
gender                  object
highest_education       object
disability              object
final_result            object
date_registration      float64
date_unregistration    float64
dtype: object

* id_student is currently an int64 datatype, but would be more appropriate as an object data type since it is categorical.

In [6]:
# changing id_student to the object data type
student_info['id_student'] = student_info['id_student'].astype(object)

<h3>Null Values</h3>

In [7]:
# print the sum of null values in each column
student_info.isnull().sum()

code_module                0
code_presentation          0
id_student                 0
region                     0
imd_band                 990
age_band                   0
gender                     0
highest_education          0
disability                 0
final_result               0
date_registration         38
date_unregistration    19809
dtype: int64

* The imd_band variable has 990 null values which we may have to work around. 
* There are 19,809 null values for date_unregistration which represent the students that did not withdraw from the course.
* We have 38 null values for date_registration, and no mention of this in the dataset documentation, so we will treat this as missing data.

<h3>Numerical Analysis</h3>

In [8]:
len(student_info.index)

28421

In [9]:
student_info.describe().astype(int)

Unnamed: 0,date_registration,date_unregistration
count,28383,8612
mean,-68,49
std,48,81
min,-321,-274
25%,-100,-2
50%,-56,27
75%,-29,107
max,167,444


* There are 8,612 values for the count of date_unregistration which represents the number of students who withdrew from the course.
* The earliest date_unregistration date is 274 days before the course began, which means these students did not make it to the first day. We are only interested in students who took the course so we must eliminate students who did not attend.

In [10]:
# removing students who withdrew on or before the first day
student_info = student_info.drop(student_info[(student_info['date_unregistration'] <= 0)].index)
student_info.reset_index(drop=True).head()

Unnamed: 0,code_module,code_presentation,id_student,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration
0,AAA,2013J,11391,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,
1,AAA,2013J,28400,Scotland,20-30%,35-55,F,HE Qualification,N,Pass,-53.0,
2,AAA,2013J,30268,North Western Region,30-40%,35-55,F,A Level or Equivalent,Y,Withdrawn,-92.0,12.0
3,AAA,2013J,31604,South East Region,50-60%,35-55,F,A Level or Equivalent,N,Pass,-52.0,
4,AAA,2013J,32885,West Midlands Region,50-60%,0-35,F,Lower Than A Level,N,Pass,-176.0,


* Also notable is that the latest unregistration date is far beyond the date any of the courses went on for.

In [11]:
# finds the longest module length in courses and prints it
longest_course = courses['module_presentation_length'].max()
md(f"Longest Course: {longest_course} days")

Longest Course: 269 days

In [12]:
# finding students whose courses went on for longer than the maximum course length
student_info.loc[student_info['date_unregistration'] > 269]

Unnamed: 0,code_module,code_presentation,id_student,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration
21812,FFF,2013J,586851,Wales,0-10%,0-35,M,Lower Than A Level,N,Withdrawn,-22.0,444.0


This seems to be an outlier, but should not affect our overall analysis so we will leave this intact

In [13]:
student_info.nunique()

code_module                7
code_presentation          4
id_student             23804
region                    13
imd_band                  10
age_band                   3
gender                     2
highest_education          5
disability                 2
final_result               4
date_registration        302
date_unregistration      241
dtype: int64

The dataframe length is 341,052 but there are only 26,096 unique student ID's. There are no duplicate records, so these students are likely enrolled in other courses at the same or different times.

In [14]:
student_info[student_info['id_student'].duplicated()]

Unnamed: 0,code_module,code_presentation,id_student,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration
11605,DDD,2013B,86047,Wales,20-30%,0-35,F,HE Qualification,N,Pass,-60.0,
11624,DDD,2013B,131145,South West Region,40-50%,0-35,M,A Level or Equivalent,N,Pass,-103.0,
11627,DDD,2013B,134025,London Region,60-70%,0-35,M,A Level or Equivalent,N,Distinction,-58.0,
11639,DDD,2013B,163067,South East Region,40-50%,0-35,F,Lower Than A Level,N,Pass,-72.0,
11641,DDD,2013B,165733,Scotland,20-30%,0-35,M,A Level or Equivalent,Y,Fail,-99.0,
...,...,...,...,...,...,...,...,...,...,...,...,...
27020,GGG,2014B,501755,South Region,80-90%,0-35,M,Lower Than A Level,N,Pass,-60.0,
27116,GGG,2014B,603921,East Midlands Region,60-70%,0-35,F,Lower Than A Level,N,Pass,-8.0,
27368,GGG,2014B,624795,North Region,10-20,0-35,F,No Formal quals,N,Fail,-22.0,
27400,GGG,2014B,626159,South West Region,60-70%,35-55,F,Lower Than A Level,Y,Pass,-45.0,


In [15]:
# finding student records with duplicate ID's
pd.concat(x for _, x in student_info.groupby("id_student") if len(x) > 1)

Unnamed: 0,code_module,code_presentation,id_student,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration
9336,CCC,2014J,29411,East Midlands Region,80-90%,0-35,M,A Level or Equivalent,N,Withdrawn,-135.0,100.0
12626,DDD,2013J,29411,East Midlands Region,80-90%,0-35,M,A Level or Equivalent,N,Pass,-96.0,
9338,CCC,2014J,29639,North Region,,0-35,M,Lower Than A Level,N,Pass,-24.0,
17721,EEE,2014B,29639,North Region,,0-35,M,Lower Than A Level,N,Pass,-26.0,
7397,CCC,2014B,29820,East Anglian Region,40-50%,0-35,M,HE Qualification,N,Pass,-57.0,
...,...,...,...,...,...,...,...,...,...,...,...,...
25963,FFF,2014J,2681198,East Anglian Region,70-80%,35-55,M,Lower Than A Level,N,Pass,-87.0,
9325,CCC,2014B,2686578,Scotland,60-70%,0-35,M,A Level or Equivalent,N,Distinction,-23.0,
14250,DDD,2013J,2686578,Scotland,60-70%,0-35,M,A Level or Equivalent,N,Distinction,-39.0,
9330,CCC,2014B,2698535,Wales,50-60%,0-35,M,Lower Than A Level,N,Withdrawn,-156.0,180.0


We have 1956 students whose ID is listed more than once and a total of 3906 duplicate records. These do seem to be in different courses, and so we will leave them

In imd_bands the % sign is missing in 10-20. We will add that for consistency and clarity

In [16]:
# changing all 10-20 values in student_info imd_band to 10-20% for consistency's sake
student_info.loc[student_info['imd_band'] == '10-20', 'imd_band'] = '10-20%'
print(student_info['imd_band'].explode().unique())

['90-100%' '20-30%' '30-40%' '50-60%' '80-90%' '70-80%' nan '60-70%'
 '40-50%' '10-20%' '0-10%']


After this cleaning we are down to 25,760 relevent records

---

Let's take a look at the possible values for our categorical variables:

In [17]:
unique_vals(student_info)

code_module: ['AAA' 'BBB' 'CCC' 'DDD' 'EEE' 'FFF' 'GGG']

code_presentation: ['2013J' '2014J' '2013B' '2014B']

id_student: [11391 28400 30268 ... 2648187 2679821 2684003]

region: ['East Anglian Region' 'Scotland' 'North Western Region'
 'South East Region' 'West Midlands Region' 'Wales' 'North Region'
 'South Region' 'Ireland' 'South West Region' 'East Midlands Region'
 'Yorkshire Region' 'London Region']

imd_band: ['90-100%' '20-30%' '30-40%' '50-60%' '80-90%' '70-80%' nan '60-70%'
 '40-50%' '10-20%' '0-10%']

age_band: ['55<=' '35-55' '0-35']

gender: ['M' 'F']

highest_education: ['HE Qualification' 'A Level or Equivalent' 'Lower Than A Level'
 'Post Graduate Qualification' 'No Formal quals']

disability: ['N' 'Y']

final_result: ['Pass' 'Withdrawn' 'Fail' 'Distinction']

date_registration: [-159.  -53.  -92.  -52. -176. -110.  -67.  -29.  -33. -179. -103.  -47.
  -59.  -68. -180.  -95. -130.  -50. -107.  -27.  -31. -170.  -62. -100.
 -109.    5.  -43.  -26.  -32.  -99.  -82. -19

In [18]:
# list of final_result possibilities
final_results = ['Fail', 'Pass', 'Withdrawn', 'Distinction']

# list of disability possibilities
disability = ['N', 'Y']

# list of region possibilities
regions = ['East Anglian Region', 'North Western Region',
 'South East Region', 'West Midlands Region', 'North Region',
 'South Region', 'South West Region', 'East Midlands Region',
 'Yorkshire Region', 'London Region', 'Wales', 'Scotland', 'Ireland']

# list of highest_education possibilities
highest_ed = ['No Formal quals', 'Lower Than A Level', 'A Level or Equivalent', 'HE Qualification', 'Post Graduate Qualification' ]

# list of imd_band possibilites
imd_bands = ['0-10%', '10-20%', '20-30%', '30-40%', '40-50%', '50-60%', '60-70%', '70-80%', '80-90%', '90-100%']

# list of age_band possibilities
age_bands = ['0-35', '35-55', '55<=']

# list of code_module possibilities
code_mods = ['2013B', '2013J', '2014B', '2014J']

# list of gender possibilities
genders = ['M', 'F']

# dictionary mapping column string names to the above lists to pass to the change_col_val function
col_dict = {'imd_band':imd_bands, 'region':regions, 'disability':disability, 'age_band':age_bands, 'highest_education':highest_ed, 'gender':genders, 'final_result':final_results}

<a id='Assessments'></a>

---

<h2>Assessments Dataframe</h2>

---

<h3>Cleaning</h3>

---

<h4>1. Look at the dataframe</h4>

---

In [19]:
assessments.head()

Unnamed: 0,code_module,code_presentation,id_assessment,assessment_type,date,weight
0,AAA,2013J,1752,TMA,19.0,10.0
1,AAA,2013J,1753,TMA,54.0,20.0
2,AAA,2013J,1754,TMA,117.0,20.0
3,AAA,2013J,1755,TMA,166.0,20.0
4,AAA,2013J,1756,TMA,215.0,30.0


In [20]:
student_assessment.head()

Unnamed: 0,id_assessment,id_student,date_submitted,is_banked,score
0,1752,11391,18,0,78.0
1,1752,28400,22,0,70.0
2,1752,31604,17,0,72.0
3,1752,32885,26,0,69.0
4,1752,38053,19,0,79.0


In [21]:
student_assessment = student_assessment.drop(columns='is_banked')

---

<h4>2. Remove unnecessary variables</h4>

---


We will merge the student_assessment and assessments dataframes, matching the records by the assessment ID to have one dataframe with the assessment information.

In [22]:
# merges dataframes student_assessment with assessments with a right join on their common ID id_assessment
# creates a colum _merge which tells you if the id_assessment was found in one or both dataframes
merged_assessments = student_assessment.merge(assessments, how='outer', on=['id_assessment'],indicator=True)
merged_assessments.head()

Unnamed: 0,id_assessment,id_student,date_submitted,score,code_module,code_presentation,assessment_type,date,weight,_merge
0,1752,11391.0,18.0,78.0,AAA,2013J,TMA,19.0,10.0,both
1,1752,28400.0,22.0,70.0,AAA,2013J,TMA,19.0,10.0,both
2,1752,31604.0,17.0,72.0,AAA,2013J,TMA,19.0,10.0,both
3,1752,32885.0,26.0,69.0,AAA,2013J,TMA,19.0,10.0,both
4,1752,38053.0,19.0,79.0,AAA,2013J,TMA,19.0,10.0,both


In [23]:
merged_assessments.loc[merged_assessments['_merge'] == 'right_only']

Unnamed: 0,id_assessment,id_student,date_submitted,score,code_module,code_presentation,assessment_type,date,weight,_merge
173912,1757,,,,AAA,2013J,Exam,,100.0,right_only
173913,1763,,,,AAA,2014J,Exam,,100.0,right_only
173914,14990,,,,BBB,2013B,Exam,,100.0,right_only
173915,15002,,,,BBB,2013J,Exam,,100.0,right_only
173916,15014,,,,BBB,2014B,Exam,,100.0,right_only
173917,15025,,,,BBB,2014J,Exam,,100.0,right_only
173918,40087,,,,CCC,2014B,Exam,,100.0,right_only
173919,40088,,,,CCC,2014J,Exam,,100.0,right_only
173920,30713,,,,EEE,2013J,Exam,235.0,100.0,right_only
173921,30718,,,,EEE,2014B,Exam,228.0,100.0,right_only


This subset consists of exams which exist in assessments, but none of the students in student_assessment have taken. Since there is no student data mapped to these exams we will drop them.

In [24]:
# remove tests that students did not take
assessments = merged_assessments.dropna(subset=['id_student'])
# reset the index to be consecutive again
assessments = assessments.reset_index(drop=True)

In [25]:
assessments.head()

Unnamed: 0,id_assessment,id_student,date_submitted,score,code_module,code_presentation,assessment_type,date,weight,_merge
0,1752,11391.0,18.0,78.0,AAA,2013J,TMA,19.0,10.0,both
1,1752,28400.0,22.0,70.0,AAA,2013J,TMA,19.0,10.0,both
2,1752,31604.0,17.0,72.0,AAA,2013J,TMA,19.0,10.0,both
3,1752,32885.0,26.0,69.0,AAA,2013J,TMA,19.0,10.0,both
4,1752,38053.0,19.0,79.0,AAA,2013J,TMA,19.0,10.0,both


Now we have a dataframe of students which we have the exam data for mapped to the exam type, date and weight

In [26]:
assessments[assessments['date'].isna()].head()

Unnamed: 0,id_assessment,id_student,date_submitted,score,code_module,code_presentation,assessment_type,date,weight,_merge
52923,24290,558914.0,230.0,32.0,CCC,2014B,Exam,,100.0,both
52924,24290,559706.0,234.0,78.0,CCC,2014B,Exam,,100.0,both
52925,24290,559770.0,230.0,54.0,CCC,2014B,Exam,,100.0,both
52926,24290,560114.0,230.0,64.0,CCC,2014B,Exam,,100.0,both
52927,24290,560311.0,234.0,100.0,CCC,2014B,Exam,,100.0,both


We have 2,873 null data points for assessment date. The documentation of this dataset states that if the exam date is missing then it is as the end of the last presentation week. We can find this information in the courses dataframe.

In [27]:
# adding the dates for the null test dates
for index, row in assessments[assessments['date'].isna()].iterrows():
    assessments.at[index, 'date'] = courses.loc[(courses['code_module'] == row['code_module']) & (courses['code_presentation'] == row['code_presentation']), 'module_presentation_length']

In [28]:
assessments = assessments[['code_module', 'code_presentation', 'id_student', 'id_assessment', 'assessment_type', 'weight', 'date', 'date_submitted', 'score']]

There are 173 records with missing scores. These are not of much interest to us, since score is what we are trying to find the relationship for.

In [29]:
assessments = assessments.dropna(subset=['score'])

In [30]:
# converting the data types back
assessments = assessments.astype({'id_assessment': object, 'id_student': object})

In [31]:
assessments.head()

Unnamed: 0,code_module,code_presentation,id_student,id_assessment,assessment_type,weight,date,date_submitted,score
0,AAA,2013J,11391.0,1752,TMA,10.0,19.0,18.0,78.0
1,AAA,2013J,28400.0,1752,TMA,10.0,19.0,22.0,70.0
2,AAA,2013J,31604.0,1752,TMA,10.0,19.0,17.0,72.0
3,AAA,2013J,32885.0,1752,TMA,10.0,19.0,26.0,69.0
4,AAA,2013J,38053.0,1752,TMA,10.0,19.0,19.0,79.0


In [32]:
analyze_df(assessments)

In order to remove the students that we removed for the number of previous attempts, we must merge assessments and student info and find the difference

In [33]:
merged_sia = assessments.merge(student_info, how='outer', on=['id_student', 'code_module', 'code_presentation'], indicator=True)
merged_sia.head()

Unnamed: 0,code_module,code_presentation,id_student,id_assessment,assessment_type,weight,date,date_submitted,score,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration,_merge
0,AAA,2013J,11391.0,1752,TMA,10.0,19.0,18.0,78.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,both
1,AAA,2013J,11391.0,1753,TMA,20.0,54.0,53.0,85.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,both
2,AAA,2013J,11391.0,1754,TMA,20.0,117.0,115.0,80.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,both
3,AAA,2013J,11391.0,1755,TMA,20.0,166.0,164.0,85.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,both
4,AAA,2013J,11391.0,1756,TMA,30.0,215.0,212.0,82.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,both


In [34]:
merged_sia.loc[merged_sia['_merge'] == 'right_only']

Unnamed: 0,code_module,code_presentation,id_student,id_assessment,assessment_type,weight,date,date_submitted,score,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration,_merge
173739,AAA,2013J,30268,,,,,,,North Western Region,30-40%,35-55,F,A Level or Equivalent,Y,Withdrawn,-92.0,12.0,right_only
173740,AAA,2013J,135335,,,,,,,East Anglian Region,20-30%,0-35,F,Lower Than A Level,N,Withdrawn,-29.0,30.0,right_only
173741,AAA,2013J,281589,,,,,,,North Western Region,30-40%,0-35,M,HE Qualification,N,Fail,-50.0,,right_only
173742,AAA,2013J,346843,,,,,,,Scotland,50-60%,35-55,F,HE Qualification,N,Fail,-44.0,,right_only
173743,AAA,2013J,354858,,,,,,,South Region,90-100%,35-55,M,HE Qualification,N,Withdrawn,-32.0,5.0,right_only
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
176896,GGG,2014J,2282141,,,,,,,Wales,0-10%,35-55,M,A Level or Equivalent,N,Withdrawn,-32.0,62.0,right_only
176897,GGG,2014J,2338614,,,,,,,Scotland,0-10%,35-55,F,A Level or Equivalent,Y,Withdrawn,-23.0,58.0,right_only
176898,GGG,2014J,2475886,,,,,,,East Anglian Region,40-50%,35-55,F,Lower Than A Level,N,Fail,-31.0,,right_only
176899,GGG,2014J,2608143,,,,,,,East Midlands Region,60-70%,35-55,M,HE Qualification,N,Withdrawn,-45.0,48.0,right_only


In [35]:
assessments = merged_sia.dropna(subset=['final_result', 'id_assessment'])

In [36]:
assessments.reset_index(drop=True)

Unnamed: 0,code_module,code_presentation,id_student,id_assessment,assessment_type,weight,date,date_submitted,score,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration,_merge
0,AAA,2013J,11391.0,1752,TMA,10.0,19.0,18.0,78.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,both
1,AAA,2013J,11391.0,1753,TMA,20.0,54.0,53.0,85.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,both
2,AAA,2013J,11391.0,1754,TMA,20.0,117.0,115.0,80.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,both
3,AAA,2013J,11391.0,1755,TMA,20.0,166.0,164.0,85.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,both
4,AAA,2013J,11391.0,1756,TMA,30.0,215.0,212.0,82.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
153523,GGG,2014J,573320.0,37439,CMA,0.0,229.0,227.0,80.0,South East Region,80-90%,35-55,F,Lower Than A Level,N,Fail,-4.0,,both
153524,GGG,2014J,573320.0,37440,CMA,0.0,229.0,227.0,100.0,South East Region,80-90%,35-55,F,Lower Than A Level,N,Fail,-4.0,,both
153525,GGG,2014J,573320.0,37441,CMA,0.0,229.0,227.0,100.0,South East Region,80-90%,35-55,F,Lower Than A Level,N,Fail,-4.0,,both
153526,GGG,2014J,573320.0,37442,CMA,0.0,229.0,227.0,20.0,South East Region,80-90%,35-55,F,Lower Than A Level,N,Fail,-4.0,,both


In [37]:
assessments.loc[assessments['_merge'] == 'left_only']

Unnamed: 0,code_module,code_presentation,id_student,id_assessment,assessment_type,weight,date,date_submitted,score,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration,_merge


In [38]:
assessments = assessments.drop(columns='_merge')

In [39]:
analyze_df(assessments)

In [40]:
vle = vle.drop(columns=['week_from', 'week_to'])

In [41]:
# merging vle & student vle
merged_vle = student_vle.merge(vle, how='outer', on=['id_site', 'code_module', 'code_presentation'],indicator=True)

In [42]:
merged_vle.loc[merged_vle['_merge'] == 'right_only']

Unnamed: 0,code_module,code_presentation,id_student,id_site,date,sum_click,activity_type,_merge
10655280,AAA,2013J,,546897,,,url,right_only
10655281,AAA,2013J,,546872,,,subpage,right_only
10655282,AAA,2014J,,1032910,,,url,right_only
10655283,AAA,2014J,,1072237,,,url,right_only
10655284,AAA,2014J,,1027118,,,url,right_only
...,...,...,...,...,...,...,...,...
10655371,FFF,2014B,,779622,,,subpage,right_only
10655372,FFF,2014B,,924222,,,forumng,right_only
10655373,FFF,2014J,,1072239,,,forumng,right_only
10655374,FFF,2014J,,883074,,,subpage,right_only


This represents materials which we have no student activity associated withh

In [43]:
merged_vle = merged_vle.dropna(subset=['id_student'])

In [44]:
merged_vle = merged_vle.drop(columns=['_merge'])

In [45]:
vle = merged_vle

In [46]:
vle.head()

Unnamed: 0,code_module,code_presentation,id_student,id_site,date,sum_click,activity_type
0,AAA,2013J,28400.0,546652,-10.0,4.0,forumng
1,AAA,2013J,28400.0,546652,-10.0,1.0,forumng
2,AAA,2013J,28400.0,546652,-10.0,1.0,forumng
3,AAA,2013J,28400.0,546652,-10.0,8.0,forumng
4,AAA,2013J,30268.0,546652,-10.0,3.0,forumng


In [47]:
analyze_df(vle)

In [48]:
vle = vle.reset_index(drop=True)

In [49]:
# pd.concat(x for _, x in vle.groupby(['id_student',"date"]) if len(x) > 1)[0:50]

In [50]:
aggregates = {'sum_click':'sum', 'code_module':'first', 'code_presentation':'first'}

In [51]:
vle = vle.groupby(['id_student']).aggregate(aggregates).reset_index()

In [52]:
vle.head()

Unnamed: 0,id_student,sum_click,code_module,code_presentation
0,6516.0,2791.0,AAA,2014J
1,8462.0,656.0,DDD,2013J
2,11391.0,934.0,AAA,2013J
3,23629.0,161.0,BBB,2013B
4,23698.0,910.0,CCC,2014J


In [53]:
merged_vle_si = student_info.merge(vle, how='outer', on=['id_student', 'code_module', 'code_presentation'],indicator=True)

In [54]:
merged_vle_si

Unnamed: 0,code_module,code_presentation,id_student,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration,sum_click,_merge
0,AAA,2013J,11391.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,934.0,both
1,AAA,2013J,28400.0,Scotland,20-30%,35-55,F,HE Qualification,N,Pass,-53.0,,1435.0,both
2,AAA,2013J,30268.0,North Western Region,30-40%,35-55,F,A Level or Equivalent,Y,Withdrawn,-92.0,12.0,281.0,both
3,AAA,2013J,31604.0,South East Region,50-60%,35-55,F,A Level or Equivalent,N,Pass,-52.0,,2158.0,both
4,AAA,2013J,32885.0,West Midlands Region,50-60%,0-35,F,Lower Than A Level,N,Pass,-176.0,,1034.0,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28878,FFF,2013J,2694680.0,,,,,,,,,,48.0,right_only
28879,DDD,2014B,2696376.0,,,,,,,,,,282.0,right_only
28880,FFF,2013J,2697608.0,,,,,,,,,,26.0,right_only
28881,FFF,2014B,2697630.0,,,,,,,,,,1109.0,right_only


In [55]:
merged_vle_si.loc[merged_vle_si['_merge'] == 'left_only']

Unnamed: 0,code_module,code_presentation,id_student,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration,sum_click,_merge
701,BBB,2013B,72070.0,South East Region,60-70%,35-55,M,A Level or Equivalent,N,Withdrawn,-24.0,10.0,,left_only
730,BBB,2013B,133531.0,Wales,30-40%,0-35,F,Lower Than A Level,N,Fail,-24.0,,,left_only
735,BBB,2013B,143854.0,West Midlands Region,10-20%,35-55,F,Lower Than A Level,N,Withdrawn,-23.0,27.0,,left_only
800,BBB,2013B,322745.0,Scotland,90-100%,0-35,F,A Level or Equivalent,N,Fail,-85.0,,,left_only
802,BBB,2013B,323914.0,West Midlands Region,10-20%,0-35,F,A Level or Equivalent,N,Fail,-136.0,,,left_only
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25584,GGG,2014J,685028.0,Ireland,60-70%,35-55,F,Lower Than A Level,N,Withdrawn,-9.0,,,left_only
25629,GGG,2014J,688663.0,Wales,10-20%,0-35,F,Lower Than A Level,N,Withdrawn,19.0,123.0,,left_only
25704,GGG,2014J,696711.0,Wales,40-50%,0-35,F,A Level or Equivalent,N,Fail,-21.0,,,left_only
25712,GGG,2014J,697456.0,North Western Region,10-20%,0-35,M,Lower Than A Level,N,Fail,-16.0,,,left_only


In [56]:
merged_vle_si = merged_vle_si.dropna(subset=['region'])
merged_vle_si = merged_vle_si.dropna(subset=['sum_click'])

In [57]:
merged_vle_si

Unnamed: 0,code_module,code_presentation,id_student,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration,sum_click,_merge
0,AAA,2013J,11391.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,934.0,both
1,AAA,2013J,28400.0,Scotland,20-30%,35-55,F,HE Qualification,N,Pass,-53.0,,1435.0,both
2,AAA,2013J,30268.0,North Western Region,30-40%,35-55,F,A Level or Equivalent,Y,Withdrawn,-92.0,12.0,281.0,both
3,AAA,2013J,31604.0,South East Region,50-60%,35-55,F,A Level or Equivalent,N,Pass,-52.0,,2158.0,both
4,AAA,2013J,32885.0,West Midlands Region,50-60%,0-35,F,Lower Than A Level,N,Pass,-176.0,,1034.0,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25755,GGG,2014J,2640965.0,Wales,10-20%,0-35,F,Lower Than A Level,N,Fail,-4.0,,41.0,both
25756,GGG,2014J,2645731.0,East Anglian Region,40-50%,35-55,F,Lower Than A Level,N,Distinction,-23.0,,893.0,both
25757,GGG,2014J,2648187.0,South Region,20-30%,0-35,F,A Level or Equivalent,Y,Pass,-129.0,,312.0,both
25758,GGG,2014J,2679821.0,South East Region,90-100%,35-55,F,Lower Than A Level,N,Withdrawn,-49.0,101.0,275.0,both


In [58]:
vle = merged_vle_si[['code_module', 'code_presentation', 'region', 'imd_band', 'id_student', 'age_band','gender','highest_education', 'disability', 'sum_click', 'final_result']]

In [59]:
vle

Unnamed: 0,code_module,code_presentation,region,imd_band,id_student,age_band,gender,highest_education,disability,sum_click,final_result
0,AAA,2013J,East Anglian Region,90-100%,11391.0,55<=,M,HE Qualification,N,934.0,Pass
1,AAA,2013J,Scotland,20-30%,28400.0,35-55,F,HE Qualification,N,1435.0,Pass
2,AAA,2013J,North Western Region,30-40%,30268.0,35-55,F,A Level or Equivalent,Y,281.0,Withdrawn
3,AAA,2013J,South East Region,50-60%,31604.0,35-55,F,A Level or Equivalent,N,2158.0,Pass
4,AAA,2013J,West Midlands Region,50-60%,32885.0,0-35,F,Lower Than A Level,N,1034.0,Pass
...,...,...,...,...,...,...,...,...,...,...,...
25755,GGG,2014J,Wales,10-20%,2640965.0,0-35,F,Lower Than A Level,N,41.0,Fail
25756,GGG,2014J,East Anglian Region,40-50%,2645731.0,35-55,F,Lower Than A Level,N,893.0,Distinction
25757,GGG,2014J,South Region,20-30%,2648187.0,0-35,F,A Level or Equivalent,Y,312.0,Pass
25758,GGG,2014J,South East Region,90-100%,2679821.0,35-55,F,Lower Than A Level,N,275.0,Withdrawn


In [60]:
vle['sum_click'] = vle['sum_click'].astype(int)
vle['id_student'] = vle['id_student'].astype(int)
vle['id_student'] = vle['id_student'].astype(object)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)


In [61]:
analyze_df(vle)

In [62]:
vle

Unnamed: 0,code_module,code_presentation,region,imd_band,id_student,age_band,gender,highest_education,disability,sum_click,final_result
0,AAA,2013J,East Anglian Region,90-100%,11391,55<=,M,HE Qualification,N,934,Pass
1,AAA,2013J,Scotland,20-30%,28400,35-55,F,HE Qualification,N,1435,Pass
2,AAA,2013J,North Western Region,30-40%,30268,35-55,F,A Level or Equivalent,Y,281,Withdrawn
3,AAA,2013J,South East Region,50-60%,31604,35-55,F,A Level or Equivalent,N,2158,Pass
4,AAA,2013J,West Midlands Region,50-60%,32885,0-35,F,Lower Than A Level,N,1034,Pass
...,...,...,...,...,...,...,...,...,...,...,...
25755,GGG,2014J,Wales,10-20%,2640965,0-35,F,Lower Than A Level,N,41,Fail
25756,GGG,2014J,East Anglian Region,40-50%,2645731,35-55,F,Lower Than A Level,N,893,Distinction
25757,GGG,2014J,South Region,20-30%,2648187,0-35,F,A Level or Equivalent,Y,312,Pass
25758,GGG,2014J,South East Region,90-100%,2679821,35-55,F,Lower Than A Level,N,275,Withdrawn


In [63]:
merged_vle_ass = assessments.merge(vle, how='outer', on=['code_module', 'code_presentation', 'id_student', 'region', 'imd_band', 'age_band', 'gender', 'highest_education', 'disability', 'final_result'],indicator=True)

In [64]:
merged_vle_ass

Unnamed: 0,code_module,code_presentation,id_student,id_assessment,assessment_type,weight,date,date_submitted,score,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration,sum_click,_merge
0,AAA,2013J,11391.0,1752,TMA,10.0,19.0,18.0,78.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,934.0,both
1,AAA,2013J,11391.0,1753,TMA,20.0,54.0,53.0,85.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,934.0,both
2,AAA,2013J,11391.0,1754,TMA,20.0,117.0,115.0,80.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,934.0,both
3,AAA,2013J,11391.0,1755,TMA,20.0,166.0,164.0,85.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,934.0,both
4,AAA,2013J,11391.0,1756,TMA,30.0,215.0,212.0,82.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,934.0,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
155747,GGG,2014J,2282141,,,,,,,Wales,0-10%,35-55,M,A Level or Equivalent,N,Withdrawn,,,208.0,right_only
155748,GGG,2014J,2338614,,,,,,,Scotland,0-10%,35-55,F,A Level or Equivalent,Y,Withdrawn,,,51.0,right_only
155749,GGG,2014J,2475886,,,,,,,East Anglian Region,40-50%,35-55,F,Lower Than A Level,N,Fail,,,9.0,right_only
155750,GGG,2014J,2608143,,,,,,,East Midlands Region,60-70%,35-55,M,HE Qualification,N,Withdrawn,,,37.0,right_only


In [65]:
merged_vle_ass.loc[merged_vle_ass['_merge'] == 'left_only']

Unnamed: 0,code_module,code_presentation,id_student,id_assessment,assessment_type,weight,date,date_submitted,score,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration,sum_click,_merge
15939,BBB,2013J,546195.0,14996,TMA,5.0,19.0,19.0,55.0,London Region,0-10%,0-35,M,Lower Than A Level,N,Fail,-10.0,,,left_only
17835,BBB,2013J,574810.0,14996,TMA,5.0,19.0,24.0,64.0,Wales,90-100%,0-35,F,A Level or Equivalent,N,Withdrawn,-29.0,48.0,,left_only
30669,BBB,2014B,1969081.0,15008,TMA,5.0,12.0,0.0,72.0,South Region,,35-55,F,Lower Than A Level,N,Withdrawn,-95.0,3.0,,left_only
31240,BBB,2014B,38941.0,15008,TMA,5.0,12.0,13.0,77.0,East Midlands Region,10-20%,0-35,F,A Level or Equivalent,N,Fail,-24.0,,,left_only
33831,BBB,2014J,674777.0,15020,TMA,0.0,19.0,12.0,100.0,Scotland,10-20%,0-35,F,Lower Than A Level,N,Pass,-101.0,,,left_only
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149165,GGG,2014B,542562.0,37429,CMA,0.0,222.0,126.0,60.0,South East Region,60-70%,0-35,F,A Level or Equivalent,Y,Pass,-68.0,,,left_only
149166,GGG,2014B,542562.0,37430,CMA,0.0,222.0,143.0,68.0,South East Region,60-70%,0-35,F,A Level or Equivalent,Y,Pass,-68.0,,,left_only
149167,GGG,2014B,542562.0,37431,CMA,0.0,222.0,178.0,80.0,South East Region,60-70%,0-35,F,A Level or Equivalent,Y,Pass,-68.0,,,left_only
149168,GGG,2014B,542562.0,37432,CMA,0.0,222.0,191.0,60.0,South East Region,60-70%,0-35,F,A Level or Equivalent,Y,Pass,-68.0,,,left_only


In [66]:
merged_vle_ass = assessments.merge(vle, how='outer', on=['code_module', 'code_presentation', 'id_student', 'region', 'imd_band', 'age_band', 'gender', 'highest_education', 'disability', 'final_result'],indicator=True).head()

In [67]:
analyze_df(merged_vle_ass)

In [68]:
merged_vle_ass

Unnamed: 0,code_module,code_presentation,id_student,id_assessment,assessment_type,weight,date,date_submitted,score,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration,sum_click,_merge
0,AAA,2013J,11391.0,1752,TMA,10.0,19.0,18.0,78.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,934.0,both
1,AAA,2013J,11391.0,1753,TMA,20.0,54.0,53.0,85.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,934.0,both
2,AAA,2013J,11391.0,1754,TMA,20.0,117.0,115.0,80.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,934.0,both
3,AAA,2013J,11391.0,1755,TMA,20.0,166.0,164.0,85.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,934.0,both
4,AAA,2013J,11391.0,1756,TMA,30.0,215.0,212.0,82.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,934.0,both


In [69]:
merged_vle_ass.loc[merged_vle_ass['_merge'] == 'left_only']

Unnamed: 0,code_module,code_presentation,id_student,id_assessment,assessment_type,weight,date,date_submitted,score,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration,sum_click,_merge


In [70]:
merged_vle_ass.loc[merged_vle_ass['_merge'] == 'right_only']

Unnamed: 0,code_module,code_presentation,id_student,id_assessment,assessment_type,weight,date,date_submitted,score,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration,sum_click,_merge


In [71]:
# merged_vle_ass.loc[merged_vle_ass['_merge'] == 'right_only']

In [72]:
# merged_vle_ass.loc[merged_vle_ass['_merge'] == 'left_only']

In [73]:
# merged_vle_ass2 = vle.merge(assessments, how='left', on=['code_module', 'code_presentation'],indicator=True).head()

In [74]:
# merged_vle_ass2 

In [75]:
# merged_vle_ass2.loc[merged_vle_ass2['_merge'] == 'right_only']

In [76]:
# merged_vle_ass2.loc[merged_vle_ass2['_merge'] == 'left_only']

In [77]:
# merged_vle_ass3 = assessments.merge(vle, how='right', on=['code_module', 'code_presentation'],indicator=True).head()

In [78]:
# merged_vle_ass3 

In [79]:
# merged_vle_ass3.loc[merged_vle_ass3['_merge'] == 'right_only']

In [80]:
# merged_vle_ass3.loc[merged_vle_ass3['_merge'] == 'left_only']

In [81]:
# merged_vle_ass4 = assessments.merge(vle, how='left', on=['code_module', 'code_presentation'],indicator=True).head()

In [82]:
# merged_vle_ass4

In [83]:
# merged_vle_ass4.loc[merged_vle_ass4['_merge'] == 'right_only']

In [84]:
# merged_vle_ass4.loc[merged_vle_ass4['_merge'] == 'left_only']

In [85]:
# merged_vle_si

In [86]:
# merged_vle_si.loc[merged_vle_si['_merge'] == 'left_only']

In [87]:
merged_vle_si = merged_vle_si.dropna(subset=['final_result'])

In [88]:
vle = merged_vle_si

In [89]:
vle

Unnamed: 0,code_module,code_presentation,id_student,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration,sum_click,_merge
0,AAA,2013J,11391.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,934.0,both
1,AAA,2013J,28400.0,Scotland,20-30%,35-55,F,HE Qualification,N,Pass,-53.0,,1435.0,both
2,AAA,2013J,30268.0,North Western Region,30-40%,35-55,F,A Level or Equivalent,Y,Withdrawn,-92.0,12.0,281.0,both
3,AAA,2013J,31604.0,South East Region,50-60%,35-55,F,A Level or Equivalent,N,Pass,-52.0,,2158.0,both
4,AAA,2013J,32885.0,West Midlands Region,50-60%,0-35,F,Lower Than A Level,N,Pass,-176.0,,1034.0,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25755,GGG,2014J,2640965.0,Wales,10-20%,0-35,F,Lower Than A Level,N,Fail,-4.0,,41.0,both
25756,GGG,2014J,2645731.0,East Anglian Region,40-50%,35-55,F,Lower Than A Level,N,Distinction,-23.0,,893.0,both
25757,GGG,2014J,2648187.0,South Region,20-30%,0-35,F,A Level or Equivalent,Y,Pass,-129.0,,312.0,both
25758,GGG,2014J,2679821.0,South East Region,90-100%,35-55,F,Lower Than A Level,N,Withdrawn,-49.0,101.0,275.0,both


* The data types are acceptable.
* There are 11 null values in date. The documentation for this dataset states that if the final exam date is missing it is at the end of the last presentation week.

We will again look at the possible values for our categorical variables:

---

<h4>Code Module</h4>


---

<h4>Code Presentation</h4>


---

<h4>Assessment ID</h4>


---

<h4>Assessment Type</h4>


In [90]:
unique_assessments = assessments['id_assessment'].count()
tma = assessments['assessment_type'].value_counts()['TMA']
cma = assessments['assessment_type'].value_counts()['CMA']
exams = assessments['assessment_type'].value_counts()['Exam']
print(f"There are {unique_assessments} unique assessments\n{tma} are Tutor Marked Assessments (TMA)\n{cma} are Computer Marked Assessments (CMA)\n{exams} are Final Exams (Exam)")

There are 153528 unique assessments
86495 are Tutor Marked Assessments (TMA)
62561 are Computer Marked Assessments (CMA)
4472 are Final Exams (Exam)


In [91]:
print(assessments.loc[assessments['assessment_type'] == 'TMA', 'code_presentation'].value_counts())
print()
print(assessments.loc[assessments['assessment_type'] == 'CMA', 'code_presentation'].value_counts())
print()
print(assessments.loc[assessments['assessment_type'] == 'Exam', 'code_presentation'].value_counts())

2014J    28870
2013J    26214
2014B    18087
2013B    13324
Name: code_presentation, dtype: int64

2013J    17322
2014J    15851
2014B    15428
2013B    13960
Name: code_presentation, dtype: int64

2014J    1914
2014B    1192
2013J     854
2013B     512
Name: code_presentation, dtype: int64


<a id='VLE'></a>

---

<h2>VLE Dataframe</h2>

---

<h3>Cleaning</h3>

---

<h4>1. Look at the dataframe</h4>

---

In [92]:
vle.head()

Unnamed: 0,code_module,code_presentation,id_student,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration,sum_click,_merge
0,AAA,2013J,11391.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,934.0,both
1,AAA,2013J,28400.0,Scotland,20-30%,35-55,F,HE Qualification,N,Pass,-53.0,,1435.0,both
2,AAA,2013J,30268.0,North Western Region,30-40%,35-55,F,A Level or Equivalent,Y,Withdrawn,-92.0,12.0,281.0,both
3,AAA,2013J,31604.0,South East Region,50-60%,35-55,F,A Level or Equivalent,N,Pass,-52.0,,2158.0,both
4,AAA,2013J,32885.0,West Midlands Region,50-60%,0-35,F,Lower Than A Level,N,Pass,-176.0,,1034.0,both


---

<h4>2. Remove unnecessary variables</h4>

---

In [93]:
vle = vle[['id_site', 'code_module', 'code_presentation', 'activity_type']]

KeyError: "['id_site', 'activity_type'] not in index"

In [119]:
vle.head()

Unnamed: 0,id_site,code_module,code_presentation,activity_type
0,546943,AAA,2013J,resource
1,546712,AAA,2013J,oucontent
2,546998,AAA,2013J,resource
3,546888,AAA,2013J,url
4,547035,AAA,2013J,resource


---

<h4>3. Explore the dataframe</h4>

---

<h4>Basic Information</h4>

In [309]:
analyze_df(vle)

Dataframe Length:

6364


Data Types:

id_site                int64
code_module           object
code_presentation     object
activity_type         object
week_from            float64
week_to              float64
dtype: object


Null Data:

id_site                 0
code_module             0
code_presentation       0
activity_type           0
week_from            5243
week_to              5243
dtype: int64




In [314]:
print(vle['activity_type'].explode().unique())

['resource' 'oucontent' 'url' 'homepage' 'subpage' 'glossary' 'forumng'
 'oucollaborate' 'dataplus' 'quiz' 'ouelluminate' 'sharedsubpage'
 'questionnaire' 'page' 'externalquiz' 'ouwiki' 'dualpane'
 'repeatactivity' 'folder' 'htmlactivity']


---

<h4>Code Module</h4>

---

<h4>Code Presentation</h4>

---

<h4>Student ID</h4>

---

<h4>Site ID</h4>

---

<h4>Date</h4>

---

<h4>Sum Click</h4>

---

<h4>Site ID</h4>

---

<h4>Code Module</h4>

---

<h4>Code Presentation</h4>

---

<h4>Activity Type</h4>

In [68]:
print(vle['activity_type'].explode().unique())

['resource' 'oucontent' 'url' 'homepage' 'subpage' 'glossary' 'forumng'
 'oucollaborate' 'dataplus' 'quiz' 'ouelluminate' 'sharedsubpage'
 'questionnaire' 'page' 'externalquiz' 'ouwiki' 'dualpane'
 'repeatactivity' 'folder' 'htmlactivity']


<a id='StudentAssessment'></a>

---

<h2>Student Assessment Dataframe</h2>

---

<h3>Cleaning</h3>

---

<h4>1. Look at the dataframe</h4>



In [141]:
student_info_cm = student_info[['code_module', 'code_presentation', 'id_student']]

In [142]:
student_info_cm

Unnamed: 0,code_module,code_presentation,id_student
0,AAA,2013J,11391
1,AAA,2013J,28400
2,AAA,2013J,30268
3,AAA,2013J,31604
4,AAA,2013J,32885
...,...,...,...
28416,GGG,2014J,2640965
28417,GGG,2014J,2645731
28418,GGG,2014J,2648187
28419,GGG,2014J,2679821


In [132]:
merged = student_assessment.merge(student_info, how='right', on=['id_student', ],indicator=True)

In [133]:
merged.loc[merged['_merge'] == 'right_only']

Unnamed: 0,id_assessment,id_student,date_submitted,is_banked,score,code_module,code_presentation,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration,_merge
10,,30268,,,,AAA,2013J,North Western Region,30-40%,35-55,F,A Level or Equivalent,Y,Withdrawn,-92.0,12.0,right_only
212,,135335,,,,AAA,2013J,East Anglian Region,20-30%,0-35,F,Lower Than A Level,N,Withdrawn,-29.0,30.0,right_only
567,,281589,,,,AAA,2013J,North Western Region,30-40%,0-35,M,HE Qualification,N,Fail,-50.0,,right_only
810,,346843,,,,AAA,2013J,Scotland,50-60%,35-55,F,HE Qualification,N,Fail,-44.0,,right_only
816,,354858,,,,AAA,2013J,South Region,90-100%,35-55,M,HE Qualification,N,Withdrawn,-32.0,5.0,right_only
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
181546,,2282141,,,,GGG,2014J,Wales,0-10%,35-55,M,A Level or Equivalent,N,Withdrawn,-32.0,62.0,right_only
181556,,2338614,,,,GGG,2014J,Scotland,0-10%,35-55,F,A Level or Equivalent,Y,Withdrawn,-23.0,58.0,right_only
181575,,2475886,,,,GGG,2014J,East Anglian Region,40-50%,35-55,F,Lower Than A Level,N,Fail,-31.0,,right_only
181603,,2608143,,,,GGG,2014J,East Midlands Region,60-70%,35-55,M,HE Qualification,N,Withdrawn,-45.0,48.0,right_only


In [136]:
merged2 = student_assessment.merge(student_info, how='left', indicator=True)

In [140]:
merged2.loc[merged2['_merge'] == 'left_only']

Unnamed: 0,id_assessment,id_student,date_submitted,is_banked,score,code_module,code_presentation,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration,_merge
1710,1758,2318055,19,0,75.0,,,,,,,,,,,,left_only
1725,1758,2474849,19,0,35.0,,,,,,,,,,,,left_only
1755,1758,2654628,19,0,69.0,,,,,,,,,,,,left_only
1790,1758,121349,19,0,73.0,,,,,,,,,,,,left_only
1875,1758,303985,19,0,67.0,,,,,,,,,,,,left_only
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
194353,37443,629258,230,0,80.0,,,,,,,,,,,,left_only
194359,37443,633561,227,0,60.0,,,,,,,,,,,,left_only
194490,37443,470900,219,0,80.0,,,,,,,,,,,,left_only
194493,37443,505216,214,0,80.0,,,,,,,,,,,,left_only


---

<h4>2. Remove unnecessary variables</h4>

---

---

<h4>3. Explore the dataframe</h4>

---

<h4>Basic Information</h4>

In [117]:
analyze_df(student_assessment)

Dataframe Length:

173912


Data Types:

id_assessment       int64
id_student          int64
date_submitted      int64
is_banked           int64
score             float64
dtype: object


Null Data:

id_assessment       0
id_student          0
date_submitted      0
is_banked           0
score             173
dtype: int64




---

<h4>Assessment ID</h4>

---

<h4>Student ID</h4>

In [320]:
student_assessment['id_student'].value_counts()

537811     28
554881     26
632074     25
591581     24
570213     24
           ..
2586026     1
500279      1
497872      1
495324      1
2675393     1
Name: id_student, Length: 23369, dtype: int64

---

<h4>Date Submitted</h4>

---

<h4>Score</h4>

In [None]:
for index, row in student_assessment[student_assessment['score'].isna()].iterrows():
    assessments.at[index, 'date'] = courses.loc[(courses['code_module'] == row['code_module']) & (courses['code_presentation'] == row['code_presentation']), 'module_presentation_length']

In [319]:
assessment_score_nas = pd.DataFrame()
for i, row in student_assessment[student_assessment['score'].isna()].iterrows():
    print(student_info.loc[(student_info['id_student'] == row['id_student']), 'final_result'])

227    Withdrawn
638    Withdrawn
Name: final_result, dtype: object
108    Withdrawn
Name: final_result, dtype: object
733    Fail
Name: final_result, dtype: object
843    Withdrawn
Name: final_result, dtype: object
1574    Withdrawn
Name: final_result, dtype: object
1616    Withdrawn
Name: final_result, dtype: object
1981    Withdrawn
Name: final_result, dtype: object
2112    Fail
Name: final_result, dtype: object
843    Withdrawn
Name: final_result, dtype: object
753    Withdrawn
Name: final_result, dtype: object
1423    Fail
Name: final_result, dtype: object
2122    Withdrawn
Name: final_result, dtype: object
1256    Fail
Name: final_result, dtype: object
2122    Withdrawn
Name: final_result, dtype: object
886     Withdrawn
4837         Fail
Name: final_result, dtype: object
2122    Withdrawn
Name: final_result, dtype: object
1221    Withdrawn
5055         Fail
Name: final_result, dtype: object
1361    Pass
Name: final_result, dtype: object
1458    Pass
Name: final_result, dtype: ob

In [317]:
assessment_score_nas

In [123]:
student_assessment[student_assessment['score'].isna()]

Unnamed: 0,id_assessment,id_student,date_submitted,is_banked,score
215,1752,721259,22,0,
937,1754,260355,127,0,
2364,1760,2606802,180,0,
3358,14984,186780,77,0,
3914,14984,531205,26,0,
...,...,...,...,...,...
148929,34903,582670,241,0,
159251,37415,610738,87,0,
166390,37427,631786,221,0,
169725,37435,648110,62,0,


In [125]:
student_info.loc[student_info['id_student'] == 721259]

Unnamed: 0,code_module,code_presentation,id_student,region,imd_band,age_band,gender,highest_education,disability,final_result
227,AAA,2013J,721259,South Region,50-60%,55<=,F,Lower Than A Level,N,Withdrawn
638,AAA,2014J,721259,South Region,50-60%,55<=,F,Lower Than A Level,N,Withdrawn


In [126]:
student_info.loc[student_info['id_student'] == 260355]

Unnamed: 0,code_module,code_presentation,id_student,region,imd_band,age_band,gender,highest_education,disability,final_result
108,AAA,2013J,260355,London Region,80-90%,35-55,F,A Level or Equivalent,N,Withdrawn
466,AAA,2014J,260355,London Region,80-90%,35-55,F,A Level or Equivalent,N,Withdrawn


<a id='MachineLearning'></a>

<h1>Machine Learning</h1>

In [267]:
change_col_val(col_dict, student_info)

In [270]:
student_info = student_info.drop(columns=['date_registration', 'date_unregistration'])

In [276]:
vle

Unnamed: 0,code_module,code_presentation,id_student,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration,sum_click,_merge
0,AAA,2014J,6516.0,Scotland,80-90%,55<=,M,HE Qualification,N,Pass,-52.0,,2791.0,both
1,DDD,2013J,8462.0,London Region,30-40%,55<=,M,HE Qualification,N,Withdrawn,-137.0,119.0,656.0,both
2,AAA,2013J,11391.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,,934.0,both
4,CCC,2014J,23698.0,East Anglian Region,50-60%,0-35,F,A Level or Equivalent,N,Pass,-110.0,,910.0,both
5,BBB,2013J,23798.0,Wales,50-60%,0-35,M,A Level or Equivalent,N,Distinction,-27.0,,590.0,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26069,DDD,2014B,2698251.0,South West Region,50-60%,0-35,F,A Level or Equivalent,N,Fail,-23.0,,1511.0,both
26070,AAA,2013J,2698257.0,East Midlands Region,60-70%,0-35,M,Lower Than A Level,N,Pass,-58.0,,758.0,both
26071,CCC,2014B,2698535.0,Wales,50-60%,0-35,M,Lower Than A Level,N,Withdrawn,-156.0,180.0,4241.0,both
26072,BBB,2014J,2698577.0,Wales,50-60%,35-55,F,Lower Than A Level,N,Fail,16.0,,717.0,both


In [271]:
student_info

Unnamed: 0,code_module,code_presentation,id_student,region,imd_band,age_band,gender,highest_education,disability,final_result
0,AAA,2013J,11391,0,9,2,0,3,0,1
1,AAA,2013J,28400,11,2,1,1,3,0,1
2,AAA,2013J,30268,1,3,1,1,2,1,2
3,AAA,2013J,31604,2,5,1,1,2,0,1
4,AAA,2013J,32885,3,5,0,1,1,0,1
...,...,...,...,...,...,...,...,...,...,...
28416,GGG,2014J,2640965,10,1,0,1,1,0,0
28417,GGG,2014J,2645731,0,4,1,1,1,0,3
28418,GGG,2014J,2648187,5,2,0,1,2,1,1
28419,GGG,2014J,2679821,2,9,1,1,1,0,2


In [273]:
from sklearn.linear_model import LinearRegression

# create linear regression object
mlr = LinearRegression()

# fit linear regression
mlr.fit(student_info[['gender', 'region']], student_info['final_result'])

# get the slope and intercept of the line best fit.
print(mlr.intercept_)
# -244.92350252069903

print(mlr.coef_)
# [ 5.97694123 19.37771052]

1.277053963629578
[-0.00537576 -0.00709779]


In [274]:
assessments.head()

Unnamed: 0,code_module,code_presentation,id_student,id_assessment,assessment_type,weight,date,date_submitted,score,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration
0,AAA,2013J,11391.0,1752,TMA,10.0,19.0,18.0,78.0,East Anglian Region,90-100%,55<=,M,HE Qualification,N,Pass,-159.0,
1,AAA,2013J,28400.0,1752,TMA,10.0,19.0,22.0,70.0,Scotland,20-30%,35-55,F,HE Qualification,N,Pass,-53.0,
2,AAA,2013J,31604.0,1752,TMA,10.0,19.0,17.0,72.0,South East Region,50-60%,35-55,F,A Level or Equivalent,N,Pass,-52.0,
3,AAA,2013J,32885.0,1752,TMA,10.0,19.0,26.0,69.0,West Midlands Region,50-60%,0-35,F,Lower Than A Level,N,Pass,-176.0,
4,AAA,2013J,38053.0,1752,TMA,10.0,19.0,19.0,79.0,Wales,80-90%,35-55,M,A Level or Equivalent,N,Pass,-110.0,


In [275]:
assessments[assessments['region']=='Scotland']

Unnamed: 0,code_module,code_presentation,id_student,id_assessment,assessment_type,weight,date,date_submitted,score,region,imd_band,age_band,gender,highest_education,disability,final_result,date_registration,date_unregistration
1,AAA,2013J,28400.0,1752,TMA,10.0,19.0,22.0,70.0,Scotland,20-30%,35-55,F,HE Qualification,N,Pass,-53.0,
5,AAA,2013J,45462.0,1752,TMA,10.0,19.0,20.0,70.0,Scotland,30-40%,0-35,M,HE Qualification,N,Pass,-67.0,
13,AAA,2013J,63400.0,1752,TMA,10.0,19.0,19.0,83.0,Scotland,40-50%,35-55,M,Lower Than A Level,N,Pass,-67.0,
60,AAA,2013J,164259.0,1752,TMA,10.0,19.0,18.0,82.0,Scotland,70-80%,0-35,M,A Level or Equivalent,N,Pass,-64.0,
75,AAA,2013J,186149.0,1752,TMA,10.0,19.0,33.0,85.0,Scotland,30-40%,35-55,M,HE Qualification,N,Pass,-109.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173622,GGG,2014J,640505.0,37437,TMA,0.0,173.0,177.0,85.0,Scotland,90-100%,0-35,F,HE Qualification,N,Distinction,-109.0,
173640,GGG,2014J,642968.0,37437,TMA,0.0,173.0,169.0,80.0,Scotland,40-50%,0-35,F,Lower Than A Level,N,Pass,-113.0,
173659,GGG,2014J,644743.0,37437,TMA,0.0,173.0,170.0,65.0,Scotland,60-70%,0-35,F,Lower Than A Level,N,Pass,-85.0,
173666,GGG,2014J,645377.0,37437,TMA,0.0,173.0,172.0,80.0,Scotland,30-40%,0-35,F,Lower Than A Level,N,Distinction,-86.0,
