# Assessing student performance in an online course

In this example, we will assess the risk of a student failing a course module based on student characterstics (gender, age, etc.) and information about their activity (studied credits, number of previous attempts to pass the course). To do that, we will train a student model using logistic regression.

Then, we will try to improve the model's performance in terms of accuracy by using the assignments' grades as an additional factor.

In [4]:
#import all the python libraries that we will need for our analysis

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report


Read the needed data into dataframes.
Here we assume that the data files are in the same directory as the python script.

In [7]:
#read student information
studentInfo = pd.read_csv("studentInfo.csv") 
#print out the 10 first rows of the data
studentInfo.head(10)

Unnamed: 0,code_module,code_presentation,id_student,gender,region,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,final_result
0,AAA,2013J,11391,M,East Anglian Region,HE Qualification,90-100%,55<=,0,240,N,Pass
1,AAA,2013J,28400,F,Scotland,HE Qualification,20-30%,35-55,0,60,N,Pass
2,AAA,2013J,30268,F,North Western Region,A Level or Equivalent,30-40%,35-55,0,60,Y,Withdrawn
3,AAA,2013J,31604,F,South East Region,A Level or Equivalent,50-60%,35-55,0,60,N,Pass
4,AAA,2013J,32885,F,West Midlands Region,Lower Than A Level,50-60%,0-35,0,60,N,Pass
5,AAA,2013J,38053,M,Wales,A Level or Equivalent,80-90%,35-55,0,60,N,Pass
6,AAA,2013J,45462,M,Scotland,HE Qualification,30-40%,0-35,0,60,N,Pass
7,AAA,2013J,45642,F,North Western Region,A Level or Equivalent,90-100%,0-35,0,120,N,Pass
8,AAA,2013J,52130,F,East Anglian Region,A Level or Equivalent,70-80%,0-35,0,90,N,Pass
9,AAA,2013J,53025,M,North Region,Post Graduate Qualification,,55<=,0,60,N,Pass


In [8]:
# lets see all potential final results
studentInfo["final_result"].unique()

array(['Pass', 'Withdrawn', 'Fail', 'Distinction'], dtype=object)

In [9]:
#create a new column to classify final results. classify studets with a pass or distinction as "1", the rest as "0"
studentInfo["result.class"] = 1

#studentInfo["result.class"] = studentInfo["final_result"].apply(lambda x: 0 if (x == 'Fail') | x == "Withdrawn") else 1)
studentInfo["result.class"].loc[(studentInfo["final_result"] == "Withdrawn") | (studentInfo["final_result"] == "Fail")] = 0

#and look at the dataset again
studentInfo.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


Unnamed: 0,code_module,code_presentation,id_student,gender,region,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,final_result,result.class
0,AAA,2013J,11391,M,East Anglian Region,HE Qualification,90-100%,55<=,0,240,N,Pass,1
1,AAA,2013J,28400,F,Scotland,HE Qualification,20-30%,35-55,0,60,N,Pass,1
2,AAA,2013J,30268,F,North Western Region,A Level or Equivalent,30-40%,35-55,0,60,Y,Withdrawn,0
3,AAA,2013J,31604,F,South East Region,A Level or Equivalent,50-60%,35-55,0,60,N,Pass,1
4,AAA,2013J,32885,F,West Midlands Region,Lower Than A Level,50-60%,0-35,0,60,N,Pass,1


We create one dataframe (Xfactors) with all the factors (variables) that we will use to assess whether a student will pass of fail the course

In [10]:
Xfactors = studentInfo[["gender", "region", "highest_education", "imd_band", "age_band", "num_of_prev_attempts", "studied_credits", "disability"]]
X_noncat = pd.get_dummies(Xfactors)

X_noncat.head(5)

Unnamed: 0,num_of_prev_attempts,studied_credits,gender_F,gender_M,region_East Anglian Region,region_East Midlands Region,region_Ireland,region_London Region,region_North Region,region_North Western Region,...,imd_band_50-60%,imd_band_60-70%,imd_band_70-80%,imd_band_80-90%,imd_band_90-100%,age_band_0-35,age_band_35-55,age_band_55<=,disability_N,disability_Y
0,0,240,0,1,1,0,0,0,0,0,...,0,0,0,0,1,0,0,1,1,0
1,0,60,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
2,0,60,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,1,0,0,1
3,0,60,1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,1,0,1,0
4,0,60,1,0,0,0,0,0,0,0,...,1,0,0,0,0,1,0,0,1,0


Then we create another variable (Youtcome) which represents the outcome, that is what we want to assess.
Here, we want to assess whether a student will pass the course successfully or not - which is represented by the variable "result.class".
Please remember, 1 means the student passes the course, 0 means the student fails the course.

In [12]:
Youtcome = studentInfo["result.class"].values
Youtcome

array([1, 1, 0, ..., 1, 0, 1])

Now its time to fit our model! This means that we will use "old" data - where we already know the outcome - to train the model. We will also keep a part of the old data to test our model's performance - that is whether the model learned to an acceptable degree to assess student performance.
The datasets used for training have the suffix "_train" while the datasets saved for testing have the suffix "_test".
The model is trained as a logistic regression binary classifier.

In [14]:
#fit the model
X_train, X_test, y_train, y_test = train_test_split(X_noncat, Youtcome, test_size=0.3, random_state=0)
OurModel = LogisticRegression()
OurModel.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

Now we will use our model (OurModel) to assess student performance using the test dataset.

In [15]:
#predict on a testset
y_pred = OurModel.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(OurModel.score(X_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.61


As you can see from the results, our model can assess student performance correctly with a 61% accuracy. 
This is not really good, is it? 
Lets try to improve the accuracy by adding one more variable to the predictive features (the Xfactors dataframe): the students'average grade of the Teacher Marked Asssessments (TMA) assignments of the course.

To do that, we will need the data contained in the tables: assessment and studentAssessment.
The analysis follows.

In [19]:
#read additional data
studentAssessments = pd.read_csv("studentAssessment.csv")
assessments = pd.read_csv("assessments.csv")

# studentAssessments.head(10)
assessments.head(10)

Unnamed: 0,code_module,code_presentation,id_assessment,assessment_type,date,weight
0,AAA,2013J,1752,TMA,19.0,10.0
1,AAA,2013J,1753,TMA,54.0,20.0
2,AAA,2013J,1754,TMA,117.0,20.0
3,AAA,2013J,1755,TMA,166.0,20.0
4,AAA,2013J,1756,TMA,215.0,30.0
5,AAA,2013J,1757,Exam,,100.0
6,AAA,2014J,1758,TMA,19.0,10.0
7,AAA,2014J,1759,TMA,54.0,20.0
8,AAA,2014J,1760,TMA,117.0,20.0
9,AAA,2014J,1761,TMA,166.0,20.0


In [20]:
#retrieve the ids only of the teacher assessments (TMA)
TAM = assessments.loc[assessments['assessment_type'] == "TMA"]


#then keep the students assessments (grades) that were only given by the teacher (TAM) and remove unknown entries ("?")
TAM_student_grades = studentAssessments.loc[studentAssessments.id_assessment.isin(TAM["id_assessment"])]
TAM_student_grades = TAM_student_grades.loc[TAM_student_grades['score'] != '?']

  result = method(y)


In [21]:
#create an empty list where we will save the average grade for each and every student
avg_grades = [] 

In [23]:
#for each student find all TMA scores for the course we are interested, and get the mean value

for i in range (0, len(studentInfo['id_student'])):
    
    this_student = studentAssessments.loc[(studentAssessments['id_student'] == student_info['id_student'][i]) &
                                          (studentAssessments['score'] != '?')]
    
    assmt = list(this_student['id_assessment'])
    score = list(this_student['score'].astype(int))
    
    final_score = 0
    for j in range(0, len(assmt)):
        idx = assessments.loc[assessments.id_assessment == assmt[j]].index[0]
        if((assessments.code_module[idx] == student_info['code_module'][i]) & (assessments.assessment_type[idx] == "TMA")):
            final_score = final_score + (float(assessments.weight[idx])*score[j])/100
            
    avg_grades.append(final_score)
    
#add the new information about average TAM grades to the student information dataframe

studentInfo['avg_TMA_assessment'] = avg_grades

NameError: name 'student_info' is not defined

In [None]:
#add the new information about average TAM grades to our model

Xfactors_updated = studentInfo[["gender", "region", "highest_education", "imd_band", "age_band", "num_of_prev_attempts", "studied_credits", "disability", "avg_TMA_assessment"]]
X_noncat_updated = pd.get_dummies(Xfactors_updated)

X_noncat_updated.head(5)

In [None]:
#fit the model again
X_train, X_test, y_train, y_test = train_test_split(X_noncat_updated, Youtcome, test_size=0.3, random_state=0)
OurModelUpdated = LogisticRegression()
OurModelUpdated.fit(X_train, y_train)

Now we will use the updated model (OurModelUpdated) to assess student performance using the test dataset.

In [None]:
#predict on a testset
y_pred = OurModelUpdated.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(OurModelUpdated.score(X_test, y_test)))

Well, what about that! The model now can assess students' performance with 83% accuracy!
Looks like the grades of Teacher Marked Assignments really helped us to improve the performance of our model :)
I wonder what else could help.... ;)

# Next time.....

How to evaluate the models and choose "THE BEST"?