# Assessing student performance in an online course

In this example, we will assess the risk of a student failing a course module based on student characterstics (gender, age, etc.) and information about their activity (studied credits, number of previous attempts to pass the course). To do that, we will train a student model using logistic regression.

Then, we will try to improve the model's performance in terms of accuracy by using the assignments' grades as an additional factor.

In [None]:
#import all the python libraries that we will need for our analysis

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report


In [None]:
# remove Future Warning message
# Reference:
# https://stackoverflow.com/questions/40659212/futurewarning-elementwise-comparison-failed-returning-scalar-but-in-the-futur
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

Read the needed data into dataframes.
Here we assume that the data files are in the same directory as the python script.

In [None]:
#read student information
studentInfo = pd.read_csv("studentInfo.csv") 
#print out the 10 first rows of the data
studentInfo.head(10)

In [None]:
# lets see all potential final results
studentInfo["final_result"].unique()

In [None]:
#create a new column to classify final results. classify studets with a pass or distinction as "1", the rest as "0"
studentInfo["result.class"] = 1

#studentInfo["result.class"] = studentInfo["final_result"].apply(lambda x: 0 if (x == 'Fail') | x == "Withdrawn") else 1)
studentInfo["result.class"].loc[(studentInfo["final_result"] == "Withdrawn") | (studentInfo["final_result"] == "Fail")] = 0

#and look at the dataset again
studentInfo.head()

We create one dataframe (Xfactors) with all the factors (variables) that we will use to assess whether a student will pass of fail the course

In [None]:
Xfactors = studentInfo[["gender", "region", "highest_education", "imd_band", "age_band", "num_of_prev_attempts", "studied_credits", "disability"]]
X_noncat = pd.get_dummies(Xfactors)

X_noncat.head(5)

Then we create another variable (Youtcome) which represents the outcome, that is what we want to assess.
Here, we want to assess whether a student will pass the course successfully or not - which is represented by the variable "result.class".
Please remember, 1 means the student passes the course, 0 means the student fails the course.

In [None]:
Youtcome = studentInfo["result.class"].values
Youtcome

Now its time to fit our model! This means that we will use "old" data - where we already know the outcome - to train the model. We will also keep a part of the old data to test our model's performance - that is whether the model learned to an acceptable degree to assess student performance.
The datasets used for training have the suffix "_train" while the datasets saved for testing have the suffix "_test".
The model is trained as a logistic regression binary classifier.

In [None]:
#fit the model
X_train, X_test, y_train, y_test = train_test_split(X_noncat, Youtcome, test_size=0.3, random_state=0)
OurModel = LogisticRegression()
OurModel.fit(X_train, y_train)

Now we will use our model (OurModel) to assess student performance using the test dataset.

In [None]:
#predict on a testset
y_pred = OurModel.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(OurModel.score(X_test, y_test)))

As you can see from the results, our model can assess student performance correctly with a 61% accuracy. 
This is not really good, is it? 
Lets try to improve the accuracy by adding one more variable to the predictive features (the Xfactors dataframe): the students'average grade of the Teacher Marked Asssessments (TMA) assignments of the course.

To do that, we will need the data contained in the tables: assessment and studentAssessment.
The analysis follows.

In [None]:
#read additional data
studentAssessments = pd.read_csv("studentAssessment.csv")
assessments = pd.read_csv("assessments.csv")

# studentAssessments.head(10)
# assessments.head(10)

In [None]:
#retrieve the ids only of the teacher assessments (TMA)
TAM = assessments.loc[assessments['assessment_type'] == "TMA"]


#then keep the students assessments (grades) that were only given by the teacher (TAM) and remove unknown entries ("?")
TAM_student_grades = studentAssessments.loc[studentAssessments.id_assessment.isin(TAM["id_assessment"])]
TAM_student_grades = TAM_student_grades.loc[TAM_student_grades['score'] != '?']

In [None]:
#create an empty list where we will save the average grade for each and every student
avg_grades = [] 

In [None]:
# studentInfo['id_student'] = studentInfo['id_student'].fillna(0)
studentInfo["id_student"].un

In [None]:
#for each student find all TMA scores for the course we are interested, and get the mean value

for i in range (0, len(studentInfo['id_student'])):
    
    this_student = studentAssessments.loc[(studentAssessments['id_student'] == studentInfo['id_student'][i]) &
                                          (studentAssessments['score'] != '?')]
    
    assmt = list(this_student['id_assessment'])
    score = list(this_student['score'].astype(float)) 
    # must be converted to 'float' instead of 'int'
    # reference: 
    # https://stackoverflow.com/questions/41550746/error-using-astype-when-nan-exists-in-a-dataframe/41550787
    
    final_score = 0
    for j in range(0, len(assmt)):
        idx = assessments.loc[assessments.id_assessment == assmt[j]].index[0]
        if((assessments.code_module[idx] == studentInfo['code_module'][i]) & (assessments.assessment_type[idx] == "TMA")):
            final_score = final_score + (float(assessments.weight[idx])*score[j])/100
            
    avg_grades.append(final_score)
    
#add the new information about average TAM grades to the student information dataframe

studentInfo['avg_TMA_assessment'] = avg_grades

In [None]:
#add the new information about average TAM grades to our model

Xfactors_updated = studentInfo[["gender", "region", "highest_education", "imd_band", "age_band", "num_of_prev_attempts", "studied_credits", "disability", "avg_TMA_assessment"]]
X_noncat_updated = pd.get_dummies(Xfactors_updated)

X_noncat_updated.head(5)

In [None]:
#fit the model again
X_train, X_test, y_train, y_test = train_test_split(X_noncat_updated, Youtcome, test_size=0.3, random_state=0)
OurModelUpdated = LogisticRegression()
OurModelUpdated.fit(X_train, y_train)

Now we will use the updated model (OurModelUpdated) to assess student performance using the test dataset.

In [None]:
#predict on a testset
y_pred = OurModelUpdated.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(OurModelUpdated.score(X_test, y_test)))

Well, what about that! The model now can assess students' performance with 83% accuracy!
Looks like the grades of Teacher Marked Assignments really helped us to improve the performance of our model :)
I wonder what else could help.... ;)

# Next time.....

How to evaluate the models and choose "THE BEST"?