# Predicting Student Success in Virtual Learning Environments

## Introduction:
   The purpose of our research project is to investigate the relationship between student’s use of Virtual Learning Environments (VLE) and their learning performance. Some of the research questions that we are interested to answer are as following:
* Does student demographic affect the likelihood of the student passing an online course?
* How does student interaction with the online courses (i.e. how often do they log in, what resources do they access?) affect their grades?
* What are the factors that indicate success (pass/fail) among students using VLE?
Our hypothesis is that there is a relationship between a student’s online learning behavior (interaction with VLE) and demographics and their performance in VLE

   
   To test this hypothesis we used the [Open University Learning Analytics dataset](https://analyse.kmi.open.ac.uk/open_dataset) to study this relationship. The dataset features demographic information about 28,000 students who, in 2013 and 2014, enrolled in any of seven particular distance learning courses at the UK’s Open University; their final results (distinction, pass, fail, or withdrawn); 173,000 graded assignments; and 10 million rows describing each student’s interactions with the courses measured by total number of clicks.

### Predictor variables (student’s demographics)


| Feature                                   | Values                                                                                                                                                                                                        |
|------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Gender                                   | Male/Female                                                                                                                                                                                                       |
| Region                                   | East Anglian Region,  Scotland,  Yorkshire Region,  South East Region,  East Midlands Region,  Wales,  North Region,  South Region,  South West Region,  London Region,  Ireland,  West Midlands Region,  North Western Region |
| Highest education                        | A Level or Equivalent,  HE Qualification,  Lower Than A Level,  Post Graduate Qualification  No Formal quals                                                                                                          |
| Index of Multiple Deprivation (IMD) Band | 0-10%  10-20%   20-30%  30-40%  40-50%  50-60%  60-70%  70-80%  80-90%  90-100%                                                                                                                                    |
| Age Band                                 | 0-35  35-55  >= 55                                                                                                                                                                                                 |
| Number of previous attempts              | Number of times the student has attempted this module (eg: 0 or 1)                                                                                                                                                 |
| Studied credits                          | Total number of credits for the modules the student is currently studying (eg: 30, 60, 90)                                                                                                                         |
| Disability                               | Yes/No                                                                                                                                                                                                             |
| Date registration                        | Number of days measured relative to the start of the module-presentation |      
|forumng|Number of clicks|
|home page |Number of clicks|
|oucollaborate |Number of clicks|
|oucontent |Number of clicks|
|page |Number of clicks|
|quiz |Number of clicks|
|resource |Number of clicks|
|subpage |Number of clicks|
|url      |Number of clicks                                                                                                 

### Outcome Variables to Predict

| Feature |  Values                                    |
|------------------|--------------------------------------|
| Final result     | `Pass`,   `Fail`,   `Distinction`,   `Withdraw` |


Many students enroll in prestigious universities in their pursuit of a higher education, and with the adoption of [e-learning in higher education](http://itdl.org/Journal/Jan_15/Jan15.pdf#page=33), our team was interested in discovering about how well students perform in these virtual environments, and what different factors might influence this learning. Because there is still debate as to whether or not VLE’s guarantee any significant [pedagogical effects](https://telearn.archives-ouvertes.fr/hal-00190701/document), our team is taking a stats driven approach to measure student successes. That being said, our team is not out to compare whether VLE's are more effective than traditional universities. Due to the nature of our dataset, we cannot compare the classes to their traditional counterparts to make any significant deductions. This research is an important undertaking as it would allow us to examine the influence and effectiveness of VLE as an educational, web-based platform for universities and students.

***

# Exploratory data analysis:

### Trimming the data

Before we could begin measuring success, we needed to first determine how to select all the files and features we had from our dataset. Because our data was so massive, we had to make a decision on how to filter out unnecessary data in order to work with a usable size that our scripts would run at a reasonable pace. We started by selecting the `code_module` CCC as our module to run analysis on and October 2014 (`code_presentation`) as our target time period. Doing this allowed us to both trim the data down to a more manageable size, and help mitigate inconsistencies between course features. Furthermore, by selecting CCC, we also had to double check that the data was not [imbalanced](https://towardsdatascience.com/machine-learning-multiclass-classification-with-imbalanced-data-set-29f6a177c1a) due to there only being 4 outcomes and us using a smaller data set (explored further in the _mitigating imbalance_ section).

As mentioned previously, our measure of student success is based on a student’s final result (whether they passed, failed, or received distinction). We also removed columns related to scores to reduce bias as the final results, with exception of withdrawn, are dependent on the scores. However, students who withdrawn from a course, “withdrawn” is recorded under their final result. A student who withdraws from a course cannot receive a final grade, therefore they cannot be categorized as someone who passed, failed, or received distinction. Therefore, we dropped all the students for withdrawn from the course for our analysis.
While looking through the dataset, we found that for there were many NaN values in the data unregistration columns. In addition, for students without date unregistration, they usually had a final result of either pass, fail, or distinction. We concluded that those with NaN values for date unregistration are the one that completed the course. At first, we filled the NaN date unregistration values with 269 (the total number of days for the course), but since we ignored all students that withdrawn from the course, we later excluded the date unregistration column in our final model.

## Exploratory Visualizations:

### Age vs Final Result

The distributions of the four outcomes amongst the three age groups were mostly the same. The only significant difference was that the 55+ age group had the highest percent of students who passed with distinction and the smallest amount to withdraw. While this could be a potential indicator to predict distinction, the sample size for the 55+ group is only 31.
![alt text](images/age.png "Title")

### Distribution of Exam Score vs Final Result

We could see the obvious relationship between Exam score and the final result of the student. Generally, students who scored above an 80 would earn a distinction, while students who failed scored 50 or less, and students who withdrew did not take the final exam. In order to better answer the queston of what demographics and factors indicate success, we decided in the end to omit the exam scores from our data. 
![alt text](images/exambox.png "Title")

### highest education vs final result

![alt text](images/education.png "Title")

### Sum of Clicks in Activities vs final result


![alt text](images/clicks.png "Title")

### Final Results Outcome Distribution

The amount of students in the dataset who passed and withdrew are much higher than the amount of students who earned a distinction or failed. While it is expected that the amount of students who fail should be smaller than the amount of students who pass or earn a distinction, this could lead to issues if our data is left too imbalanced like this.
![alt text](images/finalresults.png "Title")

### Creating and combining features

* We used the data from the `student_info` file as the base to build our student entities. This file included features like `gender`, `region`, and `highest education`. We opted to including these features because we believed that they would help with our [predictions for student success](https://pdfs.semanticscholar.org/e48e/ba98bde33586c20442d46ab9a59c411196e5.pdf).

* There were multiple assessments that a student could take, each with different weights. We decided that multiplying each assessment by their weights to create weighted scores would provide more insight to the student’s overall quality of their work and speak more to their level of engagement. But upon further investigation of our topic, we realized that the exam scores were synonymous with the `final_result` outcome that we were trying to measure. So in the end, we dropped this feature.

* Beyond the basic student info, we also decided to merge data from the `student_vle` file to add the number of clicks a student had in each module activity. We believed that this would help us further represent a student’s engagement.

### mitigating imbalance

When running the initial models and examining its prediction performance, we realised that some classes (Withdrawn and Pass) heavily outweigh the other (Distinction and Fail). To improve our prediction performance in the case of imbalanced classes, we incorporated weighting method in our model by by distributing the weights of four outcomes equally.
![alt text](imbalance.png "Title")
_the imbalance heavily affected our True Positive accuracy for the Fail outcome_

***

# Building a model

## 1st iteration
We began exploring potential models to use on our data by using a matrix. The **Random forest** and the **Gradient boosting** models gave us the best results, but we still wern't satisfied with the overall accuracy of the models yet. [*Cohen's Kappa*](https://en.wikipedia.org/wiki/Cohen's_kappa) is essentially a measure of how well the classifier performed as compared to how well it would have performed simply by chance. The higher the Kappa, the more in agreement it is with the null-hypothesis. Our Kappa scores fall within the _Fair agreement_ (0.20 to 0.40) threshold which is sufficient for our purposes. 

![alt text](images/allmodels.png "Title")


## 2nd iteration
After doing some more research into our data, we realized that keeping in students who withdrew from the program would throw off the models. We wanted to treat withdrawn students seperately from students who failed because there could be various reasons for withdrawal, and just because a student withdraws doesn't mean that they would have failed. So we omitted them from our data and ran the matrix again. While this would limit us in the amount of This time we got much better results
![alt text](images/nowithdrawmodels.png "Title")

## 3rd iteration
We knew we wanted to implementMulticlass Classification and Multinomial Logistic Regression. We narrowed down our models between ***Gradient Boosting*** (GBM) and the ***Random Forest*** (RFM) models. Normally, one would also normalize the data to a common scale, but we found that our models were actually less accurate after normalization, so we opted to not include it.
![alt text](images/normmodels.png "Title")

## Final iteration
After further deliberation, we decided to treat the three remaining outcomes (_pass, fail, distinction_) as only **two** binary outcomes. Because everyone who earns distinction also passes, we merged those two outcomes together when analyzing a student's _pass vs. fail_. We still wanted to analyze distinction, so we trained the model against _distinction vs. non-distinction_, where non-distinction included both fail and pass.

### Pass vs. Fail
Starting with Pass and Fail, we ran our two best models and found that ***gradient boosting*** gave us the best results.
![alt text](images/binarymodels.png "Title")

***

We ran a confusion matrix against the pass/fail outcomes with fail being the positive result. Overall, the **GBM** model came out with an accuracy of 0.8291, meaning that this model is **correct 83% of the time**. The Sensitivity (True-Positive) was 0.4932, indicating that it is **49% accurate at estimating student fail outcomes**. The Specificity (True-Negative) came out to be 0.9505, indicating that our model is **95% accurate at estimating student pass outcomes**
![alt text](images/passfailmatrix.png "Title")

***

Quiz, Subpage, Homepage, Page, Resource, and Forumng were the top 6 most important features being considered in the GBM model. Surprisingly, the amount of clicks a student made on these features had more weight on the model then demographic features like age, gender, and even highest education.
![alt text](images/passfailimp.png "Title")

***

The ROC curve for pass vs fail gave an AUC value of 0.722. This means that our model is able to distinguish between Passes and Fails 72% of the time
![alt text](images/passfailroc.png "Title")

***

### Distinction vs. Non-Distinction
Moving on to Distinction vs Non-distinction, we ran our two best models and found that ***random forest*** gave us the best results.
![alt text](images/binarymodelsdistinct.png "Title")

***

We ran a confusion matrix against the distinction and non-distinction outcomes with distinction being the positive result. Overall, the **RFM** model came out with an accuracy of 0.7862, meaning that this model is **correct 79% of the time**. The Sensitivity (True-Positive) was 0.5738, indicating that it is **57% accurate at estimating student fail outcomes**. The Specificity (True-Negative) came out to be 0.8465, indicating that our model is **85% accurate at estimating student pass outcomes**
![alt text](images/distmatrix.png "Title")

***

Looking at the top 6 features we see Page, Homepage, Outcontent, Quiz, Forumng, and Resource show up again as the most important features being considered in the RFM model. Unlike GMB, the Random Forest Model takes into consideration more features when building its model.
![alt text](images/distimp.png "Title")

***

The ROC curve for pass vs fail gave an AUC value of 0.71. This means that our model is able to distinguish between Passes and Fails 71% of the time
![alt text](images/distroc.png "Title")

# conclusion:

- there is a weak relationship between student’s demographics and final performance/result in VLE. student’s final result cannot be well explained by looking at student’s demographics.
    - but we are still able to observe some trends based on our exploratory data analysis (e.g. highest education? etc?)
- there is a moderate relationship between student’s interaction with VLE and their final performance/result. recognized that there may be other significant variables that are not captured in this dataset (e.g.: time spent on each activity) Number of clicks of each activity type would not be sufficient to predict student’s performance.
- Limitations: 
    - sample size issue
    - case specific (only looking at CCC 2014J given the time constraint) - not generalisable (could mention potential future work: comparing different courses at different period of time)

![alt text](images/didourbest.png "Title")