# Determining Student Knowledge Status

### Introduction

Understanding how student study methods and examination results affect the retention of information is essential to create better teaching and learning methodologies. At a university in Turkey, Ph.D students collected data on undergraduate students' understanding of Electrical DC Machines. They utilized five standardized characteristics to determine the knowledge level of students from very low to high. The five variables included the amount of study time, number of repetitions and exam performance for goal object materials as well as the amount of study time and exam performance for related objects with the goal object.

Through this project, we propose to answer the following question: 

> *Given the degree of preparation and examination results of a student, what will be the knowledge retention level of said student?* 

We aim to achieve this by training a model that, given the five aforementioned characteristics, will classify the knowledge level of a student. 

### Preliminary Exploratory Data Analysis

In [1]:
## Run this cell before continuing
library(tidyverse)
library(readxl)
library(repr)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



The dataset we will be using is the **User Knowledge Modeling Data Set** provided by the *UCI Machine Learning Repository*, linked [here](https://archive.ics.uci.edu/ml/datasets/User+Knowledge+Modeling).

This dataset has the following variables:
- `STG`: The degree of study time for goal object materials
- `SCG`: The degree of repetition number of user for goal object materials
- `STR`: The degree of study time of user for related objects with goal object
- `LPR`: The exam performance of user for related objects with goal object
- `PEG`: The exam performance of user for goal objects
- `UNS`: The knowledge level of user

Utilizing the first five variables, we aim to predict the sixth variable, `UNS`, which is a student's knowledge level. The knowledge level variable has one of four possible labels: `High`, `Middle`, `Low` and `Very Low`.

First, let's read in the training data:

In [32]:
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00257/Data_User_Modeling_Dataset_Hamdi%20Tolga%20KAHRAMAN.xls"
download.file(url, destfile = "data/Data_User_Modeling_Dataset.xls")
knowledge <- read_excel("data/Data_User_Modeling_Dataset.xls", sheet = 2) %>%
    select(STG:UNS)

head(knowledge, n = 5)

New names:
* `` -> ...7
* `` -> ...8



STG,SCG,STR,LPR,PEG,UNS
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
0.0,0.0,0.0,0.0,0.0,very_low
0.08,0.08,0.1,0.24,0.9,High
0.06,0.06,0.05,0.25,0.33,Low
0.1,0.1,0.15,0.65,0.3,Middle
0.08,0.08,0.08,0.98,0.24,Low


Note that the first five variables appear to be normalized to be in the range [0, 1]. This will actually make things easier for us when training our classification model!

Additionaly, note that out of all category names, only `STG` and `STR` properly communicate what they represent. Let's change the variable names to the following, which better communicate what they represent:
- `STG`: Study Time degree for Goal object materials
- `RNG`: Repetition Number degree for Goal object materials
- `STR`: Study Time degree for Related objects with goal object materials
- `EPR`: Exam Performance for Related objects with goal objects
- `EPG`: Exam Performance for Goal object
- `SKL`: Student Knowledge Level

In [33]:
knowledge <- knowledge %>%
    rename(RNG = SCG,
           EPR = LPR, 
           EPG = PEG,
           SKL = UNS)
head(knowledge, n = 5)

STG,RNG,STR,EPR,EPG,SKL
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
0.0,0.0,0.0,0.0,0.0,very_low
0.08,0.08,0.1,0.24,0.9,High
0.06,0.06,0.05,0.25,0.33,Low
0.1,0.1,0.15,0.65,0.3,Middle
0.08,0.08,0.08,0.98,0.24,Low


That's better. Let's continue tidying up the data!

A few changes will make this data easier to work with:
- Firstly, transform all knowledge level (SKL) labels to lower case
- Then transform the knowledge level (SKL) column to be a factor column instead of the current character column.

In [34]:
knowledge <- knowledge %>%
    mutate(SKL = as_factor(tolower(SKL)))
head(knowledge, n=10)

STG,RNG,STR,EPR,EPG,SKL
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
0.0,0.0,0.0,0.0,0.0,very_low
0.08,0.08,0.1,0.24,0.9,high
0.06,0.06,0.05,0.25,0.33,low
0.1,0.1,0.15,0.65,0.3,middle
0.08,0.08,0.08,0.98,0.24,low
0.09,0.15,0.4,0.1,0.66,middle
0.1,0.1,0.43,0.29,0.56,middle
0.15,0.02,0.34,0.4,0.01,very_low
0.2,0.14,0.35,0.72,0.25,low
0.0,0.0,0.5,0.2,0.85,high


There, that's more like it.

Now, the wide format in which this data is currently represented is not exactly tidy. For one thing, how does one know what the numbers represent? Without contextual knowledge, this format makes it impossible to know. To solve this problem, we reshape the data set to a tidy data format by adding two columns:
1. a 'Preparation Degree' column, which averages the study times and repetition numbers for both goal and related object materials
2. a 'Examination Result' column, which averages the normalized exam results for both goal and related object materials