# Project Proposal: Predicting User Knowledge from Study Habits and Exam Performance


**Group 23** <br>
Allison Fellhauer (38395166) <br>
Agastya Kaul (78851433) <br>
Grace Li (STUDENT_NUMBER) <br>
Xiangyuan Li (STUDENT_NUMBER) <br>

## Introduction: Data set and background information

### Background Information

#### User Modeling and User Knowledge

User modeling can be used to personalize a user's experience by tracking user interactions with a web page as a way to optimize their future interactions ([Kahraman et al. 2013](https://doi.org/10.1016/j.knosys.2012.08.009)). Some examples of user interactions that are assessed include pages the user has visited, the time spent on pages, and keystrokes ([Kahraman et al. 2013](https://doi.org/10.1016/j.knosys.2012.08.009)). User knowledge models can be used to evaluate and dynamically shape the learning experience of an individual, such as for an online learning environment ([Kahraman et al. 2013](https://doi.org/10.1016/j.knosys.2012.08.009)). 

### About the data set

**Our data set:** [User Knowledge Modeling](https://doi.org/10.24432/C5231X)

We have chosen to explore the user User Knowledge Modeling data set ([Kahraman et al. 2013](https://doi.org/10.1016/j.knosys.2012.08.009)), which is a data set that classifies users' knowledge of a topic (Electrical DC machines).

There are 6 total variables in the data set: 5 features and 1 target.

Features: <br>
A) Goal topics (learning objects):
- The degree of time spent studying the material [STG]
- The degree of repetition of the material [SCG]
- The performance in exams [PEG]

B) Prerequisite topics
- The degree of study time corresponding to the prerequisite objects [STR]
- The knowledge level of the prerequisite objects [LPR]

Target:
- user knowledge [UNS].

UNS has four levels:
- very low (beginner)
- low (intermediate)
- middle (expert) 
- high (advanced)

### Our Question

**Can we predict the knowledge level of a user given their study habits and their performance on the exam?**

We classify study habits as time spent studying and degree of repetition, which contributes to increased learning. Exam performance is also another way that can assess learning.

## Preliminary exploratory data analysis

### Loading necessary libraries and reading in the data

In [1]:
#load all necessarily libraries
library(tidyverse)
library(repr)
library(tidymodels)
install.packages("kknn") # this package needs to be loaded in for future classification
library(kknn)
install.packages("RColorBrewer")
library(RColorBrewer)

#read data from the web (GitHub raw file)
dc_machines <- read_csv("https://raw.githubusercontent.com/afellhauer/DSCI_Group_Project/main/data/Data_User_Modeling_Dataset_Hamdi.csv")
head(dc_machines)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.2     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.2     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.1.0     [32m✔[39m [34mtune        [39m 1.0.1
[32m✔[39m [34minfer       [39m 1.0.4     [32m✔[39m [34mworkflows   [39m 1.1.2
[32m✔[39

ERROR: Error in library(kknn): there is no package called ‘kknn’


### Make data usable and readable

Looking at the values, the format is tidy. To make the data usable, UNS (user_knowledge) needs to be converted from a character to a factor. To increase readability, we renamed all variables. We [checked for any missing data](https://www.tutorialspoint.com/dealing-with-missing-data-in-r) and [printed the result](https://www.geeksforgeeks.org/printing-output-of-an-r-program/). There are no missing values.

In [None]:
dc_machines_mutate <- dc_machines |>
    mutate(UNS = as_factor(UNS)) |> #change the class from a chr to a factor
    rename("study_time_goal" = STG, #relabel all variables to understand them better
           "repetition" = SCG,
           "study_time_related" = STR,
           "performance_related" = LPR, 
           "performance_goal" = PEG,
           "user_knowledge" = UNS)
missing <- sum(is.na(dc_machines_mutate)) # check for missing values
print(paste("Number of missing values: ", missing))

In [None]:
head(dc_machines_mutate) # view data

### Split the data into training and testing sets

To ensure reproducibility, we set the seed. We used a proportion of 0.75 for the training to testing ratio. We also stratified the data using user_knowledge, which is the class we are trying to predict. 

In [None]:
set.seed(200) #set seed to be reproducible
#create the initial split of the data
#stratify based on user_knowledge
dc_machines_split <- initial_split(dc_machines_mutate, prop = 0.75, strata = user_knowledge)

#collected the testing and training portions
dc_machines_training <- training(dc_machines_split)
dc_machines_testing <- testing(dc_machines_split)

glimpse(dc_machines_training)

### Summarizing the data

We examined how balanced or unbalanced the data set is. We created a table that summarized the number of counts for each group of the user_knowledge variable. 

The very_low (beginner) knowledge level is not as represented in this data set.

In [None]:
summary_counts <- dc_machines_training |>
    group_by(user_knowledge) |> #group based on the class
    summarize(count = n()) #gets the count (number of observations of each)
summary_counts

**Table 1**: Proportion of observations classified as each user knowledge level (very_low, low, high, middle)

We then summarized the means of our predictors for each class.

In [None]:
summary <- dc_machines_training |>
    select(user_knowledge, study_time_goal, repetition, performance_goal) |> 
    #select only the class and the predictors
    group_by(user_knowledge) |>
    summarize("mean_study_time" = mean(study_time_goal), "mean_repetition" = mean(repetition),
              "mean_exam_score" = mean(performance_goal)) 
    #get the mean for each predictor for each group
summary

**Table 2**: Mean study time and mean exam score for each user knowledge level (very_low, low, middle, high)

### Visualize the data

We visualized the distribution of the different groups according to their counts. Again, this highlights that the data is not completely balanced.

*To left align the caption, we used [this code](https://stackoverflow.com/questions/64701500/left-align-ggplot-caption)*.

In [None]:
dc_machines_plot_distribution <- dc_machines_training |>
    ggplot(aes(x = fct_recode(user_knowledge, "Very Low" = "very_low"), fill = user_knowledge)) + #change the label of very_low to Very Low
    geom_bar() + #use the default stat = "count"
    xlab("Category of user knowledge") +
    ylab("Count")

#make the plot look nicer
dc_machines_plot_distribution <- dc_machines_plot_distribution +
    theme(text = element_text(size = 15), legend.position = "none", #remove the legend
         plot.caption = element_text(hjust = 0)) + #set the text to left align
    ggtitle("Distribution of User Knowledge Groups") +
    labs(caption = "
Figure 1: Number of observations for each user knowledge group.Very low represents 
beginners, low represents intermediate, middle represents expert, and high 
represents advanced user knowledge") +
    scale_color_brewer(palette = "Set2") #set the color palette
    
dc_machines_plot_distribution

Then, we plotted the data according to study time and exam performance of each of the user knowledge groups. We start to see some distinct groups form.

*To left align the caption, we used [this code](https://stackoverflow.com/questions/64701500/left-align-ggplot-caption)*.

In [None]:
dc_machines_plot_study_vs_goal <- dc_machines_training |>
    ggplot(aes(x = study_time_goal, y = performance_goal, 
               color = fct_recode(user_knowledge, "Very Low" = "very_low"))) + #change the label of very_low to Very Low
    geom_point(alpha = 0.5) +
    xlab("Degree of study time on learning objects (goal)") +
    ylab("Performance in exams on learning objects (goal)") +
    labs(color = "User Knowledge", 
        caption = "
Figure 2: the perfomance and study time of users according to their learning group.
Very low represents beginners, low represents intermediate, middle represents expert, 
and high represents advanced user knowledge") +
    theme(text = element_text(size = 12), plot.caption = element_text(hjust = 0)) +
    ggtitle("Performance vs Study Time of User Knowledge Groups")
dc_machines_plot_study_vs_goal

## Methods

### Classification System

We will be using the k-nearest neighbors (KNN) algorithm in our project to determine if the user knowledge level of an individual can be predicted using their study habits and exam performance. 

We will conduct our data analysis using **KNN classification** since we are predicting a categorical variable. 

We will be using the following variables: 
- user_knowledge (class)
- study_time_goal (predictor)
- repetition (predictor)
- performance_goal (predictor)

### Steps
1.  Build the recipe using user_knowledge as the classifier, study_time_goal and performance_goal as the predictors, and the training set as the data.  
2.  Choose the appropriate K-value for the training set using 5-fold cross-validation and compare accuracy for each selectio  of K (tune the model).
3.  Create the KNN model using the selected K value.
4.  Train the classifier using the training set.
5.  Predict the labels for the unseen testing set.
6.  Evaluate accuracy and create a confusion matrix to assess precision and recall.
7.  Analyze performance using precision, recall, and accuracy.
8.  Discuss outcomes and provide suggestions for improving the model.

### Visualization


We will visualize our data by:
1. Plotting neighbors vs. accuracy of the cross-validation
2. Displaying the confusion matrix

## Expected Outcomes and Significance

We expect to see that exam performance and study time are good predictors of user knowledge.

We believe that these findings could contribute to predicting user knowledge for web-based learning applications ([Kahraman et al. 2013](https://doi.org/10.1016/j.knosys.2012.08.009)).

Future questions that this could lead to include how we can adjust the tasks for students learning a topic to create a dynamic and adaptive learning experience. For example, the creation of an intelligent artificial tutor that could tailor content based on user knowledge.

# References

[1](https://doi.org/10.1016/j.knosys.2012.08.009) Kahraman, H. T., Sagiroglu, S., &amp; Colak, I. (2013). The development of intuitive knowledge classifier and the modeling of Domain Dependent Data. *Knowledge-Based Systems, 37*, 283–295. https://doi.org/10.1016/j.knosys.2012.08.009 

[2](https://doi.org/10.24432/C5231X) Kahraman, H. T., Colak, I., & Sagiroglu, S. (2013). User Knowledge Modeling. *UCI Machine Learning Repository*. https://doi.org/10.24432/C5231X.

[3](https://datasciencebook.ca/) Timbers, T., Campbell, T., & Lee, M. (2023) Data Science: A First Introduction. CRC Press, Taylor & Francis Group. https://datasciencebook.ca/