## DSCI 100 Group Project 

By Tony Fu, Hao Jiang, and Aimee Garcia Castro

## Introduction: 

There are many elements that play a role in how well a student will perform on a test. The objective for this group project is predicting the knowledge level of an individual based on such elements. The knowledge levels include: `high`, `middle`, `low` and `very_low` We will be looking at the "User Knowledge" dataset obtained from the UCI Machine Learning Repository. The data in this dataset looks at `STG` (the degree of study time for goal object materials), `SCG` (the degree of repetition number of user for goal object materials), `STR` (the degree of study time of user for related objects with goal object), `LPR` (the exam performance of user for related object with goal object), and `PEG` (the exam performance of user for goal objects). As well, it relates `PEG` to `UNS` (the knowledge level of the user). This data will be used as predictors for `UNS`
 
Our predictive question that we will try to answer is: “Can we predict the `UNS` of an individual based on factors such as `STG`, `SCG`, and `STR`?”
 


## Methods: 

We begin by loading the libraries required to perform exploratory analysis. 

In [9]:
library(tidyverse)
library(ggplot2)
library(tidymodels)
library(repr)
library(GGally)
library(readxl)
options(repr.matrix.max.rows = 6)

Now, we need to transfer the data from the website to R, since the data is located on the website. As the data is available as an Excel spreadsheet, we are going to need to download it as a file with `download.file`. We have stored the file described as an url from the Internet as an object called `url` and used this in the `download.file` function.

After we have downloaded the data as a file on R, we are going to use the `read_excel` function to transfer the data from the spreadsheet. We will be using the `sheet` function to specify which sheet of the Excel file contains the data we are going to use. This will be `sheet = 2` as the second sheet is listed as **training data**, which will be crucial once we create our classifier. 

We notice that the last 3 columns are irrelevant to the data, so we will be removing those by selecting the wanted data with the `select` function.

Furthermore, although not required, we chose to lowercase all the `UNS` labels to keep the labeling consistent with the `mutate` function, along with the `recode` function.

In [19]:
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00257/Data_User_Modeling_Dataset_Hamdi%20Tolga%20KAHRAMAN.xls"

download.file(url, destfile = "data/Data_User_Modeling_Dataset_Hamdi Tolga KAHRAMAN.xls")

user_knowledge <- read_excel(path = "data/Data_User_Modeling_Dataset_Hamdi Tolga KAHRAMAN.xls", sheet = 2) %>%
    select(STG:UNS) %>%
    mutate(UNS = recode(UNS, High = 'high', Middle = 'middle', Low = 'low'))

user_knowledge

New names:
* `` -> ...7
* `` -> ...8



STG,SCG,STR,LPR,PEG,UNS
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
0.00,0.00,0.00,0.00,0.00,very_low
0.08,0.08,0.10,0.24,0.90,high
0.06,0.06,0.05,0.25,0.33,low
⋮,⋮,⋮,⋮,⋮,⋮
0.54,0.82,0.71,0.29,0.77,high
0.50,0.75,0.81,0.61,0.26,middle
0.66,0.90,0.76,0.87,0.74,high


*Table 1*: Tidied "User Knowledge" dataset obtained from UCI Machine Learning Repository

## Searching for NAs:


We will now be looking for NA values in our data frame, and handle them appropriately as they represent a missing or incomplete value in our data. 

In [20]:
sum(is.na(user_knowledge))

Thankfully there is no NAs in our dataframe, so no necessary adjustments must be made to our current dataframe.

## Defining Variables:

We have already mentioned what each of the variables acronym means in the introduction. However to reiterate, `STG` represents the study time for main topics, `SCG` represents the reptition in studying, `STR` represents study time for related topics, `LPR` represents exam performance of related topics, `PEG` represents exam performance of main topics, and `UNS` represents the knowledge level. All variables from `STG` to `PEG` have a range from 0 to 1. `UNS` have the labels `very_low`, `low`, `middle`, and `high`, where `very_low` corresponds to a low `PEG` whereas `high` corresponds to a high `PEG`.

## Determining Predictors:

To begin, we have decided to remove `LPR` since a high `LPR` would result from a high `STR`. This means that if we included both as predictors, we would essentially be using the same predictor twice, leading to these `STR` having a larger influence than wanted. Furthermore, we have also decided to remove `PEG` since `UNS` is based on `PEG`. 

In [25]:
user_knowledge <- user_knowledge %>%
    select(-LPR, -PEG)

user_knowledge

STG,SCG,STR,UNS
<dbl>,<dbl>,<dbl>,<chr>
0.00,0.00,0.00,very_low
0.08,0.08,0.10,high
0.06,0.06,0.05,low
⋮,⋮,⋮,⋮
0.54,0.82,0.71,high
0.50,0.75,0.81,middle
0.66,0.90,0.76,high


*Table 2*: Potential predictors that may be used in the classification

Now, we are left with four variables: `STG`, `SCG`, `STR`, and `UNS`. Since the categorical class we are trying to predict `UNS`, we must turn it into a factor. This is done by using the `mutate` and `as_factor` functions.

In [26]:
user_knowledge <- user_knowledge %>%
    mutate(UNS = as.factor(UNS))

user_knowledge

STG,SCG,STR,UNS
<dbl>,<dbl>,<dbl>,<fct>
0.00,0.00,0.00,very_low
0.08,0.08,0.10,high
0.06,0.06,0.05,low
⋮,⋮,⋮,⋮
0.54,0.82,0.71,high
0.50,0.75,0.81,middle
0.66,0.90,0.76,high


*Table 3*: Potential predictors with new `UNS` factor group