In [28]:
library(tidyverse)
library(repr)
library(tidymodels)

Proposal for Data Science 100 Group Project


Accumulation of knowledge is an important part of education. However, one might wonder what would be the best way to gain knowledge? And what may affect one's knowledge level on a particular subject?

Our question is:
How can we classify the knowledge level of a student with their study time, degree of repetition, and exam performace? Is there a more accurate way to classify the knowledge level with other parameters?

For our project, we will be using the User Knowledge Modeling Data Set, retrieved from https://archive.ics.uci.edu/ml/datasets/User+Knowledge+Modeling#. 

This data set contains data about the knowledge level of students on the subject of electrical DC machines, along with information about each students' study time, degree of repetition on the subject, study time on subjects related to this subject, the exam performance for subjects related to this subject, and exam performace for this subject. 

The data can be downloaded from going to the website and clicking "Data Folder". The data was downloaded, transformed into a csv file, and uploaded into the data folder for this project.

## Preliminary Exploratory Data Analysis:

### Reading in file from the web into R

First, one should obtain the xlsl file from https://archive.ics.uci.edu/ml/datasets/User+Knowledge+Modeling, in the Data Folder. In this excel file, there are multiple sheets. Therefore, to make it easier to read, one should convert each sheet to a different csv file with https://cloudconvert.com/xlsx-to-csv. For this project, the training and testing sheets in the excel file were converted into data/training_data_user_knowledge.csv and data/training_data_user_knowledge.csv respectively. 

Next, the data can be read into a tibble with read_csv. As there are no meta text, no extra parameters need to be added. For instance, to read in the training data:

In [29]:
#reading the training data into testing_data object:
training_data <- read_csv("data/training_data_user_knowledge.csv")

[1mRows: [22m[34m258[39m [1mColumns: [22m[34m6[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (1): UNS
[32mdbl[39m (5): STG, SCG, STR, LPR, PEG

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


## Cleaning and wrangling into tidy format

Here is what the first 5 rows of our data look like:

In [30]:
slice(training_data, 1:5)

STG,SCG,STR,LPR,PEG,UNS
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
0.0,0.0,0.0,0.0,0.0,very_low
0.08,0.08,0.1,0.24,0.9,High
0.06,0.06,0.05,0.25,0.33,Low
0.1,0.1,0.15,0.65,0.3,Middle
0.08,0.08,0.08,0.98,0.24,Low


From the source website, here is what the column labels mean: 

For each student,
STG: The degree of study time for subject
SCG: The degree of repetition for subject materials
STR: The degree of study time for related subjects
LPR: The exam performance for related subjects
PEG: The exam performance for subject
UNS: The knowledge level

Lets rename the columns to have better understanding:

In [31]:
training_data <- rename(training_data, 
                        direct_study_time = STG,
                        direct_repetition_degree = SCG,
                        related_study_time = STR,
                        related_exam_performance = LPR,
                        direct_exam_performance = PEG,
                        direct_knowledge_level = UNS)

slice(training_data, 1:5)

direct_study_time,direct_repetition_degree,related_study_time,related_exam_performance,direct_exam_performance,direct_knowledge_level
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
0.0,0.0,0.0,0.0,0.0,very_low
0.08,0.08,0.1,0.24,0.9,High
0.06,0.06,0.05,0.25,0.33,Low
0.1,0.1,0.15,0.65,0.3,Middle
0.08,0.08,0.08,0.98,0.24,Low


As we can see from our new table above, each row is a single observation, each column is a single variable, and each value is a single cell. Therefore, our data is tidy. Furthermore, our columns are more understandable and clear.

## Exploratory tables
### Minimum and Maximum of each column

For our numerical data, (labelled by <dbl>), we can find the range of values with the map + max and min functions:

In [None]:
max_and_min <- map(training_data,
          min_most_at_home = min(most_at_home),
          max_most_at_home = max(most_at_home))

In [1]:


#showing relation between study time and exam results on subject
ggplot(training_data, aes(x = STG, y = PEG)) +
geom_point() 

ERROR: Error in ggplot(training_data, aes(x = STG, y = PEG)): could not find function "ggplot"


## Preliminary exploratory data analysis & Methods
For our data analysis, we will use the closest k points method with the columns PEG and STG to classify knowledge levels. We will try to find the best value for k, and see how accurate our model is when tested. One way we will visualize the result is by showing a graph of accuracy against k. This will show clearly the best value for k.

## Expected outcomes and significance
We expect to find higher levels of study time and exam results to higher levels of knowledge. We also do not expect the classification to be very accurate, because the measurement method seems unreliable.

Our finding may show that common beliefs about how study time reflects knowledge may not be statistically correct. Furthermore, it may even show that exams do not test knowledge well. Or, it may confirm these beliefs.

A future question could be to ask: what other variables may affect user knoweldge?

notes (ignore):
STG (The degree of study time for goal object materails), (input value)
SCG (The degree of repetition number of user for goal object materails) (input value)
STR (The degree of study time of user for related objects with goal object) (input value)
LPR (The exam performance of user for related objects with goal object) (input value)
PEG (The exam performance of user for goal objects) (input value)
UNS (The knowledge level of user) (target value)
Very Low: 50
Low:129
Middle: 122
High 130