## Group Project (#15)
#### Members: 
- Bhoomika Yadav
- Yubin Kim
- Karen Rianika Tanuwijaya
- Yasin Mir 

## Introduction 



A predictive question is where a certain class or a specific numerical value of a future observation is estimated using data that already exists. To predict the usage of a video game server, specifically to predict a player’s hours playing the game, we are going to use a method called regression using the k-nearest neighbors model. Regression is a method used to predict a quantitative value for our future observations. Since we are using the k-nearest neighbors model, we take the average number of neighbors close to each value of the predictor and make a curve that fits the total observations. This curve contains the average location of all the points close to the predictor value thus we can predict the numerical value of our future observation. 


#### Question 
The question we chose to answer is: Which kinds of players are most likely to contribute a large amount of data so that so that we can target those players in our recruiting efforts. 
The specific question we are trying to explore and answer is: Can the `experience` of the players be used to predict the total play time `played_hours` of the player in the players dataset? 

### Data Description of the players.csv dataset 
To address this question, the players.csv dataset was used. This dataset was selected as it contains the necessay information required to answer the question, that is it provides data on each player's experience level and the total number of hours they have spent playing the game, both of which are essential to make a prediction on our question. 

##### Overview of the variables 
|Name        |Type                |Description| 
|:--------   |:---------          |:----------|
|`experience`  |Character (Categorical)| Player's skill level (Amateur, Beginner, Regular, Veteran and Pro)|         
|`subscribe`   |Logical (a boolean)| Whether or not the player has a subscription| 
|`hashedEmail` |Character| Unique identifier for each player |
|`played_hours`|Double| Total hours spent on the server by player |
|`name`        |Character| Player's name |
|`gender`     |Character (Categorical)|  Player's gender (7 categories)|
|`Age`         |Double| Player's age (8 - 50 years)|

# 
- Most players are `Amateur`
- Most players have the subscription for the game
- Most player are `Male`
- Average play time: around 5.85hrs.
- Average age: 20-21 years old. 

### Analysis 

The necessary libraries are loaded and the players.csv dataset was imported using a reproducible method that accesses the raw data through a direct link.

In [None]:
#Loading the libraries needed
library(tidyverse) 
library(repr)
library(tidymodels)
library(GGally)
library(ISLR)
library(ggtext)

In [None]:
#Loading the players.csv data
players_data <- read_csv("https://raw.githubusercontent.com/bhxxmika/group_project_files/refs/heads/main/players.csv") 
players_data|>
head()

### Figure 1: Analysing the relationship between `experience` and `played_hours` 


To determine whether player experience level is a meaningful predictor of playtime and to identify trends within the dataset, a bar graph of average hours played by experience level is generated. Appropriate graph dimensions are set, and rows where `played_hours` = 0 are removed, as they represent players who did not engage with the game and could skew the analysis. The data is grouped by experience level, and the mean played_hours is calculated for each group. This information is then visualized using a bar chart, with experience level on the x-axis and average hours played on the y-axis, to highlight potential relationships between skill level and playtime.

In [None]:

#setting the height and width for the graph 
options(repr.plot.width = 11, repr.plot.height = 8)

#grouping the dataset by experience 
players_by_lvl <- players_data |> 
                filter(played_hours > 0) |>
                group_by(experience) |>
                summarize(played_hours_mean = mean(played_hours)) 
players_by_lvl 


# Plotting the relationhip between `experience` and `played_hours`
experience_vs_hours <- players_by_lvl |> 
                    ggplot(aes(x = experience, y = played_hours_mean, fill = experience)) + 
                    geom_bar(stat = "identity") + 
                    labs(x = "Experience Level", y = "Average Hours Played", title = "**Figure 1:** Average Hours Vs. Player's Skill level", fill = "Experience Level") + 
                    theme(text = element_text(size = 14),plot.title=element_markdown()) 
                    
experience_vs_hours 


The **Figure 1** bar plot reveals that players classified as `Regular` have the highest average playtime by a significant margin relative to `Amateur`,`Beginner`,`pro`, and `vetran` players. The trend showcased by this graph indeed supports using experience level as a meaningful predictor to model player behaviour.

Next, the dataset is wrangled to prepare for modelling. The relevant columns `experience` and `played_hours` are first selected. Like before, rows with `played_hours`= 0 are removed to ensure only active players are included in the analysis. The experience variable is then recoded into numeric values to be compatible with the k-nearest neighbors algorithm, which uses on numerical distance calculations. Then the dataset is split into training and testing sets using a 75/25 ratio.

In [None]:
# Wrangling the data 
set.seed(2000)
players_tidy <- players_data |>  
                select(experience, played_hours) |> 
                filter(played_hours > 0) |>
                mutate(experience = fct_recode(experience, "1" = "Amateur", "2" = "Beginner", "3" = "Regular", "4" = "Veteran", "5" = "Pro")) |> 
                mutate(experience = as.numeric(experience)) 
            
head(players_tidy) 
tail(players_tidy)
players_split <- players_tidy |> 
                    initial_split(prop = 0.75, strata = played_hours) 
players_training <- training(players_split) 
players_testing <- testing(players_split) 
head(players_training) 
head(players_testing) 

Now the wrangled data can be used for knn regression.
Firstly, A k-nearest neighbors regression model is specified using the kknn engine. The recipe defines the model formula and includes centering and scaling steps, although these have little effect here due to the categorical nature of the predictor.


In [None]:
#knn regression 
set.seed(2000) 
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |> 
            set_engine("kknn") |> 
            set_mode("regression") 

knn_recipe <- recipe(played_hours ~ experience, data = players_training) |> 
                step_scale(all_predictors()) |> 
                step_center(all_predictors()) 
knn_vfold <- vfold_cv(players_training, v = 5, strata = played_hours) 

knn_workflow <- workflow() |> 
             add_recipe(knn_recipe) |> 
             add_model(knn_spec) 
k_values <- tibble(neighbors = seq(from = 1, to = 50, by = 1)) 

knn_results <- knn_workflow |> 
                tune_grid(resamples = knn_vfold, grid = k_values) |> 
                collect_metrics() 

knn_min <- knn_results |> 
            filter(.metric == "rmse") |> 
            filter(mean == min(mean)) 
knn_min #FOUND THAT 10 IS THE BEST K VAL 
knn_min_val <- knn_min |> 
                pull(neighbors) 
knn_min_val

players_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = knn_min_val) |> 
                set_engine("kknn") |> 
                set_mode("regression") 
players_fit <- workflow() |> 
                add_recipe(knn_recipe) |> 
                add_model(players_spec) |> 
                fit(data = players_training) 
knn_almost_rmspe <- players_fit |> 
                        predict(players_testing) |>
                        bind_cols(players_testing) |> 
                        metrics(truth = played_hours, estimate = .pred) 
knn_almost_rmspe
knn_rmspe <- knn_almost_rmspe |> 
                filter(.metric == "rmse") |> 
                select(.estimate) |> 
                pull() 
knn_rmspe 
