**Investigating the relationship between playing hours and age in the player_csv dataset**

(1) **Data description**

-In this data set there are 196 observations

-There are 7 different variables. `hashed_email` and `name` have been excluded **in analysis** as they are unique to individuals and don't serve as predictors, therefore there are 5 predictive variables

`experience`: Indicates the category of skill level

`subscribe`: Indicates whether a player is subscribed (TRUE) or not (FALSE), a logical value

`hashed_email`: Indicates the players unique email, character data

`played_hours`: Indicates number of hours spent playing, a numeric value

`name`: Indicates the players name, character data

`gender`: Indicates the players gender identity, a categorical value

`age`: Indicates the players age in years, a numeric value

-There are two main issues in this data set. 
1. There is an option for players to report "prefer not to say" for gender. This could cause issues when using gender as a predictive variable as it introduces uncertainty and missing data.
2. The other issue in the data are the NA (missing data) values throughout the columns, this could lead to less accurate predictions.

   
   






In [None]:
nrow(player_data)

(2) **Question**

My question will investigate if age can help predict the playing time of users in the `player_csv` dataset.


This question connects to the broad question of what "kinds" of players contribute the most data to the server. It looks at the age range of users who spend the most hours playing, specifiying the "kind" of player by their age, and interpreting the most contribution of data to the server as the amount of hours spent playing. 

The `player_csv` dataset will help to answer this predictive question as it contains information about the `age` and `played_hours`. Having a large age range will help identify trends and correlations. A narrow range would cause the model to predict from a limited range of data, making it less accurate.




(3) **Exploratory Data Analysis and Visualization**

Attach needed packages

In [None]:
library(repr)
library(tidyverse)
library(tidymodels)
library(GGally)
library(ISLR)
set.seed(5)
options(repr.matrix.max.rows = 10)
source("cleanup.R")

Upload Player Data

In [None]:
player_data <- read_csv("https://raw.githubusercontent.com/bellamartens/Individual_Project/refs/heads/main/players.csv")

player_data

Wrangle data as needed

In [None]:
player_data <- rename(player_data,
                      hashed_email = hashedEmail,
                      age = Age)
player_data                 

Compute average of **quantitative** values

In [None]:
avg_played_hours <- player_data |>
             summarize(avg_played_hours = mean(played_hours, na.rm = TRUE))

avg_age <- player_data |>
                summarize(avg_age = mean(age, na.rm = TRUE))
                       
avg_played_hours
avg_age

Report values in a Table

| Average Hours Played | Average Age|
|---|---|
| 6 | 21 |

Visualize the data

In [None]:
options(repr.plot.width = 7, repr.plot.height = 7) 

plot_1 <- player_data |>
          ggplot(aes(x = age, y = played_hours)) +
          geom_point () +
          labs(x = "User Age", y = "Time spent playing (hours)") +
          ggtitle("How age influences playing hours") +
 theme(text = element_text(size = 13))

plot_1


This scatter plot very loosely shows that ages ranging between 16-22 may have higher playing hours. These are the only ages where there are outliers with very high playing hours (150hrs+). However much of the data doesn't report playing hours for many of the ages, meaning the relationship is unable to be confirmed.

In [None]:
options(repr.plot.width = 7, repr.plot.height = 7) 


plot_2 <- player_data |>
        ggplot(aes(x = age, y = played_hours)) +
        geom_bar(stat = "identity") +
        labs(x = "User Age", y = "Time spent playing (hours)") +
        ggtitle("How age influences playing hours") +      
theme(text = element_text(size = 13))

plot_2
        

From this bar plot, we can clearly see that the ages with the highest playing hours fall between 15-20 years of age. This basic visualization already provides valuable insight into the potential relationship between the two variables by showing that 
younger people spend more time playing.

(4) **Methods and Plan**


The method that will be used to address the question will be K-nearest regression using a KNN model. 
The question is trying to investigate how a players age can predict what their playing hours will be, predicting a numerical 
value requires regression. 

This is an appropriate model to use because the data set is not overly large, we are working with only one predictor, and the range of values inputted in the training data is relatively large. However potential limitations of the model may include sensitivity to outliers which could impact distance calculations. 

The steps to investigate the data will follow the standard process. First, the data will be split into training and testing sets, storing about 80% of the data while testing on 20% of it. The data will be standardized, the model will be trained and the best K value will be chosen through cross-validation. Cross validation will occur by splitting the training data into 5 folds, and then further training and evaluating the model. Once the best K value has been selected the testing data will be evaluated with the chosen K. 

In [None]:
#Split the data into testing and training sets

age_split <- initial_split(player_data, prop = 0.75, strata = played_hours)
age_training <- training(age_split)
age_testing <- testing(age_split)

head(age_training)

nrow(age_training)
nrow(age_testing)

#Create a preprocessing recipe that standardizes the data

age_recipe <- recipe(played_hours ~ age, data = age_training) |>
                        step_center(all_predictors()) |>
                        step_scale(all_predictors())

#Create a KNN model

age_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
                      set_engine("kknn") |>
                      set_mode("regression")

#Choose best K value

age_vfold <- vfold_cv(age_training, v = 5, strata = age)

age_workflow <- workflow() |>
                        add_recipe(age_recipe) |>
                        add_model(age_spec)

gridvals <- tibble(neighbors = seq(1,10))

age_results <- age_workflow |>
                       tune_grid(resamples = age_vfold, grid = gridvals) |>
                       collect_metrics() 

#select the value of k resulting in best RMSE

age_min <- age_results |>
               filter(.metric == 'rmse') |>
               filter(mean == min(mean))  |> 
               pull(neighbors) #this might be wrong

#retrain the model using that final k, predict on held-out data

age_spec_2 <- nearest_neighbor(weight_func = "rectangular", neighbors = age_min) |>
  set_engine("kknn") |>
  set_mode("regression")

age_fit <- workflow() |>
  add_recipe(age_recipe) |>
  add_model(age_spec_2) |>
  fit(data = age_training)

knn_rmspe <- age_fit |>
  predict(age_testing) |>
  bind_cols(age_testing) |>
  metrics(truth = played_hours, estimate = .pred)|>
  filter(.metric == 'rmse') |>
  pull(.estimate) 
### END SOLUTION
knn_rmspe
