# Project Final Report 

In [1]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.rows = 6)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

In [3]:
players_data <-read_csv("https://raw.githubusercontent.com/amberer60s/DSCI-100---Group-Project/refs/heads/main/players%20(1).csv?token=GHSAT0AAAAAADBONSJYRLGZMVJEG4UWIOM6Z7QMGZQ")
print(players_data)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


[90m# A tibble: 196 × 7[39m
   experience subscribe hashedEmail              played_hours name  gender   Age
   [3m[90m<chr>[39m[23m      [3m[90m<lgl>[39m[23m     [3m[90m<chr>[39m[23m                           [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m  [3m[90m<dbl>[39m[23m
[90m 1[39m Pro        TRUE      f6daba428a5e19a3d475748…         30.3 Morg… Male       9
[90m 2[39m Veteran    TRUE      f3c813577c458ba0dfef809…          3.8 Chri… Male      17
[90m 3[39m Veteran    FALSE     b674dd7ee0d24096d1c0196…          0   Blake Male      17
[90m 4[39m Amateur    TRUE      23fe711e0e3b77f1da7aa22…          0.7 Flora Female    21
[90m 5[39m Regular    TRUE      7dc01f10bf20671ecfccdac…          0.1 Kylie Male      21
[90m 6[39m Amateur    TRUE      f58aad5996a435f16b0284a…          0   Adri… Female    17
[90m 7[39m Regular    TRUE      8e594b8953193b26f498db9…          0   Luna  Female    19
[90m 8[39m Amateur    FALSE     1d23

## Introduction ##

In the world of gaming, game developers and companies want to keep players engaged and attract new ones. One way to do this is by figuring out which players are most likely to play a lot, as this gives them a better idea of where to focus their marketing and recruitment efforts. The more time a player spends, the better the developers can understand how to improve the game and 

A big question for game developers is whether certain types of players are more likely to play for longer periods. 

For our project, we tried to answer the question : 

**"Can a player’s experience and age predict how much time they will spend playing the game?"**

Our aim is to see if there is a relationship between how old a player is and how much they play. This could help game developers understand which age groups are more likely to be active players.

#### **players.csv**
This dataset contains 196 player records with various variables describing their characteristics and behavior.

| Column Name   | Data Type | Description |
|--------------|----------|-------------|
| `experience` | character (chr) | Player's experience level (`Pro`, `Veteran`, `Regular`, and `Amateur`). |
| `subscribe`  | logical (lgl) | Indicates whether the player is a subscriber to the server (`True` or `False`). |
| `hashedEmail` | character (chr) | Hashed representation of the player's email. |
| `played_hours` | dbl | Total hours the player has played. |
| `name` | character (chr) | Player's name. |
| `gender` | character (chr) | Player's gender (e.g., Male, Female, Non-binary, etc.). |
| `Age` | double (dbl) | Player’s age (years). |

For our exploration in this project, we will focus mainly on the columns **experience**, **Age**, and **played_hours**. Based on this, we can draw the conclusions, 
- **Response Variable** : What we want to predict. In this case, the response variable is **played_hours**, which represennts the total time a players spends playing the game.
- **Exploratory Variable** : What we use to predict the response variable. For our project, the explanatory variables are **experience** and **Age**, as we are looking to see if a player's age can help predict how manny hours they will play. 

## Explortary and Visualization

Using summary function and is.na argument, we can check if there's any NA value in our dataset could interfere our calculations.

In [4]:
summary_players <- players_data |>
                    summarize(across(everything(), ~sum(is.na(.))))
summary_players

experience,subscribe,hashedEmail,played_hours,name,gender,Age
<int>,<int>,<int>,<int>,<int>,<int>,<int>
0,0,0,0,0,0,2


In the next step, we use functions like nrow and group_by to the dataset to find the number and percentage distribution of each experience level.Then summarize and across function can help us to summarize statistics in every column.

In [5]:
num_obs <- nrow(players_data)
players_data |>
  group_by(experience) |>
  summarize(
    count = n(),
    percentage = n() / num_obs * 100
  )

#reports the number of observations in each variable
num_observations <- players_data |>
  summarise(across(everything(), ~sum(!is.na(.))))
num_observations

experience,count,percentage
<chr>,<int>,<dbl>
Amateur,63,32.142857
Beginner,35,17.857143
Pro,14,7.142857
Regular,36,18.367347
Veteran,48,24.489796


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<int>,<int>,<int>,<int>,<int>,<int>,<int>
196,196,196,196,196,196,194


Here since we're using experience as one predictor, we want to scale the experience level from 0 to 5, corresponding with "Beiginner","Regular","Amateur","Pro", and "Veteran". Therefore, we can apply recode function to make the experience from chr to dbl. In addition, using drop_na can help us to eliminate potential NA value that we don't want.

In [6]:
players_d <- players_data|>
            mutate(experience_numeric = recode(experience,
                                      "Beginner" = 1,
                                      "Regular" = 2,
                                      "Amateur" = 3,
                                      "Pro" = 4,
                                      "Veteran" = 5))|>
            drop_na()


Here we can use select function to select only columns we will use in the following steps, making our data frame clearer.

In [7]:
players_data <- players_d |>
                select(experience_numeric, Age, played_hours)

players_data |>
head(6)

experience_numeric,Age,played_hours
<dbl>,<dbl>,<dbl>
4,9,30.3
5,17,3.8
5,17,0.0
3,21,0.7
2,21,0.1
3,17,0.0


For visulization, 

In [None]:
players_graph <- players_data |>
    ggplot(aes(x=experience_numeric,fill = factor(Age))) +
    geom_histogram(binwidth=0.5) 


players_graph

In [None]:
players_graph <- players_data |>
    ggplot(aes(x=experience_numeric,y=played_hours,color=Age)) +
    geom_point(alpha=0.4)+
    theme(text = element_text(size = 12))


players_graph

For our knn-regression model:

In [None]:
players_split <- initial_split(players_data, prop=0.75, strat=played_hours) 
players_training <-training(players_split)
players_testing <-testing(players_split)

In [None]:
players_recipe <-recipe(played_hours ~ experience_numeric + Age, data = players_training) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

In [None]:
players_spec <- nearest_neighbor(weight_func = "rectangular",
                              neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("regression")

In [2]:
gridvals <- tibble(neighbors = seq(1, 200))
players_vfold <- vfold_cv(players_training, v = 5, strata = played_hours)

ERROR: Error in tibble(neighbors = seq(1, 200)): could not find function "tibble"


In [None]:
players_multi <- workflow() |>
  add_recipe(players_recipe) |>
  add_model(players_spec) |>
  tune_grid(players_vfold, grid = gridvals) |>
  collect_metrics() |>
  filter(.metric == "rmse") |>
  filter(mean == min(mean))

players_k <- players_multi |>
              pull(neighbors)
players_multi


There were issues with some computations   [1m[33mA[39m[22m: x1

→ [31m[1mB[22m[39m | [31merror[39m:   [1m[22m[36mℹ[39m In index: 115.
               [1mCaused by error in `cl[C]`:[22m
               [33m![39m only 0's may be mixed with negative subscripts

There were issues with some computations   [1m[33mA[39m[22m: x1
There were issues with some computations   [1m[33mA[39m[22m: x1   [1m[31mB[39m[22m: x1


There were issues with some computations   [1m[33mA[39m[22m: x1   [1m[31mB[39m[22m: x1


In [None]:
players_spec <- nearest_neighbor(weight_func = "rectangular",
                              neighbors = players_k) |>
  set_engine("kknn") |>
  set_mode("regression")

In [None]:
knn_players_fit <- workflow() |>
  add_recipe(players_recipe) |>
  add_model(players_spec) |>
  fit(data = players_training)

In [None]:
knn_players_preds <- knn_players_fit |>
  predict(players_testing) |>
  bind_cols(players_testing)

players_metrics <- metrics(knn_players_preds, truth = playing_hours, estimate = .pred) |>
                     filter(.metric == 'rmse')

players_metrics