**INDIVIDUAL PLANNING REPORT**  
Aisyah Sudarmaji  
**Problem**: Predicting Usage of a Video Game Research Server


**Broad question:**  
Which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts?

**Specific question:**   
Can a player's experience level, age, and gender predict the number of hours they would play in the server?

In [None]:
library(tidyverse)
players <- read_csv("players.csv")
players
nrow(players)
ncol(players)


**DATA DESCRIPTION**  
The data set "players.csv" has 196 players and 7 variables, and contains information about players who participated in UBC's Minecraft Research Project. Each row represents an individual player. 

**Column Description**  
1. **experience**: The player's experience level in Minecraft (character) 
2. **subscribe**: Indicates whether or not the player subscribed to the game (logical)
3. **hashedEmail**: The player's anonymous identity (character)
4. **played_hours**: The number of hours the player spent playing on the server (double)
5. **name**: The player's name (character)
6. **gender**: The player's gender (character)
7. **Age**: The player's age (double)

We won't be using the columns "hashedEmail" and "name" as they don't provide useful information for the prediction.

In [None]:
players |> count(experience)
players |> count(subscribe)
players |> count(gender)

The three tables above show the distribution of the categorical variables in the data set. 

In [None]:
players_summary <- players |>
    summarize(mean_played_hours = mean(played_hours, na.rm = TRUE))
players_summary

players_summary_based_on_age <- players |>
    group_by(Age) |>
    summarize(mean_played_hours = mean(played_hours, na.rm = TRUE))
players_summary_based_on_age

players_summary_based_on_gender <- players |>
    group_by(gender) |>
    summarize(mean_played_hours = mean(played_hours, na.rm = TRUE))
players_summary_based_on_gender

players_summary_based_on_experience <- players |>
    group_by(experience) |>
    summarize(mean_played_hours = mean(played_hours, na.rm = TRUE))
players_summary_based_on_experience

An issue in the data would be the fact that there is an N/A in the column Age.

In [None]:
age_plot <- ggplot(players_summary_based_on_age, aes(x = Age, y = mean_played_hours)) +
    geom_point() +
    labs(title = "Average Played Hours by Age",
         x = "Player's Age",
         y = "Average Played Hours")
age_plot

- There is no clear relationship between the player's age and the average number of hours played on the server.  
- There is no positive or negative relationship between the two variables.
- The player's age may not be a good predictor for predicting the number of hours players play on the server.

In [None]:
gender_plot <- ggplot(players_summary_based_on_gender, aes(x = gender, y = mean_played_hours, fill = gender)) +
    geom_bar(stat = "identity") +
    labs(title = "Average Played Hours by Gender",
         x = "Gender",
         y = "Average Played Hours")
gender_plot

The graph shows that players who identify themselves as non-binary have the highest average number of hours of playing in the server, followed by female players. Two-spirited-gendered players have the lowest average number of hours playing in the server.

In [None]:
experience_plot <- ggplot(players_summary_based_on_experience, aes(x = experience, y = mean_played_hours, fill = experience)) +
            geom_bar(stat = "identity") +
            labs(title = "Average Played Hours by Experience Level",
            x = "Experience Level",
            y = "Average Played Hours") 
experience_plot
               

The graph shows that players with the a "Regular" experience level have the highest number of hours of playing in the server. This suggest that players who play regularly are more active in the server compared to amateur, beginner, pro, and veteran players.

In [None]:
wrangled_players <- players |>
    select(experience, subscribe, gender, played_hours)
wrangled_players
nrow(wrangled_players)
ncol(wrangled_players)

- The table above only includes the necessary variable to answer our question.
- This lets us focus on factors that are more influential in the number of hours played on the server. 

**METHODS AND PLAN** 
- To predict the "kinds" of players who would contribute a large amount of data, we have to analyze the number of hours the players play on the server, which is numerical. As such, KNN regression would be a suitable method, especially knowing that the relationship between the number of hours played versus the players' characteristics may not be linear.

- We need to assume that the distance between points must represent similarity. KNN would look at the most similar players based on their characteristics (age, gender, experience), and calculate the average of their play hours. A limitation for KNN regression is that it may not predict well beyond the range of values in our training data. 

- The data will be split into the training (75%) and test set (25%). 

- I will compare the KNN regression model by testing different values of k through cross-validation. The K value with the lowest RMSE would be used for the final model.

**Another data set: "session.csv"**  
We also have a second data set called "session.csv" which contains information about play sessions for each player, such as session length and frequency.

In [None]:
sessions <- read_csv("sessions.csv")
sessions
nrow(sessions)
ncol(sessions)

The sessions data set has 1535 observations and 5 variables, where each row represents the individual play sessions.   

**Column description**
1. **hashedEmail**: The player's anonymous identity (character)
2. **start_time**: The session start time
3. **end_time**: The session end time
4. **original_start_time**: Numeric variable of when the timestamp started
5. **original_end_time**: Numeric variable of when the timestamp ended
    
While we won't focus on the sessions data set in this report, the data set might still be useful for later purposes, such as to understand how often the players play in the server.

github link: https://github.com/aisyahsudarmaji-web/Predicting-Usage-of-a-Video-Game-Research-Server