**Individual Planning Report**  
Aisyah Sudarmaji  
**Problem**: Predicting Usage of a Video Game Research Server


**DATA DESCRIPTION**  
The data set "players.csv" contains information about players who participated in UBC's Minecraft Research Project. Each row represents an individual player.

**The broad question we want to answer:**  
Which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts?

**Our more specific question would be:** 
Can a player's experience level, age, and gender predict the number of hours they would play in the server?

**Column Description**  
1. experience: The player's experience level in Minecraft 
2. subscribe: Indicates whether or not the player subscribed to the game
3. hashedEmail: The player's anonymous identiy
4. played_hours: The number of hours the player spent playing on the server
5. name: The player's name
6. gender: The player's gender
7. Age: The player's age

The data set contains 196 players and 7 variables.   
We won't be using the columns "hashedEmail" and "name" as they don't provide useful information for the prediction.

In [None]:
library(tidyverse)
players <- read_csv("players.csv")
players
nrow(players)
ncol(players)

In [None]:
players_summary <- players |>
    summarize(mean_played_hours = mean(played_hours, na.rm = TRUE))
players_summary

players_summary_based_on_age <- players |>
    group_by(Age) |>
    summarize(mean_played_hours = mean(played_hours, na.rm = TRUE))
players_summary_based_on_age

players_summary_based_on_gender <- players |>
    group_by(gender) |>
    summarize(mean_played_hours = mean(played_hours, na.rm = TRUE))
players_summary_based_on_gender

players_summary_based_on_experience <- players |>
    group_by(experience) |>
    summarize(mean_played_hours = mean(played_hours, na.rm = TRUE))
players_summary_based_on_experience

An issue in the data would be the fact that there is an N/A in the column Age.

In [None]:
age_plot <- ggplot(players_summary_based_on_age, aes(x = Age, y = mean_played_hours)) +
    geom_point() +
    labs(title = "Average Played Hours by Age",
         x = "Player's Age",
         y = "Average Played Hours")
age_plot

The graph shows that there is no clear relationship between the player's age and average number of hours playing in the server. There is no positive nor negative relationship between the two variables. 

In [None]:
gender_plot <- ggplot(players_summary_based_on_gender, aes(x = gender, y = mean_played_hours, fill = gender)) +
    geom_bar(stat = "identity") +
    labs(title = "Average Played Hours by Gender",
         x = "Gender",
         y = "Average Played Hours")
gender_plot

The graph shows that players who identify themselves as non-binary have the highest average number of hours of playing in the server, followed by female players. Two-spirited-gendered players have the lowest average number of hours playing in the server.

In [None]:
experience_plot <- ggplot(players_summary_based_on_experience, aes(x = experience, y = mean_played_hours, fill = experience)) +
            geom_bar(stat = "identity") +
            labs(title = "Average Played Hours by Experience Level",
            x = "Experience Level",
            y = "Average Played Hours") 
experience_plot
               

The graph shows that players with the a "Regular" experience level have the highest number of hours of playing in the server. This suggest that players who play regularly are more active in the server compared to amateur, beginner, pro, and veteran players.

**METHODS AND PLAN** 

While we're trying to predict the "kinds" of players who would contribute a large amount of data, our analysis would have to involve the number of hours players play on the server, which is numerical. As such, KNN regression would be a suitable method, especially knowing that the relationship between the number of hours played versus the players' characteristics may not be linear.

To apply the KNN regression method, we need to assume that the distance between points must represent similarity. KNN would look at the most similar players (nearest) based on their characteristics (age, gender, experience), and calculate the average of their play hours. However, there might be some limitations. A limitation would be the fact that our model may not predict well beyond the range of values input in our training data (e.g., a player's age beyond the range). 

Before applying the model, I will split the data into the training (75%) and test set (25%). The training data will be used to train the model, and I will keep the test aside for evaluating how well the model perform on unseen data at the end. 

I will compare the KNN regression model by testing different values of k through cross-validation. The model's performance will be evaluated by calculating its RMSE, which measures the difference between the predicted vs actual number of hours played. The K value with the lowest RMSE would be used for the final model.

**Another data set: "session.csv"**  
We also have a second data set called "session.csv" which contains information about play sessions for each player, such as session length and frequency.

In [None]:
players <- read_csv("sessions.csv")
players
nrow(players)
ncol(players)