In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
players<- read_csv("players.csv")
sessions<- read_csv("sessions.csv")
players
sessions

**Question used: Question 2: We would like to know which kinds of players are most likely to contribute a large amount of data so that we can target 
those players in our recruiting efforts.**

1. Data description

We are given 2 datasets that explain the data and sessions of each player. This data is used to predict and target which kinds of players are most likely to contribute a large amount of data. 

**players.csv has 196 observations, 7 variables:**

- *experience: Pro, Veteran, Amateur, Regular, Beginner, Veteran, Pro* [chr]
------ whether a player is experienced or not
- *subscribe: TRUE, FALSE* [lgl]
------ wwhether a player is subscribed
- *hashedEmail* [chr]
------ who is playing, player ID 
- *played_hours* [dbl]
------ number of hours played, integer
- *name* [chr]
------ player's name, 
- *gender: Male, Female, Non-Binary, Prefer not to say, Agender, Two-Spirited, Other* [chr]
------ player's gender 
- *age* [int]
------ player's age, **2 NAs**

*played_hours summary statistics:* 
- mean: 5.85

*age summary statistics*
- mean: 21.14
  
players.csv is tidy because it is human readable and is very clear with what each column is trying to convey.

**sessions.csv has 1535 observations, 5 variables:**
- *hashedEmail* [chr]
------ who is playing, player ID
- *start_time (DD/MM/YYYY Hours: Minutes)* [chr]
------ time at which the player starts playing the game 
- *end_time (DD/MM/YYYY Hours: Minutes)* [chr]
------ time at which the player stops playing the game 
- *original_start_time (milliseconds)* [dbl]
------ raw timestamp at which the player starts playing the game
- *original_end_time (milliseconds)* [dbl]
------ raw timestamp at which the player stops playing the game, **2 NAs**

*original_start_time (milliseconds) summary statistics*
- mean: 1.719e+12


*original_end_time (milliseconds) summary statistics*
- mean: 1.719e+12

sessions.csv is not tidy because the decimal values for the original start and end times are hard for a human to interpret. The duplicate of both times is also redundant. 


**2. Questions:**

***Broad question:*** 
We would like to know which kinds of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts. 

***Specific question:***
Can player information such as experience level, gender, and subscriber status predict the total number of hours played in the game?


I chose kknn because it captures non-linear relationships between played_hours and categorical variables like experience, gender, and subscriber status. The ultimate goal here is to predict whether explanatory variables, experience, gender, and subscriber status can predict how long a player spends in a game, as players who play longer will contribute more data. 

For wrangling, the experience, gender, and subscriber variables should be converted to factors, get rid of the NAs by using the na.rm function, split the data into training and testing sets into 80/20, and use cross-validation to tune k in the training set. 

**3. Exploratory Data Analysis and Visualization**

In [None]:
players <- players |>
    mutate(experience= as_factor(experience), gender = as_factor(gender), subscribe = as_factor(subscribe), age = as.numeric(Age))

players_mean_hrs_age<- players|>
    summarize(mean_hours = mean(played_hours, na.rm = TRUE), mean_age= mean(age, na.rm = TRUE))

sessions<- sessions |>
    mutate(original_start_time= as.numeric(original_start_time), original_end_time= as.numeric(original_end_time))

sessions_mean_start_end_times <- sessions |>
     summarize(mean_original_start_time= mean(original_start_time, na.rm = TRUE), mean_original_end_time= mean(original_end_time, na.rm = TRUE))


players_mean_hrs_age
sessions_mean_start_end_times

In [None]:

avg_exp <- players |>
  group_by(experience) |>
  summarize(mean_hours = mean(played_hours, na.rm = TRUE))

options(repr.plot.height =6, repr.plot.width=7)
hrs_vs_exp <- ggplot(avg_exp, aes(x=experience, y = mean_hours)) + 
    geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Average Hours Played by Experience level", 
       x = "Experience level", 
       y = "Average Hours Played (hrs)") 
 
hrs_vs_exp


avg_sub <- players |>
  group_by(subscribe) |>
  summarize(mean_hours = mean(played_hours, na.rm = TRUE))

options(repr.plot.height =6, repr.plot.width=7)
hrs_vs_sub <- ggplot(avg_sub, aes(x=subscribe, y = mean_hours)) + 
    geom_bar(stat = "identity", fill = "steelblue") + 
    labs(title = "Average hours of Minecraft played between Subscribers and Non-Subscribers", 
         x = "Subscriber (where TRUE = Subscribed, FALSE = Not Subscribed)", 
         y = "Average Hours Played (hrs)")
hrs_vs_sub


avg_gender <- players |>
  group_by(gender) |>
  summarize(mean_hours = mean(played_hours, na.rm = TRUE))

options(repr.plot.height =6, repr.plot.width=7)
hrs_vs_gender <- ggplot(avg_gender, aes(x=gender, y = mean_hours)) + 
    geom_bar(stat = "identity", fill = "steelblue") + 
    labs(title = "Average Hours Played For Each Gender", 
         x = "Gender Identified", 
         y = "Average Hours Played (hrs)")
hrs_vs_gender



Experience level: Regular experience players play more average hours

Subscribed: Average hours of subscribed players dominate non-subscribers.

Gender: Average hours for Non-binary dominates

**4. Methods and Plan**

I will use KNN because it handles non-linear categorical variables and relates them to numeric values. Assumptions include similar distributions in training and testing sets and independent observations. A limitation was having to compute average hours for the graphs due to outliers. I will use cross-validation to tune k and RMSE to check accuracy. I should convert the categories into factors and split the data into 80% training, 20% testing before training.