In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
players<- read_csv("players.csv")
sessions<- read_csv("sessions.csv")
players
sessions

**Question used: Question 2: We would like to know which kinds of players are most likely to contribute a large amount of data so that we can target 
those players in our recruiting efforts.**

1. Data description

We are given 2 datasets that list the data on all the players (with their experience, subscribe, hashed email, number of hours played, name, gender, and age) and the session, (hashed email, start time,and end time). This data is used to predict and target which kinds of players are most likely to contribute a large amount of data. 

players.csv has 196 observations, incuding 7 variables:
- ***experience: Pro, Veteran, Amateur, Regular, Beginner*** [chr]
------ whether if a player is experience or not
- ***subscribe: TRUE, FALSE*** [lgl]
------ whether if a player is subscribing
- ***hashedEmail*** [chr]
------ who is playing, player ID 
- ***played_hours*** [dbl]
------ number of hours played, integer
- ***name*** [chr]
------ player's name, 
- ***gender: Male, Female*** [chr]
------ player's gender 
- ***age*** [int]
------ player's age, **has 2 NAs**

***played_hours*** summary statistics: 
- mean: 5.85

***age*** summary statistics
- mean: 21.14
  
players.csv is tidy because it is human readable and is very clear with what each column is trying to convey. Although the hashed email is a bit hard to read, it is an identification number so it is acceptable. 

sessions.csv has 1535 observations, including 5 variables: 
- ***hashedEmail*** [chr]
------ who is playing, player ID
- ***start_time (DD/MM/YYYY Hours: Minutes)*** [chr]
------ time at which the player starts playing the game 
- ***end_time (DD/MM/YYYY Hours: Minutes)*** [chr]
------ time at which the player stops playing the game 
- ***original_start_time (milliseconds)*** [dbl]
------ raw timestamp at which the player starts playing the game
- ***original_end_time (milliseconds)*** [dbl]
------ raw timestamp at which the player stops playing the game, **there are 2 NAs**

***original_start_time (milliseconds)*** summary statistics 
- mean: 1.719e+12


***original_end_time (milliseconds)*** summary statistics 
- mean: 1.719e+12

sessions.csv is not tidy because the decimal values for the original start and end times are hard for a human to interpret. The duplicate of both times is also redundant. 


2. Questions:

***Broad question:*** 
We would like to know which kinds of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts. 

***Specific question:***
Can player information such as experience level, gender, and subscriber status predict the total number of hours played in the game?


I believe that kknn model is appropriate because it can caputure the non-linear relationship between played_hours and the patterns between the categorical variables of experience, gender, andd subscirber status.  The ultimate goal here is to predict whether explanatory variables, experince, gender, and subscriber status can predict how long a player spends in a game, as players who play longer will contrbute more data. Some limitations to kknn is that distance metrics may not be as accurate especially for categorical variables, making it harder to come to a firm prediction with full accuracy. 

For wrangling, the experience, gender, and subscriber variables should be converted to factors, making it easier for R to read. We will also need to get rid of the NAs by using the na.rm function. I will also have to split the data into training and testing sets into around 80/20 or 70/30, and use cross-validation to tune k in the training set. 

3. Exploratory Data Analysis and Visualization

In [None]:
players <- players |>
    mutate(experience= as_factor(experience), gender = as_factor(gender), subscribe = as_factor(subscribe), age = as.numeric(Age))

players_mean_hrs_age<- players|>
    summarize(mean_hours = mean(played_hours, na.rm = TRUE), mean_age= mean(age, na.rm = TRUE))

sessions<- sessions |>
    mutate(original_start_time= as.numeric(original_start_time), original_end_time= as.numeric(original_end_time))

sessions_mean_start_end_times <- sessions |>
     summarize(mean_original_start_time= mean(original_start_time, na.rm = TRUE), mean_original_end_time= mean(original_end_time, na.rm = TRUE))


players_mean_hrs_age
sessions_mean_start_end_times

In [None]:
options(repr.plot.height =6, repr.plot.width=7)
hrs_vs_exp <- ggplot(players, aes(x=experience, y = played_hours)) + 
    geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Hours Played by Experience level", 
       x = "Experience level", 
       y = "Hours Played (hrs)") 
 
hrs_vs_exp

options(repr.plot.height =6, repr.plot.width=7)
hrs_vs_sub <- ggplot(players, aes(x=subscribe, y = played_hours)) + 
    geom_bar(stat = "identity", fill = "steelblue") + 
    labs(title = "Hours played by Subscibed or Not", 
         x = "Subscriber (where TRUE = Subscribed, FALSE = Not Subscribed)", 
         y = "Hours Played (hrs)")
hrs_vs_sub

options(repr.plot.height =6, repr.plot.width=7)
hrs_vs_gender <- ggplot(players, aes(x=gender, y = played_hours)) + 
    geom_bar(stat = "identity", fill = "steelblue") + 
    labs(title = "Hours Played For Each Gender", 
         x = "Gender Identified", 
         y = "Hours Played (hrs)")
hrs_vs_gender