In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(cowplot)
options(repr.matrix.max.rows = 6)
source('cleanup.R')

In [None]:
write_csv(players, "data/players.csv")
players <- read_csv("data/players.csv")


In [None]:
players <- read_csv("data/players.csv")
players

In [None]:
sessions <- read_csv("data/sessions.csv")
sessions

1) Data Description:
Players Dataset:
- Number of observations: 196
- Number of variables: 7
- 4 chr variables outlining experience, email, name, gender
- 1 lgl variable for subscription status
- 2 dbl variables outlining age and hours played
- Average age of players is 21.14 an average play time (in hours) is 5.85

In [None]:
players |>
    summarise(players_mean_age = mean(Age, na.rm = TRUE), players_mean_hours = mean(played_hours, na.rm = TRUE))

1. Data Description: Sessions Dataset:

- Number of observations: 1535
- Number of variables: 5
- 3 chr variables outlining email, start time, end time
- 2 dbl variables outlining original start time and original end time
- The mean original start time is 1.72 x 10^12, and the mean original end time is 1.72 x 10^12
- One thing I will have to figure out is what I need to do to convert the original times into the standard times. 

In [None]:
sessions |>
    summarise(sessions_mean_start_time = mean(original_start_time, na.rm = TRUE), sessions_mean_end_time = mean(original_end_time, na.rm = TRUE))

2. Questions
- Broad Question: We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.
- Specific Question: Can age be used to predict game played hours? This question is asked using the "players" dataset
- I am going to try to wrangle the data and use the age variable and predict using either linear or knn regression. I would measure distance from each new tibble to the fit line and predict hours played from there. Other data such as email, gender, experience, subscription status, or name should not be needed here.

3. Exploratory Data Analysis and Visualization
- Loading the dataset: See above

In [None]:
# Finding mean values for Players dataset

players |>
    summarise(players_mean_age = mean(Age, na.rm = TRUE), players_mean_hours = mean(played_hours, na.rm = TRUE))

The players.csv dataset is tidy. It is tidy because each row pertains to its own observation (each different player) and each cell represents its own unique measurement, without any overlapping measurements, and only one numerical value where appropriate. 

In [None]:
# Plotting the relationship between age and hours played, color coded by gender, filtered by players between  > 0 and 12 hours of playing time
options(repr.plot.height = 8, repr.plot.width = 10)
players_filtered <- players |>
    filter(played_hours <= 12, played_hours > 0)
age_hours_plot <- players_filtered |>
    ggplot(aes(x = Age, y = played_hours, color = gender)) +
    geom_point() +
    labs(x = "Age in Years", y = "Hours Played", color = "Player Gender", title = "Age vs Hours Played, Color Coded by Gender, Filtered to Up to 12 Hours of Play Time")
age_hours_plot

This graph helps me get a rough idea of the time spent playing vs age, and helps me map out the tendencies of playing time between different age groups.

In [None]:
options(repr.plot.width = 8, repr.plot.height = 8)
# Comparing the Number of Players at each Experience Level
exp_plot <- players |>
    ggplot(aes(x = experience, fill = experience)) +
    geom_bar() +
    labs(x = "Player Experience", title = "Comparing the Number of Players at each Experience Level")
exp_plot

This graph is just for me to visualize the number of people at each experience level, as it could influence play time/willingness to play

4. Methods and Plan
- The one method that I am going to use to answer my question of whether age can be used to determine the hours played is linear regression. I am using linear regression. I am using linear regression because as I explored different visualizations of my data, I failed to find any linearity. Also, since I have some sparse, isolated, data points, KNN regression would be very inaccurate especially with predictions away from the main clump of data points. However, I can compare RMSE and RMPSE to truly figure out which one is better. The good thing about this method is that there aren't really any huge assumptions that I need to make. I will only use rows of data with both an age and hours played value, using na.rm = true to filter out any possible NA's. One potential weakness is that linear regression assumes that my data is a straight line relationship, however we know that is not the case. For my data, I am going to isolate the Age and hours_played variables. I will split it with 75% going to training and 25% to testing. Overall, I feel like these steps will set me up for success. 