In [2]:
library(tidyverse)

players_url <- "https://raw.githubusercontent.com/aketineni/DSCI100_final_planning/refs/heads/main/players.csv"
sessions_url <- "https://raw.githubusercontent.com/aketineni/DSCI100_final_planning/refs/heads/main/sessions.csv"

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


# DSCI 100 Final Project Planning Proposal - Arnav Ketineni #

## Data Description ##

The **players.csv** dataset contains information about 196 players and 7 variables about each player. These variables are:
1. **experience (chr)**: Describes the player's experience level as pro, veteran, regular, amateur, or beginner.
    1. There are 14 pros, 48 veterans, 36 regulars, 63 amateurs, and 35 beginners
3. **subscribe (lgl)**: Indicates if the player is subscribed or not
    1. 144 of the players were subscribed and 52 were not
5. **hashedEmail (chr)**: Contains the player's email
6. **played_hours (dbl)**: Contains how many total hours the player has played on the server
    1. The average playtime was 5.85 hrs, ranging from 0 hrs - 223.1hrs, with a median of 0.10hrs
8. **name (chr)**: Contains the player's name
9. **age (dbl)**: Contain the player's age
    1. The average age of the players was 21.14 years old, ranging from 9-58 years old, with a median of 19 years old.
11. **gender (chr)**: Contains the players gender out of the choices male, female, non-binary, agender, two-spirited, other, and prefer not to say
    1. There were 124 male, 37 female, 15 non-binary, 2 agender, and 6 two-spirited players. 11 players chose to not indicate a gender and 1 player chose other.

One issues present in the data is the presence of NAs for two observations in the `age` column. This means that these observations will either need to have the age imputed to be used in models with age as a predictor or removed. Similarly, the `gender` column has options including "Prefer not to say" and "Other". These columns will cause issues for predictive questions involving the gender of players, since these observations do not have a specific category they can be classified into, and cannot necessarily be imputed. 

Regarding the `played_hours` variable, this number partially depends on how the player uses PLAICraft. Each player starts with 30 minutes of playtime available for them to use, and recieve more playtime for doing actions such as talking to people in-game and inviting friends to join the server, and passively for the time they aren't on the server (PLAICraft). This means that `played_hours` is likely influenced by other factors beyond the data give in the dataset, such as whether or not the player is comfortable talking to strangers or how many people they may want to invite to play PLAICraft.

(1) Data Description:
Provide a full descriptive summary of the dataset, including information such as the number of observations, summary statistics (report values to 2 decimal places), number of variables, name and type of variables, what the variables mean, any issues you see in the data, any other potential issues related to things you cannot directly see, how the data were collected, etc. Make sure to use bullet point lists or tables to summarize the variables in an easy-to-understand format.

Note that the selected dataset(s) will probably contain more variables than you need. In fact, exploring how the different variables in the dataset affect your model may be a crucial part of the project. You need to summarize the full data regardless of which variables you may choose to use later on.

In [30]:
players <- read_csv(players_url) |>
    mutate(experience = as_factor(experience),
          gender = as_factor(gender))

summary(players, digits = 4)
head(players)
# players_summary_grouped <- players |>
#     group_by(experience) |>
#     summarize(count = n(), avg_playtime = mean(played_hours), avg_age = mean(age))

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


    experience subscribe       hashedEmail         played_hours    
 Pro     :14   Mode :logical   Length:196         Min.   :  0.000  
 Veteran :48   FALSE:52        Class :character   1st Qu.:  0.000  
 Amateur :63   TRUE :144       Mode  :character   Median :  0.100  
 Regular :36                                      Mean   :  5.846  
 Beginner:35                                      3rd Qu.:  0.600  
                                                  Max.   :223.100  
                                                                   
     name                         gender         Age       
 Length:196         Male             :124   Min.   : 9.00  
 Class :character   Female           : 37   1st Qu.:17.00  
 Mode  :character   Non-binary       : 15   Median :19.00  
                    Prefer not to say: 11   Mean   :21.14  
                    Agender          :  2   3rd Qu.:22.75  
                    Two-Spirited     :  6   Max.   :58.00  
                    Other           

experience,subscribe,hashedEmail,played_hours,name,gender,Age
<fct>,<lgl>,<chr>,<dbl>,<chr>,<fct>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


In [16]:
sessions <- read_csv(sessions_url) |>
    separate(col = start_time, into = c("start_date", "start_time"), sep = " ") |>
    separate(col = end_time, into = c("start_data", "start_time"), sep = " ") |>
    mutate(start_date = date(start_date),
           end_date = date(start_date))

head(sessions)

[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


hashedEmail,start_date,start_data,start_time,original_start_time,original_end_time,end_date
<chr>,<date>,<chr>,<chr>,<dbl>,<dbl>,<date>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30-06-20,30/06/2024,18:24,1719770000000.0,1719770000000.0,30-06-20
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17-06-20,17/06/2024,23:46,1718670000000.0,1718670000000.0,17-06-20
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25-07-20,25/07/2024,17:57,1721930000000.0,1721930000000.0,25-07-20
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25-07-20,25/07/2024,03:58,1721880000000.0,1721880000000.0,25-07-20
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25-05-20,25/05/2024,16:12,1716650000000.0,1716650000000.0,25-05-20
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,23-06-20,23/06/2024,17:10,1719160000000.0,1719160000000.0,23-06-20


## Questions ##
(2) Questions:
Clearly state one broad question that you will address, and the specific question that you have formulated. Your question should involve one response variable of interest and one or more explanatory variables, and should be stated as a question. One common question format is: “Can [explanatory variable(s)] predict [response variable] in [dataset]?”, but you are free to format your question as you choose so long as it is clear. Describe clearly how the data will help you address the question of interest. You may need to describe how you plan to wrangle your data to get it into a form where you can apply one of the predictive methods from this class.

Question 1: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

Question 2: We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

Question 3: We are interested in demand forecasting, namely, what time windows are most likely to have large number of simultaneous players. This is because we need to ensure that the number of licenses on hand is sufficiently large to accommodate all parallel players with high probability. 

## Exploratory Data Analysis and Visuzalization ##
(3) Exploratory Data Analysis and Visualization
In this assignment, you will:

Demonstrate that the dataset can be loaded into R.
Do the minimum necessary wrangling to turn your data into a tidy format. Do not do any additional wrangling here; that will happen later during the group project phase.
Compute the mean value for each quantitative variable in the players.csv data set. Report the mean values in a table format.
Make a few exploratory visualizations of the data to help you understand it.
Use our visualization best practices to make high-quality plots (make sure to include labels, titles, units of measurement, etc)
Explain any insights you gain from these plots that are relevant to address your question
Note: do not perform any predictive analysis here. We are asking for an exploration of the relevant variables to demonstrate that you understand them well before performing any additional modelling, and to identify potential problems you anticipate encountering.

## Methods and Plan ##
(4) Methods and Plan
Propose one method to address your question of interest using the selected dataset and explain why it was chosen. Do not perform any modelling or present results at this stage. We are looking for high-level planning regarding model choice and justifying that choice.

In your explanation, respond to the following questions:

Why is this method appropriate?
Which assumptions are required, if any, to apply the method selected?
What are the potential limitations or weaknesses of the method selected?
How are you going to compare and select the model?
How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?

## Github Repository ##
https://github.com/aketineni/DSCI100_final_planning