# DSCI100 Project: Planning Stage (Individual)

### Predicting Usage of a Video Game Research Server

- Name: Eva Yarantseva, Section: 109 
- Student Number: 32172173

In [23]:
library(tidyverse)
library(tidymodels)
library(dplyr)

## (1) Data Description 

Provide a full descriptive summary of the dataset, including information such as the ~number of observations~, ~**summary statistics (report values to 2 decimal places)**~, ~number of variables~, ~name and type of variables, what the variables mean~, ~any issues you see in the data, any other potential issues related to things you cannot directly see,~ how the data were collected, etc. Make sure to use bullet point lists or tables to summarize the variables in an easy-to-understand format.

Note that the selected dataset(s) will probably contain more variables than you need. In fact, exploring how the different variables in the dataset affect your model may be a crucial part of the project. You need to summarize the full data regardless of which variables you may choose to use later on.

#### Dataset (a): players.csv

This dataset contains 196 observations that have 7 variables: 
- **experience**: this is a character variable that describes the skill level/experience of the player; this can categorize the players/observations into Beginner, Amature, Pro or Veteran when stored as a factor.
- **subscribe**: a TRUE/FALSE logical variable that describes whether or not a player is subscribed to a game-related newspaper.
- **hashedEmail**: a character variable that stores the user's hashed email as a string of letters and numbers (possibly to protect privacy).
- **played_hours**: numerical variable that describes the number of hours played in the server to one decimal place.
- **name**: character variable that stores the first name of the player.
- **gender**: character variable that can categorize the player by gender; Male, Female, Non-binary, Two-spirited, Agender, and to those who did not share their gender as "Prefer not say".
- **Age**: integer variable that stores the age of the player.

Summary statistics: 
- A majority of the players are male (124 players).
- The largest amount of people that play are amatures (63 players).
- More people are subscribed to a gaming-related newspaper than not (144 players).
- The mean age of players is 21.14 years old, with the median being 19.00 years old.
- The average session is 5.85 hrs long, however this can be heavily skewed because of the amounts of people who have 0 hr play time.

Problems with the dataset: 
- Type of variable: have to convert the character variables to factors before being able to wrangle the data (ie. the gender variable).
- Unnecessary variables: the names and hashed emails have no particular use in answering any of the research questions.

In [71]:
players <- read.csv("players.csv")
str(players)

'data.frame':	196 obs. of  7 variables:
 $ experience  : chr  "Pro" "Veteran" "Veteran" "Amateur" ...
 $ subscribe   : logi  TRUE TRUE FALSE TRUE TRUE TRUE ...
 $ hashedEmail : chr  "f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d" "f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9" "b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28" "23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5" ...
 $ played_hours: num  30.3 3.8 0 0.7 0.1 0 0 0 0.1 0 ...
 $ name        : chr  "Morgan" "Christian" "Blake" "Flora" ...
 $ gender      : chr  "Male" "Male" "Male" "Female" ...
 $ Age         : int  9 17 17 21 21 17 19 21 47 22 ...


In [72]:
summary(players)

  experience        subscribe       hashedEmail         played_hours    
 Length:196         Mode :logical   Length:196         Min.   :  0.000  
 Class :character   FALSE:52        Class :character   1st Qu.:  0.000  
 Mode  :character   TRUE :144       Mode  :character   Median :  0.100  
                                                       Mean   :  5.846  
                                                       3rd Qu.:  0.600  
                                                       Max.   :223.100  
                                                                        
     name              gender               Age       
 Length:196         Length:196         Min.   : 9.00  
 Class :character   Class :character   1st Qu.:17.00  
 Mode  :character   Mode  :character   Median :19.00  
                                       Mean   :21.14  
                                       3rd Qu.:22.75  
                                       Max.   :58.00  
                               

In [76]:
players <- players |>
    mutate(gender = as.factor(gender))
players |>
    count(gender)
players |>
    count(experience)

gender,n
<fct>,<int>
Agender,2
Female,37
Male,124
Non-binary,15
Other,1
Prefer not to say,11
Two-Spirited,6


experience,n
<chr>,<int>
Amateur,63
Beginner,35
Pro,14
Regular,36
Veteran,48


#### Dataset (b): sessions.csv

This dataset contains 1535 observations that have 5 variables:
- **hasedEmail**: a character variable that stores the user's hashed email as a string of letters and numbers (possibly to protect privacy).
- **start_time**: a character variable that describes the starting date and 24 hr time of the playing session.
- **end_time**: a character variable that describes the ending date and 24 hr time of the playing session.
- **original_start_time**: a numerical variable that stores the start of the playing session as a UNIX timestamp (in milliseconds).
- **original_end_time**: a numerical variable that stores the end of the playing session as a UNIX timestamp (in milliseconds).

Summary statistics: 
- The median start time in UNIX timestamp is 1.719e+12.
- The median end time in UNIX timestamp is 1.719e+12.

Problems with the dataset: 
- the start_time and end_time variables have both the date start time and the 24hr time of when the player started their gaming session as one observation instead of separate.
- The UNIX time stamp is in milliseconds instead of seconds.
- Unnecessary variables: hashed emails have no particular use in answering any of the research questions.

In [12]:
sessions <- read.csv("sessions.csv")
str(sessions)

'data.frame':	1535 obs. of  5 variables:
 $ hashedEmail        : chr  "bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf" "36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686" "f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc" "bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf" ...
 $ start_time         : chr  "30/06/2024 18:12" "17/06/2024 23:33" "25/07/2024 17:34" "25/07/2024 03:22" ...
 $ end_time           : chr  "30/06/2024 18:24" "17/06/2024 23:46" "25/07/2024 17:57" "25/07/2024 03:58" ...
 $ original_start_time: num  1.72e+12 1.72e+12 1.72e+12 1.72e+12 1.72e+12 ...
 $ original_end_time  : num  1.72e+12 1.72e+12 1.72e+12 1.72e+12 1.72e+12 ...


In [21]:
summary(sessions)

 hashedEmail         start_time          end_time         original_start_time
 Length:1535        Length:1535        Length:1535        Min.   :1.712e+12  
 Class :character   Class :character   Class :character   1st Qu.:1.716e+12  
 Mode  :character   Mode  :character   Mode  :character   Median :1.719e+12  
                                                          Mean   :1.719e+12  
                                                          3rd Qu.:1.722e+12  
                                                          Max.   :1.727e+12  
                                                                             
 original_end_time  
 Min.   :1.712e+12  
 1st Qu.:1.716e+12  
 Median :1.719e+12  
 Mean   :1.719e+12  
 3rd Qu.:1.722e+12  
 Max.   :1.727e+12  
 NA's   :2          