# Data Science Project: Planning Stage (Individual Portion)

### Data Description:

A UBC research group is collecting data on how people play videogames. Below is a summary of the datasets:

##### Sessions Data: sessions.csv

Observations(rows):1535  
Variables(columns):5

##### Variable Information:
|Variable Name| Type    | Description|
|------------|----------|--------|
|hashedEmail |Character |Player's unique hashed anonymous email(same as players.csv)|
|start_time  |Character |Game session start for player (human-readable)|
|end_time    |Character |Game session end for player (human-readable)|
|original_start_time|Double |Original session start time (server time stamp)| 
|original_end_time|Double |Original session end time (server time stamp)|

##### Additional Details:
- No session duration included(must be computed).
- Date and time included in start/end times are not useable for computation currently and must be converted allow for the session time to be calculated.
- Inconsistancies due to time zone variations.
- Sessions may include missing or inconsistant data.
- Potential errors(mismatched/incorrect) in email imputs.

  
##### Player Data: players.csv

Observations(rows): 196  
Variables(columns): 7

##### Variable Information:
|Variable Name| Type    | Description|
|------------|----------|--------|
|experience  |Character |Player's self reported experience level|
|subscribe   |Logical   |Whether the player subscribed to a game-related newsletter (True/False)|
|hashedEmail |Character |Player's unique hashed anonymous email|
|played_hours|Double    |Player's total hours in the game|
|name        |Character |Player's in game name|
|gender      |Character |Player self reported gender| 
|age         |Double    |Player age in years (no decimals)|

##### Additional Details:
- Experience level is self reported and may be biased resulting in inacurate representation. 
- Sampling bias may affect results(may not represent general population).
- Self reported variables may be biased. 
- Missing values may be present affecting analysis.
- Age and played_hours may have outliers causing the results to be skewed.


### Questions:  
##### Broad question:  
What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?  

##### Specific question:  
Can a players total playtime(played_hours), experience level (experience), number of sessions(number_of_sessions), and mean session length(mean_session_length) predict whether they will subscribe to the game-related newsletter(subscribe)?  

### Exploratory Data Analysis and Visualization:

In [None]:
# Loading Libraries
library(tidyverse)
library(tidymodels)
library(dplyr)
library(ggplot2)
library(RColorBrewer)
library(repr)
library(lubridate)
library(readr)
#limiting dataframe outputs to 6 rows
options(repr.matrix.max.rows = 6)

In [None]:
#Load the Data

players<- read_csv(
        "https://raw.githubusercontent.com/ajones200/dsci100_individual/refs/heads/main/players.csv") 
players

sessions<- read_csv(
        "https://raw.githubusercontent.com/ajones200/dsci100_individual/refs/heads/main/sessions.csv")
sessions

In [None]:
#Observe Datasets


In [None]:
#Remove NA values from the datasets




In [None]:
#convert start and end times (character strings) to an object that can be used to compute the duration of each session in minutes.
sessions_time_changed<- sessions|>
    mutate(
        start= dmy_hm(start_time, tz = "UTC"), 
        end= dmy_hm(end_time, tz = "UTC"),
        duration_mins = as.numeric(difftime(end, start, units= "mins")))

#calculate mean session time and number of sessions
sessions_organized<- sessions_time_changed|>
    group_by(hashedEmail)|>
    summarize(
        number_of_sessions = n(),
        mean_session_length = mean (duration_mins, na.rm=TRUE),
        )

In [None]:
# Merge the datasets
players_merged<- players_clean|>
    left_join(sessions_organized, by = "hashedEmail")|>
    filter(!is.na(number_of_sessions), !is.na(mean_session_length))#Remove any NA values from new columns
#Select only needed columns
players_final_data<- players_merged|>
select(mean_session_length, played_hours, number_of_sessions, experience)
players_final_data 

In [None]:
players_na<- colSums(is.na(players_final_data))
players_na

In [None]:
str(players)

In [None]:
#Number of sessions per player
player_sessions<- sessions |>
    group_by(hashedEmail)|> 
    summarize(number_of_sessions = n(), na.rm=TRUE)




In [None]:

Data Wrangling:

Libraries:


library(repr)
library(ggplot2)
source("cleanup.R")

Tidy data:


Mean value for each quantitative variable:

mean_hours_played<- 

mean_age<- 


Visualizations:



In [None]:
#Merge dat
joined<-innerjoin(players, sessions)