# Data Science Project: Planning Stage (Individual Portion)

### Data Description:

A UBC research group is collecting data on how people play videogames. Below is a summary of the datasets:

##### Sessions Data: sessions.csv

Observations(rows):1535  
Variables(columns):5

##### Variable Information:
|Variable Name| Type    | Description|
|------------|----------|--------|
|hashedEmail |Character |Player's unique hashed anonymous email(same as players.csv)|
|start_time  |Character |Game session start for player (human-readable)|
|end_time    |Character |Game session end for player (human-readable)|
|original_start_time|Double |Original session start time (server time stamp)| 
|original_end_time|Double |Original session end time (server time stamp)|

##### Additional Details:
- No session duration included(must be computed).
- Date and time included in start/end times are not useable for computation currently and must be converted allow for the session time to be calculated.
- Inconsistancies due to time zone variations.
- Sessions may include missing or inconsistant data.
- Potential errors(mismatched/incorrect) in email imputs.

  
##### Player Data: players.csv

Observations(rows): 196  
Variables(columns): 7

##### Variable Information:
|Variable Name| Type    | Description|
|------------|----------|--------|
|experience  |Character |Player's self reported experience level(from ameteur to verteran)|
|subscribe   |Logical   |Whether the player subscribed to a game-related newsletter (True/False)|
|hashedEmail |Character |Player's unique hashed anonymous email|
|played_hours|Double    |Player's total hours in the game|
|name        |Character |Player's in game name|
|gender      |Character |Player self reported gender| 
|age         |Double    |Player age in years (no decimals)|

##### Additional Details:
- Experience level is self reported and may be biased resulting in inacurate representation. 
- Sampling bias may affect results(may not represent general population).
- Self reported variables may be biased. 
- Missing values may be present affecting analysis.
- Age and played_hours may have outliers causing the results to be skewed.


### Questions:  
##### Broad question:  
What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?  

##### Specific question:  
Can a players total playtime(played_hours), experience level (experience), number of sessions(number_of_sessions), and mean session length(mean_session_length_mins) predict whether they will subscribe to the game-related newsletter(subscribe)?  

### Exploratory Data Analysis:

In [None]:
# Loading Libraries
library(tidyverse)
library(tidymodels)
library(dplyr)
library(ggplot2)
library(RColorBrewer)
library(repr)
library(lubridate)
library(readr)

#limiting dataframe outputs to 6 rows
options(repr.matrix.max.rows = 6)

In [None]:
#Load the Data

players<- read_csv(
        "https://raw.githubusercontent.com/ajones200/dsci100_individual/refs/heads/main/players.csv") 
players

sessions<- read_csv(
        "https://raw.githubusercontent.com/ajones200/dsci100_individual/refs/heads/main/sessions.csv")
sessions

In [None]:
#View Players Dataset information
cat("Players Dataset:\n","Rows:", nrow(players),"\n","Columns:", ncol(players))
cat("\n")
cat("\nPlayers Dataset Variable types:\n")
sapply(players, typeof)

#Check number of missing/NA variables
cat("\nPlayers dataset missing variables:\n") 
    print(colSums(is.na(players)))

#Look at all unique outcomes for the quatitative variables with limited results (gender and experience)
#These are self assigned variables
cat("Possible unique outcomes for some catagorical variables\n")
players|>
select(experience, gender)|>
map(unique)

In [None]:
#View Sessions Dataset information
cat("Sessions Dataset:\n","Rows:", nrow(sessions),"\n","Columns:", ncol(sessions))
cat("\n")
cat("\nSessions Dataset Variable types:\n")
sapply(sessions, typeof)

#Check number of missing variables
cat("\nSessions Dataset missing Variables:\n") 
    print(colSums(is.na(sessions)))

In [None]:
#Calculate the mean of each quantitative variable from the players dataset
cat("Mean of Quantitative Variables in Players Dataset:\n")

numeric_players<- players[sapply(players, is.numeric)]

players_table_mean<-colMeans(numeric_players, na.rm= TRUE)
    
players_table_mean

In [None]:
#Data Wrangling
#Remove NA values from the datasets
players_clean<- na.omit(players)
sessions_clean<- na.omit(sessions)

#convert start and end times (character strings) to an object that can be used to compute the duration of each session in minutes.
sessions_time_mutate<- sessions_clean|>
    mutate(
        start= dmy_hm(start_time, tz = "UTC"), 
        end= dmy_hm(end_time, tz = "UTC"),
        duration_mins = as.numeric(difftime(end, start, units= "mins")))
#When wanting to use the mean session length for computation must probably convert to hours, standardize or change the hours_played to minutes.

#calculate mean session time and number of sessions
sessions_organized<- sessions_time_mutate|>
    group_by(hashedEmail)|>
    summarize(
        number_of_sessions = n(),
        mean_session_length_mins = mean (duration_mins, na.rm=TRUE),
        )

In [None]:
# Merge the datasets
players_merged<- players_clean|>
    left_join(sessions_organized, by = "hashedEmail")|>
    filter(!is.na(number_of_sessions), !is.na(mean_session_length_mins))
#Remove any NA values(players who had no sessions or time played)

#Select only needed columns for final dataset
#Include mean_session_length, played_hours, number_of_sessions, experience
players_final_data<- players_merged|>
select(mean_session_length_mins, played_hours, number_of_sessions, experience, subscribe)
players_final_data 


### COnvert experience to a factor??? - may be too much manipulation

### Visualizations

In [None]:
#Scatterplot showing the relationships between the number of sessions, mean session time and subscription. 
options(repr.plot.width = 15, repr.plot.height = 8)
sessions_info_plot<- players_final_data|>
ggplot(aes(x= number_of_sessions, y= mean_session_length_mins, color= subscribe, shape=subscribe))+
    geom_point(size=3)+ 
    labs(x= "Number of Sessions Played", 
         y= "Mean Session Time Played (in Minutes)", 
         title = "Relationship between Number of Sessions and Mean Session Time \nand their potential effect on Subscription",
         color="Subscription to Game Newsletter",
         shape="Subscription to Game Newsletter")+
    scale_color_brewer(palette = "Set2")+
    theme(text = element_text(size = 18))

sessions_info_plot

Most players have short sessions and play relatively few times. Players who have more sessions appear to have shorter session times. There does not seem to be a clear relationship between session times and subscriptions but as the players number of sessions increases they seem less likely to subscribe.

In [None]:
#Histogram showing the players time spent playing and the subscriptions to the game-related newsletter.  
options(repr.plot.width = 11, repr.plot.height = 9)
player_time_plot<- players_final_data|>
    ggplot(aes(x= played_hours, fill= as_factor(subscribe)))+
    geom_histogram(binwidth= 40, alpha = 0.8)+
    facet_grid(rows = vars(subscribe)) +
    labs(x="Number of Hours Played by Players",
         y="Player Count",
        title= "Distribution of Players Playing Time\nand Subscription to the Game Newsletter",
        fill="Subscription Status")+
    theme(text = element_text(size = 18))
player_time_plot

There is a large number of players who have played for short amouts of time with only a few who have played for significgantly more hours suggesting an imbalenced distribution. Players with more time spent playing are more likely to subscribe. 

In [None]:
#Bar graph of the ratio of subscriptions between self reported experience levels.
options(repr.plot.width = 11, repr.plot.height = 9)
experience_plot<- players_final_data|>
    ggplot(aes(x=experience, fill= subscribe))+
    geom_bar(position="fill")+
    labs(x="Players Self Reported Experience Level",
         y="Number of Players",
        title="Self Reported Experience and Ratio of Subscription to Game Newsletter",
        fill = "Subscription to Game Newsletter")+
    theme(text = element_text(size = 18))
experience_plot

All catagories seem to have at least a 50% subscription rate with the veteran and pro catagories having the largest ratio of non subscribed players and the amateur, beginner and regular players having higher subscription rates. This suggests that newer and less experienced players are more likely to subscribe. 

### Methods and Plan

K Nearest Neighbours Classification:

Why?
- This is a predictive binary classification problem beause the outcomes of the prediction have only two responses true or false for subscription.
- KNN classification can classify players based on the multiple predictor variables selected. 
- Can adapt to new data well and makes no assumptions of underlying data. 

Assumptions?
- the data does not have linear properties
- 


Limitations?
- KNN classification model can be sensitive to noise, imbalenced data and outliers.
- Can overfit or underfit data. 
- Data needs to be scaled.
- Can be slow for large datasets.

Plan:
Change all catagorical variables to factors. 

Standardize the data.
Split data into a training (75%) and testing(25%) set. 
Train Data on training set.
Use cross validation to tune the model and choose the best value of K. 
Assess the model using accuracy, recall, and precision.  