# Data Science Project: Individual Planning Stage

### Anya Jones  
### 86102779  

### Data Description:

A UBC research group is collecting data on how people play videogames. Below is a summary of the datasets:

##### Sessions Data: sessions.csv

Observations(rows):1535  
Variables(columns):5

##### Variable Information:
|Variable Name| Type    | Description|
|------------|----------|--------|
|hashedEmail |Character |Player's unique hashed anonymous email (same as players.csv)|
|start_time  |Character |Game session start time for player (human-readable)|
|end_time    |Character |Game session end time for player (human-readable)|
|original_start_time|Double |Original session start time (server time stamp)| 
|original_end_time|Double |Original session end time (server time stamp)|

##### Additional Details:
- Start/end times are in a date time format and must be aggregated to allow for session durations to be computed.
- Inconsistencies due to time zone variations.
- Sessions may include missing, inconsistent or outlying data.
- Potential errors(mismatched/incorrect) in email inputs.
  
##### Player Data: players.csv

Observations(rows): 196  
Variables(columns): 7

##### Variable Information:
|Variable Name| Type    | Description|
|------------|----------|--------|
|experience  |Character |Player's self reported experience level(from amateur to veteran)|
|subscribe   |Logical   |Whether the player subscribed to a game-related newsletter (True/False)|
|hashedEmail |Character |Player's unique hashed anonymous email|
|played_hours|Double    |Player's total hours in the game|
|name        |Character |Players in game name|
|gender      |Character |Player self reported gender| 
|age         |Double    |Player age in years (no decimals)|

##### Additional Details:
- Experience level is self reported and may be biased resulting in inaccurate representation. 
- Sampling bias may affect results (may not represent general population).
- Self reported variables may be biased. 
- Missing values may be present affecting analysis.
- Age and played_hours may have outliers causing the results to be skewed.


### Questions:  
##### Broad question:  
What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?  

##### Specific question:  
Can a players total playtime(played_hours), experience level (experience), number of sessions (number_of_sessions), and mean session length (mean_session_length_mins) predict whether they will subscribe to the game-related newsletter (subscribe)?  

These predictor variables will help address the question as they reveal key behaviors of the players and their potential connection to the probability of players subscribing. I will need to compute the number of sessions and their mean duration from the sessions dataset then combine it with the players data to get the required variables.

### Exploratory Data Analysis:

In [None]:
# Loading necessaiy libraries.
library(tidyverse)
library(tidymodels)
library(dplyr)
library(ggplot2)
library(RColorBrewer)
library(repr)
library(lubridate)
library(readr)

#limiting dataframe outputs to 6 rows.
options(repr.matrix.max.rows = 6)

In [None]:
#Load the Data.
players<- read_csv(
        "https://raw.githubusercontent.com/ajones200/dsci100_individual/refs/heads/main/players.csv") 
players

sessions<- read_csv(
        "https://raw.githubusercontent.com/ajones200/dsci100_individual/refs/heads/main/sessions.csv")
sessions

In [None]:
#View Players Dataset information.
cat("Players Dataset:\n","Rows:", nrow(players),"\n","Columns:", ncol(players))
cat("\n")
cat("\nPlayers Dataset Variable types:\n")
sapply(players, typeof)

#Check number of missing/NA variables.
cat("\nPlayers dataset missing variables:\n") 
    print(colSums(is.na(players)))

#Look at all unique outcomes for the quatitative variables with limited results (gender and experience).
#These are self assigned variables.
cat("Possible unique outcomes for the catagorical variables\n")
players|>
select(experience, gender)|>
map(unique)

In [None]:
#View Sessions dataset information.
cat("Sessions Dataset:\n","Rows:", nrow(sessions),"\n","Columns:", ncol(sessions))
cat("\n")
cat("\nSessions Dataset Variable types:\n")
sapply(sessions, typeof)

#Check number of missing variables.
cat("\nSessions Dataset missing Variables:\n") 
    print(colSums(is.na(sessions)))

In [None]:
#Calculate the mean of each quantitative variable from the players dataset.
cat("Mean of Quantitative Variables in Players Dataset:\n")

numeric_players<- players[sapply(players, is.numeric)]

players_mean<-colMeans(numeric_players, na.rm= TRUE)
    
mean_table<- data.frame(
                       Variable = names(players_mean),
                        Mean = (players_mean))
                        
mean_table

In [None]:
#Data Wrangling.
#Remove NA values from the datasets.
players_clean<- drop_na(players)
sessions_clean<- drop_na(sessions)

#convert start and end times (character strings) to an object that can be used to compute the duration of each session in minutes.
sessions_time_mutate<- sessions_clean|>
    mutate(
        start= dmy_hm(start_time, tz = "UTC"), 
        end= dmy_hm(end_time, tz = "UTC"),
        duration_mins = as.numeric(difftime(end, start, units= "mins")))
#When wanting to use the mean session length for computation may want to convert to hours, standardize or change the hours_played to minutes.

#calculate mean session time and number of sessions.
sessions_organized<- sessions_time_mutate|>
    group_by(hashedEmail)|>
    summarize(
        number_of_sessions = n(),
        mean_session_length_mins = mean (duration_mins, na.rm=TRUE),
        )
sessions_organized #view new data table.

In [None]:
# Merge the datasets.
players_merged<- players_clean|>
    left_join(sessions_organized, by = "hashedEmail")|>
    filter(!is.na(number_of_sessions), !is.na(mean_session_length_mins))
#Remove any NA values(players who had no sessions or time played).

#Select only needed columns for final dataset.
#Include mean_session_length, played_hours, number_of_sessions, experience.
players_final_data<- players_merged|>
select(mean_session_length_mins, played_hours, number_of_sessions, experience, subscribe)
players_final_data 

### Visualizations

In [None]:
#Scatterplot showing the relationships between the number of sessions, mean session time and subscription. 
options(repr.plot.width = 15, repr.plot.height = 8)
sessions_info_plot<- players_final_data|>
ggplot(aes(x= number_of_sessions, y= mean_session_length_mins, color= subscribe, shape=subscribe))+
    geom_point(size=3)+ 
    labs(x= "Number of Sessions Played", 
         y= "Mean Session Time Played (in Minutes)", 
         title = "Relationship between Number of Sessions and Mean Session Time \nand their potential effect on Subscription",
         color="Subscription to Game Newsletter",
         shape="Subscription to Game Newsletter")+
    scale_color_brewer(palette = "Set2")+
    theme(text = element_text(size = 18))

sessions_info_plot

Most players have short sessions and play infrequently. Players with more sessions played have shorter session times and appear less likely to subscribe. There is no discernable relationships dictating subscription for players with less sessions.

In [None]:
#Bar graph of the ratio of subscriptions between self reported experience levels.
options(repr.plot.width = 11, repr.plot.height = 9)
experience_plot<- players_final_data|>
    ggplot(aes(x=experience, fill= subscribe))+
    geom_bar(position="fill")+
    labs(x="Players Self Reported Experience Level",
         y="Number of Players",
        title="Self Reported Experience and Ratio of Subscription to Game Newsletter",
        fill = "Subscription to Game Newsletter")+
    theme(text = element_text(size = 18))
experience_plot

All experience levels have subscribers. The veteran and pro catagories have the largest proportion of non subscribed players and the amateur, beginner and regular players have higher subscription proportions suggesting that newer/less experienced players are more likely to subscribe. 

In [None]:
#Histogram showing the players time spent playing and the subscriptions to the game-related newsletter.  
options(repr.plot.width = 11, repr.plot.height = 9)
player_time_plot<- players_final_data|>
    ggplot(aes(x= played_hours, fill= as_factor(subscribe)))+
    geom_histogram(binwidth= 40, alpha = 0.8)+
    facet_grid(rows = vars(subscribe)) +
    labs(x="Number of Hours Played by Players",
         y="Player Count",
        title= "Distribution of Players Playing Time\nand Subscription to the Game Newsletter",
        fill="Subscription Status")+
    theme(text = element_text(size = 18))
player_time_plot

There is a large number of players with short play times and then there are a few outliers who have have large play times suggesting an imbalenced distribution. Players with more time spent playing are more likely to subscribe. 

### Methods and Plan

Proposed method: K Nearest Neighbours Classification:

#### Why this method is appropriate:

The goal of this project is to predict whether a player will subscribe to a game related newsletter, which is a binary classification problem with two outcomes, true or false. KNN classification works based on the majority class of the chosen number of nearest datapoints (K). It can adapt well to new data and can work with multiple predictor variables without assuming connections between the data.

#### Assumptions if there are any:
- The data once scaled should stop bias in the classifications. 
- There is no missing data. 
- Players with closely related data(experience, play time, etc.) are similar and will follow similar subscription patterns.

#### Limitations of this model:
- KNN classification model can be sensitive to noise, imbalanced data (must scale data) and outliers.
- Can overfit or underfit data depending on value of k. 
- Can be slow for large datasets.
  
#### Comparing and Selecting Model:
- Cross validation (10 folds) and analysis of accuracy will be used on the training data to choose the best value for k.
- The final model will be run on the testing data and will be analyzed using accuracy, precision, recall and a confusion matrix.

#### Processing Data to apply to the model:

1. Split data into a training(70%) and testing(30%) set using intitial_split function.
2. Preprocess data:
   - Standardize the numeric data to ensure unbiased predictions.
   - Change all categorical variables to factors using as_factor.
   - Set the seed.
3. Use cross validation with 10 folds as this is a smaller dataset(less bias) on the training data to tune the model. This will prevent overfitting and ensure the model will work well with the unseen data later.
  - Train the classifier using workflow and tune_grid() using values of k from 1 to 20 by 1. 
  - Predict the labels within each fold and calculate accuracy for each K value. Use this to choose best value of K. 
4. Assess the model on the unseen testing dataset using accuracy, precision, recall and a confusion matrix. 