# DSCI 100 - Group Project: Predictive Modeling of Gaming Newsletter Subscriptions!
Group 14, Section 009

GitHub Repository Link: https://github.com/anasakbar-05/DSCI_100_Group_Project_009_14

### Introduction

    For this project, we are working with a real dataset from a UBC Computer Science research group led by Frank Wood. The group studies how people play video games by running their own Minecraft server, where the players' in-game actions are automatically recorded as they move around and interact with the world. Since this is an ongoing research project, the team needs assistance figuring out how to target the right kinds of players that will give them lots of data, and how to allocate limited resources (server hardware and software licenses) with the goal of supporting their research. To guide their decisions, they outlined three broad questions related to predicting player behaviour, player types, and server usage patterns. The chosen question will then be used to formulate a more specific question that can be answered within the scope of this project and course (DSCI 100).

Our group decided to focus on the first broad question: 

**What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?** 

Then, we specified it into one clear research question for the project:

**“Can a player’s age, their total hours played, experience level, and gender be used to accurately predict their subscription status to the game-related newsletter?”**

    To answer this, we used the "players.csv" dataset provided by the research group. It includes player demographics (such as gender and age), gameplay behaviours (like experience level and total hours played), and whether or not each player subscribed to the newsletter. With these variables, we can explore patterns across different kinds of players and build a model that predicts subscription status based on gameplay and demographic features. It will be useful to the research group to allow them to better understand what drives player engagement and how to target their future recruitment efforts.

*Note: while a secondary dataset, sessions.csv, is available, this analysis will focus on players.csv for a focused approach to answering the specific research question with the most relevant data.*

In [5]:
library(tidyverse)
library(repr)
library(tidymodels)
library(dplyr)

url = "http://drive.google.com/uc?rxport-download&id=19dtTv9I4hUdTKPBrM1QgI3A0ru68ssds"
players <- read_csv(url)
head(players)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


In [6]:
players_summary <- summary(players)
players_summary

distinct(players, experience)
distinct(players, subscribe)
distinct(players, gender)

  experience        subscribe       hashedEmail         played_hours    
 Length:196         Mode :logical   Length:196         Min.   :  0.000  
 Class :character   FALSE:52        Class :character   1st Qu.:  0.000  
 Mode  :character   TRUE :144       Mode  :character   Median :  0.100  
                                                       Mean   :  5.846  
                                                       3rd Qu.:  0.600  
                                                       Max.   :223.100  
                                                                        
     name              gender               Age       
 Length:196         Length:196         Min.   : 9.00  
 Class :character   Class :character   1st Qu.:17.00  
 Mode  :character   Mode  :character   Median :19.00  
                                       Mean   :21.14  
                                       3rd Qu.:22.75  
                                       Max.   :58.00  
                               

experience
<chr>
Pro
Veteran
Amateur
Regular
Beginner


subscribe
<lgl>
True
False


gender
<chr>
Male
Female
Non-binary
Prefer not to say
Agender
Two-Spirited
Other


### Description of the Players dataset — players.csv

The players.csv file contains data on 196 players (it has 196 observations) and includes 7 variables describing each player’s. Each row represents one unique player.

| Variable name    | Type      | Meaning                                                                                                                                  |
| ---------------- | --------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
| **experience**   | Character | Self-reported experience level. Categories include: *Beginner, Regular, Amateur, Pro, Veteran*.                                          |
| **subscribe**    | Logical   | Whether a player subscribed to the game-related newsletter (*TRUE/FALSE*).                                                               |
| **hashedEmail**  | Character | A hashed version of each player’s email address (used as an anonymized identifier).                                                      |
| **played_hours** | Numeric   | Total number of hours the player spent in the game during the session.  |
| **name**         | Character | Player’s display name.                                                                                                                   |
| **gender**       | Character | The player’s gender identity. Categories include: *Female, Male, Non-binary, Two-Spirited, Agender, Prefer not to say, Other*.           |
| **Age**          | Numeric    | Player’s self-reported age in years.                                                                                                                   |

### Summary Statistics:

`Played hours`: Mean(Average) = 5.85, Median = 0.10, Min = 0.00, Max = 223.10

`Age`: Mean(Average) = 21.14, Median = 19.00, Min = 9, Max = 58 (2 missing values)

`Subscription count`: 144 subscribed, 52 did not


### Observations & Potential Issues With the Data:

The `"played_hours"` of players is a highly skewed variable as most players have very low values. This might affect the conclusions we can draw from any analysis. It also (played_hours) may not reflect true gameplay. Players can leave the game running while AFK (away from keyboard), which could have inflated their hours.

The dataset only includes players who interacted verbally/talked in the game, since data collection depends on player communication. This means the sample might not represent quieter or less social players. This limits its size and the dataset may have less diversity in player types and could underrepresent certain playstyles or demographics.
(Source: https://plaicraft.ai/faq/gameplay).

The `"hashedEmail"` isn’t very helpful analytically. Since it’s hashed, we can’t decode it or use it for linking across datasets.

`"Age"` may not be fully reliable. Players can easily enter an inaccurate age, introducing systematic measurement error.

`"experience"` and `"gender"` will have to be converted to factor-type variables for further analysis.

## Methods and Results

For my Question, I plan to use **Knn-Classification** on my predicators, `Gender`, `Age`, `Experience`, `hours_played` to predict when `subscribe` = `True`. I chose Knn-Classification because it predicts a categorical response variable, just like our `subscribed`, which is split into two categories of `true` and `false`. 

This plan presents certain limitations like class imbalance in `gender` where the `Male` category has a large majority and is therefore more likely to be predicted out of probability. Furthermore, other limitations of the Knn-classification algorithm includes biases towards variables with a larger numerical range; to resolve this issue, I will standardize before I split the data into testing and training sets. I would need to convert `experience` and `gender` into a numeric vector before I start. 

I will **split the data into 75% training and 25% testing** sets. 
Prior to the classification, I will:  

1. standardize the data
2. split into training/ testing
3. cross-validate and k-tuning to find best K, with the highest accuracy.
4. use on testing set to find accuracy and precision of model
5. evaluate excellence of model, consider if further revisions of model is needed.

In [None]:
clean_players 

In [3]:
players_split <- initial_split(clean_players, prop = 0.75, strata = subcribe)
players_train <- training(players_split)
players_test <- testing(players_split)

ERROR: Error in initial_split(clean_players, prop = 0.75, strata = subcribe): could not find function "initial_split"


In [None]:
players_recipe <- recipe(subscribe ~., data_players_train)|>
step_scale(all_predictors())|>
step_center(all_predictors())

In [5]:
players_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune())|>
set_engine("kknn")|>
set_mode("classification")

ERROR: Error in set_mode(set_engine(nearest_neighbor(weight_func = "rectangular", : could not find function "set_mode"


In [4]:
players_tune_accuracy <- players_tune_fit|> collect_metrics()|>

ERROR: Error in parse(text = x, srcfile = src): <text>:2:0: unexpected end of input
1: players_tune_accuracy <- players_tune_fit|> collect_metrics()|>
   ^


## Discussion

summarize what you found  
discuss whether this is what you expected to find  
discuss what impact could such findings have  
discuss what future questions could this lead to

## References