# DSCI 100 - Group Project: Predictive Modeling of Gaming Newsletter Subscriptions!
Group 14, Section 009

GitHub Repository Link: https://github.com/anasakbar-05/DSCI_100_Group_Project_009_14

### Introduction

For this project, we are working with a real dataset from a UBC Computer Science research group led by Frank Wood. The group is studying how people play video games by running their own Minecraft server, where the players in-game actions are automatically recorded as they move around and interact with the world. Since this is an ongoing research project, the team needs help figuring out how to target the right kinds of players that will give them lots of data, and how to allocate enough resources (server hardware and software licenses) with the goal of supporting their research. To guide their decisions in this, they outlined three broad questions related to predicting player behaviour, player types, and server usage patterns. The chosen question will then be used to formulate a more specific question that can be answered within the scope of this project and course (DSCI 100).

Our group decided to focus on the first broad question: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types? Then, we specified it into one clear research question for the project:

“Can a player’s age, their total hours played, experience level, and gender be used to accurately predict their subscription status to the game-related newsletter?”

To answer this, we used the "players.csv" dataset provided by the research group. It includes player demographics (such as gender and age), gameplay behaviours (like experience level and total hours played), and whether or not each player subscribed to the newsletter. With these variables, we can explore patterns across different kinds of players and build a model that predicts subscription status based on gameplay and demographic features. It will be useful to the research group to allow them to better understand what drives player engagement and how to target their future recruitment efforts.

While a secondary dataset, sessions.csv, is available, this analysis will focus on players.csv. This decision ensures a direct and focused approach to answering the specific research question with the most relevant data.

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(dplyr)

url = "http://drive.google.com/uc?rxport-download&id=19dtTv9I4hUdTKPBrM1QgI3A0ru68ssds"
players <- read_csv(url)
head(players)

In [None]:
players_summary <- summary(players)
players_summary

distinct(players, experience)
distinct(players, subscribe)
distinct(players, gender)

### Description of the Players dataset — players.csv

The players.csv file contains data on 196 players (it has 196 observations) and includes 7 variables describing each player’s. Each row represents one unique player.

| Variable name    | Type      | Meaning                                                                                                                                  |
| ---------------- | --------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
| **experience**   | Character | Self-reported experience level. Categories include: *Beginner, Regular, Amateur, Pro, Veteran*.                                          |
| **subscribe**    | Logical   | Whether a player subscribed to the game-related newsletter (*TRUE/FALSE*).                                                               |
| **hashedEmail**  | Character | A hashed version of each player’s email address (used as an anonymized identifier).                                                      |
| **played_hours** | Numeric   | Total number of hours the player spent in the game during the session.  |
| **name**         | Character | Player’s display name.                                                                                                                   |
| **gender**       | Character | The player’s gender identity. Categories include: *Female, Male, Non-binary, Two-Spirited, Agender, Prefer not to say, Other*.           |
| **Age**          | Numeric    | Player’s self-reported age in years.                                                                                                                   |

### Summary Statistics:

Played hours: Mean(Average) = 5.85, Median = 0.10, Min = 0.00, Max = 223.10

Age: Mean(Average) = 21.14, Median = 19.00, Min = 9, Max = 58 (2 missing values)

Subscription count: 144 subscribed, 52 did not


### Observations & Potential Issues With the Data:

The "played_hours" of players is a highly skewed variable as most players have very low values. This might affect the conclusions we can draw from any analysis. It also (played_hours) may not reflect true gameplay. Players can leave the game running while AFK (away from keyboard), which could have inflated their hours.

The dataset only includes players who interacted verbally/talked in the game, since data collection depends on player communication. This means the sample might not represent quieter or less social players. This limits its size and the dataset may have less diversity in player types and could underrepresent certain playstyles or demographics.
(Source: https://plaicraft.ai/faq/gameplay).

The "hashedEmail" isn’t very helpful analytically. Since it’s hashed, we can’t decode it or use it for linking across datasets.

"Age" may not be fully reliable. Players can easily enter an inaccurate age, introducing systematic measurement error.

"experiance" and "gender" will have to be converted to factor-type variables for further analysis.

