# Final Project 

**Course:** DSCI 100

**Author:** Ning Hu, Michael Alexander Gunardi, Gavin Lei, Michael Leung

**Group:** 30

**Date:** Due Dec 6th

**Question choose:**  1. What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(cowplot)
library(readr)
options(repr.matrix.max.rows = 6)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


## 1. Introduction
A research group in Computer Science at UBC, led by Frank Wood, is collecting data about how people play video games. They have set up a Minecraft serverLinks to an external site, and players' actions are recorded as they navigate through the world

During this research session, the broad problem our group is trying to figure out is **What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types**?. More specifically, **Can the player's total playtime and age predict whether they subscribe to the newsletter in the player database**?,

In [None]:
players <- read.csv("https://raw.githubusercontent.com/gavinlei060322-cmd/GroupProject_DSCI-100-group-30-section-003-/refs/heads/main/players.csv")
head(players)

## Summary: 
This dataset frame contains information about individual players, including their experience level, subscription status, playtime, name, gender, and age. Each record represents a unique player identified by a hashed email address.

Number of observations: 196 players

Number of Variables: 7

## Data Description

- **Number of Observations**:
  - `players.csv`: 196 observations (rows)

- **Summary Statistics**:
  - **Age**: Min = 9, Max = 58, Mean = 21.14, Median = 19.00
  - **Played Hours**: Min = 0.00, Max = 223.10, Mean = 5.85, Median = 0.10

- **Observable Variables**:
  - **`players.csv`**:
    - **subscribe**: Logical (subscription status - TRUE or FALSE)
    - **played_hours**: Numeric (total number of hours played)
    - **Age**: Numeric (player's age)
 
- **Not Fully Observable Variables (Uses Characters)**:
  - **`players.csv`**:
    - **experience**: Character (level of experience: Pro, Veteran, Amateur)
    - **hashedEmail**: Character (unique identifier for each player)
    - **name**: Character (player's name)
    - **gender**: Character (player's gender)
 
- **Potential Issues**:
  - **Missing Data**: The `Age` and `gender` variables have a few NA's in the dataset.
  - **Outliers**: The `played_hours` variable contains a few extreme values (e.g., 223.1 hours).
  - **Data Types**: The `start_time` and `end_time` columns in `sessions.csv` are currently in character format and need conversion to `DateTime`.

## 2. Methods & Result

#### Method 
The method that we will be using to address the question **Can the player's total playtime and age predict whether they subscribe to the newsletter in the player database?** will be the KNN classification model. This will allow us to classify players as "subscribers" and "non-subscribers" based on their playtime and age.

<h3> Loading data</h3>

In [13]:
head(players)

Unnamed: 0_level_0,experience,subscribe,hashedEmail,played_hours,name,gender,Age
Unnamed: 0_level_1,<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<int>
1,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
2,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
3,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
4,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
5,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
6,Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


<h3> Data wrangling and clean</h3>

Before exploring the data, we first clean the `players.csv`.csv dataset to keep only the variables that are relevant to our predictive question.
Since our goal is to understand how player characteristics relate to newsletter subscription, we focus on variables that describe experience, demographics, and engagement.

We remove the following columns:

`hashedEmail`: this is an identifier used to link with the sessions dataset. It does not provide useful information for predicting subscription and should not be used to avoid data leakage.

`name`: not a meaningful feature and may contain personal information.

Now clean up those identity variables in data frame and keep other variables: experience, played_hours, age, gender, subscribe

In [1]:
players_clean <- select(players, experience, played_hours, Age, gender, subscribe)
players_clean

ERROR: Error in select(players, experience, played_hours, Age, gender, subscribe): could not find function "select"


## 3. Discussion