# DSCI 100: Project Final Report
Names of Group Members: Ahmed Alkarkhi, Annie Wu, Mishka Mitchell, William Kizell 

Course: DSCI 100 – 2025W1

## Introduction

A research group led by Frank Wood has established a Minecraft server to study how people engage with video games. The data collected includes a range of player characteristics and gameplay metrics. In this report, we focus on a predictive question that explores whether certain traits are associated with interest in the game’s online community.

Specifically, we ask: *“Can player experience level and hours played predict newsletter subscription in the players dataset?”*

To answer this, we analyze the [players.csv](players.csv) dataset, which contains 196 observations and 7 variables:
- experience (chr): Player’s level of in-game experience
- subscribe (lgl): Whether the player is subscribed to the newsletter
- hashedEmail (chr): Player’s hashed email
- played_hours (dbl): Number of hours the player has played
- name (chr): Player’s name
- gender (chr): Player’s gender
- Age (dbl): Player’s age

The expectation is that player <experience> level and <played_hours> will have a significant relationship in determining the categorical response <subscribe>. The reasoning behind this expectation is that players with higher experience levels and more hours played are likely more invested in the game and its community, making them more inclined to subscribe to the newsletter to stay connected and informed. Players with higher experience levels tend to be more engaged and familiar with the game’s ecosystem. Likewise, players who have accumulated more played_hours have demonstrated sustained involvement over time. 

Together, these factors suggest that both experience and played_hours should serve as effective predictors of the response variable subscribe.

## Methods & Results

**note: all figures should have a figure number and a legend**

The data for this analysis was imported from a GitHub repository as a .csv file. To investigate whether player experience level and played_hours can predict newsletter subscription, the relevant variables (experience, played_hours, and the response variable subscribe) were selected and prepared for classification. 

The subscribe variable was converted to a factor type to reflect its categorical nature. Summary statistics were computed for the two predictors, and exploratory scatterplots were created to visualize how experience and played_hours relate to subscription status, using colour to highlight the subscribed and non-subscribed groups.

Since the response variable contains two possible outcomes, TRUE or FALSE, we used K nearest neighbours classification to address the predictive question. The dataset was split into a training set that contained 75% of the observations and a testing set that contained the remaining 25%. We applied five fold cross validation with a fixed random seed to make the training process more reliable and to reduce the effect of randomness in the results.

To improve the model, we tested values of K from one to ten and chose the value that produced the highest accuracy. Before training the model, we scaled and centered the predictor variables so that differences in measurement units would not cause one variable to have more influence than the other during the distance calculations in the KNN algorithm.

##TODO: add the training, and after training!! (...the K-NN classification was performed. Then, the quality of the model was tested by its metrics on the test data and the effectiveness of the model was visualised with plots...)

Experience level and played_hours were chosen as the predictors because they represent a player’s involvement and commitment to the game. Experience reflects how familiar a player is with the gameplay, while played_hours captures the amount of time the player has spent in the game. These characteristics are important to explore because players who are more engaged may also be more likely to subscribe to the game’s newsletter and stay connected with the community.


**Loading necessary data**

In [1]:
library(tidyverse)
library(ggplot2)
library(dplyr)

players <- read_csv("players.csv")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m


**Wrangling the data to the format necessary for the planned analysis**

In [2]:
players <- players |> 
        select(played_hours, experience, subscribe) |>
        mutate(subsribe = as_factor(subscribe))

head(players)

played_hours,experience,subscribe,subsribe
<dbl>,<chr>,<lgl>,<fct>
30.3,Pro,True,True
3.8,Veteran,True,True
0.0,Veteran,False,False
0.7,Amateur,True,True
0.1,Regular,True,True
0.0,Amateur,True,True


**Performing a summary of the data set that is relevant for exploratory data analysis related to the planned analysis** 

In [5]:
total <- nrow(players)

options <- players |>
    group_by(subscribe) |>
    summarize(count = n(),
              "distribution (%)" = n() / total * 100)

options

subscribe,count,distribution (%)
<lgl>,<int>,<dbl>
False,52,26.53061
True,144,73.46939


In [6]:
summary_played_hours <- players |>
    summarize(mean_played_hours = mean(played_hours, na.rm = TRUE), mode_played_hours = mode(played_hours),
           med_played_hours = median(played_hours, na.rm = TRUE), sd_played_hours = sd(played_hours, na.rm = TRUE),
           min_played_hours = min(played_hours, na.rm = TRUE), max_played_hours = max(played_hours, na.rm = TRUE))

summary_played_hours

mean_played_hours,mode_played_hours,med_played_hours,sd_played_hours,min_played_hours,max_played_hours
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
5.845918,numeric,0.1,28.35734,0,223.1


In [None]:
##summary_exp <- players |> 
    ##summarize()

**Creating a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis**

**Performing the data analysis**

**Creating a visualization of the analysis**

## Discussion
summarize what you found
discuss whether this is what you expected to find
discuss what impact could such findings have
discuss what future questions could this lead to

## References
You may include references if necessary, as long as they all have a consistent citation style.