# Term Project (Group 34)

---

**Anis, Emilia, Eric, Peter**

## Introduction

### Background

The Pacific Laboratory for Artificial Intelligence is a research group within the department of computer science at University of British Columbia, led by Frank Wood. As part of their work, they have set up a Minecraft server where they aim to collect data about player's actions within the game. To run this project, they need to know how to target their recruitment efforts and make sure they have enough resources. This report aims to support the research group with a data analysis to help them target their recruitment efforts by investigating what player characteristics and behaviours are most predictive of subscribing to a game-related newspaper, and how these features differ between various player types. More specifically, the following research question will be investigated.

> *Can player's age, experience, gender, and number of hours played predict subscription to a game-related newspaper?*

### Data description

For this analysis, a dataset with information about 196 players using the server, is used. The data is collected within the server, with the player's consent. the dataset has the dimensions 196 x 7 and contains missing values. For each player, the following variables are recorded.

| Variable name    | Type | Description |
| -------- | ------- | ------- |
| experience  | Character    | The player's experience level (Amateur/ Beginner/ Regular/ Veteran/ Pro)  |
| subscribe | Logical     | Does the player subscribe to a game-related newspaper (TRUE/ FALSE) |
| hashedEmail    | Character    | Unique hash-code representing the player's email |
| played_hours    | Double    | Total number of hours played |
| name    | Character    | Player's name |
| gender    | Character    | Player's gender (Female/ Male) |
| Age    | Double    | Player's age |

#### Load libraries

***It says that we should show summary statistics already in the introduction which means that I have to load the dataset here already, but in the instructions it said that the data should be loaded in the methods and results section. I added it here now but feel free to move it to another section if you think it fits better in the methods and results // Emilia***

In [1]:
library(tidyverse)
library(tidymodels)

set.seed(10)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

#### Load dataset

In [2]:
players <- read_csv("https://raw.githubusercontent.com/emiliaosterlund/dsci-100-project/refs/heads/main/players.csv")

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


#### Summary statistics

The numerical variables in the dataset, player's age and number of played hours, can be summarized with the table below.

In [3]:
summary <- summarize(players, 
         min_played_hours = round(min(played_hours, na.rm = TRUE), 2),
         max_played_hours = round(max(played_hours, na.rm = TRUE), 2),
         mean_played_hours = round(mean(played_hours, na.rm = TRUE), 2),
         sd_played_hours = round(sd(played_hours, na.rm = TRUE), 2),
         min_age = round(min(Age, na.rm = TRUE), 2),
         max_age = round(max(Age, na.rm = TRUE), 2),
         mean_age = round(mean(Age, na.rm = TRUE), 2),
         sd_age = round(sd(Age, na.rm = TRUE), 2))

summary_table <- tibble(
  variable = c("played_hours", "Age"),
  min  = c(summary$min_played_hours, summary$min_age),
  max  = c(summary$max_played_hours, summary$max_age),
  mean = c(summary$mean_played_hours, summary$mean_age),
  std   = c(summary$sd_played_hours, summary$sd_age)
)

summary_table

variable,min,max,mean,std
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
played_hours,0,223.1,5.85,28.36
Age,9,58.0,21.14,7.39


## Methods & Results

## Discussion

## References