# Part 1: Data Description

In [1]:
# Set up
library(tidyverse)
library(repr)
library(tidymodels)
source("cleanup.R")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

ERROR: Error in file(filename, "r", encoding = encoding): cannot open the connection


In [None]:
# Load Data
players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")

In [None]:
print(players)
unique_experiences_vector <- players |>
    pull(experience) |>
    unique()
print(unique_experiences_vector)

unique_gender_vector <- players |>
    pull(gender) |>
    unique()
print(unique_gender_vector)
summary(players)

### Description of players.csv

- Number of variables: 7
- Number of rows 196
- Each row corresponds to a player

7 variables:
- Experience (categorical)
- Subscribe (boolean)
- hashedEmail (string)
- played_hours (double)
- name (string)
- gender (categorical)
- age

Categorical variables:
- Experience ("Pro"      "Veteran"  "Amateur"  "Regular"  "Beginner")
- Gender ("Male"              "Female"            "Non-binary" "Prefer not to say" "Agender"           "Two-Spirited")

Numerical variables
- played hours mean: 5.85
- mean age: 21.14 (there are NAs that needed to be ignored)

Summary table:
  experience        subscribe       hashedEmail         played_hours    
 Length:196         Mode :logical   Length:196         Min.   :  0.000  
 Class :character   FALSE:52        Class :character   1st Qu.:  0.000  
 Mode  :character   TRUE :144       Mode  :character   Median :  0.100  
                                                       Mean   :  5.846  
                                                       3rd Qu.:  0.600  
                                                       Max.   :223.100  
                                                                        
     name              gender               Age       
 Length:196         Length:196         Min.   : 9.00  
 Class :character   Class :character   1st Qu.:17.00  
 Mode  :character   Mode  :character   Median :19.00  
                                       Mean   :21.14  
                                       3rd Qu.:22.75  
                                       Max.   :58.00  
                                       NA's   :2      


## Part 1: Data Description (players.csv)

This report analyzes the `players.csv` dataset, which contains information about video game players.

### 1.1 Data Source and Collection

The `players.csv` dataset was provided as part of the DSCI 100 course materials. It is collected by a research group in Computer Science at UBC investigating how people play video games.

### 1.2 Data Structure

The dataset is a single table containing **196 observations** (rows) and **7 variables** (columns). Each row corresponds to a single player.

### 1.3 Variable Summary

The following table summarizes all variables in the dataset:

| Variable Name | Data Type | Description |
| :--- | :--- | :--- |
| `Experience` | `chr` | Player's self-reported experience level (Categorical: "Pro", "Veteran", "Amateur", "Regular", "Beginner") |
| `Subscribe` | `logi` | Whether the player is subscribed (Boolean: `TRUE` / `FALSE`) |
| `hashedEmail` | `chr` | A hashed email address for the player (Identifier) |
| `played_hours` | `num` | Total hours played by the player (Numeric) |
| `name` | `chr` | The player's name or username (Identifier) |
| `gender` | `chr` | The player's self-reported gender (Categorical: "Male", "Female", "Non-binary", "Prefer not to say" "Agender", "Two-Spirited") |
| `age` | `num` | The player's age in years (Numeric) |

### 1.4 Summary Statistics

The table below provides summary statistics for the quantitative variables in the dataset:

| Variable | Min | 1st Qu. | Median | Mean | 3rd Qu. | Max |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| `played_hours` | 0.00 | 0.00 | 0.10 | 5.85 | 0.60 | 223.10 |
| `age` | 9.00 | 17.00 | 19.00 | 21.14 | 22.75 | 58.00 |

*Note: The mean for `age` was calculated after ignoring 2 missing values.*

### 1.5 Data Issues and Limitations

Based on the initial exploration, the following issues and potential limitations have been identified:

**Observed Issues:**
 - Missing Values: The `age` variable contains 2 `NA` values. These will need to be handled (e.g., removed or imputed) before modeling.
 - Irrelevant Variables The `hashedEmail` and `name` variables are identifiers. They hold no predictive value and should be removed to protect privacy and prevent model issues.
 - Data Types: The `Experience` and `gender` variables are stored as character (`chr`) types. They will need to be converted to factors (`factor`) before being used in most R models.
 - Skewed Data: The `played_hours` variable appears highly skewed. The mean (5.85) is much larger than the median (0.10), and the max value (223.10) is far from the 3rd quartile (0.60).

**Potential Limitations (not directly visible):
- Self-Reported Data: Variables like `Experience` and `gender` are likely self-reported, which could introduce bias or inaccuracies.

In [None]:
print(sessions)
summary(sessions)

## Part 1: Data Description (sessions.csv)


### 1.1 Data Source and Collection

Collected by the same group as the group that collected `players.csv`. This data is linked to the `players.csv` file via the `hashedEmail` variable."

### 1.2 Data Structure

The dataset is a single table containing **1,535 observations** (rows) and **5 variables** (columns). Each row corresponds to a single play session.

### 1.3 Variable Summary

The following table summarizes all variables in the dataset:

| Variable Name | Data Type | Description |
| :--- | :--- | :--- |
| `hashedEmail` | `chr` | A hashed email address for the player (Identifier, links to `players.csv`) |
| `start_time` | `chr` | The session start time, formatted as a character string (e.g., "30/06/202...") |
| `end_time` | `chr` | The session end time, formatted as a character string (e.g., "30/06/202...") |
| `original_start_time` | `num` | The session start time as a numeric timestamp (likely Unix time in milliseconds) |
| `original_end_time` | `num` | The session end time as a numeric timestamp (likely Unix time in milliseconds) |

### 1.4 Summary Statistics

The table below provides summary statistics for the quantitative variables in the dataset.

| Variable | Min | 1st Qu. | Median | Mean | 3rd Qu. | Max |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| `original_start_time` | `1.712e+12` | `1.716e+12` | `1.719e+12` | `1.719e+12` | `1.722e+12` | `1.727e+12` |
| `original_end_time` | `1.712e+12` | `1.716e+12` | `1.719e+12` | `1.719e+12` | `1.722e+12` | `1.727e+12` |

*Note: The statistics for `original_end_time` were calculated after ignoring 2 missing values.*

### 1.5 Data Issues and Limitations

Based on the initial exploration, the following issues and potential limitations have been identified:

**Observed Issues:**
- Missing Values: The `original_end_time` variable contains 2 `NA` (missing) values. These rows will likely need to be removed as we cannot calculate session duration.
- Incorrect Data Types: The `start_time` and `end_time` variables are stored as character strings (`chr`) instead of proper `datetime` objects. They will need to be parsed and converted to be useful for calculating session lengths.
- Redundant Information: The dataset contains two sets of timestamps: one set as character strings (`start_time`, `end_time`) and one as numeric values (`original_start_time`, `original_end_time`). It will be necessary to check if these columns contain identical information after conversion. The numeric columns are more likely to be used for calculations.
- Irrelevant Variables: The `hashedEmail` variable is an identifier. While crucial for joining with `players.csv`, it should not be used as a predictive feature in a model.

Potential Limitations:
- Timestamp-Only Data: This dataset only contains session start and end times. It provides no information about *what* the player did during that session, limiting the scope of questions we can answer.
- It is unclear how a "session" is defined. For example, does a 5-second disconnect and reconnect count as a new session? This could impact calculations of "total time played."

# Part 2: Questions


### 2.1 Broad Question

We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

### 2.2 Specific Question

**Specific Question:** Can a player's **age** be used to predict the **total amount of time** a player spends on the game?

### 2.3 Addressing the Question with Data

We will use the **`players.csv`** dataset to answer this question.

* The **explanatory variable** will be `age`, a numeric variable from `players.csv`.
* The **response variable** will be `played_hours`, a numeric variable from `players.csv` that represents the total time a player has spent in the game.

### 2.4 Data Wrangling Plan

To prepare the data for analysis, we will use the `players.csv` data. The `sessions.csv` data is not required for this specific question, as `played_hours` is already a pre-aggregated total.

The minimal wrangling steps are:

1.  **Select Variables:** We will select the `age` and `played_hours` columns from the `players.csv` data.
2.  **Handle Missing Values:** The `age` variable has two `NA` (missing) values. Since our explanatory variable cannot be missing, we will remove these two rows from the dataset for this analysis.
3.  As noted in Part 1, `played_hours` is highly right-skewed. Our exploratory data analysis will investigate this skew (e.g., with a histogram) to determine if a transformation is necessary for modeling.