# Title

#### Does a player's gender and total play time correlate to a player's experience level?

In [2]:
# Load and inspect the data
library(tidyverse)
library(tidymodels)

players <- read_csv("https://raw.githubusercontent.com/achan919/dsci-final-project/refs/heads/main/players.csv")

head(players)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


| Variable     | Description |
|--------------|-------------|
| experience   | Level of experience (Pro, Veteran, Amateur, Regular, Beginner) |
| subscribe    | Whether the player subscribed to the newsletter |
| hashedEmail  | Hashed/hidden email of the player |
| played_hours | Total hours played (mean = 5.846) |
| name         | Player's first name |
| gender       | Gender (Male, Female, Non-binary, Prefer not to say, Agender, Two-Spirited, Other) |
| Age          | Player age in years (mean = 21.14) |

# Introduction

A research group is collecting data about how people play video games. They have set up a MineCraft server, and players' actions are recorded as they navigate through the world. For running this project, they need to target their recruitment efforts, and make sure they have enough resources to handle the number of players they attract. The purpose of our project is to provide an idea for this and use data analysis to prove whether this idea can effectively assist their recruitment.

Understanding how different player characteristics relate to gameplay experience is essential for designing engaging game environments and tailoring content to different player groups. In behavioral game analytics, experience level is often used as a proxy for proficiency, engagement, or familiarity with game mechanics. Identifying factors that correlate with experience level can help inform game balancing, player retention strategies, and the design of personalized player experiences.

In this project, we analyze player information and gameplay behaviour using a data sources `players.csv`: a player dataset containing demographic variables. Our research question is:

**Does a player’s gender and total play time relate to their experience level?**

To answer this question, we examine whether total accumulated play time and gender distribution differ across experience groups. By summarizing and visualizing the relationships between these variables, we aim to uncover meaningful differences across player types.

#### Data Description
The files `players.csv` were collected by a research group in Computer Science at UBC, to help us complete our project. It contains 196 players' information and 7 variables columns:
- `experience` - Experience level
- `subscribe` - The subscription state of the player
- `hashedEmail` - The hashed email address of the player
- `played_hours` - Time the player spent on the game(in hours)
- `name` - Name
- `gender` - Gender
- `Age` - Age

⚠️ Learn to describe the variables in a markdown table! Helps cut down on word count and is easier to interpret.（I don't know whether this form is okay, if anyone figure out how to describe the variables in a markdown table, just feel free to change it!）

# Methods & Results

Reads the CSV files into data frames first. The summary confirms 196 players and 1,535 raw sessions as noted. 

**1. Wrangling + Cleaning**

The players table contains demographic and experience-related attributes describing each user. Variables such as experience and gender are categorical. To ensure proper treatment in grouping, plotting, or modeling, these fields were converted into factor types using **as_factor()**.

In [18]:
#Tidy players data
players_tidy <- players |>
  mutate(
    experience = as_factor(experience),
    gender = as_factor(gender)
  )
head(players_tidy)

experience,subscribe,hashedEmail,played_hours,name,gender,Age
<fct>,<lgl>,<chr>,<dbl>,<chr>,<fct>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


For the player data set, we also need to compute the mean of each numeric variable. This helps identify overall trends in player attributes such as age and playtime. Then explore playtime differences by subscription status: subscription type may correlate with play intensity or session frequency. 

In [19]:
#Calculate players mean
player_means <- players_tidy |>
  summarize(across(where(is.numeric), mean, na.rm = TRUE)) |>
  pivot_longer(cols = everything(),
               names_to = "Variable",
               values_to = "Mean")
player_means

Variable,Mean
<chr>,<dbl>
played_hours,5.845918
Age,21.139175


**2. Summary of the data set which is relevant for exploratory data analysis related to the planned analysis**

Our research question focuses on whether a player’s **gender** and **total play time** are related to their **experience level**. Because all three variables (gender, experience, total play time) are player-level characteristics, we must first convert the players dataset—which contains one row per game session—into a dataset where **each row represents exactly one player**.
This requires aggregating all sessions belonging to the same player into a single summarized record. Session-level data cannot be used directly for comparing players because:

1. **Multiple observations per player** cause statistical dependence and would artificially inflate sample size.
2. **Experience level and gender** are properties of the player, not individual sessions.
3. To compare total play time across experience groups, we need each player to have:
   - total time spent playing,
   - number of sessions,
   - average session duration,
   - and median session duration (a robust measure unaffected by extreme values).

Creating player-level summaries ensures one row per player, which is consistent with our analysis unit.

For each player, we compute:

- **sessions_num** — how many sessions the player has played  
  (a measure of activity)
- **total_time** — total time spent playing across all sessions  
  (the variable directly related to our research question)
- **mean_time** — average session duration  
  (helps describe typical session length)
- **median_time** — median session duration  
  (less sensitive to extreme session lengths)

These variables describe a player’s overall engagement with the game.

Although players have a `name` variable, multiple players may share the same name. `hashedEmail` is a unique and anonymized identifier for each player, so using it ensures sessions are grouped correctly, and data from different players is never mixed. Thus, all summarization and joining operations use `hashedEmail` as the primary key.

In [16]:
#Keep only relevant variables for aggregation
sessions_names_only <- merged_sessions |>
  select(hashedEmail, name, diff_time)

#Count number of sessions per player
sessions_count <- sessions_names_only |>
  group_by(hashedEmail, name) |>
  summarize(sessions_num = n())

#Compute total play time per player
total_time <- sessions_names_only |>
  group_by(hashedEmail, name) |>
  summarize(total_time = sum(diff_time, na.rm = TRUE))

#Compute mean & median session length per player
time_stats <- sessions_names_only |>
  group_by(hashedEmail, name) |>
  summarize(
    mean_time = mean(diff_time, na.rm = TRUE),
    median_time = median(diff_time, na.rm = TRUE)
  )

#Merge summaries
sessions_by_player_time <- merge(sessions_count, total_time,
                            by = c("hashedEmail", "name"), all.x = TRUE)
sessions_by_player <- merge(sessions_by_player_time, time_stats,
                            by = c("hashedEmail", "name"), all.x = TRUE)

player_level <- merge(players, sessions_by_player,
                      by = c("hashedEmail", "name"), all.x = TRUE)

head(player_level)

ERROR: Error in eval(expr, envir, enclos): object 'merged_sessions' not found


**3.Exploratory Data Analysis and visualizations**

In [28]:
# Gender distribution
players_tidy |>
  count(gender)

# Experience distribution
players_tidy |>
  count(experience)

# Summary of total play time by experience group
players_tidy |>
  group_by(experience) |>
  summarise(total_play_time = sum(played_hours, na.rm = TRUE))

# Summary of total play time by gender
players_tidy |>
  group_by(gender) |>
  summarise(total_play_time = sum(played_hours, na.rm = TRUE))

# Summary of total play time by gender and experience
players_tidy |>
  group_by(experience, gender) |>
  summarise(total_play_time = sum(played_hours, na.rm = TRUE)) |>
  arrange(experience, gender)

gender,n
<fct>,<int>
Male,124
Female,37
Non-binary,15
Prefer not to say,11
Agender,2
Two-Spirited,6
Other,1


experience,n
<fct>,<int>
Pro,14
Veteran,48
Amateur,63
Regular,36
Beginner,35


experience,total_play_time
<fct>,<dbl>
Pro,36.4
Veteran,31.1
Amateur,379.1
Regular,655.5
Beginner,43.7


gender,total_play_time
<fct>,<dbl>
Male,511.8
Female,393.5
Non-binary,223.2
Prefer not to say,4.1
Agender,12.5
Two-Spirited,0.5
Other,0.2


[1m[22m`summarise()` has grouped output by 'experience'. You can override using the
`.groups` argument.


experience,gender,total_play_time
<fct>,<fct>,<dbl>
Pro,Male,36.2
Pro,Non-binary,0.0
Pro,Other,0.2
Veteran,Male,8.9
Veteran,Female,4.4
Veteran,Non-binary,3.9
Veteran,Prefer not to say,1.4
Veteran,Agender,12.5
Amateur,Male,173.2
Amateur,Female,204.3


**Visualization 1: Total play time by experience level**. This boxplot allows us to compare whether more experienced players accumulate more play time.
Differences in medians and spread reflect behavior patterns across groups.

In [None]:
player_level |> ggplot(aes(x = experience, y = total_time)) +
  geom_boxplot() +
  labs(
    title = "Figure 1. Total Play Time by Experience Level",
    x = "Experience Level",
    y = "Total Play Time (hours)"
  )

head(player_level)

**Visualization 2: Gender distribution across experience levels**. This bar plot indicates whether certain genders are more represented in higher experience groups.

In [None]:
player_level |>
  ggplot(aes(x = experience, fill = gender)) +
  geom_bar(position = "dodge") +
  labs(
    title = "Figure 2. Gender Distribution by Experience Level",
    x = "Experience Level",
    y = "Count of Players",
    fill = "Gender"
  )

head(player_level)

In [None]:
#Visualization 3: Total play time by gender and experience
player_level |>
  ggplot(aes(x = gender, y = total_time, fill = experience)) +
  geom_boxplot() +
  labs(
    title = "Figure 3. Total Time by Gender and Experience Level",
    x = "Gender",
    y = "Total Play Time"
  )

head(player_level)

# Discussion

Our analysis examined whether a player’s gender and total play time relate to their experience level. From the summary statistics and visualizations, we can see several patterns: 


First, the boxplot comparing total play time across experience levels (Figure 1) suggests that more experienced players tend to accumulate greater total play time. This pattern aligns with the expectation that experience is built over longer engagement with the game.


Second, gender distributions (Figure 2) indicate that genders may not be evenly represented across experience levels. While this does not establish causation, it highlights demographic differences that may contribute to differences in experience development.


Third, combining gender and experience (Figure 3) shows that the effect of gender on play time may differ across experience groups.


Overall, our findings align with behavioral expectations: **higher experience levels are generally associated with greater accumulated play time. Gender appears to influence participation patterns, although more detailed modeling would be required to investigate this relationship more precisely.**

# References