# Video Game Rearch Project

## 1. Introduction
Video games are a million-dollar industry. Contributing to a large swath of the world's pop-culture and entertainment landscapes, they are staples of childhood, iconography, and leisure enjoyed by populous demographics.

In this report we are investigating the video game data provided by a research group in Computer Science at UBC. They have provided data from a Minecraft server where player actions were recorded as they navigated through the world. To help their recruitment efforts, and making sure they have enough resources like licenses and hardware, for the players, we are answering the following research question:

Question: We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.


## 2. Method

To determine which kinds of players are most likely to contribute the largest amount of data, we must first define what it means for a player to “contribute data” in the context of this study. Because the dataset was collected from a Minecraft server that records player actions over time, players who spend more time on the server naturally generate more recorded events. In other words, the more hours a participant plays, the more information they output to the system. 

For this reason, we operationalize data contribution as the total number of hours a player has spent in the game. Our analytical goal is to identify which demographic or experiential characteristics are associated with higher played-hour totals, allowing us to infer which types of players are most likely to provide large amounts of data in future recruitment efforts.

## 3. The Data
Our analysis will examine the [*players.csv*](https://drive.google.com/file/d/1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz/edit) dataset. An embracive repertoire of 197 observations, the 9-column dataset lists player profiles showcasing their proclivity for a game based on duration and experience level paired with demographic attributes. This data allows us to analyze and make predictions on, for example, which age group might contribute the most hours into a game, linking player tendency to groups within one sample size. It was collected via survey.

The columns in the *Players* dataset are:
* `experience`: sorted into *Beginner*, *Amateur*, *Regular*, *Veteran*, or *Pro*, this (`str`) category defines the self-assessed experience a player has with a game.
* `subscribe`: (`bool`) subscription to a game-related newsletter.
* `hashedEmail`: encoded email (`str`).
* `played_hours`: total hours played (`float`).
* `name`: name (`str`).
* `gender`: gender: *male*, *female*, *non-binary*, *agender*, *two-spirit*, *other*, or *prefer not to say* (`str`).
* `age`: age (`int`).
* `individualID`: (supposed) ID (`NaN`).
* `organizationName`: (supposed) organization of the player (`NaN`).

In [1]:
# run this before continuing
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

In [2]:
players = pd.read_csv("data/players.csv")
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


In [3]:
# kept hash in case we want to merge this dataset with sessions.csv

players_drop = players.drop(columns=["individualId", "organizationName"])
players_use = players_drop.dropna(subset=["played_hours", "age", "gender", "experience", "subscribe"])

players_use.head()

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21


In [4]:
# removed all players with hours: '0.0'

mask = players_use['played_hours'].isin([0.0])
players_hrs = players_use[~mask]
players_hrs

#top_data = players_hrs.sort_values('played_hours', ascending=False)      # <- these show all the players with hours >= 2.0, which would theoretically comprise the meat of our predictive analysis
                                                                          #according to how we're interpreting most data. I put the 'upper percentile' at 2 hours because there's already not a lot of 
                                                                          #active players in this dataset 
#top_data.head(26)

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21
8,Amateur,True,8b71f4d66a38389b7528bb38ba6eb71157733df7d17403...,0.1,Natalie,Male,17
...,...,...,...,...,...,...,...
185,Regular,False,8e98b6db2053af0bc0e62cd55bcea5a08f23986dec3d02...,0.1,Sam,Male,18
186,Veteran,True,ba24bebe588a34ac546f8559850c65bc90cd9d51b82158...,0.1,Gabriela,Female,44
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17


In [5]:
players_cleaned = players[['gender','age', 'played_hours']].copy()

bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
labels = ['0-10', '11-20', '21-30', '31-40', '41-50', '51-60', '61-70', '71-80', '81-90', '91-100']

players_cleaned['age_group'] = pd.cut(players_cleaned['age'], bins=bins, labels=labels, right=False)

players_cleaned

Unnamed: 0,gender,age,played_hours,age_group
0,Male,9,30.3,0-10
1,Male,17,3.8,11-20
2,Male,17,0.0,11-20
3,Female,21,0.7,21-30
4,Male,21,0.1,21-30
...,...,...,...,...
191,Female,17,0.0,11-20
192,Male,22,0.3,21-30
193,Prefer not to say,17,0.0,11-20
194,Male,17,2.3,11-20


In [6]:
count_plot = alt.Chart(players_cleaned).mark_bar().encode(
    x=alt.X('age_group').title('Age Range'),
    y=alt.Y('count()').title('Number of Players'),
    color=alt.Color('gender').title('Gender'),
    xOffset='gender'
).properties(
    width=400,
    height=300,
    title='Distribution of Data'
)
    
count_plot

In [7]:
hours_plot = alt.Chart(players_cleaned).mark_bar().encode(
    x=alt.X('age_group').title('Age Range'),
    y=alt.Y('played_hours').title('Hours Played'),
    color=alt.Color('gender').title('Gender'),
    xOffset='gender'
).properties(
    width=400,
    height=300,
    title='Distribution of Hours Played'
)
    
hours_plot

### Wrangling & Cleaning the Dataset

In [15]:
# Wrangling & Cleaning the Dataset

# Drop identifier-like or irrelevant variables
# (Minimal modification: added 'hashedEmail' and 'name')
players_drop = players.drop(columns=["individualId", "organizationName", "hashedEmail", "name"])

# Drop rows with missing values for the variables needed in analysis
players_use = players_drop.dropna(subset=["played_hours", "age", "gender", "experience", "subscribe"])

# Remove players whose played_hours equals 0
mask = players_use["played_hours"] == 0
players_hrs = players_use[~mask]

players_hrs

Unnamed: 0,experience,subscribe,played_hours,gender,age
0,Pro,True,30.3,Male,9
1,Veteran,True,3.8,Male,17
3,Amateur,True,0.7,Female,21
4,Regular,True,0.1,Male,21
8,Amateur,True,0.1,Male,17
...,...,...,...,...,...
185,Regular,False,0.1,Male,18
186,Veteran,True,0.1,Female,44
192,Veteran,False,0.3,Male,22
194,Amateur,False,2.3,Male,17


### Visualizating the Training Data

We visualize the training portion of the dataset to examine the relationship between age, gameplay hours, and player experience. This helps us understand whether the relationship appears linear or displays clustered, nonlinear patterns, which would justify the use of a K-Nearest Neighbors regression model.


In [13]:
# Output dataframes instead of arrays (same as example)
set_config(transform_output="pandas")

# Set the seed
np.random.seed(1)

# Splitting the data into training and testing sets (75% / 25%)
players_train, players_test = train_test_split(
    players_hrs,
    train_size=0.75,
    random_state=1
)

# Create scatter plot of hours played versus age,
# label the points by experience level (mimicking example's color-coding)
players_visualization_training = (
    alt.Chart(players_train)
    .mark_circle(opacity=0.6, size=49)
    .encode(
        x=alt.X("age:Q").title("Age of Player"),
        y=alt.Y("played_hours:Q")
            .title("Hours Played")
            .scale(zero=False, type="sqrt"),
        color=alt.Color("experience:N").title("Experience Level")
    )
    .properties(
        title="Training Data Visualization Relating to Player Age, Experience, and Hours Played"
    )
)

players_visualization_training

#### Observations