Title (tbd)
-

Introduction
-

The UBC Minecraft Research server is currently an ongoing research project, which studies how people play video games. As part of this project, player demographics and in-game activity data are collected to ensure they have enough resources to handle the number of current and incoming players. An important challenge for the research group is understanding which types of players are most likely to remain engaged, specifically through actions such as subscribing to the servers newsletter. Newsletter subscribers are shown to generally be more connected to the game, and often more likely to participate in other studies.

The question our group will be answering, is: “Can we predict whether a player subscribes to the newsletter using their experience level, and playtime?"

From the two available, we will be using the players.csv dataset to conduct our has 196 observations and 9 variables that describe each player that has logged onto the UBC Minecraft research server. It was loaded from a Google Drive link to ensure full reproducibility within the Jupyter environment.

In [2]:
import pandas as pd
players = pd.read_csv("https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz")
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


Each row contains information of a player’s age, relating to their identity and server statistics.

In [3]:
players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196 entries, 0 to 195
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   experience        196 non-null    object 
 1   subscribe         196 non-null    bool   
 2   hashedEmail       196 non-null    object 
 3   played_hours      196 non-null    float64
 4   name              196 non-null    object 
 5   gender            196 non-null    object 
 6   age               196 non-null    int64  
 7   individualId      0 non-null      float64
 8   organizationName  0 non-null      float64
dtypes: bool(1), float64(3), int64(1), object(4)
memory usage: 12.6+ KB


In [6]:
variable_table = pd.DataFrame({
    "Variable": [
        "experience", "subscribe", "hashedEmail", "played_hours",
        "name", "gender", "age", "individualId", "organizationName"
    ],
    "Type": [
        "Categorical", "Boolean", "String", "Numeric",
        "String", "Categorical", "Integer", "NaN", "NaN"
    ],
    "Description": [
        "The reported experience of the player",
        "If the player subscribed to the newsletter",
        "The hashed email of the player",
        "The playtime of the player",
        "The username of the player",
        "The reported gender of the player",
        "The age of the player",
        "Empty",
        "Empty"
    ]
})

variable_table

Unnamed: 0,Variable,Type,Description
0,experience,Categorical,The reported experience of the player
1,subscribe,Boolean,If the player subscribed to the newsletter
2,hashedEmail,String,The hashed email of the player
3,played_hours,Numeric,The playtime of the player
4,name,String,The username of the player
5,gender,Categorical,The reported gender of the player
6,age,Integer,The age of the player
7,individualId,,Empty
8,organizationName,,Empty


The variables individualId and organizationName, are empty and need to be dropped. The other variables are mostly good, but a potential issue that we noticed is that all of the self reported fields may be inconsistent or biased, especially the experience field for which the three categories seem rather vague and up to interpretation. Even with these issues, the dataset is good enough for making player level predictions without needing session level data.

In [4]:
players_tidy = players.drop(columns=["individualId", "organizationName"])
players_tidy

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21
...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17


In [10]:
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier

X = players_tidy[["experience", "played_hours"]]
y = players_tidy["subscribe"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=1234,
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), ["played_hours"]),
        ("cat", OneHotEncoder(handle_unknown="ignore"), ["experience"])]
)

preprocessor

Methods and Results
-

Discussion
-

References
-