# Played Hours based on Demographic Attributes: Viable or Not?

## 1. Introduction
Video games are a million-dollar industry. Contributing to a large swath of the world's pop-culture and entertainment landscapes, they are staples of childhood, iconography, and leisure enjoyed by populous demographics.

In this report we are investigating the video game data provided by a research group in Computer Science at UBC. They have provided data from a Minecraft server where player actions were recorded as they navigated through the world. To help their recruitment efforts, and making sure they have enough resources like licenses and hardware, for the players, we are answering the following research question: *We would like to know which "kinds" of players are most likely to contribute a large amount of data, so that those players can be targeted in recruitment efforts.*

Our **specific predictive question** is *Can we predict how many hours a player will spend on the server (`played_hours`) using their age, gender, experience level, newsletter subscription status?*


## 2. Method

To determine which kinds of players are most likely to contribute the largest amount of data, we must first define what it means for a player to “contribute data” in the context of this study. Because the dataset was collected from a Minecraft server that records player actions over time, players who spend more time on the server naturally generate more recorded events. In other words, the more hours a participant plays, the more information they output to the system. 

For this reason, we operationalize data contribution as the total number of hours a player has spent in the game. Our analytical goal is to identify which demographic or experiential characteristics are associated with higher played-hour totals, allowing us to infer which types of players are most likely to provide large amounts of data in future recruitment efforts.

**Planned method**

A regression approach will be applied, comparing a simple linear regression model with a K-Nearest Neighbors (KNN) regression model.
Categorical predictors (gender, experience, subscribe) will be one-hot encoded, and age will remain numeric.
All features will be standardized when using KNN.

**Model evaluation**

Data will be split 75 % for training and 25 % for testing.
Model performance will be measured using RMSE on the log-transformed scale and back-transformed scale for interpretability.
Cross-validation will be used on the training set to tune hyperparameters.
The test set will only be used once at the end for final evaluation.

**Limitations**

The dataset is right-skewed, and most players spend very little time playing.
This imbalance may make predictions less accurate for players with unusually long play times.
Categorical imbalance (few players in some experience levels) may also affect the stability of coefficient estimates.

## 3. The Data
Our analysis will examine the [*players.csv*](https://drive.google.com/file/d/1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz/edit) dataset. An embracive repertoire of 197 observations, the 9-column dataset lists player profiles showcasing their proclivity for a game based on duration and experience level paired with demographic attributes. This data allows us to analyze and make predictions on, for example, which age group might contribute the most hours into a game, linking player tendency to groups within one sample size. It was collected via survey.

The columns in the *Players* dataset are:
* `experience`: sorted into *Beginner*, *Amateur*, *Regular*, *Veteran*, or *Pro*, this (`str`) category defines the self-assessed experience a player has with a game.
* `subscribe`: (`bool`) subscription to a game-related newsletter.
* `hashedEmail`: encoded email (`str`).
* `played_hours`: total hours played (`float`).
* `name`: name (`str`).
* `gender`: gender: *male*, *female*, *non-binary*, *agender*, *two-spirit*, *other*, or *prefer not to say* (`str`).
* `age`: age (`int`).
* `individualID`: (supposed) ID (`NaN`).
* `organizationName`: (supposed) organization of the player (`NaN`).

In [1]:
# run this before continuing
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.compose import make_column_transformer, ColumnTransformer
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import mean_squared_error

In [2]:
players = pd.read_csv("data/players.csv")
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


## Wrangling & Cleaning the Dataset

Since individual ID and organization name for each row in the dataframe is empty (NaN), we decided to remove it from the dataframe as it provides no insightful data that can be analyzed.

In [3]:
# kept hash in case we want to merge this dataset with sessions.csv

players_drop = players.drop(columns=["individualId", "organizationName"])
players_use = players_drop.dropna(subset=["played_hours", "age", "gender", "experience", "subscribe"])

players_use.head()

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21


Brief observations of the dataset inform us that some players have 0 played hours. To ensure accurate classification, we need to determine whether the data points with 0 hours will affect our modeling. To find the extent of players with 0 played hours, we found the number of players with 0 played hours and divided this number by the total number of players to find the percentage.

In [4]:
hrs0 = int(np.sum(players_use['played_hours'] == 0))
total_players = (players['played_hours']).size
pc_hrs0 = (hrs0/total_players)*100
print(f'The total number of players with 0 played hours is {hrs0}. This makes up {pc_hrs0:.2f}% of the total number of players ({total_players}).')

The total number of players with 0 played hours is 85. This makes up 43.37% of the total number of players (196).


As we can see from the calculations above, nearly half of the players in the dataset have 0 hours played which means that when modeling and doing classification, the results will be heavily skewed. Since we want to find the "type" of player that plays the longest (hence providing more data), we will exclude anyone that has not played when we do our classification.

In [5]:
# removed all players with hours: '0.0'

mask = players_use['played_hours'].isin([0.0])
players_hrs = players_use[~mask]
players_hrs

#top_data = players_hrs.sort_values('played_hours', ascending=False)      # <- these show all the players with hours >= 2.0, which would theoretically comprise the meat of our predictive analysis
                                                                          #according to how we're interpreting most data. I put the 'upper percentile' at 2 hours because there's already not a lot of 
                                                                          #active players in this dataset 
#top_data.head(26)

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21
8,Amateur,True,8b71f4d66a38389b7528bb38ba6eb71157733df7d17403...,0.1,Natalie,Male,17
...,...,...,...,...,...,...,...
185,Regular,False,8e98b6db2053af0bc0e62cd55bcea5a08f23986dec3d02...,0.1,Sam,Male,18
186,Veteran,True,ba24bebe588a34ac546f8559850c65bc90cd9d51b82158...,0.1,Gabriela,Female,44
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17


As we are looking to classify which type of players play the most, the name of the player as well as their email are not relevant as they cannot be grouped or predicted and will thus be removed. Furthermore exact age of players would create too many possible classfications as it varies (not limited to) from 9 to 91 years old. Therefore we cleaned the dataset to categorize the ages into age ranges, allowing for easier classification.

In [6]:
players_cleaned = players_hrs[['gender','age', 'played_hours', 'experience', 'subscribe']].copy()

bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
labels = ['0-10', '11-20', '21-30', '31-40', '41-50', '51-60', '61-70', '71-80', '81-90', '91-100']

players_cleaned['age_group'] = pd.cut(players_cleaned['age'], bins=bins, labels=labels, right=False)
players_cleaned

Unnamed: 0,gender,age,played_hours,experience,subscribe,age_group
0,Male,9,30.3,Pro,True,0-10
1,Male,17,3.8,Veteran,True,11-20
3,Female,21,0.7,Amateur,True,21-30
4,Male,21,0.1,Regular,True,21-30
8,Male,17,0.1,Amateur,True,11-20
...,...,...,...,...,...,...
185,Male,18,0.1,Regular,False,11-20
186,Female,44,0.1,Veteran,True,41-50
192,Male,22,0.3,Veteran,False,21-30
194,Male,17,2.3,Amateur,False,11-20


## Visualizing the Data

We performed some initial exploratory data analysis and visualizations to understand the distribution of played hours in the dataset and inform a hypothesis about which type of player has the highest played hours. The first figure shows the overall distribution of the players in the dataset by age and gender. Each figure thereafter compares the total played hours of players based on gender, age, experience, and subscription status. This was done by grouping the played hours by each category and finding the sum.

In [7]:
fig1 = alt.Chart(players_cleaned).mark_bar().encode(
    x=alt.X('age_group').title('Age Range'),
    y=alt.Y('count()').title('Number of Players'),
    color=alt.Color('gender').title('Gender (Fig 1)'),
    xOffset='gender'
).properties(
    width=400,
    height=300,
    title='Fig 1: Age and Gender Distribution of Players'
)

gender_hrs = players_cleaned.groupby('gender', observed=True)[['played_hours']].sum().reset_index()

fig2 = alt.Chart(gender_hrs).mark_bar().encode(
    x=alt.X('gender').title('Age Range'),
    y=alt.Y('played_hours').title('Hours Played'),
).properties(
    width=400,
    height=300,
    title='Fig 2: Gender of Players vs Hours Played'
)

age_hrs = players_cleaned.groupby('age_group', observed=True)[['played_hours']].sum().reset_index()

fig3 = alt.Chart(age_hrs).mark_bar().encode(
    x=alt.X('age_group').title('Age Range'),
    y=alt.Y('played_hours').title('Hours Played'),
).properties(
    width=400,
    height=300,
    title='Fig 3: Age of Players vs Hours Played'
)

experience_hrs = players_cleaned.groupby('experience', observed=True)[['played_hours']].sum().reset_index()

fig4 = alt.Chart(experience_hrs).mark_bar().encode(
    x=alt.X('experience').title('Experience'),
    y=alt.Y('played_hours').title('Hours played')
).properties(
    width=100,
    height=300,
    title='Fig 4: Hours Played by Experience'
)

subscription_hrs = players_cleaned.groupby('subscribe', observed=True)[['played_hours']].sum().reset_index()

fig5 = alt.Chart(subscription_hrs).mark_bar().encode(
    x=alt.X('subscribe').title('Subscription'),
    y=alt.Y('played_hours').title('Hours played')
).properties(
    width=100,
    height=300,
    title='Fig 5: Hours Played by Subscription'
)

In [8]:
fig1

In [9]:
fig2

In [10]:
fig3

In [11]:
fig4

In [12]:
fig5

Based on figs. 1, 2, 3, and 4, we can hypothesize that the players with the most total hours are males aged 11–20 who have regular experience and are subscribed. However, we cannot immediately assume that this “type” of player contributes the most data overall because we do not yet know how these variables interact with one another. For example, while males play the most and the 11–20 age group plays the most, this does not necessarily mean that males aged 11–20 play the most. We cannot simply add these variables together, so we will double check by grouping by all of the variables.

It is also important to note the distribution of players in the dataset from fig. 1. The player diversity is heavily biased towards males ages 11-30. The high played hours of males in this age group reflects the fact that a large amount of data points came from people in this category. This visualization can be used to observe the contribution of individuals in each category to their category's played hours. For example, although the total hours played is highest by males 11-20, the contribution of each individual player in this category may be low, and the total played hours is the result of a collective.

In [13]:
grouped = (
    players_cleaned
        .groupby(['gender', 'experience', 'subscribe', 'age_group'], observed=True)[['played_hours']]
        .sum()
        .reset_index()
        .sort_values('played_hours', ascending=False)
)
grouped.head()

Unnamed: 0,gender,experience,subscribe,age_group,played_hours
29,Male,Regular,True,11-20,233.2
38,Non-binary,Regular,True,21-30,218.1
3,Female,Amateur,True,11-20,198.4
9,Female,Regular,True,11-20,178.2
18,Male,Amateur,True,21-30,91.4


The result of the grouped function is as we hypothesized based off the separate figures. To further confirm that males with regular experienced aged 11-20 that are subscribed are the type that contribute the most data, we will use classification methods.

The grouped results still, however, confirm that conclusions cannot be made solely by the exploratory visualizations as described above. The next few categories with the highest hours do not match what might be expected from the graphs (for example, we might expect the second highest to be females with regular experienced aged 11-20, rather than non-binary people aged 21-30).

### Visualizating the Training Data

We visualize the training portion of the dataset to examine the relationship between age, gameplay hours, and player experience. This helps us understand whether the relationship appears linear or displays clustered, nonlinear patterns, which would justify the use of a K-Nearest Neighbors regression model.


In [14]:
# Output dataframes instead of arrays (same as example)
set_config(transform_output="pandas")

# Set the seed
np.random.seed(1)

# Splitting the data into training and testing sets (75% / 25%)
players_train, players_test = train_test_split(
    players_hrs,
    train_size=0.75,
    random_state=1
)

# Create scatter plot of hours played versus age,
# label the points by experience level (mimicking example's color-coding)
players_visualization_training = (
    alt.Chart(players_train)
    .mark_circle(opacity=0.6, size=49)
    .encode(
        x=alt.X("age:Q").title("Age of Player"),
        y=alt.Y("played_hours:Q")
            .title("Hours Played")
            .scale(zero=False, type="sqrt"),
        color=alt.Color("experience:N").title("Experience Level")
    )
    .properties(
        title="Training Data Visualization Relating to Player Age, Experience, and Hours Played"
    )
)

players_visualization_training

#### Observations  
From the visualization of the training data, we can observe several general tendencies. Most data points concentrate within a moderate range, but a number of observations appear far above or below this cluster, indicating the presence of notable outliers. The coloring also reveals a clear separation among different groups, suggesting that the categories exhibit distinct behavioral patterns.

Rather than forming a clear linear trend, the points display a more irregular arrangement that suggests a complex underlying structure. Because of this, models assuming linear relationships may not capture the variability present in the data. In contrast, a K-Nearest Neighbors approach is better suited for this type of pattern, as it relies on the similarity of local neighborhoods rather than a global linear form.

In [15]:
# Create a dataset of active players (played_hours ≥ 2)
players_active = players_use[players_use["played_hours"] >= 2].copy()

players_active

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17
17,Amateur,True,a175d4741dc84e6baf77901f6e8e0a06f54809a34e6b52...,48.4,Xander,Female,17
40,Regular,True,0d4d71be33e2bc7266ee4983002bd930f69d304288a866...,5.6,Winslow,Male,17
44,Veteran,True,8d2eed1f399e0d77cebb8fcc48ed19ad2fa8e3bb3fa683...,2.2,Cyrus,Male,24
48,Veteran,True,b3510c708bd50bf9f75e6e02bb6fe14edb705e0ea671ee...,12.5,Isidore,Agender,27
51,Regular,True,b622593d2ef8b337dc554acb307d04a88114f2bf453b18...,218.1,Akio,Non-binary,20
56,Amateur,True,8e0aac3020b3fd9cdef4840b533b4b105aaf1ce1f6f2df...,2.9,Rafael,Male,11
58,Regular,True,f2826fb8dbce4d450348f99cb27ade184b713998d96797...,3.6,Zane,Male,10
67,Amateur,True,18936844e06b6c7871dce06384e2d142dd86756941641e...,17.2,Kyrie,Male,14


In [16]:
viz_scatter_active = (
    alt.Chart(players_active)
    .mark_circle(opacity=0.7, size=60)
    .encode(
        x=alt.X("age:Q").title("Age of Player"),
        y=alt.Y("played_hours:Q").title("Hours Played"),
        color=alt.Color("experience:N").title("Experience Level"),
        tooltip=["age", "played_hours", "experience", "gender"]
    )
    .properties(
        width=500,
        height=350,
        title="Scatter Plot of Active Players (Players with ≥ 2 Hours)"
    )
)

viz_scatter_active

In [25]:
sessions = pd.read_csv('data/sessions.csv')

sessions[['start_date', 'start_time_only']] = sessions['start_time'].str.split(' ', expand=True)
sessions[['end_date', 'end_time_only']] = sessions['end_time'].str.split(' ', expand=True)


sessions[['start_hour', 'start_minute']] = sessions['start_time_only'].str.split(':', expand=True).astype(float)
sessions[['end_hour', 'end_minute']] = sessions['end_time_only'].str.split(':', expand=True).astype(float)


sessions['start_total_minutes'] = sessions['start_hour'] * 60 + sessions['start_minute']
sessions['end_total_minutes'] = sessions['end_hour'] * 60 + sessions['end_minute']
sessions['duration_minutes'] = sessions['end_total_minutes'] - sessions['start_total_minutes']


sessions.loc[sessions['duration_minutes'] < 0, 'duration_minutes'] += 24 * 60

## KNN Modeling Section

To train the prediction model, we extract the relevant predictors (age, gender, experience) and define the target variable as played_hours. Because the hours played are highly right-skewed, we follow our Methods plan and apply a log transform to stabilize variance and improve model performance.

In [18]:
# Prepare features (X) and target (y)
X = players_active[['age', 'gender', 'experience']]
y = players_active['played_hours']

# Log-transform the target to reduce skewness
y_log = np.log1p(y)

We split the data into a 75% training set and 25% test set (matching our earlier split used for training visualization).
Since KNN relies on distance calculations, numerical variables (age) must be standardized, while categorical variables (gender, experience) must be one-hot encoded.

We combine preprocessing and the KNN estimator into a unified sklearn Pipeline, ensuring that transformations are applied consistently to both training and test data.

In [19]:
# Train-test split (75/25)
X_train, X_test, y_train_log, y_test_log = train_test_split(
    X, y_log, test_size=0.25, random_state=1
)

# Preprocessing: scale numeric features and one-hot encode categorical features
numeric_features = ["age"]
categorical_features = ["gender", "experience"]

preprocess = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numeric_features),
        ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), categorical_features)
    ]
)

# Build KNN regression pipeline
knn_model = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("knn", KNeighborsRegressor())
    ]
)

We tune the number of neighbors k using 3-fold cross-validation on the training set, following the plan specified in our Methods section.
After selecting the best-performing model, we evaluate it on the held-out test set using:

RMSE (log scale)

RMSE (original scale)

Finally, we visualize the model output by comparing predicted vs. true gameplay hours — a required analysis visualization in the assignment rubric.

Because our dataset of active players is relatively small, using larger values of k would violate the requirement that the number of neighbors cannot exceed the number of training samples in each cross-validation fold. Therefore, we tune k over a limited range (1–8), which ensures valid KNN configurations.

In [20]:
# Grid search to tune k
param_grid = {"knn__n_neighbors": list(range(1, 9))}

grid_knn = GridSearchCV(
    knn_model,
    param_grid,
    cv=3,
    scoring="neg_mean_squared_error",
    n_jobs=-1
)

grid_knn.fit(X_train, y_train_log)

best_k = grid_knn.best_params_["knn__n_neighbors"]
best_k

5

After selecting the optimal value of k using cross-validation, we evaluate the best-performing model on the held-out test set.
We compute prediction error on both the log-transformed scale (the scale used during training) and on the original scale, after back-transforming the predictions.

In [21]:
# Retrieve the best-performing model from grid search
best_knn = grid_knn.best_estimator_

# Predict on the test set (log scale)
y_pred_log = best_knn.predict(X_test)

# RMSE in log-transformed space
rmse_log = np.sqrt(mean_squared_error(y_test_log, y_pred_log))

# Convert predictions back to original hours
y_test_original = np.expm1(y_test_log)
y_pred_original = np.expm1(y_pred_log)

# RMSE in original hours
rmse_original = np.sqrt(mean_squared_error(y_test_original, y_pred_original))

rmse_log, rmse_original

(np.float64(1.4887048059815753), np.float64(56.65774928551042))

To visualize model performance, we compare predicted gameplay hours to the true values in a scatter plot.
Points close to the diagonal line indicate accurate predictions.
This plot is required as the “visualization of the analysis” in the assignment rubric.

In [22]:
# Prepare dataframe for visualization
pred_df = pd.DataFrame({
    "True Hours": y_test_original,
    "Predicted Hours": y_pred_original
})

# Altair scatterplot
viz_knn_results = (
    alt.Chart(pred_df)
    .mark_circle(size=60, opacity=0.6)
    .encode(
        x=alt.X("True Hours", title="True Played Hours"),
        y=alt.Y("Predicted Hours", title="Predicted Played Hours")
    )
    .properties(
        width=500,
        height=350,
        title=f"KNN Regression Results (k = {best_k})"
    )
)

viz_knn_results

### 4. Results: KNN Regression Model

The KNN regression model with the tuned value of k = 5 shows limited predictive accuracy when estimating total played hours. As seen in the scatter plot comparing predicted and true values, the predictions cluster within a narrow band (approximately 5–18 hours), whereas the actual played hours vary dramatically across the test set, including several high-activity outliers.

This pattern reflects a common behavior of KNN regression: predictions tend to shrink toward the average of nearby observations. As a result, the model systematically underestimates players with very high activity and overestimates players with very low activity.

Model evaluation supports this observation. The error on the log-transformed scale is moderate (RMSE_log = 1.4887), but when converted back to the original scale, predictive error becomes much larger (RMSE_original ≈ 56.66 hours), indicating substantial deviation from true values. Given the small sample size and the highly skewed distribution of played hours, these results are expected.

Overall, while the model captures broad structure in the log scale, it struggles to provide accurate hour-level predictions for individual players on the original scale.

In [32]:
knn_model.fit(X_train, y_train_log)
knn_model.predict(X_test)

array([1.84053163, 2.93757488, 2.85657112, 2.262025  , 2.4064635 ,
       2.01684606, 2.03835724])

In [40]:
new_player = pd.DataFrame([{
    "age": 25,
    "gender": "Male",
    "experience": "Veteran"
}])

knn_model.predict(new_player)

array([1.88247573])

In [41]:
results = X_test.copy()
results["played_hours"] = y_test_original.values
results["predicted_hours"] = y_pred_log

results.head()

Unnamed: 0,age,gender,experience,played_hours,predicted_hours
125,49,Male,Regular,18.5,1.840532
90,16,Female,Amateur,150.0,2.937575
123,17,Male,Beginner,7.1,2.856571
40,17,Male,Regular,5.6,2.262025
177,21,Non-binary,Veteran,2.7,2.406463
