## DSCI 100 Group Project - PLAICRAFT Data Study
### Members: Justin Galimpin (59053306), Alexis Kuerbig (15606007), Arjun Sharma (61155750), Ahmad Khattab (90009473)

**Introduction**

The `players.csv` dataset contains all the individuals who have taken part in the **(PLAICRAFT)** research study, which aims to collect data about how people play video games, and specifically aims to recruit new players. There are 196 observations within the dataset, and the variables for the given dataset are as follows:

* `experience`: An object that stores a value about the player's experience level with this Minecraft-type game: Experience levels exist as Pro, Veteran, Regular, and Amateur.
* `subscribe`: Boolean value that indicates whether or not the player is subscribed to the mailing list or not.
* `hashedEmail`: Object value that indicates the hashed email of the player.
* `played_hours`: Float value that indicates the total number of hours a player has spent playing the game on this server.
* `name` : Object value that indicates the player's name.
* `gender` : Object value that indicates the player's gender.
* `age`: Integer value that indicates the player's age.
* `individualId`: Unique Player ID (left NaN)
* `organizationName`: Player Organization (left NaN)
  
**Our Question: Can we predict the total number of hours a player will partake in the study based on their age and/or their experience level?**

**Response Variable:** `played_hours`
**Predictor Variables:** `experience` `age`

Answering this question will give the research group a clear idea of what type of demographic may be able to contribute the most to their study. For example, the results of the question may find a player with X type of experience contributes far more than those with Y type of experience, or that a player with Z type of experience actually rarely contributes at all. The key variables to help us answer this question are `age`, `experience`, and `played_hours` in `players.csv`. 

In [1]:
import pandas as pd
import altair as alt
import numpy as np
from sklearn import set_config
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

 # Simplify working with large datasets in Altair
alt.data_transformers.enable('vegafusion')
# Output dataframes instead of arrays
set_config(transform_output="pandas") 

**Method and Results**

In [2]:
# We load in the data given to us, in order to analyze the player data
players = pd.read_csv("players.csv")
sessions = pd.read_csv("sessions.csv")

In [3]:
 # Visualizing the data

# We plot age vs. played hours, then we color-coded the data based on varying experience levels to better visualize the data
# and possible relationships between the variables

player_chart = alt.Chart(players, title = "Comparison of Total Time Played vs Player Age (Figure 1)").mark_point().encode(
    x = alt.X("age").title("Age of Players").scale(zero = False),
    y=alt.Y('played_hours', title='Total time played (hours)'),
    color=alt.Color("experience").title("Experience Level"),
)

player_chart

In [4]:
# Scaled version of data to "zoom in" where the plot values seem to be most prevalent, data may be more relevant here
zoomed_chart = alt.Chart(players, title="Zoomed Chart of Figure 1 (Figure 2)").mark_point(clip=True).encode(
    x = alt.X("age").title("Age of Players").scale(domain=["0", "50"]),
    y=alt.Y('played_hours', title='Total time played (hours)').scale(domain=["0", "4"]),
    color=alt.Color("experience").title("Experience Level"),
)

zoomed_chart

As the research question focuses on predicting a numerical value, one method we will consider to use is **regression**. Given the  visualizations performed in the previous question, it is unclear whether we can effectively map a linear relationship between played hours and experience. For this reason, **KNN regression** is chosen for its ability to model non-linear relationships. This flexibility makes it suitable for predicting played_hours based on age and experience levels, which may not follow a linear pattern. With that being said, because we are not completely confident that there *isn't* a linear relationship, we will also be creating a model based on **linear regression** in order to determine which of the two models are more effective.

In [18]:
# We remapped each of the names of the experience categories to numerical values in order to use them for calculations in our data
players["experience"] = players["experience"].replace({
    "Beginner" : 1.0,
    "Amateur" : 2.0,
    "Regular" : 3.0,
    "Veteran" : 4.0,
    "Pro" : 5.0
})

# Split our data into training and testing data, in order to have data to build our model, that is separate from the data used to TEST our model
players_training, players_testing = train_test_split(
    players,
    test_size=0.20,
    random_state=2000,
)
X_train = players_training[["experience","age"]]
y_train = players_training["played_hours"]

X_test = players_testing[["experience","age"]]
y_test = players_testing["played_hours"]

# knn model building
players_pipe = make_pipeline(
    StandardScaler(),
    KNeighborsRegressor()
)

players_cv = pd.DataFrame(
    cross_validate(
        players_pipe,
        X_train,
        y_train,
        scoring="neg_root_mean_squared_error",
        return_train_score=True,
        cv=5
    )
)
players_cv

  players["experience"] = players["experience"].replace({


Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.003629,0.002362,-38.359258,-24.460191
1,0.003225,0.002238,-35.315021,-27.24268
2,0.00308,0.002104,-13.584086,-31.058668
3,0.003074,0.002118,-40.870674,-24.536563
4,0.003049,0.002081,-17.023712,-31.839362


The experience column's categorical values were converted to numerical values, i.e. 1 for Beginner, 2 for Amateur, etc to allow compatibility with regression models. 

The dataset was split into two subsets, i.e. training(80%) and testing(20%) dataset. We used the training data to build our models and then evaluated the performance of each model on the testing data to avoid data leakage.

For the KNN Regression Model, a pipeline was created to standardize the working data. Using GridSearchCV, we found the optimal number of neighbors to use in this model and a 5 fold cross validation was performed to estimate the performance of our model on the training dataset. Lastly, played hours predictions were made on the testing data and RMSPE value was calculated to measure the model’s accuracy.

In [6]:
np.random.seed(101)
param_grid = {'kneighborsregressor__n_neighbors': range(1, 100, 1)}
players_tuned = GridSearchCV(players_pipe, param_grid, cv=5, n_jobs=-1, scoring="neg_root_mean_squared_error")
players_results = pd.DataFrame(players_tuned.fit(X_train, y_train).cv_results_)
players_results

  _data = np.array(data, dtype=dtype, copy=copy,


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsregressor__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.003199,0.000195,0.002160,0.000116,1,{'kneighborsregressor__n_neighbors': 1},-59.516718,-36.570118,-4.455080,-40.293108,-15.168580,-31.200721,19.419868,98
1,0.003008,0.000021,0.003454,0.002833,2,{'kneighborsregressor__n_neighbors': 2},-45.321621,-38.638925,-19.942162,-40.330880,-15.169454,-31.880609,11.996099,99
2,0.002973,0.000010,0.002028,0.000011,3,{'kneighborsregressor__n_neighbors': 3},-42.524837,-35.225325,-13.542270,-40.314107,-19.983454,-30.317999,11.499903,96
3,0.002970,0.000030,0.002194,0.000366,4,{'kneighborsregressor__n_neighbors': 4},-44.607392,-33.973574,-16.927260,-41.488809,-18.075288,-31.014465,11.568124,97
4,0.002960,0.000011,0.002031,0.000033,5,{'kneighborsregressor__n_neighbors': 5},-38.359258,-35.315021,-13.584086,-40.870674,-17.023712,-29.030550,11.397064,95
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
94,0.003054,0.000195,0.002267,0.000100,95,{'kneighborsregressor__n_neighbors': 95},-38.633742,-31.484883,-8.493834,-39.460906,-14.503360,-26.515345,12.713788,39
95,0.002911,0.000012,0.007464,0.010518,96,{'kneighborsregressor__n_neighbors': 96},-38.637628,-31.484967,-8.379666,-39.465659,-14.490842,-26.491753,12.750265,36
96,0.002892,0.000017,0.002206,0.000020,97,{'kneighborsregressor__n_neighbors': 97},-38.642477,-31.484661,-8.327222,-39.468979,-14.478423,-26.480352,12.769082,34
97,0.002898,0.000013,0.002189,0.000012,98,{'kneighborsregressor__n_neighbors': 98},-38.646221,-31.486729,-8.225785,-39.473703,-14.456759,-26.457840,12.803850,33


In [7]:
# This is where we chose the "best" k value to use, by calling best_params_ on players_tuned
players_min = players_tuned.best_params_

# This is where we called to show the RMSE of the model, by calling best_score_
playersbest_RMSE = -players_tuned.best_score_

playersbest_RMSE

np.float64(25.75707928712318)

In [8]:
# See how our model (players_tuned) predicts our test data (X_test).
np.random.seed(1234)
players_prediction = pd.DataFrame(players_tuned.predict(X_test)).rename(columns={0: "Played Hours"})
players_prediction

Unnamed: 0,Played Hours
0,33.133333
1,0.65
2,0.475
3,4.15
4,0.433333
5,0.325
6,33.141667
7,2.783333
8,0.658333
9,0.658333


In [9]:
# Use the test set to calculate RMSPE. Use y_test to compare to the PREDICTIONS our model makes
players_summary_RMSPE = mean_squared_error(y_test, players_prediction)**(1/2)

players_summary_RMSPE

np.float64(25.362170401998327)

In [10]:
# Linear Regression model building

lm = LinearRegression()
lm_fit = lm.fit(X_train, y_train)
players_preds = players_training.assign(
    predictions= lm.predict(X_train)
)
test_preds = players_testing.assign(
            predictions = lm.predict(X_test)
)
lm_rmspe = mean_squared_error(y_test, test_preds["predictions"])**(1/2)
lm_rmspe

np.float64(23.667497702128255)

**Discussion**

Interestingly, the linear regression model produced a lower RMSPE, suggesting it was better suited for capturing the relationship between the variables in this case. This process highlights the importance of comparing different models, even when an initial analysis suggests a lack of a clear relationship. By comparing different models, we are able to determine and proceed with the model that yields the least error, thus improving our prediction.

In [11]:
players["experience"] = players["experience"].replace({
1.0 : "Beginner",
2.0 : "Amateur",
3.0 : "Regular",
4.0 : "Veteran",
5.0 : "Pro"
})

In [12]:
lm_rmspe

np.float64(23.667497702128255)

In [13]:
players_summary_RMSPE

np.float64(25.362170401998327)

For our Linear Regression Model, the same standardized training and testing datasets were used to make predictions. The RMSPE value of testing data predictions was calculated to measure the model’s accuracy.

At the end, the RMSPE values of both KNN regression and linear regression models were compared to determine which model performed better in predicting our response variable i.e. played_hours.

As we can see from the RMSPE values above, when comparing for both the linear regression and KNN regression, our linear model makes a closer prediction of the true values of our data. This is reflected in the lower value of 23.67 for **lm_rmspe** as compared to a value of 25.36 for **players_summary_RMSPE**. However, it should be noted that the difference between the two models performance is miniscule; neither is significantly outperforming the other.

In [14]:
lm.coef_

array([ 0.26554675, -0.10904625])

In [15]:
lm.intercept_

np.float64(7.545981735319522)

In [16]:
# Filtering for total hours played, as a percentage of total players:
zero_hours_percentage = (players['played_hours'] == 0).mean() * 100
non_zero_hours_percentage = (players['played_hours'] > 0).mean() * 100

summary_df = pd.DataFrame({
    'Category': ['Played 0 Hours', 'Played > 0 Hours'],
    '% of Total Players': [zero_hours_percentage, non_zero_hours_percentage]
})

summary_df

Unnamed: 0,Category,% of Total Players
0,Played 0 Hours,43.367347
1,Played > 0 Hours,56.632653


With our calculations in mind, it should be noted that `players.csv` contains a substantial number of observations of players who have not logged any hours during the study period (i.e. their `played_hours` is **0**). While these entries might initially appear irrelevant, we have chosen to retain them because they also provide valuable insights into player inactivity patterns, which we found to also be crucial for understanding overall player behavior. Removing these observations would compromise the completeness of our data analysis.

In [17]:
experience_plot = alt.Chart(players, title='Plot of Player Experience vs Total Hours Played (Figure 3)').mark_bar().encode(
    x=alt.X('experience', title='Player Experience', sort='-y', axis=alt.Axis(labelAngle=25)),
    y=alt.Y('played_hours', title='Total time played (hours)'),
    color=alt.Color('experience', title='Experience')
).properties(
    width=300
)
experience_plot

Briefly analysing the given data revealed that Regulars accounted for the majority of total hours played, far surpassing all other groups; Amateurs contributed the next highest number of hours. In contrast, Beginners, Veterans, and Pros displayed significantly lower and roughly equal levels of playtime, suggesting a notable drop in engagement within these categories. 

However, we are also aware that this data is skewed due to the presence of outliers. For example, a single "Regular" player logged over 200 hours during the study, far exceeding the average for this category. This disproportionate contribution inflates the perceived engagement of "Regulars" and could misrepresent typical player behavior within the group. As such, while the trends are informative, they should be interpreted with caution, considering the potential influence of extreme values or outliers.

There are quite a substantial amount of outliers, and it should be noted with particular caution for the linear regression model. Where our main model building for linear regression is based on the slope of the line (our 'model'), any outliers will greatly skew the data and thus affect the slope- the main component of our model- to a large extent.

**Impact of our findings:** By understanding the effect of variables such as experience and age, game developers can work on designing projects that would cater to an audience that is expected to result in more engagement/played hours. For future use of this data, it should be heavily noted that we found this data largely consisted of player data with 0 total hours played. The impact should be considered for any possible future studies with such type of data, where a substantial percentage of the data involves players with no participation in the study at all.

**References (If Any)**

**None**