## DSCI 100 Group Project - PLAICRAFT Data Study
### Members: Justin Galimpin (59053306), Alexis Kuerbig (15606007), Arjun Sharma (61155750), Ahmad Khattab (90009473)

**Introduction**

The `players.csv` dataset contains all the individuals who have taken part in the **(PLAI)** research study, which aims to collect data about how people play video games. There are 196 observations within the dataset, and the variables for the given dataset are as follows:

* `experience`: An object that stores a value about the player's experience level with Minecraft as Pro, Veteran, Regular, and Amateur.
* `subscribe`: Boolean value that indicates whether or not the player is subscribed to the mailing list or not.
* `hashedEmail`: Object value that indicates the hashed email of the player.
* `played_hours`: Float value that indicates the total number of hours a player has spent playing the game on this server.
* `name` : Object value that indicates the player's name.
* `gender` : Object value that indicates the player's gender.
* `age`: Int value that indicates the player's age.
* `individualId`: Unique Player ID (left NaN)
* `organizationName`: Player Organization (left NaN)
  
**Our Question: Can we predict the total hours a player will partake in the study based on their age and/or their experience level?**

**Response Variable:** `played_hours`
**Predictor Variables:** `experience` `age`

Answering this question will give the research group a clear idea of what type of demographic may be able to contribute the most to their study. For example, the results of the question may find a player with X type of experience contributes far more than those with Y type of experience, or that a player with Z type of experience actually rarely contributes at all. The key variables to help us answer this question are `age`, `experience`, and `played_hours` in `players.csv`. 

In [1]:
import pandas as pd
import altair as alt
import numpy as np
from sklearn import set_config
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

**Method and Results**

In [2]:
players = pd.read_csv("players.csv")
sessions = pd.read_csv("sessions.csv")

In [3]:
 # Visualizing the data

# We plot age vs. played hours, then we color-coded the data based on varying experience levels to better visualize the data
# and possible relationships between the variables

player_chart = alt.Chart(players, title = "Comparison of Total Time Played vs Player Age (Figure 1)").mark_point().encode(
    x = alt.X("age").title("Age of Players").scale(zero = False),
    y=alt.Y('played_hours', title='Total time played (hours)'),
    color=alt.Color("experience").title("Experience Level"),
)

player_chart

In [4]:
# Scaled version of data to "zoom in" where the plot values seem to be most prevalent, data may be more relevant here
zoomed_chart = alt.Chart(players, title="Zoomed Chart of Figure 1 (Figure 2)").mark_point(clip=True).encode(
    x = alt.X("age").title("Age of Players").scale(domain=["0", "50"]),
    y=alt.Y('played_hours', title='Total time played (hours)').scale(domain=["0", "4"]),
    color=alt.Color("experience").title("Experience Level"),
)

zoomed_chart

As the research question focuses on predicting a numerical value, one method we will consider to use is **regression**. Given the  visualizations performed in the previous question, it is unclear whether we can effectively map a linear relationship between played hours and experience. For this reason, **KNN regression** is chosen for its ability to model non-linear relationships. This flexibility makes it suitable for predicting played_hours based on age and experience levels, which may not follow a linear pattern. With that being said, because we are not completely confident that there *isn't* a linear relationship, we will also be creating a model based on **linear regression** in order to determine which of the two models are more effective.

In [5]:
# We remapped each of the names of the experience categories to numerical values in order to use them for calculations in our data
players["experience"] = players["experience"].replace({
    "Beginner" : 1.0,
    "Amateur" : 2.0,
    "Regular" : 3.0,
    "Veteran" : 4.0,
    "Pro" : 5.0
})

# Split our data
players_training, players_testing = train_test_split(
    players,
    test_size=0.20,
    random_state=2000,
)
X_train = players_training[["experience","age"]]
y_train = players_training["played_hours"]

X_test = players_testing[["experience","age"]]
y_test = players_testing["played_hours"]

# knn model building
players_pipe = make_pipeline(
    StandardScaler(),
    KNeighborsRegressor()
)

players_cv = pd.DataFrame(
    cross_validate(
        players_pipe,
        X_train,
        y_train,
        scoring="neg_root_mean_squared_error",
        return_train_score=True,
        cv=5
    )
)
players_cv

  players["experience"] = players["experience"].replace({


Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.003157,0.002243,-38.359258,-24.460191
1,0.002565,0.001513,-35.315021,-27.24268
2,0.00258,0.001481,-13.584086,-31.058668
3,0.002521,0.001524,-40.870674,-24.536563
4,0.002429,0.00145,-17.023712,-31.839362


In [6]:
np.random.seed(101)
param_grid = {'kneighborsregressor__n_neighbors': range(1, 100, 1)}
players_tuned = GridSearchCV(players_pipe, param_grid, cv=5, n_jobs=-1, scoring="neg_root_mean_squared_error")
players_results = pd.DataFrame(players_tuned.fit(X_train, y_train).cv_results_)
players_results

  _data = np.array(data, dtype=dtype, copy=copy,


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsregressor__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.002554,0.000187,0.001507,0.000058,1,{'kneighborsregressor__n_neighbors': 1},-59.516718,-36.570118,-4.455080,-40.293108,-15.168580,-31.200721,19.419868,98
1,0.002578,0.000312,0.001598,0.000239,2,{'kneighborsregressor__n_neighbors': 2},-45.321621,-38.638925,-19.942162,-40.330880,-15.169454,-31.880609,11.996099,99
2,0.002455,0.000078,0.004303,0.005333,3,{'kneighborsregressor__n_neighbors': 3},-42.524837,-35.225325,-13.542270,-40.314107,-19.983454,-30.317999,11.499903,96
3,0.002356,0.000009,0.001457,0.000057,4,{'kneighborsregressor__n_neighbors': 4},-44.607392,-33.973574,-16.927260,-41.488809,-18.075288,-31.014465,11.568124,97
4,0.002347,0.000008,0.001439,0.000014,5,{'kneighborsregressor__n_neighbors': 5},-38.359258,-35.315021,-13.584086,-40.870674,-17.023712,-29.030550,11.397064,95
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
94,0.002249,0.000027,0.001566,0.000009,95,{'kneighborsregressor__n_neighbors': 95},-38.633742,-31.484883,-8.493834,-39.460906,-14.503360,-26.515345,12.713788,39
95,0.002242,0.000011,0.001562,0.000015,96,{'kneighborsregressor__n_neighbors': 96},-38.637628,-31.484967,-8.379666,-39.465659,-14.490842,-26.491753,12.750265,36
96,0.002276,0.000034,0.001575,0.000013,97,{'kneighborsregressor__n_neighbors': 97},-38.642477,-31.484661,-8.327222,-39.468979,-14.478423,-26.480352,12.769082,34
97,0.002235,0.000018,0.001588,0.000027,98,{'kneighborsregressor__n_neighbors': 98},-38.646221,-31.486729,-8.225785,-39.473703,-14.456759,-26.457840,12.803850,33


In [7]:
# This is where we chose the "best" k value to use, by calling best_params_ on players_tuned
players_min = players_tuned.best_params_

# This is where we called to show the RMSE of the model, by calling best_score_
playersbest_RMSE = -players_tuned.best_score_

playersbest_RMSE

np.float64(25.75707928712318)

In [8]:
# See how our model (players_tuned) predicts our test data (X_test).
np.random.seed(1234)
players_prediction = pd.DataFrame(players_tuned.predict(X_test)).rename(columns={0: "Played Hours"})
players_prediction

Unnamed: 0,Played Hours
0,33.133333
1,0.65
2,0.475
3,4.15
4,0.433333
5,0.325
6,33.141667
7,2.783333
8,0.658333
9,0.658333


In [9]:
# Use the test set to calculate RMSPE. Use y_test to compare to the PREDICTIONS our model makes
players_summary_RMSPE = mean_squared_error(y_test, players_prediction)**(1/2)

players_summary_RMSPE

np.float64(25.362170401998327)

In [10]:
# Linear Regression model building

lm = LinearRegression()
lm_fit = lm.fit(X_train, y_train)
players_preds = players_training.assign(
    predictions= lm.predict(X_train)
)
test_preds = players_testing.assign(
            predictions = lm.predict(X_test)
)
lm_rmspe = mean_squared_error(y_test, test_preds["predictions"])**(1/2)
lm_rmspe

np.float64(23.667497702128255)

**Discussion**

Interestingly, the linear regression model produced a lower RMSPE, suggesting it was better suited for capturing the relationship between the variables in this case. This process highlights the importance of comparing different models, even when an initial analysis suggests a lack of a clear relationship. By comparing different models, we are able to determine and proceed with the model that yields the least error, thus improving our prediction.

In [11]:
players["experience"] = players["experience"].replace({
1.0 : "Beginner",
2.0 : "Amateur",
3.0 : "Regular",
4.0 : "Veteran",
5.0 : "Pro"
})

In [12]:
experience_plot = alt.Chart(players, title='Plot of Player Experience vs Total Hours Played (Figure 3)').mark_bar().encode(
    x=alt.X('experience', title='Player Experience', sort='-y', axis=alt.Axis(labelAngle=25)),
    y=alt.Y('played_hours', title='Total time played (hours)'),
    color=alt.Color('experience', title='Experience')
).properties(
    width=300
)
experience_plot

In [13]:
lm_rmspe

np.float64(23.667497702128255)

In [14]:
players_summary_RMSPE

np.float64(25.362170401998327)

As we can see from the RMSPE values above, when comparing for both the linear regression and KNN regression, our linear model makes a closer prediction of the true values of our data. This is reflected in the lower value of 23.67 for **lm_rmspe** as compared to a value of 25.36 for **players_summary_RMSPE**. However, it should be noted that the difference between the two models performance is miniscule; neither is significantly outperforming the other.

In [15]:
lm.coef_

array([ 0.26554675, -0.10904625])

In [16]:
lm.intercept_

np.float64(7.545981735319522)

**References (If Any)**

In [None]:
# Trying a Git push