## Data Science Project: Predicting Usage of a Video Game Research Server

### Introduction

The purpose of this project is to help a research group at UBC understand how people play videogames by performing a data analysis on the collected players.csv data to answer a predictive question.

> #### Question:
> Can we predict whether a player will subscribe to a premium service or game-related newsletters based on age and total hours played?
> 
> **predictor variable**: age and total hours played
> <br>**response variable**: subscribe 

#### Data Description

The player.csv dataset contains 196 observations and 9 variables collected from 196 individuals describing their personal data.

| Variable |  Type  |  Description  |
|----------|--------|---------------|
| experience | categorical | Player experience level (beginner, amateur, regular, veteran, pro) |
| subscribe | binary | Subscription to game-related newsletter (True = subscribed, False = not subscribed) |
| hashedEmail | categorical | Identifier for the players (used to link individuals with their play session times in the sessions.csv file) |
| played_hours | numeric | Total number of play hours |
| name | categorical | Player name |
| gender | categorical | Player gender |
| age | numeric | Player age |
| individualId | categorical | Identifier for players (all entries missing/NaN) |
| organizationName | categorial | Player organization name (all entries missing/NaN) |

**Issues**:
<br>The dataset contains variables with all missing values which should be dropped to avoid confusion and errors in the preprocessing steps. There are also only 196 observations which may not be enough to effectively train the model with a training set and evaluate the accuracy with a testing set.

In [98]:
import pandas as pd
import numpy as np
import altair as alt
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import recall_score, precision_score

In [99]:
url = 'https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz'
player_data = pd.read_csv(url)

In [100]:
player_data['subscribe'].value_counts()

subscribe
True     144
False     52
Name: count, dtype: int64

**Issue**: Class imbalance with much more True labels (144) compared to False labels (52). Since the KNN model uses labels of nearby points to predict the label of a new observation, the imbalance will result in predictions that tend to be True, because there are more points with that label in the dataset.

In [101]:
age_plot = alt.Chart(player_data, title='Distribution of player age').mark_bar().encode(
    x=alt.X('age')
        .bin(maxbins=20)
        .title('Player Age'),
    y=alt.Y('count()')
        .title('Number of Players'),
).configure_axis(titleFontSize=12)

hours_plot = alt.Chart(player_data, title='Distribution of Total Played Hours').mark_bar().encode(
    x=alt.X('played_hours')
        .bin(maxbins=20)
        .title('Total Played Hours'),
    y=alt.Y('count()')
        .title('Number of Players'),
).configure_axis(titleFontSize=12)

age_plot

**Figure 1**: Distribution of player age

In [102]:
hours_plot

**Figure 2**: Distribution of Total Played Hours

From figures 1 and 2, we can see that player age and total played hours are both left skewed, where most of the data point fall within a smaller range (i.e. age between 15-25 years old, or played hours between 0-20 hours). However, both predictor variables have a large range of data with a few outliers, which may overinfluence the model's predictions.

### Methods and Results

#### 1. Load and wrangle the player.csv dataset
- drop the columns with missing values (individualId, organizationName) and columns not used in this model (experience, hashedEmail, name, gender)

#### 2. Standardize the data by making a preprocessor

#### 3. Split the player.csv dataset into a training and testing dataset (80:20)
- a random_state is added to make the split be the same each time the model is run, which ensures reproducibility and consistent results

In [103]:
player_drop = player_data.drop(columns = ['individualId', 'organizationName', 'hashedEmail', 'name', 'gender', 'experience'])


player_preprocessor = make_column_transformer(
    (StandardScaler(), ['age', 'played_hours']),
    remainder= 'passthrough',
    verbose_feature_names_out=False
)

player_scaled = player_preprocessor.fit_transform(player_drop)

player_train, player_test = train_test_split(
    player_scaled,
    test_size = 0.2,
    random_state = 42
)
player_train.head()

Unnamed: 0,age,played_hours,subscribe
5,-0.442141,-0.20668,True
65,-0.028984,-0.203144,True
136,-0.132273,-0.20668,True
97,-0.338852,-0.203144,False
168,-0.442141,-0.203144,True


#### 3. Create a scatterplot to visualize the relationship between age and played hours with subscription status

In [104]:
player_plot = alt.Chart(player_train).mark_point().encode(
    x=alt.X('age').title('Player age (Standardized)'),
    y=alt.Y('played_hours').title('Total Played Hours (Standardized)'),
    color=alt.Color('subscribe').title('Subscription Status')
)
player_plot

**Figure 3**: Scatter plot of total played hours versus player age colored by subscribe label.

#### 4. Create pipeline and use a 5 fold GridSearchCV to select the parameter value K that gives the best accuracy

In [105]:
np.random.seed(1234)

param_grid = {
    "kneighborsclassifier__n_neighbors": range(2, 15, 1),
}

X_train = player_train[['age', 'played_hours']]
y_train = player_train['subscribe']

player_pipe = make_pipeline(player_preprocessor, KNeighborsClassifier())

knn_tune_grid = GridSearchCV(
    estimator = player_pipe,
    param_grid = param_grid,
    cv=5
)

knn_model_grid = knn_tune_grid.fit(X_train, y_train)

cross_val = pd.DataFrame(knn_model_grid.cv_results_)

cross_val_plot = alt.Chart(cross_val).mark_line(point=True).encode(
    x=alt.X('param_kneighborsclassifier__n_neighbors')
        .title('Number of Neighbors')
        .scale(zero=False),
    y=alt.Y('mean_test_score')
        .title('Average Accuracy Estimate')
        .scale(zero=False)
)       
cross_val_plot

**Figure 4**:  Plot of estimated accuracy versus the number of neighbors

From the figure 4, we can see that K=3 provides the highest accuracy.

#### 5. Build the KNN model

- use the K=3
- predict on the testing dataset

In [106]:
np.random.seed(1234)

KNN = KNeighborsClassifier(n_neighbors=3)
KNN_fit = KNN.fit(X_train, y_train)

test_predictions = player_test.assign(
    true = player_test['subscribe'], 
    predicted = KNN_fit.predict(player_test[['age', 'played_hours']])
)

#### 6. Evaluate Accuracy of KNN model

In [107]:
player_mat = pd.crosstab(
    test_predictions['subscribe'],
    test_predictions['predicted'],
)

X_test = player_test[['age', 'played_hours']]
y_test = player_test['subscribe']

player_prediction_accuracy = KNN_fit.score(X_test, y_test)
player_prediction_accuracy


player_mat

predicted,False,True
subscribe,Unnamed: 1_level_1,Unnamed: 2_level_1
False,4,8
True,4,24


In [108]:
player_prediction_accuracy

0.7

In [109]:
precision_score(
    y_true=test_predictions['subscribe'],
    y_pred=test_predictions['predicted'],
    pos_label= True
)

np.float64(0.75)

In [110]:
recall_score(
    y_true=test_predictions['subscribe'],
    y_pred=test_predictions['predicted'],
    pos_label= True
)

np.float64(0.8571428571428571)

Confusion Matrix:
- 4 observations were correctly predicted as False
- 24 observations were correctly predicted as True
-
-

Estimated Accuracy on test data is 70%
....