# GROUP 31

## TITLE: 
Predicting the experience levels of Plaicraft participants under 60 years old (based on their age and hours played) to determine which experience levels are most likely to have the largest playing hours. 

## INTRODUCTION
A UBC Computer Science research group is looking to predict the usage of Plaicraft, a mobile game similar to Minecraft. In order to support their research, this report aims to explore the types of people who play the game as well as their behaviours in the game. This analysis will specifically be looking at age and played hours as the predictor variables in a KNN-Classification to determine the experience level of players under 60 years old.
What will the experience level of players under 60 be predicted as based on their age and number of played hours?
The dataset used was players.csv, which contains information about 196 observations of players and 7 different variables. These variables include:
- level of experience
- subscription email 
- played hours
- name
- gender
- age
- individual ID 
- organization name
- 
By focusing on using experience level as the categorical variable and using the numerical variables age and number of played hours, we are able to use their relationships to predict experience.

The second part of our analysis will answer the question which experience level(s) of players are most likely to play a large number of hours of Plaicraft. This will help the researchers in marketing and designing their game to the people who express the greatest interest in playing. It will also allow the researchers to see what markets they may be missing out on and potentially improve some of their game features to appeal to those types of people.
ON:


## METHODS:

Before building the KNN Classification model, we must load the specific packages into python:

In [1]:
### Loading packages
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# Simplify working with large datasets in Altair
alt.data_transformers.enable('vegafusion')

# Output dataframes instead of arrays
set_config(transform_output="pandas")

In [2]:
url2 = "https://raw.githubusercontent.com/agallagh/DSCI-Project/refs/heads/main/players.csv"
players_data = pd.read_csv(url2)
players_data

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


As our question is targetting players aged under 60 we need to filter the data for players under that age as well as drop columns that are irrelevant to our data analysis. This includes every column except, experience, played hours, and age.

In [3]:
# tidying the data by dropping the unecessary columns

tidy_players = players_data[["experience", "played_hours", "age"]]
tidy_players

# filtering data in age column for our demographic

filtered_age_df = tidy_players[tidy_players["age"] < 60]
filtered_age_df

Unnamed: 0,experience,played_hours,age
0,Pro,30.3,9
1,Veteran,3.8,17
2,Veteran,0.0,17
3,Amateur,0.7,21
4,Regular,0.1,21
...,...,...,...
190,Amateur,0.0,20
191,Amateur,0.0,17
192,Veteran,0.3,22
193,Amateur,0.0,17


In [4]:
# creating a scatterplot for our variables and colouring by experience
age_chart = alt.Chart(filtered_age_df).mark_point().encode(
    x=alt.X("age").title("Age"),
    y=alt.Y("played_hours").title("Played Hours"),
    color=alt.Color("experience").title("Experience")
).properties(width=700)

age_chart

#### Figure 1.
_This scatterplot represents the scatterplot of hours played for individuals under age 60 who participated in the study. The individual points are colour coded by experience level_

In [5]:
# creating a bar plot of experience vs played hours

experience_chart = alt.Chart(filtered_age_df).mark_bar().encode(
    x=alt.X("experience").title("Experience"),
    y=alt.Y("played_hours").title("Played Hours")
).properties(width = 500).configure_axisX(labelAngle = -45)

experience_chart

#### Figure 2.
_This bar graph represents to total played hours for each experience level as a sum of all the individuals._

Both figure 1 and figure 2 are needed to fully understand the data. The bar graph would simply tell us that amateurs and regulars are most likely to contribute large amounts of played hours, which targets our specific question. However, the scatterplot shows us that this bar plot is heavily influenced by a few outlier individuals. Therefore, the outliers of amateurs and regulars that have played over 150 hours are largely responsible for the prediction we would be making. To make the dataset more representative of the overall demographic, we will be filtering out those outliers by making the played hours column include only values less than 100.

In [6]:
# filtering the played hours to be less than 100

filtered_hrs_df = filtered_age_df[filtered_age_df["played_hours"] < 100] 

In [7]:
filtered_chart = alt.Chart(filtered_hrs_df).mark_point().encode(
    x=alt.X("age").title("Age"),
    y=alt.Y("played_hours").title("Played Hours"),
    color=alt.Color("experience").title("Experience Level")
).properties(width = 500)



# Facet by experience to make the visualization more clear.
facetted_chart = alt.Chart(filtered_hrs_df).mark_point().encode(
    x=alt.X("age").title("Age"),
    y=alt.Y("played_hours").title("Played Hours"),
    color=alt.Color("experience").title("Experience Level")
).properties(width = 200).facet("experience", columns = 5)

facetted_chart

#### Figure 3.
_This facetted plot shows the filtered data for the individuals in each experience level. Each coloured point represents an individual._

Figure 3 demonstrates that each experience level has roughly the same range of age in each graph, with most clustering of individual points occurring between 15-30. We can also note a few outliers with high played hours in each experience level, the amateur category having the most. Each graph shows that typical played hours will likely play under 10 hrs for each experience level. Even before doing the analysis, we can see that there is not any clear deviation, or relationship between age, experience level and played hours, as each group has its own outlier. Therefore, we know the data analysis may have a low accuracy and not be able to perform well when predicting the test data or any new real data.

#### Training The Model

To create a KNN Classification model, the first step is to split the dataset into two subgroups. One to train the model, and another to test the model on. For this analysis, we decided to use 75% of the data points to train our model to make it as accurate as possible. After splitting the data, we created X and y objects to filter the predictor (played hours and age) and prediction (experience level) columns, respectively.

In [26]:
np.random.seed(5000) 

#Train model

# split the train/test data: 75% train, 25% test
player_train, player_test = train_test_split(filtered_hrs_df, train_size = 0.75, random_state = 5000)

# make X/y objects
X_train = player_train[['played_hours',	'age']]
y_train = player_train['experience']

# make X/y testing objects
X_test = player_test[['played_hours',	'age']]
y_test = player_test['experience']

#### Finding K: Cross-Validation

The next step is to find the best k-value to use in our model, based on the training data. To find this, we create an open KNN function and select a range of k-values we want to evaluate. In this case, we chose a range of neighbours between k = 1 and k = 15. Larger numbers are more computationally expensive, so it's best to look at a lower range for k-value.

In [27]:
knn_spec = KNeighborsClassifier()

param_grid = {
    "n_neighbors": range(2, 16, 1),
}

Now that we have two objects representing varying k values in a classifier model, we can perform a five-fold cross-validation and fit the tuned grid to our training X (predictors) and y (class) objects. Using the ```.cv_results_``` function on our fitted model grid, we can see which k values (```params```) have the highest cross-validation accuracy (as seen in the ```mean_test_score``` column).

In [28]:
knn_tune_grid = GridSearchCV(
    knn_spec, param_grid, return_train_score=True, n_jobs=-1, cv=5
)

knn_model_grid = knn_tune_grid.fit(X_train, y_train)

accuracies_grid = pd.DataFrame(knn_model_grid.cv_results_)

accuracies_grid[['params', 'mean_test_score']]

Unnamed: 0,params,mean_test_score
0,{'n_neighbors': 2},0.246059
1,{'n_neighbors': 3},0.267734
2,{'n_neighbors': 4},0.253448
3,{'n_neighbors': 5},0.26798
4,{'n_neighbors': 6},0.261084
5,{'n_neighbors': 7},0.261084
6,{'n_neighbors': 8},0.225369
7,{'n_neighbors': 9},0.246305
8,{'n_neighbors': 10},0.190148
9,{'n_neighbors': 11},0.21798


This information can be represented graphically (as shown in Figure 4) to make the preferrable k value easier to interpret.

In [11]:
cross_val_plot = alt.Chart(accuracies_grid).mark_line(point=True).encode(
    x=alt.X("param_n_neighbors").title("Number of neighbours").scale(zero=False),
    y=alt.Y("mean_test_score").title("Mean test score").scale(zero=False)
)

cross_val_plot

#### Figure 4.
_The graph above shows the cross-validation accuracy (via the mean test score) from the training data (```player_train```) for different numbers of neighbours (k). The highest peak lies at k = 5, so we can conclude that five neighbours is the best value use for the KNN Classification model for this data._

Because it is difficult to see the best number of neighbours to use from Figure 4, due to the two adjacent peaks, we can use the ```best_params``` function on the tuned grid to confirm which k value is best.

In [12]:
#explain which is best, confirm with 
knn_tune_grid.best_params_['n_neighbors']

5

#### Calculating Model Accuracy

Now that we have determined the best number of neighbours to use for predicting experience class in this dataset (k = 5), it is good practice to evaluate the accuracy of the model on the testing data.

In [13]:
#testing accuracy
model_accuracy = knn_model_grid.score(X_test, y_test)
model_accuracy

0.3125

31.25% is not a great accuracy, which is something to consider in our discussion for the suitability of the data to this model, but real-life datasets are rarely perfect. This tells us that our classification will not work very well to predict the experience levels of players accurately based on their age and the number of hours they played.

#### Using The Model To Predict Experience Level

Continuing with our analysis, we will create a model with the k = 5 value we found above. This is expressed through the ```best_params_[]``` function. We can then make a preprocessor for our model to scale the ```played_hours``` and age variables.

In [14]:
np.random.seed(3131)

# Creating model with 5 neighbours
knn_best = KNeighborsClassifier(knn_tune_grid.best_params_['n_neighbors'])

# making a preprocessor and scaling our variables of interest
player_preprocessor = make_column_transformer(
    (StandardScaler(), ['played_hours',	'age']),
     remainder='passthrough',
    verbose_feature_names_out = False
)

Next, we can make a pipeline with the preprocessor and k = 5 model, then fit it to the testing data. Using the ```predict()``` function, we can predict the experience level of players by their hours and age. Finally, adding a new column to the dataframe for the predicted experience levels allows us to compare the true vs predicted levels more clearly and prepares us to create a graph of the data.

In [15]:
# fittinng the test data to the model
players_test_fit = make_pipeline(player_preprocessor, knn_best).fit(X_test, y_test)

# making a new column for predicted experience level
player_test_predictions = player_test.assign(
    predicted=players_test_fit.predict(X_test)
)

player_test_predictions

Unnamed: 0,experience,played_hours,age,predicted
34,Beginner,0.6,26,Amateur
98,Amateur,0.0,17,Amateur
2,Veteran,0.0,17,Amateur
25,Regular,0.6,28,Amateur
184,Pro,1.7,17,Pro
148,Veteran,0.0,18,Amateur
103,Beginner,2.0,27,Amateur
1,Veteran,3.8,17,Pro
182,Pro,0.2,17,Amateur
94,Beginner,0.8,22,Amateur


To get a better sense of the distribution of the predicted experience levels, we can plot the hours played vs. age variables and use color to differentiate the predicted experience levels.

In [16]:
# Making a plot of the predicted experience levels against played hours
predicted_plot = alt.Chart(player_test_predictions).mark_point().encode(
    x= alt.X('age').title("Age of Player").scale(zero=False),
    y= alt.Y('played_hours').title("Number of Hours Played").scale(zero=False),
    color=alt.Color('predicted:N').title("Predicted Experience Level")
)

predicted_plot

#### Figure 5.
_This graph exhibits the _predicted_ experience level of users from the testing data based on their age and the number of hours they played Plaicraft. Data was filtered users under 60 years old, and who played less than 100 hours of Plaicraft in order to restrict the analysis to younger demographics and reduce the effect of outliers. Experience level was determined using a KNN Classification model with 5 neighbours on data collected by Pacific Laboratory for Artificial Intelligence (PLAI) at UBC._

We can also look at the chart of the true test data points to see how the model compares:

In [17]:
# A plot of the actual test data, and their classifications
test_plot = alt.Chart(player_test).mark_point().encode(
    x= alt.X('age').title("Age of Player").scale(zero=False),
    y= alt.Y('played_hours').title("Number of Hours Played").scale(zero=False),
    color=alt.Color('experience:N').title("Experience Level")
)

test_plot

#### Figure 6.
_This graph shows the observed experience levels of users in the testing data based on their age and the number of hours they played Plaicraft. Data was filtered users under 60 years old, and who played less than 100 hours of Plaicraft in order to restrict the analysis to younger demographics and reduce the effect of outliers. Experience level was determined using a KNN Classification model with 5 neighbours on data collected by Pacific Laboratory for Artificial Intelligence (PLAI) at UBC._ 

#### Testing The Model on Random Points of High Played Hours

Since the goal of our analysis is to establish which experience levels are more likely to play Plaicraft for large amounts of time, the next part of this analysis involves generating random points that represent individuals between 1-30 years old, that play 5-30 hours of Plaicraft. We limited the range of these points to 30, instead of our initial 60 and 100 for age and played hours, respectively, because we wanted to remove the effect of outliers on the predictions and make it more realistic to how most participants tend to behave.

We also restricted the new points to have greater than five hours of playtime. We found that this is a reasonable base value for what we can consider as "large amounts of time."

In [25]:
# Creating random points with high played hours to see what their experience would be predicted as

# generating random points
np.random.seed(3131)

num_rows=20
var1 = np.random.randint(5,31, num_rows)
var2 = np.random.randint(1,31, num_rows)

new_obs=pd.DataFrame({
    "played_hours":var1,
    "age":var2
})

new_obs_df = pd.DataFrame(new_obs)
new_obs_df

Unnamed: 0,played_hours,age
0,20,22
1,18,19
2,15,5
3,11,14
4,24,1
5,6,14
6,13,6
7,27,26
8,12,11
9,29,16


Next, we can fit these new, filtered points for high-played hours to our classification model, and add a column to the data frame for the predicted experience level of each point.

In [23]:
#fitting new points to the model
player_predicted = pd.DataFrame(players_test_fit.predict(new_obs_df)).assign(
    played_hours = new_obs_df['played_hours'])

#dropping NaN points and renaming columns
high_hours_predictions = player_predicted.dropna().rename(columns = {0:"predicted_experience"})

#take a look at the dataframe
high_hours_predictions.sort_values('predicted_experience')

Unnamed: 0,predicted_experience,played_hours
0,Amateur,20
1,Amateur,18
15,Amateur,11
14,Amateur,29
13,Amateur,21
7,Amateur,27
11,Amateur,27
17,Pro,30
16,Pro,24
12,Pro,25


#### Mean Played Hours by Predicted Experience Level

For reference, we can calculate the mean number of hours played by the different experience levels contained in these randomly generated points to see which is the greatest.

From these, we can see that amateurs play Plaicraft for longer hours on average, based on our model.

In [24]:
#finding the mean played hours for each experience level from the new points
high_hours_predictions.groupby('predicted_experience').agg("mean").reset_index()

Unnamed: 0,predicted_experience,played_hours
0,Amateur,21.857143
1,Pro,18.461538


From this analysis, we can say that Amateurs are the most likely to experience level to spend a lot of time playing the game, with Pros playing about three hours less on average. This data could be used to target marketing to Amateur players, but we must also weigh the effect of our low-accuracy (31%) model that could cause inaccuracies in our predictions.

## DISCUSSION:

The first part of our analysis that involved creating our model found the accuracy to be quite low, indicating that the raw data was not ideal for making predictions. The second part of our analysis involved generating random points using the “np.random.randint” function and creating a data frame. We targeted the basis of our question for which players contribute large amounts of played hours by restricting the range of the randomly generated points to have ‘played_hours’ > 5. We fitted the new filtered points to our model and used the “predict” function. We then calculated the mean number of played hours for each of the experience levels predicted (amateur and pro) to understand the data better. This led us to the conclusion that amateur players are more likely to contribute large amounts of played hours followed closely by pros. Unfortunately, these results have to be taken with a grain of salt as we know our model isn’t great. 

This data analysis was made difficult by the outliers in the age and played hours variables, as they were heavily skewing the results of our predictions. As there was little distinction between the range of ages in each experience level, as well as little trends in difference between the played hours of each experience level, it is clear to see even before the analysis was done, that the model wasn’t going to be great. Therefore, it was expected that the model would have low accuracy and not be able to predict the test set well as demonstrated by best knn value (5) only producing an accuracy of 0.3125 when evaluating the model on the test set. This means that only 31% of the models predictions are true to the actual experience level. 

This model would not be very good for future use; however, if more data points were collected for the study, perhaps there would be clearer relationships. This would also allow the model to be trained using more data, and could possibly increase its overall accuracy. Unfortunately, whenever real data is being used, there will be outliers, especially in the context of played hours for an online game. This caused a heavy skew in the data. This becomes an issue when creating a model because the data we are looking at has such a large difference from the higher played hours and where the majority of the data is, as those large values influence the model too much. The best we could do was standardize the data while making the preprocessor, ensuring that age and played hours have an evenly weighted influence on the model despite their range of values. That being said, with a more accurate model the researchers could identify demographics that aren’t playing large hours, thus finding gaps in their marketing. Then they could make alterations to their games or marketing appeal that encourages these demographics, making the game more universal by design. 

Future questions:
Some future questions could involve how to target certain less represented demographics such as beginners. Possible methods to do this could be including a beginner tutorial once a person enters the game, including specific instructions on what keys to press and certain objectives and dangers of the virtual world. 
A future question could involve analyzing the difference in types of players who binge extended hours of the game compared to players who play for shorter consistent hours over longer periods of time. This could likely connect back to age and how much free time people in different age groups have. 
Another future analysis could involve monitoring people who participate in multiplayer mode and if they are more likely to contribute more played hours than people who play by themselves. 
