# <u> **Predicting Newsletter Subscription of Plays on PLAI Minecraft Server**


By Xuan Tung Duong, Sara Garcia Rubiera, Daniel Samari and Demelza Awogu


### <u> **Introduction**

The Pacific Laboratory for Artificial Intelligence (PLAI) is an Artificial Intelligence research group from the Department of Computer Science at UBC, whose current research involves work in generative modelling, probabilistic programming, generative modeling and Bayesian interference to advance AI (Pacific Laboratory for Artificial Intelligence, n.d.). For this data analysis project, the data used will be from PLAI's leading program for video game AI learning _Plaicraft_, where players' actions were recorded as they engaged with the game (Pacific Laboratory for Artificial Intelligence, n.d.). The problem brought up for us to examine in this coding challenge was to aid in ensuring PLAI have sufficient resources to provide optimal experiences for players. The research question we chose to approach this issue is:

_Question 1: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?_

In this project, we attempt to answer a more specific research question from the one above, and hence our research question is as follows:

**Can we predict the whether a player will subscribe to the game newsletter based on their hours played, age and experience level?**

To do this, we focus the data into three variables, age and experience level, then use a predictive algorithm to find what conditions make a player more likely to subscribe to a game newsletter.


In the table below, there are 9 columns and 196 rows.
The variables describe the following: 

1.`experience` is a categorical variable, with levels of experience, where `Veteran` is the highest, followed by `Pro`, `Regular` and `Amateur`.

2. `subscribe` is a boolean variable, indicating the whether players have subscribed to the newsletter of the game.

3. `hashedEmail` is an ID variable that refers to the players email.

4. `played_hours` is a quantitative variable - refers to the number of hours played on the game.

5. `name` refers to the name of the player.

6. `gender` indicates that of the player, and `age` is their age.

Some of the columns seem to not provide the information we want:
* `individualId` and `organizationName`has no values of data (NaN columns), and so would not be useful in answering our question.
* `hashedEmail` and `name` have information about the players , but is not relevant nor useful to answer a possible questions because of the unclear values of data in the column.
* `experience` is unclear with the rankings of each player, as we do not know whether "Veteran" deems as better experiences that a "Pro" in this game, so we might have to assume.
* The `played_hours` variable is incredibly skewed with many players near 0 hours, and `age` contains an outlier (age 91), which may affect distance-based models like KNN.

In [1]:
#Input libraries 
import pandas as pd 
import numpy as np 
import altair as alt 

In [2]:
#Import urls
players_url = "https://drive.google.com/uc?id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
players = pd.read_csv(players_url)
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


### <u> **Methods and Results**

First, we load all the necessary wrangling and libraries needed for our investigation, including setting our random numpy seed to 100.

In [3]:
import numpy as np
import pandas as pd
import altair as alt
from sklearn import set_config
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

set_config(transform_output="pandas")

np.random.seed(100)

Then, we drop the columns `individualId`, `organizationName`, `hashedEmail`, `name` and `gender` as these are not important or relevant varibales in our exploration of the data.

In [4]:
players.drop(columns=['individualId','organizationName', 'hashedEmail', 'name', 'gender'],inplace = True)
players

Unnamed: 0,experience,subscribe,played_hours,age
0,Pro,True,30.3,9
1,Veteran,True,3.8,17
2,Veteran,False,0.0,17
3,Amateur,True,0.7,21
4,Regular,True,0.1,21
...,...,...,...,...
191,Amateur,True,0.0,17
192,Veteran,False,0.3,22
193,Amateur,False,0.0,17
194,Amateur,False,2.3,17


Next, we visualise variables that might be of interest.
* `KNearestNeighbors` will be used in classification.
* `experience` will be used to rank players into different categories/rankings, but because it is ordinal, a new column called `experience_value` will be changed so that the following will be used in Kneighbours Classification:
    1. `Amateur` will be labelled as 1
    2. `Regular` will be labelled as 2
    3. `Pro` will be labelled as 3
    4. `Veteran` will be labelled as 4
* `subscribe` will be used to show whether the players have subscribed or not.
* `played_hours` will be used to compare the amount of hours spent on the game by each player.
* `age` is used to to compare the age of the players.

In [5]:
experience_value = {'Amateur':1,'Regular':2,'Pro':3,'Veteran':4}

experience_to_val = players['experience'].to_numpy().copy()

for old_val, new_val in experience_value.items():
    experience_to_val[experience_to_val == old_val] = new_val

players = players.assign(experience_value=experience_to_val)
players

Unnamed: 0,experience,subscribe,played_hours,age,experience_value
0,Pro,True,30.3,9,3
1,Veteran,True,3.8,17,4
2,Veteran,False,0.0,17,4
3,Amateur,True,0.7,21,1
4,Regular,True,0.1,21,2
...,...,...,...,...,...
191,Amateur,True,0.0,17,1
192,Veteran,False,0.3,22,4
193,Amateur,False,0.0,17,1
194,Amateur,False,2.3,17,1


The data is split into training and testing data set using an 80/20 split. Because our dataset is quite small, the 80% training portion (which is 156 observations) ensures that the model has sufficient data to be able to classify the 20% testing set (40 values). This provides us with enough data to evaluate how well the fitted model performs on new data while still keeping the training set large enough to properly train the model.

In [6]:
players_train, players_test = train_test_split(
    players, train_size=0.80, stratify=players['subscribe']
)
players_train

Unnamed: 0,experience,subscribe,played_hours,age,experience_value
163,Regular,True,0.5,20,2
19,Regular,True,0.6,19,2
121,Beginner,True,0.1,24,Beginner
10,Veteran,True,1.6,23,4
153,Beginner,True,0.1,17,Beginner
...,...,...,...,...,...
195,Pro,True,0.2,91,3
53,Amateur,True,0.2,27,1
130,Amateur,True,56.1,23,1
72,Veteran,True,0.0,17,4


Shown below is Figure 1: A Visualisation of Player Observations in the Training Set Bars Split by Subscription Status, split by subscription status.

Orange = subscribed to the newsletter, Blue = not subscribed to the newsletter

* Panel A shows the distributions of the players' ages.
* Panel B shows the distributions of the players' experience.
* Panel C shows the distributions of the players' hours played in hours.

In [7]:
age_dist = alt.Chart(players_train).mark_bar().encode(
    x = alt.X('age').bin(step=5).title('Age'),
    y = alt.Y('count()').title('Number of Players'),
    color = alt.Color('subscribe').title('Subscribed to newsletter?')
).properties(
    title=['Panel A: Player Age Distribution']
)

time_dist = alt.Chart(players_train).mark_bar().encode(
    x = alt.X('played_hours').bin(step=10).title('Time Played (hrs)'),
    y = alt.Y('count()').title('Number of Players'),
    color = alt.Color('subscribe').title('Subscribed to newsletter?')
).properties(
    title=['Panel B: Player Play Time Distribution']
)

exp_bar = alt.Chart(players_train).mark_bar().encode(
    x = alt.X('experience').title('Experience Level'),
    y = alt.Y('count()').title('Number of Players'),
    color = alt.Color('subscribe').title('Subscribed to newsletter?')
).properties(
    title=['Panel C: Player Experience Distribution']
)


alt.vconcat(alt.hconcat(age_dist, time_dist), alt.hconcat(exp_bar)).properties(title=['Fig 1. Visualisation of Player Observations in the Training Set Bars Split by Subscription Status'])

The counterplots above indicate the number of players that in each age, category of hours played, and exprience level, with the proportion of players subscribed to those who have not subscribed also indicated above. The distribution of `played_hours` is skewed, suggesting the need for feature scaling. A majority of the players in the dataset have agres between 15–25, indicating limited variability. These patterns justify using `StandardScaler` before `KNearestNeighbors`.

Figure 2. Scatterplots are used to explore the relationship between hours played and predictors `age` and `experience`. different variables once again.
Legend: Orange = subscribed to the newsletter, Blue = not subscribed to the newsletter

* `Panel A` shows the relationship between the `Hours Played` vs. `Age` of players.
* `Panel B` shows the relationship between the the `Hours Played` vs. `Experience Level` (`experience_value`) of the players.

In [13]:
age_v_time = alt.Chart(players_train).mark_point().encode(
    y = alt.Y('played_hours').title('Hours Played'),
    x = alt.X('age').title('Age'),
    color = alt.Color('subscribe').title('Subscribed to newsletter?')
).properties(
    title=['Panel A: Plot of Hours Played vs. Age of Players']
)

exp_v_time = alt.Chart(players_train).mark_point().encode(
    y = alt.Y('played_hours').title('Hours Played'),
    x = alt.X('experience_value').title('Level of Experience'),
    color = alt.Color('subscribe').title('Subscribed to newsletter?')
).properties(
    title=['Panel B: Plot of Hours Played vs. Experience Level']
)

alt.hconcat(age_v_time, exp_v_time).properties(title=['Fig 2. Exploratory Plots of the Relationship between Selected Variables'])

The scatterplots show a considerable overlap between players who subscribed and those who had not subscribed, suggesting the use of `KNearestNeighbors`. There is a wide range of `played_hours` compared to that of `age` and `experience_value` encouraging the use of `StandardScaler`.

Next, we assess whether the classes are imbalanced.

In [14]:
players_train['subscribe'].value_counts()

subscribe
True     115
False     41
Name: count, dtype: int64

We then increase the representation of the `False` class through oversampling to balance the dataset. Oversampling the minority class (in this case players that have not subscribed) reduces bias during model training. Without it, KNN would favor predicting the majority (True) class, harming recall for non-subscribers.

In [15]:
not_sub = players_train[players_train['subscribe'] == False]
subbed = players_train[players_train['subscribe'] == True]
not_sub_upsample = not_sub.sample(
    n=subbed.shape[0], replace=True
)
players_train_upsample = pd.concat((not_sub_upsample, subbed))
players_train_upsample["subscribe"].value_counts()

subscribe
False    115
True     115
Name: count, dtype: int64

First, `played_hours`, `age`, and `experience_value` are preprocessed through `StandardScaler` ensuring a mean of 0 and a standard deviation of 1. The preprocessor is then put into the pipeline using `KNeighborsClassifier`.

* KNN Classification because we are trying to find classify whether players will be more likely to subscribe or not based on experience level and hours played.
* No assumptions needed as KNN classification requires few assumptions on what the data should look like.
* A grid search is used to test the values of n_neighbors from 1 to 30 in order for the optimal number of neighbors (k) to be determined for the KNN model.
* A 5-fold cross validation is used to evaluate the model, chosen because the limited dataset. Using more folds would reduce variance but would be more subjuct to bias due to the small data set whereas using fewer folds doees the opposite.
* `GridSearchCV` is used to run the paramater and is then fitted on the unsampled training data, ensuring that the class imbalance had been noticed before training the model.
* Finally, the cross-validation results are saved in a DataFrame, and the top-performing results are displayed using the best five results.

In [16]:
players_preprocessor = make_column_transformer(
    (StandardScaler(), ['played_hours', 'age','experience_value']),
    remainder='passthrough',
    verbose_feature_names_out=False
)

players_pipe = make_pipeline(players_preprocessor, KNeighborsClassifier())

param_grid = {'kneighborsclassifier__n_neighbors': range(1,31,1) }

players_search = GridSearchCV(
    estimator=players_pipe,
    param_grid=param_grid,
    cv=5,
    return_train_score=True,
    n_jobs=-1
)

players_search.fit(
    players_train_upsample[['played_hours', 'age','experience_value']],
    players_train_upsample['subscribe']
)

cv_results = pd.DataFrame(players_search.cv_results_)
cv_results.sort_values(by='rank_test_score').head(5).reset_index()

ValueError: 
All the 150 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
150 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.11/site-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/sklearn/pipeline.py", line 469, in fit
    Xt = self._fit(X, y, routed_params)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/sklearn/pipeline.py", line 406, in _fit
    X, fitted_transformer = fit_transform_one_cached(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/joblib/memory.py", line 312, in __call__
    return self.func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/sklearn/pipeline.py", line 1310, in _fit_transform_one
    res = transformer.fit_transform(X, y, **params.get("fit_transform", {}))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/sklearn/utils/_set_output.py", line 313, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/sklearn/compose/_column_transformer.py", line 976, in fit_transform
    result = self._call_func_on_transformers(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/sklearn/compose/_column_transformer.py", line 885, in _call_func_on_transformers
    return Parallel(n_jobs=self.n_jobs)(jobs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/sklearn/utils/parallel.py", line 74, in __call__
    return super().__call__(iterable_with_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/joblib/parallel.py", line 1918, in __call__
    return output if self.return_generator else list(output)
                                                ^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/joblib/parallel.py", line 1847, in _get_sequential_output
    res = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/sklearn/utils/parallel.py", line 136, in __call__
    return self.function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/sklearn/pipeline.py", line 1310, in _fit_transform_one
    res = transformer.fit_transform(X, y, **params.get("fit_transform", {}))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/sklearn/utils/_set_output.py", line 313, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/sklearn/base.py", line 1101, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/sklearn/preprocessing/_data.py", line 878, in fit
    return self.partial_fit(X, y, sample_weight)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/sklearn/preprocessing/_data.py", line 914, in partial_fit
    X = self._validate_data(
        ^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/sklearn/base.py", line 633, in _validate_data
    out = check_array(X, input_name="X", **check_params)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1012, in check_array
    array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/sklearn/utils/_array_api.py", line 751, in _asarray_with_order
    array = numpy.asarray(array, order=order, dtype=dtype)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/pandas/core/generic.py", line 2153, in __array__
    arr = np.asarray(values, dtype=dtype)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: could not convert string to float: 'Beginner'


Figure 3. A Visualization Representing the Cross-validation accuracy of KNN models across values of k from 1 to 30.
Each point represents the calculated mean cross-validation accuracy for a given number of neighbors (k).
The highlighted red point marks the best-performing k value chosen for the final model.

In [None]:
k_point = alt.Chart(cv_results.sort_values(by='rank_test_score').head(1).reset_index()).mark_point(filled = True).encode(
    x = alt.X('param_kneighborsclassifier__n_neighbors'),
    y = alt.Y('mean_test_score'),
    color = alt.Color('param_kneighborsclassifier__n_neighbors:N',title='Best K-Value').scale(scheme="set1")
)

cross_val_plot= alt.Chart(cv_results).mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors")
        .title('Number of Neighbors (k)'),
    y=alt.Y('mean_test_score')
        .title('Accuracy Score')
        .scale(zero=False)
).properties(title=['Fig 3. Cross Validation of K-Values from 1-30','Best K Marked in Red'])

cross_val_plot + k_point

Figure 3 indicates a peak of the accuracy score at k = 1 and declines for values of larger k. This demonstrates that less neighbours used here is better than that higher k values could inhibit the performance of the model. The pattern aligns with the scatterplots, reinforcing the choice of KNN.

The confusion matrix was displayed, assessing the effectiveness of our model.

In [17]:
best_k = cv_results.sort_values(by='rank_test_score').reset_index().loc[0,'param_kneighborsclassifier__n_neighbors']
players_spec = make_pipeline(players_preprocessor,KNeighborsClassifier(n_neighbors = best_k))

players_fit = players_spec.fit(players_train_upsample[['played_hours', 'age','experience_value']],players_train_upsample['subscribe'])
players_pred = players_fit.predict(players_test[['played_hours', 'age','experience_value']])
players_eval = players_test.assign(actual=players_test['subscribe'],predicted=players_pred)
players_conf_mat = pd.crosstab(players_eval['actual'], players_eval['predicted'])
players_conf_mat

NameError: name 'cv_results' is not defined

Finally, the accuracy, precision and recall of our model were displayed.

In [18]:
print(classification_report(players_test['subscribe'], players_pred))

NameError: name 'players_pred' is not defined

The model achieved a 68% accuracwhere it performed better on subscribed players compared to non-subscribed players. The low precision for the `False`, the non-subscribed players (0.31) indicates many false positives even after upsampling.

### <u> **Discussion**

Referring back to our research question: 

<u>**Can we predict the whether a player will subscribe to the game newsletter based on their hours played, age and experience level?**

**1. Does age affect whether Players are likely to be subscribed or not?**

The findings from `Figure 1` Panel A (Player Age Distribution) indicate that the most concentrated age group is between 15 and 25 years old. Additionally, `Figure 2` Panel A (Plot of Hours Played vs. Age of Players) shows that younger players generally spend more time playing than older players. Players aged 35 and under make up the majority both in the total number of players and in the total time played. This age range includes more orange bars (indicating a player had subscribed) compared to blue bars (indicating a player had not subscribed), visually suggesting a possible correlation although the model does not demonstrate a strong predictive capability.

suggesting that younger players have a higher subscription rate overall.

**2. Does player experience affect whether they are subscribed or not?**

`Figure 1` Panel C (Player Experience Distribution) suggests that players are fairly evenly distributed across the different experience levels. `Figure 2` Panel B (Plot of Hours Played vs. Experience Level) also shows that experience does not have much impact on the amount of time players spend in the game. Overall, the accuracy (68%) suggests age, hours played, and experience level are not strong predictors of subscription. More revealing variables may not be shown in the data set. The orange bars (indicating a player had subscribed) and blue bars (indicating a player had not subscribed) are also very similar across all experience levels, suggesting that experience does not indicate an association in regards to subscription in this dataset.

For `age`, younger players were hypothesized to have both a higher player count and more hours played compared to older players, and both predictions were correct. We also expected younger players to show a higher subscription rate. These results make sense, as younger individuals typically spend more time gaming and tend to be more invested in it than older players. For `experience`, players with higher ranks were hypothesized to have more hours played and a higher subscription percentage. However, hours played and subscription rates were spread fairly evenly across all experience levels. A potential limitation is that model comparison was not done before and after oversampling, which could potentially reduce scores of the model. 

These findings could help the developers understand what changes might make the game better. Since younger players indicate increased hours played and are more likely to subscribe to the game newsletter, the developers should tailor updates or features to fit them as their target audience. The developers could also use this information to make adjustments, like trying different subscription prices to see what works best, or making the game easier and more enjoyable for older players so they might play more.

These findings can aid the newsletter in marketing its product towards an audience that houses these traits of age and experience, leading to a greater rate of attention from players of the game and, therefore, more substantial rates of subscription. Future research stemming from our findings could include developers might want to look into why younger players subscribe more and what parts of the game are attracting them and how the newsletter might promote its product to players outside of this demographic to recruit subscriptions from players in other age groups and experience levels. Testing other models (like linear regression, or KNeighborsregression) may reveal other patterns overlooked by KNN.

### <u> **References**

Pacific Laboratory for Artificial Intelligence. (n.d.). PLAI group website. Retrieved November 30, 2025, from https://plai.cs.ubc.ca/

VanderPlas, J., Grus, J., & others. (n.d.). Python Data Science Handbook. Retrieved December 4, 2025, from https://python.datasciencebook.ca/index.html