# Step 3: Machine Learning

Like earlier, we"ll be reading the `cleaned.csv` file and importing all the necessary libraries.

In [1]:
import numpy as np
import pandas as pd

data = pd.read_csv("cleaned.csv")

## Numerical Machine Learning

Before we go too crazy, we will start by doing regressions based on the numerical figures available in the dataset.

We start by retrieving the numerical values, taking care to drop the incremented ID and making owners logarithmic.

In [2]:
data = data.drop(columns=["Unnamed: 0"])
data_num = data.select_dtypes(include=np.number)
data_num["owners"] = np.log10(data_num["owners"])
data_num

Unnamed: 0,achievements,average_playtime,median_playtime,owners,price,platforms_windows,platforms_mac,platforms_linux,categories_Multi-player,categories_Online Multi-Player,...,genres_Software Training,genres_Sexual Content,genres_Audio Production,genres_Game Development,genres_Photo Editing,genres_Accounting,genres_Documentary,genres_Tutorial,age,positive_ratio
0,0,17612,317,7.176091,7.19,1,1,1,1,1,...,0,0,0,0,0,0,0,0,7000,0.973888
1,0,277,62,6.875061,3.99,1,1,1,1,1,...,0,0,0,0,0,0,0,0,7580,0.839787
2,0,187,34,6.875061,3.99,1,1,1,1,0,...,0,0,0,0,0,0,0,0,6089,0.895648
3,0,258,184,6.875061,3.99,1,1,1,1,1,...,0,0,0,0,0,0,0,0,6788,0.826623
4,0,624,415,6.875061,3.99,1,1,1,1,0,...,0,0,0,0,0,0,0,0,7366,0.947996
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15035,12,0,0,4.000000,7.19,1,0,0,0,0,...,0,0,0,0,0,0,0,0,260,0.714286
15036,7,0,0,4.000000,0.00,1,0,0,0,0,...,0,0,0,0,0,0,0,0,260,0.846939
15037,0,0,0,4.544068,0.00,1,0,0,0,0,...,0,0,0,0,0,0,0,0,265,0.776923
15038,23,0,0,4.000000,6.10,1,0,0,0,0,...,0,0,0,0,0,0,0,0,257,0.733333


Next, we split the dataset into train/test like usual. Random state values have been set to make the results deterministic.

In [3]:
from sklearn.model_selection import train_test_split

data_x = data_num.drop(columns="price")
data_y = data_num[["price"]]
x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, random_state=1234567)
print("Train input shape:", x_train.shape)
print("Test input shape:", x_test.shape)

Train input shape: (11280, 67)
Test input shape: (3760, 67)


Finally, we throw different regressors at the job and observe how well they perform:

In [4]:
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import permutation_test_score
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)
y_scaler = StandardScaler()
y_train_scaled = y_scaler.fit_transform(y_train)
y_test_scaled = y_scaler.transform(y_test)
y_train_arr = y_train_scaled.ravel()
y_test_arr = y_test_scaled.ravel()

def evaluate(name, model, perm=20):
    print("===", name, "===")
    score = permutation_test_score(model, x_train_scaled, y_train_arr, n_permutations=perm)
    print(f"Score: {score[0]:.2f} (p-value {score[2]:.3f})")
    print(f"               Best: {1/(perm+1):.3f}")
    print()

print()

evaluate("Dummy Regressor (Baseline)", DummyRegressor())
evaluate("Linear Regression", LinearRegression())
evaluate("Decision Tree", DecisionTreeRegressor(random_state=0))
evaluate("Random Forest", RandomForestRegressor(random_state=0, max_samples=0.5), perm=10)
evaluate("Boosting Regressor", HistGradientBoostingRegressor(random_state=0), perm=5)
evaluate("Multi-layer Perceptron", MLPRegressor(random_state=0, solver="adam"), perm=3)


=== Dummy Regressor (Baseline) ===
Score: -0.00 (p-value 0.714)
               Best: 0.048

=== Linear Regression ===
Score: -188456601492828258304.00 (p-value 0.095)
               Best: 0.048

=== Decision Tree ===
Score: -0.04 (p-value 0.048)
               Best: 0.048

=== Random Forest ===
Score: 0.46 (p-value 0.091)
               Best: 0.091

=== Boosting Regressor ===
Score: 0.47 (p-value 0.167)
               Best: 0.167

=== Multi-layer Perceptron ===




Score: 0.26 (p-value 0.250)
               Best: 0.250



1. As a baseline, a dummy regressor and linear regression was added.

2. As expected, the decision tree immediately overfitted with a negative R^2 score for the test dataset. Random forest, the cousin to decision tree that is less prone to overfitting, performed somewhat decently in test with R^2 of `0.24`.

3. The boosting regressor performed better than random forest with a slightly higher R^2.

Let"s see if we can do better with a different set of hyperparameters.

In [5]:
from sklearn.model_selection import GridSearchCV
import warnings
from sklearn.exceptions import ConvergenceWarning
from sklearn.metrics import r2_score

parameter_space = {
    "hidden_layer_sizes": [(50,), (50,50), (25,), (25,25,)], # Have some smaller sizes as we overfitted earlier.
    "activation": ["tanh", "relu"],
    "solver": ["sgd", "adam"],
    "alpha": [0.0001, 0.05],
    "learning_rate": ["constant","adaptive"],
}

# Ignore ConvergenceWarning to avoid spamming the output.
# SGD does not converge with 100 iterations in this dataset.
warnings.filterwarnings(action="ignore", category=ConvergenceWarning)

mlp = MLPRegressor(random_state=0, max_iter=100)
gridSearch = GridSearchCV(mlp, parameter_space, cv=5)
gridSearch.fit(x_train_scaled, y_train_arr)

train_pred = gridSearch.predict(x_train_scaled)
test_pred = gridSearch.predict(x_test_scaled)
print(f"The best parameters were:\n{gridSearch.best_params_}")
print("Score: ", r2_score(y_train_scaled, train_pred))

The best parameters were:
{'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (50, 50), 'learning_rate': 'constant', 'solver': 'sgd'}
Score:  0.5151563296318206


We have `{'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (50,50), 'learning_rate': 'constant', 'solver': 'sgd'}` as the best parameters for the solver. Let's run it with extra iterations to make sure it performs the best it can.

In [10]:
warnings.resetwarnings()
mlp = MLPRegressor(random_state=0, activation="relu", solver="sgd", learning_rate="constant", alpha=0.05, hidden_layer_sizes=(50,50), max_iter=1000)
evaluate("MLP Fine Tuned", mlp, perm=2)

=== MLP Fine Tuned ===




Score: 0.27 (p-value 0.333)
               Best: 0.333



As it turns out, overfitting is a major issue in this case and even though R^2 increased for train, it decreased for test.

## Natural Language Processing

Thusfar, we have been ignoring a variable that we have kept when we cleaned the data: Detailed description.

Let's try to vectorize it using `TfidfVectorizer`.

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
transformed = vectorizer.fit_transform(data["detailed_description"])
print(transformed.shape)
print(vectorizer.get_feature_names_out())

(15040, 54806)
['00' '000' '0000' ... 'zyorzia' 'zypheria' 'zytron']


OK. Maybe not that. The data definitely needs a lot more cleaning if it needs 54806 tokens.

Let's take a page out of GPT-3 (and GPT-2), attempt byte pair encoding and see how it works out. As a bonus, we replace all the game titles with a special NAME token, just to make it more generalized.

In [13]:
del vectorizer
del transformed

from bpe import BytePairEncoding

# Fit the BPE to the dataset and show the longest tokens.
enc = BytePairEncoding().fit(data, size=800)
t = enc.token_to_text.copy()
t.sort(key=len, reverse=True)
print(t)

['multiplayer', 'experience', 'characters', 'different', 'adventure', 'character', 'challenge', '<|name|>', 'features', 'gameplay', 'challeng', 'discover', 'complete', 'yourself', 'powerful', 'platform', 'difficul', 'original', 'through', 'players', 'enemies', 'control', 'weapons', 'explore', 'friends', 'against', 'collect', 'develop', 'special', 'puzzles', 'between', 'support', 'environ', 'survive', 'journey', 'destroy', 'player', 'experi', 'differ', 'contro', 'advent', 'friend', 'unique', 'ations', 'battle', 'action', 'levels', 'system', 'comple', 'develo', 'myster', 'acters', 'strate', 'create', "you'll", 'combat', 'skills', 'around', 'design', 'choose', 'become', 'person', 'before', 'unlock', 'attack', 'people', 'custom', 'origin', 'online', 'master', 'puzzle', 'ground', 'search', 'ation', 'world', 'their', 'level', 'every', 'thing', 'other', 'there', 'story', 'power', 'ments', 'where', 'which', 'build', 'fight', 'chall', 'games', 'inter', 'survi', 'cover', 'explo', 'again', 'ition

Let's transform all the description and rename the columns to be slightly helpful when looking at the data later.

In [14]:
def name(x):
    return "desc_" + enc.token_to_text[x]

count = data.apply(enc.transform_count, axis=1)
count = count.rename(name, axis=1)

count

Unnamed: 0,desc_a,desc_b,desc_c,desc_d,desc_e,desc_f,desc_g,desc_h,desc_i,desc_j,...,desc_ju,desc_ali,desc_tle,desc_destroy,desc_ground,desc_ouse,desc_fac,desc_learn,desc_search,desc_kill
0,0,1,3,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,0,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,4,0,1,0,1,2,2,0,3,0,...,0,0,0,0,0,0,0,0,0,0
3,3,1,3,2,1,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2,0,0,1,1,2,1,0,1,0,...,0,1,0,0,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15035,5,1,1,0,2,3,2,1,1,0,...,0,0,0,0,0,1,0,0,0,0
15036,1,0,1,0,3,0,0,2,5,0,...,0,0,0,0,0,0,0,0,0,0
15037,17,8,8,10,6,11,8,3,4,2,...,0,0,0,0,0,0,0,1,1,0
15038,4,1,1,1,4,1,1,0,2,0,...,0,0,0,0,1,0,0,0,0,0


After that, we will be using the `TfidfTransformer` to normalize the data a little bit.

In [15]:
from sklearn.feature_extraction.text import TfidfTransformer

count_tfidf = pd.DataFrame(TfidfTransformer().fit_transform(count).todense())
count_tfidf

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,790,791,792,793,794,795,796,797,798,799
0,0.000000,0.027207,0.075679,0.000000,0.000000,0.025551,0.027315,0.000000,0.000000,0.00000,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
1,0.041318,0.000000,0.000000,0.000000,0.025640,0.025075,0.000000,0.000000,0.000000,0.00000,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
2,0.066529,0.000000,0.019931,0.000000,0.020643,0.040376,0.043162,0.000000,0.073114,0.00000,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
3,0.083627,0.036027,0.100212,0.067177,0.034597,0.067669,0.000000,0.000000,0.000000,0.00000,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
4,0.042289,0.000000,0.000000,0.025477,0.026243,0.051328,0.027436,0.000000,0.030983,0.00000,...,0.0,0.075071,0.0,0.0,0.000000,0.000000,0.071898,0.000000,0.071020,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15035,0.037155,0.009604,0.008905,0.000000,0.018446,0.027058,0.019284,0.010242,0.010889,0.00000,...,0.0,0.000000,0.0,0.0,0.000000,0.026145,0.000000,0.000000,0.000000,0.0
15036,0.019560,0.000000,0.023439,0.000000,0.072828,0.000000,0.000000,0.053919,0.143304,0.00000,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
15037,0.033940,0.020642,0.019139,0.024056,0.014867,0.026656,0.020724,0.008255,0.011702,0.01047,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.006678,0.006706,0.0
15038,0.031722,0.010250,0.009503,0.009556,0.039371,0.009626,0.010290,0.000000,0.023241,0.00000,...,0.0,0.000000,0.0,0.0,0.026837,0.000000,0.000000,0.000000,0.000000,0.0


Now, we can finally combine all these data with our numerical data to do some regression.

In [16]:
data_in = pd.concat([data_num, count_tfidf.rename(name, axis=1)], axis=1)
data_in

Unnamed: 0,achievements,average_playtime,median_playtime,owners,price,platforms_windows,platforms_mac,platforms_linux,categories_Multi-player,categories_Online Multi-Player,...,desc_ju,desc_ali,desc_tle,desc_destroy,desc_ground,desc_ouse,desc_fac,desc_learn,desc_search,desc_kill
0,0,17612,317,7.176091,7.19,1,1,1,1,1,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
1,0,277,62,6.875061,3.99,1,1,1,1,1,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
2,0,187,34,6.875061,3.99,1,1,1,1,0,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
3,0,258,184,6.875061,3.99,1,1,1,1,1,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
4,0,624,415,6.875061,3.99,1,1,1,1,0,...,0.0,0.075071,0.0,0.0,0.000000,0.000000,0.071898,0.000000,0.071020,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15035,12,0,0,4.000000,7.19,1,0,0,0,0,...,0.0,0.000000,0.0,0.0,0.000000,0.026145,0.000000,0.000000,0.000000,0.0
15036,7,0,0,4.000000,0.00,1,0,0,0,0,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
15037,0,0,0,4.544068,0.00,1,0,0,0,0,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.006678,0.006706,0.0
15038,23,0,0,4.000000,6.10,1,0,0,0,0,...,0.0,0.000000,0.0,0.0,0.026837,0.000000,0.000000,0.000000,0.000000,0.0


Just to get some intuition, let's check the correlation between price and all the variables.

In [17]:
corr = data_in.corr()["price"]
corr[corr.abs() > 0.1].sort_values(key=lambda x: abs(x))

desc_levels                          -0.108471
desc_you                             -0.109229
categories_Co-op                      0.111115
positive_ratio                        0.112798
categories_Steam Achievements         0.116947
desc_game                            -0.117270
genres_Animation & Modeling           0.120740
categories_Multi-player               0.120763
desc_ition                            0.122776
desc_new                              0.124421
categories_Steam Workshop             0.124569
genres_Design & Illustration          0.130832
categories_Full controller support    0.168108
categories_Steam Cloud                0.216740
genres_Casual                        -0.237632
genres_Indie                         -0.255534
genres_Free to Play                  -0.282129
price                                 1.000000
Name: price, dtype: float64

Not looking too good :/

Let's push on with machine learning regardless but it is likely that testing accuracy will be similar if not lower.
Just to keep the runtime saner, we will only be choosing columns that have a (relatively) high correlation.

The choice of `0.07` as the threshold is totally arbitrary.

In [18]:
from sklearn.model_selection import train_test_split

data_x = data_in[data_in.columns[corr.abs() > 0.07]].drop(columns="price")
data_y = data_in[["price"]]
x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, random_state=1234567)
print("Train input shape:", x_train.shape)
print("Test input shape:", x_test.shape)

Train input shape: (11280, 52)
Test input shape: (3760, 52)


In [19]:
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
import warnings
from sklearn.exceptions import ConvergenceWarning

scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)
y_scaler = StandardScaler()
y_train_scaled = y_scaler.fit_transform(y_train)
y_test_scaled = y_scaler.transform(y_test)
y_train_arr = y_train_scaled.ravel()
y_test_arr = y_test_scaled.ravel()

print()

evaluate("Dummy Regressor (Baseline)", DummyRegressor())
evaluate("Linear Regression", LinearRegression())
evaluate("Decision Tree", DecisionTreeRegressor(random_state=0))
evaluate("Random Forest", RandomForestRegressor(random_state=0, max_samples=0.5), perm=10)
evaluate("Boosting Regressor", HistGradientBoostingRegressor(random_state=0), perm=5)

# Ignore ConvergenceWarning to avoid spamming the output.
# SGD does not converge in this dataset.
warnings.filterwarnings(action="ignore", category=ConvergenceWarning)
evaluate("Multi-layer Perceptron", MLPRegressor(random_state=0, solver="adam"), perm=3)
evaluate("MLP Prev Optimization", MLPRegressor(
    random_state=0, activation="relu", solver="sgd", learning_rate="constant", alpha=0.05, hidden_layer_sizes=(50, 50), max_iter=100), perm=2)
warnings.resetwarnings()


=== Dummy Regressor (Baseline) ===
Score: -0.00 (p-value 0.714)
               Best: 0.048

=== Linear Regression ===
Score: 0.30 (p-value 0.048)
               Best: 0.048

=== Decision Tree ===
Score: -0.26 (p-value 0.048)
               Best: 0.048

=== Random Forest ===
Score: 0.38 (p-value 0.091)
               Best: 0.091

=== Boosting Regressor ===
Score: 0.39 (p-value 0.167)
               Best: 0.167

=== Multi-layer Perceptron ===
Score: 0.09 (p-value 0.250)
               Best: 0.250

=== MLP Prev Optimization ===
Score: 0.33 (p-value 0.333)
               Best: 0.333



And it seems like we have scored lower. This may be caused by the fact that we have a large set of input, causing the models to overfit to the training data.

Let's try again with a smaller input.

In [20]:
data_x = data_in[data_in.columns[corr.abs() > 0.121]].drop(columns="price")
data_y = data_in[["price"]]
x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, random_state=1234567)
print("Train input shape:", x_train.shape)
print("Test input shape:", x_test.shape)

scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)
y_scaler = StandardScaler()
y_train_scaled = y_scaler.fit_transform(y_train)
y_test_scaled = y_scaler.transform(y_test)
y_train_arr = y_train_scaled.ravel()
y_test_arr = y_test_scaled.ravel()

print()

evaluate("Dummy Regressor (Baseline)", DummyRegressor())
evaluate("Linear Regression", LinearRegression())
evaluate("Decision Tree", DecisionTreeRegressor(random_state=0))
evaluate("Random Forest", RandomForestRegressor(random_state=0, max_samples=0.5), perm=10)
evaluate("Boosting Regressor", HistGradientBoostingRegressor(random_state=0), perm=5)

# Ignore ConvergenceWarning to avoid spamming the output.
# SGD does not converge in this dataset.
warnings.filterwarnings(action="ignore", category=ConvergenceWarning)
evaluate("Multi-layer Perceptron", MLPRegressor(random_state=0, solver="adam"), perm=3)
evaluate("MLP Prev Optimization", MLPRegressor(
    random_state=0, activation="relu", solver="sgd", learning_rate="constant", alpha=0.05, hidden_layer_sizes=(50, 50), max_iter=100), perm=2)
warnings.resetwarnings()

Train input shape: (11280, 9)
Test input shape: (3760, 9)

=== Dummy Regressor (Baseline) ===
Score: -0.00 (p-value 0.714)
               Best: 0.048

=== Linear Regression ===
Score: 0.25 (p-value 0.048)
               Best: 0.048

=== Decision Tree ===
Score: -0.04 (p-value 0.048)
               Best: 0.048

=== Random Forest ===
Score: 0.24 (p-value 0.091)
               Best: 0.091

=== Boosting Regressor ===
Score: 0.27 (p-value 0.167)
               Best: 0.167

=== Multi-layer Perceptron ===
Score: 0.30 (p-value 0.250)
               Best: 0.250

=== MLP Prev Optimization ===
Score: 0.30 (p-value 0.333)
               Best: 0.333



Based on the results, that was not strictly the case.

While the score of MLP did improve, it still performed worse than the boosting regressor did in the previous run when it had more data.

It is likely that MLP did overfit earlier but it did not perform too well in this case either. As such, we will be using Boosting Regressor from this point onwards, tweaking the hyperparameters and hopefully getting a decent score with the test set.

In [31]:
# Back to the 1st and 2nd test set.
def try_with(dataset, data_x, data_y):
    x_train, _, y_train, _ = train_test_split(data_x, data_y, random_state=1234567)
    scaler = StandardScaler()
    x_train_scaled = scaler.fit_transform(x_train)
    y_scaler = StandardScaler()
    y_train_scaled = y_scaler.fit_transform(y_train)
    y_train_arr = y_train_scaled.ravel()
    parameter_space = {
        "loss": ["squared_error", "absolute_error"],
        "learning_rate": [0.05, 0.1, 0.2],
        "max_iter": [50, 100, 200],
    }
    hgbr = HistGradientBoostingRegressor(random_state=0)
    grid_search = GridSearchCV(hgbr, parameter_space, cv=5)
    grid_search.fit(x_train_scaled, y_train_arr)
    print(f"=== {dataset} ===")
    print(f"Best parameters were {grid_search.best_params_}")
    print(f"Score: {grid_search.best_score_:.2f}")
    print()
    return grid_search

data_1_x = data_num.drop(columns="price")
data_1_y = data_num[["price"]]
data_2_x = data_in[data_in.columns[corr.abs() > 0.07]].drop(columns="price")
data_2_y = data_in[["price"]]
grid_1 = try_with("Initial numerical dataset", data_1_x, data_1_y)
grid_2 = try_with("Trimmed text dataset", data_2_x, data_2_y)

=== Initial numerical dataset ===
Best parameters were {'learning_rate': 0.05, 'loss': 'squared_error', 'max_iter': 200}
Score: 0.47

=== Trimmed text dataset ===
Best parameters were {'learning_rate': 0.05, 'loss': 'squared_error', 'max_iter': 100}
Score: 0.40



In the end, the best model (in the scope of this project) was in front of our eyes all along. It's the gradient boosting regressor with half the learning rate and double max iterations to compensate for the halved learning rate.

Let's try and run this against our test set which we have been avoiding up until now.

In [23]:
_, x_test, _, y_test = train_test_split(data_1_x, data_1_y, random_state=1234567)
scaler = StandardScaler()
x_test_scaled = scaler.fit_transform(x_test)
y_scaler = StandardScaler()
y_test_scaled = y_scaler.fit_transform(y_test)
y_test_arr = y_test_scaled.ravel()
print(f"And the final R^2 on the test data set is: {grid_1.score(x_test_scaled, y_test_arr):.2f}")

And the final R^2 on the test data set is: 0.46


---

## Conclusion

In the end, we went full circle back to the start with the gradient boosting regressor.

It is unfortunate but our current approach towards text processing did not correspond to the price but instead only caused noise and lowered the accuracy.

With `achievements`, `average_playtime`, `median_playtime`, `owners`, `age`, `positive_ratio`, `genre` and `categories`, we can guess in the ballpark of the price with a slightly below average accuracy as we ignored information on the graphics of the game which likely contributes significantly to the price of a game.