## PS4 Game Rating Prediction

Given *data about PS4 games*, let's try to predict the **rating** of a given game.

We will use a variety of regression models to make our predictions. 

Data source: https://www.kaggle.com/datasets/ww1234/ps4-games

### Getting Started

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

import warnings
warnings.filterwarnings(action='ignore')

In [2]:
data = pd.read_csv('games_data.csv')
data

Unnamed: 0.1,Unnamed: 0,game,score,leaderbord,gamers,comp_perc,rating,url,min_comp_time,max_comp_time
0,0,A Boy and His Blob,638,2.02,2194,16.5,3.2,https://www.truetrophies.com/game/A-Boy-and-Hi...,15,20
1,1,A Hat in Time,1992,1.53,7062,35.9,4.2,https://www.truetrophies.com/game/A-Hat-in-Tim...,15,20
2,2,A Hero and a Garden,1364,1.01,503,97.6,5.0,https://www.truetrophies.com/game/A-Hero-and-a...,0,1
3,3,A Hero and a Garden (EU),1363,1.01,581,97.8,2.9,https://www.truetrophies.com/game/A-Hero-and-a...,0,1
4,4,A King's Tale: Final Fantasy XV,637,2.02,21914,14.1,3.3,https://www.truetrophies.com/game/A-Kings-Tale...,4,5
...,...,...,...,...,...,...,...,...,...,...
1579,1579,36 Fragments of Midnight,1367,1.06,8472,82.3,2.5,https://www.truetrophies.com/game/36-Fragments...,0,1
1580,1580,36 Fragments of Midnight (Asia),1335,1.03,2131,88.9,2.4,https://www.truetrophies.com/game/36-Fragments...,0,1
1581,1581,36 Fragments of Midnight (EU),1382,1.07,12273,79.2,2.4,https://www.truetrophies.com/game/36-Fragments...,0,1
1582,1582,428: Shibuya Scramble,1943,1.47,916,41.5,4.2,https://www.truetrophies.com/game/428-Shibuya-...,40,50


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1584 entries, 0 to 1583
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Unnamed: 0     1584 non-null   int64  
 1   game           1584 non-null   object 
 2   score          1584 non-null   int64  
 3   leaderbord     1584 non-null   float64
 4   gamers         1584 non-null   int64  
 5   comp_perc      1584 non-null   float64
 6   rating         1584 non-null   float64
 7   url            1584 non-null   object 
 8   min_comp_time  1584 non-null   int64  
 9   max_comp_time  1584 non-null   int64  
dtypes: float64(3), int64(5), object(2)
memory usage: 123.9+ KB


### Preprocessing

In [20]:
def preprocess_inputs(df):
    df = df.copy()

    # Drop unused columns
    df = df.drop(['Unnamed: 0', 'game', 'url'], axis=1)

    # Split df into X and y
    y = df['rating']
    X = df.drop('rating', axis=1)

    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=1)

    # Scale X
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train = pd.DataFrame(scaler.transform(X_train), index=X_train.index, columns=X_train.columns)
    X_test = pd.DataFrame(scaler.transform(X_test), index=X_test.index, columns=X_test.columns)

    return X_train, X_test, y_train, y_test

In [21]:
X_train, X_test, y_train, y_test = preprocess_inputs(data)

In [22]:
X_train

Unnamed: 0,score,leaderbord,gamers,comp_perc,min_comp_time,max_comp_time
866,-0.444493,-0.782111,-0.478065,1.350900,-0.579943,-0.349371
1569,0.540038,0.532651,0.159943,-0.989571,0.462750,0.126358
943,0.002955,-0.220598,-0.342023,-0.220680,-0.162866,-0.153483
593,0.672321,0.382001,0.548047,-1.045900,0.810314,0.336238
1348,1.823110,0.902428,5.415034,-1.183906,1.157878,0.476158
...,...,...,...,...,...,...
715,0.939778,1.244814,-0.110192,-1.150108,0.636532,0.196318
905,0.591361,0.354610,1.991005,-1.378241,0.115185,-0.013563
1096,-0.985191,0.245047,-0.382605,-1.200805,-0.301891,-0.223443
235,0.019581,-0.124730,0.152840,-0.305173,0.115185,-0.013563


In [24]:
X_train.describe()

Unnamed: 0,score,leaderbord,gamers,comp_perc,min_comp_time,max_comp_time
count,1108.0,1108.0,1108.0,1108.0,1108.0,1108.0
mean,-7.695409e-17,-1.667339e-16,-3.2064200000000005e-17,-9.298619e-17,3.2064200000000005e-17,-4.80963e-18
std,1.000452,1.000452,1.000452,1.000452,1.000452,1.000452
min,-1.249756,-0.7958062,-0.559318,-1.38669,-0.5799426,-0.3493712
25%,-0.4302169,-0.727329,-0.5278402,-0.9930913,-0.5799426,-0.3493712
50%,-0.3328119,-0.3849431,-0.4397008,-0.1023887,-0.4061606,-0.279411
75%,0.4328748,0.4230876,0.01541214,1.067143,0.1151855,-0.01356266
max,6.19206,7.613192,6.788548,1.415678,6.371338,9.431051


### Training

In [25]:
models = {
    "                    Linear Regression": LinearRegression(),
    "Linear Regression (L2 Regularization)": Ridge(),
    "Linear Regression (L1 Regularization)": Lasso(),
    "                  K-Nearest Neighbors": KNeighborsRegressor(),
    "                       Neural Network": MLPRegressor(),
    "                        Decision Tree": DecisionTreeRegressor(),
    "                        Random Forest": RandomForestRegressor(),
    "                    Gradient Boosting": GradientBoostingRegressor(),
    "                              XGBoost": XGBRegressor(),
    "                             LightGBM": LGBMRegressor(),
    "                             CatBoost": CatBoostRegressor(verbose=0)
}

In [26]:
for name, model in models.items():
    model.fit(X_train, y_train)
    print(name + " trained.")

                    Linear Regression trained.
Linear Regression (L2 Regularization) trained.
Linear Regression (L1 Regularization) trained.
                  K-Nearest Neighbors trained.
                       Neural Network trained.
                        Decision Tree trained.
                        Random Forest trained.
                    Gradient Boosting trained.
                              XGBoost trained.
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000409 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 968
[LightGBM] [Info] Number of data points in the train set: 1108, number of used features: 6
[LightGBM] [Info] Start training from score 3.087816
                             LightGBM trained.
                             CatBoost trained.


### Results

In [39]:
# RMSE Calculation
for name, model in models.items():
    y_pred = model.predict(X_test)
    rmse = np.sqrt(np.mean((y_test - y_pred)**2))
    print(name + " RMSE: {:.4f}".format(rmse))

                    Linear Regression RMSE: 0.6761
Linear Regression (L2 Regularization) RMSE: 0.6759
Linear Regression (L1 Regularization) RMSE: 0.9657
                  K-Nearest Neighbors RMSE: 0.6672
                       Neural Network RMSE: 0.6346
                        Decision Tree RMSE: 0.8269
                        Random Forest RMSE: 0.6262
                    Gradient Boosting RMSE: 0.6250
                              XGBoost RMSE: 0.6775
                             LightGBM RMSE: 0.6457
                             CatBoost RMSE: 0.6454


In [40]:
# R Squared Calculation
for name, model in models.items():
    y_pred = model.predict(X_test)
    r2 = 1 - np.sum((y_test - y_pred)**2) / np.sum((y_test - y_test.mean())**2)
    print(name + " R^2: {:.5f}".format(r2))

                    Linear Regression R^2: 0.50952
Linear Regression (L2 Regularization) R^2: 0.50971
Linear Regression (L1 Regularization) R^2: -0.00069
                  K-Nearest Neighbors R^2: 0.52226
                       Neural Network R^2: 0.56789
                        Decision Tree R^2: 0.26628
                        Random Forest R^2: 0.57922
                    Gradient Boosting R^2: 0.58082
                              XGBoost R^2: 0.50738
                             LightGBM R^2: 0.55260
                             CatBoost R^2: 0.55300


In [27]:
y_pred = model.predict(X_test)

y_pred

array([1.77875672, 3.26568723, 2.67256314, 3.68228096, 2.30975216,
       3.66893862, 2.67075812, 3.55132405, 3.60397413, 3.51815395,
       2.62644978, 4.28891891, 2.01399266, 2.11433647, 2.13124308,
       3.27235732, 3.76190423, 4.16685042, 2.28339549, 2.503985  ,
       4.15038152, 3.65169709, 3.27964046, 2.07814323, 3.38782096,
       2.84627584, 3.58829319, 2.92743615, 2.17521461, 3.16364641,
       3.32998075, 3.31140388, 1.61737246, 3.45149446, 3.38829417,
       3.87852305, 3.35838039, 3.6552449 , 4.19640563, 2.69904998,
       2.9329629 , 1.90171377, 3.75761984, 1.55295823, 2.89246373,
       2.02987864, 4.38240761, 2.82041462, 3.20422459, 3.24314073,
       1.61163425, 3.69229073, 3.78351342, 1.60629698, 3.81331547,
       4.35870836, 3.65357621, 3.70788248, 2.49159303, 2.57474375,
       3.637475  , 2.77071183, 2.40817996, 3.83455433, 1.96037906,
       3.94237536, 4.04643822, 3.02258552, 2.77597396, 1.35798279,
       3.24299814, 4.30719327, 3.94637755, 3.44212356, 2.13219

In [29]:
np.sqrt(np.mean((y_test - y_pred)**2))

0.6454163787782786

In [31]:
np.sum((y_test - y_test.mean())**2)

443.5866176470588

In [36]:
r2 = 1 - np.sum((y_test - y_pred)**2) / np.sum((y_test - y_test.mean())**2)

In [37]:
r2

0.5529990133572698

In [38]:
model.score(X_test, y_test)

0.5529990133572698