# Steam Game Popularity Modeling

In [529]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, f1_score

## Data Cleaning

We will clean the table by selecting only the columns that we need for our model.

In [530]:
df = pd.read_csv("../data/a_steam_data_2021_2025.csv")
df = df[['release_year', 'genres', 'categories', 'price', 'developer', 'publisher', 'recommendations']]
df

Unnamed: 0,release_year,genres,categories,price,developer,publisher,recommendations
0,2024,Action;Adventure;Indie;RPG;Strategy,Single-player;Family Sharing,3.99,Bouncy Rocket Studios,Bouncy Rocket Studios,0
1,2025,Casual;Indie;Simulation;Strategy,Single-player;Save Anytime;Family Sharing,7.99,Caramelo Studios,Caramelo Studios,0
2,2025,Adventure;Indie;Strategy,Single-player;Steam Achievements;Full controll...,12.99,Libragames,Libragames,0
3,2023,Action;Adventure;Indie,Single-player;Steam Achievements;Full controll...,24.99,YS Net,ININ,0
4,2021,Action;Adventure;Casual;Indie;RPG;Early Access,Single-player;Steam Achievements;Full controll...,3.99,Wagner Rodrigues,Wagner Rodrigues,0
...,...,...,...,...,...,...,...
65516,2024,Casual;Indie;Simulation,Single-player;Family Sharing,14.99,Inonia Games,Inonia Games,0
65517,2025,Action;Indie;Early Access,Single-player;Family Sharing,4.99,Hard Shark Games,Hard Shark Games,0
65518,2021,Simulation;Education,Single-player;Tracked Controller Support;VR On...,14.99,Xennial Digital,Xennial Digital,0
65519,2024,Action;Indie,Single-player;Partial Controller Support;Famil...,15.99,PoisonGames,PoisonGames,0


Since no missing data is seen for recommendations, we will not be dropping any rows entirely. Although certain predictor fields display some missing data, we will simply treat them accordingly in our model rather than dropping them. Multi-label fields such as `genres` and `categories` allow empty values, and missing `developer` and `publisher` will be treated as a separate category. For now, we will only be cleaning the `developer` and `publisher` columns by filling the N/A rows with a learnable missing class.

In [531]:
df['developer'] = df['developer'].fillna('__MISSING__')
df['publisher'] = df['publisher'].fillna('__MISSING__')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65521 entries, 0 to 65520
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   release_year     65521 non-null  int64  
 1   genres           65455 non-null  object 
 2   categories       65514 non-null  object 
 3   price            65521 non-null  float64
 4   developer        65521 non-null  object 
 5   publisher        65521 non-null  object 
 6   recommendations  65521 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 3.5+ MB


## Target Construction

First, we create a new scalar containing all the values for non-zero recommendations count, and find the 0.5 and 0.8 quantile to group our observations into their popularity tier.

In [532]:
non_zero = df.loc[df["recommendations"] > 0, "recommendations"]
q50 = float(non_zero.quantile(0.50))
q80 = float(non_zero.quantile(0.80))


We then define our new 'popularity' field, with numerical values 0, 1, 2 and 3 corresponding to the tier categories.

In [533]:
conditions = [
    (df['recommendations'] > 0) & (df['recommendations'] < q50),
    (df['recommendations'] >= q50) & (df['recommendations'] < q80),
    df['recommendations'] >= q80
]

choices = [ 1, 2, 3]

df['popularity'] = np.select(conditions, choices, default = 0)
df

Unnamed: 0,release_year,genres,categories,price,developer,publisher,recommendations,popularity
0,2024,Action;Adventure;Indie;RPG;Strategy,Single-player;Family Sharing,3.99,Bouncy Rocket Studios,Bouncy Rocket Studios,0,0
1,2025,Casual;Indie;Simulation;Strategy,Single-player;Save Anytime;Family Sharing,7.99,Caramelo Studios,Caramelo Studios,0,0
2,2025,Adventure;Indie;Strategy,Single-player;Steam Achievements;Full controll...,12.99,Libragames,Libragames,0,0
3,2023,Action;Adventure;Indie,Single-player;Steam Achievements;Full controll...,24.99,YS Net,ININ,0,0
4,2021,Action;Adventure;Casual;Indie;RPG;Early Access,Single-player;Steam Achievements;Full controll...,3.99,Wagner Rodrigues,Wagner Rodrigues,0,0
...,...,...,...,...,...,...,...,...
65516,2024,Casual;Indie;Simulation,Single-player;Family Sharing,14.99,Inonia Games,Inonia Games,0,0
65517,2025,Action;Indie;Early Access,Single-player;Family Sharing,4.99,Hard Shark Games,Hard Shark Games,0,0
65518,2021,Simulation;Education,Single-player;Tracked Controller Support;VR On...,14.99,Xennial Digital,Xennial Digital,0,0
65519,2024,Action;Indie,Single-player;Partial Controller Support;Famil...,15.99,PoisonGames,PoisonGames,0,0


## Data Splitting
The data will be split chronologically, where older games are used for our training set and newer ones for our testing set. This is because a time-based split is more practical for generalizing data over time. We can inspect the proportion for each release year to decide the cutoff.

In [534]:
df['release_year'].value_counts(normalize=True)

release_year
2025    0.306863
2024    0.242029
2023    0.176386
2022    0.146777
2021    0.127944
Name: proportion, dtype: float64

Since games released in 2025 takes up roughly 30.7% of all observations, we will be using all games during 2021-2024 for training and those in 2025 for testing for a 70-30 split.

In [535]:
training = df[df['release_year'] != 2025].copy()
testing = df[df['release_year'] == 2025].copy()
training

Unnamed: 0,release_year,genres,categories,price,developer,publisher,recommendations,popularity
0,2024,Action;Adventure;Indie;RPG;Strategy,Single-player;Family Sharing,3.99,Bouncy Rocket Studios,Bouncy Rocket Studios,0,0
3,2023,Action;Adventure;Indie,Single-player;Steam Achievements;Full controll...,24.99,YS Net,ININ,0,0
4,2021,Action;Adventure;Casual;Indie;RPG;Early Access,Single-player;Steam Achievements;Full controll...,3.99,Wagner Rodrigues,Wagner Rodrigues,0,0
5,2023,Action;Adventure;Indie,Single-player;Full controller support;Family S...,1.99,Opia Games,Opia Games;Plug In Digital,0,0
11,2024,Action;Adventure;Indie;RPG,Single-player;Steam Achievements;Full controll...,14.99,Sam Enright,United Label,240,1
...,...,...,...,...,...,...,...,...
65515,2023,Casual,Single-player;Family Sharing,6.99,Seven Sails Games,Seven Sails Games,0,0
65516,2024,Casual;Indie;Simulation,Single-player;Family Sharing,14.99,Inonia Games,Inonia Games,0,0
65518,2021,Simulation;Education,Single-player;Tracked Controller Support;VR On...,14.99,Xennial Digital,Xennial Digital,0,0
65519,2024,Action;Indie,Single-player;Partial Controller Support;Famil...,15.99,PoisonGames,PoisonGames,0,0


In [536]:
testing

Unnamed: 0,release_year,genres,categories,price,developer,publisher,recommendations,popularity
1,2025,Casual;Indie;Simulation;Strategy,Single-player;Save Anytime;Family Sharing,7.99,Caramelo Studios,Caramelo Studios,0,0
2,2025,Adventure;Indie;Strategy,Single-player;Steam Achievements;Full controll...,12.99,Libragames,Libragames,0,0
6,2025,Action;Indie;Free To Play,Single-player;Full controller support;Custom V...,0.00,Taylor Conolley,Taylor Conolley,0,0
7,2025,Action;Indie,Single-player;Steam Achievements;Full controll...,2.99,halvardo13,halvardo13,0,0
8,2025,Adventure;RPG;Strategy,Single-player;Family Sharing,3.99,Oba Games,Oba Games,0,0
...,...,...,...,...,...,...,...,...
65492,2025,Adventure;Casual;Massively Multiplayer;RPG;Sim...,Multi-player;MMO;PvP;Online PvP;Co-op;Online C...,0.00,Vitali Kirpu,Vitali Kirpu,0,0
65494,2025,Strategy,Single-player;Steam Achievements;In-App Purcha...,4.99,SHN Games,SHN Games,0,0
65500,2025,Action;Adventure;Casual;Indie;Simulation,Single-player;Steam Achievements;Full controll...,4.99,Jon Warner Studios,Jon Warner Studios,0,0
65511,2025,Adventure;Casual;Indie,Single-player;Family Sharing,1.99,Total,Total,0,0


## Feature Engineering and Encoding

In [537]:
training['popularity'].value_counts()

popularity
0    39477
1     2928
2     1772
3     1238
Name: count, dtype: int64

The popularity tiers are clearly imbalanced, with Tier 0 dominating the dataset. This is expected, since many Steam games receive little or no user engagement. Among games that do receive recommendations, the remaining tiers are reasonably balanced due to the quantile-based split.

Rather than increasing the representation of higher-popularity tiers, this project's approach will be to reduce the number of Tier 0 observations in the testing set. Oversampling higher tiers could introduce noise, particularly given the high-cardinality nature of features such as developer and publisher, which may lead to unstable patterns. Undersampling Tier 0 helps mitigate imbalance while preserving the relative structure among engaged games.

We will be keeping Tier 0 the same size as all other tiers combined.

In [538]:
tier_0 = training[training['popularity'] == 0]
tier_rest = training[training['popularity'] != 0]

n_keep = len(tier_rest)

tier_0_sampled = tier_0.sample(
    n=n_keep,
    random_state=17
)

# Combining the undersampled tier 0 dataframe and the original dataframe for all other tiers
undersampled_training = pd.concat([tier_0_sampled, tier_rest])

# Shuffle all observations
undersampled_training = undersampled_training.sample(frac=1, random_state=17).reset_index(drop=True)
undersampled_training

Unnamed: 0,release_year,genres,categories,price,developer,publisher,recommendations,popularity
0,2024,Action;Indie,Single-player;Partial Controller Support;Famil...,9.99,Little Crows Studio,Little Crows Studio,0,0
1,2024,Adventure;Casual;Indie;Free To Play,Single-player,0.00,Ebananas,ebanapury,0,0
2,2023,Casual;Indie,Single-player;Family Sharing,0.99,Sarni Studio,Sarni Studio,0,0
3,2023,Adventure;Casual;Indie,Single-player;Steam Achievements;Steam Trading...,19.99,Bad Seed,Maximum Entertainment,356,2
4,2023,Casual;Indie;Simulation;Strategy,Single-player;Steam Achievements;Steam Trading...,0.99,Watashi Games,Watashi Games,1715,3
...,...,...,...,...,...,...,...,...
11871,2024,Action;Adventure;Casual;RPG;Strategy,Single-player;Steam Achievements;Full controll...,2.99,Soda Game Studio,Soda Game Studio,372,2
11872,2021,Casual;Indie,Single-player;Steam Achievements;Steam Trading...,9.99,Business Goose Studios,Business Goose Studios,2502,3
11873,2023,Indie,Single-player;Family Sharing,1.75,Lewd Leech,Lewd Leech,0,0
11874,2021,Action;Casual;Indie,Single-player;Multi-player;PvP;Shared/Split Sc...,4.99,One-Hand-Free Studios,One-Hand-Free Studios,0,0


We then convert the `genres` and `categories` fields into a list and fit their corresponding binarizers (using the `MultiLabelBinarizer` in scikit-learn) to the data.

In [557]:
mlb_genres = MultiLabelBinarizer()
mlb_categories = MultiLabelBinarizer()

# Create new columns for genres and categories separated into lists
undersampled_training['genres_separated'] = (
    undersampled_training['genres']
    .fillna('')
    .apply(lambda x: [g.strip() for g in x.split(';')] if x else [])
)

undersampled_training['categories_separated'] = (
    undersampled_training['categories']
    .fillna('')
    .apply(lambda x: [g.strip() for g in x.split(';')] if x else [])
)

# Create list of all classes for fitting
genres_list = undersampled_training['genres_separated'].tolist()
categories_list = undersampled_training['categories_separated'].tolist()

mlb_genres.fit(genres_list)
mlb_categories.fit(categories_list)

0,1,2
,"classes  classes: array-like of shape (n_classes,), default=None Indicates an ordering for the class labels. All entries should be unique (cannot contain duplicate classes).",
,"sparse_output  sparse_output: bool, default=False Set to True if output binary array is desired in CSR sparse format.",False


In [558]:
mlb_genres.classes_

array(['Accounting', 'Action', 'Adventure', 'Animation & Modeling',
       'Audio Production', 'Casual', 'Design & Illustration',
       'Early Access', 'Education', 'Free To Play', 'Game Development',
       'Indie', 'Massively Multiplayer', 'RPG', 'Racing', 'Simulation',
       'Software Training', 'Sports', 'Strategy', 'Utilities'],
      dtype=object)

In [559]:
mlb_categories.classes_

array(['Adjustable Difficulty', 'Adjustable Text Size', 'Camera Comfort',
       'Captions available', 'Chat Speech-to-text', 'Chat Text-to-speech',
       'Co-op', 'Color Alternatives', 'Commentary available',
       'Cross-Platform Multiplayer', 'Custom Volume Controls',
       'Family Sharing', 'Full controller support', 'HDR available',
       'In-App Purchases', 'Includes Source SDK', 'Includes level editor',
       'Keyboard Only Option', 'LAN Co-op', 'LAN PvP', 'MMO',
       'Mouse Only Option', 'Multi-player', 'Narrated Game Menus',
       'Online Co-op', 'Online PvP', 'Partial Controller Support',
       'Playable without Timed Input', 'PvP', 'Remote Play Together',
       'Remote Play on Phone', 'Remote Play on TV',
       'Remote Play on Tablet', 'Save Anytime', 'Shared/Split Screen',
       'Shared/Split Screen Co-op', 'Shared/Split Screen PvP',
       'Single-player', 'Stats', 'Steam Achievements', 'Steam Cloud',
       'Steam Leaderboards', 'Steam Timeline', 'Steam Tradin

We can now convert these binarizers into dataframes.

In [560]:
genre_ohe = mlb_genres.transform(
    undersampled_training['genres_separated']
)

genre_df = pd.DataFrame(
    genre_ohe,
    columns=[f'genre_{g}' for g in mlb_genres.classes_],
    index=undersampled_training.index
)

cat_ohe = mlb_categories.transform(
    undersampled_training['categories_separated']
)

cat_df = pd.DataFrame(
    cat_ohe,
    columns=[f'cat_{c}' for c in mlb_categories.classes_],
    index=undersampled_training.index
)

To encode high-cardinality fields like `developer` and `publisher`, we apply frequency encoding to reduce dimensionality. While this means sacrificing some identity for these categories, it preserves a useful popularity signal, where a more common developer/publisher may receive more engagement (correlating with higher popularity). Frequency encoding is computed independently of the target variable, which aligns well with the project's focus of avoiding data leakage.

In [562]:
dev_counts = undersampled_training['developer'].value_counts(normalize=True)
undersampled_training['developer_freq'] = undersampled_training['developer'].map(dev_counts)

publisher_counts = undersampled_training['publisher'].value_counts(normalize=True)
undersampled_training['publisher_freq'] = undersampled_training['publisher'].map(publisher_counts)

undersampled_training

Unnamed: 0,release_year,genres,categories,price,developer,publisher,recommendations,popularity,genres_separated,categories_separated,developer_freq,publisher_freq
0,2024,Action;Indie,Single-player;Partial Controller Support;Famil...,9.99,Little Crows Studio,Little Crows Studio,0,0,"[Action, Indie]","[Single-player, Partial Controller Support, Fa...",0.000084,0.000084
1,2024,Adventure;Casual;Indie;Free To Play,Single-player,0.00,Ebananas,ebanapury,0,0,"[Adventure, Casual, Indie, Free To Play]",[Single-player],0.000084,0.000084
2,2023,Casual;Indie,Single-player;Family Sharing,0.99,Sarni Studio,Sarni Studio,0,0,"[Casual, Indie]","[Single-player, Family Sharing]",0.000084,0.000084
3,2023,Adventure;Casual;Indie,Single-player;Steam Achievements;Steam Trading...,19.99,Bad Seed,Maximum Entertainment,356,2,"[Adventure, Casual, Indie]","[Single-player, Steam Achievements, Steam Trad...",0.000084,0.001010
4,2023,Casual;Indie;Simulation;Strategy,Single-player;Steam Achievements;Steam Trading...,0.99,Watashi Games,Watashi Games,1715,3,"[Casual, Indie, Simulation, Strategy]","[Single-player, Steam Achievements, Steam Trad...",0.000084,0.000084
...,...,...,...,...,...,...,...,...,...,...,...,...
11871,2024,Action;Adventure;Casual;RPG;Strategy,Single-player;Steam Achievements;Full controll...,2.99,Soda Game Studio,Soda Game Studio,372,2,"[Action, Adventure, Casual, RPG, Strategy]","[Single-player, Steam Achievements, Full contr...",0.000084,0.000084
11872,2021,Casual;Indie,Single-player;Steam Achievements;Steam Trading...,9.99,Business Goose Studios,Business Goose Studios,2502,3,"[Casual, Indie]","[Single-player, Steam Achievements, Steam Trad...",0.000084,0.000084
11873,2023,Indie,Single-player;Family Sharing,1.75,Lewd Leech,Lewd Leech,0,0,[Indie],"[Single-player, Family Sharing]",0.000084,0.000084
11874,2021,Action;Casual;Indie,Single-player;Multi-player;PvP;Shared/Split Sc...,4.99,One-Hand-Free Studios,One-Hand-Free Studios,0,0,"[Action, Casual, Indie]","[Single-player, Multi-player, PvP, Shared/Spli...",0.000084,0.000084


## Model Selection

The purpose of this section is exploratory: to test several reasonable models and observe how different modeling assumptions affect performance. These experiments help inform the selection of a final model while ensuring that the feature engineering and evaluation setup behave as expected.

In [563]:
# Predictors
X = pd.concat(
    [
        undersampled_training[
            ['release_year', 'price', 'developer_freq', 'publisher_freq']
        ],
        genre_df,
        cat_df
    ],
    axis=1
)
# Target variable
y = undersampled_training['popularity']
X

Unnamed: 0,release_year,price,developer_freq,publisher_freq,genre_Accounting,genre_Action,genre_Adventure,genre_Animation & Modeling,genre_Audio Production,genre_Casual,...,cat_Steam Turn Notifications,cat_Steam Workshop,cat_Stereo Sound,cat_Subtitle Options,cat_Surround Sound,cat_Touch Only Option,cat_Tracked Controller Support,cat_VR Only,cat_VR Support,cat_VR Supported
0,2024,9.99,0.000084,0.000084,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2024,0.00,0.000084,0.000084,0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,2023,0.99,0.000084,0.000084,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,2023,19.99,0.000084,0.001010,0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,2023,0.99,0.000084,0.000084,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11871,2024,2.99,0.000084,0.000084,0,1,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
11872,2021,9.99,0.000084,0.000084,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
11873,2023,1.75,0.000084,0.000084,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11874,2021,4.99,0.000084,0.000084,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


We will be trying the Dummy Classification model as one of our baseline models to establish naive accuracy and macro-F1 benchmark for the rest of our classification models.

In [564]:
dummy_accuracy = cross_val_score(DummyClassifier(strategy='most_frequent'), X, y, cv = 5, scoring='accuracy')
dummy_f1 = cross_val_score(DummyClassifier(strategy='most_frequent'), X, y, cv = 5, scoring='f1_macro')
print(dummy_accuracy)

[0.5        0.50021053 0.50021053 0.49978947 0.49978947]


In [565]:
print(dummy_accuracy.mean())

0.4999999999999999


In [566]:
print(dummy_f1)

[0.16666667 0.16671344 0.16671344 0.16661988 0.16661988]


In [567]:
print(dummy_f1.mean())

0.16666666141376824


The model's average accuracy is roughly 0.5, which is moderate and expected due to the imbalance nature of the dataset. However, the average macro F1 score is considered very low (roughly 0.17) due to the naive approach of the model.

Linear Regression was chosen as another baseline model due to its simplicity and the ordinal nature of the popularity labels. 

In [568]:
linear_regression_score = cross_val_score(LinearRegression(), X, y, cv = 5)
print(linear_regression_score)

[0.38077278 0.38079463 0.32529011 0.4069955  0.35868467]


In [569]:
print(linear_regression_score.mean())

0.37050753826382204


As we can see, our Linear Regression model acheived an average R² score of 0.37. The performance is moderate, which shows that there are some linear trend for engagement. However, it is not a proper classification and cannot handle multiple imbalanced classes in `popularity`.

We tried Logistic Regression, which requires some scaling of numeric features for the model to converge properly.

In [574]:
scaler = StandardScaler()
X_scaled = X.copy()

X_scaled[['release_year', 'price', 'developer_freq', 'publisher_freq']] = (
    scaler.fit_transform(
        X[['release_year', 'price', 'developer_freq', 'publisher_freq']]
    )
)

logistic_regression_f1 = cross_val_score(LogisticRegression(max_iter=1000, class_weight='balanced'), X_scaled, y, cv = 5, scoring='f1_macro')
logistic_regression_accuracy = cross_val_score(LogisticRegression(max_iter=1000, class_weight='balanced'), X_scaled, y, cv = 5, scoring='accuracy')
print(logistic_regression_f1)

[0.43538977 0.47469575 0.43817769 0.44757746 0.43798755]


In [575]:
print(logistic_regression_f1.mean())

0.44676564521021545


In [576]:
print(logistic_regression_accuracy)

[0.53914141 0.56968421 0.53305263 0.54610526 0.54273684]


In [577]:
print(logistic_regression_accuracy.mean())

0.546144072301967


Compared to the dummy regression model, logistic regression did not perform that much better in terms of accuracy. However, the logistic regression model achieved a macro-averaged F1 score of approximately 0.44 (which is significantly higher than 0.17), without any wild spikes accross each fold. This indicates that the model is able to distinguish between different popularity levels reasonably well, including less frequent but important classes.

We also tried two other tree models, which is Random Forest and Gradient Boosting Machines.

In [None]:
random_forest_f1 = cross_val_score(RandomForestClassifier(n_estimators=300, random_state=17, class_weight='balanced'), X, y, cv = 5, scoring='f1_macro')
random_forest_accuracy = cross_val_score(RandomForestClassifier(n_estimators=300, random_state=17, class_weight='balanced'), X, y, cv = 5, scoring='accuracy')
print(random_forest_f1)

[0.47833625 0.47733263 0.45671754 0.46035968 0.47017862]


In [506]:
print(random_forest_f1.mean())

0.4685849446385214


In [507]:
print(random_forest_accuracy)

[0.60563973 0.60252632 0.58821053 0.59747368 0.60042105]


In [508]:
print(random_forest_accuracy.mean())

0.5988542619174198


In [509]:
gradient_boosting_f1 = cross_val_score(GradientBoostingClassifier(n_estimators=300, random_state=17), X, y, cv = 5, scoring='f1_macro')
gradient_boosting_accuracy = cross_val_score(GradientBoostingClassifier(n_estimators=300, random_state=17), X, y, cv = 5, scoring='accuracy')
print(gradient_boosting_f1)

[0.47161519 0.48235842 0.47266494 0.46388459 0.47938301]


In [510]:
print(gradient_boosting_f1.mean())

0.4739812294527013


In [511]:
print(gradient_boosting_accuracy)

[0.61531987 0.62063158 0.61178947 0.61305263 0.62231579]


In [512]:
print(gradient_boosting_accuracy.mean())

0.6166218678008152


For both accuracy and macro-f1, gradient boosting is slightly better than random forest and random forest is slightly better than logistic regression. This doesn't guarantee better performance because the improvement can be due to noise. For non-drastic changes in performance, we also need to evaluate the tradeoffs of these more expensive models. In general, gradient boosting is very prone to overfitting risk compared to random forest, which are inherently designed to reduce overfitting. <br/> <br/>
Given that logistic regression's overall performance is good and stable across cross-validation folds, we will choose logistic regression for our main model. Compared to the 2 tree models, our main model is expected to generalize more reliably to future data due to its simpler structure, which aligns better with the practical goal to predict future games' popularity.

## Model Tuning and Final Evaluation

We will be tuning the hyperparameter C for our model.

In [None]:
# Tune C
for c in [0.1, 1, 10]:
    model = LogisticRegression(
        C=c,
        max_iter=3000,
        class_weight='balanced'
    )
    scores = cross_val_score(model, X_scaled, y, cv=5, scoring='f1_macro')
    print(c, scores.mean())

0.1 0.4482256121162232
1 0.44676564521021545
10 0.44643196230517745


We can see that the model perform slightly worse after each increment of C, and while the difference is very small, this means that the model performs better with less flexibility. This suggests that our model is already capturing the signals very well and is fitting more noises. Therefore, we will be choosing C = 0.1 in favor of robustness and regulation. <br/> <br/>
We will now train our final model over our entire training set, use it to predict our test set, and evaluate its performance. 

In [None]:
# Encoding and fitting test set
testing['genres_separated'] = (
    testing['genres']
    .fillna('')
    .apply(lambda x: [g.strip() for g in x.split(';')] if x else [])
)

testing['categories_separated'] = (
    testing['categories']
    .fillna('')
    .apply(lambda x: [g.strip() for g in x.split(';')] if x else [])
)

test_genre_ohe = mlb_genres.transform(
    testing['genres_separated']
)

test_genre_df = pd.DataFrame(
    test_genre_ohe,
    columns=[f'genre_{g}' for g in mlb_genres.classes_],
    index=testing.index
)

test_cat_ohe = mlb_categories.transform(
    testing['categories_separated']
)

test_cat_df = pd.DataFrame(
    test_cat_ohe,
    columns=[f'cat_{c}' for c in mlb_categories.classes_],
    index=testing.index
)

testing['developer_freq'] = testing['developer'].map(dev_counts).fillna(0)
testing['publisher_freq'] = testing['publisher'].map(publisher_counts).fillna(0)

X_test = pd.concat(
    [
        testing[
            ['release_year', 'price', 'developer_freq', 'publisher_freq']
        ],
        test_genre_df,
        test_cat_df
    ],
    axis=1
)
y_test = testing['popularity']

X_test[['release_year', 'price', 'developer_freq', 'publisher_freq']] = (
    scaler.transform(
        X_test[['release_year', 'price', 'developer_freq', 'publisher_freq']]
    )
)



In [None]:
model = LogisticRegression(
        C=0.1,
        max_iter=3000,
        class_weight='balanced'
    )
model.fit(X_scaled, y)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)

0.6567193872475878


In [525]:
f1 = f1_score(y_test, y_pred, average = "macro")
print(f1)

0.3245550652562088


Accuracy improved while f1-macro lower -> all due to imbalance in the test set + undersampling in training set (expected)

## Discussions 


In this context, predicting a game as highly popular when it ultimately performs poorly is considered more costly than the opposite error. A false positive may lead to overestimating success and misallocating effort, whereas a false negative still allows room for iteration and improvement after release. For this reason, the modeling approach is intentionally conservative when identifying high-popularity games, and false positives are treated as more costly than false negatives.