# Steam Game Popularity Modeling

In [177]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.linear_model import LogisticRegression

## Data Cleaning

We will clean the table by selecting only the columns that we need for our model.

In [178]:
df = pd.read_csv("../data/a_steam_data_2021_2025.csv")
df = df[['release_year', 'genres', 'categories', 'price', 'developer', 'publisher', 'recommendations']]
df

Unnamed: 0,release_year,genres,categories,price,developer,publisher,recommendations
0,2024,Action;Adventure;Indie;RPG;Strategy,Single-player;Family Sharing,3.99,Bouncy Rocket Studios,Bouncy Rocket Studios,0
1,2025,Casual;Indie;Simulation;Strategy,Single-player;Save Anytime;Family Sharing,7.99,Caramelo Studios,Caramelo Studios,0
2,2025,Adventure;Indie;Strategy,Single-player;Steam Achievements;Full controll...,12.99,Libragames,Libragames,0
3,2023,Action;Adventure;Indie,Single-player;Steam Achievements;Full controll...,24.99,YS Net,ININ,0
4,2021,Action;Adventure;Casual;Indie;RPG;Early Access,Single-player;Steam Achievements;Full controll...,3.99,Wagner Rodrigues,Wagner Rodrigues,0
...,...,...,...,...,...,...,...
65516,2024,Casual;Indie;Simulation,Single-player;Family Sharing,14.99,Inonia Games,Inonia Games,0
65517,2025,Action;Indie;Early Access,Single-player;Family Sharing,4.99,Hard Shark Games,Hard Shark Games,0
65518,2021,Simulation;Education,Single-player;Tracked Controller Support;VR On...,14.99,Xennial Digital,Xennial Digital,0
65519,2024,Action;Indie,Single-player;Partial Controller Support;Famil...,15.99,PoisonGames,PoisonGames,0


Since no missing data is seen for recommendations, we will not be dropping any rows entirely. Although certain predictor fields display some missing data, we will simply treat them accordingly in our model rather than dropping them. Multi-label fields such as `genres` and `categories` allow empty values, and missing `developer` and `publisher` will be treated as a separate category. For now, we will only be cleaning the `developer` and `publisher` columns by filling the N/A rows with a learnable missing class.

In [179]:
df['developer'] = df['developer'].fillna('__MISSING__')
df['publisher'] = df['publisher'].fillna('__MISSING__')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65521 entries, 0 to 65520
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   release_year     65521 non-null  int64  
 1   genres           65455 non-null  object 
 2   categories       65514 non-null  object 
 3   price            65521 non-null  float64
 4   developer        65521 non-null  object 
 5   publisher        65521 non-null  object 
 6   recommendations  65521 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 3.5+ MB


## Target Construction

First, we create a new scalar containing all the values for non-zero recommendations count, and find the 0.5 and 0.8 quantile to group our observations into their popularity tier.

In [180]:
non_zero = df.loc[df["recommendations"] > 0, "recommendations"]
q50 = float(non_zero.quantile(0.50))
q80 = float(non_zero.quantile(0.80))


We then define our new 'popularity' field, with numerical values 0, 1, 2 and 3 corresponding to the tier categories.

In [181]:
conditions = [
    (df['recommendations'] > 0) & (df['recommendations'] < q50),
    (df['recommendations'] >= q50) & (df['recommendations'] < q80),
    df['recommendations'] >= q80
]

choices = [ 1, 2, 3]

df['popularity'] = np.select(conditions, choices, default = 0)
df

Unnamed: 0,release_year,genres,categories,price,developer,publisher,recommendations,popularity
0,2024,Action;Adventure;Indie;RPG;Strategy,Single-player;Family Sharing,3.99,Bouncy Rocket Studios,Bouncy Rocket Studios,0,0
1,2025,Casual;Indie;Simulation;Strategy,Single-player;Save Anytime;Family Sharing,7.99,Caramelo Studios,Caramelo Studios,0,0
2,2025,Adventure;Indie;Strategy,Single-player;Steam Achievements;Full controll...,12.99,Libragames,Libragames,0,0
3,2023,Action;Adventure;Indie,Single-player;Steam Achievements;Full controll...,24.99,YS Net,ININ,0,0
4,2021,Action;Adventure;Casual;Indie;RPG;Early Access,Single-player;Steam Achievements;Full controll...,3.99,Wagner Rodrigues,Wagner Rodrigues,0,0
...,...,...,...,...,...,...,...,...
65516,2024,Casual;Indie;Simulation,Single-player;Family Sharing,14.99,Inonia Games,Inonia Games,0,0
65517,2025,Action;Indie;Early Access,Single-player;Family Sharing,4.99,Hard Shark Games,Hard Shark Games,0,0
65518,2021,Simulation;Education,Single-player;Tracked Controller Support;VR On...,14.99,Xennial Digital,Xennial Digital,0,0
65519,2024,Action;Indie,Single-player;Partial Controller Support;Famil...,15.99,PoisonGames,PoisonGames,0,0


## Data Splitting
The data will be split chronologically, where older games are used for our training set and newer ones for our testing set. This is because a time-based split is more practical for generalizing data over time. We can inspect the proportion for each release year to decide the cutoff.

In [182]:
df['release_year'].value_counts(normalize=True)

release_year
2025    0.306863
2024    0.242029
2023    0.176386
2022    0.146777
2021    0.127944
Name: proportion, dtype: float64

Since games released in 2025 takes up roughly 30.7% of all observations, we will be using all games during 2021-2024 for training and those in 2025 for testing for a 70-30 split.

In [183]:
training = df[df['release_year'] != 2025]
testing = df[df['release_year'] == 2025]
training

Unnamed: 0,release_year,genres,categories,price,developer,publisher,recommendations,popularity
0,2024,Action;Adventure;Indie;RPG;Strategy,Single-player;Family Sharing,3.99,Bouncy Rocket Studios,Bouncy Rocket Studios,0,0
3,2023,Action;Adventure;Indie,Single-player;Steam Achievements;Full controll...,24.99,YS Net,ININ,0,0
4,2021,Action;Adventure;Casual;Indie;RPG;Early Access,Single-player;Steam Achievements;Full controll...,3.99,Wagner Rodrigues,Wagner Rodrigues,0,0
5,2023,Action;Adventure;Indie,Single-player;Full controller support;Family S...,1.99,Opia Games,Opia Games;Plug In Digital,0,0
11,2024,Action;Adventure;Indie;RPG,Single-player;Steam Achievements;Full controll...,14.99,Sam Enright,United Label,240,1
...,...,...,...,...,...,...,...,...
65515,2023,Casual,Single-player;Family Sharing,6.99,Seven Sails Games,Seven Sails Games,0,0
65516,2024,Casual;Indie;Simulation,Single-player;Family Sharing,14.99,Inonia Games,Inonia Games,0,0
65518,2021,Simulation;Education,Single-player;Tracked Controller Support;VR On...,14.99,Xennial Digital,Xennial Digital,0,0
65519,2024,Action;Indie,Single-player;Partial Controller Support;Famil...,15.99,PoisonGames,PoisonGames,0,0


In [184]:
testing

Unnamed: 0,release_year,genres,categories,price,developer,publisher,recommendations,popularity
1,2025,Casual;Indie;Simulation;Strategy,Single-player;Save Anytime;Family Sharing,7.99,Caramelo Studios,Caramelo Studios,0,0
2,2025,Adventure;Indie;Strategy,Single-player;Steam Achievements;Full controll...,12.99,Libragames,Libragames,0,0
6,2025,Action;Indie;Free To Play,Single-player;Full controller support;Custom V...,0.00,Taylor Conolley,Taylor Conolley,0,0
7,2025,Action;Indie,Single-player;Steam Achievements;Full controll...,2.99,halvardo13,halvardo13,0,0
8,2025,Adventure;RPG;Strategy,Single-player;Family Sharing,3.99,Oba Games,Oba Games,0,0
...,...,...,...,...,...,...,...,...
65492,2025,Adventure;Casual;Massively Multiplayer;RPG;Sim...,Multi-player;MMO;PvP;Online PvP;Co-op;Online C...,0.00,Vitali Kirpu,Vitali Kirpu,0,0
65494,2025,Strategy,Single-player;Steam Achievements;In-App Purcha...,4.99,SHN Games,SHN Games,0,0
65500,2025,Action;Adventure;Casual;Indie;Simulation,Single-player;Steam Achievements;Full controll...,4.99,Jon Warner Studios,Jon Warner Studios,0,0
65511,2025,Adventure;Casual;Indie,Single-player;Family Sharing,1.99,Total,Total,0,0


## Feature Engineering and Encoding

In [185]:
training['popularity'].value_counts()

popularity
0    39477
1     2928
2     1772
3     1238
Name: count, dtype: int64

The popularity tiers are clearly imbalanced, with Tier 0 dominating the dataset. This is expected, since many Steam games receive little or no user engagement. Among games that do receive recommendations, the remaining tiers are reasonably balanced due to the quantile-based split.

Rather than increasing the representation of higher-popularity tiers, this project's approach will be to reduce the number of Tier 0 observations in the testing set. Oversampling higher tiers could introduce noise, particularly given the high-cardinality nature of features such as developer and publisher, which may lead to unstable patterns. Undersampling Tier 0 helps mitigate imbalance while preserving the relative structure among engaged games.

We will be keeping Tier 0 the same size as all other tiers combined.

In [191]:
tier_0 = training[training['popularity'] == 0]
tier_rest = training[training['popularity'] != 0]

n_keep = len(tier_rest)

tier_0_sampled = tier_0.sample(
    n=n_keep,
    random_state=17
)

# Combining the undersampled tier 0 dataframe and the original dataframe for all other tiers
undersampled_training = pd.concat([tier_0_sampled, tier_rest])

# Shuffle all observations
undersampled_training = undersampled_training.sample(frac=1, random_state=17).reset_index(drop=True)
undersampled_training

Unnamed: 0,release_year,genres,categories,price,developer,publisher,recommendations,popularity,developer_freq,publisher_freq
0,2024,Action;Indie,Single-player;Partial Controller Support;Famil...,9.99,Little Crows Studio,Little Crows Studio,0,0,0.000022,0.000022
1,2024,Adventure;Casual;Indie;Free To Play,Single-player,0.00,Ebananas,ebanapury,0,0,0.000022,0.000022
2,2023,Casual;Indie,Single-player;Family Sharing,0.99,Sarni Studio,Sarni Studio,0,0,0.000022,0.000022
3,2023,Adventure;Casual;Indie,Single-player;Steam Achievements;Steam Trading...,19.99,Bad Seed,Maximum Entertainment,356,2,0.000022,0.000418
4,2023,Casual;Indie;Simulation;Strategy,Single-player;Steam Achievements;Steam Trading...,0.99,Watashi Games,Watashi Games,1715,3,0.000022,0.000022
...,...,...,...,...,...,...,...,...,...,...
11871,2024,Action;Adventure;Casual;RPG;Strategy,Single-player;Steam Achievements;Full controll...,2.99,Soda Game Studio,Soda Game Studio,372,2,0.000022,0.000022
11872,2021,Casual;Indie,Single-player;Steam Achievements;Steam Trading...,9.99,Business Goose Studios,Business Goose Studios,2502,3,0.000022,0.000022
11873,2023,Indie,Single-player;Family Sharing,1.75,Lewd Leech,Lewd Leech,0,0,0.000022,0.000022
11874,2021,Action;Casual;Indie,Single-player;Multi-player;PvP;Shared/Split Sc...,4.99,One-Hand-Free Studios,One-Hand-Free Studios,0,0,0.000044,0.000044


We then convert the `genres` and `categories` fields into a list and fit their corresponding binarizers (using the `MultiLabelBinarizer` in scikit-learn) to the data.

In [192]:
mlb_genres = MultiLabelBinarizer()
mlb_categories = MultiLabelBinarizer()

# Create new columns for genres and categories separated into lists
undersampled_training['genres_separated'] = (
    undersampled_training['genres']
    .fillna('')
    .apply(lambda x: [g.strip() for g in x.split(';')] if x else [])
)

undersampled_training['categories_separated'] = (
    undersampled_training['categories']
    .fillna('')
    .apply(lambda x: [g.strip() for g in x.split(';')] if x else [])
)

# Create list of all classes for fitting
genres_list = undersampled_training['genres_separated'].tolist()
categories_list = undersampled_training['categories_separated'].tolist()

mlb_genres.fit(genres_list)
mlb_categories.fit(categories_list)

0,1,2
,"classes  classes: array-like of shape (n_classes,), default=None Indicates an ordering for the class labels. All entries should be unique (cannot contain duplicate classes).",
,"sparse_output  sparse_output: bool, default=False Set to True if output binary array is desired in CSR sparse format.",False


In [188]:
mlb_genres.classes_

array(['Accounting', 'Action', 'Adventure', 'Animation & Modeling',
       'Audio Production', 'Casual', 'Design & Illustration',
       'Early Access', 'Education', 'Free To Play', 'Game Development',
       'Indie', 'Massively Multiplayer', 'RPG', 'Racing', 'Simulation',
       'Software Training', 'Sports', 'Strategy', 'Utilities'],
      dtype=object)

In [189]:
mlb_categories.classes_

array(['Adjustable Difficulty', 'Adjustable Text Size', 'Camera Comfort',
       'Captions available', 'Chat Speech-to-text', 'Chat Text-to-speech',
       'Co-op', 'Color Alternatives', 'Commentary available',
       'Cross-Platform Multiplayer', 'Custom Volume Controls',
       'Family Sharing', 'Full controller support', 'HDR available',
       'In-App Purchases', 'Includes Source SDK', 'Includes level editor',
       'Keyboard Only Option', 'LAN Co-op', 'LAN PvP', 'MMO',
       'Mouse Only Option', 'Multi-player', 'Narrated Game Menus',
       'Online Co-op', 'Online PvP', 'Partial Controller Support',
       'Playable without Timed Input', 'PvP', 'Remote Play Together',
       'Remote Play on Phone', 'Remote Play on TV',
       'Remote Play on Tablet', 'Save Anytime', 'Shared/Split Screen',
       'Shared/Split Screen Co-op', 'Shared/Split Screen PvP',
       'Single-player', 'Stats', 'Steam Achievements', 'Steam Cloud',
       'Steam Leaderboards', 'Steam Timeline', 'Steam Tradin

To encode high-cardinality fields like `developer` and `publisher`, we apply frequency encoding to reduce dimensionality. While this means sacrificing some identity for certain categories, it preserves a useful popularity signal, where a more common developer/publisher may receive more engagement (correlating with higher popularity). Frequency encoding is computed independently of the target variable, which aligns well with the project's focus of avoiding data leakage.

In [200]:
dev_counts = undersampled_training['developer'].value_counts(normalize=True)
undersampled_training['developer_freq'] = undersampled_training['developer'].map(dev_counts)

publisher_counts = undersampled_training['publisher'].value_counts(normalize=True)
undersampled_training['publisher_freq'] = undersampled_training['publisher'].map(publisher_counts)

undersampled_training

Unnamed: 0,release_year,genres,categories,price,developer,publisher,recommendations,popularity,developer_freq,publisher_freq,genres_separated,categories_separated
0,2024,Action;Indie,Single-player;Partial Controller Support;Famil...,9.99,Little Crows Studio,Little Crows Studio,0,0,0.000084,0.000084,"[Action, Indie]","[Single-player, Partial Controller Support, Fa..."
1,2024,Adventure;Casual;Indie;Free To Play,Single-player,0.00,Ebananas,ebanapury,0,0,0.000084,0.000084,"[Adventure, Casual, Indie, Free To Play]",[Single-player]
2,2023,Casual;Indie,Single-player;Family Sharing,0.99,Sarni Studio,Sarni Studio,0,0,0.000084,0.000084,"[Casual, Indie]","[Single-player, Family Sharing]"
3,2023,Adventure;Casual;Indie,Single-player;Steam Achievements;Steam Trading...,19.99,Bad Seed,Maximum Entertainment,356,2,0.000084,0.001010,"[Adventure, Casual, Indie]","[Single-player, Steam Achievements, Steam Trad..."
4,2023,Casual;Indie;Simulation;Strategy,Single-player;Steam Achievements;Steam Trading...,0.99,Watashi Games,Watashi Games,1715,3,0.000084,0.000084,"[Casual, Indie, Simulation, Strategy]","[Single-player, Steam Achievements, Steam Trad..."
...,...,...,...,...,...,...,...,...,...,...,...,...
11871,2024,Action;Adventure;Casual;RPG;Strategy,Single-player;Steam Achievements;Full controll...,2.99,Soda Game Studio,Soda Game Studio,372,2,0.000084,0.000084,"[Action, Adventure, Casual, RPG, Strategy]","[Single-player, Steam Achievements, Full contr..."
11872,2021,Casual;Indie,Single-player;Steam Achievements;Steam Trading...,9.99,Business Goose Studios,Business Goose Studios,2502,3,0.000084,0.000084,"[Casual, Indie]","[Single-player, Steam Achievements, Steam Trad..."
11873,2023,Indie,Single-player;Family Sharing,1.75,Lewd Leech,Lewd Leech,0,0,0.000084,0.000084,[Indie],"[Single-player, Family Sharing]"
11874,2021,Action;Casual;Indie,Single-player;Multi-player;PvP;Shared/Split Sc...,4.99,One-Hand-Free Studios,One-Hand-Free Studios,0,0,0.000084,0.000084,"[Action, Casual, Indie]","[Single-player, Multi-player, PvP, Shared/Spli..."


Logistic Regression


In this context, predicting a game as highly popular when it ultimately performs poorly is considered more costly than the opposite error. A false positive may lead to overestimating success and misallocating effort, whereas a false negative still allows room for iteration and improvement after release. For this reason, the modeling approach is intentionally conservative when identifying high-popularity games, and false positives are treated as more costly than false negatives.