
**<h1>Introduction</h1>**



**<h3> Motivation </h3>**

The video game industry has always been a multi-billion dollar market, and Steam stands as its most dominant PC storefront. As a lifelong gamer, I have always been fascinated by the sheer volume of titles released yearly. Ranging from solo-developed Indie gems to massive AAA titles.

However, the marketplace is crowded. For every masterpiece, thousands of games go unnoticed. This project aims to move beyond subjective opinion and use data to understand what drives success on the platform. By analyzing pricing strategies, genre trends, and review sentiment, I aim to uncover the quantitative patterns that separate a game from being a success.. or a flop.

**<h3> Key Objectives: </h3>**
- Identify correlations between Genre and User Ratings.
- Analyse the pricing strategy differences between Indie and AAA titles.
- Determine if lower price points correlate with higher ratings.
- Predict whether a game will be a "hit" or a "flop" using XGBoost.

**<h3> Data Source: </h3>**
Dataset is created by Martin Bustos on Kaggle. Data is collected from both the API provided by Steam themselves and Steam Spy, a Steam stats service based on Web API provided by Valve. <br>
https://www.kaggle.com/datasets/fronkongames/steam-games-dataset/data

**<h3> Attributes / Features: </h3>**
**AppID**: Unique identifier for each game on the Steam store.

**Name**: The game's title.

**Release date**: The date the game was released on Steam.

**Estimated owners**: A range estimating the number of people who own the game (e.g., "0 - 20000", "20000 - 50000").

**Peak CCU**: The peak number of Concurrent Users (players) in the game on the previous day.

**Required age**: The minimum age required to play the game (0 indicates all ages).

**Price**: The current price of the game in US Dollars.

**Discount**: The number of discounts the game has experienced.

**DLC count**: The number of downloadable content (DLC) packs available for the game. 

**About the game**: A brief text description or summary of the game's content.

**Supported languages**: A list of all languages supported by the game (interface, audio, or subtitles).

**Full audio languages**: A list of languages that have full audio support (voice acting).

**Reviews**: A text summary of the user review score (e.g., "Overwhelmingly Positive", "Mixed").

**Header image**: The URL to the game's main header image/banner on the store.

**Website**: The URL to the game's official website.

**Support url**: The URL for the game's customer support page.

**Support email**: The contact email address for game support.

**Windows**: True if the game is compatible with Windows, False otherwise.

**Mac**: True if the game is compatible with macOS, False otherwise.

**Linux**: True if the game is compatible with Linux, False otherwise.

**Metacritic score**: The game's critic score from Metacritic (0 if not available).

**Metacritic url**: The URL to the game's Metacritic page.

**User score**: A score calculated based on user reviews (0 if not available or insufficient data).

**Positive**: The total count of positive user reviews.

**Negative**: The total count of negative user reviews.

**Score rank**: A ranking of the game based on its user score compared to others (often null for most games).

**Achievements**: The total number of in-game achievements available.

**Recommendations**: The number of times users have recommended this game.

**Notes**: Extra information or warnings about the game content.

**Average playtime forever**: The average total playtime for all owners of the game (in minutes).

**Average playtime two weeks**: The average playtime in the last two weeks (in minutes).

**Median playtime forever**: The median total playtime for owners of the game (in minutes).

**Median playtime two weeks**: The median playtime in the last two weeks (in minutes).

**Developers**: The name of the studio or individual who created the game.

**Publishers**: The name of the company that published the game.

**Categories**: A list of game categories (e.g., "Single-player", "Multi-player", "Co-op").

**Genres**: A list of genres the game belongs to (e.g., "Action", "Indie", "RPG").

**Tags**: User-defined tags that describe the game (e.g., "Pixel Art", "Difficult").

**Screenshots**: URLs to screenshots of the gameplay.

**Movies**: URLs to trailers or promotional videos.

**<h2> 1. Data Preprocessing </h2>**

First let's import the libraries that we'll need for the Exploratory Data Analysis (EDA). 

In [110]:
%pip install xgboost




In [111]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# For our model later on.
import xgboost as xgb
from sklearn import metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

Lets read the dataset with Pandas.

In [None]:
steam_games = pd.read_csv('../data/games.csv')

Now, let's figure out how many rows and columns does this dataset have.

In [113]:
shape = steam_games.shape
print(f"The shape of the dataset is: {shape}")

The shape of the dataset is: (111452, 40)


Taking a peak at the top 5 rows..

In [114]:
steam_games.head()

Unnamed: 0,AppID,Name,Release date,Estimated owners,Peak CCU,Required age,Price,Discount,DLC count,About the game,...,Average playtime two weeks,Median playtime forever,Median playtime two weeks,Developers,Publishers,Categories,Genres,Tags,Screenshots,Movies
0,20200,Galactic Bowling,"Oct 21, 2008",0 - 20000,0,0,19.99,0,0,Galactic Bowling is an exaggerated and stylize...,...,0,0,0,Perpetual FX Creative,Perpetual FX Creative,"Single-player,Multi-player,Steam Achievements,...","Casual,Indie,Sports","Indie,Casual,Sports,Bowling",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
1,655370,Train Bandit,"Oct 12, 2017",0 - 20000,0,0,0.99,0,0,THE LAW!! Looks to be a showdown atop a train....,...,0,0,0,Rusty Moyher,Wild Rooster,"Single-player,Steam Achievements,Full controll...","Action,Indie","Indie,Action,Pixel Graphics,2D,Retro,Arcade,Sc...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
2,1732930,Jolt Project,"Nov 17, 2021",0 - 20000,0,0,4.99,0,0,Jolt Project: The army now has a new robotics ...,...,0,0,0,Campião Games,Campião Games,Single-player,"Action,Adventure,Indie,Strategy",,https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
3,1355720,Henosis™,"Jul 23, 2020",0 - 20000,0,0,5.99,0,0,HENOSIS™ is a mysterious 2D Platform Puzzler w...,...,0,0,0,Odd Critter Games,Odd Critter Games,"Single-player,Full controller support","Adventure,Casual,Indie","2D Platformer,Atmospheric,Surreal,Mystery,Puzz...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
4,1139950,Two Weeks in Painland,"Feb 3, 2020",0 - 20000,0,0,0.0,0,0,ABOUT THE GAME Play as a hacker who has arrang...,...,0,0,0,Unusual Games,Unusual Games,"Single-player,Steam Achievements","Adventure,Indie","Indie,Adventure,Nudity,Violent,Sexual Content,...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...


Check for all the columns that are present in the dataset, lets see the attributes that we're working with.

In [115]:
steam_games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111452 entries, 0 to 111451
Data columns (total 40 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   AppID                       111452 non-null  int64  
 1   Name                        111446 non-null  object 
 2   Release date                111452 non-null  object 
 3   Estimated owners            111452 non-null  object 
 4   Peak CCU                    111452 non-null  int64  
 5   Required age                111452 non-null  int64  
 6   Price                       111452 non-null  float64
 7   Discount                    111452 non-null  int64  
 8   DLC count                   111452 non-null  int64  
 9   About the game              104969 non-null  object 
 10  Supported languages         111452 non-null  object 
 11  Full audio languages        111452 non-null  object 
 12  Reviews                     10624 non-null   object 
 13  Header image  

Lets check for the missing values for each of the columns.

In [116]:
steam_games.isnull().sum()

AppID                              0
Name                               6
Release date                       0
Estimated owners                   0
Peak CCU                           0
Required age                       0
Price                              0
Discount                           0
DLC count                          0
About the game                  6483
Supported languages                0
Full audio languages               0
Reviews                       100828
Header image                       0
Website                        64994
Support url                    60693
Support email                  19025
Windows                            0
Mac                                0
Linux                              0
Metacritic score                   0
Metacritic url                107447
User score                         0
Positive                           0
Negative                           0
Score rank                    111408
Achievements                       0
R

Right off the bat, we can see some columns that have enormous amounts of missing values. Let's check the percentages for those missing values per column.

In [117]:
missing_percentage = (steam_games.isnull().sum() * 100 / len(steam_games)).sort_values(ascending = False)
print(missing_percentage)

Score rank                    99.960521
Metacritic url                96.406525
Reviews                       90.467645
Notes                         83.446686
Website                       58.315687
Support url                   54.456627
Tags                          33.577684
Support email                 17.070129
Movies                         8.631519
Categories                     6.788573
Publishers                     6.081542
About the game                 5.816854
Developers                     5.809676
Genres                         5.778272
Screenshots                    3.761260
Name                           0.005383
Median playtime forever        0.000000
Negative                       0.000000
Median playtime two weeks      0.000000
Average playtime forever       0.000000
Recommendations                0.000000
Achievements                   0.000000
Average playtime two weeks     0.000000
AppID                          0.000000
Positive                       0.000000


To get the obvious ones out of the way, I decided to drop 'Score rank', 'Metacritic url', 'Reviews', and 'Notes' due to their enormous percentage of missing values (around 80% and above) relative to the overall length of the dataset (111452 rows).

In addition, attempts at imputing values where the columns have > 80% of their values missing could lead to significant increases in bias and inaccuracies, as there is **insufficient amount of data to reliably estimate the missing values.**

I will also be dropping the columns that are irrelevant to our key objectives. This includes 'Website', 'Support url', 'Support email', 'Screenshots', and 'Movies'. As the questions we're about to answer doesn't need the links that these columns provide.

In [118]:
steam_games.drop(['Score rank', 'Metacritic url', 'Reviews', 'Support url', 'Notes', 'Website', 'Support email', 'Screenshots', 'Movies'], axis = 1, inplace = True)

In [119]:
steam_games.shape

(111452, 31)

We've verified that the 9 columns were dropped. Lets see how many columns are left for us to work with.

In [120]:
steam_games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111452 entries, 0 to 111451
Data columns (total 31 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   AppID                       111452 non-null  int64  
 1   Name                        111446 non-null  object 
 2   Release date                111452 non-null  object 
 3   Estimated owners            111452 non-null  object 
 4   Peak CCU                    111452 non-null  int64  
 5   Required age                111452 non-null  int64  
 6   Price                       111452 non-null  float64
 7   Discount                    111452 non-null  int64  
 8   DLC count                   111452 non-null  int64  
 9   About the game              104969 non-null  object 
 10  Supported languages         111452 non-null  object 
 11  Full audio languages        111452 non-null  object 
 12  Header image                111452 non-null  object 
 13  Windows       

In [121]:
steam_games.isnull().sum()

AppID                             0
Name                              6
Release date                      0
Estimated owners                  0
Peak CCU                          0
Required age                      0
Price                             0
Discount                          0
DLC count                         0
About the game                 6483
Supported languages               0
Full audio languages              0
Header image                      0
Windows                           0
Mac                               0
Linux                             0
Metacritic score                  0
User score                        0
Positive                          0
Negative                          0
Achievements                      0
Recommendations                   0
Average playtime forever          0
Average playtime two weeks        0
Median playtime forever           0
Median playtime two weeks         0
Developers                     6475
Publishers                  

In [124]:
print(steam_games['About the game'].value_counts())

About the game
Help the beautiful girls to immerse themselves in their memories and remember the wonderful moments of their lives. This game is a classic puzzle game with several levels of difficulty and beautiful arts. All characters appearing in this game are over 18 years of age. Enjoy ;-) - Puzzle Game - Artworks - Nice music - Steam achievements Keyboard H: Complete Level                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             

Since we dont plan on doing text analysis using NLP (Natural Language Processing) to analyse the keywords for this particular project, I plan on dropping the 'About the game' column as well.

In [125]:
steam_games.drop(['About the game'], axis = 1, inplace = True)

In [127]:
steam_games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111452 entries, 0 to 111451
Data columns (total 30 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   AppID                       111452 non-null  int64  
 1   Name                        111446 non-null  object 
 2   Release date                111452 non-null  object 
 3   Estimated owners            111452 non-null  object 
 4   Peak CCU                    111452 non-null  int64  
 5   Required age                111452 non-null  int64  
 6   Price                       111452 non-null  float64
 7   Discount                    111452 non-null  int64  
 8   DLC count                   111452 non-null  int64  
 9   Supported languages         111452 non-null  object 
 10  Full audio languages        111452 non-null  object 
 11  Header image                111452 non-null  object 
 12  Windows                     111452 non-null  bool   
 13  Mac           

This leaves us with around 30 columns! a relatively huge reduction of features from the initial count of 40 columns.

In [128]:
steam_games.isnull().sum()

AppID                             0
Name                              6
Release date                      0
Estimated owners                  0
Peak CCU                          0
Required age                      0
Price                             0
Discount                          0
DLC count                         0
Supported languages               0
Full audio languages              0
Header image                      0
Windows                           0
Mac                               0
Linux                             0
Metacritic score                  0
User score                        0
Positive                          0
Negative                          0
Achievements                      0
Recommendations                   0
Average playtime forever          0
Average playtime two weeks        0
Median playtime forever           0
Median playtime two weeks         0
Developers                     6475
Publishers                     6778
Categories                  

To handle the missing developers and publishers, we could impute the "Unknown" value to substitute the missing values. This is because the names of the developer and publishers could influence the success of the games.

The same goes for the 'Categories', 'Genres', and 'Tags' column. I initially thought of using the mode (for example if the leading genre was Horror, then all the missing values would be imputed as Horror also), but this could artificially alter the data to be imbalanced and increase the bias. Another risk would be that we would just plainly miscategorise a game.

In [129]:
# Impute the 'Unknown' values to substitute the missing values.
steam_games['Developers'] = steam_games['Developers'].fillna('Unknown')
steam_games['Publishers'] = steam_games['Publishers'].fillna('Unknown')
steam_games['Categories'] = steam_games['Categories'].fillna('Unknown')
steam_games['Genres'] = steam_games['Genres'].fillna('Unknown')
steam_games['Tags'] = steam_games['Tags'].fillna('Unknown')

In [130]:
steam_games.isnull().sum()

AppID                         0
Name                          6
Release date                  0
Estimated owners              0
Peak CCU                      0
Required age                  0
Price                         0
Discount                      0
DLC count                     0
Supported languages           0
Full audio languages          0
Header image                  0
Windows                       0
Mac                           0
Linux                         0
Metacritic score              0
User score                    0
Positive                      0
Negative                      0
Achievements                  0
Recommendations               0
Average playtime forever      0
Average playtime two weeks    0
Median playtime forever       0
Median playtime two weeks     0
Developers                    0
Publishers                    0
Categories                    0
Genres                        0
Tags                          0
dtype: int64

In [131]:
steam_games.head()

Unnamed: 0,AppID,Name,Release date,Estimated owners,Peak CCU,Required age,Price,Discount,DLC count,Supported languages,...,Recommendations,Average playtime forever,Average playtime two weeks,Median playtime forever,Median playtime two weeks,Developers,Publishers,Categories,Genres,Tags
0,20200,Galactic Bowling,"Oct 21, 2008",0 - 20000,0,0,19.99,0,0,['English'],...,0,0,0,0,0,Perpetual FX Creative,Perpetual FX Creative,"Single-player,Multi-player,Steam Achievements,...","Casual,Indie,Sports","Indie,Casual,Sports,Bowling"
1,655370,Train Bandit,"Oct 12, 2017",0 - 20000,0,0,0.99,0,0,"['English', 'French', 'Italian', 'German', 'Sp...",...,0,0,0,0,0,Rusty Moyher,Wild Rooster,"Single-player,Steam Achievements,Full controll...","Action,Indie","Indie,Action,Pixel Graphics,2D,Retro,Arcade,Sc..."
2,1732930,Jolt Project,"Nov 17, 2021",0 - 20000,0,0,4.99,0,0,"['English', 'Portuguese - Brazil']",...,0,0,0,0,0,Campião Games,Campião Games,Single-player,"Action,Adventure,Indie,Strategy",Unknown
3,1355720,Henosis™,"Jul 23, 2020",0 - 20000,0,0,5.99,0,0,"['English', 'French', 'Italian', 'German', 'Sp...",...,0,0,0,0,0,Odd Critter Games,Odd Critter Games,"Single-player,Full controller support","Adventure,Casual,Indie","2D Platformer,Atmospheric,Surreal,Mystery,Puzz..."
4,1139950,Two Weeks in Painland,"Feb 3, 2020",0 - 20000,0,0,0.0,0,0,"['English', 'Spanish - Spain']",...,0,0,0,0,0,Unusual Games,Unusual Games,"Single-player,Steam Achievements","Adventure,Indie","Indie,Adventure,Nudity,Violent,Sexual Content,..."
