
**<h1>Introduction</h1>**



**<h3> Motivation </h3>**

The video game industry has always been a multi-billion dollar market, and Steam stands as its most dominant PC storefront. As a lifelong gamer, I have always been fascinated by the sheer volume of titles released yearly. Ranging from solo-developed Indie gems to massive AAA titles.

However, the marketplace is crowded. For every masterpiece, thousands of games go unnoticed. This project aims to move beyond subjective opinion and use data to understand what drives success on the platform. By analyzing pricing strategies, genre trends, and review sentiment, I aim to uncover the quantitative patterns that separate a game from being a success.. or a flop.

**<h3> Key Objectives: </h3>**
- Identify correlations between Genre and User Ratings.
- Analyse the pricing strategy differences between Indie and AAA titles.
- Determine if lower price points correlate with higher ratings.
- Predict whether a game will be a "hit" or a "flop" using XGBoost.

**<h3> Data Source: </h3>**
Data is collected from Steam Spy, a Steam stats service based on Web API provided by Valve. <br>
https://steamspy.com/

**<h3> Attributes / Features: </h3>**
**AppID**: Unique identifier for each game on the Steam store.

**Name**: The game's title.

**Release date**: The date the game was released on Steam.

**Estimated owners**: A range estimating the number of people who own the game (e.g., "0 - 20000", "20000 - 50000").

**Peak CCU**: The peak number of Concurrent Users (players) in the game on the previous day.

**Required age**: The minimum age required to play the game (0 indicates all ages).

**Price**: The current price of the game in US Dollars.

**DiscountDLC count**: The number of downloadable content (DLC) packs available for the game. 

**About the game**: A brief text description or summary of the game's content.

**Supported languages**: A list of all languages supported by the game (interface, audio, or subtitles).

**Full audio languages**: A list of languages that have full audio support (voice acting).

**Reviews**: A text summary of the user review score (e.g., "Overwhelmingly Positive", "Mixed").

**Header image**: The URL to the game's main header image/banner on the store.

**Website**: The URL to the game's official website.

**Support url**: The URL for the game's customer support page.

**Support email**: The contact email address for game support.

**Windows**: True if the game is compatible with Windows, False otherwise.

**Mac**: True if the game is compatible with macOS, False otherwise.

**Linux**: True if the game is compatible with Linux, False otherwise.

**Metacritic score**: The game's critic score from Metacritic (0 if not available).

**Metacritic url**: The URL to the game's Metacritic page.

**User score**: A score calculated based on user reviews (0 if not available or insufficient data).

**Positive**: The total count of positive user reviews.

**Negative**: The total count of negative user reviews.

**Score rank**: A ranking of the game based on its user score compared to others (often null for most games).

**Achievements**: The total number of in-game achievements available.

**Recommendations**: The number of times users have recommended this game.

**Notes**: Extra information or warnings about the game content.

**Average playtime forever**: The average total playtime for all owners of the game (in minutes).

**Average playtime two weeks**: The average playtime in the last two weeks (in minutes).

**Median playtime forever**: The median total playtime for owners of the game (in minutes).

**Median playtime two weeks**: The median playtime in the last two weeks (in minutes).

**Developers**: The name of the studio or individual who created the game.

**Publishers**: The name of the company that published the game.

**Categories**: A list of game categories (e.g., "Single-player", "Multi-player", "Co-op").

**Genres**: A list of genres the game belongs to (e.g., "Action", "Indie", "RPG").

**Tags**: User-defined tags that describe the game (e.g., "Pixel Art", "Difficult").

**Screenshots**: URLs to screenshots of the gameplay.

**Movies**: URLs to trailers or promotional videos.

**<h2> 1. Data Wrangling </h2>**

First let's import the libraries that we'll need for the Exploratory Data Analysis (EDA).

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

c:\Users\ASUS\OneDrive\Documents\Data Personal Projects\steam-games-analysis\notebooks


In [4]:
steam_games = pd.read_csv('../data/games.csv')
print(steam_games.head())

                         AppID          Name Release date  Estimated owners  \
20200         Galactic Bowling  Oct 21, 2008    0 - 20000                 0   
655370            Train Bandit  Oct 12, 2017    0 - 20000                 0   
1732930           Jolt Project  Nov 17, 2021    0 - 20000                 0   
1355720               Henosis™  Jul 23, 2020    0 - 20000                 0   
1139950  Two Weeks in Painland   Feb 3, 2020    0 - 20000                 0   

         Peak CCU  Required age  Price  DiscountDLC count  \
20200           0         19.99      0                  0   
655370          0          0.99      0                  0   
1732930         0          4.99      0                  0   
1355720         0          5.99      0                  0   
1139950         0          0.00      0                  0   

                                            About the game  \
20200    Galactic Bowling is an exaggerated and stylize...   
655370   THE LAW!! Looks to be a s

Check for all the columns that are present in the dataset.

In [42]:
steam_games.info()

<class 'pandas.core.frame.DataFrame'>
Index: 111452 entries, 20200 to 3183790
Data columns (total 39 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   AppID                       111446 non-null  object 
 1   Name                        111452 non-null  object 
 2   Release date                111452 non-null  object 
 3   Estimated owners            111452 non-null  int64  
 4   Peak CCU                    111452 non-null  int64  
 5   Required age                111452 non-null  float64
 6   Price                       111452 non-null  int64  
 7   DiscountDLC count           111452 non-null  int64  
 8   About the game              104969 non-null  object 
 9   Supported languages         111452 non-null  object 
 10  Full audio languages        111452 non-null  object 
 11  Reviews                     10624 non-null   object 
 12  Header image                111452 non-null  object 
 13  Website       

Lets check for the missing values first.

In [43]:
print(steam_games.isnull().sum())

AppID                              6
Name                               0
Release date                       0
Estimated owners                   0
Peak CCU                           0
Required age                       0
Price                              0
DiscountDLC count                  0
About the game                  6483
Supported languages                0
Full audio languages               0
Reviews                       100828
Header image                       0
Website                        64994
Support url                    60693
Support email                  19025
Windows                            0
Mac                                0
Linux                              0
Metacritic score                   0
Metacritic url                107447
User score                         0
Positive                           0
Negative                           0
Score rank                    111408
Achievements                       0
Recommendations                    0
N

Lets see the columns that contain empty rows.

In [44]:
# Select and then see which columns have missing values.
null_columns = steam_games.columns[steam_games.isnull().any()]
print(null_columns)

Index(['AppID', 'About the game', 'Reviews', 'Website', 'Support url',
       'Support email', 'Metacritic url', 'Score rank', 'Notes', 'Developers',
       'Publishers', 'Categories', 'Genres', 'Tags', 'Screenshots', 'Movies'],
      dtype='object')


Before we go remove the empty rows, we need to be careful to not just wildly erase a row if they contain an empty row. The 'Movies' column for example is not going to matter at all if it has empty values since it doesn't contribute much to our analyses. Thus, we'll be removing the rows that have missing values for the critical columns only. Such as 'AppID', 'Name', 'Genres', 'About the game', and 'Reviews'.

In [45]:
cleaned_steam_games = steam_games.dropna(subset = ['AppID', 'Name', 'Genres', 'About the game', 'Reviews'])
print(cleaned_steam_games.shape)
print(cleaned_steam_games.isnull().sum())

(10599, 39)
AppID                             0
Name                              0
Release date                      0
Estimated owners                  0
Peak CCU                          0
Required age                      0
Price                             0
DiscountDLC count                 0
About the game                    0
Supported languages               0
Full audio languages              0
Reviews                           0
Header image                      0
Website                        2319
Support url                    3301
Support email                  1475
Windows                           0
Mac                               0
Linux                             0
Metacritic score                  0
Metacritic url                 8493
User score                        0
Positive                          0
Negative                          0
Score rank                    10591
Achievements                      0
Recommendations                   0
Notes           

In [46]:
cleaned_steam_games.head()

Unnamed: 0,AppID,Name,Release date,Estimated owners,Peak CCU,Required age,Price,DiscountDLC count,About the game,Supported languages,...,Average playtime two weeks,Median playtime forever,Median playtime two weeks,Developers,Publishers,Categories,Genres,Tags,Screenshots,Movies
1026420,WARSAW,"Oct 2, 2019",20000 - 50000,5,0,23.99,0,0,Use everything at your disposal to help a team...,"['English', 'French', 'German', 'Polish', 'Rus...",...,0,93,0,Pixelated Milk,"Pixelated Milk,gaming company","Single-player,Steam Achievements,Steam Trading...","Indie,RPG","Tactical RPG,Turn-Based Strategy,Wargame,Histo...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
485000,Cthulhu Realms,"Jul 1, 2016",50000 - 100000,0,0,0.0,0,1,Star Realms has gone insane! Introducing Cthul...,['English'],...,0,257,0,"Wise Wizard Games, LLC","Wise Wizard Games, LLC","Single-player,Multi-player,Shared/Split Screen...",Strategy,"Card Game,Strategy,Deckbuilding,Lovecraftian,B...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
346560,Hero of the Kingdom II,"Feb 20, 2015",100000 - 200000,5,0,7.99,0,0,Sail to the farthest islands to save your sist...,"['English', 'French', 'German', 'Spanish - Spa...",...,0,303,0,Lonely Troops,Lonely Troops,"Single-player,Steam Achievements,Steam Trading...","Adventure,Casual,Indie,RPG","Adventure,Casual,Point & Click,RPG,Indie,Isome...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
552350,Unhappy Ever After,"Jan 20, 2017",20000 - 50000,0,0,0.99,0,2,'In the infinitesimal spaces between the pages...,['English'],...,0,218,0,Warped Tales,Warped Tales,"Single-player,Steam Trading Cards,Partial Cont...","Adventure,Indie,RPG","Adventure,Indie,RPG,RPGMaker,Dark Fantasy,Fant...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
897820,Reigns: Game of Thrones,"Oct 18, 2018",50000 - 100000,2,0,3.99,0,0,Reigns: Game of Thrones is the heir to the awa...,"['English', 'French', 'German', 'Spanish - Spa...",...,0,76,0,Nerial,Devolver Digital,"Single-player,Steam Achievements,Full controll...","Adventure,Indie,RPG","RPG,Indie,Card Game,Adventure,Choices Matter,2...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...


Now lets check for any duplicates. This is with the assumption that the AppID column is the primary key to uniquely identify each row (being the games in this case).

In [49]:
cleaned_steam_games['AppID'].duplicated().sum()
print(cleaned_steam_games['AppID'].duplicated().sum())

37
