# Games Dataset (Using Pandas)

## Dataset Inspection (Bronze)

Using Pandas for the Games dataset. Pandas is single threaded and doesn't handle large datasets too well.

In [None]:
import pandas as pd

games_bronze_df = pd.read_csv('bronze/games.csv', delimiter=',', index_col=False)
games_bronze_df.head().T


Upon closer inspection, it seems that the dataset has mismatched column names/values. 
Listing columns and unique values for each.

In [None]:
# List all columns and their unique values
for column in games_bronze_df.columns:
    print(f"\n{column}:\n")
    print(games_bronze_df[column].unique())


There seems to be a shift, at Discount DLCs. The About the game column values correspond to Discount DLCs.

## Bronze -> Silver

In [None]:
import pandas as pd

# Load the dataset
games_bronze_df = pd.read_csv('bronze/games.csv',index_col=False)

# Get the index of the 'DiscountDLC count' column
idx = games_bronze_df.columns.get_loc('DiscountDLC count')

# Get the list of columns starting from 'DiscountDLC count' to the end
cols_to_fix = games_bronze_df.columns[idx:]

# Shift the values of these columns to the left by one
games_bronze_df[cols_to_fix[:-1]] = games_bronze_df[cols_to_fix[1:]].values

# Drop the last column
games_bronze_df.drop(columns=cols_to_fix[-1], inplace=True)
games_bronze_df.columns = games_bronze_df.columns

# Inspect the DataFrame
for column in games_bronze_df.columns:
    print(f"\n{column}:\n")
    print(games_bronze_df[column].unique())

From a cursory look at the data we see that there are a few columns with a significant number of null values. 
There are Bools, Float64, Int64 and Object types.
Strings are stored as object types, a generic Pandas struct that stores a pointer to the actual content.

Focusing on the null values, we'll do a quick check of how many null values there are per column. The purpose here is to identify if there are any columns that completely null and hence safe to remove.

In [None]:
missing_values = games_bronze_df.isnull().sum()
total_rows = games_bronze_df.shape[0]

print(missing_values.apply(lambda x: f"{"Empty" if x == total_rows else ("High Null Rate" if x/total_rows > 0.5 else "")}      {x}/{total_rows}   {x/total_rows:.2%}"))

We see that there are no columns that are completely redundant, although a few have significantly higher rates of null values. 
'Reviews', 'Website', 'Support url', 'Metacritic url', 'Score rank', 'Notes' all have a significant number of null values. 
Score Rank would be useful but it has the highest rate of null values at 99.95%.

Notes can be useful to indicate whether a title depicts violence, mature themes etc, however the high null rate and the non-standard wording of the notes would not make this endeavor fruitful. Similar case for Reviews, as it has a high null rate and we would need a sentiment analysis tool to get useful information out of this field.

In [None]:
# Drop Unnecessary or High Null Rate Columns
columns_to_drop = ['Reviews', 'Website', 'Support url', 'Metacritic url', 'Score rank', 'Notes']

# Drop the columns
games_silver_df = games_bronze_df.drop(columns=columns_to_drop)

6 entries are missing a name, which is a critical piece of information. Dropping those rows.

In [None]:
# Drop the rows with missing values in the 'Name' column
games_silver_df = games_silver_df.dropna(subset=['Name'])

Fill null entries with descriptive placeholders.
We are purposefully keeping some NaN values because it is easier to parse those out when making the gold datasets. (Notably in the genre, category, tags).

Dropping or filling them out at this stage would be premature and either lose information we need or make parsing harder in the gold datasets.

In [None]:

games_silver_df['Release date'] = games_silver_df['Release date'].fillna(pd.to_datetime('1950-01-01'))
games_silver_df['About the game'] = games_silver_df['About the game'].fillna('Not Provided')
games_silver_df['Support email'] = games_silver_df['Support email'].fillna('Not Provided')
games_silver_df['Screenshots'] = games_silver_df['Screenshots'].fillna('Not Provided') 

print(games_silver_df.info())

Coerce objects to defined types in pandas

In [None]:
numeric_columns = ['AppID', 'Peak CCU', 'Required age', 'Price', 'Achievements', 'Positive', 'Negative', 'Average playtime forever', 'Average playtime two weeks', 'Median playtime forever', 'Median playtime two weeks', 'Recommendations']

for col in numeric_columns:
    games_silver_df[col] = pd.to_numeric(games_silver_df[col], errors='coerce')

boolean_columns = ['Windows', 'Mac', 'Linux']

for col in boolean_columns:
    games_silver_df[col] = games_silver_df[col].astype(bool)


In [None]:
# Identify number of 0 values is numeric columns
numeric_columns = games_silver_df.select_dtypes(include=['int64', 'float64']).columns

# Check object columns too for 0 values
numeric_columns = numeric_columns.append(games_silver_df.select_dtypes(include=['object']).columns)

zero_values = games_silver_df[numeric_columns].eq(0).sum()

print(zero_values.apply(lambda x: f"{"Empty" if x == total_rows else ("High Zero Rate" if x/total_rows > 0.5 else "")}      {x}/{total_rows}   {x/total_rows:.2%}"))

- 0 required age implied 'E' for everyone as an age rating.
- Peak CCU being 0 is not impossible considering that the dataset is scraping every game available on Steam (of which a significant number is not popular, or ever used).
- Playtimes being at 0 is also a reasonable value because that implies the game was bought but never played. Number of games not being played in the past two weeks is appropriately higher.
- User score being almost entirely 0 is surprising. It is unclear whether this is missing data, or that many games listed on Steam have no user reviews at all. 
- Metacritic being mostly 0 makes sense as they only score high profile videogame releases (steam includes non-game software in its listing)

Drop Duplicates, only considering AppIDs.

In [None]:
if len(games_silver_df['AppID'])-len(games_silver_df['AppID'].drop_duplicates()) == 0:
    print('No duplicates found')
else:
    print('Duplicates found')
    games_silver_df = games_silver_df.drop_duplicates(subset='AppID')


Categories, Tags, Genres contain strings with comma separated 'words'. 
As part of processing we're splitting these into lists.

In [None]:
# Convert to lists and lowercase
games_silver_df['Categories'] = games_silver_df['Categories'].apply(lambda x: [item.strip().lower() for item in x.split(',')] if (isinstance(x,str) and not pd.isnull(x)) else [])
games_silver_df['Genres'] = games_silver_df['Genres'].apply(lambda x: [item.strip().lower() for item in x.split(',')] if isinstance(x,str) and not pd.isnull(x) else [])
games_silver_df['Tags'] = games_silver_df['Tags'].apply(lambda x: [item.strip().lower() for item in x.split(',')] if isinstance(x,str) and not pd.isnull(x) else [])

# Remove duplicates
games_silver_df['Categories'] = games_silver_df['Categories'].apply(lambda x: list(set(x)))
games_silver_df['Genres'] = games_silver_df['Genres'].apply(lambda x: list(set(x)))
games_silver_df['Tags'] = games_silver_df['Tags'].apply(lambda x: list(set(x)))


In [None]:
# See if this works
print(games_silver_df['Categories'])
print(games_silver_df['Genres'])
print(games_silver_df['Tags'])

In [None]:
games_silver_df.head().T

In [None]:
# Save the silver DataFrame to a CSV file
games_silver_df.to_csv('silver/games.csv', index=False)

# Silver -> Gold: Metrics for Player Retention and reputation by Genre

The first Gold Dataset will provide information on player retention and reputation by genre.
Using playtime, user review rates to calculate the main metric. Providing additional information such as estimated players in genre and total revenue.

In [1]:
import pandas as pd
from pprint import pprint
try:
    gold_genre_df = games_silver_df.copy()
    print('gold_genre_df copied from games_silver_df')
except NameError:
    gold_genre_df = pd.read_csv('silver/games.csv', index_col=False)
    print('gold_genre_df loaded from silver/games.csv')
    


gold_genre_df loaded from silver/games.csv


### Most Common Genre

Drop rows with missing genres

In [2]:
# Rows with Empty genres
empty_genres = gold_genre_df['Genres'].apply(lambda x: len(x) == 0 if isinstance(x, list) else pd.isnull(x))
gold_genres_df = gold_genre_df[~empty_genres]

# Print out what this results in
pprint(gold_genres_df["Genres"].apply(lambda x: str(x)).unique())


array(["['indie', 'sports', 'casual']", "['action', 'indie']",
       "['action', 'adventure', 'indie', 'strategy']", ...,
       "['action', 'adventure', 'indie', 'game development']",
       "['action', 'rpg', 'indie', 'free to play', 'racing', 'strategy']",
       "['action', 'indie', 'free to play', 'casual', 'massively multiplayer', 'strategy']"],
      dtype=object)


In [3]:
import ast
# Explode the Genres column
gold_genre_df["Genres"] = gold_genre_df["Genres"].apply(ast.literal_eval)
gold_genre_df = gold_genre_df.explode('Genres')

gold_genre_df = gold_genre_df.dropna(subset=['Genres'])

# See the result
pprint(gold_genre_df.head().T)

                                                                            0  \
AppID                                                                   20200   
Name                                                         Galactic Bowling   
Release date                                                     Oct 21, 2008   
Estimated owners                                                    0 - 20000   
Peak CCU                                                                    0   
Required age                                                                0   
Price                                                                   19.99   
DiscountDLC count                                                           0   
About the game              Galactic Bowling is an exaggerated and stylize...   
Supported languages                                               ['English']   
Full audio languages                                                       []   
Header image                

### Group by Genre and developer 

In [4]:
import numpy as np

gold_genre_df['Total_Reviews'] = gold_genre_df['Positive'] + gold_genre_df['Negative']
gold_genre_df['Total_Reviews'] = gold_genre_df['Total_Reviews'].replace(0, np.nan)
gold_genre_df['Positive_Review_Rate'] = gold_genre_df['Positive'] / gold_genre_df['Total_Reviews']
gold_genre_df['Positive_Review_Rate'] = gold_genre_df['Positive_Review_Rate'].fillna(0)

In [5]:
def estimate_owners_range(owners_range):
    if pd.isnull(owners_range):
        return np.nan
    try:
        low, high = owners_range.split(' - ')
        low = int(low.replace(',', '').strip())
        high = int(high.replace(',', '').strip())
        return (low + high) / 2  # Average of the range
    except:
        return np.nan

gold_genre_df['Estimated_Owners'] = gold_genre_df['Estimated owners'].apply(estimate_owners_range)


In [6]:
gold_genre_df['Price'] = pd.to_numeric(gold_genre_df['Price'], errors='coerce').fillna(0)
gold_genre_df['Estimated_Owners'] = gold_genre_df['Estimated_Owners'].fillna(0)

gold_genre_df['Total_Revenue'] = gold_genre_df['Estimated_Owners'] * gold_genre_df['Price']


In [7]:
# Group by 'Genres' and calculate metrics
genre_metrics = gold_genre_df.groupby('Genres').agg(
    Average_Playtime_Forever=('Average playtime forever', 'mean'),
    Average_Positive_Review_Rate=('Positive_Review_Rate', 'mean'),
    Total_Estimated_Players=('Estimated_Owners', 'sum'),
    Total_Revenue=('Total_Revenue', 'sum'),
    Number_of_Games=('AppID', 'count')
).reset_index()

genre_metrics = genre_metrics.round({
    'Average_Player_Retention': 4,
    'Average_Positive_Review_Rate': 4,
    'Total_Estimated_Players': 0,
    'Total_Revenue': 2
})

# Filter out genres with less than 10 games, with average playtime at 0 and total estimated players at 0
genre_metrics = genre_metrics[(genre_metrics['Number_of_Games'] >= 10) & (genre_metrics['Average_Playtime_Forever'] > 0) & (genre_metrics['Total_Estimated_Players'] > 0)]

genre_metrics = genre_metrics.sort_values('Total_Revenue', ascending=False)
genre_metrics.head(10)

Unnamed: 0,Genres,Average_Playtime_Forever,Average_Positive_Review_Rate,Total_Estimated_Players,Total_Revenue,Number_of_Games
2,action,89.770182,0.5693,4749815000.0,63895630000.0,38004
3,adventure,84.13409,0.5624,2765970000.0,41376170000.0,35864
21,rpg,150.509576,0.5387,1877345000.0,35245060000.0,16657
15,indie,59.240322,0.5677,3273505000.0,32416130000.0,64501
27,strategy,138.93902,0.556,1449030000.0,19285880000.0,17760
24,simulation,130.053794,0.5177,1328075000.0,18024610000.0,18608
6,casual,47.189351,0.5205,1356245000.0,8449390000.0,39308
16,massively multiplayer,286.138315,0.4346,921485000.0,7725984000.0,2422
9,early access,64.999421,0.523,529805000.0,7590049000.0,12097
26,sports,135.245425,0.5361,297595000.0,2977051000.0,4262


In [8]:
# Save the gold DataFrame to a CSV file
genre_metrics.to_csv('gold/games_genre_metrics.csv', index=False)

# Silver -> Gold: Total Revenue by Year released

In [None]:
import pandas as pd
from pprint import pprint
try:
    revenue_df = games_silver_df.copy()
    print('gold_genre_df copied from games_silver_df')
except NameError:
    revenue_df = pd.read_csv('silver/games.csv', index_col=False)
    print('gold_genre_df loaded from silver/games.csv')

In [None]:
# Convert 'Release date' to datetime
revenue_df['Release date'] = pd.to_datetime(revenue_df['Release date'], errors='coerce')

revenue_df['Release_Year'] = revenue_df['Release date'].dt.year
revenue_df['Release_Month'] = revenue_df['Release date'].dt.month

revenue_df = revenue_df.dropna(subset=['Release_Year', 'Release_Month']) # Most likely not needed

In [None]:
def estimate_owners_range(owners_range):
    if pd.isnull(owners_range):
        return np.nan
    try:
        low, high = owners_range.split(' - ')
        low = int(low.replace(',', '').strip())
        high = int(high.replace(',', '').strip())
        return (low + high) / 2 
    except:
        return np.nan

revenue_df['Estimated_Owners'] = revenue_df['Estimated owners'].apply(estimate_owners_range)
revenue_df['Estimated_Revenue'] = revenue_df['Estimated_Owners'] * revenue_df['Price']

In [None]:
monthly_revenue = revenue_df.groupby(['Release_Year']).agg(
    Total_Estimated_Revenue=('Estimated_Revenue', 'sum'),
    Total_Estimated_Owners=('Estimated_Owners', 'sum'),
    Number_of_Games=('AppID', 'count')
).reset_index()

# Drop years with 0 total estimated owners
monthly_revenue = monthly_revenue[monthly_revenue['Total_Estimated_Owners'] > 0]

monthly_revenue


In [None]:
# Save the gold DataFrame to a CSV file
monthly_revenue.to_csv('gold/games_yearly_revenue.csv', index=False)