Note: Most code is copied from: https://www.kaggle.com/code/terencicp/steam-games-data-transformation

In [None]:
import pandas as pd
import json
import numpy as np
from datetime import datetime

Load json version of games dataset

In [None]:
first_file = 'Data/games.json'
with open(first_file, 'r', encoding="utf8") as file:
    json_data = json.load(file)

Drop unneeded variables

In [None]:
dropped = [
    'packages', 'screenshots', 'movies', 'score_rank', 'header_image',
    'reviews', 'website', 'support_url', 'notes', 'support_email',
    'median_playtime_2weeks', 'required_age',
    'metacritic_url', 'detailed_description', 'about_the_game', 
    'average_playtime_2weeks'
]

In [None]:
# Process each game's information and store in a list
games = [{
    **{k: v for k, v in game_info.items() if k not in dropped},
    'tags': list(tags.keys()) if isinstance((tags := game_info.get('tags', {})), dict) else [],
    'tag_frequencies': list(tags.values()) if isinstance(tags, dict) else [],
    'app_id': app_id
} for app_id, game_info in json_data.items()]

# Create a DataFrame from the processed list
df = pd.DataFrame(games)
df

## Data Cleaning

Remove games with no sales:

In [None]:
count = (df['estimated_owners'] == "0 - 0").sum()
print("Number of games with estimated owners '0 - 0':", count)

In [None]:
df[df['estimated_owners'] == "0 - 0"]

Some games just seem to be developer tests. Let's remove them. We'll also remove games with no reviews or no categories:

In [None]:
# Filter games without sales, reviews or categories
df2 = df[~((df['estimated_owners'] == "0 - 0") | (df['positive'] + df['negative'] == 0) | (df['categories'].str.len() == 0))]

Let's see the DataFrame again. It seems we got rid of more than 20000 irrelevant games:

In [None]:
df2.shape

I'll also split the 'estimated_owners' column into two different variables, this way we'll be able to use it for aggregation in Tableau:

In [None]:
# Split estimated_owners into two: min_owners and max_owners
df2[['min_owners', 'max_owners']] = df2['estimated_owners'].str.split(' - ', expand=True)

# Remove the original field
df2 = df2.drop('estimated_owners', axis=1)

In [None]:
df2[['min_owners', 'max_owners']]

Let's have a look at the distribution of prices:

In [None]:
# Box plot of price
df2.boxplot(column=['price'])

In [None]:
# Games priced above 200$
df2[df2['price'] > 200]

We can see that the game priced at $999 is basically a cash-grab without any actual sales, and being an extreme outlier it can distort our analysis. Let's remove it:

In [None]:
# Delete game with id 26936
df2 = df2[df2['app_id'] != 26936]

In [None]:
df2

## Normalizing data

The DataFrame contains fields such as 'categories' or 'tags' that consist on lists of values. To improve the performance of the visualization we'll build in Tableau we must convert this fields into separate tables, that will be linked with the main table using the 'app_id' column.

In [None]:
# Create a separate DataFrame for each list-type column
cols = ["supported_languages", "full_audio_languages", "categories", "genres"]
for col in cols:
    new_df = df2.explode(col)[['app_id', col]]
    new_df.to_csv(f'{col}.csv', index=False)

df_tags = df2.explode('tags')[['app_id', 'tags']]
df_frequencies = df2.explode('tag_frequencies')['tag_frequencies']
df_tags['tag_frequencies'] = df_frequencies.values

# Remove the list columns from the main DataFrame
columns_to_remove = cols + ['tags', 'tag_frequencies']
df_imploded = df2.drop(columns=columns_to_remove)

In [None]:
df_imploded

## Save results as CSV

Dataset that doesn't include list-like columns

In [None]:
df_imploded.to_csv('cleaned_games.csv', index=False)