# Imports

In [1]:
# System
import os
from pathlib import Path
from dotenv import load_dotenv
import importlib
import sys

# Data Management
from typing import Optional, Dict, List, Any
from pydantic import BaseModel, Field
import time
import random
import json
from bson import json_util
import bson
import re
import difflib
from bs4 import BeautifulSoup
import unicodedata

# Data Science
import numpy as np
import pandas as pd
import polars as pl
import missingno as msno
import flatten_json

# API Interactions
import requests
from tqdm.notebook import tqdm


# Ignore warnings
from bs4 import MarkupResemblesLocatorWarning
import warnings

warnings.filterwarnings("ignore", category=MarkupResemblesLocatorWarning)

In [2]:
# At the top of your notebook
import importlib

# Add project root to path
sys.path.append(str(Path.cwd().parent))

# Import your modules
from src import steam_api_manager
from src import mongo_manager

# Configuration

In [3]:
# Load Environment Variables
load_dotenv()

# When you need to reload after changes
importlib.reload(steam_api_manager)
importlib.reload(mongo_manager)

# Get fresh instances of your classes
steam_api = steam_api_manager.SteamAPIManager()
mongo_manager = mongo_manager.MongoManager()

# EDA for Silver Layer

## Read Data from MongoDB

In [None]:
# For exploring smaller sizes

# Fetch documents
cursor = mongo_manager.database.details.find()    # or remove `.limit(10)` to get all

# Convert BSON to JSON-compatible (handles ObjectId, etc.)
parsed_docs = json.loads(json_util.dumps(list(cursor)))

df = pd.json_normalize(parsed_docs)
df

## Check for missing data

In [None]:
msno.bar(df, sort='ascending')

There seems to be a column called `alternative_appid`, which may be a problem when normalizing because appid is the primary key. 

However, as seen below they are all from the same `advertising` type and most of the `alternate_appid`s return false api results.

In [None]:
df.loc[df['alternate_appid'].notna(), ['appid', 'alternate_appid', 'name', 'type']]

## Super Category: Type

In [None]:
df['type'].value_counts(dropna=False)

Sample from each category

In [None]:
df_sample_type = df.groupby('type', group_keys=False).apply(lambda x: x.sample(n=1, random_state=42))
df_sample_type

Using:

- `game`: In Steam, a "game type" or "genre" refers to **a classification system that describes the core gameplay mechanics of a game**. These classifications help users browse and discover games based on their preferred style of play. Steam uses a range of tags and categories to describe games, going beyond simple genres like "Action" or "RPG" to include subgenres and additional features.
- `dlc`: DLC stands for downloadable content. It refers to **additional digital content that players can download and add to a video game after its initial release**. Game developers use DLC to expand and enhance the gaming experience by providing new storylines, challenges, characters, weapons, or cosmetic items.
- `demo`: Demos can range in size and scope, but generally are a small playable portion of your game that show some of the core mechanics and leave the player excited for more.
- `episode`: In Steam, an "episode type" typically refers to **a game that is released in individual chapters or installments**, often referred to as episodes. These games are sold and downloaded separately, usually representing a single segment of the overall story or gameplay experience.
- `series`: In Steam, a "series" refers to **a type of game grouping or collection, allowing users to organize their library by thematic categories like "series" or "franchises"**. These groupings can be used to easily find all the games within a particular series, like the "Grand Theft Auto" series.
- `music`: In the context of Steam, "music type" refers to **the category or genre of music available for purchase as standalone products or as DLC for games**. Steam allows users to buy and download music albums, individual songs, or soundtracks that may accompany games or be available separately as standalone music apps.
- `mod`: In the context of Steam, a "mod" refers to **a modification or alteration made to a video game by a user or fan, typically to change how it looks, plays, or behaves**. These modifications are often shared and distributed through the [**Steam Workshop**](https://store.steampowered.com/about/communitymods/) platform.

Not using:

- `video`: In the context of Steam, "video type" refers to **the format in which a video file is stored, which can be used for game recording, streaming, or other features**. Steam typically uses the m4s format for background recordings and clips, but users can export these recordings as MP4 files. For streaming, Steam can use various video codecs like VP8, VP9, or AV1, and recommends using VP9 or AV1 for optimal quality.
- `advertising`: There are only three which all redirect to the same appid which is from the column `alternative_appid` 402590 (update: they are useless data as the website doesn't return anything for the alternative_appids and it's not important data.)
- `hardware`: 1696780 the following is a website for steamdeckdock, which is like a nintendo switch or psp for steam

## Filter by Type

In [None]:
df_filtered = df.loc[
    (df['type'].isin(['game', 'dlc', 'demo', 'series', 'episode', 'music', 'mod']))
].copy()
df_filtered

In [None]:
msno.bar(df_filtered, sort='ascending')

## EDA for 100% available columns

Looks like the following are common for all rows:

appid, required_age, released_date.date, support_info.url, platforms.linux, platforms.mac, platforms.windows
support_info.email, content_descriptors.ids, release_date.coming_soon, type, steam_appid, short_description, 
_id.$oid, background, detailed_description, package_groups, background_raw, capsule_image, capsule_imagev5,
about_the_game, header_image, is_free, name,

In [None]:
df_filtered

## Descriptions about the game

check for differences between `about_the_game` and `detailed_description`

In [None]:
import pandas as pd
import difflib
from bs4 import BeautifulSoup


def remove_html_tags(text):
    soup = BeautifulSoup(text or "", "html.parser")
    return soup.get_text(separator=" ").strip()

# Define your function
def get_only_in_a_vs_b(a, b):
    a_lines = (a or "").splitlines()
    b_lines = (b or "").splitlines()
    diff = list(difflib.ndiff(a_lines, b_lines))

    only_in_a = "\n".join([line[2:] for line in diff if line.startswith("- ")])
    only_in_b = "\n".join([line[2:] for line in diff if line.startswith("+ ")])
    has_diff = bool(only_in_a or only_in_b)
    
    return pd.Series([has_diff, only_in_a, only_in_b])

df_diff_check = df_filtered.loc[:,['appid', 'about_the_game', 'detailed_description']].copy()
df_diff_check['about_the_game'] = df_diff_check['about_the_game'].apply(remove_html_tags)
df_diff_check['detailed_description'] = df_diff_check['detailed_description'].apply(remove_html_tags)

# Apply to your specific DataFrame
df_diff_check[['has_diff', 'only_in_about', 'only_in_detailed']] = df_diff_check.apply(
    lambda row: get_only_in_a_vs_b(row["about_the_game"], row["detailed_description"]),
    axis=1
)

In [None]:
df_diff_check.loc[df_diff_check['has_diff'] == True]

## Required Age & Release Date Filtering

check unique values for required_age and sample for cleaning

In [None]:
df_filtered['required_age'].unique()

sample different age values

In [None]:
df_sample_age = df_filtered.groupby('required_age', group_keys=False).apply(lambda x: x.sample(n=1, random_state=42))
df_sample_age.loc[:, ['appid', 'name', 'short_description', 'required_age']]

In [None]:
df_filtered.loc[df_filtered['required_age'] == '6', ['appid', 'name', 'short_description', 'required_age']]

check for the `ratings` nests to see if the esrb and other age ratings include useful information

In [None]:
cols_ratings = [col for col in df_filtered.columns if 'ratings' in col]
df_ratings = df_filtered[['appid', 'required_age', *cols_ratings]]
df_ratings.loc[df_ratings['ratings.esrb.rating'].notna()]

we can see that a lot of the columns that are something like ratings.*.descriptors say violence, sex, drugs, etc. but when you check, most of the games do not have that kind of information when other columns like ratings.*.required_age is low. that's why the columns that start with ratings, only the ones that end with required_age will be taken into consideration.


for example: `appid`: 3586970 shows Adult Only content but the `required_age` is set to 0 and it can be seen that it is not a published game yet, so we might have to filter add a filter to the `released_date` column too.

In [None]:
print(df_filtered['release_date.coming_soon'].unique())
print(df_filtered['release_date.date'].unique())

In [None]:
release_date_values = list(
    df_filtered['release_date.date']
    .drop_duplicates(keep='first')
    .sort_values(ascending=False)
    .reset_index(drop=True)
)
pd.DataFrame(release_date_values)

try to make a boolean column to see if the column is ISO 8601 compatible or not and see which values are not compatible because there are some dates that look like: 

- 'November 2027' which could be truncated to Nov 01, 2027
- just text like: 'Coming Soon', 'maybe' etc.
- or other formats but still having date info like:  '9/mai./2017', '9. Sep. 2019', '9 maj, 2016', '9 listopada 2015',
- and other formats that look like quarters and even dates that haven't come yet: Q4 2030

try to clean as much date info to capture as much data as possible.

either use ISO8601 or a custom date format like: '%d %b, %Y' to make it YYYY-MM-DD

In [None]:
df_clean_dates = df_filtered.copy()

df_clean_dates['is_iso8601'] = pd.to_datetime(df_clean_dates['release_date.date'], format='%d %b, %Y', errors='coerce').notna()
df_clean_dates['release_date'] = pd.to_datetime(df_clean_dates['release_date.date'], format='%d %b, %Y', errors='coerce').dt.date
df_clean_dates = df_clean_dates.loc[:, ['appid', 'name', 'type', 'release_date.date', 'release_date', 'release_date.coming_soon', 'is_iso8601']]

df_clean_dates.loc[df_clean_dates['is_iso8601'] == False]

another filter will be added for `release_date.coming_soon` because the project will try to analyze games that are released and their performance or whatnot.

In [None]:
df_clean_dates.loc[
    (df_clean_dates['is_iso8601'] == False) & 
    (df_clean_dates['release_date.coming_soon'] == False)
]

there are a lot of apps that have been released but the `release_date.date` is an empty string.  
when checking the website though, it appears that these games do have a release date.  
some options are to just exclude these because it's a small proportion of the data but for precise analysis, webscraping the release dates for these apps would be the better option.

one thing for sure is that another filter will be:
`release_date.coming_soon` == False

In [None]:
df_filtered = df_filtered.loc[df_filtered['release_date.coming_soon'] == False]

now go back to check for the `ratings` nests to see if the esrb and other age ratings include useful information

this time the columns will only be included if it has `*.required_age` and not all `ratings` columns

In [None]:
cols_required_age = [col for col in df_filtered.columns if 'required_age' in col]
df_filtered.loc[:, ['appid', 'name', 'release_date.date', *cols_required_age]]

it can be seen from `appid`: 2166370 that even if the `release_date.date` has ISO 8601 compatibility and `required_age` is 0 it is still for adults, so the `ratings.*.required_age` will be very useful in filtering these adult games that have `required_age` set to 0.

In [None]:
df_filtered.loc[df_filtered['required_age'] == 0, ['appid', 'name', 'release_date.date', *cols_required_age]]

try to replace the `required_age` with the maximum number from the columns that include `ratings.*.required_age` if `required_age` is 0, however, there are some cases like with appid 2165210 where `ratings.steam_germany.required_age` is 2147483647 so the maximum and minimum numbers along with describe should be checked.

In [None]:
df_filtered.loc[:, ['appid', 'name', *cols_required_age]].describe()

change all the `required_age` columns to integers first

In [None]:
df_required_age = df_filtered.loc[:, ['appid', 'name', *cols_required_age]]

values_in_ages_columns = []

for col in df_required_age.columns[2:]:
    values_in_ages_columns.extend(df_required_age[col].unique())

set(values_in_ages_columns)

In [None]:
df_required_age[df_required_age.isin([
    '35',
    'Project Winter Item Store', 
    "They're mp3s", 
    "javascript:ToggleCheckbox('checkbox_app_game_ratings_csrr_use_age_gate_');18",
    'Any sound card',
    'M',
    'Free',
    '１８',
    '13+',
    "javascript:ToggleCheckbox('checkbox_app_game_ratings_esrb_use_age_gate_');",
    "javascript:ToggleCheckbox('checkbox_app_game_ratings_oflc_use_age_gate_')12",
    "17javascript:ToggleCheckbox('checkbox_app_game_ratings_kgrb_use_age_gate_');",
    '99'
    ]).any(axis=1)]

after having checked all age ranges, it can be seen that if any value is not a number or an empty string, it will be converted to 0, the strings that has the javascript:ToggleCheckbox has some digits, like: 
- "javascript:ToggleCheckbox('checkbox_app_game_ratings_oflc_use_age_gate_')12"
- "17javascript:ToggleCheckbox('checkbox_app_game_ratings_kgrb_use_age_gate_');"
- "javascript:ToggleCheckbox('checkbox_app_game_ratings_csrr_use_age_gate_');18"

however, it seems that they have no value after check the apps. 

then there are also really big values and they also have no value so the range will be 0~21.

the steps to clean up the ages are:
1. normalize with unicodedata (appid=1587290)
2. remove '+' (appid=1328670)
3. user pd.to_numeric to convert to integers
4. convert nan to 0
5. convert to integers
6. remove all values less than 0 or greater than 21

In [None]:
import unicodedata

def normalize_strings(value: Any) -> Any:
    if isinstance(value, str):
        return unicodedata.normalize('NFKC', value).strip()
    return value

def normalize_strings_df(df: pd.DataFrame) -> pd.DataFrame:
    return df.map(normalize_strings)

def remove_plus_sign(value: Any) -> Any:
    if isinstance(value, str):
        return value.replace('+', '')
    return value

def remove_plus_sign_df(df: pd.DataFrame, cols: list) -> pd.DataFrame:
    for col in cols:
        df[col] = df[col].map(remove_plus_sign)
    return df

def convert_age_to_int_df(df: pd.DataFrame, cols: list) -> pd.DataFrame:
    for col in cols:
        df[col] = pd.to_numeric(df[col], errors='coerce')
        df[col] = df[col].fillna(0).astype('int64')
    return df

df_cleaned_required_ages = (
    df_filtered
    .pipe(normalize_strings_df)
    .pipe(remove_plus_sign_df, cols=[col for col in df_filtered.columns if 'required_age' in col])
    .pipe(convert_age_to_int_df, cols=[col for col in df_filtered.columns if 'required_age' in col])
)

df_cleaned_required_ages

In [None]:
df_check_required_ages = df_cleaned_required_ages.loc[:, ['appid', 'name', *cols_required_age]]

values_in_ages_columns = []

for col in df_check_required_ages.columns[2:]:
    values_in_ages_columns.extend(df_check_required_ages[col].unique())

set(values_in_ages_columns)

In [None]:
df_cleaned_required_ages

In [None]:
msno.bar(df_cleaned_required_ages, sort='ascending')

there are some games where the highest age is 18 but doesn't actually seem for adults, after looking at the reviews, it might be rated like that because of the in-game risks, so this method of classifying the game as the highest age from the `required_age` columns is good enough

In [None]:
df_filtered.loc[df_filtered['appid'] == 8500]

also some games just happend to have `required_age` 0 with notes about the game having violence or sexual content, but this is because it is a demo or dlc while their game does have the correct age maybe in a `ratings.*.required_age` column

In [None]:
df_filtered.loc[df_filtered['appid'] == 2126680]

# EDA for Silver Layer 2

## Columns to drop

before continuing to drill down, it would be a good moment to drop columns that are not necessary.
after analyzing some columns, now there is some understanding of what columns are important.

columns to remove until now:
- `_id.$oid`
- `steam_appid`
- `alternate_appid`
- `capsule_imagev5`
- `capsule_image`
- `metacritic.score`
- `metacritic.url`


In [None]:
df_dropped_columns = df.drop(columns=[
    '_id.$oid',
    'steam_appid',
    'alternate_appid',
    'capsule_imagev5',
    'capsule_image',
    'metacritic.score',
    'metacritic.url',
    'drm_notice',
    *[col for col in df.columns if 'ratings.' in col and 'required_age' not in col]
])
df_filtered = df_dropped_columns.loc[
    (df_dropped_columns['type'].isin(['game', 'dlc', 'demo', 'series', 'episode', 'music', 'mod'])) &
    (df_dropped_columns['release_date.coming_soon'] == False)
]
df_filtered

## Fullgame columns

there are columns that start with `fullgame` and has other appid and name so it feels like these may be reference to the game if the type of the app is not game (like dlc, demo, etc).

In [None]:
df_filtered.loc[
    (df_filtered['type'] == 'game') &
    ((df_filtered['fullgame.name'].notna()) | (df_filtered['fullgame.appid'].notna()))
]

there were a lot of rows that were other type than game that had values for `fullgame.name` or `fullgame.appid`, but from thee code above it can be seen that if the type is game, then these columns are null, so it does prove the point.

## Prices and Recurring Prices

after looking at some values for the columns with price_overview (which have so many null values), it felt a little weird to include these but it is more like these are subscription types.

In [None]:
print(f"{df_filtered['price_overview.recurring_sub'].unique() = }")
print(f"{df_filtered['price_overview.recurring_sub_desc'].unique() = }")

In [None]:
df_filtered.loc[df_filtered['price_overview.recurring_sub'].notnull()]

after checking for `price_overview.recurring_sub*` (checked about 5) and only one (235883) seems to have subscription, it would be safe to remove the subscription related columns, as prices even with subscription are the same prices

there are other columns that are related to the prices so collect the columns related to prices and check which ones should be removed or merged

In [None]:
cols_prices = [
    'is_free', 
    *[df for df in df_filtered.columns if 'price_overview.' in df]
]

df_filtered.loc[:, ['appid', 'name', *cols_prices]]

the `is_free` column may be helpful for filtering in the dashboard.

any price column that has to do with discount will be removed because this project is not near close to realtime data it will not gather historical data but only with up to date data so columns that will be removed will be:  
`price_overview.final`, `price_overview.discount`, `price_overview.final_formatted`

it seems that the columns that are left are `price_overview_currency`, `price_overview.initial`, and `price_overview.initial_formatted`, however the Korean currency seems to have an error in which there seems to be two more 0 at the end and so in the future maybe the location of the account should be changed to the US or EU. it seems that it's not only krw but all other currencies need a repositioning of the exponent, divided by 100 (move the decimal two to the left)

In [None]:
df_filtered.loc[df_filtered['price_overview.currency'] != 'KRW', ['appid', 'name', *cols_prices]]

In [None]:
def extract_numeric_amounts(value: Any) -> float:
    if value is np.nan:
        return value
    value = str(value)
    cleaned = re.sub(r"[^\d,\.]", "", value)
    cleaned = cleaned.replace(',', '.')
    if cleaned == '':
        return np.nan
    return float(cleaned)

df_prices = df_filtered.loc[
    (df_filtered['price_overview.currency'] != 'KRW') &
    (df_filtered['is_free'] == False)
    ,
    ['appid', 'name', *cols_prices]
]
df_prices['price_overview.initial_formatter'] = df_prices['price_overview.initial_formatted'].apply(extract_numeric_amounts)
df_prices.loc[
    :, 
    [
        'appid',
        'name',
        'price_overview.currency',
        'price_overview.initial', 
        'price_overview.initial_formatted', 
        'price_overview.initial_formatter'
    ]
]


even though `is_free` is false, there are some games where the price is not shown like 2166440 but there are also some games like 2166240 where `release_date.coming_soon` is false but the price is not shown because it is coming soon.

also there are games like 2166270 where the name of the game is not updated, nor the release date (it seems that the prototype was released before, but they are developing again) and so the price is not shown.

also there are some games like 2154690, which doesn't show the price and the name does not have 'Playtest' anymore (other similar have platest or beta in the names but not anymore) and so it would be good to check for webscraping these prices although they come with the currency symbol and have to be cleaned.

all in all there are a lot of values in the column `price_overview.initial_formatted` that is null so a combination of `price_overview.currency` and `price_overview.initial` / 100 will be used regarding the pricings of the game

In [None]:
df_prices.loc[
    (df_prices['price_overview.initial'].notna()) &
    (df_prices['price_overview.initial_formatter'].notna())
    , 
    [
        'appid',
        'name',
        'price_overview.currency',
        'price_overview.initial', 
        'price_overview.initial_formatted', 
        'price_overview.initial_formatter'
    ]
]

from the above it is seem that dividing `price_overview.inital` is needed to get the correct pricing

## Images

In [None]:
df_filtered

checking from images, `capsule_imagev5` and `capsule_image` are the same image as `header_image` but smaller, so these will be removed

as for `background` it will also be removed because background_raw is the same but raw, which means it has color while background has this steam like blue grayish filter

## Computer Specs (OS & Requirements)

In [None]:
df_filtered.columns

In [None]:
cols_computer_specs = [
    'platforms.windows',
    'pc_requirements',
    'pc_requirements.minimum',
    'pc_requirements.recommended',
    'platforms.mac',
    'mac_requirements',
    'mac_requirements.minimum',
    'mac_requirements.recommended',
    'platforms.linux',
    'linux_requirements',
    'linux_requirements.minimum',
    'linux_requirements.recommended',
]
df_filtered.loc[
    :, 
    [
        'appid',
        'name',
        *cols_computer_specs
    ]
]

it seems that `platforms.windows`, `platforms.mac`, `platforms.linux` are the ones that decide if the game is supported on windows, mac or linux and even if it says false sometimes there are some strings in columns related to that specific OS, which when inspected have only html boileplates but no actual content.

even if the platform is true, the requirements seem to be empty for most parts and maybe it is because these are displayed only for the `type` games, it fills that there is a strong need to partition and do EDA for only `type` game and then add all non-game but chosen types as additional dimensional data, but first check for the remaining specs columns.

In [None]:
df_filtered.loc[
    (df_filtered['type'] == 'game') &
    (df_filtered['pc_requirements'].notnull()) &
    (df_filtered['mac_requirements'].notnull()) &
    (df_filtered['linux_requirements'].notnull())
    , 
    [
        'appid',
        'name',
        *cols_computer_specs
    ]
]

the columns `pc_requirements`, `mac_requirements`, `linux_requirements` are all empty or have just an empty string so these will be dropped for sure.

even the minimum and recommended look like most are only for games that are actually published and if the requirements columns are not null it means that they are in development because the minimum and recommended are null.

also even if the minimum and recommended are null, in the website it does show the minimum and recommended requirements so these might also be webscraped.

In [None]:
df_filtered.loc[
    (df_filtered['type'] == 'game') &
    (df_filtered['pc_requirements'].isnull()) |
    (df_filtered['mac_requirements'].isnull()) |
    (df_filtered['linux_requirements'].isnull())
    , 
    [
        'appid',
        'name',
        *cols_computer_specs
    ]
]

only the columns that the `pc_requirements`, `mac_requirements`, `linux_requirements` are null, apparently have the minimum and recommended requirements, and do not have 'playtest' or 'beta' in their names so this could be considered a filter but will be just an option for now to play safe.

In [None]:
df_computer_specs = (
    df_filtered
    .loc[:, ['appid', 'name', *cols_computer_specs]]
)

for col in cols_computer_specs:
    df_computer_specs[col] = df_computer_specs[col].apply(lambda x: remove_html_tags(x) if isinstance(x, str) else x)

df_computer_specs = df_computer_specs.drop(columns=['pc_requirements', 'mac_requirements', 'linux_requirements'])
df_computer_specs

In [None]:
df_computer_specs.loc[df_computer_specs['platforms.mac'] == False]

from games like 2167220 and 2166420, it seems that even when for example: `platforms.mac` is False, the `mac_requirements.minimum` and `mac_requirements.recommended` are not empty, however after checking the website, these games are not mac compatible, so it seems that `platforms.*` columns are the ones to be used as filters and convert the requirements column to null if `platforms.*` is False

## Website and Contacts & Unneeded info

In [None]:
df_filtered

`package_groups` are not needed because with types that were chosen but not game can be used to retrieve the game they belong to,

`website` will not provide any useful insights for our dashboard, maybe in the future it can be seen that games that have an official website are less of an indie game but don't know if that gives us much info at the moment, 

`content_descriptors.ids` this is more like an inner meta data that doesn't give us much info because we don't know what the ids mean (probably linked to `content_descriptors.notes` but the notes are enough by using NLP on them),

`support_info.url` and `support_info.email` also don't provide any insights for our dashboard for the same reasons as why `website` column is being dropped,

`legal_notice` will be dropped for the same reasons as `website`,

`packages` will be removed for the same reasons as `content_descriptors.ids` tried searching the website thinking this may be an appid reference but it seems like an inner metadata,

`dlc`, `demos` will be dropped because these can be backtracked with `fullgame.appid`, and `fullgame.name` from non-game types,

`fullgame.name` will be dropped because `fullgame.appid` is enough to refer to the game from non-game types,

`reviews`, `ratings` have too many nulls and actual reviews and ratings are collected separately so these will be dropped too

`ext_user_account_notice` has too many nulls and there is no additional information that is useful

`recommendations.total` ill be dropped for the same reason as `reviews`, `ratings` as this column has too many nulls and the popularity of the game can be obtained from other sources

## Unnesting list[dicts]

In [None]:
df_unnesting = df_filtered.copy()
df_unnesting

In [None]:
df_unnesting = df_unnesting.loc[
    :,
    [
        'appid',
        'name',
        'categories',
        'genres',
        'screenshots',
        'movies',
        'achievements.highlighted',
    ]
]
df_unnesting

there are columns which have a list[dicts] that need to be unnested. the reason these were not unnested with the json_parse from the beginning is because these hold a lists with varying lengths, which was tried to unnest and would make around 4000 columns, which is too much for EDA.

double click to see the markdown formatting

`categories` (appid: 2167380)
keys to be extracted: 'description'
looks something like: 
[
    {'description': 'Multi-player', 'id': 1} 
    {'description': 'PvP', 'id': 49}
    {'description': 'Online PvP', 'id': 36}
    {'description': 'Shared/Split Screen PvP', 'id': 37}
    {'description': 'Shared/Split Screen', 'id': 24}
    {'description': 'Full controller support', 'id': 28}
    {'description': 'Remote Play Together', 'id': 44}
    {'description': 'Family Sharing', 'id': 62}
]
will be unnested to look something like:
[
    'Multi-player', 'PvP', 'Online PvP', 
    'Shared/Split Screen PvP', 'Shared/Split Screen', 
    'Full controller support', 'Remote Play Together', 
    'Family Sharing'
]

`genres` (appid: 2167370)
keys to be extracted: 'description'
looks something like:
[
    {'description': 'Action', 'id': '1'}
    {'description': 'Adventure', 'id': '25'}
    {'description': 'Indie', 'id': '23'} 
    {'description': 'RPG', 'id': '3'}
    {'description': 'Strategy', 'id': '2'}
]
will be unnested to look something like:
[
    'Action', 'Adventure', 'Indie', 'RPG', 'Strategy'
]

`screenshots` (appid: 2167330)
keys to be extracted: 'path_full'
looks something like:
[
    {
        'id': 0, 
        'path_full': 'https://shared.akamai.steamstatic.com/store_item_assets/steam/apps/2167330/ss_3070f64060195ef026f35290beda1b9423143ddf.1920x1080.jpg?t=1711660624', 
        'path_thumbnail': 'https://shared.akamai.steamstatic.com/store_item_assets/steam/apps/2167330/ss_3070f64060195ef026f35290beda1b9423143ddf.600x338.jpg?t=1711660624'
    }
    {
        'id': 1, 
        'path_full': 'https://shared.akamai.steamstatic.com/store_item_assets/steam/apps/2167330/ss_a78ad1478ff83176bb1eac17753909e550e750eb.1920x1080.jpg?t=1711660624', 
        'path_thumbnail': 'https://shared.akamai.steamstatic.com/store_item_assets/steam/apps/2167330/ss_a78ad1478ff83176bb1eac17753909e550e750eb.600x338.jpg?t=1711660624'
    }
]
will be unnested to look something like:
[
    'https://shared.akamai.steamstatic.com/store_item_assets/steam/apps/2167330/ss_3070f64060195ef026f35290beda1b9423143ddf.1920x1080.jpg?t=1711660624',
    'https://shared.akamai.steamstatic.com/store_item_assets/steam/apps/2167330/ss_a78ad1478ff83176bb1eac17753909e550e750eb.1920x1080.jpg?t=1711660624'
]

`movies` (appid: 2167380)
keys to be extracted: 'mp4'.'max'
looks something like:
[
  {
    'highlight': True,
    'id': 257072610,
    'mp4': {
      '480': 'http://video.akamai.steamstatic.com/store_trailers/257072610/movie480.mp4?t=1731398896',
      'max': 'http://video.akamai.steamstatic.com/store_trailers/257072610/movie_max.mp4?t=1731398896'
    },
    'name': 'Steam Launch',
    'thumbnail': 'https://shared.akamai.steamstatic.com/store_item_assets/steam/apps/257072610/0d5fe86b8452e81fd1f2ed13fc2a5d91ba9166f8/movie_600x337.jpg?t=1731398896',
    'webm': {
      '480': 'http://video.akamai.steamstatic.com/store_trailers/257072610/movie480_vp9.webm?t=1731398896',
      'max': 'http://video.akamai.steamstatic.com/store_trailers/257072610/movie_max_vp9.webm?t=1731398896'
    }
  }
  {
    'highlight': True,
    'id': 257063140,
    'mp4': {
      '480': 'http://video.akamai.steamstatic.com/store_trailers/257063140/movie480.mp4?t=1728446372',
      'max': 'http://video.akamai.steamstatic.com/store_trailers/257063140/movie_max.mp4?t=1728446372'
    },
    'name': 'Ricochet Rodeo - Steam Demo',
    'thumbnail': 'https://shared.akamai.steamstatic.com/store_item_assets/steam/apps/257063140/d06d8e94408e902ff0f225748d5158c33dc595b6/movie_600x337.jpg?t=1728446372',
    'webm': {
      '480': 'http://video.akamai.steamstatic.com/store_trailers/257063140/movie480_vp9.webm?t=1728446372',
      'max': 'http://video.akamai.steamstatic.com/store_trailers/257063140/movie_max_vp9.webm?t=1728446372'
    }
  }
  {
    'highlight': False,
    'id': 256972749,
    'mp4': {
      '480': 'http://video.akamai.steamstatic.com/store_trailers/256972749/movie480.mp4?t=1696137443',
      'max': 'http://video.akamai.steamstatic.com/store_trailers/256972749/movie_max.mp4?t=1696137443'
    },
    'name': 'Trailer',
    'thumbnail': 'https://shared.akamai.steamstatic.com/store_item_assets/steam/apps/256972749/movie.293x165.jpg?t=1696137443',
    'webm': {
      '480': 'http://video.akamai.steamstatic.com/store_trailers/256972749/movie480_vp9.webm?t=1696137443',
      'max': 'http://video.akamai.steamstatic.com/store_trailers/256972749/movie_max_vp9.webm?t=1696137443'
    }
  }
]
will be extracted to look something like:
[
    'http://video.akamai.steamstatic.com/store_trailers/257072610/movie_max.mp4?t=1731398896',
    'http://video.akamai.steamstatic.com/store_trailers/257063140/movie_max.mp4?t=1728446372',
    'http://video.akamai.steamstatic.com/store_trailers/256972749/movie_max.mp4?t=1696137443'
]

(This will not be parsed)
<!-- `achievements.highlighted` (appid: 2167390)
keys to be extracted: 'name', 'path'
that's why this column will be separated into two columns:
  `achievements_name`, `achievements_img`
looks something like:
[
  {
    'name': 'Welcome',
    'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/2167390/97d780ecbf4460758d692611b4f112bb54974b28.jpg'
  }
  {
    'name': 'Into Next level',
    'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/2167390/b0eff68ff0b35e279703dad7332b593bdd5caead.jpg'
  }
  {
    'name': '20 Kills',
    'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/2167390/0e1a9228793eff7d869ddc18e09ac95ae5616168.jpg'
  }
  {
    'name': '50 Kills',
    'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/2167390/13a92b6d9856f80cef744316250599c82d0a6108.jpg'
  }
  {
    'name': '100 Kills',
    'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/2167390/4561a9b4dea07482c45692cfc874ec3eb3426552.jpg'
  }
  {
    'name': '200 Kills',
    'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/2167390/06365352fbaed975e0705444894d2d79cd2e4b1d.jpg'
  }
  {
    'name': '400 Kills',
    'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/2167390/907c332197c74a255308975efc56236a84d2ec61.jpg'
  }
  {
    'name': '666 Kills',
    'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/2167390/31ba9c09299ed7547d646878398298a973639412.jpg'
  }
  {
    'name': '800 Kills',
    'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/2167390/0f9ff6ad2c761df4fe55af708fecd5cb05d3db37.jpg'
  }
  {
    'name': '999 Kills',
    'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/2167390/15b3cd618b29dd4b0c43840e558bb145f7fe3818.jpg'
  }
]
will be extracted to look something like: 
`achievements_name`
[
  'Welcome', 'Into Next Level', '20 Kills', ...
]
`achievements_img`
[
  'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/2167390/97d780ecbf4460758d692611b4f112bb54974b28.jpg',
  'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/2167390/b0eff68ff0b35e279703dad7332b593bdd5caead.jpg',
  'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/2167390/0e1a9228793eff7d869ddc18e09ac95ae5616168.jpg',
  ...
] -->

## Achievements

`achievements.highlighted` will not be parsed anymore

In [None]:
df_filtered

it feels like `achievements.highlighted` might be the count of `achievements.highlighted` so this should be looked into.

In [None]:
df_achievements = df_filtered.copy()
df_achievements['count_achievements'] = (
    df_achievements['achievements.highlighted']
    .apply(lambda x: len(x) if isinstance(x, list) else 0)
)
df_achievements.loc[
    df_achievements['count_achievements'] > 0,
    [
        'appid',
        'name',
        'achievements.highlighted',
        'achievements.total',
        'count_achievements',
    ]
]

kind of feels like `achievements.total` is how much of the achievements have been completed by the users, but at the same time the numbers are too low considering that max is 1384 and there are so many users, so this column will be dropped because there is no idea what value this column will have...

also looking at the `count_achievements` column that was made for this exploration it is seen that the api maybe only provides up to 10 achievements, even if the game has more achievements, but this will be used just to show what kind of achievements the game has because that also gives a gist of what the game will be about