# All Printings Table

## Introduction

The purpose of this notebook is to process and upload all card data from MTGJSON into the postgresql database mtg_db. This is done through the following steps:
- Download the json file from MTGJSON's file server
- Check the version and date of the json file
- Pre-process the dictionary and convert it into a dataframe
- Push the keywords dataframe to the database "raw_data" schema

## Schemas

Set List Schema - Main table

| Column            | Renamed         | Dataype    | Description                                                                |
| ---               | ---             | ---        | ---                                                                        |
| code              | SET_CODE        | STRING     | The set code                                                               |
| name              | SET_NAME        | STRING     | Name of the set                                                            |
| baseSetSize       | BASE_SET_SIZE   | INTEGER    | The number of cards in the base set without promos or supplements          |
| booster           | BOOSTERS
| cardsphereSetId   | CS_SET_ID       | FLOAT      | ID for set in Cardsphere                                                   |
| mcmId             | CM_ID           | FLOAT      | Card Market set ID                                                         |
| mcmIdExtras       | CM_ID_ADD       | FLOAT      | If the set is split into two sets this is the additional Card Market ID    |
| mcmName           | CM_NAME         | STRING     | Name of the set on Card Market                                             |
| isFoilOnly        | FOIL_FLAG       | BOOLEAN    | Flag whether the set is only available as foils                            |
| isForeignOnly     | FOREIGN_FLAG    | BOOLEAN    | Flag whether the set is only available outside the US                      |
| keyruneCode       | KEYRUNE_CODE    | STRING     | ID for the keyrune database of set icons                                   |
| languages         | LANGUAGES       | LIST       | List of languages the set was printed in                                   |
| mtgoCode          | MTGO_SET_CODE   | STRING     | Set code on Magic The Gathering Online                                     |
| isNonFoilOnly     | NON_FOIL_FLAG   | BOOLEAN    | Flag whether the set is only available as non-foils                        |
| isOnlineOnly      | ONLINE_FLAG     | BOOLEAN    | Flag whether the set is only available in online formats                   |
| isPartialPreview  | PREVIEW_FLAG    | BOOLEAN    | Flag whether the set is still in preview and not complete                  |
| sealedProduct     | PRODUCT_INFO    | LIST       | Information about the purchasable sealed product                           |
| releaseDate       | RELEASE_DATE    | STRING     | Date the set was release, in format YYYY-MM-DD                             |
| block             | SET_BLOCK_NAME  | STRING     | Block the set is in, e.g. Kaladesh                                         |
| decks             | SET_DECKS       | LIST       | All decks associated with the set                                          |
| parentCode        | SET_PARENT_CODE | STRING     | Code of the parent set for set variations, e.g. promotions, guild kits etc |
| tokenSetCode      | SET_TOKEN_CODE  | STRING     | Code for the set's tokens                                                  |
| type              | SET_TYPE        | STRING     | The type of set, e.g. alchemy, commander, funny                            |
| tcgplayerGroupId  | TCGPG_ID        | INTEGER    | ID for the set on TCGplayer                                                |
| totalSetSize      | TOTAL_SET_SIZE  | INTEGER    | The number opf cards in the set with promos and supplements                |
| translations      | TRANSLATIONS    | DICTIONARY | The translated name of the set                                             |

## Python Libraries

In [1]:
import json
import requests
import lzma
from   tqdm       import tqdm
import numpy      as     np
import pandas     as     pd
from   sqlalchemy import create_engine, Table, Column, MetaData, Text, Date, text
from   sqlalchemy.dialects.postgresql import insert

In [2]:
# Show all columns instead of truncating with "..."
pd.set_option("display.max_columns", None)

# (Optional) also show all rows
pd.set_option("display.max_rows", None)

# (Optional) widen the display area so columns don’t wrap badly
pd.set_option("display.width", None)

## Functions

In [3]:
# Function for showing the data and version of the MTGJSON data
def data_recency_check(data, json_type):

    """
    Extract and display the version and date metadata from an MTGJSON dataset,
    and return this information as a DataFrame along with the JSON type.

    Parameters
    ----------
    data : dict
        MTGJSON data loaded from a JSON file, expected to contain a 'meta' key
        with 'date' and 'version' fields.

    json_type : str
        A string indicating the type or name of the JSON dataset being processed.
        This will be included in the output DataFrame.

    Returns
    -------
    pd.DataFrame
        A DataFrame with a single row and columns:
        - 'json_type': The provided JSON dataset type/name.
        - 'latest_date': The date the MTGJSON data was last updated.
        - 'latest_version': The MTGJSON model version.
    """

    # Create a DataFrame for the output
    df = pd.DataFrame({'json_type'      : [json_type]
                      ,'latest_date'    : [data['meta']['date']]
                      ,'latest_version' : [data['meta']['version']]})

    # Returning the values directly
    return(df)

In [4]:
# Function for uploading the recency check

def recency_check_upload(schema_name, table_name, dataframe):
    
    """
    Uploads recency check data from a Pandas DataFrame into a PostgreSQL table 
    with upsert (insert or update) logic.

    Each row from the DataFrame is inserted into the target table. If a row with the 
    same `json_type` (primary key) already exists, the corresponding `latest_date` 
    and `latest_version` values are updated instead.

    Parameters
    ----------
    schema_name : str
        Name of the PostgreSQL schema where the table resides.
    table_name : str
        Name of the PostgreSQL table to update or insert into.
    dataframe : pandas.DataFrame
        DataFrame containing the recency check data with columns:
        - "json_type" (str): Identifier for the JSON file type.
        - "latest_date" (datetime.date): Date of the latest file.
        - "latest_version" (str): Version string of the latest file.

    Notes
    -----
    - Requires a global SQLAlchemy `engine` object to be defined.
    - Uses PostgreSQL's ON CONFLICT clause for upsert behavior.
    """

    # Create a MetaData object
    metadata = MetaData(schema=schema_name)
    
    # Define the Table object matching your PostgreSQL table
    json_recency_table = Table(table_name
                              ,metadata
                              ,Column("json_type" ,Text ,primary_key = True)
                              ,Column("latest_date" ,Date)
                              ,Column("latest_version" ,Text))
    
    # Upsert each row from your DataFrame
    with engine.begin() as conn:
        
        # Iterate through rows of the DataFrame
        for _, row in dataframe.iterrows():
            
            # Create an insert statement for the current row
            stmt = insert(json_recency_table).values(json_type      = row["json_type"]
                                                    ,latest_date    = row["latest_date"]
                                                    ,latest_version = row["latest_version"])
            
            # Add upsert logic to update on conflict
            stmt = stmt.on_conflict_do_update(index_elements = ["json_type"]
                                             ,set_           = {"latest_date"    : row["latest_date"]
                                                               ,"latest_version" : row["latest_version"]})
            
            # Execute the statement
            conn.execute(stmt)

In [5]:
# Function for showing the hierarchy of a dictionary or the schema of a single level
def print_dict_structure(data, max_depth=None, _indent=0):

    """
    Recursively prints the hierarchical structure of a dictionary or list,
    including the length of each element where applicable.
    
    If max_depth=1, returns a DataFrame with columns: KEY_NAME, DATA_TYPE, LENGTH.
    
    Args:
        data: The dictionary or list to explore.
        max_depth: Limit how deep to traverse (None for full depth).
    
    Returns:
        pd.DataFrame if max_depth=1, otherwise None (prints output).
    """
    
    # Check if we are at the top level and max_depth=1 to return DataFrame instead of printing
    if max_depth == 1 and _indent == 0:
        # Initialize list to collect rows for the DataFrame
        rows = []

        # If data is a dictionary, iterate over its keys and values
        if isinstance(data, dict):
            for key, value in data.items():
                # Determine length if possible, otherwise set to 0
                length = len(value) if hasattr(value, "__len__") and not isinstance(value, (str, bytes)) else 0
                # Append a tuple of key name, data type, and length to rows
                rows.append((key, type(value).__name__, length))

        # If data is a list, take the first element (assumed dict) and do the same
        elif isinstance(data, list) and data:
            for key, value in data[0].items():
                # Determine length if possible, otherwise set to 0
                length = len(value) if hasattr(value, "__len__") and not isinstance(value, (str, bytes)) else 0
                # Append a tuple of key name, data type, and length to rows
                rows.append((key, type(value).__name__, length))

        # Convert collected rows into a DataFrame with specific column names
        return pd.DataFrame(rows, columns=["KEY_NAME", "DATA_TYPE", "LENGTH"])
    
    # Create a prefix for indentation when printing nested structures
    prefix = "  " * _indent

    # If data is a dictionary, iterate recursively
    if isinstance(data, dict):
        for key, value in data.items():
            # Prepare a string showing type and length for printing
            length_info = f", len={len(value)}" if hasattr(value, "__len__") and not isinstance(value, (str, bytes)) else ""
            # Print the key name, type, and length with proper indentation
            print(f"{prefix}{key} ({type(value).__name__}{length_info})")
            # Recurse into value if max_depth is not reached
            if max_depth is None or _indent < max_depth - 1:
                print_dict_structure(value, max_depth, _indent + 1)

    # If data is a list, recurse into the first element (assuming homogeneous elements)
    elif isinstance(data, list):
        if data and (max_depth is None or _indent < max_depth - 1):
            print_dict_structure(data[0], max_depth, _indent + 1)

    # For non-dict and non-list elements, print their type and length
    else:
        # Prepare a string showing type and length for printing
        length_info = f", len={len(data)}" if hasattr(data, "__len__") and not isinstance(data, (str, bytes)) else ""
        # Print the type and length with proper indentation
        print(f"{prefix}{type(data).__name__}{length_info}")

## Input

### Database Connection

In [6]:
## Setting up credentials for accessing postgresql "mtg_db" database

# Credentials for setting up connection to postgresql
user     = "postgres"
password = "as:123bpostgresql"
host     = "localhost"
port     = "5432"
database = "mtg_db"

# Engine connection to postgresql
engine = create_engine(f"postgresql+psycopg2://{user}:{password}@{host}:{port}/{database}")

In [7]:
## Creating the empty data_recency table if not exists
query = """
        CREATE TABLE IF NOT EXISTS raw_data.data_recency (
         json_type      TEXT PRIMARY KEY
        ,latest_date    DATE
        ,latest_version TEXT);
        """
with engine.begin() as conn:
    conn.execute(text(query))

### Input Data

In [8]:
# URL for the MTGJSON file (example: AllPrintings)
url = "https://mtgjson.com/api/v5/AllPrintings.json.xz"

# Stream download the file to track progress
response = requests.get(url, stream=True)
response.raise_for_status()

# Prepare to track total size and read in chunks
total_size = int(response.headers.get('content-length', 0))  # total bytes, may be None
chunk_size = 1024 * 1024  # 1 MB per chunk
compressed_data = bytearray()  # store the downloaded bytes

# Iterate over response chunks, updating progress bar
with tqdm(total=total_size, unit='B', unit_scale=True, desc="Downloading") as pbar:
    for chunk in response.iter_content(chunk_size=chunk_size):
        if chunk:  # filter out keep-alive chunks
            compressed_data.extend(chunk)
            pbar.update(len(chunk))

# Decompress the .xz file from the bytes you collected
decompressed_bytes = lzma.decompress(compressed_data)

# Parse JSON into a dictionary
dict__all_printings = json.loads(decompressed_bytes)

Downloading: 100%|██████████| 64.1M/64.1M [00:20<00:00, 3.07MB/s]


## Pre-processing

In [9]:
# Checking the latest version of the input data
df__data_recency = data_recency_check(dict__all_printings, 'all printings')
df__data_recency

Unnamed: 0,json_type,latest_date,latest_version
0,all printings,2025-09-08,5.2.2+20250908


## Main Code

In [10]:
## Converting the first layer of JSON dictionary into dataframe
# Empty list for storing dataframes
list__set_data = []

# Listing the set codes
list__set_codes = list(dict__all_printings['data'].keys())

# Looping through the set codes making individual dataframes
for set_code in tqdm(list__set_codes, desc="Processing sets"):
    df__set = pd.json_normalize(dict__all_printings['data'][set_code], max_level=0)
    list__set_data.append(df__set)

# Concatenate sets into single DataFrame
df__sets = pd.concat(list__set_data, ignore_index=True)

Processing sets: 100%|██████████| 826/826 [00:08<00:00, 96.62it/s] 


In [None]:
# Dropping columns in the set 

In [12]:
# Reviewing columns in main dataframe
print(df__sets.columns)
df__sets.head(3)

Index(['baseSetSize', 'block', 'booster', 'cards', 'cardsphereSetId', 'code',
       'decks', 'isFoilOnly', 'isOnlineOnly', 'keyruneCode', 'languages',
       'mcmId', 'mcmName', 'mtgoCode', 'name', 'releaseDate', 'sealedProduct',
       'tcgplayerGroupId', 'tokenSetCode', 'tokens', 'totalSetSize',
       'translations', 'type', 'isNonFoilOnly', 'mcmIdExtras', 'isForeignOnly',
       'parentCode', 'isPartialPreview'],
      dtype='object')


Unnamed: 0,baseSetSize,block,booster,cards,cardsphereSetId,code,decks,isFoilOnly,isOnlineOnly,keyruneCode,languages,mcmId,mcmName,mtgoCode,name,releaseDate,sealedProduct,tcgplayerGroupId,tokenSetCode,tokens,totalSetSize,translations,type,isNonFoilOnly,mcmIdExtras,isForeignOnly,parentCode,isPartialPreview
0,383,Core Set,{'draft': {'boosters': [{'contents': {'basic':...,"[{'artist': 'Pete Venters', 'artistIds': ['d54...",755.0,10E,"[{'code': '10E', 'commander': [], 'displayComm...",False,False,10E,"[Chinese Simplified, English, French, German, ...",74.0,Tenth Edition,10E,Tenth Edition,2007-07-13,"[{'category': 'booster_box', 'contents': {'sea...",1.0,T10E,"[{'artist': 'Paolo Parente', 'artistIds': ['d4...",510,"{'Chinese Simplified': None, 'Chinese Traditio...",core,,,,,
1,302,Core Set,{'default': {'boosters': [{'contents': {'commo...,"[{'artist': 'Dan Frazier', 'artistIds': ['059b...",938.0,2ED,,False,False,2ED,[English],,,,Unlimited Edition,1993-12-01,"[{'category': 'booster_box', 'contents': {'sea...",115.0,,[],302,{},core,True,,,,
2,331,,{'collector': {'boosters': [{'contents': {'com...,"[{'artist': 'Mark Tedin', 'artistIds': ['9ee9a...",1462.0,2X2,,False,False,2X2,"[Chinese Simplified, English, French, German, ...",5070.0,Double Masters 2022,,Double Masters 2022,2022-07-08,"[{'category': 'limited_aid_tool', 'contents': ...",3070.0,T2X2,"[{'artist': 'Izzy', 'artistIds': ['2c3d2473-ff...",579,"{'Chinese Simplified': None, 'Chinese Traditio...",masters,,5071.0,,,


In [None]:
# Collect results
rows = []

# Loop through sets
for mtg_set, set_data in dict__all_printings["data"].items():
    booster_data = set_data.get("booster", None)

    if isinstance(booster_data, dict):
        # Expand each booster type into its own row
        for booster_type, booster_content in booster_data.items():
            rows.append({
                 'SET_CODE'        : mtg_set
                ,'BOOSTER_TYPE'    : booster_type
                ,'BOOSTER_CONTENT' : booster_content
            })
    else:
        # Add a single row with None values
        rows.append({
             'SET_CODE'        : mtg_set
            ,'BOOSTER_TYPE'    : None
            ,'BOOSTER_CONTENT' : None
        })

# Convert to DataFrame
df__boosters = pd.DataFrame(rows)

Unnamed: 0,SET_CODE,BOOSTER_TYPE,BOOSTER_CONTENT
0,10E,draft,"{'boosters': [{'contents': {'basic': 1, 'commo..."
1,2ED,default,"{'boosters': [{'contents': {'common': 11, 'rar..."
2,2ED,starter,{'boosters': [{'contents': {'commonWithDuplica...
3,2X2,collector,{'boosters': [{'contents': {'commonUncommonSho...
4,2X2,draft,{'boosters': [{'contents': {'commonWithShowcas...


In [24]:
df__boosters["BOOSTER_CONTENT"] = df__boosters["BOOSTER_CONTENT"].astype(str)
df__boosters.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1187 entries, 0 to 1186
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   SET_CODE         1187 non-null   object
 1   BOOSTER_TYPE     547 non-null    object
 2   BOOSTER_CONTENT  1187 non-null   object
dtypes: object(3)
memory usage: 27.9+ KB


## Output

In [28]:
recency_check_upload(schema_name = "raw_data"
                    ,table_name  = "data_recency"
                    ,dataframe   = df__data_recency)

In [25]:
# Uploading the keywords dataframe to postgresql
df__boosters.to_sql(name      = 'boosters'
                   ,con       = engine
                   ,schema    = 'raw_data'
                   ,if_exists = 'replace'
                   ,index     = False)

187

## Checks

### Boosters

In [30]:
# Check the json file date and version
query = """
        SELECT *
        FROM raw_data.data_recency
        """
pd.read_sql_query(query, con=engine)

Unnamed: 0,json_type,latest_date,latest_version
0,keyword,2025-08-20,5.2.2+20250820
1,all printings,2025-09-08,5.2.2+20250908


In [31]:
# Check the dataframe top 10 values
query = """
        SELECT *
        FROM raw_data.boosters
        LIMIT 10
        """
pd.read_sql_query(query, con=engine)

Unnamed: 0,SET_CODE,BOOSTER_TYPE,BOOSTER_CONTENT
0,10E,draft,"{'boosters': [{'contents': {'basic': 1, 'commo..."
1,2ED,default,"{'boosters': [{'contents': {'common': 11, 'rar..."
2,2ED,starter,{'boosters': [{'contents': {'commonWithDuplica...
3,2X2,collector,{'boosters': [{'contents': {'commonUncommonSho...
4,2X2,draft,{'boosters': [{'contents': {'commonWithShowcas...
5,2XM,box-topper,"{'boosters': [{'contents': {'boxtopper': 1}, '..."
6,2XM,draft,"{'boosters': [{'contents': {'common': 8, 'dedi..."
7,2XM,vip,"{'boosters': [{'contents': {'foilBasic': 2, 'f..."
8,30A,draft,"{'boosters': [{'contents': {'a30Basic': 2, 'a3..."
9,3ED,default,"{'boosters': [{'contents': {'common': 11, 'rar..."


### Cards