# Keywords Table

## Introduction

The purpose of this notebook is to process and upload keyword data from MTGJSON into the postgresql database mtg_db. This is done through the following steps:
- Download the json file from MTGJSON's file server
- Check the version and date of the json file
- Pre-process the dictionary and convert it into a dataframe
- Push the keywords dataframe to the database "raw_data" schema

## Schemas

### Keywords Schema

| Column           | Renamed   | Dataype | Description                                              |
| ---              | ---       | ---     | ---                                                      |
| abilityWords     | ABILITIES | STRING  | A list of ability words found in rules text on cards     |
| keywordAbilities | KEYWORDS  | STRING  | A list of keyword abilities found in rules text on cards |
| keywordActions   | ACTIONS   | STRING  | A list of keyword actiona found in rules text on cards   |

## Python Libraries

In [56]:
import json
import requests
import lzma
from   tqdm       import tqdm
import numpy      as     np
import pandas     as     pd
from   sqlalchemy import create_engine, text

## Modular functions
# Setting the root path for finding the modules directory
import sys, os
sys.path.append(os.path.abspath(".."))
# Loading Modular functions
from   modules.data_recency import data_recency_check, recency_check_upload

# Clean-Up
del sys, os

In [57]:
# Show all columns instead of truncating with "..."
pd.set_option("display.max_columns", None)

# (Optional) also show all rows
pd.set_option("display.max_rows", None)

# (Optional) widen the display area so columns don’t wrap badly
pd.set_option("display.width", None)

## Input

### Database Connection

In [58]:
## Setting up credentials for accessing postgresql "mtg_db" database

# Credentials for setting up connection to postgresql
user     = "postgres"
password = "as:123bpostgresql"
host     = "localhost"
port     = "5432"
database = "mtg_db"

# Engine connection to postgresql
engine = create_engine(f"postgresql+psycopg2://{user}:{password}@{host}:{port}/{database}")

# Clean-Up
del user, password, host, port, database, create_engine

In [59]:
## Creating the empty data_recency table if not exists
query = """
        CREATE TABLE IF NOT EXISTS raw_data.data_recency (
             json_type      TEXT PRIMARY KEY
            ,latest_date    DATE
            ,latest_version TEXT);
        """
with engine.begin() as conn:
    conn.execute(text(query))

    # Clean-Up
    del query, conn, text

### Input Data

In [60]:
# URL for MTGJSON (example: Keywords.xz)
url = "https://mtgjson.com/api/v5/Keywords.json.xz"

# Download the compressed file
response = requests.get(url)
response.raise_for_status()

# Prepare to track total size and read in chunks
total_size = int(response.headers.get('content-length', 0))  # total bytes, may be None
chunk_size = 1024 * 1024  # 1 MB per chunk
compressed_data = bytearray()  # store the downloaded bytes

# Iterate over response chunks, updating progress bar
with tqdm(total=total_size, unit='B', unit_scale=True, desc="Downloading") as pbar:
    for chunk in response.iter_content(chunk_size=chunk_size):
        if chunk:  # filter out keep-alive chunks
            compressed_data.extend(chunk)
            pbar.update(len(chunk))

# Decompress the .xz file
decompressed_bytes = lzma.decompress(compressed_data)

# Parse JSON into a dictionary
dict__keywords = json.loads(decompressed_bytes)

# Clean-Up
del url, tqdm, total_size, chunk_size, lzma, chunk, json, requests
del response, compressed_data, decompressed_bytes, pbar

Downloading: 100%|██████████| 2.02k/2.02k [00:00<00:00, 25.8MB/s]


## Pre-processing

In [61]:
# Checking the latest version of the input data
df__data_recency = data_recency_check(dict__keywords, 'keyword')

# Clean-Up
del data_recency_check

## Main Code

In [62]:
## Converting the dictionary to a dataframe, renaming the columns and making empty values empty strings

# Converting the json dictionary to a dataframe
df__keywords = pd.DataFrame.from_dict(dict__keywords['data']
                                     # The columns are different lengths
                                     ,orient = 'index').transpose()

# Renaming the columns
df__keywords.columns = ['abilities'
                       ,'keywords'
                       ,'actions']

# Sort each column independently, pushing NaNs and empty strings to the bottom
df__keywords = df__keywords.apply(lambda col: col.replace('', np.nan)             # Treat empty strings as NaN
                                                 .sort_values(na_position='last') # Sort values
                                                 .fillna('')                      # Put empty strings back if desired
                                                 .values)                         # Reset index

# Clean-Up
del dict__keywords, np

## Output

In [63]:
# Appending/replacing the meta data of the json download to a central table
recency_check_upload(schema_name = "raw_data"
                    ,table_name  = "data_recency"
                    ,dataframe   = df__data_recency
                    ,engine = engine)

# Clean-Up
del df__data_recency, recency_check_upload

In [64]:
# Uploading the keywords dataframe to postgresql
df__keywords.to_sql(name      = "keywords"
                   ,con       = engine
                   ,schema    = "raw_data"
                   ,if_exists = "replace"
                   ,index     = False)

# Clean-Up
del df__keywords

## Checks

In [65]:
# Check the json file date and version
query = """
        SELECT *
        FROM raw_data.data_recency
        """
pd.read_sql_query(query, con=engine)

Unnamed: 0,json_type,latest_date,latest_version
0,all printings,2025-09-08,5.2.2+20250908
1,set list,2025-09-24,5.2.2+20250924
2,keyword,2025-09-28,5.2.2+20250928


In [66]:
# Check the dataframe top 10 values
query = """
        SELECT *
        FROM raw_data.keywords
        LIMIT 10
        """
pd.read_sql_query(query, con=engine)

# Clean-Up
del engine, pd, query