# 1. API Data Structuring

In this notebook I'll share with you the initial steps of importing MTG card data from an API; followed by the discovery process, its structuring, cleaning and export; so that it can be used on other data projects later on.

## 1.1. Resources and Setup

To start, let's set up our project imports. We'll need some components from the scipy stack, some general utility packages and a custom module I wrote to isolate the process of requesting the API.


In [1]:
# Project imports
import os   # To access environment variables
import sys  # To allow access to my local modules from this notebook

from pathlib import Path    # To ease working with filesystem paths
from datetime import date   # to manage product versions based on date

# Scipy Stack 
import numpy as np
import pandas as pd

# Setting up custom modules
project_folder = Path.cwd().parent

if str(project_folder) not in sys.path:
    sys.path.append(str(project_folder))

from modules.api_client import get_set_data

Next, the project constants should be defined. this will include general details for the project like a project identifier, the current date, and the path where any resulting file should be stored on. In this case we'll store two files: the raw data as obtained from the api to avoid repeated queries (1), and the resulting data after being cleaned (2) so that it can be used in other projects.

In [2]:
# Project Constants
PROJECT_CODE = 'mtg_demo' # A short identifier for this project.
PROJECT_DATE = str(date.today())

# Indentifiers for the files we intend on producing in this notebook
RAW_PRODUCT = 'scryfall_raw'
PROCESSED_PRODUCT= 'processed_data'

RAW_DATA_OUT_DIR = Path(os.environ['RAW_DATA_DIR_PATH'])
RAW_DATA_PATH = RAW_DATA_OUT_DIR / ('_'.join([PROJECT_CODE,
                                             RAW_PRODUCT,
                                             PROJECT_DATE]))

PROCESSED_DATA_OUT_DIR = Path(os.environ['PROCESSED_DATA_DIR_PATH'])
PROCESSED_DATA_PATH = PROCESSED_DATA_OUT_DIR / ('_'.join([PROJECT_CODE,
                                                         PROCESSED_PRODUCT,
                                                         PROJECT_DATE]))

## 1.2. Obtaining the Card Data

There are multiple choices as to where we can import data from. I chose [Scryfall's API](https://scryfall.com/docs/api) for the sake of familiarity with their platform and documentation. As i mentioned earlier, the specifics of the requesting process are kept in the api_client.py module in the modules folder of this project. 

In [3]:
raw_set_data = get_set_data(
    RAW_DATA_PATH.with_suffix('.json'),
    query='(s:mom or s:stx) -is:rebalanced'
)

2023-05-05 12:08:27.341 | INFO     | modules.api_client:get_set_data:10 - 
 - previously stored data found. skipping request.



## 1.3. Processing the Data

To first familiarize ourselves with the data we run the **`pandas.DataFrame.info()`** method. this should display the size of the dataframe; and the name, dtype, and prevalence of null values in every column. adding **`verbose=True`** as a parameter might be needed to display all the details.



In [4]:
raw_set_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 571 entries, 0 to 570
Data columns (total 72 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   object             571 non-null    object        
 1   id                 571 non-null    object        
 2   oracle_id          571 non-null    object        
 3   multiverse_ids     571 non-null    object        
 4   mtgo_id            571 non-null    int64         
 5   arena_id           571 non-null    int64         
 6   tcgplayer_id       571 non-null    int64         
 7   cardmarket_id      556 non-null    float64       
 8   name               571 non-null    object        
 9   lang               571 non-null    object        
 10  released_at        571 non-null    datetime64[ns]
 11  uri                571 non-null    object        
 12  scryfall_uri       571 non-null    object        
 13  layout             571 non-null    object        
 14  highres_im

The card data we are requesting belongs to the sets March of the Machine (MOM) and Strixhaven (STX). In my experience, exploring the contents of a set with data tools accelerates the process of becoming deeply familiar with the themes and composition of the set. 

Say, we are now interested in the columns that describe gameplay features of the cards in the set to ease the process of playing with new cards. We can start by filtering the data with a list of the features relevant to gameplay. In this case we can verify what features are relevant to gameplay by reviewing the descriptions in the [Scryfall's API Documentation](https://scryfall.com/docs/api/cards).

In [5]:
gameplay_features = ['id', 'name', 'layout', 'mana_cost', 'cmc', 
                     'type_line','oracle_text','colors', 'card_faces',
                     'rarity', 'power', 'toughness', 'loyalty', 
                     'image_uris']

set_data = raw_set_data[gameplay_features]
set_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 571 entries, 0 to 570
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           571 non-null    object 
 1   name         571 non-null    object 
 2   layout       571 non-null    object 
 3   mana_cost    494 non-null    object 
 4   cmc          571 non-null    int64  
 5   type_line    571 non-null    object 
 6   oracle_text  494 non-null    object 
 7   colors       494 non-null    object 
 8   card_faces   77 non-null     object 
 9   rarity       571 non-null    object 
 10  power        250 non-null    float64
 11  toughness    250 non-null    float64
 12  loyalty      5 non-null      float64
 13  image_uris   494 non-null    object 
dtypes: float64(3), int64(1), object(10)
memory usage: 66.9+ KB


Also, in the documetation we can find the following key details about the data 

1. Any row corresponding to a double-face card (dfc) has an array of json data with the features of each face in the `card_faces` column.

2. We can tell if a card is a dfc by looking at the `layout` column.

3. `cmc, colors, image_uris, layout, loyalty, mana_cost, name, power, toughness,` and `type_line` are the properties of each card face in the card_faces array if any. I'll also include the recently released `defense` feature since it's not been included in the documentation yet.


Let's verify the contents of the layout column.

In [6]:
set_data.layout.value_counts()

normal       494
transform     61
modal_dfc     16
Name: layout, dtype: int64

### 1.3.1. Extracting the properties of each card face
The `card_faces` column of the 77 non-normal layout  cards should help fill the null values of the columns with 494 non-null values. lets extract a function that extracts the features fro the card face array

In [7]:
def extract_card_face_features(row , features):
    if row['layout'] == 'transform' or row['layout'] == 'modal_dfc':
        for feature in features:
            try:
                row[feature] = row['card_faces'][feature]
            except KeyError:
                row[feature] = np.nan
    return row

the following set of funtions should deliver the desired data.

In [8]:
card_face_features = ['cmc', 'image_uris','layout','loyalty','mana_cost',
                      'colors','power','name','toughness','type_line',
                      'defense', 'oracle_text']

dfc_set_data = (set_data
    .query('layout == "transform" or layout == "modal_dfc"')
    .explode('card_faces')
    .apply(extract_card_face_features,
           axis='columns',
           features= card_face_features)
    .drop(columns='card_faces'))

dfc_set_data.head(4)

Unnamed: 0,id,name,layout,mana_cost,cmc,type_line,oracle_text,colors,rarity,power,toughness,loyalty,image_uris,defense
5,dad34ae5-56b4-4394-be02-e043dc1cc23d,Aetherblade Agent,,{1}{B},,Creature — Human Rogue,Deathtouch\n{4}{U/P}: Transform Aetherblade Ag...,[B],common,1.0,1.0,,{'small': 'https://cards.scryfall.io/small/fro...,
5,dad34ae5-56b4-4394-be02-e043dc1cc23d,Gitaxian Mindstinger,,,,Creature — Phyrexian Rogue,Deathtouch\nWhenever Gitaxian Mindstinger deal...,"[B, U]",common,3.0,3.0,,{'small': 'https://cards.scryfall.io/small/bac...,
26,d9131fc3-018a-4975-8795-47be3956160d,Augmenter Pugilist,,{1}{G}{G},,Creature — Troll Druid,Trample\nAs long as you control eight or more ...,[G],rare,3.0,3.0,,{'small': 'https://cards.scryfall.io/small/fro...,
26,d9131fc3-018a-4975-8795-47be3956160d,Echoing Equation,,{3}{U}{U},,Sorcery,Choose target creature you control. Each other...,[U],rare,,,,{'small': 'https://cards.scryfall.io/small/bac...,


It looks mostly fine, but there are some columns with null values. let's look deeper.

In [9]:
dfc_set_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 154 entries, 5 to 546
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           154 non-null    object 
 1   name         154 non-null    object 
 2   layout       0 non-null      float64
 3   mana_cost    154 non-null    object 
 4   cmc          0 non-null      float64
 5   type_line    154 non-null    object 
 6   oracle_text  154 non-null    object 
 7   colors       154 non-null    object 
 8   rarity       154 non-null    object 
 9   power        88 non-null     object 
 10  toughness    88 non-null     object 
 11  loyalty      4 non-null      object 
 12  image_uris   154 non-null    object 
 13  defense      36 non-null     object 
dtypes: float64(2), object(12)
memory usage: 18.0+ KB


indeed the layout and cmc columns seem to be always null. we should remove them from the card_face_features list

In [10]:
card_face_features = ['image_uris', 'loyalty','mana_cost',
                      'colors','power','name','toughness','type_line',
                      'defense', 'oracle_text']

dfc_set_data = (set_data
    .query('layout == "transform" or layout == "modal_dfc"')
    .explode('card_faces', ignore_index=True)
    .apply(extract_card_face_features,
           axis='columns',
           features= card_face_features)
    .drop(columns='card_faces'))

dfc_set_data.head(4)

Unnamed: 0,id,name,layout,mana_cost,cmc,type_line,oracle_text,colors,rarity,power,toughness,loyalty,image_uris,defense
0,dad34ae5-56b4-4394-be02-e043dc1cc23d,Aetherblade Agent,transform,{1}{B},2,Creature — Human Rogue,Deathtouch\n{4}{U/P}: Transform Aetherblade Ag...,[B],common,1.0,1.0,,{'small': 'https://cards.scryfall.io/small/fro...,
1,dad34ae5-56b4-4394-be02-e043dc1cc23d,Gitaxian Mindstinger,transform,,2,Creature — Phyrexian Rogue,Deathtouch\nWhenever Gitaxian Mindstinger deal...,"[B, U]",common,3.0,3.0,,{'small': 'https://cards.scryfall.io/small/bac...,
2,d9131fc3-018a-4975-8795-47be3956160d,Augmenter Pugilist,modal_dfc,{1}{G}{G},3,Creature — Troll Druid,Trample\nAs long as you control eight or more ...,[G],rare,3.0,3.0,,{'small': 'https://cards.scryfall.io/small/fro...,
3,d9131fc3-018a-4975-8795-47be3956160d,Echoing Equation,modal_dfc,{3}{U}{U},3,Sorcery,Choose target creature you control. Each other...,[U],rare,,,,{'small': 'https://cards.scryfall.io/small/bac...,


In [11]:
dfc_set_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 154 entries, 0 to 153
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           154 non-null    object
 1   name         154 non-null    object
 2   layout       154 non-null    object
 3   mana_cost    154 non-null    object
 4   cmc          154 non-null    int64 
 5   type_line    154 non-null    object
 6   oracle_text  154 non-null    object
 7   colors       154 non-null    object
 8   rarity       154 non-null    object
 9   power        88 non-null     object
 10  toughness    88 non-null     object
 11  loyalty      4 non-null      object
 12  image_uris   154 non-null    object
 13  defense      36 non-null     object
dtypes: int64(1), object(13)
memory usage: 17.0+ KB


Although we don't have any more null values in the layout and cmc columns, there are still issues with the cmc column. In case you are not familiar with mtg, cmc standands for converted mana cost and it represents the numeric magnitude of the mana cost. Our problem comes up with modal dfcs, which unlike cards with transform layouts, won't necessarily share their cmc between faces. If the two faces of a modal dfc have different `mana_cost` their cmc is also likely to be different but in our data it's always the same. To resolve this issue I wrote a script (in the modules folder) to parse the mana cost and calculate the cmc. We'll apply this script to every row containing a card with modal_dfc layout.

In [12]:
from modules.mana_cost_parser import get_cmc

def set_mdfc_cmc(row):
    if row['layout'] == 'modal_dfc':
        row['cmc'] = get_cmc(row['mana_cost'])
    
    return row

In [13]:

(dfc_set_data
    .apply(set_mdfc_cmc, axis='columns')).head(4)


Unnamed: 0,id,name,layout,mana_cost,cmc,type_line,oracle_text,colors,rarity,power,toughness,loyalty,image_uris,defense
0,dad34ae5-56b4-4394-be02-e043dc1cc23d,Aetherblade Agent,transform,{1}{B},2,Creature — Human Rogue,Deathtouch\n{4}{U/P}: Transform Aetherblade Ag...,[B],common,1.0,1.0,,{'small': 'https://cards.scryfall.io/small/fro...,
1,dad34ae5-56b4-4394-be02-e043dc1cc23d,Gitaxian Mindstinger,transform,,2,Creature — Phyrexian Rogue,Deathtouch\nWhenever Gitaxian Mindstinger deal...,"[B, U]",common,3.0,3.0,,{'small': 'https://cards.scryfall.io/small/bac...,
2,d9131fc3-018a-4975-8795-47be3956160d,Augmenter Pugilist,modal_dfc,{1}{G}{G},3,Creature — Troll Druid,Trample\nAs long as you control eight or more ...,[G],rare,3.0,3.0,,{'small': 'https://cards.scryfall.io/small/fro...,
3,d9131fc3-018a-4975-8795-47be3956160d,Echoing Equation,modal_dfc,{3}{U}{U},5,Sorcery,Choose target creature you control. Each other...,[U],rare,,,,{'small': 'https://cards.scryfall.io/small/bac...,


Now that we know everything is working correctly lets apply to the main dataset. 

In [14]:
set_data = (set_data
    .explode('card_faces')
    .apply(extract_card_face_features,
           axis='columns',
           features=card_face_features)
    .apply(set_mdfc_cmc,
           axis='columns')
    .drop(columns='card_faces')
    .reset_index(drop=True))

In [15]:
set_data

Unnamed: 0,cmc,colors,defense,id,image_uris,layout,loyalty,mana_cost,name,oracle_text,power,rarity,toughness,type_line
0,1,[R],,4620cc3b-e401-4096-b310-fed080806344,{'small': 'https://cards.scryfall.io/small/fro...,normal,,{R},Academic Dispute,Target creature blocks this turn if able. You ...,,uncommon,,Instant
1,2,[W],,05521edf-f47f-4e7a-aec5-cdc4ae7368c2,{'small': 'https://cards.scryfall.io/small/fro...,normal,,{1}{W},Academic Probation,Choose one —\n• Choose a nonland card name. Op...,,rare,,Sorcery — Lesson
2,0,[],,edf8eb51-9643-4c54-b38e-e7abea92bbe1,{'small': 'https://cards.scryfall.io/small/fro...,normal,,,Access Tunnel,"{T}: Add {C}.\n{3}, {T}: Target creature with ...",,uncommon,,Land
3,4,[G],,0d7b7830-b65e-4c53-98e8-152026764e4b,{'small': 'https://cards.scryfall.io/small/fro...,normal,,{3}{G},Accomplished Alchemist,{T}: Add one mana of any color.\n{T}: Add X ma...,2.0,rare,5.0,Creature — Elf Druid
4,2,[W],,f7017afb-4c7c-4c8d-9c9d-3f056a55561e,{'small': 'https://cards.scryfall.io/small/fro...,normal,,{1}{W},Aerial Boost,Convoke (Your creatures can help cast this spe...,,common,,Instant
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
643,3,[W],,277e5b49-c53f-4bf7-aac0-950d8708b957,{'small': 'https://cards.scryfall.io/small/fro...,normal,,{2}{W},Zhalfirin Lancer,Whenever another Knight enters the battlefield...,3.0,uncommon,3.0,Creature — Human Knight
644,2,[U],,e446a380-0316-46f7-8ac9-22bce773b35f,{'small': 'https://cards.scryfall.io/small/fro...,normal,,{1}{U},Zhalfirin Shapecraft,Target creature has base power and toughness 4...,,common,,Instant
645,3,"[B, G, U]",,bf2af874-1052-4cad-90ed-d80e49d4c68c,{'small': 'https://cards.scryfall.io/small/fro...,normal,,{B}{G}{U},Zimone and Dina,"Whenever you draw your second card each turn, ...",3.0,mythic,4.0,Legendary Creature — Human Dryad
646,2,"[G, U]",,0ca14c17-dc72-4f68-92f2-14a6c4019f4e,{'small': 'https://cards.scryfall.io/small/fro...,normal,,{G}{U},"Zimone, Quandrix Prodigy","{1}, {T}: You may put a land card from your ha...",1.0,uncommon,2.0,Legendary Creature — Human Wizard


In [16]:
set_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 648 entries, 0 to 647
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   cmc          648 non-null    int64 
 1   colors       648 non-null    object
 2   defense      36 non-null     object
 3   id           648 non-null    object
 4   image_uris   648 non-null    object
 5   layout       648 non-null    object
 6   loyalty      9 non-null      object
 7   mana_cost    648 non-null    object
 8   name         648 non-null    object
 9   oracle_text  648 non-null    object
 10  power        338 non-null    object
 11  rarity       648 non-null    object
 12  toughness    338 non-null    object
 13  type_line    648 non-null    object
dtypes: int64(1), object(13)
memory usage: 71.0+ KB


### 1.3.2. Extracting the image uris

Since we are keeping the `image_uris` data as as simple reference to the card, we don't really need to store all the image versions for every card. We can write a function to do this for every row of the dataset.

In [17]:
# Defining a functio to extract normal image uris
def get_normal_uri(row):
    row['normal_image_uri'] = row['image_uris']['normal']
    return row

In [18]:
set_data = (
    set_data
        .apply(
            get_normal_uri,
            axis='columns')
        .drop(
            columns='image_uris'))

set_data.normal_image_uri.head(3)

0    https://cards.scryfall.io/normal/front/4/6/462...
1    https://cards.scryfall.io/normal/front/0/5/055...
2    https://cards.scryfall.io/normal/front/e/d/edf...
Name: normal_image_uri, dtype: object

### 1.3.3. Encoding the type line and color categories

The `type_line` feature describes supertypes, types and subtypes of cards through a string of text. We'd like to break it down into disctinct features to ease insight extraction. To achieve this will turn the type line into a list and encode any relevant types found in it.

In [19]:
import re

def parse_type_line(row, relevant_types):
    
    type_line = row['type_line'].split(' ')
    
    for segment in type_line:
        
        if re.match(r'[^a-zA-Z]', segment):
            continue
        
        if segment in relevant_types:
            category = 'is_'+segment.lower()
            row[category] = 1
    
    return row

In [20]:
relevant_types=['Basic','Snow','Legendary','Instant','Creature',
                        'Battle','Enchantment','Artifact','Land','Sorcery',
                        'Planeswalker']

set_data = (
    set_data
        .apply(
            parse_type_line,
            relevant_types=relevant_types,
            axis='columns'))

for type in relevant_types:
    col_flag = 'is_'+type.lower()
    try:
        set_data[col_flag] = (~ set_data[col_flag].isna())
    except KeyError:
        continue

In [21]:
set_data.head(2)

Unnamed: 0,cmc,colors,defense,id,is_artifact,is_basic,is_battle,is_creature,is_enchantment,is_instant,...,layout,loyalty,mana_cost,name,normal_image_uri,oracle_text,power,rarity,toughness,type_line
0,1,[R],,4620cc3b-e401-4096-b310-fed080806344,False,False,False,False,False,True,...,normal,,{R},Academic Dispute,https://cards.scryfall.io/normal/front/4/6/462...,Target creature blocks this turn if able. You ...,,uncommon,,Instant
1,2,[W],,05521edf-f47f-4e7a-aec5-cdc4ae7368c2,False,False,False,False,False,False,...,normal,,{1}{W},Academic Probation,https://cards.scryfall.io/normal/front/0/5/055...,Choose one —\n• Choose a nonland card name. Op...,,rare,,Sorcery — Lesson


A similar process can be done to separate the color features.