# 1. API Data Structuring

In this notebook I'll share with you the initial steps of importing MTG card data from an API; followed by the discovery process, its structuring, cleaning and export; so that it can be used on other data projects later on.

## 1.1. Resources and Setup

To start, let's set up our project imports. We'll need some components from the scipy stack, some general utility packages and a custom module I wrote to isolate the process of requesting the API.


In [1]:
# Project imports
import os   # To access environment variables
import sys  # To allow access to my local modules from this notebook

from pathlib import Path    # To ease working with filesystem paths
from datetime import date   # to manage product versions based on date

# Scipy Stack 
import numpy as np
import pandas as pd

# Setting up custom modules
project_folder = Path.cwd().parent

if str(project_folder) not in sys.path:
    sys.path.append(str(project_folder))

from modules.api_client import get_set_data

Next, the project constants should be defined. this will include general details for the project like a project identifier, the current date, and the path where any resulting file should be stored on. In this case we'll store two files: the raw data as obtained from the api to avoid repeated queries (1), and the resulting data after being cleaned (2) so that it can be used in other projects.

In [2]:
# Project Constants
PROJECT_CODE = 'mtg_demo' # A short identifier for this project.
PROJECT_DATE = str(date.today())

# Indentifiers for the files we intend on producing in this notebook
RAW_PRODUCT = 'scryfall_raw'
PROCESSED_PRODUCT= 'processed_data'

RAW_DATA_OUT_DIR = Path(os.environ['RAW_DATA_DIR_PATH'])
RAW_DATA_PATH = RAW_DATA_OUT_DIR / ('_'.join([PROJECT_CODE,
                                             RAW_PRODUCT,
                                             PROJECT_DATE]))

PROCESSED_DATA_OUT_DIR = Path(os.environ['PROCESSED_DATA_DIR_PATH'])
PROCESSED_DATA_PATH = PROCESSED_DATA_OUT_DIR / ('_'.join([PROJECT_CODE,
                                                         PROCESSED_PRODUCT,
                                                         PROJECT_DATE]))

## 1.2. Obtaining the Card Data

There are multiple choices as to where we can import data from. I chose [Scryfall's API](https://scryfall.com/docs/api) for the sake of familiarity with their platform and documentation. As i mentioned earlier, the specifics of the requesting process are kept in the api_client.py module in the modules folder of this project. 

In [3]:
raw_set_data = get_set_data(
    RAW_DATA_PATH.with_suffix('.json'),
    query='(s:mom or s:stx) -is:rebalanced'
)

2023-05-01 17:15:57.668 | INFO     | modules.api_client:get_set_data:16 - 
 - requesting api for data...

2023-05-01 17:16:01.059 | INFO     | modules.api_client:get_set_data:20 - 
 - storing json data for future reference...



# 1.3. Processing the Data

To first familiarize ourselves with the data we run the **`pandas.DataFrame.info()`** method. this should display the size of the dataframe; and the name, dtype, and prevalence of null values in every column. adding **`verbose=True`** as a parameter might be needed to display all the details.

In [4]:
raw_set_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 571 entries, 0 to 570
Data columns (total 72 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   object             571 non-null    object 
 1   id                 571 non-null    object 
 2   oracle_id          571 non-null    object 
 3   multiverse_ids     571 non-null    object 
 4   mtgo_id            571 non-null    int64  
 5   arena_id           571 non-null    int64  
 6   tcgplayer_id       571 non-null    int64  
 7   cardmarket_id      556 non-null    float64
 8   name               571 non-null    object 
 9   lang               571 non-null    object 
 10  released_at        571 non-null    object 
 11  uri                571 non-null    object 
 12  scryfall_uri       571 non-null    object 
 13  layout             571 non-null    object 
 14  highres_image      571 non-null    bool   
 15  image_status       571 non-null    object 
 16  image_uris         494 non

The card data we are requesting belongs to the sets March of the Machine (MOM) and Strixhaven (STX). In my experience, exploring the contents of a set with data tools accelerates the process of becoming deeply familiar with the themes and composition of the set. 

Say, we are now interested in the columns that describe gameplay features of the cards in the set to ease the process of playing with new cards. We can start by filtering the data with a list of the features relevant to gameplay. In this case we can verify what features are relevant to gameplay by reviewing the descriptions in the [Scryfall's API Documentation](https://scryfall.com/docs/api/cards).

In [5]:
gameplay_features = ['name', 'layout', 'mana_cost', 'cmc', 'type_line',
                     'oracle_text','colors','color_identity', 'card_faces',
                     'rarity', 'power', 'toughness', 'loyalty', 'id', 'image_uris']

set_data = raw_set_data[gameplay_features]
set_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 571 entries, 0 to 570
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   name            571 non-null    object 
 1   layout          571 non-null    object 
 2   mana_cost       494 non-null    object 
 3   cmc             571 non-null    float64
 4   type_line       571 non-null    object 
 5   oracle_text     494 non-null    object 
 6   colors          494 non-null    object 
 7   color_identity  571 non-null    object 
 8   card_faces      77 non-null     object 
 9   rarity          571 non-null    object 
 10  power           250 non-null    object 
 11  toughness       250 non-null    object 
 12  loyalty         5 non-null      object 
 13  id              571 non-null    object 
 14  image_uris      494 non-null    object 
dtypes: float64(1), object(14)
memory usage: 67.0+ KB


Also, in the documetation we can find two key details about the data 

1. Any row corresponding to a double-face card (dfc) has an array of json data with the features of each face in the `card_faces` column.

2. We can tell if a card is a dfc by looking at the `layout` column.

3. A transform `layout` indicates that the card starts on a given face (usually the front) and then flips (transforms) into the other face.

4. A modal_dfc `layout` indicates that the card can played as the front mode or the back mode.

Let's verify the contents of the layout column.

In [7]:
set_data.layout.value_counts()

normal       494
transform     61
modal_dfc     16
Name: layout, dtype: int64

The `card_faces` column of the 77 non-normal layout  cards should help fill the null values of the columns with 494 non-null values.