# 1. API Data Structuring

In this notebook I'll share with you the initial steps of importing MTG card data from an API; followed by the discovery process, its structuring, cleaning and export; so that it can be used on other data projects later on.

## 1.1. Resources and Setup

To start, let's set up our project imports. We'll need some components from the scipy stack, some general utility packages and a custom module I wrote to isolate the process of requesting the API and keep the contents of this notebook on scope.


In [1]:
# Project imports
import os   # To access environment variables
import sys  # To allow access to my local modules from this notebook

from pathlib import Path    # To ease working with filesystem paths
from datetime import date   # to manage product versions based on date

# Scipy Stack 
import numpy as np
import pandas as pd

# Setting up custom modules
project_folder = Path.cwd().parent

if str(project_folder) not in sys.path:
    sys.path.append(str(project_folder))

from modules.api_client import get_set_data

Next, the project constants should be defined. this will include general details for the project like a project identifier, the current date, and the path where any resulting file should be stored on. In this case we'll store two files: the raw data as obtained from the api to avoid repeated queries (1), and the resulting data after being cleaned (2) so that it can be used in other projects.

In [2]:
# Project Constants
PROJECT_CODE = 'mtg_demo' # A short identifier for this project.
PROJECT_DATE = str(date.today())

# Indentifiers for the files we intend on producing in this notebook
RAW_PRODUCT = 'scryfall_raw'
PROCESSED_PRODUCT= 'processed_data'

RAW_DATA_OUT_DIR = Path(os.environ['RAW_DATA_DIR_PATH'])
RAW_DATA_PATH = RAW_DATA_OUT_DIR / ('_'.join([PROJECT_CODE,
                                             RAW_PRODUCT,
                                             PROJECT_DATE]))

PROCESSED_DATA_OUT_DIR = Path(os.environ['PROCESSED_DATA_DIR_PATH'])
PROCESSED_DATA_PATH = PROCESSED_DATA_OUT_DIR / ('_'.join([PROJECT_CODE,
                                                         PROCESSED_PRODUCT,
                                                         PROJECT_DATE]))

## 1.2. Obtaining the Card Data

There are multiple choices as to where we can import data from. I chose [Scryfall's API](https://scryfall.com/docs/api) for the sake of familiarity with their platform and documentation. As i mentioned earlier, the specifics of the requesting process are kept in the api_client.py module in the modules folder of this project. 

In [4]:
raw_set_data = get_set_data(
    RAW_DATA_PATH.with_suffix('.json'),
    query='s:mom -is:rebalanced'
)

2023-05-01 14:33:00.677 | INFO     | modules.api_client:get_set_data:10 - 
 - older data found. skipping request.



# 1.3. Processing the Data

To first familiarize ourselves with the data we run the **`pandas.DataFrame.info()`** method. this should display the size of the dataframe; and the name, dtype, and prevalence of null values in every column.

In [5]:
raw_set_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 296 entries, 0 to 295
Data columns (total 71 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   object             296 non-null    object        
 1   id                 296 non-null    object        
 2   oracle_id          296 non-null    object        
 3   multiverse_ids     296 non-null    object        
 4   mtgo_id            296 non-null    int64         
 5   arena_id           296 non-null    int64         
 6   tcgplayer_id       296 non-null    int64         
 7   cardmarket_id      281 non-null    float64       
 8   name               296 non-null    object        
 9   lang               296 non-null    object        
 10  released_at        296 non-null    datetime64[ns]
 11  uri                296 non-null    object        
 12  scryfall_uri       296 non-null    object        
 13  layout             296 non-null    object        
 14  highres_im

The card data we are requesting belongs to the latest card set released by Wizards of the Coast: March of the Machine (aka MOM). In my experience, exploring the contents of a newly released set with data tools accelerates the process of becoming deeply familiar with the themes and composition of the set. 

Say, we are now interested in the columns that describe gameplay features of the cards in the set to easy the process of playing with new cards. Let's filter the data with a list with the gameplay relevant features. The id and image_uris columns will be also kept to ease any future reference to the data.

In [15]:
gameplay_features = {'name', 'layout', 'mana_cost', 'cmc', 'type_line',
                     'oracle_text','colors','color_identity', 'card_faces',
                     'rarity', 'power', 'toughness', 'loyalty', 'id', 'image_uris'}

raw_set_data[gameplay_features].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 296 entries, 0 to 295
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   oracle_text     235 non-null    object 
 1   power           135 non-null    float64
 2   toughness       135 non-null    float64
 3   layout          296 non-null    object 
 4   loyalty         3 non-null      float64
 5   name            296 non-null    object 
 6   card_faces      61 non-null     object 
 7   colors          235 non-null    object 
 8   color_identity  296 non-null    object 
 9   rarity          296 non-null    object 
 10  type_line       296 non-null    object 
 11  mana_cost       235 non-null    object 
 12  cmc             296 non-null    int64  
dtypes: float64(3), int64(1), object(9)
memory usage: 32.4+ KB


  raw_set_data[gameplay_features].info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 296 entries, 0 to 295
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   oracle_text     235 non-null    object 
 1   power           135 non-null    float64
 2   toughness       135 non-null    float64
 3   layout          296 non-null    object 
 4   loyalty         3 non-null      float64
 5   name            296 non-null    object 
 6   card_faces      61 non-null     object 
 7   colors          235 non-null    object 
 8   color_identity  296 non-null    object 
 9   rarity          296 non-null    object 
 10  type_line       296 non-null    object 
 11  mana_cost       235 non-null    object 
 12  cmc             296 non-null    int64  
dtypes: float64(3), int64(1), object(9)
memory usage: 32.4+ KB


  raw_set_data[gameplay_features].info().groupby()


AttributeError: 'NoneType' object has no attribute 'groupby'