## Taxonomy data extraction

**Format:**
- *kingdom*: broadest classification (i.e., usually *Animalia* for animals)
- *phylum*: groups organisms by body structure (e.g., *Chordata* for animals with a spinal cord)
- *class*: broadly categorize by major traits (e.g., *Aves* for birds, *Mammalia* for mammals)
- *family*: groups related genera indicating evolutionary relations
- *genus*=*parent*: closely categorize species (i.e., first part of the scientific name) 
- *scientific_name*: corresponding to a single species' *id*

### Preliminary code

In [None]:
# Importing useful packages
from pathlib import Path
import requests
import numpy as np
import pandas as pd

In [None]:
# Loading species data
species_train = np.load(Path('../species/species_train.npz'))
species_names = dict(zip(species_train['taxon_ids'].astype(str), species_train['taxon_names']))

### Scraping data

In [None]:
def get_gbif_data(scientific_name, species_id):
    url = f"https://api.gbif.org/v1/species?name={scientific_name}"
    response = requests.get(url)
    
    if response.status_code == 200:
        data = response.json()
        if data['results']:
            result = data['results'][0]
            return {
                'id': species_id,
                'scientific_name': scientific_name,
                'family': result.get('family', np.nan),
                'kingdom': result.get('kingdom', np.nan),
                'phylum': result.get('phylum', np.nan),
                'parent': result.get('parent', np.nan),
                'class': result.get('class', np.nan),
                'genus': result.get('genus', np.nan)
            }
    
    # Returning NaN if no results or error
    return {
        'id': species_id,
        'scientific_name': scientific_name,
        'family': np.nan,
        'kingdom': np.nan,
        'phylum': np.nan,
        'parent': np.nan,
        'class': np.nan,
        'genus': np.nan
    }

# Fetching data into dataframe
results = []
for species_id, scientific_name in species_names.items():
    species_data = get_gbif_data(scientific_name, species_id)
    results.append(species_data)
animals = pd.DataFrame(results)

# Saving dataframe
animals.to_csv('taxonomy.csv', index=False)