# Get Files from FiscalData

This notebook demonstrates how to use Python and AWS SDK for Python (boto3) to download public information available via REST APIs.


### Set up Tables in Glue Catalog
Create Glue Tables using Metadata from [FiscalData Data Dictionaries](data/metadata/readme.md).

**Why not Glue Crawler?**   
Alternatively, we can [use a crawler](https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html) to populate the AWS Glue Data Catalog with tables. This is the primary method used by most AWS Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog.

Experience with crawlers has revealed them to be 'fiddly' to configure and maintain. And inconsistent inferring data types.  Since we generally have good knowledge of the data we are working with, including metadata and/or schemas from which to specify accurate data types, our preferred approach is to explicity create the Glue Tables.  


In [1]:
# Initialize Persistent %store & Notebook Globals
import json
import boto3
import os

%store -r 

S_sys_abbrev = "FSDATA"
S_landing_pad_bucket = f"{S_stack}-landing-pad"
S_datalake_bucket = f"{S_stack}-datalake"

S_glue_database = f"{S_stack}-{S_sys_abbrev}".lower()

%store S_sys_abbrev S_landing_pad_bucket S_glue_database S_datalake_bucket

%store 

os.environ['AWS_DEFAULT_REGION'] = S_region


Stored 'S_sys_abbrev' (str)
Stored 'S_landing_pad_bucket' (str)
Stored 'S_glue_database' (str)
Stored 'S_datalake_bucket' (str)
Stored variables and their in-db values:
S_account_id                     -> '{aws_acct}'
S_arn_template                   -> 'arn:aws:$Service:us-east-2:{aws_acct}:$Resource
S_code_bucket                    -> 'daab-lab-code-us-east-2'
S_datalake_bucket                -> 'daab-lab-smpl-main-datalake'
S_file_prefix                    -> 'FDMD.FSDATA'
S_glue_database                  -> 'daab-lab-smpl-main-fsdata'
S_glue_dbname                    -> 'daab-lab-smpl-main-fsdata'
S_landing_pad_bucket             -> 'daab-lab-smpl-main-landing-pad'
S_partition                      -> 'aws'
S_qualifier                      -> 'FDMD'
S_region                         -> 'us-east-2'
S_rootdir                        -> '/home/ec2-user/SageMaker/daab-simple'
S_stack                          -> 'daab-lab-smpl-main'
S_sys_abbrev                     -> 'FSDATA'


In [2]:
# create a Glue Database Using Python SDK
# (shoulda already been created by the CloudFormation stack)
glue_client = boto3.client('glue')

try:
    response = glue_client.create_database(
        DatabaseInput={
            'Name': S_glue_database,
            'Description': 'Created from boto3 script in glue_workbook.ipynb',
            'LocationUri': f's3://{S_stack}-datalake/{S_sys_abbrev}'
        }
    )
    print(f"Created Glue Database '{S_glue_database}'")
except glue_client.exceptions.AlreadyExistsException:
    print(f"Using Existing Glue Database '{S_glue_database}'")

  

Using Existing Glue Database 'daab-lab-smpl-main-fsdata'


In [None]:
# create Glue Tables Using Python SDK
import boto3
glue_client = boto3.client("glue")

import sys
python_path = f"{S_rootdir}/python"
if python_path not in sys.path: sys.path.append(python_path)
import glue_functions_2212 as glu

glue_table_names = [
  'avg_interest_rates',
  'top_federal',
  'top_state'
]

glue_table_input = json.loads(glu.get_file('file://../python/glue_table_input_template_parquet.json'))

for glue_table_name in glue_table_names:
  
  with open( f"metadata/{glue_table_name}.raml" , "r" ) as f:
    raml = f.read()
  column_metadata = glu.import_raml_to_glue( "", "", raml )
  s3_url_table_location = f's3://{S_datalake_bucket}/{S_sys_abbrev}/PARQUET/{glue_table_name}'

  glue_table_input['TableInput']['Name'] = glue_table_name
  glue_table_input['TableInput']['StorageDescriptor']['Columns'] = column_metadata
  glue_table_input['TableInput']['StorageDescriptor']['Location'] = s3_url_table_location

  try:
    response = glue_client.create_table(
      DatabaseName = S_glue_database,
      TableInput = glue_table_input['TableInput']  
    )
    status = 'Created'
  except glue_client.exceptions.AlreadyExistsException:
    response = glue_client.update_table (
        DatabaseName = S_glue_database,
        TableInput = glue_table_input['TableInput']
    )
    status = 'Updated'

  print(f"Glue Table '{glue_table_name}' {status} in Database '{S_glue_database}'" )


## Fetch & Package Data 
Get some files from FiscalData via REST API 
-   [aws_setup.ipynb](aws_setup.ipynb)
-   package into a ZIP archive file containing a set of CSV files (typical format for Bureau data exchanges)
    -   {dataset}.{Dyymmdd}.full.csv
-   This example program uploads the file to S3 using the AWS SDK for Python.   Alternatively, the ZIP file might be placed in the designated folder by a MFT process. 
- Process for obtaining data files from Enterprise Anypoint REST API will be similar, but will need to include authentication headers in requests.get(url)


In [5]:
# Download Several FiscalData Data Sets, package into .ZIP
from urllib.parse import urlparse
import datetime
import os
import boto3
import requests
import zipfile

base_url = "https://api.fiscaldata.treasury.gov/services/api/fiscal_service"

api_data_urls = [
    f"{base_url}/v1/debt/top/top_federal?page[number]=1&page[size]=100&format=csv&filter=record_date:eq:2022-10-01",
    f"{base_url}/v1/debt/top/top_state?page[number]=1&page[size]=100&format=csv&filter=record_date:eq:2022-10-01",
    f"{base_url}/v2/accounting/od/avg_interest_rates?sort=-record_date&format=csv&page[number]=1&page[size]=216"
]

yymmdd = datetime.datetime.now().strftime("%y%m%d")
zipfile_name = f"FDMD.{S_sys_abbrev}.D{yymmdd}.FULL.ZIP"
#cwd = os.getcwd()
outfolder = f"{S_rootdir}/data"
if not os.path.exists(outfolder):
    os.makedirs(outfolder)

for url in api_data_urls:
    parts = urlparse(url)
    #print(parts)
    dataset_name = parts.path.split('/')[-1]
    member_filename = f"{dataset_name}.D{yymmdd}.full.csv"

    rows = requests.get(url)

    with zipfile.ZipFile(f"{outfolder}/{zipfile_name}", 'a') as outzip:
        outzip.writestr( f"{member_filename}" , rows.text )

    print (f"'{member_filename}' written to '{zipfile_name}'")

#s3_client.upload_file( f"{outfolder}/{zipfile_name}", s3_bucket, f'{s3_folder}/{zipfile_name}' )
#s3_url_zipfile = f"s3://{s3_bucket}/{s3_folder}/{zipfile_name}"
#print(f"'{zipfile_name}' uploaded to 's3://{s3_bucket}/{s3_folder}/{zipfile_name}'")

'top_federal.D230113.full.csv' written to 'FDMD.FSDATA.D230113.FULL.ZIP'
'top_state.D230113.full.csv' written to 'FDMD.FSDATA.D230113.FULL.ZIP'
'avg_interest_rates.D230113.full.csv' written to 'FDMD.FSDATA.D230113.FULL.ZIP'


In [6]:
# Uploading the ZIP file to the S3'landing-pad' bucket will initiate processing thru the EventBridge Rule ...
s3_client = boto3.client('s3')
s3_bucket = S_landing_pad_bucket
s3_folder = f"{S_sys_abbrev}/Inbound"

s3_client.upload_file( f"{outfolder}/{zipfile_name}", s3_bucket, f'{s3_folder}/{zipfile_name}' )
s3_url_zipfile = f"s3://{s3_bucket}/{s3_folder}/{zipfile_name}"
print(f"'{zipfile_name}' uploaded to 's3://{s3_bucket}/{s3_folder}/{zipfile_name}'")

'FDMD.FSDATA.D230113.FULL.ZIP' uploaded to 's3://daab-lab-smpl-main-landing-pad/FSDATA/Inbound/FDMD.FSDATA.D230113.FULL.ZIP'
