# Demo 3: Glue

## Other formats

Because you can create a pandas dataframe from many types of formats, you can use the same munging code regardless of data source.
https://pandas.pydata.org/pandas-docs/stable/io.html

![Pandas IO](assets/pandas_io.png)

### JSON and APIs

In [None]:
import json
import requests
import pandas as pd
from pandas.io.json import json_normalize

pd.options.display.float_format = '{:,.0f}'.format

# documentation of the usaspending API: api.usaspending.gov
uri = 'https://api.usaspending.gov/api/v2/search/spending_by_award/'
headers = {'content-type': 'application/json'}
awards_json = []
page = 1
while page:
    payload = {
        "filters":{"time_period":[{"start_date":"2015-10-01","end_date":"2018-09-30"}],
        "award_type_codes":["02","03","04","05"],
        "place_of_performance_locations":[{"country":"USA","state":"MA","county":"015"}]}, # Hampshire County
        "fields":["Award ID","Recipient Name","Start Date","End Date","Award Amount","Awarding Agency","Awarding Sub Agency","Award Type", "Description"],
        "page":page,
        "limit":100
    }
    r = requests.post(uri, data=json.dumps(payload), headers=headers)  
    awards_json.extend(r.json()['results'])
    if r.json()['page_metadata']['hasNext']:
        page = page + 1
    else:
        page = None

awards_json[0:2]

In [None]:
awards_df = pd.DataFrame(json_normalize(awards_json))
awards_df[:10]

### SQL

In [None]:
import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('postgresql://becky@localhost:5432/usaspending')
recipients = pd.read_sql('select * from legal_entity limit 10000;', engine)
recipients.head()

### Parquet

[Apache Parquet](https://parquet.apache.org/) is a columnar storage format, commonly used with Spark and the Hadoop ecosystem. It's not limited to that ecosystem, however. Another use case is sending Parquet files to AWS S3 for use with Athena.

In [None]:
import io

import boto3
import pandas as pd

s3 = boto3.resource('s3')
bucket = s3.Bucket('humble-dataframe')
recipients.to_parquet('data/recipients.parquet', compression='gzip')
bucket.put_object(Body='data/recipients.parquet', Key='recipients.parquet')

# There's a patch in pandas dev branch to write parquet directly to S3, so soon the following syntax should work:
# recipients.to_parquet('s3://humble-dataframe/recipients.parquet', compression='gzip')

## Other Tools

Many tools understand data frames. Just a few:

* scikit-learn
* statsmodels
* seaborn
* bokeh
* plotly
* geopandas
* jupyter notebook
* apache arrow


In [None]:
import seaborn as sns
%matplotlib inline

In [None]:
awards_df.head()

In [None]:
charty = sns.barplot(x="Award Amount", y="Awarding Agency", ci=None, data=awards_df)