# Amplify Data API: Downloading Product Data

This notebook will walk you through the process of downloading your data product's data via the Amplify API

To follow this example, clone this repo and open this notebook: 
```
git clone git@github.com:amplifydata/amplifydata-public.git
```

## Auth

First, you must obtain an `AMPLIFY_API_KEY` by following the [Generating the API Key](https://github.com/amplifydata/amplifydata-public/tree/main#generating-an-api-key) directions on this repo's readme. 

Next, you must make that value available as an environment variable. 

We use the following steps (but you can update this in any fashion you like):
1. Create a `.env` in the same directory as this notebook

We use the name `AMPLIFY_API_KEY`

We suggest using [python-dotenv]() to load your `AMPLIFY_API_KEY` available as an environment variable. Be sure to create a `.env` file in the same directory as this notebook with this structure:
```bash
AMPLIFY_API_KEY=your-key-from-amplify
```

It will be used to authenticate you against our api endpoints

In [1]:
pip install python-dotenv

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os 
from dotenv import load_dotenv

load_dotenv()
    
AMPLIFY_API_KEY = os.environ["AMPLIFY_API_KEY"]


## Create API Product Subscription

**insert steps on subscription creation**

1. go to UI
2. get stuff


## Request Your Download links `/external-api/v2/products/{product_id}/files`

Next, we are going to download the underlying files of your data product via the amplify api. 


**The API url for your product download should be available on the subscription you created for this project**


Amplify's api works by returning a list of presigned download url's that can be used to download your data. 




In [12]:
import requests 


product_id = "b16f31fd-41e1-4dd9-839d-f67d06af95c0"


## REPLACE WITH THE URL FROM YOUR SUBSCRIPTION!! 
product_url = "https://dev.amplifydata.io/external-api/v2/products/fb4e397f-d464-4851-93b1-0519f3544550/files"


product_url = "https://dev.amplifydata.io/external-api/v2/products/b16f31fd-41e1-4dd9-839d-f67d06af95c0/files"


res = requests.get(
    product_url,
    headers = {
        "X-API-KEY": AMPLIFY_API_KEY,
        'accept': 'application/json'
    }
)

if res.status_code != 200:
    raise Exception(res.json())
    
data = res.json() # { "download_links": [], "metadata": {} }
print(data)

{'download_links': ['https://amplifydata-development-dev.s3.amazonaws.com/snowflake%20orders%20100k--b16f31fd-41e1-4dd9-839d-f67d06af95c0/file.csv?AWSAccessKeyId=AKIASC5E62QHCPMXWWSX&Signature=IejlOW3Lct4R67f1%2B1ju%2BX7yP8c%3D&Expires=1688061370'], 'metadata': {'num_files': 1, 'total_size_mb': 13.7, 'avg_file_size_mb': 13.7, 'expires_at': '2023-06-29 13:56:10'}}


## Download your data

Response structure: 
```json
{
    'download_links': [
        list of urls used to download data
    ], 
    'metdata': {
        'num_files': 
        'total_size_mb': 
        avg_file_size_mb: 
        expires_at: 
    }
}
```

With that request, you now have access to download your data! 

**insert section about expires at**

## Download Directly to file

If you are downloading data for a large product, this process can take a long time (it is largely dependent on your network connection).

The following cell has a function `download_file` that we will use to download your data to a local file. 

**Make sure the directory you are storing the file in already exists.** You can use the DATA_DIR var in the next cell to manage where the data is saved

In [27]:
DATA_DIR = "data_dir" 

In [28]:
!mkdir $DATA_DIR # creates a directory relative to this notebook

mkdir: data_dir: File exists


In [29]:
def download_file(url: str, filepath: str):
    """
    This function downloads the data available in 'url' and writes it to file of name filepath
    Args:
        url (str): download link that is provided by amplify api response 
        filepath (str): filepath file to write data to (directory must already exist!)
    Returns: 
        None
    """
    # NOTE the stream=True parameter below
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        total_size = int(r.headers.get('content-length', 0))
        with open(filepath, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192): 
                f.write(chunk)
        print(f"Downloaded {url} to {filepath}")




for index, i_link in enumerate(data["download_links"]):
    download_file(i_link, f"{DATA_DIR}/file-{index}.csv")

Downloaded https://amplifydata-development-dev.s3.amazonaws.com/snowflake%20orders%20100k--b16f31fd-41e1-4dd9-839d-f67d06af95c0/file.csv?AWSAccessKeyId=AKIASC5E62QHCPMXWWSX&Signature=IejlOW3Lct4R67f1%2B1ju%2BX7yP8c%3D&Expires=1688061370 to data_dir/file-0.csv


### Direct to file with progress bar! 

An extra feature that is nice for longer downloads is a progress bar for each file download. 
The code is very similar, except that we need to install tqdm before and handle the progress bar. 


In [30]:
!pip install tqdm



In [31]:
from tqdm import tqdm

def download_file(url: str, filepath: str):
    """
    This function downloads the data available in 'url' and writes it to file of name filepath
    Args:
        url (str): download link that is provided by amplify api response 
        filepath (str): filepath file to write data to (directory must already exist!)
    Returns: 
        None
    """
    # NOTE the stream=True parameter below
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        total_size = int(r.headers.get('content-length', 0))
        progress_bar = tqdm(total=total_size, unit='B', unit_scale=True, position=0, leave=True)
        with open(filepath, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192): 
                f.write(chunk)
                progress_bar.update(len(chunk))
        progress_bar.close()
        print(f"Downloaded {url} to {filepath}")


for index, i_link in enumerate(data["download_links"]):
    download_file(i_link, f"data-dir/file-{index}.csv")

100%|███████████████████████████████████████████████████████████████████████████| 14.4M/14.4M [00:00<00:00, 21.4MB/s]

Downloaded https://amplifydata-development-dev.s3.amazonaws.com/snowflake%20orders%20100k--b16f31fd-41e1-4dd9-839d-f67d06af95c0/file.csv?AWSAccessKeyId=AKIASC5E62QHCPMXWWSX&Signature=IejlOW3Lct4R67f1%2B1ju%2BX7yP8c%3D&Expires=1688061370 to data-dir/file-0.csv





## Use Pandas to Download 

You can also download your data directly into a pandas dataframe. 

You can pass a download link directly to `pd.read_csv`! 

In [18]:
import pandas as pd

## Single file example
df = pd.read_csv(data["download_links"][0]) # can only pass a single download url
df.head()

Unnamed: 0,O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS,O_TOTALPRICE,O_ORDERDATE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_COMMENT,CREATED_AT
0,4200001.0,13726.0,F,99406.41,1994-02-21,3-MEDIUM,Clerk#000000128,0.0,eep. final deposits are after t,2020-07-26 18:37:56.448
1,4200002.0,129376.0,O,256838.41,1997-04-14,4-NOT SPECIFIED,Clerk#000000281,0.0,ke carefully. blithely regular epitaphs are am...,2023-03-09 19:37:56.448
2,4200003.0,141613.0,O,150849.49,1997-11-24,4-NOT SPECIFIED,Clerk#000000585,0.0,cial accounts. theodolites are carefully. pend...,2022-12-01 19:37:56.448
3,4200004.0,23515.0,O,178688.27,1996-12-09,2-HIGH,Clerk#000000632,0.0,sual requests against the always special packa...,2021-07-20 18:37:56.448
4,4200005.0,97687.0,O,261742.31,1997-02-01,2-HIGH,Clerk#000000562,0.0,"t slyly above the pending, final accounts? reg...",2020-12-09 19:37:56.448


In [19]:
## Read all download links into a dataframe

## WARNING: for large products, this can take a long time
## be sure to save this data locally so you don't have to wait for the download process each time
dfs = []

for i_download_link in data["download_links"]:
    i_df = pd.read_csv(i_download_link)
    dfs.append(i_df)

combined_df = pd.concat(dfs)
combined_df.head()

Unnamed: 0,O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS,O_TOTALPRICE,O_ORDERDATE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_COMMENT,CREATED_AT
0,4200001.0,13726.0,F,99406.41,1994-02-21,3-MEDIUM,Clerk#000000128,0.0,eep. final deposits are after t,2020-07-26 18:37:56.448
1,4200002.0,129376.0,O,256838.41,1997-04-14,4-NOT SPECIFIED,Clerk#000000281,0.0,ke carefully. blithely regular epitaphs are am...,2023-03-09 19:37:56.448
2,4200003.0,141613.0,O,150849.49,1997-11-24,4-NOT SPECIFIED,Clerk#000000585,0.0,cial accounts. theodolites are carefully. pend...,2022-12-01 19:37:56.448
3,4200004.0,23515.0,O,178688.27,1996-12-09,2-HIGH,Clerk#000000632,0.0,sual requests against the always special packa...,2021-07-20 18:37:56.448
4,4200005.0,97687.0,O,261742.31,1997-02-01,2-HIGH,Clerk#000000562,0.0,"t slyly above the pending, final accounts? reg...",2020-12-09 19:37:56.448
