# Amplify Data API: Downloading Product Data

This notebook will walk you through the process of downloading your data product's data via the Amplify API

To follow this example, clone this repo and open this notebook: 
```
git clone git@github.com:amplifydata/amplifydata-public.git
```

## Setup
Make sure you follow the [data-product-download/README.md]() setup directions before running this notebook 

In [67]:
pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [33]:
import os 
from dotenv import load_dotenv #install with requirements

load_dotenv()
    
AMPLIFY_API_KEY = os.environ["AMPLIFY_API_KEY"]


## Getting Product Data

Getting your products data is a two step process. This notebook will walk you through the following steps

1. Hit the api to get a list of "download_links". These links are [aws s3 presigned url's](https://docs.aws.amazon.com/AmazonS3/latest/userguide/ShareObjectPreSignedURL.html), which can be used to download data
2. Use the download links to download the data to your machine. 


## Step 1: Request Your Download links


The first step is to make a request to the Amplify Data API for the product with data you intend to download. 


We will use the `/external-api/v2/products/{product_id}/files` endpoint to obtain "download_links" for your project, which are links we will use to download the product data. 

Here is an example response, with the fields expalined below: 


**Example Response**
```json
{
    "download_links": [
        "https://amplifydata-development-dev.s3.amazonaws.com/..."
    ],
    "metadata": {
        "num_files": 1,
        "total_size_mb": 13.7,
        "avg_file_size_mb": 13.7,
        "expires_at": "2023-06-29 13:56:10"
    }
}
```

* download_links: this key holds a list of [aws s3 presigned url's](https://docs.aws.amazon.com/AmazonS3/latest/userguide/ShareObjectPreSignedURL.html). These can be used to download the data, we will show examples later in this notebook
* Metadata: a dictionary of info about the product you are downloading
    * num_files: the number of files (and correspondingly, download_links) that are included in the response
    * total_size_mb: the total size of all files available for download
    * avg_file_size_mb: the avg file size of the files available for download
    * **expires_at**: This is the date at which the returned download_links will no longer work. You will have to request new ones after this date. 


## Your API URL

The Amplify Subscription Card (in the frontend) should provide you are url to use to download a product's data. 

**Replace the `PRODUCT_API_URL` variable below with the provided api url for your request.** 

After replacing the `PRODUCT_API_URL`, you can run the following code to get the download_links for your product. 

In [51]:
import requests 
import json


# product_id = "b16f31fd-41e1-4dd9-839d-f67d06af95c0"
# product_url = "https://dev.amplifydata.io/external-api/v2/products/fb4e397f-d464-4851-93b1-0519f3544550/files"

## REPLACE WITH THE URL FROM YOUR SUBSCRIPTION!! 
# PRODUCT_API_URL = "https://dev.amplifydata.io/external-api/v2/products/e61ba35e-2a7e-4a7a-8788-284a92012cdd/files"
PRODUCT_API_URL = "https://dev.amplifydata.io/external-api/v2/products/b16f31fd-41e1-4dd9-839d-f67d06af95c0/files"


res = requests.get(
    PRODUCT_API_URL,
    headers = {
        "X-API-KEY": AMPLIFY_API_KEY,
        'accept': 'application/json'
    }
)

if res.status_code != 200:
    raise Exception(res.json())
    
data = res.json()

# json used to "prettify" notebook output, only shows up to 10 download links
print(f'metadata = {json.dumps(data["metadata"], indent=4)}')
print(f'download_links = {json.dumps(data["download_links"][:10], indent=4)}')

metadata = {
    "num_files": 1,
    "total_size_mb": 13.7,
    "avg_file_size_mb": 13.7,
    "expires_at": "2023-06-29 15:08:18"
}
download_links = [
    "https://amplifydata-development-dev.s3.amazonaws.com/snowflake%20orders%20100k--b16f31fd-41e1-4dd9-839d-f67d06af95c0/file.csv?AWSAccessKeyId=AKIASC5E62QHCPMXWWSX&Signature=wgzXD0ethF9oEQGu%2FMX3fKDqWe4%3D&Expires=1688065698"
]


## Download your data

Now that we have a successful response from the API, we can use the `download_links` to do what we came here for, and download the data itself to your machine. 



### Download Directly to files

If you are downloading data for a large product, this process can take a long time (it is largely dependent on your network connection).

The following code has a function `download_file` that we will use to download your data to a local file. 

**Make sure the directory you are storing the file in already exists.** You can use the DATA_DIR var in the next cell to manage where the data is saved

In [60]:
DATA_DIR = "data_dir" 

In [61]:
!mkdir $DATA_DIR # creates a directory relative to this notebook

mkdir: data_dir: File exists


In [62]:
def download_file(url: str, filepath: str):
    """
    This function downloads the data available in 'url' and writes it to file of name filepath
    Args:
        url (str): download link that is provided by amplify api response 
        filepath (str): filepath file to write data to (directory must already exist!)
    Returns: 
        None
    """
    # NOTE the stream=True parameter below
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        total_size = int(r.headers.get('content-length', 0))
        with open(filepath, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192): 
                f.write(chunk)
        print(f"Downloaded {url} to {filepath}")




for index, i_link in enumerate(data["download_links"]):
    download_file(i_link, f"{DATA_DIR}/file-{index}.csv")

Downloaded https://amplifydata-development-dev.s3.amazonaws.com/snowflake%20orders%20100k--b16f31fd-41e1-4dd9-839d-f67d06af95c0/file.csv?AWSAccessKeyId=AKIASC5E62QHCPMXWWSX&Signature=wgzXD0ethF9oEQGu%2FMX3fKDqWe4%3D&Expires=1688065698 to data_dir/file-0.csv


### Direct to file with progress bar! 

An extra feature that is nice for longer downloads is a progress bar for each file download. 
The code is very similar, except that we need to install tqdm before to handle the progress bar. 


In [63]:
!pip install tqdm



In [64]:
from tqdm import tqdm

def download_file(url: str, filepath: str):
    """
    This function downloads the data available in 'url' and writes it to file of name filepath
    Args:
        url (str): download link that is provided by amplify api response 
        filepath (str): filepath file to write data to (directory must already exist!)
    Returns: 
        None
    """
    # NOTE the stream=True parameter below
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        total_size = int(r.headers.get('content-length', 0))
        progress_bar = tqdm(total=total_size, unit='B', unit_scale=True, position=0, leave=True)
        with open(filepath, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192): 
                f.write(chunk)
                progress_bar.update(len(chunk))
        progress_bar.close()
        print(f"Downloaded {url} to {filepath}")


for index, i_link in enumerate(data["download_links"]):
    download_file(i_link, f"data-dir/file-{index}.csv")

100%|███████████████████████████████████████████████████████████████████████████| 14.4M/14.4M [00:00<00:00, 25.8MB/s]

Downloaded https://amplifydata-development-dev.s3.amazonaws.com/snowflake%20orders%20100k--b16f31fd-41e1-4dd9-839d-f67d06af95c0/file.csv?AWSAccessKeyId=AKIASC5E62QHCPMXWWSX&Signature=wgzXD0ethF9oEQGu%2FMX3fKDqWe4%3D&Expires=1688065698 to data-dir/file-0.csv





## Use Pandas to Download 

You can also download your data directly into a pandas dataframe. 

You can pass a download link directly to `pd.read_csv`! 

Here's an example reading a single download_link into a dataframe

In [65]:
import pandas as pd

## Single file example
df = pd.read_csv(data["download_links"][0]) # can only pass a single download url
df.head()

Unnamed: 0,O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS,O_TOTALPRICE,O_ORDERDATE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_COMMENT,CREATED_AT
0,4200001.0,13726.0,F,99406.41,1994-02-21,3-MEDIUM,Clerk#000000128,0.0,eep. final deposits are after t,2020-07-26 18:37:56.448
1,4200002.0,129376.0,O,256838.41,1997-04-14,4-NOT SPECIFIED,Clerk#000000281,0.0,ke carefully. blithely regular epitaphs are am...,2023-03-09 19:37:56.448
2,4200003.0,141613.0,O,150849.49,1997-11-24,4-NOT SPECIFIED,Clerk#000000585,0.0,cial accounts. theodolites are carefully. pend...,2022-12-01 19:37:56.448
3,4200004.0,23515.0,O,178688.27,1996-12-09,2-HIGH,Clerk#000000632,0.0,sual requests against the always special packa...,2021-07-20 18:37:56.448
4,4200005.0,97687.0,O,261742.31,1997-02-01,2-HIGH,Clerk#000000562,0.0,"t slyly above the pending, final accounts? reg...",2020-12-09 19:37:56.448


#### Read all download links into a dataframe

Here is an example of how you can use pandas to read multiple download_links into a single dataframe. 

**WARNING**: for large products, this can take a long time, with no indication of progress. For larger downloads, we suggest using the download to file with a progress bar example above.


In [66]:
dfs = []

for i_download_link in data["download_links"]:
    i_df = pd.read_csv(i_download_link)
    dfs.append(i_df)

combined_df = pd.concat(dfs)
combined_df.head()

Unnamed: 0,O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS,O_TOTALPRICE,O_ORDERDATE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_COMMENT,CREATED_AT
0,4200001.0,13726.0,F,99406.41,1994-02-21,3-MEDIUM,Clerk#000000128,0.0,eep. final deposits are after t,2020-07-26 18:37:56.448
1,4200002.0,129376.0,O,256838.41,1997-04-14,4-NOT SPECIFIED,Clerk#000000281,0.0,ke carefully. blithely regular epitaphs are am...,2023-03-09 19:37:56.448
2,4200003.0,141613.0,O,150849.49,1997-11-24,4-NOT SPECIFIED,Clerk#000000585,0.0,cial accounts. theodolites are carefully. pend...,2022-12-01 19:37:56.448
3,4200004.0,23515.0,O,178688.27,1996-12-09,2-HIGH,Clerk#000000632,0.0,sual requests against the always special packa...,2021-07-20 18:37:56.448
4,4200005.0,97687.0,O,261742.31,1997-02-01,2-HIGH,Clerk#000000562,0.0,"t slyly above the pending, final accounts? reg...",2020-12-09 19:37:56.448
