# How to obtain cloud URIs for TESS images?

*Prepared by Geert Barentsen on Feb 17, 2021.*

## Summary

Many use cases for analyzing TESS data in the cloud would benefit from being able to query the S3 URIs of data products in a friendly and efficient way.  This notebook evaluates four different methods which may be used to obtain the URIs of the FITS images for a given TESS sector & chip.

In brief, the findings are:

| Method | Time | Comments | 
| :----- | ---: | :------- |
| `s3fs.glob()`             | ~15s  | Restricted to querying meta data encoded in URI. 😟 | 
| `s3://manifest.txt.gz`    | ~15s  | Restricted to querying meta data encoded in URI. 😟 |
| `astroquery.mast`         | ~200s | Call to `get_cloud_uris()` does not scale. 🐌 | 
| `TAP` via `vao.stsci.edu` | TBD   | Could not figure out how to query cloud URIs via TAP. Help needed! 🚑 |

The remainder of this notebook shows the code used to evaluate each method.  I warmly welcome tips for improvements!

## Code used

In [1]:
# Select a data set to retrieve
SECTOR = 12
CAMERA = 3
CCD = 4

In [2]:
# Install dependencies
!pip install s3fs boto3 pandas -q
!pip install git+https://github.com/astropy/astroquery/ -q  # Need bleeding edge for public AWS data

### Method 1: using `s3fs.glob()`

In [3]:
%%time
import s3fs
fs = s3fs.S3FileSystem(anon=True)
uris = fs.glob(f"stpubdata/tess/public/ffi/s{SECTOR:04d}/*/*/{CAMERA}-{CCD}/**_ffic.fits")
f"Found {len(uris)} image URIs."

CPU times: user 6.23 s, sys: 136 ms, total: 6.36 s
Wall time: 19.1 s


'Found 1289 image URIs.'

### Method 2: using `s3://manifest.txt.gz`

In [4]:
%%time
import pandas as pd  # will use `s3fs` in a seamless way
df = pd.read_fwf('s3://stpubdata/tess/public/manifest.txt.gz',
                 compression='gzip',
                 names=['date', 'time', 'size', 'path'])
mask = df.path.str.match(f'tess/public/ffi/s{SECTOR:04d}/.*/.*/{CAMERA}-{CCD}/.*_ffic.fits')
uris = df.path[mask]
f"Found {len(uris)} image URIs."

CPU times: user 14.4 s, sys: 872 ms, total: 15.3 s
Wall time: 16.6 s


'Found 1289 image URIs.'

### Method 3: using `astroquery.mast`

In [5]:
%%time
from astroquery.mast import Observations
obsTable = Observations.query_criteria(obs_id=f"tess-s{SECTOR:04d}-{CAMERA}-{CCD}")
products = Observations.get_product_list(obsTable)
filtered = Observations.filter_products(products, 
                                        productSubGroupDescription="FFIC",
                                        mrp_only=False)
f"Found {len(filtered)} products"

CPU times: user 609 ms, sys: 41.8 ms, total: 651 ms
Wall time: 1min 11s


'Found 1289 products'

The cell above only obtained MAST URIs.  To obtain cloud URIs, we need to use `Observations.get_cloud_uris()` to query `https://mast.stsci.edu/api/v0.1/path_lookup` thousands of times.  This step does not scale and frequently encounters an `HTTPError` due to the large number of queries.

In [6]:
%%time
Observations.enable_cloud_dataset()
# The line below queries `https://mast.stsci.edu/api/v0.1/path_lookup` for each product
# This is extremely slow and ~always yields an HTTPError before completing :-(
uris = Observations.get_cloud_uris(filtered)
f"Found {len(uris)} image URIs."

INFO: Using the S3 STScI public dataset [astroquery.mast.cloud]
CPU times: user 36.7 s, sys: 1.84 s, total: 38.6 s
Wall time: 2min 17s


'Found 1289 image URIs.'

### Method 4: using MAST TAP?

In [7]:
%%time
from astroquery.utils.tap.core import TapPlus
mast_tap = TapPlus(url="https://vao.stsci.edu/caomtap/tapservice.aspx")
adql = f"""SELECT access_url FROM obscore
           WHERE obs_collection='TESS' AND dataproduct_type = "image"
           AND obs_id LIKE 'tess%-s{SECTOR:04d}-{CAMERA}-{CCD}%'"""
job = mast_tap.launch_job_async(adql)
uris = job.get_results()

Created TAP+ (v20200428.1) - Connection:
	Host: vao.stsci.edu
	Use HTTPS: True
	Port: 443
	SSL Port: 443
INFO: Query finished. [astroquery.utils.tap.core]
CPU times: user 56.4 ms, sys: 7.03 ms, total: 63.4 ms
Wall time: 5.39 s


In [8]:
f"Found {len(uris)} URIs."

'Found 1289 URIs.'

⚠️ Problem: the query above returns MAST URIs. How can S3 URIs be obtained without making thousands of queries to `/api/v0.1/path_lookup` again?