<div>
<img src="images/Logo_Gaia_may_23_2022.png" width="300" align="right"/>
</div>


# Tutorial:  Download DataLink products for >5000 sources

<br />
<br />
<br />
<br />
<br />


**Release number:** 
v1.0 (2022-07-06)


**Applicable Gaia Data Releases:**
Gaia EDR3, Gaia DR3

**Author:**
Héctor Cánovas Cabrera; hector.canovas@esa.int

**Summary:** 

This Jupyter Notebook allows to overcome the Gaia Archive DataLink products download threshold by first splitting an input source list into multiple chunks, each of them having $\leq$ 5000 sources. Then, a sequential download begins and the multiple outputs are finally merged. As explained in the [DataLink: products serialisation](https://www.cosmos.esa.int/web/gaia-users/archive/datalink-products#datalink_serialisation) tutorial, it is possible to retrieve DataLink products in various data structures and formats. We suggest to retrieve the DataLink products in COMBINED data structure (as shown in all the examples below) because our tests indicate that this is the most efficient data structure to download large amounts of products. For simplicity, all the products in the following examples are downloaded in [VOTable](https://www.ivoa.net/documents/VOTable/). This allows to easily export them to several other formats using the tools available within the [Astropy.table](https://docs.astropy.org/en/stable/table/index.html) module. This complementary [tutorial](https://www.cosmos.esa.int/web/gaia-users/archive/datalink-products#datalink_jntb_get_all_prods) shows how to download  and inspect all the different DataLink products via [Astroquery.Gaia](https://astroquery.readthedocs.io/en/latest/gaia/gaia.html) for an small sample of sources. Finally, while executing this notebook it is posisble to receive a few warnings about the units included in the product metadata. Those are known issues and we are working on them.



**Useful URLs:**

* [Questions or suggestions](https://www.cosmos.esa.int/web/gaia/questions)
* [Tutorials, documentation, and more](https://www.cosmos.esa.int/web/gaia-users/archive)
* [Known issues in the Gaia data](https://www.cosmos.esa.int/web/gaia-users/known-issues)
* [Gaia data credits and acknowledgements](https://www.cosmos.esa.int/web/gaia-users/credits)

In [1]:
from astropy.table import Table, vstack
from astroquery.gaia import Gaia
import numpy as np

In [2]:
def chunks(lst, n):
    ""
    "Split an input list into multiple chunks of size =< n"
    ""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

## Connect to the Gaia Archive

The DataLink products are available to both registered & anonymous users. However, we recommend to access as a registered user due to their extra benefits when executing long queries (as explained in this [FAQ](https://www.cosmos.esa.int/web/gaia-users/archive/faq#account-limits-2020)).

In [3]:
Gaia.login()

INFO: Login to gaia TAP server [astroquery.gaia.core]
User: hcanovas
Password: ········
OK
INFO: Login to gaia data server [astroquery.gaia.core]
OK


##  Execute ADQL Query

The query below retrieves data for 12000 sources that have associated all the DataLink products offered in Gaia DR3.

In [4]:
query = "SELECT TOP 5100 source_id, ra, dec, parallax from gaiadr3.gaia_source \
WHERE has_epoch_photometry = 'True' AND \
has_mcmc_gspphot = 'True' AND \
has_mcmc_msc = 'True' AND \
has_xp_sampled = 'True' AND \
has_rvs = 'True'"

job     = Gaia.launch_job_async(query)
results = job.get_results()
results[0:5]

INFO: Query finished. [astroquery.utils.tap.core]


source_id,ra,dec,parallax
Unnamed: 0_level_1,deg,deg,mas
int64,float64,float64,float64
2263166706630078848,295.13035167754015,70.28624696426813,17.357227526090668
2263178457660566784,294.86955515586925,70.52640371163079,5.99456673538563
2268372099615724288,285.6366359200697,75.41851051257491,23.857068308325488
5912901375001820288,263.99225124991324,-58.82661905857226,6.476061657906406
2266609140096698112,275.7457014457717,72.17444369607303,7.253739784978569


## Download Datalink Products

**Warning**: The ```load_data``` method allows to retrieve all types of DataLink products (epoch photometry, MCMC's, and spectra) in one single call (see below). However, selecting this option when attempting to retrieve DataLink products for large (>1000) amount of sources can severely delay the dataset preparation on the server side, and even result in a download error. Therefore, we strongly recommend to select one a product at a time in this case.

### Split the input list into several chunks containing =<5000 elements each


In [5]:
dl_threshold = 5000               # DataLink server threshold. It is not possible to download products for more than 5000 sources in one single call.
ids          = results['source_id']
ids_chunks   = list(chunks(ids, dl_threshold))
datalink_all = []


print(f'* Input list contains {len(ids)} source_IDs')
print(f'* This list is split into {len(ids_chunks)} chunks of <= {dl_threshold} elements each')

* Input list contains 5100 source_IDs
* This list is split into 2 chunks of <= 5000 elements each


In [6]:
retrieval_type = 'RVS'        # Options are: 'EPOCH_PHOTOMETRY', 'MCMC_GSPPHOT', 'MCMC_MSC', 'XP_SAMPLED', 'XP_CONTINUOUS', 'RVS' 
data_structure = 'COMBINED'   # Options are: 'INDIVIDUAL', 'COMBINED', 'RAW' - but as explained above, we strongly recommend to use COMBINED for massive downloads.
data_release   = 'Gaia DR3'   # Options are: 'Gaia DR3' (default), 'Gaia DR2'
dl_key         = f'{retrieval_type}_{data_structure}.xml'


ii = 0
for chunk in ids_chunks:
    ii = ii + 1
    print(f'Downloading Chunk #{ii}; N_files = {len(chunk)}')
    datalink  = Gaia.load_data(ids=chunk, data_release = data_release, retrieval_type=retrieval_type, format = 'votable', data_structure = data_structure)
    datalink_all.append(datalink)

Downloading Chunk #1; N_files = 5000




Downloading Chunk #2; N_files = 100


### Concatenate the DataLink outputs into one single table

The sampled spectra (XP and RVS) are serialised following the [IVOA Spectrum Data Model](https://www.ivoa.net/documents/SpectrumDM/) and as a result a number of parameters, including the associated source_id, are stored in the table metadata. This is taken into account in the cells below. 

### Epoch Photometry, MCMC, or XP Continuous

In this case, the merged product is one single table that includes the source_id in one of the table fields. The code below includes an example showing how to write the entire table using the [Astropy.table](https://docs.astropy.org/en/stable/table/index.html) module. 

**Warning**: the written table can have a size >1 Gb.  

In [7]:
if 'RVS' not in dl_key and 'XP_SAMPLED' not in dl_key:
    temp       = [inp[dl_key][0].to_table() for inp in datalink_all]
    merged     = vstack(temp)
    file_name  = f"{dl_key}_{data_release.replace(' ','_')}.vot"

    print(f'Writting table as: {file_name}')
    merged.write(file_name, format = 'votable', overwrite = True)

    display(merged)

### XP sampled or RVS

In this case, the merged product is one Python list whose elements are all the individual products. The code below includes an example showing how to write an individual table using the [Astropy.table](https://docs.astropy.org/en/stable/table/index.html) module

In [9]:
if 'RVS' in dl_key or 'XP_SAMPLED'  in dl_key:
    product_list_tb  = [item                                    for sublist in datalink_all for item in sublist[dl_key]]
    product_list_ids = [item.get_field_by_id("source_id").value for sublist in datalink_all for item in sublist[dl_key]]
    
    
    ii          = 12     # Try different values to display the content of the individual products.
    source_id   = product_list_ids[ii]
    product_tab = product_list_tb[ii].to_table()
    file_name   = f"{dl_key.replace('_COMBINED.xml', '')}_{data_release.replace(' ','_')}_{source_id}.vot"
    
    print(f'Writting table as: {file_name}')
    product_tab.write(file_name, format = 'votable', overwrite = True)
    print()
    print(f'Showing {retrieval_type} for source_id = {source_id}')
    display(product_tab[:5])

Writting table as: RVS_Gaia_DR3_5912768368455408000.vot

Showing RVS for source_id = 5912768368455408000


wavelength,flux,flux_error
nm,Unnamed: 1_level_1,Unnamed: 2_level_1
float64,float32,float32
846.0,0.961344,0.03171042
846.01,0.9489333,0.020776467
846.02,0.9774552,0.017536303
846.03,0.9911668,0.014816107
846.04,0.9947043,0.013110418
