In [1]:
__author__ = 'Mike Fitzpatrick, NOAO Data Lab Team'
__version__ = '20190103'
__keywords__ = ['query']

# How to use the Data Lab *Query Client* Service

### Table of contents
* [Summary](#summary)
* [Disclaimer & attribution](#attribution)
* [Imports & setup](#import)
* [Example Query](#query)
* [Save to virtual storage](#save)

<a class="anchor" id="summary"></a>
# Summary

This notebook documents how to query the Data Lab via the query client service. For full documentation see the <a href="https://datalab.noao.edu/docs/api/queryClient.html">API documentation</a>.


### The query client service interface

The query client service simplifies access to the Data Lab databases. This section describes the query client service interface in case we want to write our own code against that rather than using one of the provided tools.
The query client service accepts an HTTP GET call to the <i>query</i> endpoint with the following parameters:

| Name | Function | Optional | Supported values |
|------|----------|----------|------------------|
| adql | The query string to run against the db (ADQL format)| No | string |
| out | The location to save any results | Yes | 'vos://...', 'mydb://...' |
| fmt | The output format of any results | Yes | ascii, csv, fits, hdf5, votable |
| async_ | Run the query asynchronously | Yes | true/false |

For example: /query?adql=<query>&fmt=csv

#### Saving results
If no save location is specified (no <i>out</i> param) then the results are returned directly. A save location beginning with the 'vos://' identifier indicates a location in the user's virtual storage to save the result. A save location beginning with the 'mydb://' identifier indicates the results are to be saved to a table in the user's remote database (MyDB). 

#### Output formats
The results can be returned as whitespace delimited (<i>ascii</i>), CSV (<i>csv</i>), FITS object (<i>fits</i>), HDF5 (<i>hdf5</i>, or in VOTable format (<i>votable</i>). Note that if the results are saved to the user's database then the output format is ignored.

#### Asynchronous queries
Long queries should be run asynchronously and the service may refuse any synchronous query attempted if the projected query time is too long. A query can be submitted asynchronously by setting the <i>async_</i> parameter to <i>True</i>. A job id will then be returned.

The status of an asynchronous query can be checked by submitting an HTTP GET call to the query manager service <i>status</i> endpoint with the relevant job id as an argument: /status?jobid=<jobid>. A return value of 'COMPLETED' indicates the query has terminated. A return value of 'ERROR' indicates that there was a problem with the query.

The results of an asynchronous query (assuming that they were not saved to either the user's virtual storage or remote database) can be retrieved once the query has completed with an HTTP GET call to the query manager service <i>results</i> endpoint with the relevant job id as argument: /results?jobid=<jobid>

### From Python code

The query client service can be called from Python code using the <i>datalab</i> module. This provides methods to access the various query manager functions in the <i>queryMgr</i> subpackage. See the information [here](https://github.com/noaodatalab/datalab/blob/master/README.md).


<a class="anchor" id="imports"></a>
# Imports and setup

This is the setup that is required to use the query client. The first thing to do is import the relevant Python modules. To save results to virtual storage, we need to retrieve our DataLab security token.

In [2]:
# Standard lib
from getpass import getpass

# Data Lab
from dl import authClient as ac, queryClient as qc, storeClient as sc

# Authentication

In [4]:
# Get your Data Lab security token
token = ac.login(input("Enter user name: (+ENTER) "),getpass("Enter password: (+ENTER) "))
if not ac.isValidToken(token):
   raise Exception('Token is not valid. Please check your usename/password and execute this cell again.')

Enter user name: (+ENTER) demo00
Enter password: (+ENTER) ········


#### The *queryClient* class

All queries are executed through the <i>query()</i> method of the <i>queryClient</i> class. This takes as arguments:

| Argument | Description | Default  value | Allowed Values |
|----------|-------------|----------------|----------------|
| adql | The query to be submitted to the TAP service | None | |
| sql | The query to be submitted to the DB directly | None | |
| fmt | The requested format (if any) | ascii | ascii,csv,votable,fits |
| out | The saved location (if any) | None | local filename, *vos://filename*, *mydb://tablename* |
| async_ | Indicates if the query is asynchronous | False | |

All arguments are optional, except that one of *adql* or *sql* must be supplied.  The distinction between these two parameters is in how the *QueryClient* executes the query:  If *adql* is provided the query is sent to the TAP (Table Access Protocol) service, if *sql* is provided the query is sent directly to the database.  The choice of execution depends on whether the query string contains ADQL-specific functions, or SQL constructs or DB extensions not understood by the TAP service.  For large queries there can also be a performance difference depending on the where/how the results are saved. 

<a class="anchor" id="query"></a>
### A quick query

Let's say we want to return the $gri$ magnitudes of the top 10 objects in the SMASH DR1 dataset and get it back as a CSV file:

In [5]:
query = 'select gmag, rmag, imag from smash_dr1.object limit 10'
response = qc.query(adql = query, fmt = 'csv')
print (response)

gmag,rmag,imag
24.859207,24.14867,23.768522
25.097267,24.357933,24.406269
25.083416,24.611797,24.010031
25.379248,99.989998,24.756306
24.923378,24.037075,23.779806
24.816929,24.496265,24.57375
25.039248,99.989998,99.989998
24.665981,24.336454,24.532278
25.134247,99.989998,99.989998
24.831894,24.246521,23.679804



<a class="anchor" id="save"></a>
### Saving results to virtual storage

Now we want to save the results from the same query to our virtual storage space instead.  By putting the query in a try-block we are able to trap errors when executing the query.  Note that running this cell multiple times will trigger an error and we use the Storage Manager client to remove the file once we are done.

In [6]:
try:
    response = qc.query (adql=query, fmt='csv', 
                                  out='vos://zxmags.csv')
except Exception as e:
    # Handle any errors in the query.  By running this cell multiple times with the same
    # output file, or by using a bogus SQL statement, you can view various error messages.
    print (e.message)
else:
    if response is not None: 
        print (response)           # print the response
    else:
        print ("OK")

# Remove the file we just created, but list it first to show it exists
listing = sc.ls (name='vos://zxmags.csv')
print (listing)
sc.rm (name='vos://zxmags.csv')

OK
zxmags.csv


''

### Saving results to remote database

Alternatively we may want to store the results in a table called <i>mags</i> in our remote database.

In [7]:
query = "select * from usno.b1 limit 1000"
try:
    response = qc.query (adql=query, fmt='csv',
                                  out='mydb://mags')
except Exception as e:
    # Handle any errors in the query.  By running this cell multiple times with the same
    # output file, or by using a bogus SQL statement, you can view various error messages.
    print (e.message)
else:
    if response is not None: 
        print (response)           # print the response
    else:
        print ("OK")

Error: relation "mags" already exists



### An asynchronous query

We now want to run a longer query, say, counting the total number of objects in USNO-B1, and need to do it asynchronously. The first thing we do is submit the query as normal but with the <i>async</i> argument indicated - this will return the id of the asynchronous job. All the previous arguments can also be used to specify where and in what format we want the query results.

In [8]:
query = 'select count(*) from usno.b1'
jobId = qc.query(adql = query, async_ = True)
print (jobId)

remersiaydhhq9g1


We can check on the status of the job at any time:

In [9]:
status = qc.status(jobId = jobId)
print (status)

COMPLETED






























































While running, the status value will be "EXECUTING".  If the status value is "QUEUED" then the job is waiting to be executed. If
it is "ERROR" then there was a problem with the execution. When the status value is "COMPLETED", we can get our results (assuming we did not save them to our virtual storage or remote database).

In [10]:
results = qc.results(jobId = jobId)
print (results)

COUNT
1045175762



#### Using profiles

Different datasets (or versions of the same dataset) may reside on different backend servers and a user may want to work explicitly with a (typically older) dataset. In some cases these servers will be used only by developers or those with restricted access.

The first thing to do is see what profiles are available.

In [11]:
profilelist = qc.list_profiles()
print (profilelist)

       GALEX-DR6   GALEX DR6 TAP service (29 Tables, TAP Only)
            GAVO   GAVO Data Center TAP service (149 Tables, TAP Only)
         HEASARC   HEASARC Xamin TAP Service (921 Tables, TAP Only)
            IRSA   IRSA TAP Service (478 Tables, TAP Only)
        SDSS-DR9   SDSS DR9 TAP service (92 Tables, TAP Only)
          SIMBAD   SIMBAD TAP service (47 Tables, TAP Only)
    STScI-RegTAP   STScI Registry TAP service (18 Tables, TAP Only)
          Vizier   TAP Vizier query engine (34381 Tables, TAP Only)
         default   Default Public NOAO Data Lab TAP Service / Database



The thing to note in the output here are names such as '*GAVO*', '*SDSS-DR9*', etc;  These profiles refer to external TAP services that can be accessed using the Query Manager interface.  Only a few are configured at the moment as we work on ways to automatically discover the >100 such services and provide useful listings of what they contain, but let's see how we can query one and save the result to our Virtual Storage:

In [12]:
qc.set_profile('GAVO')
query = 'select top 10 * from sdssdr7.sources'
response = qc.query(adql = query, fmt = 'csv', out="vos://gavo_out.csv")
print (response)

OK


In this case we queried an SDSS DR7 table at the TAP service run by GAVO in Heidelberg.  

We can get the details of a particular profile by including the name of the profile as an argument in the <i>list_profiles</i> method:

In [13]:
qc.list_profiles("default")

{'accessURL': 'http://gp01.datalab.noao.edu:8080/ivoa-dal/tap',
 'resultStorePath': '/net/dl1/tap_data/resultStoreImpl',
 'description': 'Default Public NOAO Data Lab TAP Service / Database',
 'database': 'tapdb',
 'tempfilePath': '/net/dl1/temp',
 'vosRoot': 'vos://datalab.noao!vospace',
 'type': 'public',
 'mydb_database': 'mydb'}

So let's try a query against the default profile first - let's get a list of all tables in the default database.  Note that in this case we are accessing the *information_schema* table of the database, this table is not included in the TAP service and so we <b>must</b> use the *sql* argument to talk directly to the database.

In [14]:
sql = 'select table_catalog, table_schema, table_name from information_schema.tables'
try:
    qc.set_profile('default')
    default = qc.query(token, sql=sql)
except Exception as e:
    print (e.message)
else:
    print (default)

table_catalog,table_schema,table_name
tapdb,cp_calib,ps1
tapdb,sdss_dr14,x_specobj_ls_dr6_1p5
tapdb,pg_catalog,pg_statistic
tapdb,pg_catalog,pg_type
tapdb,des_dr1,x_gaia_dr2_2p5
tapdb,pg_catalog,pg_policy
tapdb,pg_catalog,pg_authid
tapdb,pg_catalog,pg_shadow
tapdb,pg_catalog,pg_settings
tapdb,pg_catalog,pg_hba_file_rules
tapdb,pg_catalog,pg_file_settings
tapdb,pg_catalog,pg_config
tapdb,sdss_dr14,x_specobj_ls_dr7_1p5
tapdb,mydb,cttargets_20181227_1191
tapdb,mydb,test_1549
tapdb,pg_catalog,pg_user_mapping
tapdb,pg_catalog,pg_replication_origin_status
tapdb,pg_catalog,pg_subscription
tapdb,pg_catalog,pg_stat_user_tables
tapdb,pg_catalog,pg_stat_xact_user_tables
tapdb,pg_catalog,pg_attribute
tapdb,pg_catalog,pg_proc
tapdb,pg_catalog,pg_class
tapdb,pg_catalog,pg_attrdef
tapdb,pg_catalog,pg_constraint
tapdb,pg_catalog,pg_statio_all_tables
tapdb,pg_catalog,pg_statio_sys_tables
tapdb,pg_catalog,pg_statio_user_tables
tapdb,pg_catalog,pg_stat_all_indexes
tapdb,pg_catalog,pg_inherits
tapdb,pg_ca