In [None]:
__author__ = 'M. Fitzpatrick <fitz@noao.edu>. M. Graham <graham@noao.edu>' # single string; emails in <>
__version__ = '20180604' # yyyymmdd; version datestamp of this notebook
__datasets__ = ['']      # datasets used

## How to use the Data Lab *Query Manager* Service

*Revised: June 4, 2018*

This notebook documents how to query the Data Lab via the Query Manager service. This can be done from a notebook, a Python script or from the command line using the <i>datalab</i> command-line tool.  The Query Manager provides a simpler interface to the data access services offered by Data Lab, although direct access to the underlying services is still available to clients.

### The _Query Manager_ Web Service Interface

The Query Manager service simplifies access to the Data Lab databases and other data-access services. This section describes the Query Manager service interface that allows us to write our own code when not using one the <i>datalab</i> command-line client or other third party application (e.g. _[TOPCAT](http://www.star.bris.ac.uk/~mbt/topcat/)_, _[STILTS](http://www.star.bris.ac.uk/~mbt/stilts/)_ or _[Aladin](http://aladin.u-strasbg.fr/)_).

The Query Manager service accepts an HTTP GET or POST call to the requested endpoint and then interacts with the backed database or access services:

| Endpoint | Description | Parameters | Client Interface |
|----------|-------------|------------|:----------------:|
| /query | Main catalog query interface | adql,sql,async,out,ofmt,profile | query (adql|sql) |
| /status | Get status of async query | name | put (fr, to) |
| /results | Get results from async query | jobid | results (jobid) |
| /abort | Abort an async query | jobid | put (jobid) |
| /schema | Get table schema info | value | schema (value,profile) |
| /list | List tables in MyDB | table | list (table) |
| /create | Create a table in MyDB | table | create (table) |
| /delete | Delete a MyDB table | table | delete (table) |
| /copy | Copy a MyDB table | src,target | copy (src, target) |
| /rename | Rename a MyDB table | src,dest | rename (src, dest) |
| /sia| Generic SIA interface | svc,pos,size | sia (svc, pos, size) |
| /scs | Generic SCS interface | svc,pos,size | scs (svc, pos, size) |
| | | | | |
| /available | Service availability endpoint | N/A | N/A |
| /profiles | List available service profiles | profile, format | res = get_profiles (profile) |
| /debug | Toggle service debug flag | N/A | N/A |


#### Authentication
The Query Manager service requires a Data Lab security token when output is to be saved to either virtual storage or MyDB. This token needs to be passed as the value of the header keyword "_X-DL-AuthToken_" in any HTTP GET call to the service.  If no token is passed, the 'anonymous' user token is assumed.

#### Profiles
There are several different backend machines serving data, etc., and it is possible that a user might want to specify a particular set to run their query on and save results to. A <i>profile</i> defines such a set and a particular profile can be specified in a query call (see below for details on profile usage).


## From Python code

The query manager service can be called from Python code using the <i>datalab</i> module. This provides methods to access the various Query Manager functions in the <i>queryClient</i> subpackage. 

### Initialization
This is the setup that is required to use the Query Manager. The first thing to do is import the relevant Python modules and also retrieve our Data Lab security token (remember that this has to be included in any call to the query manager service).
***

### Standard Notebook Setup

In [None]:
# Standard notebook imports
from __future__ import print_function   # Py2/Py3 compatability
import getpass                          # for Data Lab login
from dl import authClient               # for Data Lab login
from dl import storeClient              # Storage Manager client interface
from dl import queryClient              # Query Manager client interface

#### Data Lab Login

Logging into the Data Lab is only required when a user identity is required, i.e. when you wish to access protected resources such as virtual storage or MyDB.  If no token is provided, the _anonymous_ user is assumed.  Once you login, the token is saved and will automatically be passed to interface methods that require it;  If you wish to supply a different token, each interface method accepts a _token_ parameter argument.

The first time this notebook is executed, login as the example _demo00_ user (passwd: _balatad_) in the cell below. Subsequent runs may skip this cell.  Note, this token will persist for other notebooks as well.

In [None]:
# Get the login token for the 'demo00' scratch user (password: 'balatad')
token = authClient.login ('demo00',getpass.getpass('Account password: '))
if not authClient.isValidToken (token):
    print ('Error: invalid user login (%s)' % token)
else:
    print ("Login token:   %s" % token)

anon_tok = 'anonymous.0.0.anon_access'                            # EXAMPLES ONLY

#### The *queryClient* class

All queries are executed through the <i>query()</i> method of the <i>queryClient</i> class. This takes as arguments:

| Argument | Description | Default  value | Allowed Values |
|----------|-------------|----------------|----------------|
| token | The login identity token | None | any valid token |
| adql | The query to be submitted to the TAP service | None | |
| sql | The query to be submitted to the DB directly | None | |
| fmt | The requested format (if any) | ascii | ascii,csv,votable,fits |
| out | The saved location (if any) | None | local filename, *vos://filename*, *mydb://tablename* |
| async | Indicates if the query is asynchronous | False | |
| profile | Indicates which service profile to query | default | |

All arguments are optional except that one of *adql* or *sql* must be supplied;  if neither of these is explicitly given but a query string is passed as the first argument, it will be assumed to be an _sql_ parameter.  The distinction between these two parameters is in how the *Query Manager* executes the query:  If *adql* is provided the query is sent to the TAP (Table Access Protocol) service, if *sql* is provided the query is sent directly to the database.  The choice of execution depends on whether the query string contains ADQL-specific functions, or SQL constructs or DB extensions not understood by the TAP service.  For large queries there can also be a performance difference depending on the where/how the results are saved. Note that if a *profile* is not specified then the default profile is assumed, otherwise the *profile* may be specified to use an external or development service.  External services may only be accessed using the TAP (i.e. an *adql* query) protocol.

#### Saving Results
If no save location is specified (i.e. no <i>out</i> param specified) then the results are returned directly to the caller. A save location beginning with the '_vos://_' identifier indicates a location in the user's virtual storage to save the result. Similarly, a save location beginning with the '_mydb://_' identifier indicates the results are to be saved to a named table in the user's remote database (known as _MyDB_). 

#### Output formats
The results can be returned as whitespace delimited (<i>ascii</i>), CSV (<i>csv</i>, the default), or in VOTable format (<i>votable</i>) depending on the value of the _out_ parameter. Note that if the results are saved to the user's database then the output format is ignored.

#### Asynchronous queries
Long queries should be run asynchronously and the service may refuse any synchronous query attempted if the projected query time is too long. A query can be submitted asynchronously by setting the <i>async</i> parameter to <i>true</i>,  a job id will then be returned.

The status of an asynchronous query can be checked by submitting an HTTP GET call to the query manager service <i>status</i> endpoint with the relevant job id as an argument: /status?jobid=<jobid>. A return value of 'COMPLETED' indicates the query has terminated. A return value of 'ERROR' indicates that there was a problem with the query.

The results of an asynchronous query (assuming that they were not saved to either the user's virtual storage or remote database) can be retrieved once the query has completed with an HTTP GET call to the query manager service <i>results</i> endpoint with the relevant job id as argument: /results?jobid=<jobid>

## Client Interface Summary

### Discovering Datasets in Data Lab

Data available from the Query Manager service can be discovered using the _**queryClient.schema()**_ method:

        result = queryClient.schema ([token])
        result = queryClient.schema (<value>)
        result = queryClient.schema ([token], value)
        result = queryClient.schema ([token], value=<value>)
where:

        token - Data Lab auth token to use when overriding login token
        value - Schema value to list

The schema _value_ argument has the following forms:

            None or ''                      # List all available datasets
            <schema>                        # List tables in named <schema>
            <schema>.<table>                # List columns in table <schema>.<table>
            <schema>.<table>.<column>       # List details of column <schema>.<table>.<column>

For example:

In [None]:
try:
    list = queryClient.schema()                    # List all available datasets
    #list = queryClient.schema('')                  # List all available datasets
    #list = queryClient.schema('des_dr1')           # List tables in DES DR1
    #list = queryClient.schema('des_dr1.main')      # List columns within the 'main' table of DES DR1
    #list = queryClient.schema('des_dr1.main.ra')   # List details of the 'ra' column in 'main'
    #list = queryClient.schema(profile='IRSA')      # List datasets in external service (see profiles below)
except Exception as e:
    print (str(e))
else:
    print (list)

### Using Profiles

Users may sometimes wish to access different datasets (or older versions of some datasets) that may reside on alternate backend servers (e.g. they were infrequently accessed and so have been moved to offline hardware). In other cases, there will be servers that are used only by developers or those with restricted access (e.g. as we prepare a new data release or work with a science collaboration).  Lastly, we can use the Data Lab _Query Manager_ to access to an external TAP service.  The way users can access these different data-service collections is by means of _**profiles**_ within Data Lab.

The first thing to do is see what profiles are available:

In [None]:
profiles = queryClient.list_profiles()
print (profiles)

The thing to note in the output here are names such as '*GAVO*', '*SDSS-DR9*', etc;  These profiles refer to external TAP services that can be accessed using the Query Manager interface.  Only a few are currently configured as we work on ways to automatically discover the >100 such services and provide useful listings of what they contain (we are also planning to add the ability for users to define their own set of external profiles).

Let's see how we can query one of these external service and save the result to our Virtual Storage:

For large queries, external services can be queried with the results saved to virtual storage for later use in a workflow.  In this case, we explicitly set a profile to be the new default:

In [None]:
queryClient.set_profile('GAVO')
query = 'select top 10 * from sdssdr7.sources'                  # Note the use of ADQL syntax 
fname = 'vos://gavo_test.csv'                                   # set the saved result filename
if storeClient.access(fname): storeClient.rm(fname)             # remove existing file

# Submitting the query is just as before, but we're 
# using the GAVO TAP service (and so 'adql' is required)
response = queryClient.query(token, adql=query, fmt='csv',      # FIXME - file not created?
                             out=fname)
print ('Query response: ' + response)

# Use the Storage Manager to verify the file was saved, then clean up
print (storeClient.ls ())
resp = storeClient.rm (fname)

In this case we queried an SDSS DR7 table at the TAP service run by GAVO in Heidelberg.  The name of the currently active default profile can be checked at any time using the _**queryClient.get_profile()**_ method:

In [None]:
cur_profile = queryClient.get_profile()
print ("Current profile:  '%s'" % cur_profile)

We can get the details of a particular profile by including the name of the profile as an argument in the <i>list_profiles</i> method:

In [None]:
list = queryClient.list_profiles("default")
#list = queryClient.list_profiles(profile="default")
print (list)

Using the _TAP_SCHEMA_ tables available in a TAP service (see the [TAP specification](http://ivoa.net/documents/tap)) we can programmatically discover datasets for each profile.  For example, even though the Data Lab profiles don't explicitly store the available _IRSA_ tables, we can extract these from the _TAP_SCHEMA_ tables.  First, let's look at what we get from the default profile so we can compare with the output of the <i><b>list_profiles()</b><i> method above:

In [None]:
query = 'select table_name, description from tap_schema.tables'
try:
    results = queryClient.query(adql=query, profile='default')
except Exception as e:
    print (str(e))
else:
    print (results[:512])    # print snippet of result

And now we'll run the same query against the '*IRSA*' profile:

In [None]:
try:
    results = queryClient.query(adql=query, profile='IRSA')
except Exception as e:
    print (str(e))
else:
    print (results[:512])    # print snippet of result

Comparing the two outputs, we can see that there are differences in which tables are available. **NOTE:** _Queries of a TAP_SCHEMA table can sometimes fail or behave somewhat differently, use as your own risk._

Lastly, reset the default profile for the remainder of this notebook.

In [None]:
queryClient.set_profile('default')

### Querying Catalogs

Querying a Data Lab catalog is done using the _**queryClient.query()**_ method:

        result = queryClient.query (token, query, adql=None, sql=None, fmt='csv', out=None, async=False, **kw)
where:

        query - An SQL query string
        token - Data Lab auth token to use when overriding login token
        adql - An ADQL query string, submitted to TAP service
        sql - An SQL query string, submitted directly to the database
        fmt - Desired output format (csv|pandas|array|structarray|table|ascii|fits|votable)
        out - A pathname, 'vos://' or 'mydb://' URI to save the file.  If None results returned directly
        async - If True, query executed asynchronously and the method result is a job-id string

#### A quick query

Let's say we want to return the first 10 objects in the USNO A2 catalog and get it back as a CSV file:

In [None]:
# Example 1:  SQL syntax, use default CSV return format
query = 'select * from usno.a2 limit 10'

# Note that when passing in a literal string or query variable
# without explicitly setting the 'sql' parameter, 'sql' will be
# assumed.
#response = queryClient.query('select * from usno.a2 limit 10')  # pass query string directly
response = queryClient.query(query)                             # pass variable containing query
#response = queryClient.query(sql=query, fmt='csv')              # explicitly set 'sql' param and format
#response = queryClient.query(query, token=anon_tok)             # override default user token

print (response)

In [None]:
# Example 2:  Same query but request a Pandas Data Frame as the return
query = 'select * from usno.a2 limit 10'
df = queryClient.query(sql=query, fmt='pandas')
df.head()

In [None]:
# Example 3:  ADQL syntax, explicitly set 'adql' param and format
query = 'select top 10 * from usno.a2'

response = queryClient.query(adql=query, fmt='votable')
print (response[:1024])

#### Saving results to virtual storage

Now we want to save the results from the same query to our virtual storage space instead.  By putting the query in a try-block we are able to trap errors when executing the query.  Note that running this cell multiple times will trigger an error and we use the Storage Manager client to remove the file once we are done.

In [None]:
try:
    query = 'select top 10 * from usno.a2'
    response = queryClient.query (adql=query, fmt='csv', 
                                  out='vos://zzmags.csv')
except Exception as e:
    # Handle any errors in the query.  By running this cell multiple times with the same
    # output file, or by using a bogus SQL statement, you can view various error messages.
    print (str(e))
else:
    if response is not None: 
        print ("query() response: '%s'\n" % response)  # print the response
    else:
        print ("OK")

# Remove the file we just created, but list it first to show it exists
print (storeClient.ls ('vos://',format='short'))
resp = storeClient.rm ('vos://zzmags*')

#### Saving Results to MyDB

Alternatively we may want to store the results in a table called <i>mags</i> in our remote database.

In [None]:
try:
    query = "select * from usno.a2 limit 1000"
    response = queryClient.query (token, sql=query, fmt='csv', 
                                  out='mydb://zzmags')
except Exception as e:
    # Handle any errors in the query.  By running this cell multiple times with the same
    # output file, or by using a bogus SQL statement, you can view various error messages.
    print (str(e))
else:
    if response is not None: 
        print ("query() response: '%s'\n" % response)           # print the response
    else:
        print ("OK")

#### Listing Available MyDB Tables

To list what tables are available in your MyDB storage, you can use the _**queryClient.list()**_ method:

        result = queryClient.list ()                     # Lists names of all MyDB tables
        result = queryClient.list (token)                #   "     "    "  "   "     "
        result = queryClient.list (table)                # List columns of a specific table
        result = queryClient.list (token, <table>)       #   "    "      " "    "       "
        result = queryClient.list (token, table=<table>) #   "    "      " "    "       "

where:

        table - Name of a specific MyDB table -- lists column descriptions
        token - Data Lab auth token to use when overriding login token

In [None]:
# List all available tables, or example table created above
#result = queryClient.list ()
#result = queryClient.list (token)
#result = queryClient.list ('zzmags')
#result = queryClient.list (token, 'zzmags')
result = queryClient.list (token, table='zzmags')

print (result)

#### Dropping MyDB Tables

Removing tables from MyDB is done using the _**queryClient.drop()**_ method:

        result = queryClient.drop (<table>)                       # Drop the named table
        result = queryClient.drop (table=<table>)                 #   "   "    "     "
        result = queryClient.drop (token, table=<table>)          #   "   "    "     "
        result = queryClient.drop (token=<token>, table=<table>)  #   "   "    "     "
where:

        table - Name of a specific MyDB table -- lists column descriptions
        token - Data Lab auth token to use when overriding login token
        
The _drop()_ method returns "OK" on success or an error message.

In [None]:
# Drop the example table created above
print ('MyDB tables before drop():\n' + queryClient.list ())
       
result = queryClient.drop ('zzmags')
#result = queryClient.drop (token, 'zzmags')
#result = queryClient.drop (token, table='zzmags')
       
print ('MyDB tables after drop():\n' + queryClient.list ())
print (result[:256])

#### Submitting an Asynchronous Query

We now want to run a longer query, say, a count of objects in USNO-A2 (typically ~1 min to execute), and need to do it asynchronously. The first thing we do is submit the query as normal but with the <i>async</i> argument set to _True_ - this will return the id of the asynchronous job. All the previous arguments can also be used to specify where and in what format we want the query results.

In [None]:
# A simple long-running query
jobId = queryClient.query(adql='SELECT COUNT(*) FROM usno.a2', async=True)
#jobId = queryClient.query(adql='SELECT foo,bar FROM usno.a2', async=True)   # to generate error
print ('Job ID: ' + jobId)

####  Checking Asynchronous Job Status

We can check on the status of an asynchronous job at any time using the _**queryClient.status()**_ method:

        status = queryClient.status (<jobID>)
        status = queryClient.status (token, <jobID>)
        status = queryClient.status (token, jobId=<jobID>)
where:

        jobID - The job ID string returned by query() in async mode
        token - Data Lab auth token to use when overriding login token
        
The _status()_ method will return one of 'QUEUED', 'EXECUTING', 'COMPLETED', 'ABORTED' or 'ERROR' depending on the phase of the job identified by the _<jobID>_ string.  Typically this would be done with a loop containing a pause to wait until the query is completed.

In [None]:
import time     # for the sleep() method
status = ''
while (status not in ['COMPLETED','ERROR']):
    status = queryClient.status(jobId)
    print (status)
    time.sleep(5)

Note that within the loop we use the _**queryClient.error()**_ method to retrieve an error message on a failed job.

If the status value is "_QUEUED_" then the job is waiting to be executed. If
it is "_ERROR_" then there was a problem with the execution. When the status value is "_COMPLETED_", we can get our results (assuming we did not save them to our virtual storage or remote database).

#### Retrieving Asynchronous Query Results

We can retrieve the results of an asynchronous job once completed using the _**queryClient.results()**_ method:

        results = queryClient.results (<jobID>, delete=True)
        results = queryClient.results (token, <jobID>, delete=True)
        results = queryClient.results (token, jobId=<jobID>, delete=True)
where:, 

        jobID - The job ID string returned by query() in async mode
        token - Data Lab auth token to use when overriding login token
        
By default results of an asynchronous query can only be retrieved *once* as they are removed from the server following a successful retrieval.  However, results will typically be available for up to 7 days once a query is complete.

In [None]:
results = queryClient.results(jobId)
print (results)

The _delete_ option can be disabled when results may need to be retrieved more than once in a workflow.  For example:

In [None]:
for i in range(3):
    print (queryClient.results(jobId, delete=False))
junk = queryClient.results(jobId, delete=True)                # delete the results from the server

#### Retrieving Error Messages from Asynchronous Queries

If an asynchronous query exits with an 'ERROR' status, the _**queryClient.error()**_ may be used.

        errmsg = queryClient.error (<jobID>)
        errmsg = queryClient.error (token, <jobID>)
        errmsg = queryClient.error (token, jobId=<jobID>)
where: 

        jobID - The job ID string returned by query() in async mode
        token - Data Lab auth token to use when overriding login token
        
For example:

In [None]:
# Submit a bogus ADQL query
jobId = queryClient.query(adql='SELECT foo,bar FROM usno.a2', async=True)   # to generate error
print ('Job ID: ' + jobId)

status = ''
while (status not in ['COMPLETED','ERROR']):
    status = queryClient.status(jobId)
    print ((status if status != 'ERROR' else ('Error: %s' % (queryClient.error(jobId)))))
    time.sleep(1)

#### Aborting an Asynchronous Query

An asynchronous query may be aborted using the _**queryClient.abort()**_ method.

        msg = queryClient.abort (<jobID>)
        msg = queryClient.abort (token, <jobID>)
        msg = queryClient.abort (token, jobId=<jobID>)
where: 

        jobID - The job ID string returned by query() in async mode
        token - Data Lab auth token to use when overriding login token
        
For example:

In [None]:
# Submit a async ADQL query
jobId = queryClient.query(adql='SELECT count(*) FROM usno.a2', async=True)   # to generate error
print ('Job ID: ' + jobId)

time.sleep(10)                    # let it run for a little bit ....
msg = queryClient.abort (jobId)   # kill the job

In [None]:
# Wait for job to be killed on the server and then get status
time.sleep(5)
msg = queryClient.abort (jobId)
status = queryClient.status(jobId)
print (jobId)
print ('Status:  %s   Message:  %s' % (status,msg))     # FIXME - final status should be 'ABORTED'

### Using Multiple Clients

Because methods like _set_profile()_ change the default services available, applications that query a mix of Data Lab and external services may find it convenient to instantiate multiple Query Manager clients by name.  This can be done using the _**queryClient.getClient()**_ method:

        client = queryClient.getClient (profile=<profile_name>, svc_url=<service_url>)   
where

        profile - The name of a valid Query Manager profile
        svc_url - The service URL of the Query Manager instance to use
        
For example:

In [None]:
qc = queryClient.getClient()                   # get a default client
gavo = queryClient.getClient (profile='GAVO')  # get a client for the GAVO profile

print ('Default schema:\n' + qc.schema(''))
print ('GAVO schema:\n' + gavo.schema(''))

***

## Using the *datalab* command

The <i>datalab</i> command provides an alternate command line way to work with the Query Manager through the <i>query</i> subcommand.  Similarly, other commands exist to view or access virtual storage or manage the user's login.  For a summary of available commands, type:

In [None]:
!datalab help

#### Logging into the Data Lab

We need to be logged into the Data Lab to use the Query Manager if we also plan to use the Virtual Storage resources of our user account.  If you simply wish to access the data and return it for local use, you can skip this step and query for data anonymously.  Here. we'll login using a demo account name, user's may login using several identities (e.g. a shared collaboration account as well as a personal account) where only one identity is active at a time (the '*whoami*' command will print what that is).

In [None]:
!datalab login user=demo00 password=balatad
!datalab whoami

Similarly, the '*logout*' command can be used to log out of the Data Lab.

In [None]:
!datalab logout
!datalab whoami

#### A quick data query

Again we will return the magnitudes of the top 10 objects in the SMASH DR1 object table and get it back as a CSV string (this is the default, so specification of the '*fmt*' parameter is optional)

In [None]:
!datalab query fmt='csv' \
    adql='select top 10 id,gmag,rmag,imag from smash_dr1.object'

#### Saving results to virtual storage

Now we'll save the results to our virtual storage space as a file.  Note that the initial login above has saved our identity token which will be passed automatically with the query.  Afterward, we'll list the virtual storage space to confrm the file was saved.  Notice that in this query, we use the '*sql*' parameter to submit the query directly to the database since we are using the SQL 'limit 10000' syntax instead of the ADQL "top 10000' syntax.

In [None]:
# Login again to access Virtual Storage
!datalab login user=demo00 password=balatad

!datalab query out='vos://smash_mags.csv'  \
    sql='select id,gmag,rmag,imag from smash_dr1.object limit 10000'
!datalab ls name=vos://

#### Saving results to remote database

And we can also save data to our remote database.  Here we also use the '*listdb*' command to show the list of tables in our MyDB, by specifying the table name we get a more descriptive listing of the table structure. 

In [None]:
!datalab query out='mydb://smash_mags34' \
    adql='select top 1000 id,gmag,rmag,imag from smash_dr1.object' 
!datalab listdb
!datalab listdb table=smash_mags

#### An asynchronous query

Alternatively, we can run a longer asynchronous query:

In [None]:
%%script bash
export jobId=`datalab query adql='select count(*) from des_sva1.gold_catalog' async=True`
echo "Job ID = "$jobId

status=`datalab qstatus jobId=$jobId`
echo "Init status:  "$status
while [ "$status" != "COMPLETED" ]
do
    echo `date`"  "$status
    sleep 1
    status=`datalab qstatus jobId=$jobId`
done

echo "Query complete, result is:"
datalab qresults jobId=$jobId