In [1]:
__author__ = 'Mike Fitzpatrick <michael.fitzpatrick@noirlab.edu>, Alice Jacques <alice.jacques@noirlab.edu>, NOIRLab Astro Data Lab Team <datalab@noirlab.edu>'
__version__ = '20201216'
__datasets__ = ['usno','smash_dr1','gaia_dr1']
__keywords__ = ['query','vospace','mydb']

# How to use the Data Lab *Query Client* Service

### Table of contents
* [Summary](#summary)
* [Disclaimer & attribution](#attribution)
* [Imports & setup](#imports)
* [Review: General template for a simple query in SQL](#review)
* [Quick Example Query](#query)
* [Save to virtual storage VOSpace](#savevospace)
* [Save to remote database MyDB](#savemydb)
* [Import data into a MyDB table with QueryClient](#importmydb)
* [An asynchronous query example](#async)
* [Using profiles](#profiles)


<a class="anchor" id="summary"></a>
# Summary

This notebook documents how to query the Data Lab via the query client service. For full documentation see the <a href="https://datalab.noao.edu/docs/api/queryClient.html">API documentation</a>.


### The *queryClient* class


All queries are executed through the `queryClient.query()` method of the *queryClient* class. This takes as arguments:

| Argument | Description | Optional | Default  value | Supported Values |
|----------|-------------|----------|----------------|----------------|
| adql | The query to be submitted to the TAP service | No | None | string |
| sql | The query to be submitted to the DB directly | No | None | string |
| fmt | The requested format (if any) | Yes | ascii | ascii,csv,fits,hdf5,votable |
| out | The saved location (if any) | Yes | None | local filename, `vos://filename`, `mydb://tablename` |
| async_ | Indicates if the query is asynchronous | Yes | False | True/False |


All arguments are optional, except that one of `adql` or `sql` must be supplied.  The distinction between these two parameters is in how the *QueryClient* executes the query:  If `adql` is provided the query is sent to the TAP (Table Access Protocol) service, if `sql` is provided the query is sent directly to the database. The choice of execution depends on whether the query string contains ADQL-specific functions, or SQL constructs or DB extensions not understood by the TAP service.  For large queries there can also be a performance difference depending on the where/how the results are saved. 

#### Output formats
The results can be returned as whitespace delimited (*ascii*), CSV (*csv*), FITS object (*fits*), HDF5 (*hdf5*), or in VOTable format (*votable*). Note that if the results are saved to the user's database then the output format is ignored.

#### Saving results
If no save location is specified (no `out` param) then the results are returned directly. A save location beginning with the `'vos://'` identifier indicates a location in the user's virtual storage to save the result. A save location beginning with the `'mydb://'` identifier indicates the results are to be saved to a table in the user's remote database (MyDB). 

#### Asynchronous queries
Long queries should be run asynchronously and the service may refuse any synchronous query attempted if the projected query time is too long. A query can be submitted asynchronously by setting the `async_` parameter to `True`. A job id will then be returned.
    
The status of an asynchronous query can be checked with `queryClient.status(jobid)`. A return value of 'COMPLETED' indicates the query has terminated. A return value of 'ERROR' indicates that there was a problem with the query.    

The results of an asynchronous query (assuming that they were not saved to either the user's virtual storage or remote database) can be retrieved once the query has completed with: 

`result = qc.query(adql=query,async_=True,wait=True,poll=1,verbose=1)`

### From Python code

The query client service can be called from Python code using the *datalab* module. This provides methods to access the various query client functions in the *QueryClient* subpackage. See the information [here](https://github.com/noaodatalab/datalab/blob/master/README.md).

Queries can be also run from the command line, e.g. on your local machine, using the datalab command line utility. Read about it in our GitHub repo [here](https://github.com/noaodatalab/datalab).


<a class="anchor" id="attribution"></a>
# Disclaimer & attribution
If you use this notebook for your published science, please acknowledge the following:

* Data Lab concept paper: Fitzpatrick et al., "The NOAO Data Laboratory: a conceptual overview", SPIE, 9149, 2014, http://dx.doi.org/10.1117/12.2057445

* Data Lab disclaimer: https://datalab.noirlab.edu/disclaimers.php

<a class="anchor" id="imports"></a>
# Imports and setup

This is the setup that is required to use the query client. The first thing to do is import the relevant Python modules.

In [2]:
# Standard lib
from getpass import getpass

# Data Lab
from dl import authClient as ac, queryClient as qc, storeClient as sc
from dl.helpers.utils import convert

# Authentication
Much of the functionality of Data Lab can be accessed without explicitly logging in (the service then uses an anonymous login). But some capacities, for instance saving the results of your queries to your virtual storage space, require a login (i.e. you will need a registered user account).

If you need to log in to Data Lab, issue this command, and respond according to the instructions:

In [3]:
#ac.login(input("Enter user name: (+ENTER) "),getpass("Enter password: (+ENTER) "))
ac.whoAmI()

'demo00'

<a class="anchor" id="review"></a>
# Review: General template for a simple query in SQL

### SQL is a way to describe to a database what you want from it
General template for a simple query written in SQL
```
SELECT something
FROM database.table
WHERE constraints
LIMIT 100
```
### Please see our intro notebook [JupyterPythonSQL101](https://github.com/noaodatalab/notebooks-latest/blob/master/01_GettingStartedWithDataLab/01_JupyterPythonSQL101.ipynb) for more info on this.

<a class="anchor" id="query"></a>
# A quick query

Let's say we want to fetch the g,r,i magnitudes from 10 objects in the SMASH DR1 data set, and retrieve the results as a CSV-formatted string:

In [4]:
query = """SELECT gmag, rmag, imag FROM smash_dr1.object 
            WHERE gmag<99 AND rmag<99 AND imag<99 LIMIT 10"""
response = qc.query(sql = query, fmt = 'csv')
print (response)

gmag,rmag,imag
11.9154,11.7978,11.7892
11.9376,11.5092,11.3253
12.0699,11.6934,11.5577
12.0747,16.5371,17.0187
12.1244,11.8028,11.6169
12.1355,11.7621,11.5469
12.1841,11.8446,11.7191
12.2102,12.189,12.318
12.2275,11.7632,11.5062
12.25,11.8397,11.7034



<a class="anchor" id="savevospace"></a>
# Saving results to virtual storage VOSpace

VOSpace is a convenient storage space for users to save their work. It can store any data or file type.  

Now we want to save the results from the same query to our virtual storage space instead.  By putting the query in a try-block we are able to trap errors when executing the query.  Note that running this cell multiple times will trigger an error and we use the Storage Manager Client to remove the file once we are done.

In [5]:
try:
    response = qc.query (sql=query, fmt='csv', out='vos://examplemags.csv')
except Exception as e:
    # Handle any errors in the query.  By running this cell multiple times with the same
    # output file, or by using a bogus SQL statement, you can view various error messages.
    print (e.message)
else:
    if response is not None: 
        print (response)        
    else:
        print ("OK")

OK


Let's ensure the file was created in VOSpace:

In [6]:
sc.ls(name='vos://examplemags.csv')

'examplemags.csv'

Now let's remove the file we just created:

In [7]:
sc.rm (name='vos://examplemags.csv')

'OK'

Let's ensure the file was removed from VOSpace:

In [8]:
sc.rm (name='vos://examplemags.csv')

'A Node does not exist with the requested URI.'

<a class="anchor" id="savemydb"></a>
# Saving results to remote database MyDB
MyDB is a useful OS remote per-user relational database that can store data tables. Furthermore, the results of queries can be directly saved to MyDB, as we show in the following example:

In [9]:
try:
    response = qc.query (sql=query, fmt='csv', out='mydb://examplemags')
except Exception as e:
    # Handle any errors in the query.  By running this cell multiple times with the same
    # output file, or by using a bogus SQL statement, you can view various error messages.
    print (e.message)
else:
    if response is not None: 
        print (response)         
    else:
        print ("OK")

OK


Ensure the table has been saved to MyDB by calling the `mydb_list()` function:

In [10]:
print(qc.mydb_list(),"\n")

examplemags,created:2020-12-16 14:11:16 MST
 



Now let's drop the table from our MyDB.

In [11]:
qc.mydb_drop('examplemags')

'OK'

Ensure it has been removed by calling the `mydb_list()` function again:

In [12]:
print(qc.mydb_list(),"\n")

No tables 



<a class="anchor" id="importmydb"></a>
# Import data into a MyDB table with QueryClient

Users can use the `mydb_import` function to import data saved on a local computer or import data from VOSpace into a MyDB data table. The data must be in the form of either a CSV file or Pandas Dataframe object in order to load it into MyDB.

First let's query some data from the Data Lab database and save it locally as a CSV file:

In [13]:
query = "select * from gaia_dr1.gaia_source limit 10"
qc.query (sql=query, fmt='csv', out='./gaiaresult.csv')

'OK'

Next we will use the `mydb_import` function to import the locally stored CSV file into a MyDB table:

In [14]:
qc.mydb_import('testresult','./gaiaresult.csv')

'OK'

Let's ensure it's there by calling the `mydb_list()` function:

In [15]:
print(qc.mydb_list(),"\n")

testresult,created:2020-12-16 14:11:17 MST
 



Similarly, we can use the `mydb_import` function to import data from VOSpace into a MyDB table:

In [16]:
qc.mydb_import('testresult2','vos://newmags.csv')

'OK'

In [17]:
print(qc.mydb_list(),"\n")

testresult,created:2020-12-16 14:11:17 MST
testresult2,created:2020-12-16 14:11:18 MST
 



Finally, for clean-up purposes, let's remove the two tables we just imported into MyDB by using the `mydb_drop` function:

In [18]:
qc.mydb_drop('testresult')
qc.mydb_drop('testresult2')

'OK'

And let's make sure the two tables were removed from MyDB by using the `mydb_list()` function:

In [19]:
print(qc.mydb_list(),"\n")

No tables 



<a class="anchor" id="async"></a>
# An asynchronous query

We now want to run a longer query, say, counting the total number of objects in USNO-B1, and need to do it asynchronously. To do this we will submit the query as normal but with the `async_` argument indicated.

Asynchronous queries get a jobid upon submission e.g.:

```
jobid = qc.query(adql=query,async_=True)
print(jobid)
q5qy6ujsu9cnygcp
```

You can then check periodically for the status of the query, by submitting the `jobid` string to the `status` method:

```
qc.status(jobid)
'EXECUTING'
```

If you repeat the `status` request a bit later, it will eventually change to `'COMPLETED'`, at which point you can retrieve the result:

```
result = qc.results(jobid)
```

**Luckliy, there is a built-in mechanism to do the periodic checking for you.** It also retrieves the results of your query for you at the end. The signature is:

```
result = qc.query(sql=query,async_=True,wait=True,poll=5,verbose=1)
```

where `poll` sets the polling period (in seconds).

In [20]:
query = 'SELECT count(*) FROM usno.b1'

In [21]:
result = qc.query(sql=query,async_=True,wait=True,poll=10,verbose=1)

EXECUTING
Status = EXECUTING; elapsed time: 10, timeout in 290
EXECUTING
Status = EXECUTING; elapsed time: 20, timeout in 280
EXECUTING
Status = EXECUTING; elapsed time: 30, timeout in 270
EXECUTING
Status = EXECUTING; elapsed time: 40, timeout in 260
EXECUTING
Status = EXECUTING; elapsed time: 50, timeout in 250
EXECUTING
Status = EXECUTING; elapsed time: 60, timeout in 240
EXECUTING
Status = EXECUTING; elapsed time: 70, timeout in 230
EXECUTING
Status = EXECUTING; elapsed time: 80, timeout in 220
EXECUTING
Status = EXECUTING; elapsed time: 90, timeout in 210
EXECUTING
Status = EXECUTING; elapsed time: 100, timeout in 200
EXECUTING
Status = EXECUTING; elapsed time: 110, timeout in 190
EXECUTING
Status = EXECUTING; elapsed time: 120, timeout in 180
EXECUTING
Status = EXECUTING; elapsed time: 130, timeout in 170
EXECUTING
Status = EXECUTING; elapsed time: 140, timeout in 160
EXECUTING
Status = EXECUTING; elapsed time: 150, timeout in 150
EXECUTING
Status = COMPLETED; elapsed time: 160, 

In [22]:
print(result)

COUNT
1045175762



<a class="anchor" id="profiles"></a>
# Using profiles

Sometimes, different datasets (or versions of the same dataset) may reside on different backend servers and a user may want to work explicitly with a (typically older) dataset. In some cases these servers will be used only by developers or those with restricted access. Also, external TAP services are accessible from within Data Lab. In both use cases, the `qc.profiles()` function comes in handy.

The first thing to do is see what profiles are available:

In [23]:
profilelist = qc.list_profiles()
print (profilelist)

       GALEX-DR6   GALEX DR6 TAP service (29 Tables, TAP Only)
            GAVO   GAVO Data Center TAP service (149 Tables, TAP Only)
         HEASARC   HEASARC Xamin TAP Service (921 Tables, TAP Only)
            IRSA   IRSA TAP Service (478 Tables, TAP Only)
        SDSS-DR9   SDSS DR9 TAP service (92 Tables, TAP Only)
          SIMBAD   SIMBAD TAP service (47 Tables, TAP Only)
    STScI-RegTAP   STScI Registry TAP service (18 Tables, TAP Only)
          Vizier   TAP Vizier query engine (34381 Tables, TAP Only)



The thing to note in the output here are names such as '*GAVO*', '*SDSS-DR9*', etc;  **These profiles refer to external TAP services that can be accessed using the Query Manager interface.** Only a few are configured at the moment as we work on ways to automatically discover the >100 such services and provide useful listings of what they contain.

Let's see how we can query one and save the result to our VOSpace:

In [24]:
qc.set_profile('GAVO')
query = 'SELECT top 10 * FROM sdssdr7.sources'
response = qc.query(adql = query, fmt = 'csv', out="vos://gavo_out.csv")
print (response)

OK


In this case we queried an SDSS DR7 table at the TAP service run by GAVO in Heidelberg. 

Let's ensure it has been saved to VOSpace:

In [25]:
listing = sc.ls (name='vos://',format='long')
print (listing)

-rw-rw-r-x  demo00       0  05 Sep 2018 10:56  a2_small.csv
-rw-rw-r-x  demo00     202  06 Jun 2019 10:57  canaryfiletwo.csv
-rw-rw-r-x  demo00  7162560  22 Sep 2020 13:14  cutout.fits
drwxrwxr-x  demo00       0  04 Jun 2019 20:13  directory1/
-rw-rw-r-x  demo00       0  28 Nov 2018 15:07  fooa.csv
-rw-rw-r-x  demo00   26912  06 Nov 2020 15:53  gaia_sample
-rw-rw-r-x  demo00       0  16 Feb 2018 23:24  gavo1.csv
-rw-rw-r-x  demo00       0  16 Feb 2018 23:24  gavo26.csv
-rw-rw-r-x  demo00       0  16 Feb 2018 23:24  gavo27.csv
-rw-rw-r-x  demo00       0  16 Feb 2018 23:24  gavo28.csv
-rw-rw-r-x  demo00       0  04 Jan 2019 11:49  gavo_out.csv
-rw-rw-r-x  demo00  12027327  19 Nov 2020 09:41  hipplx_glen.csv
-rw-rw-r-x  demo00       0  16 Feb 2018 23:24  lsdr2.csv
-rw-rw-r-x  demo00    5888  05 Feb 2019 16:05  mags.csv
-rw-rw-r-x  demo00       0  14 Sep 2018 22:10  newmags.csv
-rw-rw-r-x  demo00       0  16 Feb 2018 23:24  newmags2.csv
drwxrwxr-x  demo00       0  08 Apr 2018 22:19  public

We can load the data set and e.g. convert to a Pandas data frame, like this:

In [26]:
data = sc.get(fr = 'vos://gavo_out.csv', to = '')
df = convert(data)
df

Unnamed: 0,objid,run,rerun,camcol,fieldid,obj,ra,dec,raerr,decerr,...,offsetra_i,offsetdec_i,i,err_i,epoch_i,offsetra_z,offsetdec_z,z,err_z,epoch_z
0,758883089249337439,6182,648,1,758883089249337344,95,331.786804,41.387806,0.000117,0.000113,...,-6.4e-05,6.5e-05,19.9785,0.092936,2006.41,4.66667e-07,-2.38889e-07,22.8831,2.08631,2006.41
1,758883089249337433,6182,648,1,758883089249337344,89,331.780687,41.389283,3.5e-05,3e-05,...,-0.000111,-0.000117,20.7977,0.097531,2006.41,-0.000136364,3.20194e-05,20.1367,0.202043,2006.41
2,758883089249337434,6182,648,1,758883089249337344,90,331.784089,41.389143,3.3e-05,3.9e-05,...,-0.000402,0.000139,21.473,0.153602,2006.41,-3.63889e-07,3.53056e-06,20.9629,0.351456,2006.41
3,758883089249337438,6182,648,1,758883089249337344,94,331.793057,41.390098,3.6e-05,3.5e-05,...,8.7e-05,5.2e-05,24.345,1.94074,2006.41,8.76278e-05,5.14444e-05,22.5182,1.63563,2006.41
4,758883089249339591,6182,648,1,758883089249337344,2247,331.801543,41.390926,1.6e-05,1.5e-05,...,1e-06,1.1e-05,20.6834,0.059923,2006.41,-5.13583e-05,-6.525e-06,20.2915,0.163408,2006.41
5,758883089249337441,6182,648,1,758883089249337344,97,331.797318,41.390873,7e-05,6.8e-05,...,-8.5e-05,-4.5e-05,20.5697,0.132147,2006.41,5e-07,-1.94444e-07,24.7128,0.622408,2006.41
6,758883089249337437,6182,648,1,758883089249337344,93,331.798537,41.392072,3.3e-05,3.2e-05,...,5e-06,-6e-06,20.1749,0.120331,2006.41,-4.35306e-05,0.000361214,24.0276,1.43299,2006.41
7,758883089249337419,6182,648,1,758883089249337344,75,331.792733,41.392017,1.4e-05,1.3e-05,...,-4e-06,-5e-06,15.8837,0.003775,2006.41,-4.59167e-06,-1.69722e-06,15.5639,0.005426,2006.41
8,758883089249337423,6182,648,1,758883089249337344,79,331.796938,41.394596,1.5e-05,1.4e-05,...,-3e-06,-2.6e-05,19.8373,0.032767,2006.41,-6.10833e-05,-6.42139e-05,19.3057,0.074711,2006.41
9,758883089249339570,6182,648,1,758883089249337344,2226,331.775376,41.392066,1.8e-05,1.8e-05,...,3.1e-05,1.4e-05,,,2006.41,-4.47222e-07,7.00833e-06,,,2006.41


We can get the details of a particular profile by including the name of the profile as an argument in the `list_profiles` method:

In [27]:
qc.list_profiles('default')

{'mydb_user': 'datalab',
 'description': 'db01',
 'tempfilePath': '/net/dl1/temp',
 'accessURL': 'http://gp01.datalab.noao.edu:8080/ivoa-dal/tap',
 'resultStorePath': '/net/dl1/tap_data/resultStoreImpl',
 'database': 'tapdb',
 'type': 'hidden',
 'vosRoot': 'vos://datalab.noao!vospace'}

So let's try a query against the default profile - let's get a list of all tables in the default database.  Note that in this case we are accessing the *information_schema* table of the database, this table is not included in the TAP service and so we **must** use the `sql` argument to talk directly to the database.

In [28]:
sql = 'SELECT table_catalog, table_schema, table_name FROM information_schema.tables'
try:
    qc.set_profile('default')
    default = qc.query(sql=sql)
except Exception as e:
    print (e.message)
else:
    print (default)

table_catalog,table_schema,table_name
tapdb,pg_catalog,pg_aggregate
tapdb,pg_catalog,pg_am
tapdb,pg_catalog,pg_amop
tapdb,pg_catalog,pg_amproc
tapdb,pg_catalog,pg_attrdef
tapdb,pg_catalog,pg_attribute
tapdb,pg_catalog,pg_auth_members
tapdb,pg_catalog,pg_authid
tapdb,pg_catalog,pg_available_extension_versions
tapdb,pg_catalog,pg_available_extensions
tapdb,pg_catalog,pg_cast
tapdb,pg_catalog,pg_class
tapdb,pg_catalog,pg_collation
tapdb,pg_catalog,pg_config
tapdb,pg_catalog,pg_constraint
tapdb,pg_catalog,pg_conversion
tapdb,pg_catalog,pg_cursors
tapdb,pg_catalog,pg_database
tapdb,pg_catalog,pg_db_role_setting
tapdb,pg_catalog,pg_default_acl
tapdb,pg_catalog,pg_depend
tapdb,pg_catalog,pg_description
tapdb,pg_catalog,pg_enum
tapdb,pg_catalog,pg_event_trigger
tapdb,pg_catalog,pg_extension
tapdb,pg_catalog,pg_file_settings
tapdb,pg_catalog,pg_foreign_data_wrapper
tapdb,pg_catalog,pg_foreign_server
tapdb,pg_catalog,pg_foreign_table
tapdb,pg_catalog,pg_group
tapdb,pg_catalog,pg_hba_file_rules
t