In [None]:
__author__ = 'Mike Fitzpatrick <mfitzpatrick@noao.edu>, Alice Jacques <alice.jacques@noao.edu>, NOIRLab Astro Data Lab Team <datalab@noao.edu>'
__version__ = '20200908'
__keywords__ = ['query','vospace','mydb']

# How to use the Data Lab *Query Client* Service

### Table of contents
* [Summary](#summary)
* [Disclaimer & attribution](#attribution)
* [Imports & setup](#imports)
* [Review: General template for a simple query in SQL](#review)
* [Quick Example Query](#query)
* [Save to virtual storage VOSpace](#savevospace)
* [Save to remote database MyDB](#savemydb)
* [Import data into a MyDB table with QueryClient](#importmydb)
* [An asynchronous query example](#async)
* [Using profiles](#profiles)


<a class="anchor" id="summary"></a>
# Summary

This notebook documents how to query the Data Lab via the query client service. For full documentation see the <a href="https://datalab.noao.edu/docs/api/queryClient.html">API documentation</a>.


### The *queryClient* class


All queries are executed through the *queryClient.query()* method of the *queryClient* class. This takes as arguments:

| Argument | Description | Optional | Default  value | Supported Values |
|----------|-------------|----------|----------------|----------------|
| adql | The query to be submitted to the TAP service | No | None | string |
| sql | The query to be submitted to the DB directly | Yes | None | string |
| fmt | The requested format (if any) | Yes | ascii | ascii,csv,fits,hdf5,votable |
| out | The saved location (if any) | Yes | None | local filename, *vos://filename*, *mydb://tablename* |
| async_ | Indicates if the query is asynchronous | Yes | False | True/False |


All arguments are optional, except that one of *adql* or *sql* must be supplied.  The distinction between these two parameters is in how the *QueryClient* executes the query:  If *adql* is provided the query is sent to the TAP (Table Access Protocol) service, if *sql* is provided the query is sent directly to the database. The choice of execution depends on whether the query string contains ADQL-specific functions, or SQL constructs or DB extensions not understood by the TAP service.  For large queries there can also be a performance difference depending on the where/how the results are saved. 

#### Output formats
The results can be returned as whitespace delimited (*ascii*), CSV (*csv*), FITS object (*fits*), HDF5 (*hdf5*), or in VOTable format (*votable*). Note that if the results are saved to the user's database then the output format is ignored.

#### Saving results
If no save location is specified (no *out* param) then the results are returned directly. A save location beginning with the 'vos://' identifier indicates a location in the user's virtual storage to save the result. A save location beginning with the 'mydb://' identifier indicates the results are to be saved to a table in the user's remote database (MyDB). 

#### Asynchronous queries
Long queries should be run asynchronously and the service may refuse any synchronous query attempted if the projected query time is too long. A query can be submitted asynchronously by setting the *async_* parameter to *True*. A job id will then be returned.

    
The status of an asynchronous query can be checked with *queryClient.status(jobid)*. A return value of 'COMPLETED' indicates the query has terminated. A return value of 'ERROR' indicates that there was a problem with the query.    

The results of an asynchronous query (assuming that they were not saved to either the user's virtual storage or remote database) can be retrieved once the query has completed with *result = qc.query(adql=query,async_=True,wait=True,poll=1,verbose=1)*

### From Python code

The query client service can be called from Python code using the *datalab* module. This provides methods to access the various query client functions in the *QueryClient* subpackage. See the information [here](https://github.com/noaodatalab/datalab/blob/master/README.md).

Queries can be also run from the command line, e.g. on your local machine, using the datalab command line utility. Read about it in our GitHub repo [here](https://github.com/noaodatalab/datalab).


<a class="anchor" id="attribution"></a>
# Disclaimer & attribution
If you use this notebook for your published science, please acknowledge the following:

* Data Lab concept paper: Fitzpatrick et al., "The NOAO Data Laboratory: a conceptual overview", SPIE, 9149, 2014, http://dx.doi.org/10.1117/12.2057445

* Data Lab disclaimer: https://datalab.noao.edu/disclaimers.php

<a class="anchor" id="imports"></a>
# Imports and setup

This is the setup that is required to use the query client. The first thing to do is import the relevant Python modules.

In [None]:
# Standard lib
from getpass import getpass

# Data Lab
from dl import authClient as ac, queryClient as qc, storeClient as sc
from dl.helpers.utils import convert

# Authentication
Much of the functionality of Data Lab can be accessed without explicitly logging in (the service then uses an anonymous login). But some capacities, for instance saving the results of your queries to your virtual storage space, require a login (i.e. you will need a registered user account).

If you need to log in to Data Lab, issue this command, and respond according to the instructions:

In [None]:
#ac.login(input("Enter user name: (+ENTER) "),getpass("Enter password: (+ENTER) "))
ac.whoAmI()

<a class="anchor" id="review"></a>
# Review: General template for a simple query in SQL

### SQL is a way to describe to a database what you want from it
General template for a simple query written in SQL
```
SELECT something
FROM database.table
WHERE constraints
LIMIT 100
```
### Please see our intro notebook [JupyterPythonSQL101](https://github.com/noaodatalab/notebooks-latest/blob/master/01_GettingStartedWithDataLab/01_JupyterPythonSQL101.ipynb) for more info on this.

<a class="anchor" id="query"></a>
# A quick query

Let's say we want to fetch the g,r,i magnitudes from 10 objects in the SMASH DR1 data set, and retrieve the results as a CSV-formatted string:

In [None]:
query = 'SELECT gmag, rmag, imag FROM smash_dr1.object WHERE gmag<99 AND rmag<99 AND imag<99 LIMIT 10'
response = qc.query(sql = query, fmt = 'csv')
print (response)

<a class="anchor" id="savevospace"></a>
# Saving results to virtual storage VOSpace

VOSpace is a convenient storage space for users to save their work. It can store any data or file type.  

Now we want to save the results from the same query to our virtual storage space instead.  By putting the query in a try-block we are able to trap errors when executing the query.  Note that running this cell multiple times will trigger an error and we use the Storage Manager Client to remove the file once we are done.

In [None]:
try:
    response = qc.query (sql=query, fmt='csv', out='vos://examplemags.csv')
except Exception as e:
    # Handle any errors in the query.  By running this cell multiple times with the same
    # output file, or by using a bogus SQL statement, you can view various error messages.
    print (e.message)
else:
    if response is not None: 
        print (response)        
    else:
        print ("OK")

Let's ensure the file was created in VOSpace:

In [None]:
sc.ls(name='vos://examplemags.csv')

Now let's remove the file we just created:

In [None]:
sc.rm (name='vos://examplemags.csv')

Let's ensure the file was removed from VOSpace:

In [None]:
sc.rm (name='vos://examplemags.csv')

<a class="anchor" id="savemydb"></a>
# Saving results to remote database MyDB
MyDB is a useful OS remote per-user relational database that can store data tables. Furthermore, the results of queries can be directly saved to MyDB, as we show in the following example:

In [None]:
try:
    response = qc.query (sql=query, fmt='csv', out='mydb://examplemags')
except Exception as e:
    # Handle any errors in the query.  By running this cell multiple times with the same
    # output file, or by using a bogus SQL statement, you can view various error messages.
    print (e.message)
else:
    if response is not None: 
        print (response)         
    else:
        print ("OK")

Ensure the table has been saved to MyDB by calling the *mydb_list()* function:

In [None]:
print(qc.mydb_list(),"\n")

Now let's drop the table from our MyDB.

In [None]:
qc.mydb_drop('examplemags')

Ensure it has been removed by calling the *mydb_list()* function again:

In [None]:
print(qc.mydb_list(),"\n")

<a class="anchor" id="importmydb"></a>
# Import data into a MyDB table with QueryClient

Users can use the *mydb_import* function to import data saved on a local computer or import data from VOSpace into a MyDB data table. The data must be in the form of either a CSV file or Pandas Dataframe object in order to load it into MyDB.

First let's query some data from the Data Lab database and save it locally as a CSV file:

In [None]:
query = "select * from gaia_dr1.gaia_source limit 10"
qc.query (sql=query, fmt='csv', out='./gaiaresult.csv')

Next we will use the *mydb_import* function to import the locally stored CSV file into a MyDB table:

In [None]:
qc.mydb_import('testresult','./gaiaresult.csv')

Let's ensure it's there by calling the *mydb_list()* function:

In [None]:
print(qc.mydb_list(),"\n")

Similarly, we can use the *mydb_import* function to import data from VOSpace into a MyDB table:

In [None]:
qc.mydb_import('testresult2','vos://newmags.csv')

In [None]:
print(qc.mydb_list(),"\n")

Finally, for clean-up purposes, let's remove the two tables we just imported into MyDB by using the *mydb_drop* function:

In [None]:
qc.mydb_drop('testresult')
qc.mydb_drop('testresult2')

And let's make sure the two tables were removed from MyDB by using the *mydb_list()* function:

In [None]:
print(qc.mydb_list(),"\n")

<a class="anchor" id="async"></a>
# An asynchronous query

We now want to run a longer query, say, counting the total number of objects in USNO-B1, and need to do it asynchronously. To do this we will submit the query as normal but with the *async_* argument indicated.

In [None]:
query = 'SELECT count(*) FROM usno.b1'

In [None]:
result = qc.query(sql=query,async_=True,wait=True,poll=5,verbose=1)

In [None]:
print(result)

### Please see our [JupyterPythonSQL101](https://github.com/noaodatalab/notebooks-latest/blob/master/01_GettingStartedWithDataLab/01_JupyterPythonSQL101.ipynb) notebook for more information on getting jobid, checking status until 'COMPLETED', and retrieving results with qc.results(). 

<a class="anchor" id="profiles"></a>
# Using profiles

Sometimes, different datasets (or versions of the same dataset) may reside on different backend servers and a user may want to work explicitly with a (typically older) dataset. In some cases these servers will be used only by developers or those with restricted access. Also, external TAP services are accessible from within Data Lab. In both use cases, the *qc.profiles()* function comes in handy.

The first thing to do is see what profiles are available:

In [None]:
profilelist = qc.list_profiles()
print (profilelist)

The thing to note in the output here are names such as '*GAVO*', '*SDSS-DR9*', etc;  **These profiles refer to external TAP services that can be accessed using the Query Manager interface.** Only a few are configured at the moment as we work on ways to automatically discover the >100 such services and provide useful listings of what they contain.

Let's see how we can query one and save the result to our Virtual Storage VOSpace:

In [None]:
qc.set_profile('GAVO')
query = 'SELECT top 10 * FROM sdssdr7.sources'
response = qc.query(adql = query, fmt = 'csv', out="vos://gavo_out.csv")
print (response)

In this case we queried an SDSS DR7 table at the TAP service run by GAVO in Heidelberg. 

Let's ensure it has been saved to VOSpace:

In [None]:
listing = sc.ls (name='vos://',format='long')
print (listing)

We can load the data set and e.g. convert to a Pandas data frame, like this:

In [None]:
data = sc.get(fr = 'vos://gavo_out.csv', to = '')
df = convert(data)
df

We can get the details of a particular profile by including the name of the profile as an argument in the *list_profiles* method:

In [None]:
qc.list_profiles('default')

So let's try a query against the default profile - let's get a list of all tables in the default database.  Note that in this case we are accessing the *information_schema* table of the database, this table is not included in the TAP service and so we **must** use the *sql* argument to talk directly to the database.

In [None]:
sql = 'SELECT table_catalog, table_schema, table_name FROM information_schema.tables'
try:
    qc.set_profile('default')
    default = qc.query(sql=sql)
except Exception as e:
    print (e.message)
else:
    print (default)