In [1]:
__author__ = 'Alice Jacques <alice.jacques@noirlab.edu>, Astro Data Lab Team <datalab@noirlab.edu>' 
__version__ = '20220713' #yyyymmdd 
__datasets__ = ['ls_dr9','sdss_dr16','gaia_dr3','des_dr2'] 
__keywords__ = ['crossmatch','joint query','mydb','vospace']

# Examples using the pre-crossmatched tables at Astro Data Lab

by Alice Jacques and the Astro Data Lab Team

### Table of contents
* [Goals](#goals)
* [Disclaimer & attribution](#attribution)
* [Imports & setup](#import)
* [Authentication](#auth)
* [Writing a query with a JOIN statement](#joinquery)
* [Examples of using a JOIN statement in a query with LS and SDSS catalogs](#lssdss)
* [Saving results to VOSpace](#savetovospace)
* [Saving results to MyDB](#savetomydb)
* [Example Using a User Table and a pre-crossmatched Table](#usertable)
* [Resources & references](#refs)

<a class="anchor" id="goals"></a>
# Goals

* Learn how to write a query with a JOIN statement to retrieve information from a pre-crossmatched table and another Data Lab table
* Use a user-provided table to crossmatch a Data Lab table

For a more in-depth explanation of the pre-crossmatched tables hosted at Astro Data Lab, see our [How-to use pre-crossmatched tables notebook](https://github.com/astro-datalab/notebooks-latest/blob/master/04_HowTos/CrossmatchTables/How_to_use_pre_crossmatched_tables.ipynb). 

<a class="anchor" id="attribution"></a>
# Disclaimer & attribution
If you use this notebook for your published science, please acknowledge the following:

* Data Lab concept paper: Fitzpatrick et al., "The NOAO Data Laboratory: a conceptual overview", SPIE, 9149, 2014, http://dx.doi.org/10.1117/12.2057445

* Data Lab disclaimer: https://datalab.noirlab.edu/disclaimers.php

<a class="anchor" id="import"></a>
# Imports and setup

In [2]:
# std lib
from getpass import getpass

# 3rd party
from astropy.utils.data import download_file  #import file from URL
from matplotlib.ticker import NullFormatter
import pylab as plt
import matplotlib
%matplotlib inline

# Data Lab
from dl import authClient as ac, queryClient as qc, storeClient as sc
from dl.helpers.utils import convert # converts table to Pandas dataframe object

<a class="anchor" id="auth"></a>
# Authentication
Much of the functionality of Data Lab can be accessed without explicitly logging in (the service then uses an anonymous login). But some capacities, for instance saving the results of your queries to your virtual storage space, require a login (i.e. you will need a registered user account).

If you need to log in to Data Lab, issue this command, and respond according to the instructions:

In [3]:
#ac.login(input("Enter user name: (+ENTER) "),getpass("Enter password: (+ENTER) "))
ac.whoAmI()

'demo00'

<a class="anchor" id="joinquery"></a>
# Writing a query with a JOIN statement
In order to extract only the relevant columns pertaining to our science question from multiple data tables, we may write a query that uses a JOIN statement. There are 4 main types of JOIN statements that we could use, and which one we decide to choose depends on how we want the information to be extracted. 
1. **(INNER) JOIN**: Returns rows that have matching values in both tables
2. **LEFT (OUTER) JOIN**: Returns all rows from the left table, and the matched rows from the right table
3. **RIGHT (OUTER) JOIN**: Returns all rows from the right table, and the matched rows from the left table
4. **FULL (OUTER) JOIN**: Returns all rows when there is a match in either left or right table

Take a moment to look over the figure below outlining the various JOIN statement types.  
NOTE: the default JOIN is an `INNER JOIN`.

<img src='join.png'></img>

### `JOIN LATERAL`
In nearest neighbor crossmatch queries, we use `JOIN LATERAL`, which is like a SQL foreach loop that will iterate over each row in a result set and evaluate a subquery using that row as a parameter.

<a class="anchor" id="lssdss"></a>
# Examples of using a JOIN statement in a query with LS and SDSS catalogs
## Example 1: A Single JOIN
First we will examine the spectroscopic redshift of objects that are found in both the SDSS DR16 catalog and the LS DR9 catalog by writing a query with a single JOIN statement between their pre-crossmatched table and the SDSS DR16 table. The two crossmatch tables related to these two catalogs are:

`ls_dr9.x1p5__tractor__sdss_dr16__specobj`  
`sdss_dr16.x1p5__specobj__ls_dr9__tractor`

The choice of which of these two crossmatch tables to use should be based on the science question being posed. For instance, the question *'how does a galaxy's structure change with redshift?'* is dependent on the redshift values obtained from SDSS DR16, so we should use the crossmatch table that has SDSS DR16 as the first table. So, the relevant information we want to select from our two tables of interest for this example are:

1. "X" = `sdss_dr16.x1p5__specobj__ls_dr9__tractor`
    - **ra1** (RA of SDSS object)
    - **dec1** (Dec of SDSS object)
2. "S" = `sdss_dr16.specobj`
    - **z** (redshift)

### Write the single JOIN statement query
Now that we know what we want and where we want it from, let's write the query and then print the results on screen. Here we use one JOIN statement: it will search in the SDSS DR16 `specobj` table for rows that have the same SDSS id value (`specobjid`) as in the pre-crossmatched table (`id1`) and retrieve the desired columns from the SDSS DR16 `specobj` table within the specified RA and Dec region.

In [4]:
query_single = ("""
SELECT 
    X.ra1 AS ra_sdss, X.dec1 AS dec_sdss,
    S.z
FROM
    sdss_dr16.x1p5__specobj__ls_dr9__tractor AS X 
JOIN
    sdss_dr16.specobj AS S ON X.id1 = S.specobjid 
WHERE
    X.ra1 BETWEEN %s and %s and X.dec1 BETWEEN %s and %s
LIMIT 10000
"""
) %(110,200,7.,40.)  #large region
print(query_single) # print the query statement to screen


SELECT 
    X.ra1 AS ra_sdss, X.dec1 AS dec_sdss,
    S.z
FROM
    sdss_dr16.x1p5__specobj__ls_dr9__tractor AS X 
JOIN
    sdss_dr16.specobj AS S ON X.id1 = S.specobjid 
WHERE
    X.ra1 BETWEEN 110 and 200 and X.dec1 BETWEEN 7.0 and 40.0
LIMIT 10000



### Execute the single JOIN statement query and print results

In [5]:
%%time
df_single = qc.query(sql=query_single,fmt='pandas')
df_single

CPU times: user 42.8 ms, sys: 8.39 ms, total: 51.2 ms
Wall time: 230 ms


Unnamed: 0,ra_sdss,dec_sdss,z
0,110.00027,39.699210,0.516598
1,110.00065,39.613622,0.053511
2,110.00140,30.987794,-0.000062
3,110.00143,37.534445,-0.000253
4,110.00192,39.751928,0.000305
...,...,...,...
9995,113.28545,39.046797,0.217406
9996,113.28586,33.386140,0.731382
9997,113.28597,32.749380,0.951027
9998,113.28601,33.420892,5.299447


## Example 2: A Double JOIN
Now we will examine both the spectroscopic redshifts from SDSS DR16 and the photometry from LS DR9 by writing a query with two JOIN statements. The relevant information we want to select from our three tables of interest for this example are:

1. "X" = `sdss_dr16.x1p5__specobj__ls_dr9__tractor`
    - **ra1** (RA of SDSS object)
    - **dec1** (Dec of SDSS object)
2. "S" = `sdss_dr16.specobj`
    - **z** (redshift)
3. "L" = `ls_dr9.tractor`
    - **mag_g** (converted g magnitude)
    - **mag_r** (converted r magnitude)

### Write the double JOIN statement query
In this example we use two JOIN statements: the first will search in the SDSS DR16 `specobj` table for rows that have the same SDSS id value (`specobjid`) as in the pre-crossmatched table (`id1`) and retrieve the desired columns from the SDSS DR16 `specobj` table. The second will search in the LS DR9 `tractor` table for rows that have the same LS id value (`ls_id`) as in the pre-crossmatched table (`id2`) and retrieve the desired columns from the LS DR9 `tractor` table within the specified RA and Dec region.

In [6]:
query_double = ("""
SELECT 
    X.ra1 AS ra_sdss, X.dec1 AS dec_sdss,
    S.z,
    L.mag_g, L.mag_r
FROM
    sdss_dr16.x1p5__specobj__ls_dr9__tractor AS X 
JOIN
    sdss_dr16.specobj AS S ON X.id1 = S.specobjid 
JOIN
    ls_dr9.tractor AS L ON X.id2 = L.ls_id
WHERE
    X.ra1 BETWEEN %s and %s and X.dec1 BETWEEN %s and %s
LIMIT 10000
"""
) %(110,200,7.,40.)  #large region
print(query_double) # print the query statement to screen


SELECT 
    X.ra1 AS ra_sdss, X.dec1 AS dec_sdss,
    S.z,
    L.mag_g, L.mag_r
FROM
    sdss_dr16.x1p5__specobj__ls_dr9__tractor AS X 
JOIN
    sdss_dr16.specobj AS S ON X.id1 = S.specobjid 
JOIN
    ls_dr9.tractor AS L ON X.id2 = L.ls_id
WHERE
    X.ra1 BETWEEN 110 and 200 and X.dec1 BETWEEN 7.0 and 40.0
LIMIT 10000



### Execute the double JOIN statement query and print results

In [7]:
%%time
df_double = qc.query(sql=query_double,fmt='pandas')
df_double

CPU times: user 37.3 ms, sys: 7.13 ms, total: 44.5 ms
Wall time: 949 ms


Unnamed: 0,ra_sdss,dec_sdss,z,mag_g,mag_r
0,124.93368,39.862601,0.174130,18.598770,17.707087
1,124.91581,39.849342,1.784543,21.515312,21.335560
2,124.88087,39.829225,0.362639,20.689663,18.882698
3,124.52069,39.986561,0.031352,18.176840,17.582064
4,124.50651,39.977437,0.000255,17.876215,17.854273
...,...,...,...,...,...
9995,138.39385,39.620674,0.122326,17.721390,16.696081
9996,140.39495,39.445870,0.224668,18.444952,17.616922
9997,140.98451,39.800896,0.095002,17.533873,16.555084
9998,140.45823,39.444055,0.589554,21.350521,20.022564


<a class="anchor" id="savetovospace"></a>
# Saving results to VOSpace
VOSpace is a convenient storage space for users to save their work. It can store any data or file type. We can save the results from a query to our virtual storage space. First, a basic query extracting a few thousand rows of the specobjid, ra, and dec columns from the SDSS DR16 specobj table:

In [8]:
basic_query = "SELECT specobjid, ra, dec FROM sdss_dr16.specobj LIMIT 10000"
print(basic_query)

SELECT specobjid, ra, dec FROM sdss_dr16.specobj LIMIT 10000


##### Submit the query, format the output as a CSV, and save it to VOSpace:

In [9]:
%%time
response = qc.query(sql=basic_query,fmt='csv',out='vos://basic_result.csv')

CPU times: user 22.4 ms, sys: 1.84 ms, total: 24.2 ms
Wall time: 2.04 s


##### Let's ensure the file was saved in VOSpace:

In [10]:
sc.ls(name='vos://basic_result.csv')

'basic_result.csv'

##### We will then remove the file from VOSpace:

In [11]:
sc.rm(name='vos://basic_result.csv')

'OK'

##### And ensure it was removed:

In [12]:
sc.ls(name='vos://basic_result.csv')

'A Node does not exist with the requested URI.'

<a class="anchor" id="savetomydb"></a>
# Saving results to MyDB
MyDB is a useful remote per-user relational database that can store data tables. Furthermore, the results of queries can be directly saved to MyDB, as we show in the following example:

In [13]:
%%time
response = qc.query(sql=basic_query, fmt='csv', out='mydb://basic_result', drop=True)

CPU times: user 24.6 ms, sys: 1.4 ms, total: 26 ms
Wall time: 295 ms


##### Ensure the table has been saved to MyDB by calling the `mydb_list()` function, which will list all tables currently in a user's MyDB:

In [14]:
print(qc.mydb_list(),"\n")

basic_result,created:2022-07-13 13:01:12 MST
bgsfaint_dlnotebook,created:2021-08-03 16:40:27 MST
desi_tile,created:2021-08-10 14:31:49 MST
df_xmatch,created:2021-12-27 14:04:08 MST
fastspec_everest_z_lt_0p6,created:2021-09-14 12:41:40 MST
gaia_sample,created:2021-11-12 12:34:07 MST
gaia_sample_xmatch,created:2021-11-12 12:34:08 MST
gals,created:2021-12-27 14:05:56 MST
lowmassagn_dlnotebook,created:2021-08-03 16:40:22 MST
secondary_dark_subset,created:2021-08-11 12:08:30 MST
sv1targets_bright_secondary,created:2021-08-10 14:08:01 MST
sv1targets_dark_secondary,created:2021-08-10 14:41:39 MST
test1,created:2021-11-22 17:32:45 MST
twomasspsc,created:2021-11-23 11:47:37 MST
usno_objects,created:2021-11-22 12:35:32 MST
 



<a class="anchor" id="usertable"></a>
# Example: a User Table and a pre-crossmatched Table
Example: a user has only ID, RA, and Dec columns for their own table. They then perform a crossmatch between their table and `gaia_dr3.gaia_source`. In doing so, they can use the pre-crossmatched table `gaia_dr3.x1p5__gaia_source__ls_dr9__tractor` to get `ls_dr9.tractor` ID, RA, and Dec columns for free since the pre-crossmatched table gives them the corresponding `gaia_dr3.gaia_source` ID. We will use the `basic_result` table we stored to our MyDB in the previous section as our "user-provided" table.

##### Write the nearest-neighbor crossmatch query

In [15]:
query_xmatch = """
SELECT
    b.specobjid AS sdss_id, gg.source_id AS gaia_id, 
    (q3c_dist(b.ra,b.dec,gg.ra,gg.dec)*3600.0) AS dist_arcsec 
FROM
    mydb://basic_result AS b
LEFT JOIN LATERAL (
    SELECT g.* 
    FROM 
        gaia_dr3.gaia_source AS g
    WHERE
        q3c_join(b.ra, b.dec, g.ra, g.dec, 0.01)
    ORDER BY
        q3c_dist(b.ra,b.dec,g.ra,g.dec)
    ASC LIMIT 1
) AS gg ON true
"""
print(query_xmatch)


SELECT
    b.specobjid AS sdss_id, gg.source_id AS gaia_id, 
    (q3c_dist(b.ra,b.dec,gg.ra,gg.dec)*3600.0) AS dist_arcsec 
FROM
    mydb://basic_result AS b
LEFT JOIN LATERAL (
    SELECT g.* 
    FROM 
        gaia_dr3.gaia_source AS g
    WHERE
        q3c_join(b.ra, b.dec, g.ra, g.dec, 0.01)
    ORDER BY
        q3c_dist(b.ra,b.dec,g.ra,g.dec)
    ASC LIMIT 1
) AS gg ON true



##### Submit the crossmatch query and output to MyDB

In [16]:
%%time
df_xmatch = qc.query(sql=query_xmatch,out="mydb://df_xmatch",drop=True) # set drop=True to remove an already existing table from MyDB with the same name

CPU times: user 24.1 ms, sys: 3.19 ms, total: 27.3 ms
Wall time: 1.12 s


##### We can print the table by writing a query to MyDB

In [17]:
q = "select * from mydb://df_xmatch"
re = qc.query(sql=q,fmt='pandas')
re

Unnamed: 0,sdss_id,gaia_id,dist_arcsec
0,407606586140289024,1.417758e+18,0.059682
1,407739077291436032,1.417761e+18,0.109204
2,407738802413529088,1.417774e+18,0.038132
3,405488373919148032,1.417965e+18,0.112825
4,407728631930972160,1.417941e+18,0.166325
...,...,...,...
9995,7114652879051575296,1.407530e+18,30.430855
9996,-8832658246329937920,1.407531e+18,33.463994
9997,7114652329295761408,1.407532e+18,35.740673
9998,-8832657971452030976,1.407530e+18,15.182376


##### We can now use the gaia_id column from our `df_xmatch` table above to get `ls_dr9.tractor` object information using the `gaia_dr3.x1p5__gaia_source__ls_dr9__tractor` pre-crossmatched table

In [18]:
%%time
query_1 = """
SELECT
    id2 AS ls_id, ra2 AS ls_ra, dec2 AS ls_dec,
    df_xmatch.gaia_id, df_xmatch.sdss_id
FROM
    gaia_dr3.x1p5__gaia_source__ls_dr9__tractor AS gxl
JOIN
    mydb://df_xmatch AS df_xmatch ON gxl.id1 = df_xmatch.gaia_id
"""
print(query_1)


SELECT
    id2 AS ls_id, ra2 AS ls_ra, dec2 AS ls_dec,
    df_xmatch.gaia_id, df_xmatch.sdss_id
FROM
    gaia_dr3.x1p5__gaia_source__ls_dr9__tractor AS gxl
JOIN
    mydb://df_xmatch AS df_xmatch ON gxl.id1 = df_xmatch.gaia_id

CPU times: user 117 µs, sys: 0 ns, total: 117 µs
Wall time: 108 µs


##### Submit the query and print the resulting table

In [19]:
%%time
df = qc.query(sql=query_1,fmt='pandas')
df

CPU times: user 49.2 ms, sys: 8.81 ms, total: 58 ms
Wall time: 494 ms


Unnamed: 0,ls_id,ls_ra,ls_dec,gaia_id,sdss_id
0,9907738631015714,265.585571,54.518153,1417757713589461120,407606586140289024
1,9907738631015095,265.533827,54.562355,1417760943404869760,407739077291436032
2,9907738685800625,265.479042,54.688460,1417773759587719552,407738802413529088
3,9907738685737398,265.266888,54.727856,1417965383848150784,405488373919148032
4,9907738685735446,265.085212,54.658224,1417941228952064128,407728631930972160
...,...,...,...,...,...
9374,9907736639441127,253.623459,46.181360,1407529571095903360,7114652879051575296
9375,9907736639440624,253.581190,46.196592,1407530503104597888,-8832658246329937920
9376,9907736639441313,253.636990,46.245412,1407532182436417280,7114652329295761408
9377,9907736639440201,253.543293,46.217305,1407530430089747200,-8832657971452030976


<a class="anchor" id="refs"></a>
# Resources & references

W3Schools: SQL Joins https://www.w3schools.com/sql/sql_join.asp  