In [1]:
__author__ = 'Alice Jacques <alice.jacques@noirlab.edu>, Astro Data Lab Team <datalab@noirlab.edu>' 
__version__ = '20240607' #yyyymmdd 
__datasets__ = ['ls_dr9', 'sdss_dr17', 'gaia_dr3', 'des_dr2'] 
__keywords__ = ['crossmatch', 'joint query', 'mydb', 'vospace']

# Examples using the pre-crossmatched tables at Astro Data Lab

by Alice Jacques and the Astro Data Lab Team

### Table of contents
* [Goals](#goals)
* [Disclaimer & attribution](#attribution)
* [Imports & setup](#import)
* [Authentication](#auth)
* [Writing a query with a JOIN statement](#joinquery)
* [Examples of using a JOIN statement in a query with LS and SDSS catalogs](#lssdss)
* [Saving results to VOSpace](#savetovospace)
* [Saving results to MyDB](#savetomydb)
* [Example Using a User Table and a pre-crossmatched Table](#usertable)
* [Resources & references](#refs)

<a class="anchor" id="goals"></a>
# Goals

* Learn how to write a query with a JOIN statement to retrieve information from a pre-crossmatched table and another Data Lab table
* Use a user-provided table to crossmatch a Data Lab table

For a more in-depth explanation of the pre-crossmatched tables hosted at Astro Data Lab, see our [How-to use pre-crossmatched tables notebook](https://github.com/astro-datalab/notebooks-latest/blob/master/04_HowTos/CrossmatchTables/How_to_use_pre_crossmatched_tables.ipynb). 

<a class="anchor" id="attribution"></a>
# Disclaimer & attribution

Disclaimers
-----------
Note that using the Astro Data Lab constitutes your agreement with our minimal [Disclaimers](https://datalab.noirlab.edu/disclaimers.php).

Acknowledgments
---------------
If you use **Astro Data Lab** in your published research, please include the text in your paper's Acknowledgments section:

_This research uses services or data provided by the Astro Data Lab, which is part of the Community Science and Data Center (CSDC) Program of NSF NOIRLab. NOIRLab is operated by the Association of Universities for Research in Astronomy (AURA), Inc. under a cooperative agreement with the U.S. National Science Foundation._

If you use **SPARCL jointly with the Astro Data Lab platform** (via JupyterLab, command-line, or web interface) in your published research, please include this text below in your paper's Acknowledgments section:

_This research uses services or data provided by the SPectra Analysis and Retrievable Catalog Lab (SPARCL) and the Astro Data Lab, which are both part of the Community Science and Data Center (CSDC) Program of NSF NOIRLab. NOIRLab is operated by the Association of Universities for Research in Astronomy (AURA), Inc. under a cooperative agreement with the U.S. National Science Foundation._

In either case **please cite the following papers**:

* Data Lab concept paper: Fitzpatrick et al., "The NOAO Data Laboratory: a conceptual overview", SPIE, 9149, 2014, https://doi.org/10.1117/12.2057445

* Astro Data Lab overview: Nikutta et al., "Data Lab - A Community Science Platform", Astronomy and Computing, 33, 2020, https://doi.org/10.1016/j.ascom.2020.100411

If you are referring to the Data Lab JupyterLab / Jupyter Notebooks, cite:

* Juneau et al., "Jupyter-Enabled Astrophysical Analysis Using Data-Proximate Computing Platforms", CiSE, 23, 15, 2021, https://doi.org/10.1109/MCSE.2021.3057097

If publishing in a AAS journal, also add the keyword: `\facility{Astro Data Lab}`

And if you are using SPARCL, please also add `\software{SPARCL}` and cite:

* Juneau et al., "SPARCL: SPectra Analysis and Retrievable Catalog Lab", Conference Proceedings for ADASS XXXIII, 2024
https://doi.org/10.48550/arXiv.2401.05576

The NOIRLab Library maintains [lists of proper acknowledgments](https://noirlab.edu/science/about/scientific-acknowledgments) to use when publishing papers using the Lab's facilities, data, or services.

<a class="anchor" id="import"></a>
# Imports and setup

In [2]:
# std lib
from getpass import getpass

# 3rd party
from astropy.utils.data import download_file  #import file from URL
from matplotlib.ticker import NullFormatter
import pylab as plt
import matplotlib
%matplotlib inline

# Data Lab
from dl import authClient as ac, queryClient as qc, storeClient as sc
from dl.helpers.utils import convert # converts table to Pandas dataframe object

<a class="anchor" id="auth"></a>
# Authentication
Much of the functionality of Data Lab can be accessed without explicitly logging in (the service then uses an anonymous login). But some capacities, for instance saving the results of your queries to your virtual storage space, require a login (i.e. you will need a registered user account).

If you need to log in to Data Lab, un-comment the cell below and execute it:

In [3]:
#token = ac.login(input("Enter user name: (+ENTER) "),getpass("Enter password: (+ENTER) "))
ac.whoAmI()

'demo00'

<a class="anchor" id="joinquery"></a>
# Writing a query with a JOIN statement
In order to extract only the relevant columns pertaining to our science question from multiple data tables, we may write a query that uses a JOIN statement. There are 4 main types of JOIN statements that we could use, and which one we decide to choose depends on how we want the information to be extracted. 
1. **(INNER) JOIN**: Returns rows that have matching values in both tables
2. **LEFT (OUTER) JOIN**: Returns all rows from the left table, and the matched rows from the right table
3. **RIGHT (OUTER) JOIN**: Returns all rows from the right table, and the matched rows from the left table
4. **FULL (OUTER) JOIN**: Returns all rows when there is a match in either left or right table

Take a moment to look over the figure below outlining the various JOIN statement types.  
NOTE: the default JOIN is an `INNER JOIN`.

<img src='join.png'></img>

### `JOIN LATERAL`
In nearest neighbor crossmatch queries, we use `JOIN LATERAL`, which is like a SQL foreach loop that will iterate over each row in a result set and evaluate a subquery using that row as a parameter.

<a class="anchor" id="lssdss"></a>
# Examples of using a JOIN statement in a query with LS and SDSS catalogs
## Example 1: A Single JOIN
First we will examine the spectroscopic redshift of objects that are found in both the SDSS DR17 catalog and the LS DR9 catalog by writing a query with a single JOIN statement between their pre-crossmatched table and the SDSS DR17 table. The two crossmatch tables related to these two catalogs are:

`ls_dr9.x1p5__tractor__sdss_dr17__specobj`  
`sdss_dr17.x1p5__specobj__ls_dr9__tractor`

The choice of which of these two crossmatch tables to use should be based on the science question being posed. For instance, the question *'how does a galaxy's structure change with redshift?'* is dependent on the redshift values obtained from SDSS DR17, so we should use the crossmatch table that has SDSS DR17 as the first table. So, the relevant information we want to select from our two tables of interest for this example are:

1. "X" = `sdss_dr17.x1p5__specobj__ls_dr9__tractor`
    - **ra1** (RA of SDSS object)
    - **dec1** (Dec of SDSS object)
2. "S" = `sdss_dr17.specobj`
    - **z** (redshift)

### Write the single JOIN statement query
Now that we know what we want and where we want it from, let's write the query and then print the results on screen. Here we use one JOIN statement: it will search in the SDSS DR17 `specobj` table for rows that have the same SDSS id value (`specobjid`) as in the pre-crossmatched table (`id1`) and retrieve the desired columns from the SDSS DR17 `specobj` table within the specified RA and Dec region.

In [4]:
query_single = ("""
SELECT 
    X.ra1 AS ra_sdss, X.dec1 AS dec_sdss,
    S.z
FROM
    sdss_dr17.x1p5__specobj__ls_dr9__tractor AS X 
JOIN
    sdss_dr17.specobj AS S ON X.id1 = S.specobjid 
WHERE
    X.ra1 BETWEEN %s and %s and X.dec1 BETWEEN %s and %s
LIMIT 10000
"""
) %(110,200,7.,40.)  #large region
print(query_single) # print the query statement to screen


SELECT 
    X.ra1 AS ra_sdss, X.dec1 AS dec_sdss,
    S.z
FROM
    sdss_dr17.x1p5__specobj__ls_dr9__tractor AS X 
JOIN
    sdss_dr17.specobj AS S ON X.id1 = S.specobjid 
WHERE
    X.ra1 BETWEEN 110 and 200 and X.dec1 BETWEEN 7.0 and 40.0
LIMIT 10000



### Execute the single JOIN statement query and print results

In [5]:
df_single = qc.query(sql=query_single,fmt='pandas')
df_single

Unnamed: 0,ra_sdss,dec_sdss,z
0,110.00027,39.699210,0.516598
1,110.00065,39.613622,0.053511
2,110.00140,30.987794,-0.000062
3,110.00143,37.534445,-0.000253
4,110.00192,39.751928,0.000305
...,...,...,...
9995,113.28194,25.709454,0.000075
9996,113.28196,39.540578,-0.000212
9997,113.28202,31.683362,0.377256
9998,113.28211,26.552400,1.718549


## Example 2: A Double JOIN
Now we will examine both the spectroscopic redshifts from SDSS DR17 and the photometry from LS DR9 by writing a query with two JOIN statements. The relevant information we want to select from our three tables of interest for this example are:

1. "X" = `sdss_dr17.x1p5__specobj__ls_dr9__tractor`
    - **ra1** (RA of SDSS object)
    - **dec1** (Dec of SDSS object)
2. "S" = `sdss_dr17.specobj`
    - **z** (redshift)
3. "L" = `ls_dr9.tractor`
    - **mag_g** (converted g magnitude)
    - **mag_r** (converted r magnitude)

### Write the double JOIN statement query
In this example we use two JOIN statements: the first will search in the SDSS DR17 `specobj` table for rows that have the same SDSS id value (`specobjid`) as in the pre-crossmatched table (`id1`) and retrieve the desired columns from the SDSS DR17 `specobj` table. The second will search in the LS DR9 `tractor` table for rows that have the same LS id value (`ls_id`) as in the pre-crossmatched table (`id2`) and retrieve the desired columns from the LS DR9 `tractor` table within the specified RA and Dec region.

In [6]:
query_double = ("""
SELECT 
    X.ra1 AS ra_sdss, X.dec1 AS dec_sdss,
    S.z,
    L.mag_g, L.mag_r
FROM
    sdss_dr17.x1p5__specobj__ls_dr9__tractor AS X 
JOIN
    sdss_dr17.specobj AS S ON X.id1 = S.specobjid 
JOIN
    ls_dr9.tractor AS L ON X.id2 = L.ls_id
WHERE
    X.ra1 BETWEEN %s and %s and X.dec1 BETWEEN %s and %s
LIMIT 10000
"""
) %(110,200,7.,40.)  #large region
print(query_double) # print the query statement to screen


SELECT 
    X.ra1 AS ra_sdss, X.dec1 AS dec_sdss,
    S.z,
    L.mag_g, L.mag_r
FROM
    sdss_dr17.x1p5__specobj__ls_dr9__tractor AS X 
JOIN
    sdss_dr17.specobj AS S ON X.id1 = S.specobjid 
JOIN
    ls_dr9.tractor AS L ON X.id2 = L.ls_id
WHERE
    X.ra1 BETWEEN 110 and 200 and X.dec1 BETWEEN 7.0 and 40.0
LIMIT 10000



### Execute the double JOIN statement query and print results

In [7]:
df_double = qc.query(sql=query_double,fmt='pandas')
df_double

Unnamed: 0,ra_sdss,dec_sdss,z,mag_g,mag_r
0,123.12650,39.993317,0.124905,18.130915,16.995148
1,123.36125,39.996015,-0.000117,21.921595,20.344702
2,123.23940,39.990355,0.067814,17.442272,16.721584
3,123.18048,39.935782,0.067945,18.585884,17.680323
4,123.25379,39.959822,0.000056,22.013874,20.268965
...,...,...,...,...,...
9995,139.99228,39.700074,0.570250,23.129639,21.435806
9996,139.88228,39.660420,0.458871,20.758060,20.142900
9997,140.03065,39.671779,1.867199,21.640137,21.532164
9998,140.12081,39.656766,0.092700,15.941207,14.906138


<a class="anchor" id="savetovospace"></a>
# Saving results to VOSpace
VOSpace is a convenient storage space for users to save their work. It can store any data or file type. We can save the results from a query to our virtual storage space. First, a basic query extracting a few thousand rows of the specobjid, ra, and dec columns from the SDSS DR17 specobj table:

In [8]:
basic_query = "SELECT specobjid, ra, dec FROM sdss_dr17.specobj LIMIT 10000"
print(basic_query)

SELECT specobjid, ra, dec FROM sdss_dr17.specobj LIMIT 10000


##### Submit the query, format the output as a CSV, and save it to VOSpace:

In [9]:
response = qc.query(sql=basic_query,fmt='csv',out='vos://basic_result.csv')

##### Let's ensure the file was saved in VOSpace:

In [10]:
sc.ls(name='vos://basic_result.csv')

'basic_result.csv'

##### We will then remove the file from VOSpace:

In [11]:
sc.rm(name='vos://basic_result.csv')

'OK'

##### And ensure it was removed:

In [12]:
sc.ls(name='vos://basic_result.csv')

'Error 404: "vos://basic_result.csv" NOT FOUND'

<a class="anchor" id="savetomydb"></a>
# Saving results to MyDB
MyDB is a useful remote per-user relational database that can store data tables. Furthermore, the results of queries can be directly saved to MyDB, as we show in the following example:

In [13]:
response = qc.query(sql=basic_query, fmt='csv', out='mydb://basic_result', drop=True)

##### Ensure the table has been saved to MyDB by calling the `mydb_list()` function, which will list all tables currently in a user's MyDB:

In [14]:
print(qc.mydb_list(),"\n")

aaa
aab
basic_result
bgsfaint_dlnotebook
cmtestoutput
cmxmatchtest
desi_tile
df_xmatch
fastspec_everest_z_lt_0p6
fiberassign
gaia_rc
gaia_sample
gaia_sample_xmatch
gals
lowmassagn_dlnotebook
secondary_dark_subset
sv1targets_bright_secondary
sv1targets_dark_secondary
tbl_stat
testcm
testingx
testresult2
testx2
testxmatchqueryout
tile
twomass_gaia1
twomass_pt1
twomasspsc
usno_objects
xmatchasyncout
xmatchasyncout2
 



<a class="anchor" id="usertable"></a>
# Example: a User Table and a pre-crossmatched Table
Example: a user has only ID, RA, and Dec columns for their own table. They then perform a crossmatch between their table and `gaia_dr3.gaia_source`. In doing so, they can use the pre-crossmatched table `gaia_dr3.x1p5__gaia_source__ls_dr9__tractor` to get `ls_dr9.tractor` ID, RA, and Dec columns for free since the pre-crossmatched table gives them the corresponding `gaia_dr3.gaia_source` ID. We will use the `basic_result` table we stored to our MyDB in the previous section as our "user-provided" table.

##### Write the nearest-neighbor crossmatch query

In [15]:
query_xmatch = """
SELECT
    b.specobjid AS sdss_id, gg.source_id AS gaia_id, 
    (q3c_dist(b.ra,b.dec,gg.ra,gg.dec)*3600.0) AS dist_arcsec 
FROM
    mydb://basic_result AS b
LEFT JOIN LATERAL (
    SELECT g.* 
    FROM 
        gaia_dr3.gaia_source AS g
    WHERE
        q3c_join(b.ra, b.dec, g.ra, g.dec, 0.01)
    ORDER BY
        q3c_dist(b.ra,b.dec,g.ra,g.dec)
    ASC LIMIT 1
) AS gg ON true
"""
print(query_xmatch)


SELECT
    b.specobjid AS sdss_id, gg.source_id AS gaia_id, 
    (q3c_dist(b.ra,b.dec,gg.ra,gg.dec)*3600.0) AS dist_arcsec 
FROM
    mydb://basic_result AS b
LEFT JOIN LATERAL (
    SELECT g.* 
    FROM 
        gaia_dr3.gaia_source AS g
    WHERE
        q3c_join(b.ra, b.dec, g.ra, g.dec, 0.01)
    ORDER BY
        q3c_dist(b.ra,b.dec,g.ra,g.dec)
    ASC LIMIT 1
) AS gg ON true



##### Submit the crossmatch query and output to MyDB

In [16]:
df_xmatch = qc.query(sql=query_xmatch,out="mydb://df_xmatch",drop=True) # set drop=True to remove an already existing table from MyDB with the same name

##### We can print the table by writing a query to MyDB

In [17]:
q = "select * from mydb://df_xmatch"
re = qc.query(sql=q,fmt='pandas')
re

Unnamed: 0,sdss_id,gaia_id,dist_arcsec
0,2889072702671316992,1.962451e+18,0.122596
1,2877815347283519488,1.962452e+18,0.135446
2,2877824143376541696,1.961010e+18,0.193541
3,2877823318742820864,1.960965e+18,0.133224
4,2889071878037596160,1.960965e+18,11.562321
...,...,...,...
9995,2857534859855816704,2.272961e+18,0.049671
9996,2866543439439226880,2.272962e+18,0.060718
9997,2857536234245351424,2.272973e+18,0.080895
9998,2866556633578760192,2.272977e+18,0.043344


##### We can now use the gaia_id column from our `df_xmatch` table above to get `ls_dr9.tractor` object information using the `gaia_dr3.x1p5__gaia_source__ls_dr9__tractor` pre-crossmatched table

In [18]:
query_1 = """
SELECT
    id2 AS ls_id, ra2 AS ls_ra, dec2 AS ls_dec,
    df_xmatch.gaia_id, df_xmatch.sdss_id
FROM
    gaia_dr3.x1p5__gaia_source__ls_dr9__tractor AS gxl
JOIN
    mydb://df_xmatch AS df_xmatch ON gxl.id1 = df_xmatch.gaia_id
"""
print(query_1)


SELECT
    id2 AS ls_id, ra2 AS ls_ra, dec2 AS ls_dec,
    df_xmatch.gaia_id, df_xmatch.sdss_id
FROM
    gaia_dr3.x1p5__gaia_source__ls_dr9__tractor AS gxl
JOIN
    mydb://df_xmatch AS df_xmatch ON gxl.id1 = df_xmatch.gaia_id



##### Submit the query and print the resulting table

In [19]:
df = qc.query(sql=query_1,fmt='pandas')
df

Unnamed: 0,ls_id,ls_ra,ls_dec,gaia_id,sdss_id
0,9907737095837650,287.228165,48.064735,2131102058617608064,3384465917919389696
1,9907737159009377,287.448870,48.229697,2131098455144685056,3384466192797296640
2,9907737158950166,287.387517,48.168933,2131097218194075904,3384462344506599424
3,9907737221862367,287.697861,48.382752,2131196930154414592,3384463718896134144
4,9907737221859120,287.547174,48.407548,2131195006009050240,3384465093285668864
...,...,...,...,...,...
840,9907740343599562,276.670899,63.135065,2160347319866203008,2873374430422132736
841,9907740343599573,276.671625,63.208500,2160347899685623680,2873378278712829952
842,9907740343599244,276.634148,63.135310,2160347972701229440,2873379103346550784
843,9907740343535831,276.356517,63.279633,2160352916207452416,2873378828468643840


<a class="anchor" id="refs"></a>
# Resources & references

W3Schools: SQL Joins https://www.w3schools.com/sql/sql_join.asp  