# Directly Querying the ZTF (version 2) database
```Author: Eden Girma, Last updated 20210406```

# Table of contents:
* [Motivation](#motivation)
* [Connecting to ZTF Database](#connecting)
* [Available database tables](#tables)
* [Available class keys](#classes)
* [Example: Querying objects from past 12 hr](#ex1)
* [Exporting to VOTable](#exportVOTable)
* [Future considerations](#future)

## Motivation <a class="anchor" id="motivation"></a>

Since about July 2020, the ALeRCE team has been developing an experimental ALeRCE API, which can be cloned and installed from https://github.com/alercebroker/alerce_client_new. However, _it appears that this new pipeline database is not yet in production mode, which means that it does not contain the latest alert information._ Specifically, the latest object MJDs I've been able to retrieve are at/around 59010, i.e. 06-10-2020. In addition, my own testing revealed that the Python client does not handle excessive querying well — I received repeated 504 timeout errors whenever I tried to use it to make very large queries.

Directly accessing the database circumvents both these problems; hence, I suggest to work with the ZTF database separately from the in-development ALeRCE client, to query the most recent object classifications.

You can clone and install the ZTF API from https://github.com/alercebroker/ztf_api.  In the following notebook I'll walk through what data is available from the database, and how to create a direct query for a specific timeframe.

In [1]:
import sys

# Packages for direct database access
# %pip install psycopg2
import psycopg2
import json

# Packages for data and number handling
import numpy as np
import pandas as pd
import math

# Packages for calculating current time and extracting ZTF data to VOTable
from astropy.time import Time
from astropy.table import Table
from astropy.io.votable import from_table, writeto
from datetime import datetime

# Packages for display and data plotting, if desired
from IPython.display import HTML
from IPython.display import display
import matplotlib.pyplot as plt
%matplotlib inline


## Connecting to ZTF Database <a class="anchor" id="connecting"></a>

First, we'll use the read only credentials available in the repository; a note that the database we're accessing is the ```ztf_v2```, hosted at ```db.alerce.online```.

In [2]:
# Open and load credentials
credentials_file = "../alercereaduser_v2.json"
with open(credentials_file) as jsonfile:
    params = json.load(jsonfile)["params"]
    
# Open a connection to the database
conn = psycopg2.connect(dbname=params['dbname'], 
                        user=params['user'], 
                        host=params['host'], 
                        password=params['password'])

## Available database tables <a class="anchor" id="tables"></a>

The following cell shows all of the tables available in the database, which can be used for querying.

In [3]:
# Show all the available tables, sorted by tablename
query = "select tablename from pg_tables where schemaname='public';"
tables = pd.read_sql_query(query, conn)
tables.sort_values(by="tablename")

Unnamed: 0,tablename
9,asassn
18,class
10,crtsnorth
12,crtssouth
4,detections
23,early_classification_v2
16,features
22,features_v2
2,features_v3
24,ingestion_timestamp


The tables used in this older version are somewhat different from the tables available in the new version. Yet there is still some overlap in the most relevant tables. Moving from less to more aggregation, they are:

* `non_detections`: one row per non-detection per object, the limiting magnitudes
* `detections`: one row per detection, light curves and other relevant time dependent information
* `object`: one row per object, basic filter and time–aggregated statistics
* `stamp_classification`: one row per object, the probablities of every object pertaining to the five classes SNe, AGN, variable stars, asteroids, and bogus alerts. This is the first early classification step (as described in Carrasco–Davis 2020)
* `late_probabilities_v2`: one row per object, the probabilities of every object pertaining to a light curve classifier class (as described in Sánchez–Sáez 2020)

* `features`, `features_v2`, `features_v3`: one row per object per feature, object light curves including their difference and corrected magnitudes and associated errors separated by filter
* `xmatch`: one row per object per external catalog, the table that points to the detailed xmatch tables

Below, we'll look at all of the columns that are available in all of the tables:

In [4]:
# Show all columns available in all tables
alltabs = []

for tab in sorted(tables.tablename):
    cols = pd.DataFrame()
    query = "select column_name, data_type from information_schema.columns where table_name = '%s';" % tab
    results = pd.read_sql_query(query, conn)
    results["table"] = tab
    alltabs.append(results)
    
dftab = pd.concat(alltabs)
pd.options.display.max_rows = 999
display(dftab[["table", "column_name", "data_type"]])
pd.options.display.max_rows = 101

Unnamed: 0,table,column_name,data_type
0,asassn,ASAS-SN Name,text
1,asassn,Other Names,text
2,asassn,LCID,integer
3,asassn,ra,double precision
4,asassn,dec,double precision
5,asassn,Mean VMag,double precision
6,asassn,Amplitude,double precision
7,asassn,Period,double precision
8,asassn,Type,text
9,asassn,Url,text


## Available class keys <a class="anchor" id="classes"></a>

In this database, classes are represented by integers. The following is a class mapper that indicates the relation:

In [5]:
# Creating classmapper, sorted by ID value in ascending order
classquery = "select * from class"
classes = pd.read_sql_query(classquery, conn).sort_values(by=['id'],ascending=True)

classmapper = dict(zip(classes.name.tolist(), classes.id.tolist()))
classmapper

{'Other': 0,
 'Ceph': 1,
 'DSCT': 2,
 'EB': 3,
 'LPV': 4,
 'RRL': 5,
 'SNe': 6,
 'AGN-I': 7,
 'Blazar': 8,
 'CV/Nova': 9,
 'SNIa': 10,
 'SNIbc': 11,
 'SNII': 12,
 'SNIIn': 13,
 'SLSN': 14,
 'EBSD/D': 15,
 'EBC': 16,
 'Periodic-Other': 17,
 'AGN': 18,
 'SN': 19,
 'Variable Star': 20,
 'Asteroid': 21,
 'Bogus': 22,
 'RS-CVn': 23,
 'QSO-I': 24}

## Example: Querying objects from past 12 hr <a class="anchor" id="ex1"></a>

Let's pull all of the entries that have occurred within the past 12hr. Our object data will include the mean RA, mean DEC, $\sigma_{RA}$, $\sigma_{DEC}$, number of observations (nobs), first and last dates of detection in MJD, and the early classification (**classearly**) along with its probability (**pclassearly**).

In [6]:
mjd_first = Time(datetime.today(), scale='utc').mjd-0.5

query='''
SELECT
objects.oid, objects.meanra, objects.meandec, objects.sigmara, 
objects.sigmadec, objects.nobs, objects.firstmjd, objects.lastmjd, 
objects.deltajd, objects.classearly, objects.pclassearly

FROM objects

WHERE 
objects.lastMJD>%s
''' % (mjd_first)

# Outputs as a pd.DataFrame
detections = pd.read_sql_query(query, conn)

# Prints the Dataframe shape: (number of selected objects, number of selected filters)
print(detections.shape)
detections.head()

(166597, 11)


Unnamed: 0,oid,meanra,meandec,sigmara,sigmadec,nobs,firstmjd,lastmjd,deltajd,classearly,pclassearly
0,ZTF21aasweyp,221.897616,-25.966204,,,1,59310.376076,59310.376076,0.0,20.0,0.585908
1,ZTF21aasweyn,221.646129,-26.015138,,,1,59310.376076,59310.376076,0.0,20.0,0.535293
2,ZTF20aaqfqbh,222.342882,-25.915615,8.4e-05,3e-05,3,58900.542963,59310.376076,409.833113,20.0,0.588102
3,ZTF21aasweyl,222.091631,-25.915851,,,1,59310.376076,59310.376076,0.0,20.0,0.568464
4,ZTF21aanlaag,222.052252,-25.816462,0.000146,0.000115,2,59271.502118,59310.376076,38.873958,20.0,0.573425


## Exporting to VOTable <a class="anchor" id="exportVOTable"></a>

To save this data as a VOTable requires converting it from its current form (a ```pd.DataFrame```). This is possible with the ```Table``` object from ```astropy.table```, and the functions we initially imported from ```astropy.io.votable```. Essentially, we'll convert our ```pd.DataFrame``` to an ```astropy.table.Table``` to a ```astropy.io.votable.VOTableFile```, which can then be exported.

_A buggy caveat, however_ -- ```astropy.io.votable.VOTableFile``` objects throw an error when you attempt to pass on masked/```NaN``` values. I've gotten around this, for now, by filling in the masked values the the _string_ ```"NaN"``` before the ```pd.DataFrame``` is converted to a ```Table```.

In [7]:
# Sorting detections by lastMJD, then firstMJD, in descending order
detections_sorted = detections.sort_values(by=['lastmjd','firstmjd'],ascending=False)

# Filling the masked values with the string 'NaN'
detections_filled = detections_sorted.fillna('NaN')

# Converting filled dataframe to astropy Table, then astropy VOTableFile, then exporting into .xml
full_dt = Table.from_pandas(detections_filled)
votable = from_table(full_dt)
writeto(votable, "ztf_v2_output.xml")

## Future considerations <a class='anchor' id='future'></a>

In terms of refining this querying proccess:
* I could guess that we would not want to include entries that are classified as 'bogus.' If so, this filter can be incorporated easily into the ```read_sql_query```.
* There is a lot of probabilistic data available for each object (whether from the stamp or lightcurve classifiers) -- would we want to pull all of this from the database to save on our own documentation databases?
* To get a sense of program runtime, we can measure how long it takes to query the past 12hrs vs. the past 24hrs (or whatever time scale we're interested in), to see what timeframe might work best.

To make the extracted data more legible:
* We might want to replace any integer values (e.g. the ```classearly``` attribute) with the actual key text, so its clearer what information the object is associated with.

Eventually, to search coordinates on on ChaSeR:
* The ZTF RA/Dec can be inputed as they are, since they’re in degrees (Equatorial J2000 coordinate system).
* For ‘start date’ field: we must convert First or Last MJD into YYYY-MM-DD format.