# Querying ALeRCE - LC Classifier
```Author: Eden Girma, Last updated 20210426```

# Table of contents:
* [Querying the ALeRCE API Python Client](#api)
* [Replicating process with direct database query](#db)
* [Comparing API and DB outputs](#compare)
* [Understanding retrieved object data](#data)
* [Exporting to VOTable](#export)

**Goal:**
 
1) To query the ALeRCE database for objects with the following attributes:
* detected 24 - 48 hours from the current time
* classified by the LC classifier

2) To return a table consisting of ALeRCE alert objects that includes, per row:
* aggregated detection properties per object (e.g. mean RA/Dec, number of detections)
* probability of the highest ranking class assigned by the LC classifier

We will try this by querying the ALeRCE API first, and then directly querying the ZTF database.

In [1]:
import sys

# Packages for direct database access
# %pip install psycopg2
import psycopg2
import json

# Packages for data and number handling
import numpy as np
import pandas as pd
import math

# Packages for calculating current time and extracting ZTF data to VOTable
from astropy.time import Time
from astropy.table import Table, unique, vstack
from astropy.io.votable import from_table, writeto
from datetime import datetime

# Packages for display and data plotting, if desired
from IPython.display import HTML
from IPython.display import display
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Set up ALeRCE python client
from alerce.core import Alerce
client = Alerce()

## Querying the ALeRCE API Python Client<a class="anchor" id="api"></a>

We will retrieve these objects per class, first by building a function that uses the ALeRCE client to query objects according to stamp classifier predictions.

Note that according to the ZTF API (```ztf-api/api/sql/astro_object/astro_object.py```), the default ranking for ```query_objects``` when ranking is not specified is 1.

In [3]:
# Define function that queries objects according to class
def query_class_objects(cn, min_lastmjd, max_lastmjd):
    objects = client.query_objects(classifier = 'lc_classifier',
                                   classifier_version = 'hierarchical_random_forest_1.0.0',
                                   class_name = cn,
                                   lastmjd = [min_lastmjd, max_lastmjd],
                                   page_size = int(1e6),
                                   format='pandas')
    return objects

In [4]:
# Querying the ALeRCE client for objects detected 24 - 48 hours from the current time, over a range of classes

min_lastmjd = Time(datetime.today(), scale='utc').mjd - 2
max_lastmjd = Time(datetime.today(), scale='utc').mjd - 1
classes = client.query_classifiers()[0]['classes']
apiobjects = Table()

for class_name in classes:
    class_objects = query_class_objects(class_name, min_lastmjd, max_lastmjd)
    if class_name == classes[0]:
        apiobjects = class_objects
    else:
        apiobjects = pd.concat([apiobjects, class_objects])
    
    print('Class queried: %s' % (class_name))
    
    if class_name == classes[-1]:
        print('Done.')

Class queried: SNIa
Class queried: SNIbc
Class queried: SNII
Class queried: SLSN
Class queried: QSO
Class queried: AGN
Class queried: Blazar
Class queried: CV/Nova
Class queried: YSO
Class queried: LPV
Class queried: E
Class queried: DSCT
Class queried: RRL
Class queried: CEP
Class queried: Periodic-Other
Done.


In [5]:
# Prints the dataframe shape: (number of selected objects, number of selected filters)
print(apiobjects.shape)

# Sorting detections by lastMJD, firstMJD, and OID in descending order
apiobjects = apiobjects.sort_values(by=['lastmjd', 'firstmjd', 'oid'], ascending=False)
apiobjects.head()

(111703, 23)


Unnamed: 0,oid,ndethist,ncovhist,mjdstarthist,mjdendhist,corrected,stellar,ndet,g_r_max,g_r_max_corr,...,lastmjd,deltajd,meanra,meandec,sigmara,sigmadec,class,classifier,probability,step_id_corr
9248,ZTF20acsrocc,241,423,58363.294734,59329.477836,True,True,12,0.3249,0.573077,...,59329.477836,182.31441,300.205054,-10.684575,8.7e-05,8.6e-05,E,lc_classifier,0.283536,correction_0.0.1
10207,ZTF20acpbpcw,316,418,58360.359444,59329.477836,True,True,17,0.720458,0.527469,...,59329.477836,194.30831,298.546769,-11.335479,9.5e-05,8.6e-05,E,lc_classifier,0.298764,correction_0.0.1
7901,ZTF20acnwoci,147,425,58344.337407,59329.477836,True,True,22,0.1714,0.808239,...,59329.477836,204.309884,300.991775,-6.379007,6.9e-05,8.3e-05,Periodic-Other,lc_classifier,0.507144,correction_0.0.1
1311,ZTF20abznzmv,15,153,59108.251991,59329.477836,True,True,15,,,...,59329.477836,221.225845,299.582945,-8.019932,3.6e-05,5e-05,DSCT,lc_classifier,0.153996,correction_0.0.1
14396,ZTF20abjasrm,36,423,58830.087106,59329.477836,True,True,21,0.597153,0.650312,...,59329.477836,300.203472,301.4596,-13.0593,9.4e-05,9.4e-05,Periodic-Other,lc_classifier,0.39298,correction_0.0.1


## Replicating process with direct database query<a class="anchor" id="db"></a>

In [6]:
# Open and load credentials
credentials_file = "../alercereaduser_v4.json"
with open(credentials_file) as jsonfile:
    params = json.load(jsonfile)["params"]
    
# Open a connection to the database
conn = psycopg2.connect(dbname=params['dbname'], 
                        user=params['user'], 
                        host=params['host'], 
                        password=params['password'])

In [7]:
query='''
SELECT
    object.oid, object.meanra, object.meandec, object.sigmara, object.sigmadec,
    object.firstmjd, object.lastmjd, object.ndet, 
    pr.classifier_name, pr.classifier_version, pr.class_name, 
    pr.ranking, pr.probability

FROM 
    object INNER JOIN (
        SELECT 
            probability.oid, probability.classifier_name, probability.classifier_version,
            probability.class_name, probability.ranking, probability.probability
        FROM
            probability
        WHERE
            probability.classifier_name = 'lc_classifier'
            AND probability.classifier_version = 'hierarchical_random_forest_1.0.0'
            AND probability.ranking = 1
    ) AS pr
    ON object.oid = pr.oid

WHERE 
    object.lastMJD >= %s
    AND object.lastMJD <= %s
''' % (min_lastmjd, max_lastmjd)

# Outputs as a pd.DataFrame
dbobjects = pd.read_sql_query(query, conn)

In [8]:
# Prints the dataframe shape: (number of selected objects, number of selected filters)
print(dbobjects.shape)

# Sorting detections by lastMJD in descending order
dbobjects = dbobjects.sort_values(by=['lastmjd', 'firstmjd', 'oid'], ascending=False)
dbobjects.head()

(111703, 13)


Unnamed: 0,oid,meanra,meandec,sigmara,sigmadec,firstmjd,lastmjd,ndet,classifier_name,classifier_version,class_name,ranking,probability
95364,ZTF20acsrocc,300.205054,-10.684575,8.7e-05,8.6e-05,59147.163426,59329.477836,12,lc_classifier,hierarchical_random_forest_1.0.0,E,1,0.283536
108118,ZTF20acpbpcw,298.546769,-11.335479,9.5e-05,8.6e-05,59135.169525,59329.477836,17,lc_classifier,hierarchical_random_forest_1.0.0,E,1,0.298764
9068,ZTF20acnwoci,300.991775,-6.379007,6.9e-05,8.3e-05,59125.167951,59329.477836,22,lc_classifier,hierarchical_random_forest_1.0.0,Periodic-Other,1,0.507144
94859,ZTF20abznzmv,299.582945,-8.019932,3.6e-05,5e-05,59108.251991,59329.477836,15,lc_classifier,hierarchical_random_forest_1.0.0,DSCT,1,0.153996
3100,ZTF20abjasrm,301.4596,-13.0593,9.4e-05,9.4e-05,59029.274363,59329.477836,21,lc_classifier,hierarchical_random_forest_1.0.0,Periodic-Other,1,0.39298


## Comparing API and DB outputs<a class="anchor" id="compare"></a>

In [9]:
# Check that the OIDs of each row in the API table are identical to that of the corresponding row in the DB table
print(set(dbobjects['oid'].values==apiobjects['oid'].values))

{True}


## Understanding retrieved object data<a class="anchor" id="data"></a>

For this, we'll only look at the dataframe retrieved from the API client (which is alright, as the ```dbobjects``` and ```apiobjects``` dataframes encompass the same OIDs.)

The following prints out the number of OIDs that correspond to each class name:

In [24]:
# Count number of OIDs that correspond to each class name
print('Total rows : %i' % (len(apiobjects.index)))
obj_classes = apiobjects.groupby('class')
for key in obj_classes.groups.keys():
    l = obj_classes.groups[key].size
    print('%s : %i' % (key, l))

Total rows : 111703
AGN : 738
Blazar : 1662
CEP : 5465
CV/Nova : 3824
DSCT : 2318
E : 23222
LPV : 34045
Periodic-Other : 16349
QSO : 865
RRL : 13549
SLSN : 49
SNII : 34
SNIa : 116
SNIbc : 160
YSO : 9307


In [29]:
# Identify duplicate OID entries - rows with same OID but different classes and probabilities
obj_oid = apiobjects.groupby(['oid'])
duplicates = []
for key in obj_oid.groups.keys():
    l = obj_oid.groups[key].size
    if l > 1:
        oid = key
        duplicates.append(oid)

print('Number of OIDs with more than one row : %i' % (len(duplicates)))
print('Number of unique OIDs : %i' % len(obj_oid))
# Print example rows with duplicate OIDs
if len(duplicates) > 0:
    display(apiobjects[(apiobjects['oid']==duplicates[0])])

Number of OIDs with more than one row : 174
Number of unique OIDs : 111529


Unnamed: 0,oid,ndethist,ncovhist,mjdstarthist,mjdendhist,corrected,stellar,ndet,g_r_max,g_r_max_corr,...,lastmjd,deltajd,meanra,meandec,sigmara,sigmadec,class,classifier,probability,step_id_corr
23132,ZTF17aabmpqk,130,456,58450.481898,59329.170984,True,True,88,0.45611,0.360766,...,59329.170984,878.689086,148.695664,-2.339137,0.000109,0.000106,E,lc_classifier,0.196184,correction_0.0.1
5431,ZTF17aabmpqk,130,456,58450.481898,59329.170984,True,True,88,0.45611,0.360766,...,59329.170984,878.689086,148.695664,-2.339137,0.000109,0.000106,CEP,lc_classifier,0.196184,correction_0.0.1


## Exporting to VOTable <a class="anchor" id="export"></a>

To save this data as a VOTable requires converting it from its current form (a ```pd.DataFrame```). This is possible with the ```Table``` object from ```astropy.table```, and the functions we initially imported from ```astropy.io.votable```. Essentially, we'll convert our ```pd.DataFrame``` to an ```astropy.table.Table``` to a ```astropy.io.votable.VOTableFile```, which can then be exported.

_A buggy caveat, however_ -- ```astropy.io.votable.VOTableFile``` objects throw an error when you attempt to pass on masked/```NaN``` values. I've gotten around this, for now, by filling in the masked values the the _string_ ```"None"``` before the ```pd.DataFrame``` is converted to a ```Table```.

In [30]:
# Defining a function that allows you to export the dataframe into a VOTable
def export_object_data(objects, filename):
    # Filling the masked values with the string 'NaN'
    objects_filled = objects.fillna('None')

    # Converting filled dataframe to astropy Table, then astropy VOTableFile, then exporting into .xml
    full_dt = Table.from_pandas(objects_filled)
    votable = from_table(full_dt)
    writeto(votable, filename)