# Querying ALeRCE - Stamp Classifier
```Author: Eden Girma, Last updated 20210426```

# Table of contents:
* [Querying the ALeRCE API Python Client](#api)
* [Replicating process with direct database query](#db)
* [Comparing API and DB outputs](#compare)
* [Understanding retrieved object data](#data)
* [Exporting to VOTable](#export)

**Goal:**
 
1) To query the ALeRCE database for objects with the following attributes:
* detected 24 - 48 hours from the current time
* classified by the stamp classifier (version 1.0.4)

2) To return a table consisting of ALeRCE alert objects that includes, per row:
* aggregated detection properties per object (e.g. mean RA/Dec, number of detections)
* probability of the highest ranking class assigned by the stamp classifier (v1.0.4)

We will try this by querying the ALeRCE API first, and then directly querying the ZTF database.

In [1]:
import sys

# Packages for direct database access
# %pip install psycopg2
import psycopg2
import json

# Packages for data and number handling
import numpy as np
import pandas as pd
import math

# Packages for calculating current time and extracting ZTF data to VOTable
from astropy.time import Time
from astropy.table import Table, unique, vstack
from astropy.io.votable import from_table, writeto
from datetime import datetime

# Packages for display and data plotting, if desired
from IPython.display import HTML
from IPython.display import display
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Set up ALeRCE python client
from alerce.core import Alerce
client = Alerce()

## Querying the ALeRCE API Python Client<a class="anchor" id="api"></a>

We will retrieve these objects per class, first by building a function that uses the ALeRCE client to query objects according to stamp classifier predictions.

Note that according to the ZTF API (```ztf-api/api/sql/astro_object/astro_object.py```), the default ranking for ```query_objects``` when ranking is not specified is 1.

In [7]:
# Define function that queries objects according to class
def query_class_objects(cn, min_lastmjd, max_lastmjd):
    objects = client.query_objects(classifier = 'stamp_classifier',
                                   classifier_version = 'stamp_classifier_1.0.4',
                                   class_name = cn,
                                   lastmjd = [min_lastmjd, max_lastmjd],
                                   page_size = int(1e6),
                                   format='pandas')
    return objects

In [8]:
# Querying the ALeRCE client for objects detected 24 - 48 hours from the current time, over a range of classes

min_lastmjd = Time(datetime.today(), scale='utc').mjd - 2
max_lastmjd = Time(datetime.today(), scale='utc').mjd - 1
classes = ["AGN", "SN", "VS", "asteroid", "bogus"]
apiobjects = Table()

for class_name in classes:
    class_objects = query_class_objects(class_name, min_lastmjd, max_lastmjd)
    if class_name == classes[0]:
        apiobjects = class_objects
    else:
        apiobjects = pd.concat([apiobjects, class_objects])
    
    print('Class queried: %s' % (class_name))
    
    if class_name == classes[-1]:
        print('Done.')

Class queried: AGN
Class queried: SN
Class queried: VS
Class queried: asteroid
Class queried: bogus
Done.


In [9]:
# Prints the dataframe shape: (number of selected objects, number of selected filters)
print(apiobjects.shape)

# Sorting detections by lastMJD, firstMJD, and OID in descending order
apiobjects = apiobjects.sort_values(by=['lastmjd', 'firstmjd', 'oid'], ascending=False)
apiobjects.head()

(139298, 23)


Unnamed: 0,oid,ndethist,ncovhist,mjdstarthist,mjdendhist,corrected,stellar,ndet,g_r_max,g_r_max_corr,...,lastmjd,deltajd,meanra,meandec,sigmara,sigmadec,class,classifier,probability,step_id_corr
8386,ZTF21aaxlhvt,1,507,59329.461065,59329.461065,False,False,1,,,...,59329.461065,0.0,256.327491,-24.100443,,,asteroid,stamp_classifier,1.0,correction_0.0.1
14509,ZTF21aaxlhvs,1,509,59329.461065,59329.461065,False,False,1,,,...,59329.461065,0.0,256.375443,-23.632052,,,bogus,stamp_classifier,0.434058,correction_0.0.1
14680,ZTF21aaxlhvr,1,509,59329.461065,59329.461065,False,False,1,,,...,59329.461065,0.0,256.683072,-23.790888,,,bogus,stamp_classifier,0.804503,correction_0.0.1
8413,ZTF21aaxlhvh,1,457,59329.461065,59329.461065,False,False,1,,,...,59329.461065,0.0,260.72405,-23.414924,,,asteroid,stamp_classifier,1.0,correction_0.0.1
8424,ZTF21aaxlhvg,1,457,59329.461065,59329.461065,False,False,1,,,...,59329.461065,0.0,260.448838,-23.853284,,,asteroid,stamp_classifier,1.0,correction_0.0.1


## Replicating process with direct database query<a class="anchor" id="db"></a>

In [10]:
# Open and load credentials
credentials_file = "../alercereaduser_v4.json"
with open(credentials_file) as jsonfile:
    params = json.load(jsonfile)["params"]
    
# Open a connection to the database
conn = psycopg2.connect(dbname=params['dbname'], 
                        user=params['user'], 
                        host=params['host'], 
                        password=params['password'])

In [11]:
query='''
SELECT
    object.oid, object.meanra, object.meandec, object.sigmara, object.sigmadec,
    object.firstmjd, object.lastmjd, object.ndet, 
    pr.classifier_name, pr.classifier_version, pr.class_name, 
    pr.ranking, pr.probability

FROM 
    object INNER JOIN (
        SELECT 
            probability.oid, probability.classifier_name, probability.classifier_version,
            probability.class_name, probability.ranking, probability.probability
        FROM
            probability
        WHERE
            probability.classifier_name = 'stamp_classifier'
            AND probability.classifier_version = 'stamp_classifier_1.0.4'
            AND probability.ranking = 1
    ) AS pr
    ON object.oid = pr.oid

WHERE 
    object.lastMJD >= %s
    AND object.lastMJD <= %s
''' % (min_lastmjd, max_lastmjd)

# Outputs as a pd.DataFrame
dbobjects = pd.read_sql_query(query, conn)

In [12]:
# Prints the dataframe shape: (number of selected objects, number of selected filters)
print(dbobjects.shape)

# Sorting detections by lastMJD in descending order
dbobjects = dbobjects.sort_values(by=['lastmjd', 'firstmjd', 'oid'], ascending=False)
dbobjects.head()

(139298, 13)


Unnamed: 0,oid,meanra,meandec,sigmara,sigmadec,firstmjd,lastmjd,ndet,classifier_name,classifier_version,class_name,ranking,probability
132917,ZTF21aaxlhvt,256.327491,-24.100443,,,59329.461065,59329.461065,1,stamp_classifier,stamp_classifier_1.0.4,asteroid,1,1.0
132933,ZTF21aaxlhvs,256.375443,-23.632052,,,59329.461065,59329.461065,1,stamp_classifier,stamp_classifier_1.0.4,bogus,1,0.434058
133617,ZTF21aaxlhvr,256.683072,-23.790888,,,59329.461065,59329.461065,1,stamp_classifier,stamp_classifier_1.0.4,bogus,1,0.804503
133158,ZTF21aaxlhvh,260.72405,-23.414924,,,59329.461065,59329.461065,1,stamp_classifier,stamp_classifier_1.0.4,asteroid,1,1.0
133616,ZTF21aaxlhvg,260.448838,-23.853284,,,59329.461065,59329.461065,1,stamp_classifier,stamp_classifier_1.0.4,asteroid,1,1.0


## Comparing API and DB outputs<a class="anchor" id="compare"></a>

In [20]:
# Check that the OIDs of each row in the API table are identical to that of the corresponding row in the DB table
print(set(dbobjects['oid'].values==apiobjects['oid'].values))

{True}


In [21]:
# Check if each row in the corresponds to a unique OID
print(dbobjects['oid'].is_unique)

True


## Understanding retrieved object data<a class="anchor" id="data"></a>

For this, we'll only look at the dataframe retrieved from the API client (which is alright, as the ```dbobjects``` and ```apiobjects``` dataframes encompass the same OIDs.)

The following prints out the number of OIDs that correspond to each class name:

In [22]:
# Count number of OIDs that correspond to each class name
print('Total rows : %i' % (len(apiobjects.index)))
obj_classes = apiobjects.groupby('class')
for key in obj_classes.groups.keys():
    l = obj_classes.groups[key].size
    print('%s : %i' % (key, l))

Total rows : 139298
AGN : 10477
SN : 789
VS : 104360
asteroid : 8479
bogus : 15193


In [24]:
# Identify duplicate OID entries - rows with same OID but different classes and probabilities
obj_oid = apiobjects.groupby(['oid'])
duplicates = []
for key in obj_oid.groups.keys():
    l = obj_oid.groups[key].size
    if l > 1:
        oid = key
        duplicates.append(oid)

print('Number of duplicate OIDs: %i' % (len(duplicates)))
print('Number of unique OIDs : %i' % len(obj_oid))
# Print example rows with duplicate OIDs
if len(duplicates) > 0:
    display(apiobjects[(apiobjects['oid']==duplicates[0])])

Number of duplicate OIDs: 0
Number of unique OIDs : 139298


## Exporting to VOTable <a class="anchor" id="export"></a>

To save this data as a VOTable requires converting it from its current form (a ```pd.DataFrame```). This is possible with the ```Table``` object from ```astropy.table```, and the functions we initially imported from ```astropy.io.votable```. Essentially, we'll convert our ```pd.DataFrame``` to an ```astropy.table.Table``` to a ```astropy.io.votable.VOTableFile```, which can then be exported.

_A buggy caveat, however_ -- ```astropy.io.votable.VOTableFile``` objects throw an error when you attempt to pass on masked/```NaN``` values. I've gotten around this, for now, by filling in the masked values the the _string_ ```"None"``` before the ```pd.DataFrame``` is converted to a ```Table```.

In [25]:
# Defining a function that allows you to export the dataframe into a VOTable
def export_object_data(objects, filename):
    # Filling the masked values with the string 'NaN'
    objects_filled = objects.fillna('None')

    # Converting filled dataframe to astropy Table, then astropy VOTableFile, then exporting into .xml
    full_dt = Table.from_pandas(objects_filled)
    votable = from_table(full_dt)
    writeto(votable, filename)