# The Local Group Galaxy Database: An Open, Multi-Source, Curated Catalog

This notebook will guide users on how to use the basic API we've designed.

There are two ways to start the database, by providing a directory of JSON files to load or by suppling a MongoDB connection string. Here we will use the directory method. All the same method exists regardless of how you start the database, though the MongoDB use will have more extensive query language options.

In [1]:
from galcat.core import *

db = Database(directory='galcat/test_data', references_file='references.json')

The database is a NoSQL document store that organizes the data as a collection of JSON/dictionary objects. Each galaxy will be it's own JSON document containing all relevant parameters. For example, here is the structure for a test galaxy, Gal 2:

In [2]:
doc = {
    "name": "Gal 2",
    "ra": [
        {
            "value": 9.14542,
            "best": 1,
            "reference": "",
            "unit": "deg"
        }
    ],
    "dec": [
        {
            "value": 49.64667,
            "best": 1,
            "reference": "",
            "unit": "deg"
        }
    ],
    "ebv": [
        {
            "value": 0.166,
            "best": 1,
            "reference": "Bellazzini_2006_1"
        }
    ],
    "v_mag": [
        {
            "value": 16.2,
            "error_upper": 0.3,
            "error_lower": 0.3,
            "best": 1,
            "unit": "mag",
            "reference": ""
        }
    ]
}

## Queries

Data can be queried from the database using MongoDB's query language, which also uses a JSON/dictionary like structure to parse queries. Here are some example queries you can make with the API.

In [3]:
# Query on a single field, such as a name
doc = db.query_db({'name': 'Gal 2'})
print(doc)

[{'name': 'Gal 2', 'ra': array([{'value': 10.4, 'best': 1, 'reference': '', 'unit': 'deg'}],
      dtype=object), 'dec': array([{'value': -32.4, 'best': 1, 'reference': '', 'unit': 'deg'}],
      dtype=object), 'ebv': array([{'value': 0.2, 'best': 1, 'reference': 'Bellazzini_2006_1'}],
      dtype=object), 'v_mag': array([{'value': 20.2, 'error_upper': 0.5, 'error_lower': 0.5, 'best': 1, 'unit': 'mag', 'reference': ''}],
      dtype=object)}]


In [4]:
# Tabular results of query
db.query_table({})

dec,ebv,half-light_radius,name,ra,radial_velocity,v_mag
object,float64,object,str5,object,float64,object
-32.4 deg,0.2,--,Gal 2,10.4 deg,--,20.2 mag
49.64667 deg,0.166,1.35 arcmin,Gal 1,9.14542 deg,-139.8,16.2 mag


A variety of query operators are implemented, including:

 - `$gt` greater than
 - `$gte` greater than or equal to
 - `$lt` less than
 - `$lte` less than or equal to
 - `$exists` does this parameter exist in the document
 - `$or` OR operator (AND is implicit in the query)

In [5]:
# Example query with operators ($)
query = {'v_mag.value': {'$lt': 17}}
df = db.query_table(query=query)
df[['name', 'v_mag']]

name,v_mag
str5,object
Gal 1,16.2 mag


In [6]:
# Example with $exists
query = {'radial_velocity.value': {'$exists': True}}
db.query_table(query=query)[['name', 'radial_velocity']]

name,radial_velocity
str5,float64
Gal 1,-139.8


In [7]:
# Example AND query
query = {'v_mag.value': {'$gt': 10}, 'radial_velocity.value': {'$lte': -100}}
db.query_table(query=query)[['name', 'v_mag', 'radial_velocity']]

name,v_mag,radial_velocity
str5,object,float64
Gal 1,16.2 mag,-139.8


In [8]:
# Example OR query
query = {'$or': [{'v_mag.value': 16.2}, {'dec.value': -32.4}]}
df = db.query_table(query=query)
df[['name', 'ra', 'dec', 'v_mag']]

name,ra,dec,v_mag
str5,object,object,object
Gal 2,10.4 deg,-32.4 deg,20.2 mag
Gal 1,9.14542 deg,49.64667 deg,16.2 mag


In [9]:
# Example AND and OR query
query = {'ra.value': {'$gt': 10}, '$or': [{'v_mag.value': {'$lte': 21}},
                                          {'v_mag.value': {'$gte': 16}}]}
db.query_table(query=query)[['name', 'ra', 'v_mag']]

name,ra,v_mag
str5,object,object
Gal 2,10.4 deg,20.2 mag
Gal 1,9.14542 deg,16.2 mag


The nature of a document is that multiple values can be stored for each parameter. For example, consider Gal 1's values of RA:

In [10]:
doc = db.query_db({'name': 'Gal 1'})[0]['ra']
print(doc)

[{'value': 9.14542, 'best': 1, 'reference': '', 'unit': 'deg'}
 {'value': 999.14542, 'best': 0, 'reference': 'FakeRef2019', 'unit': 'deg'}]


The `selection` parameter in the `db.query_table` method can be used to control which reference to use in the table result, otherwise the 'best' value will be used. (This is anticipated to change)

In [11]:
query = {'name': 'Gal 1'}
tab1 = db.query_table(query=query)[['name', 'ra', 'dec']]
tab2 = db.query_table(query=query, selection={'ra': 'FakeRef2019'})[['name', 'ra', 'dec']]
print(tab1) # default (best) results
print(tab2) # user-selected results

 name      ra         dec     
----- ----------- ------------
Gal 1 9.14542 deg 49.64667 deg
 name       ra          dec     
----- ------------- ------------
Gal 1 999.14542 deg 49.64667 deg


## References

A reference JSON file can be provided which lists the basic author, journal, title, etc information for any reference used in the database. This data can be queried in a similar fashion and results can be embedded in the data query results.

In [12]:
# Query references information
doc = db.query_reference({'key': 'Martin_2005_1'})[0]
print(json.dumps(doc, indent=4, sort_keys=False))

{
    "key": "Martin_2005_1",
    "id": 2,
    "year": 2005,
    "doi": "10.1111/j.1365-2966.2005.09339.x",
    "bibcode": "2005MNRAS.362..906M",
    "authors": [
        "Martin, N. F.",
        "Ibata, R. A.",
        "Conn, B. C.",
        "Lewis, G. F.",
        "Bellazzini, M.",
        "Irwin, M. J."
    ],
    "journal": "MNRAS",
    "title": "A radial velocity survey of low Galactic latitude structures - I. Kinematics of the Canis Major dwarf galaxy"
}


In [13]:
# Query with references embedded
doc = db.query_db({'name': 'Gal 1'}, embed_ref=True)[0]['ebv'][0]
print(json.dumps(doc, indent=4, sort_keys=False))

{
    "value": 0.166,
    "best": 1,
    "reference": {
        "key": "Bellazzini_2006_1",
        "id": 1,
        "year": 2006,
        "doi": "10.1111/j.1365-2966.2005.09973.x",
        "bibcode": "2006MNRAS.366..865B",
        "authors": [
            "Bellazzini, M.",
            "Ibata, R.",
            "Martin, N.",
            "Lewis, G. F.",
            "Conn, B.",
            "Irwin, M. J."
        ],
        "journal": "MNRAS",
        "title": "The core of the Canis Major galaxy as traced by red clump stars"
    }
}


**NOTE**: Embedding a reference temporarily modifies the database value when running locally (ie, not in MongoDB). You will need to re-load the JSON file to reset this (use `db.load_file_to_db`).

## Loading data

New objects (eg, galaxies) can be loaded to the database using the `db.load_file_to_db` method. This is also useful to reset data in memory to that from the disk. 

To add new parameters to existing objects, use `db.add_data`. Any new data added to the database can be explicitly written to disk with `db.save_all(out_dir)`.

In [14]:
# Re-loading Gal 1
db.load_file_to_db('galcat/test_data/Gal_1.json')

Let's create some extra data that we want to load to Gal 1. We'll use `db.add_data` to add this to the database, but will not use `db.save_all` as we don't want to permanently store this. Note that `db.add_data` can either take a filename or a dict-like object to load.

In [15]:
doc = {
  "name": "Gal 1",
  "ra": [
    {
      "value": 42,
      "reference": "Penguin_2020_1",
      "unit": "deg"
    }
  ],
  "ebv": [
    {
      "value": 99,
      "reference": "Penguin_2020_1"
    }
  ],
  "fake_quantity": [
    {
      "value": 27,
      "error": 2,
      "reference": "Penguin_2020_1"
    }
  ]
}
db.add_data(doc, validate=False)

Data for Gal 1 has been updated. Consider running save_all() to update JSON on disk.


In [16]:
doc = db.query_db({'name': 'Gal 1'})[0]
print(doc['ra'])
print(doc['ebv'])
print(doc['fake_quantity'])

[{'value': 9.14542, 'best': 1, 'reference': '', 'unit': 'deg'}
 {'value': 999.14542, 'best': 0, 'reference': 'FakeRef2019', 'unit': 'deg'}
 {'value': 42, 'reference': 'Penguin_2020_1', 'unit': 'deg'}]
[{'value': 0.166, 'best': 1, 'reference': 'Bellazzini_2006_1'}
 {'value': 99, 'reference': 'Penguin_2020_1'}]
[{'value': 27, 'error': 2, 'reference': 'Penguin_2020_1'}]


In [18]:
# Re-loading Gal 1
db.load_file_to_db('galcat/test_data/Gal_1.json')

## Validation

A validator exists to ensure that JSON added to the database meet some minimum criteria. This can be run against a JSON file itself, against a dict-like object, or against the full database.

In [19]:
from galcat.validator import Validator

# Validate full database, including checking that references are set
Validator(database=db, is_data=True, ref_check=True, verbose=True).run()

Checking Gal 2
ERROR: ra has missing references or it does not exist: {'value': 10.4, 'best': 1, 'reference': '', 'unit': 'deg'}
ERROR: dec has missing references or it does not exist: {'value': -32.4, 'best': 1, 'reference': '', 'unit': 'deg'}
ERROR: v_mag has missing references or it does not exist: {'value': 20.2, 'error_upper': 0.5, 'error_lower': 0.5, 'best': 1, 'unit': 'mag', 'reference': ''}
Checking Gal 1
ERROR: ra has missing references or it does not exist: {'value': 9.14542, 'best': 1, 'reference': '', 'unit': 'deg'}
ERROR: ra has missing references or it does not exist: {'value': 999.14542, 'best': 0, 'reference': 'FakeRef2019', 'unit': 'deg'}
ERROR: dec has missing references or it does not exist: {'value': 49.64667, 'best': 1, 'reference': '', 'unit': 'deg'}
ERROR: radial_velocity has missing references or it does not exist: {'value': -139.8, 'error_upper': 6.0, 'error_lower': 6.6, 'best': 1, 'reference': ''}
ERROR: v_mag has missing references or it does not exist: {'val

In [23]:
# Validate against dict-like object. For example, this may be a new galaxy you intend to add.
doc = {"name": "Gal 3",
       "ra": [{"value": 9.14542, "best": 1, "reference": "", "unit": "deg"}],
       "dec": [{"value": 49.64667, "best": 1, "reference": "", "unit": "FAKE UNIT"}],
       "ebv": [{"error_upper": 0.1, "best": 1, "reference": "Bellazzini_2006_1"}]}
#print(json.dumps(doc, indent=4, sort_keys=False))

Validator(database=db, db_object=doc, is_data=True, ref_check=True).run()  

ERROR: ra has missing references or it does not exist: {'value': 9.14542, 'best': 1, 'reference': '', 'unit': 'deg'}
ERROR: dec has missing references or it does not exist: {'value': 49.64667, 'best': 1, 'reference': '', 'unit': 'FAKE UNIT'}
ERROR: dec has invalid units: FAKE UNIT
ERROR: ebv has missing values/distribution: {'error_upper': 0.1, 'best': 1, 'reference': 'Bellazzini_2006_1'}
Validation complete.
