## Indexing

Indexing is used to make searches more efficient. When there are no indices, any search goes through each full document in order to find the specified results. If a search uses indexed fields, it can perform more efficiently by going only through the indexed field(s), and then retrieving only the matching documents without having to go through the full collection. However, not everything needs to be indexed, since the storage usage can grow substantially when too many indexes are created. It is generally recommended to limit the number of indices to the most commonly used fields.

For the examples in this tutorial, neither the gain in search speed nor the extra storage required will be noticeable, but for large collections this has to be taken into account.

Every collection in MongoDB comes indexed by `_id` by default, but new indices can be created at any time and there are different kinds of indices.

In this section we'll be using a sample of ALeRCE-like objects that should be loaded into the database if the instructions in `README.md` have been followed. For an example document in this collection:

In [2]:
from pprint import pprint
from pymongo import MongoClient

client = MongoClient(host='localhost', port=27017, username='mongo', password='mongo')

objects = client.alerce.objects  # This is the collection we'll be using

pprint(objects.find_one())

{'_id': 'AL17kydexvudyfzwq',
 'e_dec': 6.98509819165943e-05,
 'e_ra': 0.000167809096646232,
 'firstmjd': 58366.4399768999,
 'lastmjd': 59542.2826156998,
 'loc': {'coordinates': [-112.883127892771, 52.2932219572289], 'type': 'Point'},
 'meandec': 52.2932219572289,
 'meanra': 67.1168721072289,
 'ndet': 166.0,
 'oid': ['ZTF17aaaacji'],
 'probabilities': [{'class_name': 'AGN',
                    'classifier_name': 'stamp_classifier',
                    'classifier_version': 'stamp_classifier_1.0.0',
                    'probability': 0.086826704,
                    'ranking': 2.0},
                   {'class_name': 'asteroid',
                    'classifier_name': 'stamp_classifier',
                    'classifier_version': 'stamp_classifier_1.0.0',
                    'probability': 0.059574142,
                    'ranking': 4.0},
                   {'class_name': 'bogus',
                    'classifier_name': 'stamp_classifier',
                    'classifier_version': 'stamp_cla

**Note:** The objects in the actual ALeRCE database do not exactly follow this structure. We're using only a simplified version here to show these concepts.

### Creating basic indices

To create a simple index over a field, the structure is similar to what we saw for the `sort` option in the `find` method. It should be a list of pairs with a direction given by `1` or `-1` (ascending or descending, respectively).

In [None]:
from pymongo import IndexModel

objects.create_index([('firstmjd', 1)])

The return value correspond to the name given to the index in the collection.

In [None]:
objects.index_information()

The indices here are given by index name and have information about the field, order and type of index (the `v` field refers to the version).

One can also drop an index based on the index name:

In [None]:
objects.drop_index('firstmjd_1')
objects.index_information()

At creation, an index can also be given a custom name, instead of relying on MongoDB for the name:

In [None]:
objects.create_index([('firstmjd', 1)], name='first_detection_date')

In the case of the field `probabilities`, we have the case of a an array with nested documents. If we wanted to create an index for a field inside it, we can use dot notation:

In [None]:
objects.create_index([('probabilities.probability', -1)], name='probs')

As for other common options, it is possible to demand that the elements of the index are unique:

In [None]:
objects.create_index([('oid', 1)], unique=True)

Or creating a partial index (only indexes documents that fullfill a given condition):

In [None]:
objects.create_index([('lastmjd', 1)], partialFilterExpression={'ndet': {'$gt': 100}})

The above creates an index over `lastmjd`, but only for documents with `ndet` greater than 100.

For now, we'll just remove all indices (note that this will never remove the index over `_id`):

In [None]:
objects.drop_indexes()

#### Digression: Special ordering and coordinates

Some fields can have a special indexing, such as the location on a sphere. This is why the field `loc` has the form it does:

In [4]:
doc = objects.find_one()

print(f'loc: {doc["loc"]}')
print(f'ra: {doc["meanra"]}')
print(f'dec: {doc["meandec"]}')

loc: {'type': 'Point', 'coordinates': [-112.883127892771, 52.2932219572289]}
ra: 67.1168721072289
dec: 52.2932219572289


The first value of `coordinates` inside `loc` corresponds to RA minus 180, while Dec remains the same. This is because the format with `type` and `coordinates` is defined by GeoJSON and first established for terrestrial coordinates. This shift allows to index over the sphere and perform cone-searches over the coordinates (which we'll see later on). For this, the index cannot just be ascending or descending and have to use the value `2dsphere`:

In [6]:
objects.create_index([('loc', '2dsphere')])

'loc_2dsphere'

### Compund indices

It is also possible to create indices over multiple fields at a time, with them being sorted by in order (latter indices fixing clashes over the first). This can be very useful depending on the type of search. For instance, we'll create an index over the classifier name and version, so that all versions are in order, but first sorted by classifier (for each document):

In [5]:
objects.create_index([('probabilities.classifier_name', 1), ('probabilities.classifier_version', 1)])  # The list now has a second element (the secondary index)

'probabilities.classifier_name_1_probabilities.classifier_version_1'

Again we'll clean up all the indices:

In [6]:
objects.drop_indexes()

## Projections (selecting fields for output)

The `find` and `find_one` methods have some additional functionality that we'll discuss now. Besides the dictionary with the query filters, a second argument can be passed with another dictionary for "projection". These projections allow for different manipulations of the output documents, without modifying them on the database. For instance, if there is only one field of interest in the output:

In [7]:
docs = objects.find({'ndet': {'$gte': 100}}, {'firstmjd': True, 'lastmjd': True})

for doc in docs:
    print(doc)

{'_id': 'AL17kydexvudyfzwq', 'lastmjd': 59542.2826156998, 'firstmjd': 58366.4399768999}
{'_id': 'AL17kyhickkapibwi', 'lastmjd': 59530.3737846999, 'firstmjd': 58450.4007060002}
{'_id': 'AL17msbqengdzwbtk', 'lastmjd': 59520.1848958, 'firstmjd': 58288.4348148}
{'_id': 'AL17kvgayignexklo', 'lastmjd': 59540.3450694, 'firstmjd': 58363.4715161999}
{'_id': 'AL17kykzgxdgqsntc', 'lastmjd': 59538.2855556002, 'firstmjd': 58348.4711342999}
{'_id': 'AL17ktitbgrfqqhkq', 'lastmjd': 59540.4023263999, 'firstmjd': 58336.4893518998}
{'_id': 'AL17kvzjrikyiplhk', 'lastmjd': 59540.4116898002, 'firstmjd': 58338.4519213}
{'_id': 'AL17lasirdapyanxs', 'lastmjd': 59542.3199421, 'firstmjd': 58343.4893749999}
{'_id': 'AL17ldcrbgfdyppbw', 'lastmjd': 59550.375, 'firstmjd': 58423.4193402999}
{'_id': 'AL17laogqkaumzppg', 'lastmjd': 59542.3199421, 'firstmjd': 58342.4907986}
{'_id': 'AL17ldheheiulmpbw', 'lastmjd': 59550.3538079001, 'firstmjd': 58423.3785068998}
{'_id': 'AL17ldghniumrlnpo', 'lastmjd': 59530.4050925998, 'f

Here we've selected only objects with more than 100 detections, but are only interested in the first and last MJD. As you can see, even if not explicitly selected, the `_id` field will be carried by default. This behaviour can be changed:

In [8]:
docs = objects.find({'ndet': {'$gte': 100}}, {'firstmjd': True, 'lastmjd': True, '_id': False})

for doc in docs:
    print(doc)

{'lastmjd': 59542.2826156998, 'firstmjd': 58366.4399768999}
{'lastmjd': 59530.3737846999, 'firstmjd': 58450.4007060002}
{'lastmjd': 59520.1848958, 'firstmjd': 58288.4348148}
{'lastmjd': 59540.3450694, 'firstmjd': 58363.4715161999}
{'lastmjd': 59538.2855556002, 'firstmjd': 58348.4711342999}
{'lastmjd': 59540.4023263999, 'firstmjd': 58336.4893518998}
{'lastmjd': 59540.4116898002, 'firstmjd': 58338.4519213}
{'lastmjd': 59542.3199421, 'firstmjd': 58343.4893749999}
{'lastmjd': 59550.375, 'firstmjd': 58423.4193402999}
{'lastmjd': 59542.3199421, 'firstmjd': 58342.4907986}
{'lastmjd': 59550.3538079001, 'firstmjd': 58423.3785068998}
{'lastmjd': 59530.4050925998, 'firstmjd': 58357.4957869998}
{'lastmjd': 59498.3893634002, 'firstmjd': 58343.4893749999}
{'lastmjd': 59550.2775809998, 'firstmjd': 58443.3156249998}
{'lastmjd': 59540.4386343001, 'firstmjd': 58443.2944676001}
{'lastmjd': 59532.4202777999, 'firstmjd': 58370.515625}
{'lastmjd': 59542.1187153002, 'firstmjd': 58346.3351968001}
{'lastmjd': 

If at least one of the projected fields is explicitly selected, all the other will be implicitly removed. The oposite is also true:

In [10]:
docs = objects.find({'ndet': {'$gte': 100}}, {'probabilities': False, '_id': False})

for doc in docs:
    print(doc)

{'oid': ['ZTF17aaaacji'], 'lastmjd': 59542.2826156998, 'firstmjd': 58366.4399768999, 'ndet': 166.0, 'loc': {'type': 'Point', 'coordinates': [-112.883127892771, 52.2932219572289]}, 'meanra': 67.1168721072289, 'meandec': 52.2932219572289, 'e_ra': 0.000167809096646232, 'e_dec': 6.98509819165943e-05, 'tid': ['ZTF']}
{'oid': ['ZTF17aaaacpo'], 'lastmjd': 59530.3737846999, 'firstmjd': 58450.4007060002, 'ndet': 226.0, 'loc': {'type': 'Point', 'coordinates': [-111.458553317257, 7.32192741061947]}, 'meanra': 68.5414466827434, 'meandec': 7.32192741061947, 'e_ra': 4.63841947555995e-05, 'e_dec': 5.73385615644401e-05, 'tid': ['ZTF']}
{'oid': ['ZTF17aaaajgn'], 'lastmjd': 59520.1848958, 'firstmjd': 58288.4348148, 'ndet': 345.0, 'loc': {'type': 'Point', 'coordinates': [138.677740238551, 48.5400606710145]}, 'meanra': 318.677740238551, 'meandec': 48.5400606710145, 'e_ra': 8.30365847738501e-05, 'e_dec': 8.06191525756549e-05, 'tid': ['ZTF']}
{'oid': ['ZTF17aaaampi'], 'lastmjd': 59540.3450694, 'firstmjd': 5

The above includes every field, except for `probabilities` and `_id`. Note that it is not possible to mix inclusion and exclusion of fields, except for the case of `_id`:

In [12]:
# Will fail due to mixing inclusion and exclusion
docs = objects.find({'ndet': {'$gte': 100}}, {'probabilities': False, 'firstmjd': True})
# However, it will only fail at this stage
for doc in docs:
    print(doc)

OperationFailure: Cannot do inclusion on field firstmjd in exclusion projection, full error: {'ok': 0.0, 'errmsg': 'Cannot do inclusion on field firstmjd in exclusion projection', 'code': 31253, 'codeName': 'Location31253'}

The projections can also limit the number of elements returned from an array:

In [17]:
docs = objects.find({'probabilities.ranking': 1}, {'probabilities.$': True, '_id': False})

for doc in docs:
    print(doc)

{'probabilities': [{'classifier_name': 'stamp_classifier', 'classifier_version': 'stamp_classifier_1.0.0', 'class_name': 'VS', 'probability': 0.7186312, 'ranking': 1.0}]}
{'probabilities': [{'classifier_name': 'stamp_classifier', 'classifier_version': 'stamp_classifier_1.0.0', 'class_name': 'VS', 'probability': 0.7214016, 'ranking': 1.0}]}
{'probabilities': [{'classifier_name': 'lc_classifier_top', 'classifier_version': 'hierarchical_rf_1.1.0', 'class_name': 'Periodic', 'probability': 0.926, 'ranking': 1.0}]}
{'probabilities': [{'classifier_name': 'lc_classifier_transient', 'classifier_version': 'hierarchical_rf_1.1.0', 'class_name': 'SNII', 'probability': 0.298, 'ranking': 1.0}]}
{'probabilities': [{'classifier_name': 'stamp_classifier', 'classifier_version': 'stamp_classifier_1.0.0', 'class_name': 'VS', 'probability': 0.77150106, 'ranking': 1.0}]}
{'probabilities': [{'classifier_name': 'lc_classifier', 'classifier_version': 'hierarchical_rf_1.1.0', 'class_name': 'CV,Nova', 'probabili

The `$` operator seen above will only return the first element of the array that matches the query,  even if more than one element does. For this reason it requires for the array to actually be used within the query.

For more control over the returned element, there is also the `$elemMatch` projection operator:

In [19]:
docs = objects.find(
    {
        'ndet': {'$gte': 100}
    }, 
    {
        '_id': False,
        'probabilities': {
            '$elemMatch': {  # The value for $elemMatch has the form of a query, over the fields inside the elements
                'classifier_name': 'stamp_classifier',
                'ranking': 1
            }
        },
    })

for doc in docs:
    print(doc)

{'probabilities': [{'classifier_name': 'stamp_classifier', 'classifier_version': 'stamp_classifier_1.0.0', 'class_name': 'VS', 'probability': 0.7186312, 'ranking': 1.0}]}
{'probabilities': [{'classifier_name': 'stamp_classifier', 'classifier_version': 'stamp_classifier_1.0.0', 'class_name': 'VS', 'probability': 0.7214016, 'ranking': 1.0}]}
{'probabilities': [{'classifier_name': 'stamp_classifier', 'classifier_version': 'stamp_classifier_1.0.0', 'class_name': 'VS', 'probability': 0.67333895, 'ranking': 1.0}]}
{'probabilities': [{'classifier_name': 'stamp_classifier', 'classifier_version': 'stamp_classifier_1.0.0', 'class_name': 'VS', 'probability': 0.77150106, 'ranking': 1.0}]}
{}
{}
{}
{'probabilities': [{'classifier_name': 'stamp_classifier', 'classifier_version': 'stamp_classifier_1.0.0', 'class_name': 'VS', 'probability': 0.7479756, 'ranking': 1.0}]}
{'probabilities': [{'classifier_name': 'stamp_classifier', 'classifier_version': 'stamp_classifier_1.0.0', 'class_name': 'VS', 'probab

Note that the objects where no element matches the requirements for the projection are still returned, but are now empty. This is because they still fullfill the main query in the find command. Also, in this case there is no limitation for the main query to include the array used in the projection.