## Indexing

Indexing is used to make searches more efficient. When there are no indices, any search goes through each full document in order to find the specified results. If a search uses indexed fields, it can perform more efficiently by going only through the indexed field(s), and then retrieving only the matching documents without having to go through the full collection. However, not everything needs to be indexed, since the storage usage can grow substantially when too many indexes are created. It is generally recommended to limit the number of indices to the most commonly used fields.

For the examples in this tutorial, neither the gain in search speed nor the extra storage required will be noticeable, but for large collections this has to be taken into account.

Every collection in MongoDB comes indexed by `_id` by default, but new indices can be created at any time and there are different kinds of indices.

**Note:** If your user doesn't have write permission, you won't be able to create your own indices, but the gains when doing queries over existing indices will still be noticeable.

In this section we'll be using a sample of ALeRCE-like objects that should be loaded into the database if the instructions in `README.md` have been followed. For an example document in this collection:

In [None]:
from pprint import pprint
from pymongo import MongoClient

client = MongoClient(host='localhost', port=27017, username='mongo', password='mongo')

objects = client.alerce.objects  # This is the collection we'll be using

pprint(objects.find_one())

**Note:** The objects in the actual ALeRCE database do not exactly follow this structure. We're using only a simplified version here to show these concepts.

### Creating basic indices

To create a simple index over a field, the structure is similar to what we saw for the `sort` option in the `find` method. It should be a list of pairs with a direction given by `1` or `-1` (ascending or descending, respectively).

In [None]:
from pymongo import IndexModel

objects.create_index([('firstmjd', 1)])

The return value correspond to the name given to the index in the collection.

In [None]:
objects.index_information()

The indices here are given by index name and have information about the field, order and type of index (the `v` field refers to the version of indexing).

When performing queries, it is possible to check if indices are being used. For this, there is the `explain` method for `Cursor`, which returns a dictionary with information regarding the query:

In [None]:
info = objects.find({'firstmjd': {'$gte': 58400}}).explain()
exec_stats = info['executionStats']

print(f'Docs returned: {exec_stats["nReturned"]}')
print(f'Keys checked : {exec_stats["totalKeysExamined"]}')
print(f'Docs checked : {exec_stats["totalDocsExamined"]}')
print(f'Time (ms)    : {exec_stats["executionTimeMillis"]}')

There is a lot more infomation from `explain`, but we'll center on the relevant ones for this section (all within `executionStats`):

* `nReturned`: Number of documents returned
* `totalKeysExamined`: Number of indexed fields checked during the query
* `totalDocsExamined`: Number of documents checked during the query
* `executionTimeMillis`: Execution time in milliseconds. This is always an integer and is rounded

For comparison, let's do a query over a non-indexed field:

In [None]:
info = objects.find({'lastmjd': {'$gte': 59550}}).explain()
exec_stats = info['executionStats']

print(f'Docs returned: {exec_stats["nReturned"]}')
print(f'Keys checked : {exec_stats["totalKeysExamined"]}')
print(f'Docs checked : {exec_stats["totalDocsExamined"]}')
print(f'Time (ms)    : {exec_stats["executionTimeMillis"]}')

Even though it actually returns less documents, the query checks all documents in the collection, without checking any key (as there is no indexing over `lastmjd`). In this case the execution time is still below a millisecond, but for large collections, such full scans can make a big difference.

When using indices, the number of keys checked are usually in the order of the returned documents, although some variations depending on the query and ordering can alter that. Normally as well, it only checks the documents that it needs to return, making it most of the time the most efficient option. The only exception is for queries that return most of the collection, as then it will scan the indices and then the documents themselves, but the losses here are generally minor compared with gains over more restricted queries.

One can also drop an index based on the index name:

In [None]:
objects.drop_index('firstmjd_1')
objects.index_information()

At creation, an index can also be given a custom name, instead of relying on MongoDB for the name:

In [None]:
objects.create_index([('firstmjd', 1)], name='first_detection_date')

In the case of the field `probabilities`, we have the case of a an array with nested documents. If we wanted to create an index for a field inside it, we can use dot notation:

In [None]:
objects.create_index([('probabilities.probability', -1)], name='probs')

As for other common options, it is possible to demand that the elements of the index are unique:

In [None]:
objects.create_index([('oid', 1)], unique=True)

Or creating a partial index (only indexes documents that fullfill a given condition):

In [None]:
objects.create_index([('lastmjd', 1)], partialFilterExpression={'ndet': {'$gt': 100}})

The above creates an index over `lastmjd`, but only for documents with `ndet` greater than 100.

For now, we'll just remove all indices (note that this will never remove the index over `_id`):

In [None]:
objects.drop_indexes()

### Geospatial indexing

Some fields can have a special indexing, such as the location on a sphere. This is why the field `loc` has the form it does:

In [None]:
doc = objects.find_one()

print(f'loc: {doc["loc"]}')
print(f'RA: {doc["meanra"]}; RA - 180: {doc["meanra"] - 180}')
print(f'Dec: {doc["meandec"]}')

The first value of `coordinates` inside `loc` corresponds to RA minus 180, while Dec remains the same. This is because the format with `type` and `coordinates` is defined by [GeoJSON](https://www.mongodb.com/docs/manual/reference/geojson/) and normally used for latitude/longitude coordinates (thus why we need to use RA minus 180). Using the GeoJSON notation allows us index over the sphere and perform cone-searches over the coordinates (which we'll see later on). For this, the index cannot just be ascending or descending and have to use the value `2dsphere`:

In [None]:
objects.create_index([('loc', '2dsphere')])

### Compund indices

It is also possible to create indices over multiple fields at a time, with them being sorted by in order (latter indices fixing clashes over the first). This can be very useful depending on the type of search. For instance, we'll create an index over the classifier name and version, so that all versions are in order, but first sorted by classifier (for each document):

In [None]:
objects.create_index([('probabilities.classifier_name', 1), ('probabilities.classifier_version', 1)])  # The list now has a second element (the secondary index)

Again we'll clean up all the indices:

In [None]:
objects.drop_indexes()

## Projections (selecting fields for output)

The `find` and `find_one` methods have some additional functionality that we'll discuss now. Besides the dictionary with the query filters, a second argument can be passed with another dictionary for "projection". These projections allow for different manipulations of the output documents, without modifying them on the database. For instance, if there is only one field of interest in the output:

In [None]:
docs = objects.find({'ndet': {'$gte': 100}}, {'firstmjd': True, 'lastmjd': True})

for doc in docs:
    print(doc)

Here we've selected only objects with more than 100 detections, but are only interested in the first and last MJD. As you can see, even if not explicitly selected, the `_id` field will be carried by default. This behaviour can be changed:

In [None]:
docs = objects.find({'ndet': {'$gte': 100}}, {'firstmjd': True, 'lastmjd': True, '_id': False})

for doc in docs:
    print(doc)

If at least one of the projected fields is explicitly selected, all the other will be implicitly removed. The oposite is also true:

In [None]:
docs = objects.find({'ndet': {'$gte': 100}}, {'probabilities': False, '_id': False})

for doc in docs:
    print(doc)

The above includes every field, except for `probabilities` and `_id`. Note that it is not possible to mix inclusion and exclusion of fields, except for the case of `_id`:

In [None]:
# Will fail due to mixing inclusion and exclusion
docs = objects.find({'ndet': {'$gte': 100}}, {'probabilities': False, 'firstmjd': True})
# However, it will only fail at this stage
for doc in docs:
    print(doc)

It is possible also to change the names of fields using the new name as key and the old name as value (with `$` before the name):

In [None]:
docs = objects.find({'ndet': {'$gte': 100}}, {'detections': '$ndet', 'firstmjd': True})

# Renamed ndet to detections
for doc in docs:
    print(doc)

It is also possible to project embedded documents:

In [None]:
docs = objects.find({'ndet': {'$gte': 100}}, {'probability': '$probabilities.probability', '_id': False})

# Renamed ndet to detections
for doc in docs:
    print(doc)

The projections can also limit the number of elements returned from an array:

In [None]:
docs = objects.find({'probabilities.ranking': 1}, {'probabilities.$': True, '_id': False})

for doc in docs:
    print(doc)

The `$` operator seen above will only return the first element of the array that matches the query,  even if more than one element does. For this reason it requires for the array to actually be used within the query.

For more control over the returned element, there is also the `$elemMatch` projection operator:

In [None]:
docs = objects.find(
    {
        'ndet': {'$gte': 100}
    }, 
    {
        '_id': False,
        'probabilities': {
            '$elemMatch': {  # The value for $elemMatch has the form of a query, over the fields inside the elements
                'classifier_name': 'stamp_classifier',
                'ranking': 1
            }
        },
    })

for doc in docs:
    print(doc)

**Note:** If more than one element of `$elemMatch` meets the criteria, only the first match will be returned.

The objects where no element matches the requirements for the projection are still returned, but are now empty. This is because they still fullfill the main query in the find command. Also, in this case there is no limitation for the main query to include the array used in the projection.

In order to retrieve all matching elements, it is better to use `$filter`:

In [None]:
docs = objects.find(
    {
        'ndet': {'$gte': 100}
    }, 
    {
        '_id': False,
        'probs': {  # This is the name of the output array (can be anything)
            '$filter': {
                'input': '$probabilities',  # This is the name of the input array
                'cond': {  # The condition that needs to match
                    '$and': [
                        {'$eq': ['$$this.ranking', 1]},
                        {'$eq': ['$$this.classifier_name', 'stamp_classifier']}
                    ]
                }
            }
        },
    })

for doc in docs:
    print(doc)

**Note:** The operator `$filter` is only available for MongoDB version 3.2 or above.

The projection is different this time. The condition (`cond`) for `$filter` must be a single query, using an operator as the key. That's why we are explicitly using `$and`. Additionally, the conditions within the `$and` are still passed as a list of dictionaries, but now the operator is the key, while the value is a two element list, with the required field in the first position and the value used for the operator in the second position. 

The name `$$this` refers to elements of the array defined in `input`. When passing the name of the array to input, it must be preceeded by `$`. It is possible to change the name from `this` to something else using the option `as`:

In [None]:
docs = objects.find(
    {
        'ndet': {'$gte': 100}
    }, 
    {
        '_id': False,
        'probs': {
            '$filter': {
                'input': '$probabilities',
                'as': 'element',  # New name for items
                'cond': {
                    '$and': [
                        {'$eq': ['$$element.ranking', 1]},  # now using 'element' instead of 'this'
                        {'$eq': ['$$element.classifier_name', 'stamp_classifier']}
                    ]
                }
            }
        },
    })

for doc in docs:
    print(doc)

## Expression queries

Another types of queries allows for the use of what are called expressions. These are typically some sort of operation over the fields, rather than a direct search over the values. For instance, to retrieve objects with difference between last and first detection dates of less than a thousand days, we would use: 

In [None]:
docs = objects.find(
    {
        '$expr': {
            '$lte': [{'$subtract': ['$lastmjd', '$firstmjd']}, 1000]
        }
    }, 
    {'firstmjd': True, 'lastmjd': True, '_id': False}
)

for doc in docs:
    print(doc)

The `$expr` operator can use withing the standard query operators (`$lte` in this case), but now with a value that is a list with the field in the first position, which can now be an expression (in this case `$subtract`) and the second value given for the comparison. Note that the fields used within the expression *must* be preceeded by `$`. A list of all expression operators can be found [here](https://www.mongodb.com/docs/v6.0/meta/aggregation-quick-reference/#operator-expressions).

With the concept of expressions, we can now explain the `$filter` operator: `input` must be an expression that returns an array, thus why we use the `$` at the beginning of the field name. We could also have made a more involved expression, as long as the return is always an array. The same applies to `cond`, which explains the use of arrays to define the `$eq` situations.

Expressions are an important concept, separated from operators. Some operators can receive expressions and some expressions can have the same name as operators, but they can have somewhat different names. *Always check the documentation.*

## Array queries

So far we've seen some queries over simple fields. Queries over arrays sometimes work in unexpected ways. First of, a simple query (without projection) will always return the full object, not just the matching elements:

In [None]:
docs = objects.find({'probabilities.classifier_name': 'stamp_classifier'})

print(docs[1])

A projection is needed to limit the results within an array. Furthermore, in the following query one might expect to select only objects classed as AGN with the highest probability on the stamp classifier:

In [None]:
docs = objects.find(
    {
        'probabilities.classifier_name': 'stamp_classifier',
        'probabilities.ranking': 1,
        'probabilities.class_name': 'AGN'
    }, 
    {  # We're using the projection to get only the elements that actually match the query
        '_id': False,
        'probs': {
            '$filter': {
                'input': '$probabilities',
                'cond': {
                    '$and': [
                        {'$eq': ['$$this.ranking', 1]},
                        {'$eq': ['$$this.classifier_name', 'stamp_classifier']},
                        {'$eq': ['$$this.class_name', 'AGN']}
                    ]
                }
            }
        },
    }
)

for doc in docs:
    print(doc)

Why are we getting empty arrays?

The answer is that, by concatenating queries over array fields, they will return documents where at least one element of the array fullfills each condition *independently*. In other words, the above will match as long as an element of the array has a has stamp classifier as the classifier name, a ranking one and a class name of AGN, *even if each condition is fullfilled by a different element*.

To make sure that a given element matches the condition simultaneously we can use the `$elemMatch` operator for queries:

In [None]:
docs = objects.find(
    {
        'probabilities': {
            '$elemMatch': {
                'classifier_name': 'stamp_classifier',
                'ranking': 1,
                'class_name': 'AGN'
            }
        }
    }, 
    {  # We're using the projection to get only the elements that actually match the query
        '_id': False,
        'probs': {
            '$filter': {
                'input': '$probabilities',
                'cond': {
                    '$and': [
                        {'$eq': ['$$this.ranking', 1]},
                        {'$eq': ['$$this.classifier_name', 'stamp_classifier']},
                        {'$eq': ['$$this.class_name', 'AGN']}
                    ]
                }
            }
        },
    }
)

for doc in docs:
    print(doc)

## Geospatial queries

If a geospatial index is being used on a field, it allows us to do geospatial queries. While multiple types of searches are possible depending on the geometries defined, we will focus only on cone-searches, which are the more relevent for usage within ALeRCE. For other types of geospatial queries, see [here](https://www.mongodb.com/docs/manual/reference/operator/query-geospatial/).

First we need to create a geospatial index:

In [None]:
objects.create_index([('loc', '2dsphere')])

Now, to search for elements within a circle over the sphere we use the operator `$geoWithin`. Inside the operator there are multiple options that can be used, but in the case of the circle we use `$centerSphere`, which has as value an array. The first element is an array of coordinates (latitude and longitude, or RA minus 180 and Dec, always in degrees) and the second element is the radius (in radians):

In [None]:
docs = objects.find(
    {
        'loc': {
            '$geoWithin': {
                '$centerSphere': [[67 - 180, 52], 3.14 / 180]
            }
        }
    },
    {
        'meanra': True,
        'meandec': True,
        '_id': False
    }
)

for doc in docs:
    print(doc)

It is also possible to use other geometries, for instance a `$box`. This uses an array with two arrays, representing oposite corners of the box:

In [None]:
docs = objects.find(
    {
        'loc': {
            '$geoWithin': {
                '$box': [[67 - 180, 52], [68 - 180, 53]]
            }
        }
    },
    {
        'meanra': True,
        'meandec': True,
        '_id': False
    }
)

for doc in docs:
    print(doc)

For other geospatial query operators, check [here](https://www.mongodb.com/docs/manual/reference/operator/query-geospatial/).

## Summary

New concepts seen here:

* Indices allow for better performance over queries that involve indexed fields, preventing a full collection scan
* Projections allow us to limit or rename the number of fields returned
* Expressions allow for more complex queries, not just direct comparissons with the values, but also over operations involving one or more fields
  
Things to keep in mind:

* Some operators might ignore certain types of indices. If you see a loss of performance, check the `explain` method and documentation to make sure things are being used as expected
* Queries over arrays are sometimes unintuitive. To ensure that the query matches for a single element, use `$elemMatch`. Otherwise the match might be over different elements
* Geospatial indices and searches works somwhat differently than others and the latter cannot be done without the existence of the former