## Aggregation pipelines

In MongoDB it is possible to concatenate multiple operations within a single command by using aggregation pipelines. This can include some of the things we've seen before, such as queries, projections, sorting and pagination, and also additional operations which are not available as part of the previous commands.

An aggregation pipeline in MongoDB consists of one or more stage, each with their own possible operators. Most stages are rather free when it comes to the order in which they have to be used or how many times they can be used, but keep in mind that some of them have restrictions about when and how many times they can be used. We'll see here only the most common ones, but you can check the full list of stages [here](https://www.mongodb.com/docs/v6.0/reference/operator/aggregation-pipeline/#std-label-aggregation-pipeline-operator-reference).

To use an aggregation pipeline in pymongo, we use the `aggregate` method for collections. This receives a list of dictionaries as its main parameter. Each dictionary must have a single key, corresponding to the pipeline stage and the value defining the working of the stage. The output of each stage will then be used as input for the following, all the way until the list is exhausted.

In [None]:
from pymongo import MongoClient

client = MongoClient(host='localhost', port=27017, username='mongo', password='mongo')

objects = client.alerce.objects  # This is the collection we'll be using

### `$match`

The stage `$match` is equivalent to performing a query and uses the same operators we've seen for the `find` method first parameter. This stage doesn't include the possibility of using a projection (there is a special stage for that).

In [None]:
docs = objects.aggregate([{'$match': {'ndet': {'$gte': 400}}}])
docs

The output of `aggregate` is a `CommandCursor`. This is different from the `Cursor` we saw for the output of `find`, but it is still iterable. 

Unfortunately, this type does not have the `explain` method.

In [None]:
for doc in docs:
    print(doc)

### `$project`

As the name implies, the `$project` stage is equivalent to the projection we've seen in the previous module:

In [None]:
docs = objects.aggregate([{'$project': {'ndet': True, '_id': False}}])

for doc in docs:
    print(doc)

Each stage can be concatenated in any order. Keep in mind that the field being used might change due to renaming and the order of the stages. The following to blocks give the same result:

In [None]:
docs = objects.aggregate([
    {'$project': {'detections': '$ndet', '_id': False}},  # Renaming the field
    {'$match': {'detections': {'$gte': 400}}}  # We need to use the new name
])

for doc in docs:
    print(doc)

In [None]:
docs = objects.aggregate([
    {'$match': {'ndet': {'$gte': 400}}},  # Using ndet    
    {'$project': {'detections': '$ndet', '_id': False}},  # Renaming the field
])

for doc in docs:
    print(doc)

However, it is important to note that, in terms of performance they are both very different. By starting with the match, we only need to rename the field for the 3 documents matched documents. Using the reverse order, we'll be renaming the field for the whole collection and then selecting the relevant documents.

**It is recommended to start a pipeline with a `$match` that limits as much as possible the number of results.**

### `$set`/`$addFields`

These stages do the same thing, although `$set` is only available starting on MongoDB 4.2. Their behaviour is similar to that of `$project` when creating a new field. The difference comes in the fact that the new fields are added to the existing ones instead of having to select what is going to be in the output:

In [None]:
docs = objects.aggregate([
    {'$match': {'ndet': {'$gte': 400}}},
    {'$set': {'deltamjd': {'$subtract': ['$lastmjd', '$firstmjd']}}},
])

for doc in docs:
    print(doc)

The new fields are always added at the end of the dictionary. More than one field can be added in a single stage:

In [None]:
docs = objects.aggregate([
    {'$match': {'ndet': {'$gte': 400}}},
    {'$set': {
        'deltamjd': {
            '$subtract': ['$lastmjd', '$firstmjd']
        },
        'stamp_classified': {  # Checks if at least one of the classifier names contains 'stamp_classifier'
            '$in': ['stamp_classifier', '$probabilities.classifier_name']
        }
    }},
])

for doc in docs:
    print(doc)

### `$unwind`

The stage `$unwind` is used for arrays and it will "disarm" the array, resulting on one document for each array element among all the retrieved documents:

In [None]:
docs = objects.aggregate([
    {'$match': {'ndet': {'$gte': 400}}},
    {'$unwind': '$probabilities'},
])

for doc in docs:
    print(doc)

As you can see, now we retrieved repeated `_id`s, and each output document correspond to one element of the original `probabilities` array. The field `probabilities` is now an element of the array we began with.

This allows us to get a single entry when searching by, for instance, class and probability:

In [None]:
classifier = 'stamp_classifier'
class_ = 'VS'
min_prob = 0.7

docs = objects.aggregate([
    {
        '$match': {
            'probabilities': {
                '$elemMatch': {
                    'classifier_name': classifier,
                    'class_name': class_,
                    'probability': {'$gte': min_prob}
                }
            }
        }
    },
    {  # Remember that in the last stage we still have the full 'probabilities' array
        '$set': {
            'probabilities': {
                '$filter': {
                    'input': '$probabilities',
                    'cond': {
                        '$and': [
                            {'$eq': ['$$this.classifier_name', classifier]},
                            {'$eq': ['$$this.class_name', class_]},
                            {'$gte': ['$$this.probability', min_prob]}
                        ]
                    }
                }
            }
        }
    },
    {
        '$unwind': '$probabilities'
    },
])

for doc in docs:
    print(doc)

**Warning:** Unfortunately, due to how the selection of array elements works, both the `$set` and the `$match` stages should match for a query like the one above, but there is no control over it. It is very easy to be testing some queries and then forget to update the value or the field in either the `$filter` or the `$elemMatch` operators, resulting in valid but meaningless queries. *Pay close attention when creating this types of queries.* 