# MongoDB Childcare Database
This databse is a collection of data about children, staff and their events in a daycare.

You will notice the first way we have it setup is very relational. We have a table for children, a table for staff and a table for events. The events table has a reference to the children and staff tables. This is how you would see this setup in a relational database. 

# Get the daily events and child info for each event in aggregation pipeline

## Why is this bad

In the query below we are joining the child info with the daily events table. We are also getting every single event. Odds are we don't need every single event. 

**Stats:**
- This returns 2000000 documents(2 million)
- This takes roughly 23 seconds to run.

In [1]:
from datetime import datetime, tzinfo, timezone
import time
from pprint import pprint
import json


from pymongo import MongoClient
from pymongo.synchronous.command_cursor import CommandCursor

client = MongoClient('mongodb://root:password@localhost:27017/')

db = client['daycare_db']

# used in queries to get all events since a certain date
events_since_time = datetime.strptime("12-08-2024 00:00:00", "%m-%d-%Y %H:%M:%S")



In [10]:

result = client['daycare_db']['dailyEvents'].aggregate([
    {
        '$lookup': {
            'from': 'children', 
            'localField': 'childId', 
            'foreignField': '_id', 
            'as': 'child_info'
        }
    }, {
        '$unwind': '$child_info'
    }, {
        '$project': {
            'eventType': 1, 
            'childId': 1, 
            'timestamp': 1, 
            'details': 1, 
            'firstName': '$child_info.firstName', 
            'lastName': '$child_info.lastName'
        }
    }
])

data = list(result)
num_of_docs = len(data)
print(f"Number of documents: {num_of_docs}")

Number of documents: 2000000


In [2]:
def get_events_and_children_since_time() -> CommandCursor:
    """
    Retrieves daily events and associated child information since a specified time.

    This function performs an aggregation pipeline on the 'dailyEvents' collection:
    1. Matches events with timestamps greater than or equal to 'events_since_time'.
    2. Looks up corresponding child information from the 'children' collection.
    3. Projects specific fields from both events and child information.

    The function also measures and prints the execution time and number of documents returned.

    Returns:
        CommandCursor: A cursor to iterate over the matching events with child information.
    """

    start_time = time.time()
    result = db['dailyEvents'].aggregate([
        {
            '$match': {
                'timestamp': {
                    '$gte': events_since_time
                }
            }
        }, {
            '$lookup': {
                'from': 'children', 
                'localField': 'childId', 
                'foreignField': '_id', 
                'as': 'childInfo'
            }
        }, {
            '$project': {
                'notes': 1, 
                'details': 1, 
                'eventType': 1, 
                'childId': 1, 
                'staffId': 1, 
                'timestamp': 1, 
                'childInfo.firstName': 1, 
                'childInfo.lastName': 1
            }
        }
    ])
    
    # result is just a cursor and doesn't return any data till you iterate over it
    print_num_events(result)
    end_time = time.time()
    execution_time = end_time - start_time
    print(f"Execution time: {execution_time} seconds")

    return result

def print_num_events(result: CommandCursor):
    data = list(result) # this actually runs the query and returns the data
    num_of_docs = len(data)
    print(f"Number of documents: {num_of_docs}")


def run_explain_on_pipeline(pipeline, collection_name: str):
    explain_command = {
    "explain": {
        "aggregate": collection_name,
        "pipeline": pipeline,
        "cursor": {}
    },
    "verbosity": "executionStats"
}
    explain_results = db.command(explain_command)
    
    pprint(explain_results)

def get_events_since_time(explain: bool = False) -> CommandCursor:
    """
    Retrieves daily events from the database since a specified time.

    This function queries the 'dailyEvents' collection for events with timestamps
    greater than or equal to the 'events_since_time'. It measures and prints the
    execution time and the number of documents returned.

    Returns:
        CommandCursor: A cursor to iterate over the matching events.
    """
    
    pipeline = [
        {
            '$match': {
                'timestamp': {
                    '$gte': events_since_time
                }
            }
        }
    ]
    
    start_time = time.time()
    
    if explain:
        run_explain_on_pipeline(pipeline, "dailyEvents")
        return
    
    result = db['dailyEvents'].aggregate(pipeline)

    
    # result is just a cursor and doesn't return any data till you iterate over it
    print_num_events(result)
    end_time = time.time()
    execution_time = end_time - start_time
    print(f"Execution time: {execution_time} seconds")

    # return result

# Limit the number of documents returned

Limit the events by a certain time range, example last 3 days. This dataset was created on 12-11-2024 so we to get the last 3 days we would use 12-08-2024.

## Why is this bad

We are not using any index here. If we were to add an index I bet this would be even faster.

**Stats:**
- This returns 714306 documents
- Uses a collscan
- Has an 'executionTimeMillisEstimate': 486

In [None]:
# By setting explain=True we can get the explain plan for the query. 
get_events_since_time(explain=True)


In [57]:
# Now let run the query without explain to see how long it takes
get_events_since_time()

Number of documents: 714306
Execution time: 3.2660820484161377 seconds


# Adding an index

Anyone who knows SQL knows that creating an index can help speed up queries. What do we add a index to in the above query?
When adding an index we should add an index to fields that we are using in a where clause or in MongoDB's aggregation pipeline the $match or anything we are sorting on.

In the query above we are matching on the timestamp field. We should add an index to the timestamp field. In MongoDB we can add an index to a field by using the create_index method. You also provide a parameter for accending or descending but which do we use? My gut would say we care about the most recent events but there is this note in the docs that stats indexes using descending order can cause performance issues and only use ascending order for indexes. https://www.mongodb.com/docs/manual/core/indexes/create-index/#example so We will start with that, test it, remove it and check descending order and see if there is a difference.


**Stats:**
With ascending order index on timestamp
- This returns 714306 documents
- This takes roughly  seconds to run.

In [58]:
# Add index to the timestamp field in ascending order
db['dailyEvents'].create_index([('timestamp', 1)])



'timestamp_1'

# Running the query with the index ascending

Run the query in a seperate cell to not have the index creation influence the run time

## Getting the explain plan with explain=True
First lets run with an explain to make sure we are using the index

This output should show you we have an index 'indexName': 'timestamp_1'. So we are infact using the index.

## Wait its slower using the index
Looking at the explain plan we see that the query now has a **'executionTimeMillisEstimate': 1018**! What happened?

When the query would return a large portion of the collection (typically >30% of documents):
- The index scan plus document lookup becomes more expensive than a simple collection scan
- MongoDB has to look up each document in the index and then fetch the actual document
- You can see the two stages in the image below from Compass

<
## So what does this mean

This means that this is why we should always test our queries after adding an index. Just because we added an index doesn't mean that we are actually improving anything.

Always follow these steps:
- Run the query with an explain. Document how long the query plan stats it will run via the executionTimeMillisEstimate
- Calculate if the number of rows that is going to be returned is greater than 30%. If so maybe an index doesn't make sense. 
- Add the index you plan to use
- Run the query explain again. Make sure you document the total executionTimeMillisEstimate not just the executionTimeMillisEstimate for each stage.
- If the index makes it worse remove the index. Feel free to try another index just as we are going to down below for descending order just to make sure.
- If you have trouble reading the explain plan recreate the query in MongoDB Compass and run the explain there. The visual is much more readable.



In [None]:

get_events_since_time(explain=True)

In [None]:
get_events_since_time()

# Adding a descending index

Just for fun, lets add a descending index on the timestamp field to see what happens. 
This actually ran in ~ 900ms which is faster then the ascending index, which goes against the docs?!?!

However it is still slower then no index. So we will remove this index as well.

In [69]:
# Remove the index on the timestamp field
try:
    db['dailyEvents'].drop_index('timestamp_1')
except Exception:
    print("Index does not exist")
# For fun maybe we clear the query plan cache
db.command({"planCacheClear": "dailyEvents"})

# Recreate the index on the timestamp field in descending order
db['dailyEvents'].create_index([('timestamp', -1)])


Index does not exist


'timestamp_-1'

In [None]:
get_events_since_time(explain=True)

In [71]:
# Drop the descending order index
try:
    db['dailyEvents'].drop_index('timestamp_-1')
except Exception:
    print("Index does not exist")
