# Lab 5 - NoSQL with MongoDB

In this lab, you will learn how to use MongoDB, a popular NoSQL database, through its Python driver PyMongo. We will cover basic CRUD operations, querying, and aggregation.

You will learn how to:

- Use `pymongo` to interact with a MongoDB database
- Perform CRUD operations on MongoDB collections
- Perform aggregation operations on MongoDB collections

## Setup Instructions

We will use MongoDB Atlas, which removes the need to install MongoDB locally.

### Step 1: Create an Atlas Account
1. Create an account on [MongoDB Atlas's official website](https://www.mongodb.com/cloud/atlas/register).
2. Verify your email and set up multi-factor authentication (you can use your email again). You can use your ESSEC or CS email address.
2. You will be asked a series of questions for personnalization, you can either answer them, or `Skip personalization`.

### Step 2: Create a Free Cluster
1. When asked how you want to deply your cluster, select the `Free` solution.
2. Leave the cluster name to default (`Cluster0`) and click `Create Deployment`.

### Step 3: Secure Your Cluster
You now need to secure your MongoDB Atlas cluster before you can use it. 


![Security](img/clusterSecurity.png)

In particular, you need to specify

- which IP addresses can access the cluster
- which users can access the cluster

Store your usename and password and click `Create Database User`.

### Step 4: Connect to Your Cluster
1. Click `Choose a connection method` to open the following page:

![Connection](img/connect.png)

2. Click `Drivers` and select `Python` as your driver. The installation instructions tell you to use pip to install PyMango. If you use anaconda, PyMango is also available.

Uncomment and run the command corresponding to your case (note that you might have to restart the kernel):

In [None]:
import sys
!{sys.executable} -m pip install pymongo

In [None]:
# %conda install --channel conda-forge pymongo

3. Toggle `View full code sample`, copy-paste the connection code sample from the Atlas UI into the following cell and run it to test your connection. If you run into a problem, ask for help.

![Driver](img/driver.png)

In [3]:
import certifi
from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi

uri = "mongodb+srv://b00811533_db_user:ZoWNazijrzG7VrwS@datastoragecluster.ek2k0s2.mongodb.net/?retryWrites=true&w=majority&appName=DataStorageCluster"

# Create a new client and connect to the server
client = MongoClient(uri, 
                    server_api=ServerApi('1'),
                    tlsCAFile=certifi.where(),  # ensure trusted CA bundle
                    tls=True)

# Send a ping to confirm a successful connection
try:
    client.admin.command('ping')
    print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
    print(e)

Pinged your deployment. You successfully connected to MongoDB!


Note the script imports `MongoClient` from `pymongo.mongo_client` and `ServerApi` from `pymongo.server_api`. As you can deduce from the names, the server is handled directly by MongoDB Atlas and we interact with the database through an API. If you need MongoDB, it is also possible to set it up locally with MongoDB Community Server from [MongoDB's official site](https://www.mongodb.com/try/download/community).

### Step 5: Configure Your Connection (optional)
To avoid storing your credentials directly in the notebook, you can create a `config.py` file with the following content:

```python
# config.py
username = "your_username"
password = "your_password"
hostname = "your_hostname"
```

The `your_username` and `your_password` are the ones provided during the setup, `your_hostname` can be retrieved from the URI in the code sample (it is the part between `@` and the API parameters, i.e., up to `/`).

Then you use the following code to connect:

In [1]:
import certifi
from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi

from config import username, password, hostname

uri = f"mongodb+srv://{username}:{password}@{hostname}/?retryWrites=true&w=majority"


In [None]:
uri

In [2]:

# Create a new client and connect to the server
client = MongoClient(uri, 
                    server_api=ServerApi('1'),
                    tlsCAFile=certifi.where(),  # ensure trusted CA bundle
                    tls=True)

# Send a ping to confirm a successful connection
try:
    client.admin.command('ping')
    print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
     print(e)

ConfigurationError: The DNS query name does not exist: _mongodb._tcp.datastoragecluster.

You can now click `Done`. Everything should now be set up properly.

### Step 6: Solving IP issues

Since the server will only be used for experimenting during the lab, you can lift the IP restrtiction (especially if you encounter issues with the wifi and your IP gets changed): 

1. To that end, go to the `Security Quickstart` tab on the left.
2. Navigate to `Where would you like to connect from?` and `Add entries to your IP Access List`.
3. Add `0.0.0.0/0` to whitelist all IP addresses.

![IP setup](img/ipmanagement.png)

## Working with MongoDB

Now that you are connected, we can start working with MongoDB through PyMango (see the [documentation](https://pymongo.readthedocs.io/en/stable/) if you need additional information).

**Note:** We have already created a MongoDB client in the previous step.

```python
# Create a new client and connect to the server
client = MongoClient(uri, server_api=ServerApi('1'))
```

This client is our connection to the MongoDB Atlas cluster (see the documentation[https://pymongo.readthedocs.io/en/stable/api/pymongo/mongo_client.html#pymongo.mongo_client.MongoClient]). This means that all calls will be done from the client object.

The first entry in the [MongoClient documentation] tells us that we can get a `Database` instance from a `MongoClient` via either a dictionary-style (`client["<database name>"]`) or an attribute-style (`client.<database name>`) access. If the database does not exist yet, it will be created.

Create a database `student_projects`.

In [None]:
# Create or access a database
db = # TODO

The database is currently empty, we would like to add some data to it. Remember that we first need to create a `collection` and add it to the database. The syntax is similar to the creation of a database from the client, but from the perspective of the database: `db["collection_name"]` or `db.collection_name`.

Create a collection called `projects`.

In [None]:
# Create or access a collection
collection = # TODO

In [None]:
## Additional setup
from bson import ObjectId
from pymongo.errors import BulkWriteError

### Basic CRUD Operations

We will now cover the CRUD operations: Create, Read, Update, Delete. The documentation here will refer to `PyMango`, you can also have a look at [MondoDB's documentation for collection](https://www.mongodb.com/docs/manual/reference/method/js-collection/) as it provides more concrete examples. You then just need to adapat the syntax slightly.

#### Create: Insert Documents

The database is currently empty, so we can add a sample dataset of student projects. Here is the dataset that we want to add.

In [None]:
# Sample dataset of student projects
sample_projects = [
    {
        "_id": ObjectId("507f1f77bcf86cd799439011"),
        "name": "Data Collection Tool",
        "description": "A Python script to collect data from APIs.",
        "tags": ["python", "api", "data"],
        "completed": True,
        "collaborators": ["Alice", "Bob"],
        "created_at": "2025-10-15",
        "finished_at": "2025-12-10"
    },
    {
        "_id": ObjectId("507f1f77bcf86cd799439012"),
        "name": "NoSQL Database Comparison",
        "description": "A comparison of MongoDB, Redis, and Cassandra.",
        "tags": ["nosql", "database", "visualization"],
        "completed": False,
        "collaborators": ["Charlie", "Alice"],
        "created_at": "2025-09-20",
        "finished_at": "2025-11-27"
    },
    {
        "_id": ObjectId("507f1f77bcf86cd799439013"),
        "name": "Web Scraping Project",
        "description": "A tool to scrape and store data from websites.",
        "tags": ["python", "data", "scraping"],
        "completed": True,
        "collaborators": ["Bob", "Dave"],
        "created_at": "2025-11-01",
        "finished_at": "2025-12-04"
    }
]

Do add the dataset to the collection, we can use the method `insert_many` on the collection (see the [documentation]). It expects an iterable of documents and returns an instance of `InsertManyResult`, from which the attribute `inserted_ids` contains list of the ID added to the collection.

Use `insert_many` on `collection` with `sample_projects` and check that the result contains three entries. Note that as each entry has a value for the `_id` field, a `BulkWriteError` will be thrown if you try to add the same documents several times.

In [None]:
# Insert multiple documents
try : 
    result = # TODO
    print(f"Inserted {len(result.inserted_ids)} documents with IDs: {result.inserted_ids}")
except BulkWriteError:
    print("Already in the collection")

We can also use `insert_one` to add one document (see the [documentation](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.insert_one)).

Use `insert_many` on `collection` with `new_project` and check that the result contains one entry.

In [None]:
new_project = {
    "name": "Data Visualization Dashboard",
    "description": "A dashboard to visualize project data.",
    "tags": ["python", "visualization", "dashboard"],
    "completed": False,
    "collaborators": ["Eve", "Alice", "Charlie"],
    "created_at": "2025-10-08",
    "finished_at": "2025-12-01"
}

In [None]:
# Insert a single document
try : 
    insert_result = # TODO
    print(f"Inserted document with ID: {insert_result.inserted_id}")
except BulkWriteError:
    print("Already in the collection")

#### Viewing Your Database in MongoDB Atlas

As you perform operations on your MongoDB database, you can directly observe the changes in the MongoDB Atlas web interface:

1. Go to your MongoDB Atlas dashboard (`Database` on the left tab, then `Clusters`)
2. Navigate to your cluster by clicking on its name (likely `Cluster0`)
3. Click on the `Browse Collections` button

Here you can:
- Browse all documents in your collection
- View the structure of each document
- See real-time updates as you modify the data
- Manually add, edit, or delete documents if needed

This visual interface can help you understand how your operations affect the database and verify that your code is working as expected.


#### Read: Query Documents

Now that the database contains some data, we can query to retrieve data out of it. There are two methods that you can use on the collection to retrieve data: `find_one` (see the [documentation](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.find_one)) and `find` (see the [documentation](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.find)).

The first method, `find_one` returns a single document from the database, according to the `read_preference` of the collection, while `find` retrieve all documents.

Use `find_one` to retrieve a project from the collection.

In [None]:
# Find one document
# TODO

Use `find` to retrieve all projects from the collection. The return type is an instance of the `Cursor` class, which can be iterated over as any other collection.

In [None]:
# Find all documents
# TODO

We can query documents using filters to find specific data. By default, MongoDB uses exact matching for filters. When you specify a filter like `{"completed": True}`, MongoDB will return all documents where the `completed` field exactly matches `True`.

Write a query to find the name of all projects that are not completed.

In [None]:
# Find the name of all uncompleted projects
# TODO

Filters can also be used to find documents where an array contains a specific value. We use the field name with the value we are looking for. MongoDB will return documents where the specified array contains that value.

Write a query to find the name of all projects done with Python, which have `'python'` as one of their value in the array `tags`.

In [None]:
# Find projects with 'python' tag
# TODO

MongoDB provides comparison operators for more complex queries. For example, `$gt` (greater than) can be used to find documents where a field is greater than a specified value. In that way, `collection.find({"created_at": {"\$gt": "2025-10-01"}})` finds all projects created after October 1st, 2025.

Write a query to find projects finished before December 02, 2025.

In [None]:
# Find projects finished before a specific date
# TODO

By default, multiple conditions in a filter are combined with a logical AND. You can also explicitly use $and, $or, $nor, and $not for more complex logical operations.

This means that the following
```python
collection.find({
    "$and": [
        {"completed": True},
        {"tags": "python"}
    ]
})
```
is equilavent to
```python
collection.find({"completed": True, "tags": "python"})
```

Compare the result of the two queries.

In [None]:
# Compare the result of the two queries.
# TODO

#### Update: Modify Documents

To update documents in MongoDB, we can use the `update_one` and `update_many` methods (see the [documentation for update_one](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.update_one) and [documentation for update_many](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.update_many)). The `update_one` method updates a single document that matches the filter, while `update_many` updates all documents that match the filter. Besides the filter, the actual update operation also need to be specified.

For instance 

```python
collection.update_one(
    # Filter to find the document
    {"name": "NoSQL Database Comparison"},
    # Update operation
    {"$set": {"description": "A comparison of MongoDB, Redis, Cassandra and Neo4j"}}
)
```
changes the description of the NoSQL database comparison project.

Use `update_one` to modify the `completed` status of the 'Data Collection Tool' project to `False`.

In [None]:
# Update one document
# TODO

The `$set` operator updates the value of a field. There are other update operators available, such as:
- `$unset`: Removes the specified field from a document
- `$inc`: Increments the value of the field by the specified amount
- `$push`: Adds an item to an array
- `$pull`: Removes all instances of a value from an array

Use `update_many` to remove the `finished_at` field for all projects that are not completed.

In [None]:
# Update multiple documents
# TODO

#### Delete: Remove Documents

To delete documents from a MongoDB collection, we can use the `delete_one` and `delete_many` methods (see the [documentation for delete_one](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.delete_one) and [documentation for delete_many](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.delete_many)). The `delete_one` method deletes a single document that matches the filter, while `delete_many` deletes all documents that match the filter.

Use `delete_one` to remove the 'Data Visualization Dashboard' project from the collection.

In [None]:
# Delete one document
# TODO

Deleting documents is a permanent operation and cannot be undone. Always double-check your filters before executing a delete operation.

Use `delete_many` to remove all projects that are completed.

In [None]:
# Delete multiple documents
# TODO

**Important Notes:**
- Operations return a result object that contains information about the operation, such as how many documents were matched and modified (for updates) or deleted.
- Always verify the results of update and delete operations to ensure they had the intended effect.

#### Adding Back All Project

To perform additional queries, let us add back all projects to the database.

In [None]:
for document in sample_projects+[new_project]:
    collection.update_one({"name": document["name"]}, {"$setOnInsert": document}, upsert=True)

### Aggregation Operations

Aggregation operations in MongoDB allow us to process data records and return computed results. The aggregation framework uses a pipeline concept where documents pass through a series of processing stages (see the [aggregation documentation](https://www.mongodb.com/docs/manual/aggregation/) and [PyMongo aggregation documentation](https://pymongo.readthedocs.io/en/stable/aggregation.html)).

Each stage in the pipeline transforms the documents as they pass through. Common stages include:
- `$match`: Filters documents (like the `find` operation)
- `$group`: Groups documents by specified identifier
- `$sort`: Sorts documents
- `$project`: Reshapes documents (selects which fields to include/exclude)
- `$addFields`: Adds new fields to documents
- `$limit`: Limits the number of documents

The simplest aggregation would be to count the number of documents in a collection. We can use the `count_documents` method (see the [documentation](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.count_documents)).

Count the total number of projects in the collection.

In [None]:
# Count documents
# TODO

For more advanced operations, we use the aggregation pipeline framework in MongoDB. An aggregation pipeline consists of one or more stages that process documents sequentially. Each stage transforms the documents as they pass through the pipeline.

To run an aggregation pipeline, we:
1. Create a list of stages (each stage is a dictionary specifying the operation)
2. Pass this list to the `aggregate` method on our collection
3. Process the results

An example of stage is `$group`, which allows us to:
- Categorize documents into groups based on specified criteria
- Perform calculations on each group (count, sum, average, etc.)
- Reshape the data structure

For example, to group projects by their completion status and count how many projects are in each group, we need to build a pipeline with a single `$group` stage:

```python
pipeline = [
    {
        "$group": {
            "_id": "$completed",  # Group by the 'completed' field
            "count": {"$sum": 1}  # Count documents in each group
        }
    }
]
```

In this pipeline:

- We create a list containing one dictionary (our single stage)
- The dictionary has one key: `"$group"` (the operation we want to perform)
- The `"$group"` value is another dictionary with:
    - `"_id"`: Specifies the field to group by (`"$completed"` means use the `'completed'` field)
    - `"count"`: Uses the `"$sum"` operator to count documents (adding 1 for each document)


When we run this pipeline with `collection.aggregate(pipeline)`, MongoDB will:

1. Take all documents from the collection
2. Group them by their 'completed' status (True/False)
3. Count how many documents are in each group
4. Return results showing the count for each completion status

The result will look something like this:

```python
[
    {"_id": True, "count": 1},
    {"_id": False, "count": 3}
]
```

This tells us that there are 1 completed projects and 3 incomplete project in our collection.

In [None]:
# Group by completion status
pipeline = [
    {
        "$group": {
            "_id": "$completed",  # Group by the completed field
            "count": {"$sum": 1}  # Count documents in each group
        }
    }
]

print("Projects by completion status:")
for group in list(collection.aggregate(pipeline)):
    status = "Completed" if group['_id'] else "Not Completed"
    print(f"{status}: {group['count']}")

Write a query to find the project with the most collaborators. *Hint: add a field that counts the number of collaborators for each project, sort by this count in descending order, and return only the project with the most collaborators.*

In [None]:
# Find projects with most collaborators
# TODO

To analyze array fields across documents, we can use the `$unwind` stage to deconstruct an array field, followed by `$group` to count occurrences (see the [$unwind documentation](https://www.mongodb.com/docs/manual/reference/operator/aggregation/unwind/) and [$group documentation](https://www.mongodb.com/docs/manual/reference/operator/aggregation/group/)).

Find the three most common tags across all projects.

In [None]:
# Find most common tags
# TODO

### Cleanup

We can clean up the database to start from scratch if we run the notebook again.

In [None]:
# Delete all documents from the collection
delete_result = collection.delete_many({})

# Optionally, drop the collection
collection.drop()

# Optionally, drop the database
client.drop_database('student_projects')

### Close the Connection

When you are done editing the database, always remember to close your MongoDB connection to free up resources. It ensures that all resources are properly released and prevents potential connection leaks.

In [None]:
# Close the connection
client.close()
print("\nDisconnected from MongoDB.")

## Conclusion

As a recap, here is a comparison of MongoDB operations with their SQL equivalents.

| MongoDB Operation | SQL Equivalent |
|--------------------|---------------|
| `insert_one()`     | `INSERT INTO table VALUES (...)` |
| `find()`           | `SELECT * FROM table` |
| `find_one()`       | `SELECT * FROM table LIMIT 1` |
| `update_one()`     | `UPDATE table SET field=value WHERE condition` |
| `delete_one()`     | `DELETE FROM table WHERE condition` |
| Aggregation Pipeline | Complex `SELECT` with `GROUP BY`, `JOIN`, etc. |
