# MongoDB with Python: A Comprehensive Tutorial

## Table of Contents

1.  **Introduction to MongoDB**
    *   What is MongoDB and Why Use It?
    *   Key Concepts: Documents, Collections, Databases
    *   MongoDB vs. Relational Databases
    *   Use Cases for MongoDB
2.  **Setting Up MongoDB and Python Environment**
    *   Installing MongoDB
    *   Installing PyMongo (MongoDB Python Driver)
    *   Connecting to MongoDB
    *   Working with Databases
    *   Authentication (if needed)

3.  **Basic Operations with MongoDB and PyMongo**
    *   Creating and Inserting Documents
    *   Finding and Querying Documents
    *   Updating Documents
    *   Deleting Documents
    *   Working with Indexes

4.  **Advanced Operations and Features**
    *   Working with Aggregations
    *   Using Projections for Results
    *   Working with Different Data Types
    *   Handling Errors and Exceptions
    *   Transactions and Atomicity

5.  **Hands-On Lab: Building a Data Pipeline with MongoDB**
    *   Creating a Data Loading Script
    *   Implementing Data Transformation
    *   Developing a Retrieval System
    *   Testing and Evaluating Performance

6.  **Challenges, Best Practices, and Future Directions**
    *   Scalability and Performance Optimization
    *   Security Considerations
    *   Schema Design and Data Modeling
    *   Future Trends and Integration with AI/ML


# MongoDB with Python: A Comprehensive Tutorial

## 1. Introduction to MongoDB

### 1.1 What is MongoDB and Why Use It?

*   **Definition:** MongoDB is a popular, open-source, document-oriented database system.
*   **Key Characteristics:**
    *   **Document-Oriented:** It stores data in flexible, JSON-like documents, which allow for schema less storage, where data is stored as key value pairs.
    *   **NoSQL Database:** Does not use a traditional relational database model, and does not use SQL for querying.
    *   **Scalable:** Designed to be highly scalable and to handle large volumes of data.
    *   **Flexible:** Allows for flexible and dynamic data schemas that can easily adapt to new types of data.
    *   **Performance:** Optimized for read and write performance for managing large datasets.
    *  **Schema-less:** No predefined data schema is required, which allows for a more flexible data organization.
    *   **Easy to Use:** It provides an API that is simpler to use than relational databases.
    *   **Open Source:** It is open source and has a very large and active community.
*   **Why Use MongoDB?**
    *   **Handling Unstructured Data:** Ideal for storing and managing unstructured or semi-structured data like text, images, videos, or other types of documents.
    *   **Scalability:** Designed to handle large datasets and high traffic loads, which is ideal for modern applications that need to handle large data volumes.
    *  **Agile Development:** Provides more flexibility and simplifies data management, which makes it more useful for agile projects where the data requirements may change over time.
    *   **Performance:** Provides good performance for both read and write operations, and also for indexing and searching in large collections of data.
    *  **JSON Data Format:** Uses JSON like documents which can be easily mapped to data types that are used in web applications.
    *   **Schema Flexibility:** The schema less nature of the documents makes it more flexible for changes and more suitable for new and innovative applications.
    *   **Developer Productivity:** Simplifies data operations, reduces the need for complex database design and implementation, and has an API that is simpler and easier to understand than traditional relational database systems.

### 1.2 Key Concepts: Documents, Collections, Databases

*   **Documents:**
    *   **Definition:** The basic unit of data in MongoDB, which are stored using a JSON-like format, using key-value pairs that can be of any supported data type (string, numbers, arrays, objects, etc.).
    *  **Properties:** Documents are schema-less, which means that they do not need a predefined data structure, and different documents in the same collection can have different types of data, making them more flexible and easier to maintain.
*   **Collections:**
    *  **Definition:** A group of similar documents, which can be compared to tables in relational databases.
    *  **Properties:** Documents are not required to have any predefined structure, or to follow a rigid data format.
*   **Databases:**
    *  **Definition:** A container for storing collections of documents, and their associated indexes, data structures, and configurations.
    *   **Properties:** A single MongoDB instance can host multiple databases.

### 1.3 MongoDB vs. Relational Databases

*  **MongoDB (NoSQL):**
   *   **Data Model:** Document-oriented, schema-less.
   *   **Query Language:** Uses MongoDB query language, which is a JSON-like structure, instead of SQL.
   *   **Scalability:** Designed to scale horizontally, by adding more servers.
   *   **Flexibility:** Designed for applications that need to store unstructured or semi-structured data, and that require flexibility and fast development cycles.
   *   **Data Structure:** Data is stored as JSON like documents, which makes it much more flexible to use for data processing tasks.
*   **Relational Databases (SQL):**
    *   **Data Model:** Relational, uses structured tables with rows and columns.
    *   **Query Language:** Uses SQL (Structured Query Language) to query and manage data.
    *   **Scalability:** Typically scales vertically, by using more powerful machines.
    *   **Flexibility:** Designed for structured data with predefined data schemas, and requires a strict data structure to be defined in advance, which makes it less suitable for unstructured data or applications that require more flexibility.
    *   **Data Structure:** Data is structured in rigid tables with predefined columns, and a predefined data type.

*  **Key Differences:**
    *   **Data Structure:** MongoDB stores data in a flexible format (JSON like) and relational databases in rigid tables.
    *   **Schema:** MongoDB is schema-less; relational databases require predefined schemas.
    *   **Query Language:** MongoDB uses a flexible query language (JSON), while relational databases use SQL.
    *   **Scalability:** MongoDB scales horizontally, while relational databases usually scale vertically.
    *   **Use Cases:** MongoDB is better suited for unstructured data, and for agile development cycles, while SQL databases are better for applications that have structured data.

### 1.4 Use Cases for MongoDB

MongoDB is very powerful and is used in many different types of applications, like:

*   **Content Management Systems (CMS):** For managing large amounts of dynamic and unstructured content.
*   **E-commerce Platforms:** For storing product catalogs, customer data, and order information.
*   **Mobile Applications:** For storing data for mobile apps that need fast access to different types of information.
*   **Social Media Platforms:** For storing unstructured text, image and video data, that can grow very quickly.
*   **Internet of Things (IoT):** For capturing and managing the data from sensors, and other smart devices.
*   **Gaming Platforms:** For storing player profiles, game state, and other dynamic information.
*   **Big Data Analytics:** For capturing, storing and processing large amounts of data in many different types of data analytics tasks.
*   **Real-time Data Processing:** For real time analytics, and processing of live user data, and events that need to be processed and stored with a high performance system.

This introduction should provide you with the basic understanding of what is MongoDB, how it works, and why it is used in many different types of modern applications. In the next sections, you will start learning how to set up and use MongoDB by using Python.


Okay, let's move on to the second section: "Setting Up MongoDB and Python Environment". This will cover installing MongoDB, PyMongo, connecting to MongoDB, working with databases, and authentication.



# MongoDB with Python: A Comprehensive Tutorial

## 2. Setting Up MongoDB and Python Environment

### 2.1 Installing MongoDB

*   **Installation Guide:**
    *   You’ll need to install MongoDB on your system. You can download the Community Server version from the official MongoDB website, and follow the installation instructions for your Operating System: [https://www.mongodb.com/try/download/community](https://www.mongodb.com/try/download/community).
    *   For a simpler way of using MongoDB, you can also use MongoDB Atlas, which is a cloud version of MongoDB, that can be used by using a free account, and can be accessed by using its connection string, from your code. [https://www.mongodb.com/cloud/atlas](https://www.mongodb.com/cloud/atlas)
    *   Follow the installation guide that is specific for your OS, and make sure that you have MongoDB running on your system before trying the code snippets in this tutorial.
*   **MongoDB Shell:**
    *   The MongoDB shell (`mongo`) is a command-line interface that you can use to interact directly with MongoDB server.
    *   You can open the mongo shell by using the `mongo` command, in your console or terminal, and try some of the basic commands to familiarize yourself with the system.
    *   The `show dbs` command will show the current list of databases, and the command `use <database_name>` will switch the context to another database.

### 2.2 Installing PyMongo (MongoDB Python Driver)

*   **PyMongo:** A driver that allows you to connect and to use MongoDB using the Python programming language.
*   **Installation:** You can easily install PyMongo using pip:


In [49]:
# Installation of pymongo driver
!pip install pymongo

Defaulting to user installation because normal site-packages is not writeable


### 2.3 Connecting to MongoDB

Let's start by connecting to a MongoDB server from Python by using PyMongo, by using the connection string, and creating a client that will be used for all database operations:


In [50]:
from pymongo import MongoClient

# replace with your connection string if you are using mongo atlas, or a local server running on a different port or machine.
connection_string = "mongodb://localhost:27017/"

# Create a mongo client object
try:
    client = MongoClient(connection_string)
    print ("Connection successful")
except Exception as e:
    print (f"Error connecting to MongoDB: {e}")

Connection successful


### 2.4 Working with Databases

Now, let's create and connect to a specific database, by creating a database object, that you will use in the future sections.


In [51]:
# Define a database name, replace if with a database of your choice.
db_name = "mydatabase"

# Create or get an existing database
db = client[db_name]
print (f"Database: {db.name}")

# Verify if database was correctly created by printing a list of available database
print(f"List of available databases: {client.list_database_names()}")

Database: mydatabase
List of available databases: ['admin', 'config', 'local', 'mydatabase']


### 2.5 Authentication (if needed)

If your database needs authentication, you will need to provide credentials on the connection string:


In [52]:
# Example for authentication:
# Replace with your real username, password and cluster name (if using MongoDB Atlas)
#connection_string_with_auth = "mongodb+srv://<username>:<password>@<clustername>.mongodb.net/?retryWrites=true&w=majority"
#client = MongoClient(connection_string_with_auth)

*   **Security:** Always keep sensitive data like usernames and passwords secure and do not add them directly in code, specially if it will be committed to a version control system. It is better to use environment variables or other secure methods for managing credentials.

This section covered the basic steps for setting up your environment, and for creating a connection to your database using PyMongo, which will be essential for the next sections, where you will learn how to add data, query data, and other types of operations.



Okay, let's move on to the third section: "Basic Operations with MongoDB and PyMongo." This will cover creating and inserting documents, finding and querying documents, updating, deleting, and working with indexes.




# MongoDB with Python: A Comprehensive Tutorial

## 3. Basic Operations with MongoDB and PyMongo

### 3.1 Creating and Inserting Documents

Now, let's create some documents and add them to a collection in our MongoDB database:


In [53]:
from pymongo import MongoClient

# Set up connection (replace with your details)
connection_string = "mongodb://localhost:27017/"
client = MongoClient(connection_string)
db_name = "mydatabase"
db = client[db_name]

# Create a collection (like a table in SQL) or get an existing one
collection_name = "mycollection"
collection = db[collection_name]

# Sample data
documents_to_insert = [
    {"name": "Apple", "color": "red", "price": 1.2, "category":"fruit"},
    {"name": "Banana", "color": "yellow", "price": 0.8, "category":"fruit"},
    {"name": "Carrot", "color": "orange", "price": 0.5, "category":"vegetable"},
    {"name": "Laptop", "brand": "Dell", "price": 1200, "category":"electronics"},
    {"name": "Headphones", "brand": "Sony", "price": 150, "category":"electronics"}
]

# Insert multiple documents into the collection
try:
  result = collection.insert_many(documents_to_insert)
  inserted_ids = result.inserted_ids
  print(f"Successfully inserted {len(inserted_ids)} documents.")
  print(f"Inserted IDs: {inserted_ids}")
except Exception as e:
    print (f"Error inserting documents: {e}")

Successfully inserted 5 documents.
Inserted IDs: [ObjectId('679652230e71066b1e313b1b'), ObjectId('679652230e71066b1e313b1c'), ObjectId('679652230e71066b1e313b1d'), ObjectId('679652230e71066b1e313b1e'), ObjectId('679652230e71066b1e313b1f')]


### 3.2 Finding and Querying Documents

Let's explore how to find documents in a MongoDB collection by using different types of queries:


In [None]:
# Find all documents in the collection
all_documents = list(collection.find())
print("All Documents:")
for doc in all_documents:
    print(doc)

# Find documents that match a specific condition
query = {"category": "fruit"}
filtered_documents = list(collection.find(query))
print ("\nFruits:")
for doc in filtered_documents:
   print (doc)

# Find documents with a price above a specific value:
query = {"price": {"$gt": 100}}
filtered_documents = list(collection.find(query))
print("\nItems with price above 100:")
for doc in filtered_documents:
    print(doc)

# Find specific document with an id
from bson.objectid import ObjectId
document_id =  list(collection.find({"name": "Apple"}))[0]["_id"]
query = {"_id": ObjectId(document_id)}
filtered_documents = list(collection.find(query))
print("\nDocument with a specific ID:")
for doc in filtered_documents:
    print(doc)

All Documents:
{'_id': ObjectId('67964dfc0e71066b1e313af9'), 'name': 'Laptop', 'brand': 'Dell', 'price': 1310, 'category': 'electronics'}
{'_id': ObjectId('67964dfc0e71066b1e313afa'), 'name': 'Headphones', 'brand': 'Sony', 'price': 170, 'category': 'electronics'}
{'_id': ObjectId('679650ef0e71066b1e313b01'), 'name': 'Smartphone', 'brand': 'Samsung', 'price': 900.0, 'features': ['camera', 'touchscreen', '5G'], 'release_date': '2024-06-10', 'in_stock': True, 'dimensions': {'height': 15, 'width': 7, 'depth': 0.8}}
{'_id': ObjectId('679651f30e71066b1e313b0a'), 'name': 'Laptop', 'brand': 'Dell', 'price': 1220, 'category': 'electronics'}
{'_id': ObjectId('679651f30e71066b1e313b0b'), 'name': 'Headphones', 'brand': 'Sony', 'price': 170, 'category': 'electronics'}
{'_id': ObjectId('679651f40e71066b1e313b0d'), 'name': 'Smartphone', 'brand': 'Samsung', 'price': 900.0, 'features': ['camera', 'touchscreen', '5G'], 'release_date': '2024-06-10', 'in_stock': True, 'dimensions': {'height': 15, 'width':

### 3.3 Updating Documents

To update a document we use the `update_one` or `update_many` methods. Let’s change the price of an item.


In [None]:
# Find a specific document by its name
query = {"name": "Laptop"}
# Update the price
update_value = {"$set": {"price": 1300}}
result = collection.update_one(query, update_value)
print(f"Updated document, updated: {result.modified_count} documents.")

# check the new value
updated_documents = list(collection.find(query))
print("\nUpdated laptop document:")
for doc in updated_documents:
    print(doc)


# Update multiple documents
query = {"category": "electronics"}
update_value = {"$inc": {"price": 10}}
result = collection.update_many(query, update_value)
print(f"Updated multiple documents, modified: {result.modified_count} documents.")

#check the new values
updated_documents = list(collection.find(query))
print("\nUpdated electronic items:")
for doc in updated_documents:
    print(doc)

Updated document, updated: 1 documents.

Updated laptop document:
{'_id': ObjectId('67964dfc0e71066b1e313af9'), 'name': 'Laptop', 'brand': 'Dell', 'price': 1300, 'category': 'electronics'}
{'_id': ObjectId('679651f30e71066b1e313b0a'), 'name': 'Laptop', 'brand': 'Dell', 'price': 1220, 'category': 'electronics'}
{'_id': ObjectId('679652120e71066b1e313b13'), 'name': 'Laptop', 'brand': 'Dell', 'price': 1210, 'category': 'electronics'}
{'_id': ObjectId('679652230e71066b1e313b1e'), 'name': 'Laptop', 'brand': 'Dell', 'price': 1200, 'category': 'electronics'}
Updated multiple documents, modified: 8 documents.

Updated electronic items:
{'_id': ObjectId('67964dfc0e71066b1e313af9'), 'name': 'Laptop', 'brand': 'Dell', 'price': 1310, 'category': 'electronics'}
{'_id': ObjectId('67964dfc0e71066b1e313afa'), 'name': 'Headphones', 'brand': 'Sony', 'price': 180, 'category': 'electronics'}
{'_id': ObjectId('679651f30e71066b1e313b0a'), 'name': 'Laptop', 'brand': 'Dell', 'price': 1230, 'category': 'electr

### 3.4 Deleting Documents

Use the `delete_one` or `delete_many` methods to remove specific documents from a collection:


In [56]:
# Delete one document that matches the specific criteria:
query = {"name": "Carrot"}
result = collection.delete_one(query)
print(f"Deleted one document, deleted count: {result.deleted_count}.")

# Delete all documents that match the specific criteria:
query = {"category": "fruit"}
result = collection.delete_many(query)
print(f"Deleted multiple documents, deleted count: {result.deleted_count}.")

# Check how many documents are left
all_documents = list(collection.find())
print("\nDocuments after deletion:")
for doc in all_documents:
    print(doc)

Deleted one document, deleted count: 1.
Deleted multiple documents, deleted count: 2.

Documents after deletion:
{'_id': ObjectId('67964dfc0e71066b1e313af9'), 'name': 'Laptop', 'brand': 'Dell', 'price': 1310, 'category': 'electronics'}
{'_id': ObjectId('67964dfc0e71066b1e313afa'), 'name': 'Headphones', 'brand': 'Sony', 'price': 180, 'category': 'electronics'}
{'_id': ObjectId('679650ef0e71066b1e313b01'), 'name': 'Smartphone', 'brand': 'Samsung', 'price': 900.0, 'features': ['camera', 'touchscreen', '5G'], 'release_date': '2024-06-10', 'in_stock': True, 'dimensions': {'height': 15, 'width': 7, 'depth': 0.8}}
{'_id': ObjectId('679651f30e71066b1e313b0a'), 'name': 'Laptop', 'brand': 'Dell', 'price': 1230, 'category': 'electronics'}
{'_id': ObjectId('679651f30e71066b1e313b0b'), 'name': 'Headphones', 'brand': 'Sony', 'price': 180, 'category': 'electronics'}
{'_id': ObjectId('679651f40e71066b1e313b0d'), 'name': 'Smartphone', 'brand': 'Samsung', 'price': 900.0, 'features': ['camera', 'touchscr

### 3.5 Working with Indexes

Let's create an index for the `name` field in the collection, which will be used to improve the performance of queries that search by name, and that will make the database work more efficiently, specially when using large datasets.


In [None]:
# Create an index for the "name" field.
index_name = collection.create_index("name")
print (f"Created index with name: {index_name}")

# Check if the index exists
index_info = list(collection.list_indexes())
print (f"\nIndexes in {collection_name}: {index_info}")

# Remove the index
collection.drop_index(index_name)
print ("\nIndex was dropped")

# Check if the index is present
index_info = list(collection.list_indexes())
print (f"\nIndexes in {collection_name}: {index_info}")

Created index with name: name_1



Indexes in mycollection: [SON([('v', 2), ('key', SON([('_id', 1)])), ('name', '_id_')]), SON([('v', 2), ('key', SON([('name', 1)])), ('name', 'name_1')])]

Index was dropped

Indexes in mycollection: [SON([('v', 2), ('key', SON([('_id', 1)])), ('name', '_id_')])]


This section has covered the basic operations of MongoDB using PyMongo, such as inserting, finding, updating, and deleting documents, and also working with indexes to improve performance. In the next sections, you will learn more advanced topics like aggregations, projections, and other interesting features of this database.





# MongoDB with Python: A Comprehensive Tutorial

## 4. Advanced Operations and Features

### 4.1 Working with Aggregations

MongoDB's aggregation framework allows you to perform complex data transformations, grouping, and calculations within the database. Let's explore how to use it:


In [None]:
from pymongo import MongoClient
from bson.son import SON

# Set up connection (replace with your details)
connection_string = "mongodb://localhost:27017/"
client = MongoClient(connection_string)
db_name = "mydatabase"
db = client[db_name]

# Get the existing collection. You must have previously created the data
collection_name = "mycollection"
collection = db[collection_name]

# Use an aggregation pipeline for calculating the average price by category
# pipeline = [
#     {"$group": {"_id": "$category", "average_price": {"$avg": "$price"}}},
#     {"$sort": {"average_price":-1}}
# ]
# try:
#     results = list(collection.aggregate(pipeline))
#     print("Average price per category:")
#     for item in results:
#         print(item)
# except Exception as e:
#     print(f"Error performing the aggregation: {e}")

# Find total number of items in every category
lst = [
    {"$group": {"_id": "$category", "total_items": {"$count": {}}}},
     {"$sort": {"total_items":-1}}
]
try:
    results = list(collection.aggregate(lst))
    print("\nTotal number of items in each category:")
    for item in results:
        print(item)
except Exception as e:
    print(f"Error performing the aggregation: {e}")


Total number of items in each category:
{'_id': 'electronics', 'total_items': 8}
{'_id': None, 'total_items': 3}
{'_id': 'vegetable', 'total_items': 1}


### 4.2 Using Projections for Results

Projections are used to select which fields should be included in the results from your queries, which reduces the size of the data being transferred from the database, and may improve overall speed and performance.


In [59]:
# Using projection to select only the name and the price for items of the electronic category
query = {"category": "electronics"}
projection = {"name": 1, "price": 1, "_id":0}
filtered_documents = list(collection.find(query, projection))
print ("\nElectronic Items (using projection):")
for doc in filtered_documents:
    print(doc)


Electronic Items (using projection):
{'name': 'Laptop', 'price': 1310}
{'name': 'Headphones', 'price': 180}
{'name': 'Laptop', 'price': 1230}
{'name': 'Headphones', 'price': 180}
{'name': 'Laptop', 'price': 1220}
{'name': 'Headphones', 'price': 170}
{'name': 'Laptop', 'price': 1210}
{'name': 'Headphones', 'price': 160}


### 4.3 Working with Different Data Types

MongoDB supports a wide variety of data types, including strings, integers, floats, booleans, dates, arrays, nested objects, and other types, which provide more flexibility when managing and processing different types of data:


In [60]:
# Adding a new document with different types of data
new_doc = {
    "name": "Smartphone",
    "brand": "Samsung",
     "price": 900.0,
    "features": ["camera", "touchscreen", "5G"],
    "release_date":  "2024-06-10",
     "in_stock": True,
      "dimensions": {"height": 15, "width": 7, "depth": 0.8}
}
try:
  result = collection.insert_one(new_doc)
  print(f"Successfully inserted document with id: {result.inserted_id}")
except Exception as e:
  print(f"Error inserting document: {e}")

# Find the new document:
query = {"name": "Smartphone"}
filtered_documents = list(collection.find(query))
print ("\nSmartphone Document:")
for doc in filtered_documents:
    print(doc)

Successfully inserted document with id: 679652240e71066b1e313b21

Smartphone Document:
{'_id': ObjectId('679650ef0e71066b1e313b01'), 'name': 'Smartphone', 'brand': 'Samsung', 'price': 900.0, 'features': ['camera', 'touchscreen', '5G'], 'release_date': '2024-06-10', 'in_stock': True, 'dimensions': {'height': 15, 'width': 7, 'depth': 0.8}}
{'_id': ObjectId('679651f40e71066b1e313b0d'), 'name': 'Smartphone', 'brand': 'Samsung', 'price': 900.0, 'features': ['camera', 'touchscreen', '5G'], 'release_date': '2024-06-10', 'in_stock': True, 'dimensions': {'height': 15, 'width': 7, 'depth': 0.8}}
{'_id': ObjectId('679652120e71066b1e313b16'), 'name': 'Smartphone', 'brand': 'Samsung', 'price': 900.0, 'features': ['camera', 'touchscreen', '5G'], 'release_date': '2024-06-10', 'in_stock': True, 'dimensions': {'height': 15, 'width': 7, 'depth': 0.8}}
{'_id': ObjectId('679652240e71066b1e313b21'), 'name': 'Smartphone', 'brand': 'Samsung', 'price': 900.0, 'features': ['camera', 'touchscreen', '5G'], 'rele

### 4.4 Handling Errors and Exceptions

It is very important to implement good error handling when accessing external services, because they may not always work as expected, or there may be problems with network connections, authorization, or other external factors. The use of exceptions will prevent the programs from crashing when problems occur.


In [66]:
from pymongo import MongoClient
from pymongo.errors import ConnectionFailure

# Set up connection with a wrong connection string to force an error
connection_string = "mongodb://wrongaddress:27017/"

# Create a mongo client object, with error handling
try:
    client = MongoClient(connection_string)
    print ("Connection successful")
    db = client["wrongdatabase"]
    print(f"Database: {db.name}")
except ConnectionFailure as e:
    print (f"Error connecting to MongoDB: {e}")
except Exception as e:
    print (f"Error: {e}")

Connection successful
Database: wrongdatabase


### 4.5 Transactions and Atomicity

MongoDB supports multi-document ACID transactions, which makes it possible to have data consistency when performing changes in multiple parts of the data. This can be very useful when performing operations that are complex and that involve changes in several parts of the database.

In [67]:
# Example of atomic transactions.
from pymongo import MongoClient
from pymongo.errors import ConnectionFailure
from bson import ObjectId

# Set up connection (replace with your details)
connection_string = "mongodb://localhost:27017/"
client = MongoClient(connection_string)
db_name = "mydatabase"
db = client[db_name]
collection_name = "mycollection"
collection = db[collection_name]

# Define documents IDs to update, replace with real object ids
doc1_id = list(collection.find({"name":"Laptop"}))[0]["_id"]
doc2_id = list(collection.find({"name":"Headphones"}))[0]["_id"]


try:
    with client.start_session() as session:
        session.start_transaction()

        # Update the Laptop price
        collection.update_one({"_id": doc1_id}, {"$set": {"price": 1500}}, session=session)
        # Update Headphones price
        collection.update_one({"_id": doc2_id}, {"$set": {"price": 160}}, session=session)
        # Insert a new transaction event
        transaction_event = {"type":"transaction", "description":"updated prices of laptop and headphones"}
        db["transactions"].insert_one(transaction_event, session = session)

        session.commit_transaction()
        print("Transaction completed")

except Exception as e:
    print(f"Transaction aborted: {e}")
    session.abort_transaction()

Transaction aborted: Transaction numbers are only allowed on a replica set member or mongos, full error: {'ok': 0.0, 'errmsg': 'Transaction numbers are only allowed on a replica set member or mongos', 'code': 20, 'codeName': 'IllegalOperation'}


InvalidOperation: Cannot use ended session

This section concludes the basics and some intermediate operations for using MongoDB with Python, such as inserting, querying, updating, deleting documents, using aggregations, projections, performing error handling, and implementing transactions. In the next sections, you will combine these concepts to create more complex and useful applications.



Okay, let's move on to the fifth section: "Hands-On Lab: Building a Data Pipeline with MongoDB". In this section, you'll create a data pipeline that loads, transforms, and retrieves data using MongoDB and PyMongo.





# MongoDB with Python: A Comprehensive Tutorial

## 5. Hands-On Lab: Building a Data Pipeline with MongoDB

### 5.1 Creating a Data Loading Script

Let's create a script to load data from a CSV file into a MongoDB collection. We'll use the pandas library to read the CSV, and then PyMongo to add the data to the database.

In [68]:
import pandas as pd
from pymongo import MongoClient

# Load the sample data from a csv file
data_url = "https://raw.githubusercontent.com/plotly/datasets/master/2014_apple_stock.csv"
df = pd.read_csv(data_url)

# Set up connection to MongoDB (replace with your details)
connection_string = "mongodb://localhost:27017/"
client = MongoClient(connection_string)
db_name = "mydatabase"
db = client[db_name]

# Create or access an existing collection
collection_name = "apple_stock_data"
collection = db[collection_name]

# Convert DataFrame to list of dictionaries and add it to the database
try:
  documents_to_insert = df.to_dict("records")
  result = collection.insert_many(documents_to_insert)
  inserted_ids = result.inserted_ids
  print(f"Successfully inserted {len(inserted_ids)} documents.")
  print(f"Inserted IDs: {inserted_ids}")
except Exception as e:
  print(f"Error inserting documents: {e}")

Successfully inserted 240 documents.
Inserted IDs: [ObjectId('679658240e71066b1e313b28'), ObjectId('679658240e71066b1e313b29'), ObjectId('679658240e71066b1e313b2a'), ObjectId('679658240e71066b1e313b2b'), ObjectId('679658240e71066b1e313b2c'), ObjectId('679658240e71066b1e313b2d'), ObjectId('679658240e71066b1e313b2e'), ObjectId('679658240e71066b1e313b2f'), ObjectId('679658240e71066b1e313b30'), ObjectId('679658240e71066b1e313b31'), ObjectId('679658240e71066b1e313b32'), ObjectId('679658240e71066b1e313b33'), ObjectId('679658240e71066b1e313b34'), ObjectId('679658240e71066b1e313b35'), ObjectId('679658240e71066b1e313b36'), ObjectId('679658240e71066b1e313b37'), ObjectId('679658240e71066b1e313b38'), ObjectId('679658240e71066b1e313b39'), ObjectId('679658240e71066b1e313b3a'), ObjectId('679658240e71066b1e313b3b'), ObjectId('679658240e71066b1e313b3c'), ObjectId('679658240e71066b1e313b3d'), ObjectId('679658240e71066b1e313b3e'), ObjectId('679658240e71066b1e313b3f'), ObjectId('679658240e71066b1e313b40')

### 5.2 Implementing Data Transformation

Now, let's implement a data transformation step, and add a new field to all the documents in the collection, by using a custom logic.


In [70]:
from pymongo import MongoClient

# Set up connection (replace with your details)
connection_string = "mongodb://localhost:27017/"
client = MongoClient(connection_string)
db_name = "mydatabase"
db = client[db_name]
collection_name = "apple_stock_data"
collection = db[collection_name]

# Apply a custom data transformation and add a new field
try:
    # Define a processing function to transform the data
    def transform_document(doc):
      new_field = doc["AAPL_y"] - 10
      doc["daily_change"] = new_field
      return doc

    # Get all the documents in the collection
    documents = list(collection.find())
    # Process all the documents
    updated_documents = [transform_document(doc) for doc in documents]

    # replace existing documents with the updated documents.
    collection.delete_many({})
    collection.insert_many(updated_documents)

    print (f"Updated {len(updated_documents)} documents.")
    # Check how the results look
    for doc in list(collection.find())[:3]:
        print(doc)
except Exception as e:
    print(f"Error updating documents: {e}")

Updated 240 documents.
{'_id': ObjectId('679658240e71066b1e313b28'), 'AAPL_x': '2014-01-02', 'AAPL_y': 77.44539475, 'daily_change': 67.44539475}
{'_id': ObjectId('679658240e71066b1e313b29'), 'AAPL_x': '2014-01-03', 'AAPL_y': 77.04557544, 'daily_change': 67.04557544}
{'_id': ObjectId('679658240e71066b1e313b2a'), 'AAPL_x': '2014-01-06', 'AAPL_y': 74.89697204, 'daily_change': 64.89697204}


### 5.3 Developing a Retrieval System

Now, let’s create a retrieval system that is capable of querying the database, and of retrieving information based on different criteria.


In [77]:
# Implement data retrieval from database with different filters
from pymongo import MongoClient

# Set up connection (replace with your details)
connection_string = "mongodb://localhost:27017/"
client = MongoClient(connection_string)
db_name = "mydatabase"
db = client[db_name]
collection_name = "apple_stock_data"
collection = db[collection_name]

# Method to retrieve data
def retrieve_data(collection, query, projection = None, sort_field = None, sort_direction= -1, limit = None):
  if sort_field:
        results = collection.find(query, projection).sort(sort_field, sort_direction).limit(limit)
  else:
       results = collection.find(query, projection).limit(limit)
  return list(results)


# Test retrieval system
query = {"AAPL_y": {"$gt": 10}} # volume greater than 100000
projection = {"AAPL_x": 1, "AAPL_y":1, "_id":0}
sort_field = "AAPL_x"
sort_direction = -1 #descending order
limit = 5
results = retrieve_data(collection, query, projection, sort_field, sort_direction, limit)
print ("Results of the search with filter, projection, sort and limit:")
for item in results:
    print (item)

# Test with just a filter
query = {"AAPL_y": {"$lt": 20000}} # volume smaller than 200000
results = retrieve_data(collection, query, projection,limit=limit)
print ("\nResults of the search with filter only:")
for item in results:
    print (item)


# Test with just a limit
results = retrieve_data(collection, {},  limit = 3)
print ("\nFirst 3 elements:")
for item in results:
    print (item)

Results of the search with filter, projection, sort and limit:
{'AAPL_x': '2014-12-12', 'AAPL_y': 110.0271393}
{'AAPL_x': '2014-12-11', 'AAPL_y': 111.8174772}
{'AAPL_x': '2014-12-10', 'AAPL_y': 113.9603314}
{'AAPL_x': '2014-12-09', 'AAPL_y': 109.7554968}
{'AAPL_x': '2014-12-08', 'AAPL_y': 113.6533452}

Results of the search with filter only:
{'AAPL_x': '2014-01-02', 'AAPL_y': 77.44539475}
{'AAPL_x': '2014-01-03', 'AAPL_y': 77.04557544}
{'AAPL_x': '2014-01-06', 'AAPL_y': 74.89697204}
{'AAPL_x': '2014-01-07', 'AAPL_y': 75.856461}
{'AAPL_x': '2014-01-08', 'AAPL_y': 75.09194679}

First 3 elements:
{'_id': ObjectId('679658240e71066b1e313b28'), 'AAPL_x': '2014-01-02', 'AAPL_y': 77.44539475, 'daily_change': 67.44539475}
{'_id': ObjectId('679658240e71066b1e313b29'), 'AAPL_x': '2014-01-03', 'AAPL_y': 77.04557544, 'daily_change': 67.04557544}
{'_id': ObjectId('679658240e71066b1e313b2a'), 'AAPL_x': '2014-01-06', 'AAPL_y': 74.89697204, 'daily_change': 64.89697204}


### 5.4 Testing and Evaluating Performance

For a complete evaluation, you should perform different tests, and measure different metrics such as time for running a query, resources used by the system, and evaluate if the retrieved data is correct.


In [81]:
import time
# Test speed of retrieving and printing data.
start = time.time()
results = retrieve_data(collection, {}, projection={"AAPL_x": 1, "AAPL_y":1, "_id":0},limit=5)
for item in results:
   pass
end = time.time()
print (f"Time taken to retrieve {len(results)} documents: {end - start:.4f} seconds")

Time taken to retrieve 5 documents: 0.0030 seconds


This section provides you with a practical lab experience, that is designed to show you how to build your own data processing pipelines, load data from external files, transform it, and implement a retrieval system to access data from the MongoDB database, that you can use as a base for building more complex applications.




Okay, let's proceed to the final section: "Challenges, Best Practices, and Future Directions," where we'll discuss ethical considerations, scalability, robustness, security, and emerging trends for MongoDB and similar systems.




# MongoDB with Python: A Comprehensive Tutorial

## 6. Challenges, Best Practices, and Future Directions

### 6.1 Scalability and Performance Optimization

Designing for scalability and performance is crucial for any real-world application that uses databases.

*   **Scalability Challenges in MongoDB:**
    *   **Data Volume:** Handling extremely large datasets.
    *   **Read/Write Loads:** Handling a large number of users performing read and write operations simultaneously.
    *   **Query Complexity:** Processing complex queries that require aggregations and joins.
    *  **Real-time Data:** Providing fast access to real-time data and live data streams.
*  **Scalability and Performance Optimization Techniques:**
    1.  **Sharding:**
        *   **Mechanism:** Partition the database into smaller, manageable segments (shards) across multiple servers or nodes.
        *   **Benefits:** Distributes data and workload, enabling higher throughput and handling very large datasets.
        *   **Considerations:** Requires choosing a strategy for partitioning data (hash-based, range based), and implementing systems that route queries to the correct shard, while maintaining data consistency.
    2.  **Replication:**
        *   **Mechanism:** Create multiple copies of the data across different servers, for data availability and also for increasing the read speed.
        *   **Benefits:** Improves data availability, prevents data loss, and improves read speed, allowing data to be read from different replicated databases simultaneously.
        *   **Considerations:** Data must be consistent across all replicas, and it will also increase storage needs.
    3.  **Indexing Strategies:**
        *   **Mechanism:** Use indexes to speed up data retrieval on frequently queried fields.
        *   **Benefits:** Improves search performance and reduces query processing time.
        *   **Considerations:** Create indexes based on the queries used in the application, and be aware that indexes can increase write latency, and memory usage.
    4.  **Query Optimization:**
        *   **Mechanism:** Optimize the queries using techniques to make sure that they are fast, by selecting the specific fields needed, and by using specific options to filter data as fast as possible.
        *   **Benefits:** Reduce query time, and minimize resource consumption.
    5.  **Caching:**
        *   **Mechanism:** Implement caching mechanisms to store results from frequently performed queries, or commonly accessed data in faster memory.
        *  **Benefits:** Reduces access time for data that is accessed very often, and reduces the workload on the database.
    6.  **Connection Pooling:**
        *  **Mechanism:** Reuse database connections to avoid the overhead of opening and closing a new connection with each query, by creating a pool of already established database connections.
        *  **Benefits:** Reduces the time for opening and closing connections and makes queries faster.
    7.  **Read and Write Optimization:**
        *   **Mechanism:** Separate read and write operations to different database nodes so they do not interfere with each other.
        *   **Benefits:** Provides a way to balance the workload, and improve performance for both reads and writes.
    8.  **Hardware Optimization:**
        *  **Mechanism:** Use specialized hardware that is better at handling database operations (CPUs with more cores, fast memory, fast disks, optimized network).
       *   **Benefits:** Improves processing speed, and reduces latency.
    9.  **Monitoring:**
        *  **Mechanism:** Monitor database performance, identify bottlenecks, and find opportunities for improvements.
        *  **Benefits:** Allows you to act proactively and make changes before a problem occurs.
    10. **Data Archiving:**
         *   **Mechanism:** Move old or rarely used data to a less expensive storage solution, so that it does not use resources from more important areas.
        *   **Benefits:** Improves the performance of the system by reducing the database size.

### 6.2 Security Considerations

Security is very important when creating applications that handle sensitive data. It’s important to consider security in all phases of design, development and deployment.

*   **Key Security Considerations:**
    *   **Authentication and Authorization:**
        *   **Best Practice:** Implement authentication to verify user identity, and to control data access using granular authorization mechanisms, that define who can read, write, and modify data in the system.
    *   **Data Encryption:**
        *   **Best Practice:** Encrypt sensitive data both in transit (between the application and the database), and at rest (when it is stored), to prevent unauthorized access to sensitive information.
    *   **Input Validation and Sanitization:**
        *  **Best Practice:** Filter data at the input level to prevent injection attacks (like SQL injection) and other attacks that may be based on sending malicious data to the system.
        *  **Technique:** Use methods to validate and sanitize data before using it to query or update the database.
    *   **Access Control:**
        *  **Best Practice:** Limit database access to only the resources that are required for each application, and to avoid granting access that is not necessary.
    *   **Regular Auditing and Monitoring:**
        *   **Best Practice:** Monitor the database activity for suspicious behavior, and also implement security audits to make sure that there are not any security flaws or exploits.
    *   **Secure Configuration:**
        *  **Best Practice:** Make sure that the database configuration is secure and is not prone to attack.
    *  **Data Backup and Recovery:**
         *  **Best Practice:** Make backups of all data, and create a solid procedure for recovering data when a security breach or a problem is identified.
    *   **Use of Secure Protocols:**
        *   **Best Practice:** Use secure protocols for client-server communication (TLS/SSL), and implement security measures when interacting with external APIs or external tools.
    *   **Credential Management:**
        *   **Best Practice:** Store connection strings, usernames and passwords using environment variables, key management systems or other tools for storing sensitive information, to prevent them from being exposed.
    *   **Vulnerability Scanning:**
        *   **Best Practice:** Use tools to scan your code for vulnerabilities, to identify and correct security flaws.
    *   **Regular Software Updates:**
        *   **Best Practice:** Keep all your software components up to date, by implementing a patch system for all software components.
    *   **Principle of Least Privilege:**
        *   **Best Practice:** Apply the principle of least privilege, which means that only the bare minimum required resources should be granted to each user, component, or API.

### 6.3 Schema Design and Data Modeling

Designing a database schema is a crucial step in creating efficient, scalable, and reliable applications.

*   **Key Concepts in MongoDB Schema Design:**
    *  **Flexible Documents:** Because MongoDB stores data in a JSON format, each document can have different data types, and it is not necessary to create a predefined schema for each collection.
    *   **Embedded Documents:** Nested objects and arrays can be stored inside of a single document, which can be useful when dealing with complex data types.
    *   **Denormalization:** Data can be duplicated in several places to minimize joins, and optimize queries for specific use cases, by using document denormalization.
*  **Data Modeling Strategies:**
    *   **Understand the Data:** Clearly understand your data, its size, its complexity, and the type of queries that will be used.
    *  **Choose the Right Data Types:** Select data types that are appropriate to the information you are going to store, and understand their memory implications.
    *  **Optimize for Queries:** The schema should be optimized for the main types of queries that will be performed to ensure that data can be retrieved with the minimum amount of resources and in the shortest possible time.
    *   **Use Indexes:** Create indexes for frequently queried fields to improve search and retrieval times.
    *   **Use Aggregations:** If you need to perform complex aggregations, you may need to optimize your schema to make that process more efficient.
    *   **Performance Testing:** Test database performance and adjust your schema accordingly.
    *   **Schema Evolution:** Plan for future changes in the data structure, and for how to handle new fields and new types of information.
    *   **Data Validation:** Use techniques for validating data and for ensuring that the data is correct and consistent with your design.
    *   **Scalability:** Design a schema that can scale with an increase in the amount of data, or the number of users.
    *   **Security:** Implement security techniques that protect your data.

### 6.4 Future Trends and Integration with AI/ML

The future of MongoDB is related to emerging trends such as the use of AI and Machine Learning.

*   **Future Trends:**
    *   **AI and ML Integration:** Tighter integrations between MongoDB and AI/ML tools, for data preprocessing, feature engineering, model deployment, and other AI related tasks.
    *   **Vector Search:** Improved support for vector search, embeddings, and other techniques used for building AI applications.
    *   **Serverless and Cloud Native:** Increasing adoption of cloud-based and serverless databases, to provide higher availability and improved scalability.
    *   **Real-Time Data:** Improved real-time data capabilities, including streaming, change data capture (CDC), and event handling.
    *  **Edge Computing:** Support for use in edge computing environments where models are deployed closer to the data sources.
    *  **Multi-Cloud Support:** Providing better support for use across different cloud platforms and vendors.
    *   **Data Governance and Security:** Use better security, and data governance techniques.
    *   **Improved Data Visualization:** Improved tools to visualize and analyze MongoDB data.
    *   **Document Understanding:** New tools and APIs to perform better semantic analysis of data.
    *   **Low Code and No Code:** More support for no-code and low-code development environments.
    *   **Schema Evolution:** Automated methods for schema evolution, to allow new fields to be easily added or updated.
    *   **Federated Databases:** Use federated databases and distributed technologies to improve access and processing of data from multiple different locations.
    *   **Data Interoperability:** Improve interoperability, and the possibility to export and import data in different formats.

This concludes the comprehensive tutorial on MongoDB. You now have the knowledge and tools to effectively work with MongoDB using Python, from basic concepts to advanced applications and considerations.

