Theoretical Questions

1. What are the key differences between SQL and NoSQL databases?


SQL and NoSQL databases differ primarily in their data models, scalability, and flexibility. SQL databases, also known as relational databases, store data in structured tables with predefined schemas, making them highly suitable for applications requiring consistency, complex queries, and strict data integrity. They rely on Structured Query Language (SQL) for defining and manipulating data, ensuring strong ACID (Atomicity, Consistency, Isolation, Durability) compliance.

In contrast, NoSQL databases are non-relational and provide more flexible data models, such as document, key-value, columnar, or graph formats, which allow storage of unstructured, semi-structured, or rapidly changing data without requiring a fixed schema. They are designed for horizontal scalability, high performance, and handling large volumes of diverse data across distributed systems, often favoring eventual consistency over strict ACID compliance. While SQL databases excel in transactional systems like banking or enterprise applications, NoSQL databases are better suited for big data, real-time analytics, social networks, and applications with dynamic or varied data structures.

2. What makes MongoDB a good choice for modern applications?

MongoDB is a good choice for modern applications because it offers flexibility, scalability, and speed, which align well with today’s dynamic data needs. Unlike traditional relational databases, MongoDB uses a document-oriented model where data is stored in JSON-like structures, allowing developers to work with complex, nested, and rapidly changing data without rigid schemas. This schema-less approach accelerates development and makes it easier to adapt as application requirements evolve.

3. Explain the concept of collections in MongoDB.

In MongoDB, a collection is a grouping of documents, similar to a table in relational databases but without the rigid structure of rows and columns. Each collection holds multiple documents, which are JSON-like objects consisting of key-value pairs. Unlike relational tables, collections do not require a predefined schema, meaning each document within the same collection can have a different structure and set of fields. This flexibility allows developers to store and manage unstructured or semi-structured data efficiently. Collections are created automatically when the first document is inserted, and they support indexing, queries, and aggregation operations for fast data access.

4. How does MongoDB ensure high availability using replication?

MongoDB ensures high availability through a mechanism called replication, which is implemented using replica sets. A replica set is a group of MongoDB servers that maintain the same dataset, providing redundancy and fault tolerance. Within a replica set, one node is designated as the primary, and the others act as secondaries. All write operations and most reads are handled by the primary, while the secondaries continuously replicate the primary’s data through an oplog (operations log).

If the primary server fails or becomes unreachable, the replica set automatically triggers an election process among the secondaries to select a new primary. This failover process happens without manual intervention, ensuring that the database remains available to applications. Secondaries can also serve read requests (if configured), which helps distribute workload and improve performance. By maintaining multiple copies of the data across servers, MongoDB’s replication not only ensures high availability but also safeguards against data loss, making it highly reliable for modern, distributed applications.

5. What are the main benefits of MongoDB Atlas?

Key Benefits of MongoDB Atlas

Fully Managed Service :
Atlas handles a lot of the operational overhead: provisioning, patching, upgrades, backups, monitoring, etc. This means less effort on ops/DBA work, letting development teams focus more on building application features.


High Availability & Resilience :
Atlas clusters are built with replica sets and can be deployed across different cloud regions. If there’s a hardware failure or region outage, automatic failover ensures the database remains available.

Scalability & Global Distribution

Scale Up or Out Easily: You can scale resources (CPU, RAM, storage) vertically (bigger machines) or horizontally (sharding) with minimal friction.


Multi-Region / Multi-Cloud Deployments: For low latency (by putting data close to users), data residency, redundancy, and disaster recovery. Also helps in regulatory or compliance scenarios.

6. What is the role of indexes in MongoDB, and how do they improve performance?

In MongoDB, indexes play a crucial role in improving the efficiency of query operations. An index is a special data structure that stores a small portion of a collection’s data in an easily searchable form, much like an index in a book. Without indexes, MongoDB would need to scan every document in a collection to fulfill a query—a process called a collection scan—which is slow and inefficient, especially with large datasets.

Indexes improve performance by allowing MongoDB to quickly locate documents that match query conditions without scanning the entire collection. For example, if you frequently query a "users" collection by the email field, creating an index on email enables the database to jump directly to the matching entries. MongoDB supports various types of indexes such as single-field, compound, multikey (for arrays), text, and geospatial indexes, each designed for specific use cases.

7.  Describe the stages of the MongoDB aggregation pipeline.

The MongoDB aggregation pipeline is a framework for processing data in stages, where each stage transforms the documents and passes them to the next stage. It’s similar to a data processing pipeline, allowing complex data analysis directly within the database. Here are the main stages:

$match – Filters documents based on specified criteria, similar to a WHERE clause in SQL. This stage reduces the dataset early, improving performance.
Example: Select only orders with status: "completed".

$project – Reshapes documents by including, excluding, or adding computed fields. It works like a SELECT statement that chooses which fields to return.
Example: Return only customerName and totalAmount, or compute a new field like discountedPrice.

$group – Groups documents by a field (or fields) and applies aggregation expressions (like sum, avg, max, min). Comparable to GROUP BY in SQL.
Example: Group sales by region and calculate total revenue per region.

$sort – Sorts documents by one or more fields in ascending or descending order.
Example: Sort users by age in descending order.

$limit – Restricts the number of documents that pass through the pipeline.
Example: Show only the top 10 highest-paying customers.

$skip – Skips a specified number of documents, often used with $limit for pagination.
Example: Skip the first 20 results and return the next 10.

$unwind – Deconstructs an array field so that each element generates a separate document.
Example: Break down a document with an array of items into individual documents, one per item.

$lookup – Performs a left outer join with another collection to merge related data.
Example: Join orders with customers to show customer details alongside each order.

$out – Writes the results of the aggregation into a new or existing collection.
Example: Store processed data into a reporting collection.

$count – Returns the number of documents at a given stage in the pipeline.
Example: Count how many completed orders exist after filtering.

8. What is sharding in MongoDB? How does it differ from replication?

Sharding in MongoDB is a method of distributing data across multiple servers (called shards) to handle very large datasets and high throughput operations. Each shard holds a subset of the data, and together they form a complete dataset. A shard key is used to determine how documents are distributed across shards, ensuring queries are directed efficiently. Sharding enables horizontal scaling, meaning you can add more machines to handle increased data and workload, which is critical for applications with massive growth in users or data volume.

On the other hand, replication in MongoDB is about creating multiple copies of the same dataset on different servers (replica sets) to ensure high availability and fault tolerance. Replication does not split data but duplicates it, so if the primary node fails, another replica can take over, keeping the system online.

9. What is PyMongo, and why is it used.

PyMongo is the official Python driver for MongoDB that allows Python applications to interact with MongoDB databases. It provides the tools and APIs needed to connect to a MongoDB instance, perform operations such as inserting, querying, updating, and deleting documents, as well as more advanced tasks like running aggregation pipelines, managing indexes, and handling bulk operations.

It is used because it acts as the bridge between Python code and MongoDB, enabling developers to seamlessly work with MongoDB’s document-oriented model using familiar Python syntax. PyMongo simplifies database connectivity, supports authentication and secure connections, and integrates well with frameworks like Flask or Django, making it a common choice for building data-driven applications.

10. What are the ACID properties in the context of MongoDB transactions?

In the context of MongoDB, ACID properties define the guarantees provided by transactions to ensure reliable and consistent database operations. MongoDB supports multi-document transactions (starting from version 4.0), which allow multiple operations to be executed atomically, similar to relational databases. The ACID properties are:

Atomicity – A transaction is treated as a single unit of work. Either all operations within the transaction succeed, or none are applied. This prevents partial updates that could leave the database in an inconsistent state.
Example: If transferring money between two accounts, both the debit and credit must succeed together, or both are rolled back.

Consistency – Transactions ensure that the database moves from one valid state to another, maintaining all defined rules, constraints, and data integrity. Any transaction that violates these rules will be aborted.

Isolation – Operations in a transaction are isolated from other concurrent operations, so intermediate states are not visible to other transactions. This ensures that concurrent transactions do not interfere with each other. MongoDB provides snapshot isolation for multi-document transactions.

Durability – Once a transaction is committed, its changes are permanently stored in the database, even in the event of a system crash. MongoDB uses the write-ahead logging mechanism to guarantee durability.

11. What is the purpose of MongoDB’s explain() function?

The explain() function in MongoDB is used to provide detailed information about how the database executes a query. Its main purpose is to help developers and database administrators analyze and optimize query performance. When you append explain() to a query, MongoDB returns execution statistics, including the query plan, index usage, number of documents scanned, and whether a collection scan or index scan was performed.

12.  How does MongoDB handle schema validation.

MongoDB handles schema validation by allowing developers to define rules that enforce constraints on the structure and content of documents in a collection. While MongoDB is inherently schema-less, schema validation enables control over data quality without losing flexibility. Validation rules can be specified using a validator when creating or modifying a collection, often expressed in MongoDB query expressions or JSON Schema format.

13. What is the difference between a primary and a secondary node in a replica set?

In a MongoDB replica set, nodes are categorized as primary and secondary, each serving distinct roles to ensure high availability and data redundancy:

Primary Node

The primary node is the main writable node in the replica set. All write operations (inserts, updates, deletes) are directed to the primary.

It records all changes in an operation log (oplog), which is then used by secondary nodes to replicate data.

If the primary fails, the replica set automatically holds an election to select a new primary from the secondaries.

Secondary Node

Secondary nodes are read-only replicas of the primary by default. They continuously replicate data from the primary using the oplog.

They provide redundancy and high availability, ensuring the dataset is preserved even if the primary fails.

Optionally, applications can be configured to read from secondaries to distribute read load and improve performance.

14.  What security mechanisms does MongoDB provide for data protection?

MongoDB provides a comprehensive set of security mechanisms to protect data at rest, in transit, and from unauthorized access. Key mechanisms include:

Authentication – Ensures only authorized users can access the database. MongoDB supports multiple authentication methods such as SCRAM (Salted Challenge Response Authentication Mechanism), LDAP, x.509 certificates, and Kerberos.

Authorization & Role-Based Access Control (RBAC) – Controls what authenticated users can do. Permissions are managed through roles, which define access to databases, collections, and operations (read, write, or administrative tasks).

Encryption in Transit – MongoDB supports TLS/SSL to encrypt data being transmitted between clients and servers, protecting it from eavesdropping or tampering.

Encryption at Rest – Data stored on disk can be encrypted using storage-level encryption, including support for MongoDB’s Encrypted Storage Engine. This ensures that even if disks are compromised, the data remains unreadable.

Auditing – MongoDB provides an audit log to track database activity, including authentication attempts, commands executed, and changes to roles or permissions. This is critical for compliance and forensic analysis.

15. Explain the concept of embedded documents and when they should be used.

In MongoDB, embedded documents are documents stored as a field within another document, allowing related data to be kept together in a single structure. Instead of spreading information across multiple collections, embedding nests it directly inside the parent document. For example, a user document might include an embedded address field that contains street, city, and postal code details as a subdocument.

16. What is the purpose of MongoDB’s $lookup stage in aggregation?

The $lookup stage in MongoDB’s aggregation pipeline is used to perform a left outer join between documents in one collection and documents in another collection. Its main purpose is to combine related data that is stored in different collections, similar to how a JOIN works in SQL.

With $lookup, we can match documents from the input collection to documents in the "joined" collection based on specified fields, and the matching documents are returned as an embedded array in the result.

17. What are some common use cases for MongoDB?

MongoDB is widely used in modern applications because of its flexibility, scalability, and ability to handle diverse data types. Some common use cases include:

Content Management Systems (CMS) – MongoDB’s document model makes it easy to store and manage varied content like articles, blogs, videos, and product catalogs without needing a rigid schema.

E-commerce Applications – Product catalogs often contain diverse attributes (size, color, brand, reviews) that vary by category. MongoDB’s schema flexibility and support for embedded documents are well-suited for this.

Real-Time Analytics – MongoDB’s aggregation pipeline and scalability allow organizations to process and analyze high-velocity data streams, such as user activity, IoT sensor data, or financial transactions.

Mobile and Social Media Apps – MongoDB’s ability to handle unstructured and semi-structured data is ideal for storing user profiles, posts, likes, comments, and real-time interactions.

IoT (Internet of Things) – Devices generate huge volumes of diverse data. MongoDB handles high ingestion rates and stores time-series or event-driven data efficiently.

18. What are the advantages of using MongoDB for horizontal scaling?

The main advantage of using MongoDB for horizontal scaling is that it can distribute very large datasets and high-throughput workloads across multiple servers, making it well-suited for modern, data-intensive applications. Here are the key benefits:

Sharding for Data Distribution – MongoDB uses sharding to split data into smaller chunks and distribute them across multiple machines (shards). This allows the database to handle more data than a single server could manage.

Improved Performance – Since queries can be processed in parallel across shards, read and write operations scale out as more servers are added, reducing the load on any single node.

Elastic Growth – Horizontal scaling lets you add more nodes as your data or traffic grows, avoiding the limits of vertical scaling (upgrading a single machine’s CPU/RAM).

High Availability with Replica Sets – Each shard is typically backed by a replica set, ensuring that even as the system scales out, it maintains redundancy and fault tolerance.

Global Distribution – MongoDB allows sharded clusters to be deployed across regions, so data can be placed closer to users, reducing latency and supporting compliance with data residency requirements.

19. How do MongoDB transactions differ from SQL transactions?


In SQL databases, transactions have been a core feature from the beginning. They follow strict ACID properties across multiple tables and rows, making them ideal for complex, multi-table relational operations. SQL transactions are deeply optimized for these use cases, ensuring strong consistency at all times.

In MongoDB, transactions were introduced later (multi-document transactions came in version 4.0). By default, MongoDB operations on a single document are already atomic, so full transactions are usually needed only when working with multiple documents across collections or shards. MongoDB transactions also support ACID guarantees, but because MongoDB is designed for scalability and distributed workloads, transactions may come with higher performance costs compared to SQL systems.

20. What are the main differences between capped collections and regular collections?

In MongoDB, capped collections and regular collections differ mainly in how they store and manage data:

Size Limitation

Capped collections have a fixed size defined at creation. Once the limit is reached, the oldest documents are automatically overwritten by new ones in insertion order.

Regular collections have no fixed size limit (except the general 16 MB per-document limit) and grow dynamically as new documents are added.

Insertion Order

Capped collections preserve the order of document insertion and do not allow document deletion (except by dropping the collection).

Regular collections do not guarantee insertion order, and documents can be freely inserted, updated, or deleted.

Use Cases

Capped collections are ideal for scenarios like logging, caching, real-time analytics, or sensor data storage—where only the most recent data matters and older data can be discarded.

Regular collections are used for general-purpose data storage, where data must be retained and managed over time.

21.  What is the purpose of the $match stage in MongoDB’s aggregation pipeline?

Its purpose is to allow only the documents that meet certain criteria to pass through to the next stage of the pipeline. By applying $match early in the pipeline, MongoDB reduces the number of documents being processed in later stages, which improves performance.

22. How can you secure access to a MongoDB database?

Securing access to a MongoDB database is critical because, by default, MongoDB instances can be vulnerable if left open. MongoDB provides several mechanisms to protect data and restrict unauthorized access:

Enable Authentication

Require users to authenticate with a username and password.

Use role-based access control (RBAC) to assign only the necessary permissions (e.g., read-only, readWrite, dbAdmin).

Enable Authorization

Ensure that each user has only the privileges required for their tasks.

Apply the principle of least privilege so no account has more access than necessary.

Network Access Control

Bind MongoDB to specific IP addresses instead of 0.0.0.0 (which allows access from anywhere).

Use firewalls or cloud security groups to restrict which hosts can connect.

Encryption

Encryption in transit: Use TLS/SSL to secure communication between clients and the database.

Encryption at rest: Enable storage-level encryption so that data files remain secure if compromised.

23. What is MongoDB’s WiredTiger storage engine, and why is it important?

Key Features of WiredTiger:

Document-Level Concurrency

Instead of locking entire collections or databases, WiredTiger uses document-level locking, which allows multiple clients to read/write different documents in parallel.

This improves throughput in high-concurrency applications.

Compression

Supports snappy and zlib/zstd compression for data and indexes.

Reduces disk space usage and improves performance by minimizing I/O.

Checkpointing & Journaling

Uses checkpoints to persist data consistently on disk.

Journaling ensures durability, so data can be recovered even after crashes.

Caching & Memory Management

WiredTiger maintains an in-memory cache for fast data access.

Uses a write-ahead log (WAL) and checkpoints to ensure ACID guarantees.

Scalability

Designed for modern hardware and multi-core CPUs, making it suitable for large-scale, high-traffic applications.

Practical Questions

1. Write a Python script to load the Superstore dataset from a CSV file into MongoDB.

In [None]:
import csv
from pymongo import MongoClient


client = MongoClient("mongodb://localhost:27017/")


db = client["superstoreDB"]


collection = db["sales"]

csv_file = "Superstore.csv"

with open(csv_file, mode='r', encoding='utf-8-sig') as file:
    reader = csv.DictReader(file)
    data = list(reader)

if data:
    collection.insert_many(data)
    print(f"Inserted {len(data)} records into MongoDB collection 'sales'.")
else:
    print("No data found in CSV file.")


2.  Retrieve and print all documents from the Orders collection.

In [None]:
from pymongo import MongoClient
import pprint

client = MongoClient("mongodb://localhost:27017/")

db = client["superstoreDB"]

orders_collection = db["Orders"]

all_orders = orders_collection.find()

for order in all_orders:
    pprint.pprint(order)


3. Count and display the total number of documents in the Orders collection.

In [None]:
from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")

db = client["superstoreDB"]

orders_collection = db["Orders"]

total_orders = orders_collection.count_documents({})

print(f"Total number of documents in the Orders collection: {total_orders}")


4. Write a query to fetch all orders from the "West" region.

In [None]:
from pymongo import MongoClient
import pprint

client = MongoClient("mongodb://localhost:27017/")

db = client["superstoreDB"]

orders_collection = db["Orders"]

query = { "Region": "West" }

west_orders = orders_collection.find(query)

for order in west_orders:
    pprint.pprint(order)


5. Write a query to find orders where Sales is greater than 500.

In [None]:
from pymongo import MongoClient
import pprint

client = MongoClient("mongodb://localhost:27017/")

db = client["superstoreDB"]

orders_collection = db["Orders"]

query = { "Sales": { "$gt": 500 } }

high_sales_orders = orders_collection.find(query)

for order in high_sales_orders:
    pprint.pprint(order)


6. Fetch the top 3 orders with the highest Profit.

In [None]:
from pymongo import MongoClient
import pprint

client = MongoClient("mongodb://localhost:27017/")

db = client["superstoreDB"]

orders_collection = db["Orders"]

top_profit_orders = orders_collection.find().sort("Profit", -1).limit(3)

for order in top_profit_orders:
    pprint.pprint(order)


7. Update all orders with Ship Mode as "First Class" to "Premium Class".

In [None]:
from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")

db = client["superstoreDB"]

orders_collection = db["Orders"]

filter_query = { "Ship Mode": "First Class" }
update_query = { "$set": { "Ship Mode": "Premium Class" } }

result = orders_collection.update_many(filter_query, update_query)

print(f"Number of orders updated: {result.modified_count}")


8. Delete all orders where Sales is less than 50.




In [None]:
from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")

db = client["superstoreDB"]

orders_collection = db["Orders"]
filter_query = { "Sales": { "$lt": 50 } }

result = orders_collection.delete_many(filter_query)

print(f"Number of orders deleted: {result.deleted_count}")


9.  Use aggregation to group orders by Region and calculate total sales per region.

In [None]:
from pymongo import MongoClient
import pprint

client = MongoClient("mongodb://localhost:27017/")

db = client["superstoreDB"]

orders_collection = db["Orders"]

pipeline = [
    {
        "$group": {
            "_id": "$Region",
            "totalSales": { "$sum": "$Sales" }
        }
    },
    {
        "$sort": { "totalSales" : -1 }
    }
]

result = orders_collection.aggregate(pipeline)


for region in result:
    pprint.pprint(region)


10. Fetch all distinct values for Ship Mode from the collection.

In [None]:
from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")

db = client["superstoreDB"]

orders_collection = db["Orders"]

distinct_ship_modes = orders_collection.distinct("Ship Mode")

print("Distinct Ship Modes in the Orders collection:")
for mode in distinct_ship_modes:
    print(mode)


11. Count the number of orders for each category.

In [None]:
from pymongo import MongoClient
import pprint

client = MongoClient("mongodb://localhost:27017/")

db = client["superstoreDB"]

orders_collection = db["Orders"]

pipeline = [
    {
        "$group": {
            "_id": "$Category",
            "orderCount": { "$sum": 1 }
        }
    },
    {
        "$sort": { "orderCount": -1 }
    }
]

result = orders_collection.aggregate(pipeline)


for category in result:
    pprint.pprint(category)
