# **MongoDB - Assignment Questions**

# **Theoretical Questions**

1. What are the key differences between SQL and NoSQL databases ?

- SQL = SQL databases, or relational databases, organize data into structured tables with predefined schemas. Each table consists of rows and columns, and relationships between tables are established using foreign keys. This rigid schema ensures consistency and data integrity, making SQL ideal for use cases that require structured and uniform data. In contrast, NoSQL databases utilize a flexible, non-relational data model. They can store data in formats such as key-value pairs, documents, wide-columns, or graphs, allowing developers to manage unstructured or semi-structured data without a fixed schema.

- NoSQL Database = SQL databases use Structured Query Language (SQL) for defining, manipulating, and querying data. This language is powerful for executing complex joins, aggregations, and transactions. SQL systems are ACID-compliant, meaning they guarantee data reliability and consistency. On the other hand, NoSQL databases often use proprietary query languages or APIs. While they typically do not provide full ACID compliance, they follow the BASE model (Basically Available, Soft state, Eventually consistent), which allows them to maintain high availability and performance in distributed environments.

2. What makes MongoDB a good choice for modern applications ?

- MongoDB is a strong choice for modern application development due to its flexible data model, high scalability, and robust performance in handling unstructured or semi-structured data. Unlike traditional relational databases, MongoDB uses a document-oriented model where data is stored in BSON (Binary JSON) format, allowing developers to store complex, hierarchical relationships in a single document without relying on joins. This flexibility accelerates development cycles, especially in agile and DevOps environments where application requirements evolve rapidly. MongoDB’s dynamic schema also enables easier and faster changes to the data structure without downtime.

3. Explain the concept of collections in MongoDB.

- In MongoDB, a collection is a grouping of documents, analogous to a table in a relational database system. However, unlike traditional tables, MongoDB collections do not enforce a fixed schema, meaning each document (which is equivalent to a row in SQL) can have a different structure, fields, or data types. This schema-less nature offers high flexibility and allows developers to iterate rapidly as application requirements evolve.

4. How does MongoDB ensure high availability using replication?
- MongoDB ensures high availability through a feature called replica sets, which are self-healing clusters of nodes that maintain copies of the same data. A typical replica set consists of one primary node and one or more secondary nodes. The primary node handles all write operations, while the secondary nodes replicate the data from the primary in near real-time. If the primary node fails or becomes unreachable due to maintenance or unexpected downtime, the system initiates an automatic election process to promote one of the secondary nodes to become the new primary, thereby ensuring continuous availability of the database service without manual intervention.

5. What are the main benefits of MongoDB Atlas ?
- MongoDB Atlas offers a fully managed cloud database service that provides several key benefits for modern application development. One of the primary advantages is its automated infrastructure management, which includes provisioning, scaling, backup, and patching allowing development teams to focus on building applications rather than maintaining databases. Atlas ensures high availability through built-in replication and automatic failover across multiple cloud regions and availability zones, significantly reducing downtime risk. It also offers robust security features such as end-to-end encryption, IP whitelisting, role-based access control, and compliance with standards like GDPR, HIPAA, and SOC 2.

6. What is the role of indexes in MongoDB, and how do they improve performance ?
- In MongoDB, indexes play a critical role in enhancing the performance and efficiency of database queries. An index is a data structure that stores a small portion of the collection’s data in a way that makes it faster to search and retrieve records. Without indexes, MongoDB must perform a collection scan, examining every document to fulfill a query, which can be extremely inefficient for large datasets. By contrast, when a query uses an index, MongoDB can locate the desired documents quickly, significantly reducing the amount of data scanned and improving overall response times.

7. Describe the stages of the MongoDB aggregation pipeline.
- The MongoDB aggregation pipeline is a powerful framework used to process and transform data within a collection through a series of stages, each performing a specific operation. The data flows through these stages sequentially, with the output of one stage becoming the input for the next. This modular architecture allows for flexible and efficient data transformation, filtering, grouping, and analysis.

8. What is sharding in MongoDB? How does it differ from replication?
- Sharding in MongoDB is a method used to achieve horizontal scalability by distributing data across multiple servers or clusters, known as shards. Each shard contains a subset of the dataset, and collectively, all shards represent the entire data set. MongoDB uses a shard key to determine how data is partitioned across the shards, ensuring that read and write operations are routed efficiently to the appropriate servers. This approach is particularly useful for handling large-scale datasets and high-throughput applications, as it enables the database to manage more data and serve more concurrent operations by leveraging the resources of multiple machines.

9. What is PyMongo, and why is it used ?
- PyMongo is the official Python driver for interacting with MongoDB databases. It provides a comprehensive set of tools that allow Python developers to connect to a MongoDB server, perform database operations, and manage collections and documents using intuitive Python syntax. PyMongo enables seamless execution of CRUD (Create, Read, Update, Delete) operations, as well as support for advanced functionalities such as indexing, aggregation pipelines, and bulk writes.

10. What are the ACID properties in the context of MongoDB transactions ?
- In the context of MongoDB transactions, ACID properties which stand for Atomicity, Consistency, Isolation, and Durability ensure that database operations are reliable, predictable, and adhere to strict data integrity standards. Atomicity means that a transaction is treated as a single unit of work, so either all the operations within the transaction succeed or none of them are applied, preventing partial updates. Consistency ensures that a transaction takes the database from one valid state to another, maintaining all defined rules, constraints, and relationships. Isolation guarantees that concurrent transactions do not interfere with each other, meaning intermediate states of a transaction are not visible to others until the transaction is committed. Durability ensures that once a transaction is committed, the changes are permanently stored in the database, even in the event of a system failure. Starting from MongoDB 4.0 for replica sets and 4.2 for sharded clusters, full ACID compliant multi-document transactions have been supported, making MongoDB suitable for complex, mission-critical applications that require strong data integrity and consistency.

11. What is the purpose of MongoDB’s explain() function ?
- The purpose of MongoDB’s explain() function is to provide detailed insights into how the database executes a query, helping developers and database administrators analyze and optimize query performance. When applied to a query operation, explain() returns execution statistics, including information such as whether indexes were used, the number of documents scanned, the query plan chosen by the optimizer, and the execution time. This diagnostic tool is invaluable for identifying performance bottlenecks, inefficient queries, or missing indexes.

12. How does MongoDB handle schema validation ?
- The purpose of MongoDB’s explain() function is to provide detailed insights into how the database executes a query, helping developers and database administrators analyze and optimize query performance. When applied to a query operation, explain() returns execution statistics, including information such as whether indexes were used, the number of documents scanned, the query plan chosen by the optimizer, and the execution time.

13. What is the difference between a primary and a secondary node in a replica set ?
- In a MongoDB replica set, the primary and secondary nodes serve distinct but complementary roles to ensure high availability and data redundancy.

- The primary node is the only node that receives write operations. All data modifications such as inserts, updates, and deletes must go through the primary. It then propagates these changes to the secondary nodes through replication, ensuring that all members of the replica set maintain the same dataset. Applications typically connect to the primary by default for both read and write operations, unless otherwise specified.

- In contrast, secondary nodes are read-only by default and exist to maintain copies of the primary's data. They replicate changes asynchronously from the primary node. While they do not handle writes directly, they can be configured to serve read requests using read preferences, which helps offload traffic from the primary and improve query performance. If the primary node becomes unavailable due to failure or maintenance, the replica set automatically initiates an election process, during which one of the secondary nodes is promoted to primary to maintain uninterrupted availability.

14. What security mechanisms does MongoDB provide for data protection ?
- MongoDB provides a comprehensive suite of security mechanisms designed to protect data both in transit and at rest, ensuring compliance with industry best practices and regulatory standards. One of the foundational features is authentication, which verifies the identity of users or applications using methods such as SCRAM, LDAP, Kerberos, and x.509 certificates. Once authenticated, authorization controls are enforced through role based access control (RBAC), allowing administrators to assign fine grained permissions to users, ensuring that each user has access only to the data and operations necessary for their role.

15. Explain the concept of embedded documents and when they should be used.
- In MongoDB, embedded documents refer to the practice of nesting one document inside another as a sub-document, rather than storing related data in separate collections and linking them. This structure takes advantage of MongoDB’s document oriented data model, allowing for more natural and efficient representation of complex, hierarchical relationships within a single document. Embedded documents are stored as part of the parent document and can include arrays or even other nested documents, making them highly flexible for modeling real-world data.

16. What is the purpose of MongoDB lookup stage in aggregation?
- "The purpose of MongoDB lookup stage in the aggregation pipeline" is to enable join-like operations between documents in different collections. It allows one collection to include related data from another collection by performing a left outer join based on a shared field or condition. This stage is particularly valuable in a NoSQL environment, where traditional relational joins are not inherently supported.

17.  What are some common use cases for MongoDB?
- MongoDB is widely leveraged in various modern application scenarios due to its flexible schema design, scalability, and high-performance capabilities. One of the most common use cases is content management systems (CMS), where MongoDB’s document-oriented structure supports diverse content formats and dynamic schemas. It is also heavily used in real-time analytics applications, such as monitoring user activity, IoT device data, or financial transactions, where low-latency reads and writes are essential.

18. What are the advantages of using MongoDB for horizontal scaling ?
- MongoDB offers significant advantages when it comes to horizontal scaling, making it a preferred choice for large-scale, data intensive applications. One key advantage is its built-in sharding mechanism, which enables automatic distribution of data across multiple servers or clusters. This ensures that as data volume or user load increases, the system can scale out seamlessly by simply adding more nodes, rather than vertically upgrading existing hardware.

19. How do MongoDB transactions differ from SQL transactions ?
- MongoDB transactions differ from traditional SQL transactions primarily in terms of data model, consistency behavior, and use cases. SQL databases operate on a strictly relational model and have supported ACID-compliant multi-statement transactions for decades, ensuring atomicity, consistency, isolation, and durability across multiple tables and rows by default. These transactions are deeply embedded in the SQL architecture and are used extensively in systems requiring complex, interrelated data integrity such as banking or ERP systems.

20.  What are the main differences between capped collections and regular collections ?
- The primary differences between capped collections and regular collections in MongoDB lie in their storage behavior, data management, and use cases.
- A capped collection is a fixed-size collection that maintains documents in the order they were inserted and automatically overwrites the oldest documents once the allocated space is filled. This ensures a constant storage footprint and consistent write performance. Capped collections do not allow deletion or document resizing—insertions must fit the pre-allocated space, and updates cannot increase the document size. This makes them ideal for logging, caching, and real-time analytics, where retaining only the most recent data is sufficient.

21. What is the purpose of the match stage in MongoDB’s aggregation pipeline ?
- The $match stage in MongoDB’s aggregation pipeline is designed to filter documents based on specified criteria, similar to the WHERE clause in SQL. Its primary purpose is to narrow down the dataset early in the pipeline, allowing only documents that meet certain conditions to proceed to subsequent stages. This improves the efficiency and performance of the aggregation operation by minimizing the volume of data processed downstream.

22. How can you secure access to a MongoDB database ?
- Securing access to a MongoDB database involves implementing a combination of authentication, authorization, encryption, and network-level safeguards to ensure data confidentiality, integrity, and availability. MongoDB supports robust authentication mechanisms such as SCRAM, x.509 certificates, LDAP, and Kerberos to verify the identity of users and services attempting to access the database.

23. What is MongoDB’s WiredTiger storage engine, and why is it important?
- MongoDB’s WiredTiger storage engine is the default storage engine used since version 3.2, designed to deliver high performance, scalability, and efficiency for modern data workloads. It plays a critical role in how MongoDB handles data storage, retrieval, and management at the disk level. WiredTiger offers document-level concurrency control, which allows multiple operations to read and write to different documents simultaneously without blocking each other significantly improving throughput in write intensive applications.

# **Practical Questions**

In [None]:
# Q1. Write a Python script to load the Superstore dataset from a CSV file into MongoDB.
import pandas as pd
from pymongo import MongoClient

from google.colab import files
uploaded = files.upload()


csv_file_path = r"C:\Users\panch\Downloads\superstore.csv"
df = pd.read_csv(csv_file_path)

data_dict = df.to_dict(orient='records')

client = MongoClient("mongodb://localhost:27017/")
db = client["superstore_db"]
collection = db["orders"]

if data_dict:
    collection.insert_many(data_dict)
    print(f"Inserted {len(data_dict)} documents into MongoDB.")
else:
    print("No data to insert.")
client.close()

Saving superstore.csv to superstore (2).csv


FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\panch\\Downloads\\superstore.csv'

In [None]:
# Q2. Retrieve and print all documents from the Orders collection.

from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")

db = client["superstore_db"]
collection = db["orders"]

documents = collection.find()

for doc in documents:
    print(doc)

ServerSelectionTimeoutError: localhost:27017: [Errno 111] Connection refused (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms), Timeout: 30s, Topology Description: <TopologyDescription id: 6883248911e1fdcf96ae1890, topology_type: Unknown, servers: [<ServerDescription ('localhost', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('localhost:27017: [Errno 111] Connection refused (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms)')>]>

In [None]:
# Q3. Count and display the total number of documents in the Orders collection.

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")

db = client["superstore_db"]
collection = db["orders"]

total_documents = collection.count_documents({})

print(f"📊 Total number of documents in the 'orders' collection: {total_documents}")

ServerSelectionTimeoutError: localhost:27017: [Errno 111] Connection refused (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms), Timeout: 30s, Topology Description: <TopologyDescription id: 688324cf11e1fdcf96ae1891, topology_type: Unknown, servers: [<ServerDescription ('localhost', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('localhost:27017: [Errno 111] Connection refused (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms)')>]>

In [None]:
# Q4. Write a query to fetch all orders from the "West" region.

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")

db = client["superstore_db"]
collection = db["orders"]
west_region_orders = collection.find({"Region": "West"})

for order in west_region_orders:
    print(order)

ServerSelectionTimeoutError: localhost:27017: [Errno 111] Connection refused (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms), Timeout: 30s, Topology Description: <TopologyDescription id: 688324fd11e1fdcf96ae1892, topology_type: Unknown, servers: [<ServerDescription ('localhost', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('localhost:27017: [Errno 111] Connection refused (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms)')>]>

In [None]:
# Q5. Write a query to find orders where Sales is greater than 500.

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")
db = client["superstore_db"]
collection = db["orders"]

high_sales_orders = collection.find({"Sales": {"$gt": 500}})

for order in high_sales_orders:
    print(order)

ServerSelectionTimeoutError: localhost:27017: [Errno 111] Connection refused (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms), Timeout: 30s, Topology Description: <TopologyDescription id: 6883252911e1fdcf96ae1893, topology_type: Unknown, servers: [<ServerDescription ('localhost', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('localhost:27017: [Errno 111] Connection refused (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms)')>]>

In [None]:
# Q6. Fetch the top 3 orders with the highest Profit.

from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")

db = client["superstore_db"]
collection = db["orders"]

top_profit_orders = collection.find().sort("Profit", -1).limit(3)

for order in top_profit_orders:
    print(order)

In [None]:
# Q7. Update all orders with Ship Mode as "First Class" to "Premium Class."

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")

db = client["superstore_db"]
collection = db["orders"]
result = collection.update_many(
    {"Ship Mode": "First Class"},
    {"$set": {"Ship Mode": "Premium Class"}}
)
print(f"✅ Documents matched: {result.matched_count}")
print(f"✅ Documents modified: {result.modified_count}")

ServerSelectionTimeoutError: localhost:27017: [Errno 111] Connection refused (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms), Timeout: 30s, Topology Description: <TopologyDescription id: 6883256e11e1fdcf96ae1894, topology_type: Unknown, servers: [<ServerDescription ('localhost', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('localhost:27017: [Errno 111] Connection refused (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms)')>]>

In [None]:
# Q8. Delete all orders where Sales is less than 50.

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")

db = client["superstore_db"]
collection = db["orders"]

result = collection.delete_many({"Sales": {"$lt": 50}})
print(f"🗑️ Total documents deleted: {result.deleted_count}")

In [None]:
# Q9. Use aggregation to group orders by Region and calculate total sales per region.

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")

db = client["superstore_db"]
collection = db["orders"]

pipeline = [
    {
        "$group": {
            "_id": "$Region",
            "total_sales": {"$sum": "$Sales"}
        }
    },
    {
        "$sort": {"total_sales": -1}
    }
]
results = collection.aggregate(pipeline)

print("📊 Total Sales by Region:")
for region in results:
    print(f"➡️ Region: {region['_id']}, Total Sales: ₹{region['total_sales']:.2f}")

ServerSelectionTimeoutError: localhost:27017: [Errno 111] Connection refused (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms), Timeout: 30s, Topology Description: <TopologyDescription id: 688325b011e1fdcf96ae1895, topology_type: Unknown, servers: [<ServerDescription ('localhost', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('localhost:27017: [Errno 111] Connection refused (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms)')>]>

In [None]:
# Q10. Fetch all distinct values for Ship Mode from the collection.

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")

db = client["superstore_db"]
collection = db["orders"]

distinct_ship_modes = collection.distinct("Ship Mode")

print("🚚 Distinct Ship Modes:")
for mode in distinct_ship_modes:
    print(f"• {mode}")

In [None]:
# Q11. Count the number of orders for each category.

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")

db = client["superstore_db"]
collection = db["orders"]

pipeline = [
    {
        "$group": {
            "_id": "$Category",
            "order_count": {"$sum": 1}
        }
    },
    {
        "$sort": {"order_count": -1}
    }
]

results = collection.aggregate(pipeline)

print("📦 Order Count by Category:")
for category in results:
    print(f"• {category['_id']}: {category['order_count']} orders")

ServerSelectionTimeoutError: localhost:27017: [Errno 111] Connection refused (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms), Timeout: 30s, Topology Description: <TopologyDescription id: 6883261c11e1fdcf96ae1897, topology_type: Unknown, servers: [<ServerDescription ('localhost', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('localhost:27017: [Errno 111] Connection refused (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms)')>]>

# Thank you for your time and consideration.
Yours sincerely,
Darshan Panchal