1. What are the key differences between SQL and NoSQL databases?

Data Structure:

SQL: Uses a structured schema with tables, rows, and columns. Data is organized in a relational format, and relationships between tables are defined using foreign keys.
NoSQL: Supports various data models, including document, key-value, column-family, and graph formats. It is more flexible and can handle unstructured or semi-structured data.

Schema:

SQL: Requires a predefined schema. Changes to the schema can be complex and may require downtime.
NoSQL: Typically schema-less or has a dynamic schema, allowing for easier modifications and the ability to store different types of data in the same database.

Query Language:

SQL: Uses SQL for querying and managing data. It provides powerful querying capabilities with JOIN operations.
NoSQL: Does not have a standard query language. Each NoSQL database may have its own query language or API, which can vary significantly.

Transactions:

SQL: Supports ACID (Atomicity, Consistency, Isolation, Durability) properties, ensuring reliable transactions.
NoSQL: Often follows BASE (Basically Available, Soft state, Eventually consistent) principles, which may sacrifice some consistency for availability and partition tolerance.
Scalability:

SQL: Generally scales vertically (adding more power to a single server), which can be limiting and expensive.
NoSQL: Designed to scale horizontally (adding more servers), making it easier to handle large volumes of data and high traffic.

Use Cases:

SQL: Best suited for applications requiring complex queries, transactions, and structured data, such as financial systems and enterprise applications.
NoSQL: Ideal for applications with large volumes of unstructured data, real-time web applications, big data analytics, and content management systems.
Examples:

SQL: MySQL, PostgreSQL, Oracle, Microsoft SQL Server.
NoSQL: MongoDB, Cassandra, Redis, Couchbase, Neo4j.

2. What makes MongoDB a good choice for modern applications?

Flexible Data Model:
MongoDB uses a document-oriented data model, allowing developers to store unstructured and semi-structured data easily.

The JSON-like format (BSON) enables a natural mapping to objects in programming languages, simplifying data handling.

Scalability:
MongoDB's horizontal scaling capabilities allow it to manage large volumes of data and high traffic efficiently.

Sharding distributes data across multiple servers, ensuring performance remains strong even as data grows.

Rapid Development:
Developers can quickly get started with MongoDB, as it requires minimal setup and allows for iterative development.

The dynamic schema means changes can be made without significant overhead, facilitating agile development practices.

High Performance:
MongoDB is optimized for high read and write operations, making it suitable for real-time applications.

Its indexing capabilities enhance query performance, allowing for fast data retrieval.

Rich Query Language:
MongoDB supports complex queries, including ad hoc queries and aggregations, which are essential for modern applications that require data analysis.

Community and Ecosystem:
A large and active community provides extensive support and resources, making it easier for developers to find solutions and best practices.

MongoDB Atlas, the cloud-based service, simplifies deployment and management, offering features like automated backups and scaling.

Versatile Use Cases:
MongoDB is suitable for a wide range of applications, from content management systems and e-commerce platforms to IoT applications and real-time analytics.


3. Explain the concept of collections in MongoDB?

Document Storage:

A collection stores documents, which are individual records represented in a flexible, JSON-like format (BSON - Binary JSON). Each document can have a different structure, allowing for a variety of data types and fields within the same collection.
Schema-less Design:

Collections in MongoDB are schema-less, meaning that there is no predefined structure for the documents they contain. This flexibility allows developers to easily modify the data model as application requirements evolve without needing to alter the entire collection.
Dynamic Fields:

Each document within a collection can have its own unique set of fields. For example, one document might have fields for "name" and "age," while another document in the same collection might include "name," "age," and "address." This allows for a diverse range of data to be stored together.
Indexing:

Collections can be indexed to improve query performance. MongoDB supports various types of indexes, including single-field, compound, and geospatial indexes, which can be applied to fields within documents in a collection.
Operations:

Collections support a variety of operations, including inserting, updating, deleting, and querying documents. MongoDB provides a rich query language that allows for complex queries and aggregations on the data within collections.
Naming Conventions:

Collection names must be unique within a database and can include letters, numbers, underscores, and certain special characters. However, they cannot start with a dollar sign ($) or contain a dot (.) as it can lead to confusion with MongoDB's internal structure.
Relationship Handling:

While collections can store related data, MongoDB encourages embedding related data within documents when appropriate (denormalization) rather than using joins, which are common in relational databases. This approach can improve performance and simplify data retrieval.
Example:

For instance, in a database for a library, you might have a collection named "books" that contains documents representing individual books. Each document could include fields like "title," "author," "published_year," and "genres," with some documents having additional fields like "ISBN" or "summary."

4. How does MongoDB ensure high availability using replication?

MongoDB ensures high availability through a feature called replication, which is implemented using replica sets.
A replica set consists of a group of MongoDB servers that maintain the same dataset. It typically includes one primary node and multiple secondary nodes.
The primary node receives all write operations, while secondary nodes replicate the data from the primary asynchronously.
If the primary node fails, the replica set automatically initiates an election process to select a new primary node, ensuring no single point of failure.
This failover process enables applications to continue operating without interruption, thus providing high availability.

5. What are the main benefits of MongoDB Atlas?

MongoDB Atlas is a fully-managed cloud database service offered by MongoDB Inc. that provides the following benefits:
- **Automated Operations**: Handles backup, patching, monitoring, and scaling automatically.
- **Scalability**: Easily scale up or down as per your application's needs.
- **Global Distribution**: Deploy databases in multiple cloud regions to reduce latency and meet compliance requirements.
- **Built-in Security**: Includes features like encryption at rest and in transit, IP whitelisting, and role-based access control.
- **Integrated Monitoring and Alerts**: Provides real-time performance metrics and alerts to help manage the system effectively.
- **Data Tools**: Includes features like full-text search, charts, and BI connector out of the box.


6. What is the role of indexes in MongoDB, and how do they improve performance?

Indexes in MongoDB are special data structures that store a portion of the data set in an easy-to-traverse form.
They help improve query performance by allowing the database engine to quickly locate data without scanning every document in a collection.
Without indexes, MongoDB must perform a collection scan, which is slow for large datasets.
Indexes can be created on one or more fields in a document. For example, creating an index on the 'email' field makes searching for users by email much faster.
They also support operations like sorting and can enforce uniqueness.

7. Describe the stages of the MongoDB aggregation pipeline?

The MongoDB aggregation pipeline is a framework for performing data aggregation operations. It processes data records through a sequence of stages:
- **$match**: Filters documents based on specified criteria (like WHERE in SQL).
- **$project**: Reshapes each document, adding/removing fields or computing new values.
- **$group**: Groups documents by a specified key and performs aggregation operations like sum, avg, min, max.
- **$sort**: Sorts documents by a specified field(s).
- **$skip**: Skips a specified number of documents.
- **$limit**: Limits the number of documents passed to the next stage.
- **$lookup**: Performs a left outer join with another collection.
Each stage processes input and passes results to the next, allowing for powerful data transformation.


8. What is sharding in MongoDB? How does it differ from replication?

Sharding is a method used in MongoDB to support horizontal scaling by distributing data across multiple servers, or shards.
Each shard contains a portion of the data, and together they form a complete data set. MongoDB uses a shard key to determine the distribution.
Replication, on the other hand, is about redundancy and high availability. A replica set contains copies of the same data on different nodes.
So while sharding spreads data to handle large-scale workloads, replication ensures data availability and fault tolerance.
Both can be used together for high-performance and resilient systems.

9. What is PyMongo, and why is it used?

PyMongo is the official Python driver for MongoDB. It allows Python developers to interact with MongoDB databases using Python code.
It provides functions to perform database operations like inserting, updating, querying, and deleting documents.
PyMongo also supports advanced features like indexing, aggregation pipelines, and transactions.
It is commonly used in web development, data analysis, and automation projects where MongoDB is the chosen database.

10. What are the ACID properties in the context of MongoDB transactions?

ACID stands for Atomicity, Consistency, Isolation, and Durability. These properties ensure reliable transaction processing:
- **Atomicity**: All operations in a transaction are completed successfully, or none are.
- **Consistency**: The database remains in a valid state before and after the transaction.
- **Isolation**: Transactions are isolated from each other to prevent conflicts.
- **Durability**: Once a transaction is committed, changes are permanent.
MongoDB supports multi-document transactions since version 4.0, allowing ACID-compliant operations across multiple documents and collections.

11. What is the purpose of MongoDB’s explain() function?

The `explain()` function in MongoDB provides information on how a query will be executed.
It returns execution plans, including details like whether an index is used, how many documents are scanned, and the query execution time.
This is helpful for optimizing query performance, understanding query behavior, and diagnosing slow queries.

12. How does MongoDB handle schema validation?

Although MongoDB is schema-less, it supports schema validation using JSON Schema.
You can define rules that documents must follow, such as field types, required fields, value ranges, or regex patterns.
Validation is enforced at the collection level and helps maintain data integrity while retaining flexibility.
Example:
```json
{
  "$jsonSchema": {
    "bsonType": "object",
    "required": ["name", "age"],
    "properties": {
      "name": {"bsonType": "string"},
      "age": {"bsonType": "int", "minimum": 18}
    }
  }
}
```

13. What is the difference between a primary and a secondary node in a replica set?

In a MongoDB replica set:
- **Primary Node**: Receives all write operations. There is only one primary node at any time.
- **Secondary Node(s)**: Maintain copies of the primary's data by replicating it. They can serve read requests (if configured).
If the primary fails, a secondary can be promoted to primary automatically. This ensures availability and fault tolerance.

14. What security mechanisms does MongoDB provide for data protection?

MongoDB offers several built-in security features:
- **Authentication**: Verifies user identity using credentials or external systems like LDAP.
- **Authorization (RBAC)**: Controls access with roles and privileges.
- **Encryption**: Data is encrypted at rest using the WiredTiger storage engine and in transit using TLS/SSL.
- **IP Whitelisting**: Restricts access to trusted IP addresses.
- **Auditing**: Tracks access and changes for compliance and monitoring.
These features help secure sensitive data and prevent unauthorized access.

15. Explain the concept of embedded documents and when they should be used?

Embedded documents are documents stored within other documents. They help model related data in a single structure.
Use embedded documents when:
- The embedded data is tightly coupled and usually accessed together (e.g., user and address).
- You want to avoid expensive joins or `$lookup` operations.
Example:
```json
{
  "name": "John",
  "address": {
    "city": "Mumbai",
    "zip": "400001"
  }
}
```

16. What is the purpose of MongoDB’s $lookup stage in aggregation?

The `$lookup` stage performs a left outer join between two collections in the same database.
It allows combining documents from different collections based on a matching field.
Example: Joining `orders` with `customers` based on customer ID.
It’s useful for relating data in a way similar to relational databases.

17. What are some common use cases for MongoDB?

- Content Management Systems (CMS)
- Real-time analytics dashboards
- Internet of Things (IoT) applications
- Mobile and web apps needing flexible schemas
- Catalogs, inventories, and product listings
- Social networking applications
MongoDB’s flexible document model and scalability make it ideal for a wide range of modern applications.

18. What are the advantages of using MongoDB for horizontal scaling?

MongoDB supports horizontal scaling through sharding, which offers:
- **Increased Capacity**: Spreads data across multiple servers, avoiding bottlenecks.
- **Improved Performance**: Reduces read/write load per server.
- **Elastic Scalability**: Easily add more shards as your data grows.
- **Cost Efficiency**: Scale using commodity hardware instead of vertical scaling.

19. How do MongoDB transactions differ from SQL transactions?

MongoDB traditionally focused on single-document atomic operations. Multi-document transactions were introduced in v4.0.
Key differences:
- SQL databases use transactions by default; MongoDB uses them when needed.
- MongoDB's document model often avoids the need for multi-document transactions.
- MongoDB transactions are ACID-compliant but may add overhead if overused.

20. What are the main differences between capped collections and regular collections?

| Feature          | Capped Collection              | Regular Collection         |
|------------------|--------------------------------|----------------------------|
| Size Limit       | Fixed size (in bytes)           | Grows dynamically          |
| Deletion         | Oldest data auto-removed        | Manual deletion required   |
| Use Case         | Logging, metrics, caching       | General-purpose storage    |
| Insertion Order  | Maintained                      | Not guaranteed             |
Capped collections are optimized for high-throughput operations.

21. What is the purpose of the $match stage in MongoDB’s aggregation pipeline?

The `$match` stage filters documents based on specific criteria before they move to the next stage of the pipeline.
It’s similar to a SQL `WHERE` clause and is often used early in the pipeline to reduce the number of documents processed in later stages, improving performance.

22. How can you secure access to a MongoDB database?

- **Enable Authentication**: Use internal or external authentication (e.g., LDAP).
- **Role-Based Access Control (RBAC)**: Assign roles with least privilege.
- **Use TLS/SSL**: Encrypt communication between client and server.
- **Network Restrictions**: Use firewalls and IP whitelisting.
- **Encrypt Data at Rest**: Using WiredTiger with encryption.
- **Keep MongoDB Updated**: Apply patches to fix security vulnerabilities.
- **Audit Logging**: Monitor access and changes.

23. What is MongoDB’s WiredTiger storage engine, and why is it important?

WiredTiger is MongoDB’s default storage engine since version 3.2.
It provides features like:
- **Document-level Concurrency**: Improves performance for concurrent operations.
- **Compression**: Reduces storage requirements.
- **Checkpointing and Journaling**: Ensures data durability.
WiredTiger’s efficient use of memory and concurrency control makes it ideal for high-throughput applications.




1. Write a Python script to load the Superstore dataset from a CSV file into MongoDB

from pymongo import MongoClient
import pandas as pd

# 1️⃣ Connect to MongoDB

client = MongoClient("mongodb://localhost:27017/")
db = client["superstore_db"]
orders_collection = db["Orders"]

# 2️⃣ Load CSV into MongoDB

csv_file = "Superstore.csv"  # Replace with your file path
df = pd.read_csv(csv_file)

# Clear existing collection to avoid duplicates

orders_collection.delete_many({})

# Insert all rows from CSV

orders_collection.insert_many(df.to_dict(orient="records"))
print("✅ Superstore dataset loaded into MongoDB.")

2. Retrieve and print all documents from the Orders collection


print("\n📄 All documents in Orders collection:")
for doc in orders_collection.find():
    print(doc)

3. Count and display the total number of documents in the Orders collection


total_docs = orders_collection.count_documents({})
print(f"\n📊 Total number of documents: {total_docs}")

4. Write a query to fetch all orders from the "West" region


print("\n📍 Orders from West region:")
for doc in orders_collection.find({"Region": "West"}):
    print(doc)

5. Write a query to find orders where Sales is greater than 500


print("\n💰 Orders with Sales > 500:")
for doc in orders_collection.find({"Sales": {"$gt": 500}}):
    print(doc)

6.  Fetch the top 3 orders with the highest Profit


print("\n🏆 Top 3 orders with highest Profit:")
for doc in orders_collection.find().sort("Profit", -1).limit(3):
    print(doc)

7. Update all orders with Ship Mode as "First Class" to "Premium Class"


update_result = orders_collection.update_many(
    {"Ship Mode": "First Class"},
    {"$set": {"Ship Mode": "Premium Class"}}
)
print(f"\n Updated {update_result.modified_count} documents to 'Premium Class'.")

8.  Delete all orders where Sales is less than 50


delete_result = orders_collection.delete_many({"Sales": {"$lt": 50}})
print(f"\n Deleted {delete_result.deleted_count} documents where Sales < 50.")

9. Use aggregation to group orders by Region and calculate total sales per region


print("\n Total Sales per Region:")
pipeline = [
    {"$group": {"_id": "$Region", "total_sales": {"$sum": "$Sales"}}}
]
for result in orders_collection.aggregate(pipeline):
    print(result)

10. Fetch all distinct values for Ship Mode from the collection


ship_modes = orders_collection.distinct("Ship Mode")
print("\n Distinct Ship Modes:", ship_modes)

11. Count the number of orders for each category


print("\n Number of orders per Category:")
pipeline = [
    {"$group": {"_id": "$Category", "order_count": {"$sum": 1}}}
]
for result in orders_collection.aggregate(pipeline):
    print(result)
