# NoSQL Databases and MongoDB

## Learning Objectives
* Recognise the need for NoSQL databases for handling unstructured and non-relational data
* Distinguish between different NoSQL database types for managing unstructured data
* Demonstrate querying, aggregation, indexing, and error handling skills in MongoDB using PyMongo

## Part 1: NoSQL Fundamentals (45 minutes)

### The Data Explosion Challenge

Consider these statistics:
* 90% of the world's data was created in the last two years
* Most of this data is unstructured (social media posts, emails, images, sensor data)
* Traditional SQL databases struggle with this kind of data

### Limitations of Traditional SQL Databases

1. **Schema Rigidity**
   * Fixed table structure
   * All records must follow the same schema
   * Schema changes are difficult

2. **Scaling Challenges**
   * Vertical scaling (bigger servers) is expensive
   * Horizontal scaling is complex
   * Join operations become costly at scale

3. **Complex Data Structures**
   * Nested data requires multiple tables
   * Many-to-many relationships are cumbersome
   * Array and hierarchical data is difficult to model

### Example: Social Media Post in SQL vs NoSQL

Let's look at how we might store a social media post with comments:

In [None]:
# SQL Example (multiple tables needed)
posts_sql = """
CREATE TABLE posts (
    post_id INT PRIMARY KEY,
    user_id INT,
    content TEXT,
    timestamp DATETIME
);

CREATE TABLE comments (
    comment_id INT PRIMARY KEY,
    post_id INT,
    user_id INT,
    content TEXT,
    timestamp DATETIME
);
"""

# NoSQL Example (single document)
post_nosql = {
    "_id": "post123",
    "user": {
        "id": "user456",
        "name": "Fatima Khan"
    },
    "content": "Check out this great data science article!",
    "timestamp": "2024-02-14T10:00:00Z",
    "comments": [
        {
            "user": {"id": "user789", "name": "Li Wei"},
            "content": "Great find!",
            "timestamp": "2024-02-14T10:05:00Z"
        }
    ]
}

print("SQL requires multiple tables and joins")
print("NoSQL stores everything in a single, intuitive document:")

print(post_nosql)

In [None]:
from pprint import pprint

pprint(post_nosql)

### Types of NoSQL Databases

#### 1. Document Stores (e.g., MongoDB)
* Store data in flexible, JSON-like documents
* Each document can have its own structure
* Perfect for:
  - Content management systems
  - User profiles
  - Event logging

In [None]:
# Example Document Store data
document_store_example = {
    "user_profile": {
        "_id": "user123",
        "name": "Aisha Johnson",
        "interests": ["data science", "machine learning", "python"],
        "work_history": [
            {"company": "Tech Corp", "years": 3},
            {"company": "Data Inc", "years": 2}
        ]
    }
}

print("Document Store Example:")
pprint(document_store_example)

#### 2. Key-Value Stores (e.g., Redis)
* Simplest NoSQL database
* Just key-value pairs
* Perfect for:
  - Caching
  - Session management
  - Real-time data

In [None]:
# Example Key-Value Store data
key_value_example = {
    "session:123": "user_id=456;last_access=2024-02-14T10:00:00Z",
    "cache:popular_posts": "[1, 23, 45, 67, 89]",
    "rate_limit:user456": "10"
}

print("Key-Value Store Example:")
pprint(key_value_example)

#### 3. Column-Family Stores (e.g., Cassandra)
* Optimised for queries over large datasets
* Data organised by column rather than row
* Perfect for:
  - Time-series data
  - Weather data
  - IoT sensor data

In [None]:
# Example Column-Family Store data
column_family_example = {
    "sensor_data": {
        "row_key": "sensor123:2024-02-14",
        "column_families": {
            "temperature": {
                "10:00": "22.5",
                "10:01": "22.6",
                "10:02": "22.4"
            },
            "humidity": {
                "10:00": "45%",
                "10:01": "46%",
                "10:02": "44%"
            }
        }
    }
}

print("Column-Family Store Example:")
pprint(column_family_example)

#### 4. Graph Databases (e.g., Neo4j)
* Store data in nodes and edges
* Optimise for connected data
* Perfect for:
  - Social networks
  - Recommendation engines
  - Fraud detection

In [None]:
# Example Graph Database data
graph_example = {
    "nodes": {
        "1": {"type": "person", "name": "Aisha"},
        "2": {"type": "person", "name": "Ravi"},
        "3": {"type": "product", "name": "Data Science Course"}
    },
    "relationships": [
        {"start": "1", "type": "FRIENDS_WITH", "end": "2"},
        {"start": "1", "type": "PURCHASED", "end": "3"},
        {"start": "2", "type": "INTERESTED_IN", "end": "3"}
    ]
}

print("Graph Database Example:")
pprint(graph_example)

### Discussion Point (10 minutes)

Think about your organisation's data:
1. What types of unstructured data do you work with?
2. Which NoSQL database type might be most appropriate?
3. What challenges might you face in migrating from SQL to NoSQL?

### 10-Minute Break

---

## Part 2: Hands-on MongoDB with PyMongo (45 minutes)

Let's get practical with MongoDB, a popular document store database.

First, let's install and import our required libraries:

In [None]:
# Install required packages if not already installed
%pip install pymongo pandas

# Import required libraries
from pymongo import MongoClient
import pandas as pd
from pprint import pprint

### Connecting to MongoDB

In [None]:
# Connect to MongoDB
client = MongoClient('mongodb://root:example@localhost:27017/')
db = client['data_science_db']
collection = db['projects']

print("Successfully connected to MongoDB!")

### Working with Data

Let's create some sample data science project data:

In [None]:
# Sample data science projects
projects = [
    {
        "name": "Customer Churn Prediction",
        "team": ["Aisha", "Ravi", "Mei"],
        "technologies": {
            "languages": ["Python", "R"],
            "frameworks": ["scikit-learn", "pandas"],
            "databases": ["MongoDB", "PostgreSQL"]
        },
        "metrics": {
            "accuracy": 0.85,
            "f1_score": 0.83
        },
        "status": "completed"
    },
    {
        "name": "Sentiment Analysis",
        "team": ["Ahmed", "Priya"],
        "technologies": {
            "languages": ["Python"],
            "frameworks": ["tensorflow", "keras"],
            "databases": ["MongoDB"]
        },
        "metrics": {
            "accuracy": 0.78,
            "f1_score": 0.76
        },
        "status": "in_progress"
    }
]

# Insert the data
result = collection.insert_many(projects)

print(f"Inserted {len(result.inserted_ids)} documents")

### Basic Queries

Let's learn how to query our data:

In [None]:
# Find all projects
print("All projects:")

for project in collection.find():
    print(f"\nProject: {project['name']}")
    pprint(project)

In [None]:
# Find completed projects
print("\nCompleted projects:")

completed = collection.find({"status": "completed"})

for project in completed:
    print(f"- {project['name']}")

In [None]:
# Find projects using Python
print("\nProjects using Python:")

python_projects = collection.find({"technologies.languages": "Python"})

for project in python_projects:
    print(f"- {project['name']}")

### Advanced Queries and Aggregation

MongoDB's aggregation framework is powerful for data analysis:

In [None]:
# Calculate average accuracy by project status
pipeline = [
    {
        "$group": {
            "_id": "$status",
            "avg_accuracy": {"$avg": "$metrics.accuracy"},
            "project_count": {"$sum": 1}
        }
    }
]

print("Average accuracy by project status:")

results = collection.aggregate(pipeline)

for result in results:
    print(f"\nStatus: {result['_id']}")
    print(f"Average accuracy: {result['avg_accuracy']:0.2f}")
    print(f"Number of projects: {result['project_count']}")

### Updating Documents

Let's learn how to update our data:

In [None]:
# Update a project's status
result = collection.update_one(
    {"name": "Sentiment Analysis"},
    {"$set": {"status": "completed"}}
)

print(f"Modified {result.modified_count} document")

In [None]:
# Add a new team member to all projects
result = collection.update_many(
    {},
    {"$push": {"team": "Juan"}}
)

print(f"Modified {result.modified_count} documents")

In [None]:
# Verify the changes
print("\nUpdated projects:")

for project in collection.find():
    print(f"\nProject: {project['name']}")
    print(f"Status: {project['status']}")
    print(f"Team: {', '.join(project['team'])}")

### Error Handling

Let's implement robust error handling:

In [None]:
def safe_mongodb_operation(operation_func):
    try:
        result = operation_func()
        return True, result
    except Exception as e:
        return False, str(e)

# Example usage with a new project
new_project = {
    "name": "Image Classification",
    "team": ["Mei", "Priya"],
    "technologies": {
        "languages": ["Python"],
        "frameworks": ["pytorch", "opencv"],
        "databases": ["MongoDB"]
    },
    "status": "planning"
}

# Try to insert the new project
success, result = safe_mongodb_operation(
    lambda: collection.insert_one(new_project)
)

if success:
    print(f"Successfully inserted new project with ID: {result.inserted_id}")
else:
    print(f"Error inserting project: {result}")

# Try an invalid operation
success, result = safe_mongodb_operation(
    lambda: collection.find_one({"invalid_field": {"$invalid_operator": 1}})
)

if success:
    print("Found document:", result)
else:
    print(f"Error in query: {result}")

### Practical Exercise (15 minutes)

Try these tasks:

1. Insert a new data science project with at least three team members and two technologies
2. Write a query to find all projects using a specific technology
3. Create an aggregation pipeline to find the most common programming language across projects
4. Update a project's metrics and handle potential errors

Here's a template to get you started:

In [None]:
# Your solution here

# 1. Insert new project
new_project = {
    # Add project details here
}

# 2. Query by technology
# Your query here

# 3. Aggregation pipeline
pipeline = [
    # Your aggregation stages here
]

# 4. Update with error handling
def update_project_metrics(project_name, new_metrics):
    # Your code here
    pass

### Cleanup

Let's clean up our database:

In [None]:
# Drop the collection
collection.drop()

# Close the connection
client.close()
print("Cleaned up resources")

### Summary

In this lesson, we've covered:
1. Why NoSQL databases are needed
2. Different types of NoSQL databases
3. Basic and advanced MongoDB operations
4. Error handling best practices

### Next Steps

- Explore more advanced MongoDB features
- Learn about data modeling in MongoDB
- Practice building complete ETL pipelines
- Consider how NoSQL databases could benefit your organisation

### Additional Resources

- [MongoDB Documentation](https://docs.mongodb.com/)
- [PyMongo Documentation](https://pymongo.readthedocs.io/)
- [MongoDB University](https://university.mongodb.com/)