# Scenariusze testowe dla porównania wydajności baz danych

### 1. Operacja CREATE

- Dodanie nowego nauczyciela
- Utworzenie nowej klasy
- Dodanie nowego przedmiotu
- Zarejestrowanie nowego ucznia
- Przypisanie ucznia do klasy (**Dodano: Zapisanie ucznia do klasy (enrolment)**)
- Utworzenie harmonogramu zajęć
- Wystawienie oceny

### 2. Operacja READ

Pobranie kompleksowego raportu zawierającego:
- Dane osobowe ucznia
- Informacje o klasie (**Dodano: Informacje o zapisach do klas**)
- Dane nauczyciela prowadzącego
- Listę ocen z opisami przedmiotów
- Szczegółowy harmonogram zajęć

### 3. Operacja UPDATE

- Aktualizacja danych ucznia
- Zmiana przypisania do klasy (**Dodano: Aktualizacja zapisu do klasy**)
- Modyfikacja nazwy klasy
- Aktualizacja danych nauczyciela
- Zmiana oceny
- Aktualizacja opisu przedmiotu
- Modyfikacja harmonogramu zajęć

### 4. Operacja DELETE

- Usunięcie ocen ucznia
- Wypisanie ucznia z klasy (**Dodano: Usunięcie zapisu do klasy**)
- Usunięcie harmonogramu zajęć
- Usunięcie klasy
- Opcjonalne usunięcie przedmiotów
- Opcjonalne usunięcie nauczyciela
- Usunięcie rekordu ucznia

## Ilość rekordów do testów

Testy będą przeprowadzane dla następujących ilości rekordów:

1. 10,000 rekordów
2. 100,000 rekordów
3. 1,000,000 rekordów
4. 10,000,000 rekordów

## Metryki wydajnościowe

Dla każdego scenariusza i ilości rekordów będziemy mierzyć:

1. Czas wykonania całego scenariusza
2. Średni czas pojedynczych operacji
3. Liczbę operacji na sekundę (throughput)
4. Zużycie zasobów systemowych (CPU, RAM, I/O dysku)

# Narzędzia i technologie testowe

### Wbudowane instrumenty bazodanowe

Każdy system oferuje specjalizowane narzędzia diagnostyczne:

| System | Narzędzie | Funkcjonalności |
| :-- | :-- | :-- |
| PostgreSQL | pgBench | Testy TPC-B, własne skrypty SQL |
| MariaDB | sysbench | Testy OLTP, skalowanie pionowe |
| MongoDB | mongoperf | Operacje na dokumentach JSON |
| Cassandra | cassandra-stress | Testy dystrybucji danych |
| Redis | redis-benchmark | Pomiar opóźnień operacji klucz-wartość |

Wykorzystanie natywnych narzędzi pozwala na precyzyjne badanie specyficznych mechanizmów storage engine.

### Automatyzacja w Pythonie

Kluczowe biblioteki wspierające testy:

- **SQLAlchemy** dla baz relacyjnych
- **PyMongo** dla MongoDB
- **Cassandra-driver** dla Cassandra
- **redis-py** dla Redis

In [None]:
# Import required libraries
import psycopg2
import psycopg2.errors
from pymongo import MongoClient
from cassandra.cluster import Cluster
import redis
import mysql.connector
import yaml
import pandas as pd
import os
import time
import sys
from pathlib import Path

# Load database configuration
print("Setting up database connections...")
with open('docker-compose.yml', 'r') as file:
    docker_config = yaml.safe_load(file)

# PostgreSQL connection
postgres_config = docker_config['services']['postgresql']
postgres_client = psycopg2.connect(
    host='localhost',
    database=postgres_config['environment']['POSTGRES_DB'],
    user=postgres_config['environment']['POSTGRES_USER'],
    password=postgres_config['environment']['POSTGRES_PASSWORD'],
    port=postgres_config['ports'][0].split(':')[0]
)

# MariaDB connection
mariadb_config = docker_config['services']['mariadb']
mariadb_client = mysql.connector.connect(
    host='localhost',
    database=mariadb_config['environment']['MYSQL_DATABASE'],
    user=mariadb_config['environment']['MYSQL_USER'],
    password=mariadb_config['environment']['MYSQL_PASSWORD'],
    port=mariadb_config['ports'][0].split(':')[0],
    allow_local_infile=True
)

# MongoDB connection
mongo_config = docker_config['services']['mongodb']
mongo_client = MongoClient(
    host='localhost',
    port=int(mongo_config['ports'][0].split(':')[0])
)

# Cassandra connection
cassandra_config = docker_config['services']['cassandra']
cassandra_client = Cluster(['localhost'], port=cassandra_config['ports'][0].split(':')[0])
cassandra_session = cassandra_client.connect()

# Redis connection
redis_config = docker_config['services']['redis']
redis_client = redis.Redis(
    host='localhost',
    port=int(redis_config['ports'][0].split(':')[0])
)

# Test connections
try:
    postgres_client.cursor().execute("SELECT 1")
    print("INFO: PostgreSQL connection successful")
    
    mariadb_client.cursor(buffered=True).execute("SELECT 1")
    print("INFO: MariaDB connection successful")
    
    cassandra_session.execute("SELECT release_version FROM system.local")
    print("INFO: Cassandra connection successful")
    
    mongo_client.admin.command('ping')
    print("INFO: MongoDB connection successful")
    
    redis_client.ping()
    print("INFO: Redis connection successful")
except Exception as e:
    print(f"ERROR: Connection test failed: {e}")

2025-04-23 20:46:48,680 - INFO - Using datacenter 'datacenter1' for DCAwareRoundRobinPolicy (via host '::1:9042'); if incorrect, please specify a local_dc to the constructor, or limit contact points to local cluster nodes
2025-04-23 20:46:48,681 - INFO - Cassandra host 127.0.0.1:9042 removed


Setting up database connections...
INFO: PostgreSQL connection successful
INFO: MariaDB connection successful
INFO: Cassandra connection successful
INFO: MongoDB connection successful
INFO: Redis connection successful




In [426]:
STOP = ''

In [None]:
# Data generation functions
sys.path.append(str(Path.cwd()))
from generator import generate_school_data

def generate_files(output_dir='./data', scale=1000, batch_size=10000, **kwargs):
    """
    Generate synthetic school data files for benchmarking.
    """
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    print(f"INFO: Generating data with scale {scale} and batch size {batch_size}...")

    result = generate_school_data(
        output_dir=output_dir,
        scale=scale,
        batch_size=batch_size,
        **kwargs
    )

    print(f"INFO: Generated {len(result['students'])} students, {len(result['teachers'])} teachers, " + 
          f"{len(result['classes'])} classes, {len(result['subjects'])} subjects")
    print("="*50)
    return result

# Generate test data sets
scale_100_dir = './data/scale_100'
scale_1000_dir = './data/scale_1000'

generate_files(output_dir=scale_100_dir, scale=100, batch_size=5000)
generate_files(output_dir=scale_1000_dir, scale=1000, batch_size=5000)
STOP


INFO: Generating data with scale 100 and batch size 5000...
Processing students.csv...
  Batch 1/1 processed for students.csv
Processing teachers.csv...
  Batch 1/1 processed for teachers.csv
Processing classes.csv...
  Batch 1/1 processed for classes.csv
Processing subjects.csv...
  Batch 1/1 processed for subjects.csv
Processing grades.csv...
  Batch 1/1 processed for grades.csv
Processing schedules.csv...
  Batch 1/1 processed for schedules.csv
Processing enrollments.csv...
  Batch 1/1 processed for enrollments.csv
INFO: Generated 1000 students, 100 teachers, 200 classes, 100 subjects
INFO: Generating data with scale 1000 and batch size 5000...
Processing students.csv...
  Batch 1/2 processed for students.csv
  Batch 2/2 processed for students.csv
Processing teachers.csv...
  Batch 1/1 processed for teachers.csv
Processing classes.csv...
  Batch 1/1 processed for classes.csv
Processing subjects.csv...
  Batch 1/1 processed for subjects.csv
Processing grades.csv...
  Batch 1/6 proces

''

Traceback (most recent call last):
  File "/opt/miniconda3/envs/db-benchmark/lib/python3.10/site-packages/cassandra/cluster.py", line 3577, in _reconnect_internal
    return self._try_connect(host)
  File "/opt/miniconda3/envs/db-benchmark/lib/python3.10/site-packages/cassandra/cluster.py", line 3599, in _try_connect
    connection = self._cluster.connection_factory(host.endpoint, is_control_connection=True)
  File "/opt/miniconda3/envs/db-benchmark/lib/python3.10/site-packages/cassandra/cluster.py", line 1670, in connection_factory
    return self.connection_class.factory(endpoint, self.connect_timeout, *args, **kwargs)
  File "/opt/miniconda3/envs/db-benchmark/lib/python3.10/site-packages/cassandra/connection.py", line 846, in factory
    conn = cls(endpoint, *args, **kwargs)
  File "/opt/miniconda3/envs/db-benchmark/lib/python3.10/site-packages/cassandra/io/libevreactor.py", line 267, in __init__
    self._connect_socket()
  File "/opt/miniconda3/envs/db-benchmark/lib/python3.10/sit

# PostgreSQL Operations

In [428]:
# PostgreSQL Methods

def initialize_postgres_schema(conn, schema_sql):
    """
    Initializes the PostgreSQL database schema using the provided SQL script.
    """
    if not schema_sql:
        print("ERROR: Schema SQL content is empty.")
        return

    try:
        with conn.cursor() as cur:
            cur.execute(schema_sql)
        conn.commit()
        print("INFO: PostgreSQL schema initialized.")
    except Exception as e:
        conn.rollback()
        print(f"ERROR: Error initializing PostgreSQL schema: {e}")

def verify_postgres_tables(conn, expected_tables):
    """
    Verifies if the expected tables exist in PostgreSQL.
    """
    try:
        with conn.cursor() as cur:
            cur.execute("""
                SELECT table_name
                FROM information_schema.tables
                WHERE table_schema = 'public' AND table_name = ANY(%s);
            """, (expected_tables,))
            existing_tables = {row[0] for row in cur.fetchall()}

        missing_tables = set(expected_tables) - existing_tables
        if not missing_tables:
            print(f"INFO: All PostgreSQL tables exist: {', '.join(expected_tables)}")
            return True
        else:
            print(f"WARNING: Missing PostgreSQL tables: {', '.join(missing_tables)}")
            return False
    except Exception as e:
        print(f"ERROR: Error verifying PostgreSQL tables: {e}")
        return False

def load_postgres_data(conn, data_dir):
    """
    Loads data from CSV files into PostgreSQL tables.
    """
    data_path = Path(data_dir)
    table_csv_map = {
        'teachers': 'teachers.csv',
        'subjects': 'subjects.csv',
        'classes': 'classes.csv',
        'students': 'students.csv',
        'grades': 'grades.csv',
        'schedules': 'schedules.csv',
        'enrollments': 'enrollments.csv'
    }

    start_time = time.time()
    print(f"INFO: Loading PostgreSQL data from {data_dir}")

    try:
        with conn.cursor() as cur:
            for table_name, csv_file in table_csv_map.items():
                file_path = data_path / csv_file
                if not file_path.exists():
                    print(f"WARNING: CSV file not found: {file_path}")
                    continue

                load_start = time.time()

                if table_name == 'enrollments':
                    # Special handling for enrollments using temp table
                    print(f"INFO: Loading enrollments with duplicate handling...")
                    temp_table_name = f"temp_{table_name}"
                    try:
                        cur.execute(f"""
                            CREATE TEMP TABLE {temp_table_name} (
                                student_id INT,
                                class_id INT,
                                enrolled_at TIMESTAMP
                            ) ON COMMIT DROP;
                        """)
                        copy_sql = f"COPY {temp_table_name} FROM STDIN WITH (FORMAT CSV, HEADER)"
                        with open(file_path, 'r') as f:
                            cur.copy_expert(sql=copy_sql, file=f)

                        insert_sql = f"""
                            INSERT INTO {table_name} (student_id, class_id, enrolled_at)
                            SELECT student_id, class_id, enrolled_at FROM {temp_table_name}
                            ON CONFLICT (student_id, class_id) DO NOTHING;
                        """
                        cur.execute(insert_sql)
                        inserted_count = cur.rowcount
                        conn.commit()
                        load_end = time.time()
                        print(f"INFO: Loaded {inserted_count} enrollments in {load_end - load_start:.2f} seconds.")
                    except Exception as enroll_error:
                        conn.rollback()
                        print(f"ERROR: {enroll_error}")
                else:
                    # Standard COPY for other tables
                    print(f"INFO: Loading {table_name}...")
                    copy_sql = f"COPY {table_name} FROM STDIN WITH (FORMAT CSV, HEADER)"
                    try:
                        with open(file_path, 'r') as f:
                            cur.copy_expert(sql=copy_sql, file=f)
                        conn.commit()
                        load_end = time.time()
                        print(f"INFO: Loaded {table_name} in {load_end - load_start:.2f} seconds.")
                    except Exception as copy_error:
                        conn.rollback()
                        print(f"ERROR: {copy_error}")

    except Exception as e:
        conn.rollback()
        print(f"ERROR: {e}")
    finally:
        end_time = time.time()
        print(f"INFO: PostgreSQL loading complete in {end_time - start_time:.2f} seconds.")

def verify_postgres_counts(conn, tables):
    """
    Counts rows in PostgreSQL tables.
    """
    counts = {}
    max_len = max(len(t) for t in tables) if tables else 0
    print(f"INFO: Counting rows in PostgreSQL tables")
    try:
        with conn.cursor() as cur:
            for table_name in tables:
                try:
                    cur.execute(f"SELECT COUNT(*) FROM {table_name};")
                    count = cur.fetchone()[0]
                    counts[table_name] = count
                except Exception as count_error:
                    print(f"ERROR: {count_error}")
                    counts[table_name] = 'Error'

        print("--- PostgreSQL Table Row Counts ---")
        for table, count in counts.items():
            print(f"{table:<{max_len}} : {count}")
        print("-----------------------------------")
        return counts

    except Exception as e:
        print(f"ERROR: {e}")
        return None

In [429]:
# PostgreSQL Operations Execution

# Schema initialization
with open('schemas/postgres_schema.sql', 'r') as f:
    sql_schema = f.read()

initialize_postgres_schema(postgres_client, sql_schema)

# Table verification 
required_tables = ['teachers', 'subjects', 'classes', 'students', 'enrollments', 'grades', 'schedules']
verify_postgres_tables(postgres_client, required_tables)

# Data loading
load_postgres_data(postgres_client, scale_100_dir)

# Count verification
verify_postgres_counts(postgres_client, required_tables)
STOP

INFO: PostgreSQL schema initialized.
INFO: All PostgreSQL tables exist: teachers, subjects, classes, students, enrollments, grades, schedules
INFO: Loading PostgreSQL data from ./data/scale_100
INFO: Loading teachers...
INFO: Loaded teachers in 0.00 seconds.
INFO: Loading subjects...
INFO: Loaded subjects in 0.00 seconds.
INFO: Loading classes...
INFO: Loaded classes in 0.00 seconds.
INFO: Loading students...
INFO: Loaded students in 0.00 seconds.
INFO: Loading grades...
INFO: Loaded grades in 0.02 seconds.
INFO: Loading schedules...
INFO: Loaded schedules in 0.01 seconds.
INFO: Loading enrollments with duplicate handling...
INFO: Loaded 1993 enrollments in 0.02 seconds.
INFO: PostgreSQL loading complete in 0.06 seconds.
INFO: Counting rows in PostgreSQL tables
--- PostgreSQL Table Row Counts ---
teachers    : 100
subjects    : 100
classes     : 200
students    : 1000
enrollments : 1993
grades      : 3000
schedules   : 500
-----------------------------------


''

# MariaDB Operations

In [430]:
# MariaDB Methods

def initialize_mariadb_schema(conn, schema_sql):
    """
    Initializes the MariaDB database schema using the provided SQL script.
    """
    if not schema_sql:
        print("ERROR: Schema SQL content is empty.")
        return
    try:
        with conn.cursor() as cur:
            for statement in schema_sql.split(';'):
                stmt = statement.strip()
                if stmt:
                    cur.execute(stmt)
        conn.commit()
        print("INFO: MariaDB schema initialized.")
    except Exception as e:
        conn.rollback()
        print(f"ERROR: Error initializing MariaDB schema: {e}")

def verify_mariadb_tables(conn, expected_tables):
    """
    Verifies if the expected tables exist in MariaDB.
    """
    try:
        with conn.cursor() as cur:
            format_strings = ','.join(['%s'] * len(expected_tables))
            cur.execute(f"""
                SELECT table_name
                FROM information_schema.tables
                WHERE table_schema = DATABASE() AND table_name IN ({format_strings});
            """, tuple(expected_tables))
            existing_tables = {row[0] for row in cur.fetchall()}

        missing_tables = set(expected_tables) - existing_tables
        if not missing_tables:
            print(f"INFO: All MariaDB tables exist: {', '.join(expected_tables)}")
            return True
        else:
            print(f"WARNING: Missing MariaDB tables: {', '.join(missing_tables)}")
            return False
    except Exception as e:
        print(f"ERROR: Error verifying MariaDB tables: {e}")
        return False

def load_mariadb_data(conn, data_dir):
    """
    Loads data from CSV files into MariaDB tables.
    """
    data_path = Path(data_dir)
    table_csv_map = {
        'teachers': 'teachers.csv',
        'subjects': 'subjects.csv',
        'classes': 'classes.csv',
        'students': 'students.csv',
        'grades': 'grades.csv',
        'schedules': 'schedules.csv',
        'enrollments': 'enrollments.csv'
    }
    
    start_time = time.time()
    print(f"INFO: Loading MariaDB data from {data_dir}")
    
    try:
        with conn.cursor() as cur:
            for table_name, csv_file in table_csv_map.items():
                file_path = data_path / csv_file
                if not file_path.exists():
                    print(f"WARNING: CSV file not found: {file_path}")
                    continue
                    
                load_start = time.time()
                
                try:
                    if table_name == 'enrollments':
                        # Handle enrollments with INSERT IGNORE to skip duplicates
                        print(f"INFO: Loading enrollments with duplicate handling...")
                        with open(file_path, 'r') as f:
                            next(f)  # skip header
                            for line in f:
                                student_id, class_id, enrolled_at = line.strip().split(',')
                                cur.execute(
                                    """
                                    INSERT IGNORE INTO enrollments (student_id, class_id, enrolled_at)
                                    VALUES (%s, %s, %s)
                                    """,
                                    (student_id, class_id, enrolled_at)
                                )
                        conn.commit()
                    else:
                        # Use LOAD DATA LOCAL INFILE for other tables
                        print(f"INFO: Loading {table_name}...")
                        load_sql = f"""
                        LOAD DATA LOCAL INFILE '{file_path.resolve()}'
                        INTO TABLE {table_name}
                        FIELDS TERMINATED BY ','
                        OPTIONALLY ENCLOSED BY '"'
                        LINES TERMINATED BY '\n'
                        IGNORE 1 LINES;
                        """
                        cur.execute(load_sql)
                        conn.commit()
                        
                    load_end = time.time()
                    print(f"INFO: Loaded {table_name} in {load_end - load_start:.2f} seconds.")
                except Exception as load_error:
                    conn.rollback()
                    print(f"ERROR: {load_error}")
                    
    except Exception as e:
        conn.rollback()
        print(f"ERROR: {e}")
    finally:
        end_time = time.time()
        print(f"INFO: MariaDB loading complete in {end_time - start_time:.2f} seconds.")

def verify_mariadb_counts(conn, tables):
    """
    Counts rows in MariaDB tables.
    """
    counts = {}
    max_len = max(len(t) for t in tables) if tables else 0
    print(f"INFO: Counting rows in MariaDB tables")
    
    try:
        with conn.cursor() as cur:
            for table_name in tables:
                try:
                    cur.execute(f"SELECT COUNT(*) FROM {table_name};")
                    count = cur.fetchone()[0]
                    counts[table_name] = count
                except Exception as count_error:
                    print(f"ERROR: {count_error}")
                    counts[table_name] = 'Error'

        print("--- MariaDB Table Row Counts ---")
        for table, count in counts.items():
            print(f"{table:<{max_len}} : {count}")
        print("---------------------------------")
        return counts

    except Exception as e:
        print(f"ERROR: {e}")
        return None

In [431]:
# MariaDB Operations Execution

# Schema initialization
with open('schemas/mariadb_schema.sql', 'r') as f:
    mariadb_schema = f.read()

initialize_mariadb_schema(mariadb_client, mariadb_schema)

# Table verification
required_tables = ['teachers', 'subjects', 'classes', 'students', 'enrollments', 'grades', 'schedules']
verify_mariadb_tables(mariadb_client, required_tables)

# Data loading
load_mariadb_data(mariadb_client, scale_100_dir)

# Count verification
verify_mariadb_counts(mariadb_client, required_tables)
STOP

INFO: MariaDB schema initialized.
INFO: All MariaDB tables exist: teachers, subjects, classes, students, enrollments, grades, schedules
INFO: Loading MariaDB data from ./data/scale_100
INFO: Loading teachers...
INFO: Loaded teachers in 0.00 seconds.
INFO: Loading subjects...
INFO: Loaded subjects in 0.00 seconds.
INFO: Loading classes...
INFO: Loaded classes in 0.00 seconds.
INFO: Loading students...
INFO: Loaded students in 0.00 seconds.
INFO: Loading grades...
INFO: Loaded grades in 0.01 seconds.
INFO: Loading schedules...
INFO: Loaded schedules in 0.00 seconds.
INFO: Loading enrollments with duplicate handling...
INFO: Loaded enrollments in 0.28 seconds.
INFO: MariaDB loading complete in 0.31 seconds.
INFO: Counting rows in MariaDB tables
--- MariaDB Table Row Counts ---
teachers    : 100
subjects    : 100
classes     : 200
students    : 1000
enrollments : 1993
grades      : 3000
schedules   : 500
---------------------------------


''

# MongoDB Operations

In [432]:
# MongoDB Methods

def initialize_mongo_schema(client, db_name='benchmark'):
    """
    Initializes the MongoDB schema by creating necessary collections.
    """
    try:
        db = client[db_name]
        
        # List of collections to create based on no_sql_design.txt
        collections = ['students', 'teachers', 'classes', 'subjects']
        
        # Drop existing collections if they exist
        for collection in collections:
            if collection in db.list_collection_names():
                db[collection].drop()
                print(f"INFO: Dropped MongoDB collection: {collection}")
        
        # Create collections with indexes
        for collection in collections:
            db.create_collection(collection)
            print(f"INFO: Created MongoDB collection: {collection}")
            
            # Create indexes for performance
            if collection == 'students':
                db[collection].create_index([("last_name", 1), ("first_name", 1)])
            elif collection == 'classes':
                db[collection].create_index([("name", 1)])
                
        print("INFO: MongoDB schema initialized.")
    except Exception as e:
        print(f"ERROR: {e}")

def verify_mongo_collections(client, db_name='benchmark', expected_collections=None):
    """
    Verifies if the expected collections exist in MongoDB.
    """
    if expected_collections is None:
        expected_collections = ['students', 'teachers', 'classes', 'subjects']
    
    try:
        db = client[db_name]
        existing_collections = db.list_collection_names()
        
        missing_collections = set(expected_collections) - set(existing_collections)
        if not missing_collections:
            print(f"INFO: All MongoDB collections exist: {', '.join(expected_collections)}")
            return True
        else:
            print(f"WARNING: Missing MongoDB collections: {', '.join(missing_collections)}")
            return False
    except Exception as e:
        print(f"ERROR: {e}")
        return False

def load_mongo_data(client, data_dir, db_name='benchmark'):
    """
    Loads data from CSV files into MongoDB collections with document-oriented structure.
    """
    data_path = Path(data_dir)
    db = client[db_name]
    
    start_time = time.time()
    print(f"INFO: Loading MongoDB data from {data_dir}")
    
    # Clear previous data
    for collection in ['students', 'teachers', 'classes', 'subjects']:
        db[collection].delete_many({})
    
    try:
        # Step 1: Load teachers
        teachers_file = data_path / 'teachers.csv'
        teachers_map = {}
        
        if teachers_file.exists():
            print(f"INFO: Loading teachers...")
            load_start = time.time()
            
            teachers = []
            with open(teachers_file, 'r') as f:
                reader = pd.read_csv(f)
                for _, row in reader.iterrows():
                    teacher_doc = {
                        "_id": row['id'],
                        "first_name": row['first_name'],
                        "last_name": row['last_name'],
                        "subject": row['subject'],
                        "hire_date": row['hire_date']
                    }
                    teachers.append(teacher_doc)
                    teachers_map[row['id']] = f"{row['first_name']} {row['last_name']}"
                    
            if teachers:
                db.teachers.insert_many(teachers)
                load_end = time.time()
                print(f"INFO: Loaded {len(teachers)} teachers in {load_end - load_start:.2f} seconds.")
        
        # Step 2: Load subjects
        subjects_file = data_path / 'subjects.csv'
        subjects_map = {}
        
        if subjects_file.exists():
            print(f"INFO: Loading subjects...")
            load_start = time.time()
            
            subjects = []
            with open(subjects_file, 'r') as f:
                reader = pd.read_csv(f)
                for _, row in reader.iterrows():
                    subject_doc = {
                        "_id": row['id'],
                        "name": row['name'],
                        "description": row['description']
                    }
                    subjects.append(subject_doc)
                    subjects_map[row['id']] = row['name']
                    
            if subjects:
                db.subjects.insert_many(subjects)
                load_end = time.time()
                print(f"INFO: Loaded {len(subjects)} subjects in {load_end - load_start:.2f} seconds.")
        
        # Step 3: Process schedules for embedding in classes
        schedules_file = data_path / 'schedules.csv'
        schedules_map = {}
        
        if schedules_file.exists():
            print(f"INFO: Processing schedules...")
            with open(schedules_file, 'r') as f:
                reader = pd.read_csv(f)
                for _, row in reader.iterrows():
                    class_id = row['class_id']
                    if class_id not in schedules_map:
                        schedules_map[class_id] = []
                        
                    schedules_map[class_id].append({
                        "subject_id": row['subject_id'],
                        "day_of_week": row['day_of_week'],
                        "time_start": row['time_start'],
                        "time_end": row['time_end']
                    })
        
        # Step 4: Load classes with embedded schedules
        classes_file = data_path / 'classes.csv'
        classes_map = {}
        
        if classes_file.exists():
            print(f"INFO: Loading classes with embedded schedules...")
            load_start = time.time()
            
            classes = []
            with open(classes_file, 'r') as f:
                reader = pd.read_csv(f)
                for _, row in reader.iterrows():
                    class_id = row['id']
                    teacher_id = row['teacher_id']
                    
                    class_doc = {
                        "_id": class_id,
                        "name": row['name'],
                        "teacher": {
                            "teacher_id": teacher_id,
                            "name": teachers_map.get(teacher_id, "Unknown")
                        },
                        "schedule": schedules_map.get(class_id, [])
                    }
                    
                    classes.append(class_doc)
                    classes_map[class_id] = row['name']
                    
            if classes:
                db.classes.insert_many(classes)
                load_end = time.time()
                print(f"INFO: Loaded {len(classes)} classes in {load_end - load_start:.2f} seconds.")
        
        # Step 5: Process enrollments and grades for embedding in students
        enrollments_file = data_path / 'enrollments.csv'
        enrollments_map = {}
        
        if enrollments_file.exists():
            print(f"INFO: Processing enrollments...")
            with open(enrollments_file, 'r') as f:
                reader = pd.read_csv(f)
                for _, row in reader.iterrows():
                    student_id = row['student_id']
                    if student_id not in enrollments_map:
                        enrollments_map[student_id] = []
                        
                    enrollments_map[student_id].append({
                        "class_id": row['class_id'],
                        "enrolled_at": row['enrolled_at']
                    })
        
        grades_file = data_path / 'grades.csv'
        grades_map = {}
        
        if grades_file.exists():
            print(f"INFO: Processing grades...")
            with open(grades_file, 'r') as f:
                reader = pd.read_csv(f)
                for _, row in reader.iterrows():
                    student_id = row['student_id']
                    if student_id not in grades_map:
                        grades_map[student_id] = []
                        
                    grades_map[student_id].append({
                        "subject_id": row['subject_id'],
                        "grade": row['grade'],
                        "created_at": row['created_at']
                    })
        
        # Step 6: Load students with embedded enrollments and grades
        students_file = data_path / 'students.csv'
        
        if students_file.exists():
            print(f"INFO: Loading students with embedded enrollments and grades...")
            load_start = time.time()
            
            # Process in batches
            batch_size = 5000
            batch = []
            student_count = 0
            
            with open(students_file, 'r') as f:
                reader = pd.read_csv(f)
                for _, row in reader.iterrows():
                    student_id = row['id']
                    student_doc = {
                        "_id": student_id,
                        "first_name": row['first_name'],
                        "last_name": row['last_name'],
                        "birth_date": row['birth_date'],
                        "enrollments": enrollments_map.get(student_id, []),
                        "grades": grades_map.get(student_id, [])
                    }
                    
                    batch.append(student_doc)
                    student_count += 1
                    
                    # Insert batch when it reaches batch_size
                    if len(batch) >= batch_size:
                        db.students.insert_many(batch)
                        batch = []
                
                # Insert any remaining documents
                if batch:
                    db.students.insert_many(batch)
                    
                load_end = time.time()
                print(f"INFO: Loaded {student_count} students in {load_end - load_start:.2f} seconds.")
        
    except Exception as e:
        print(f"ERROR: {e}")
    finally:
        end_time = time.time()
        print(f"INFO: MongoDB loading complete in {end_time - start_time:.2f} seconds.")

def verify_mongo_counts(client, db_name='benchmark'):
    """
    Counts documents in MongoDB collections.
    """
    collections = ['students', 'teachers', 'classes', 'subjects']
    max_len = max(len(c) for c in collections)
    
    try:
        db = client[db_name]
        counts = {}
        
        for collection in collections:
            try:
                count = db[collection].count_documents({})
                counts[collection] = count
            except Exception as e:
                print(f"ERROR: {e}")
                counts[collection] = 'Error'
                
        print("--- MongoDB Collection Document Counts ---")
        for collection, count in counts.items():
            print(f"{collection:<{max_len}} : {count}")
        print("-----------------------------------------")

        # Additional checks for embedded documents
        try:
            students_with_enrollments = db.students.count_documents({"enrollments": {"$exists": True, "$ne": []}})
            students_with_grades = db.students.count_documents({"grades": {"$exists": True, "$ne": []}})
            classes_with_schedules = db.classes.count_documents({"schedule": {"$exists": True, "$ne": []}})
            
            print("\n--- MongoDB Embedded Document Counts ---")
            print(f"Students with enrollments : {students_with_enrollments}")
            print(f"Students with grades      : {students_with_grades}")
            print(f"Classes with schedules    : {classes_with_schedules}")
            print("-----------------------------------------")
        except Exception as e:
            print(f"ERROR: {e}")
        
        return counts
    except Exception as e:
        print(f"ERROR: {e}")
        return None

In [433]:
# MongoDB Operations Execution

# Schema initialization
initialize_mongo_schema(mongo_client)

# Collection verification
verify_mongo_collections(mongo_client)

# Data loading
load_mongo_data(mongo_client, scale_100_dir)

# Document count verification
verify_mongo_counts(mongo_client)
STOP

INFO: Dropped MongoDB collection: students
INFO: Dropped MongoDB collection: teachers
INFO: Dropped MongoDB collection: classes
INFO: Dropped MongoDB collection: subjects
INFO: Created MongoDB collection: students
INFO: Created MongoDB collection: teachers
INFO: Created MongoDB collection: classes
INFO: Created MongoDB collection: subjects
INFO: MongoDB schema initialized.
INFO: All MongoDB collections exist: students, teachers, classes, subjects
INFO: Loading MongoDB data from ./data/scale_100
INFO: Loading teachers...
INFO: Loaded 100 teachers in 0.00 seconds.
INFO: Loading subjects...
INFO: Loaded 100 subjects in 0.00 seconds.
INFO: Processing schedules...
INFO: Loading classes with embedded schedules...
INFO: Loaded 200 classes in 0.01 seconds.
INFO: Processing enrollments...
INFO: Processing grades...
INFO: Loading students with embedded enrollments and grades...
INFO: Loaded 1000 students in 0.03 seconds.
INFO: MongoDB loading complete in 0.13 seconds.
--- MongoDB Collection Docu

''