# ScyllaDB - Advanced Queries and Indexing

This notebook covers advanced query patterns, secondary indexes, and performance optimization.

## Topics Covered
- WHERE clauses and filtering
- Secondary indexes
- ALLOW FILTERING
- Time series data patterns
- Batch operations
- Query performance tips

## Prerequisites
- Completed the getting-started notebook
- ScyllaDB running locally

In [1]:
# Import required libraries
from cassandra.cluster import Cluster
from cassandra import ConsistencyLevel
from cassandra.query import SimpleStatement, BatchStatement, BatchType
import uuid
from datetime import datetime, timedelta
import random
import time

## 1. Setup Connection and Schema

Let's connect and create a more complex schema for demonstration.

In [2]:
# Connect to ScyllaDB
SCYLLA_HOST = "127.0.0.1"
SCYLLA_PORT = 9042

cluster = Cluster([SCYLLA_HOST], port=SCYLLA_PORT)
session = cluster.connect()

print(f"[OK] Connected to ScyllaDB at {SCYLLA_HOST}:{SCYLLA_PORT}")

# Create keyspace
KEYSPACE = "advanced_demo"
session.execute(f"""
    CREATE KEYSPACE IF NOT EXISTS {KEYSPACE}
    WITH replication = {{'class': 'SimpleStrategy', 'replication_factor': 1}}
""")
session.set_keyspace(KEYSPACE)
print(f"[OK] Using keyspace '{KEYSPACE}'")

[OK] Connected to ScyllaDB at 127.0.0.1:9042
[OK] Using keyspace 'advanced_demo'


## 2. Time Series Data with Clustering

Create and query time series data efficiently using clustering columns.

In [3]:
# Create time series table with clustering order
session.execute("""
    CREATE TABLE IF NOT EXISTS sensor_data (
        sensor_id text,
        timestamp timestamp,
        temperature double,
        humidity double,
        PRIMARY KEY (sensor_id, timestamp)
    ) WITH CLUSTERING ORDER BY (timestamp DESC)
""")

print("[OK] Created sensor_data table with time-based clustering")

# Insert sample sensor readings
sensors = ["sensor_001", "sensor_002", "sensor_003"]
base_time = datetime.now()

insert_sensor = session.prepare("""
    INSERT INTO sensor_data (sensor_id, timestamp, temperature, humidity)
    VALUES (?, ?, ?, ?)
""")

for sensor in sensors:
    for i in range(50):
        timestamp = base_time - timedelta(minutes=i)
        temperature = 20 + random.uniform(-5, 5)
        humidity = 50 + random.uniform(-10, 10)
        
        session.execute(insert_sensor, (sensor, timestamp, temperature, humidity))
    
    print(f"  [OK] Inserted 50 readings for {sensor}")

print(f"[OK] Total: {len(sensors) * 50} sensor readings inserted")

[OK] Created sensor_data table with time-based clustering
  [OK] Inserted 50 readings for sensor_001
  [OK] Inserted 50 readings for sensor_002
  [OK] Inserted 50 readings for sensor_003
[OK] Total: 150 sensor readings inserted


## 3. Query Time Series Data

Query recent readings efficiently using partition and clustering keys.

In [4]:
# Query latest readings for a sensor
sensor_id = "sensor_001"
query = """
    SELECT timestamp, temperature, humidity
    FROM sensor_data
    WHERE sensor_id = ?
    LIMIT 10
"""

print(f"Latest 10 readings for {sensor_id}:")
print("-" * 70)
print(f"{'Timestamp':<25} {'Temperature (°C)':<20} {'Humidity (%)':<15}")
print("-" * 70)

prepared = session.prepare(query)
result = session.execute(prepared, (sensor_id,))
for row in result:
    print(f"{str(row.timestamp):<25} {row.temperature:<20.2f} {row.humidity:<15.2f}")

print("-" * 70)

Latest 10 readings for sensor_001:
----------------------------------------------------------------------
Timestamp                 Temperature (°C)     Humidity (%)   
----------------------------------------------------------------------
2026-01-02 11:54:59.949000 21.18                58.37          
2026-01-02 11:53:59.949000 19.96                45.03          
2026-01-02 11:52:59.949000 19.18                43.29          
2026-01-02 11:52:18.473000 20.78                53.01          
2026-01-02 11:51:59.949000 22.38                49.42          
2026-01-02 11:51:18.473000 24.15                58.66          
2026-01-02 11:50:59.949000 20.25                42.13          
2026-01-02 11:50:18.473000 15.49                53.18          
2026-01-02 11:49:59.949000 17.77                49.48          
2026-01-02 11:49:18.473000 17.47                48.59          
----------------------------------------------------------------------


## 4. Secondary Indexes

Create and use secondary indexes for alternative query patterns.

In [5]:
# Create orders table
session.execute("""
    CREATE TABLE IF NOT EXISTS orders (
        order_id uuid PRIMARY KEY,
        customer_id uuid,
        status text,
        total_amount decimal,
        order_date timestamp
    )
""")

# Create secondary indexes
session.execute("CREATE INDEX IF NOT EXISTS orders_customer_idx ON orders (customer_id)")
session.execute("CREATE INDEX IF NOT EXISTS orders_status_idx ON orders (status)")

print("[OK] Created orders table with secondary indexes")

# Insert sample orders
customers = [uuid.uuid4() for _ in range(3)]
statuses = ["pending", "processing", "shipped", "delivered"]

prepared = session.prepare(
    """
        INSERT INTO orders (order_id, customer_id, status, total_amount, order_date)
        VALUES (?, ?, ?, ?, ?)
    """
)
for _ in range(20):
    session.execute(
        prepared,
        (
            uuid.uuid4(),
            random.choice(customers),
            random.choice(statuses),
            round(random.uniform(10, 500), 2),
            datetime.now() - timedelta(days=random.randint(0, 30))
        )
    )

print("[OK] Inserted 20 sample orders")

[OK] Created orders table with secondary indexes
[OK] Inserted 20 sample orders


In [6]:
# Query using secondary index
customer_id = customers[0]
prepared = session.prepare("SELECT * FROM orders WHERE customer_id = ?")
result = session.execute(prepared, (customer_id,))
customer_orders = list(result)

print(f"\nOrders for customer {customer_id}: {len(customer_orders)}")

# Query by status
status_result = session.execute("SELECT * FROM orders WHERE status = 'shipped'")
shipped_orders = list(status_result)

print(f"Shipped orders: {len(shipped_orders)}")


Orders for customer e281f572-dffb-4c9e-8ada-400f815ab67d: 4
Shipped orders: 10


## 5. Batch Operations

Use batch statements for atomic operations on related data.

In [7]:
# Create batch and insert multiple orders
batch = BatchStatement(batch_type=BatchType.UNLOGGED)

insert_stmt = session.prepare("""
    INSERT INTO orders (order_id, customer_id, status, total_amount, order_date)
    VALUES (?, ?, ?, ?, ?)
""")

for i in range(5):
    batch.add(insert_stmt, (
        uuid.uuid4(),
        random.choice(customers),
        "pending",
        round(random.uniform(50, 200), 2),
        datetime.now()
    ))

session.execute(batch)
print("[OK] Batch insert completed: 5 orders added")

[OK] Batch insert completed: 5 orders added


## Summary

Key learnings from this notebook:
- [OK] Time series patterns with clustering columns
- [OK] Secondary indexes for alternative queries
- [OK] Batch operations for atomic updates
- [OK] Query optimization techniques

**Best Practices:**
- Design tables for your queries (query-first approach)
- Use clustering columns for sorting and range queries
- Secondary indexes impact write performance - use sparingly
- Batch operations should be small (< 100 statements)
- Avoid ALLOW FILTERING in production

## Next Steps
- Explore `03-data-modeling.ipynb` for advanced data modeling patterns

In [8]:
# Cleanup and close connection
# session.execute(f"DROP KEYSPACE IF EXISTS {KEYSPACE}")
cluster.shutdown()
print("[OK] Connection closed")

[OK] Connection closed
