# Elasticsearch Index Downloader

## Overview

This notebook implements the **first stage** of our cybersecurity dataset creation pipeline. It provides a systematic approach to connect to an Elasticsearch cluster, discover relevant indices, and export time-ranged data to JSONL (JSON Lines) format for downstream processing.

## Pipeline Context

This is **Notebook 1 of 7** in the dataset generation workflow:
1. **[Current]** Elasticsearch Index Downloader - Raw data extraction
2. Sysmon Dataset CSV Creator - Windows event log processing  
3. Network Traffic Flow CSV Creator - Network data processing
4. Caldera Report Analyzer - Attack timeline analysis
5. Sysmon Event Tracking - Advanced event correlation
6. Event Timeline Plotter - Temporal visualization
7. Network Event Tracking - Network behavior analysis

## Process Workflow

```mermaid
graph TB
    A[Connect to Elasticsearch] --> B[Discover Available Indices]
    B --> C[Filter by Keywords: 'sysmon', 'network_traffic']
    C --> D[Display Index Information<br/>Size, Creation Date]
    D --> E[Interactive Index Selection]
    E --> F[Define Time Range<br/>Start & End DateTime]
    F --> G[Export Query Execution<br/>@timestamp filtering]
    G --> H[JSONL File Generation<br/>One file per index]
    H --> I[Pipeline Ready for Stage 2]
```

## Key Technical Features

- **Secure Connection**: SSL-disabled connection suitable for internal network environments
- **Index Discovery**: Automatic identification of relevant indices using keyword filtering
- **Time Range Filtering**: Precise timestamp-based data extraction using Elasticsearch range queries
- **Efficient Scanning**: Uses `elasticsearch.helpers.scan()` for memory-efficient large dataset processing
- **JSONL Output**: Structured line-delimited JSON format optimized for data pipeline integration

## Output Artifacts

Generated JSONL files follow the naming convention: `{sanitized-index-name}.jsonl`
- Example: `.ds-logs-windows.sysmon_operational-default-2025.05.04-000001` → `-ds-logs-windows-sysmon_operational-default-2025-05-04-000001.jsonl`

In [ ]:
# Core Elasticsearch connectivity
from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan  # Memory-efficient large dataset iteration

# Time handling with timezone support
from datetime import datetime, timezone

# Data serialization and file system operations
import json
import os

In [ ]:
# Elasticsearch cluster connection settings
es_host = "https://10.2.0.20:9200"  # Internal network Elasticsearch endpoint
username = "elastic"                # Cluster admin username
password = "hiYqiU21LVg0F8krD=XN"   # Cluster admin password (secure in production)

# Index filtering patterns for cybersecurity data sources
keywords = ['sysmon', 'network_traffic']  # Target data types for analysis

# Output configuration
output_dir = "./"  # Current directory for JSONL file generation

# Timestamp parsing format (matches Elasticsearch Kibana format)
TIMESTAMP_FORMAT = "%b %d, %Y @ %H:%M:%S.%f"  # Example: "May 04, 2025 @ 11:30:00.000"

In [ ]:
def connect_elasticsearch():
    """Create secure connection to Elasticsearch cluster with authentication"""
    return Elasticsearch(
        hosts=[es_host],                    # Target cluster endpoint
        basic_auth=(username, password),   # Username/password authentication
        verify_certs=False,                # Disable SSL verification for internal networks
        ssl_show_warn=False                # Suppress SSL warning messages
    )

In [ ]:
def test_connection(es):
    """Validate Elasticsearch connection with cluster health check"""
    try:
        return es.ping()  # Lightweight cluster connectivity test
    except Exception as e:
        print(f"🔥 Connection failed: {e}")  # User-friendly error reporting
        return False

In [ ]:
def list_relevant_indices(es, keywords):
    """Retrieve indices containing keywords with storage size and creation metadata"""
    try:
        # Fetch index metadata: name, size, creation timestamp
        response = es.cat.indices(format="json", h="index,store.size,creation.date")
        
        return [
            {
                "name": idx["index"],                                                    # Full index name
                "size": idx.get("store.size", "0b"),                                   # Storage size (human-readable)
                "created": datetime.fromtimestamp(int(idx["creation.date"])/1000, tz=timezone.utc)  # UTC creation time
            }
            for idx in response
            if any(kw in idx["index"] for kw in keywords)  # Filter by keyword patterns
        ]
    except Exception as e:
        print(f"🚨 Error listing indices: {e}")  # Handle API failures gracefully
        return []

In [ ]:
def display_indices_selector(indices):
    """Interactive index selection with user-friendly interface and validation"""
    print(f"\n📂 Found {len(indices)} relevant indices:")
    
    # Display numbered list with metadata for informed selection
    for i, idx in enumerate(indices, 1):
        print(f"{i:>3}. {idx['name']} ({idx['size']}) [Created: {idx['created'].strftime('%Y-%m-%d')}]")

    while True:  # Continue until valid selection or exit
        selection = input("\n🔢 Select indices (comma-separated numbers, 'all', or 'exit'): ").strip().lower()
        
        # Handle exit condition
        if selection == "exit":
            return []
            
        # Handle bulk selection
        if selection == "all":
            return [idx["name"] for idx in indices]
        
        try:
            # Parse comma-separated numbers and convert to index names
            selected_indices = [
                indices[int(num)-1]["name"]  # Convert 1-based input to 0-based indexing
                for num in selection.split(",") 
                if num.strip().isdigit()    # Validate numeric input
            ]
            
            if selected_indices:
                return list(set(selected_indices))  # Remove duplicates and return
            
            print("⚠️ No valid selection. Please try again.")
            
        except (IndexError, ValueError):
            print("⛔ Invalid input format. Use numbers separated by commas.")

In [ ]:
def parse_utc_time(time_str):
    """Parse time string into UTC datetime object for Elasticsearch compatibility"""
    try:
        # Clean input: remove timezone suffix if present (assume UTC)
        time_str = time_str.split(" (UTC)")[0].strip()
        
        # Parse using predefined format and enforce UTC timezone
        return datetime.strptime(time_str, TIMESTAMP_FORMAT).replace(tzinfo=timezone.utc)
        
    except ValueError as e:
        # Provide user-friendly error feedback with format example
        print(f"⏰ Time parsing error: {e}")
        print(f"📅 Expected format: {TIMESTAMP_FORMAT} (UTC)")
        return None

In [ ]:
def export_index_data(es, index_name, start_time, end_time):
    """Export index data within time range to JSONL file with memory-efficient scanning"""
    
    # Ensure output directory exists
    os.makedirs(output_dir, exist_ok=True)
    
    # Sanitize index name for filesystem compatibility
    safe_name = index_name.replace(":", "_").replace(".", "-")
    filename = os.path.join(output_dir, f"{safe_name}.jsonl")
    
    # Construct Elasticsearch range query for temporal filtering
    query = {
        "query": {
            "range": {
                "@timestamp": {                         # Standard Elasticsearch timestamp field
                    "gte": start_time.isoformat(),      # Greater than or equal (start)
                    "lte": end_time.isoformat(),        # Less than or equal (end)
                    "format": "strict_date_optional_time"  # ISO datetime format
                }
            }
        }
    }

    try:
        with open(filename, "w") as f:
            count = 0
            
            # Memory-efficient iteration through large result sets
            for hit in scan(es, index=index_name, query=query):
                # Write each document as a single JSON line
                f.write(json.dumps(hit["_source"]) + "\n")  # Extract document source only
                count += 1
                
        print(f"✅ Success: {count} documents from {index_name} -> {filename}")
        return True
        
    except Exception as e:
        print(f"❌ Failed to export {index_name}: {e}")  # Handle export failures gracefully
        return False

In [ ]:
def main():
    """Main pipeline orchestrator for Elasticsearch data extraction workflow"""
    
    # Stage 1: Establish cluster connection
    print("\n🔗 Connecting to Elasticsearch...")
    es = connect_elasticsearch()
    
    # Stage 2: Validate connection before proceeding
    if not test_connection(es):
        print("🚨 Could not establish connection to Elasticsearch")
        return

    # Stage 3: Discover available indices matching keywords
    print("\n🔍 Searching for relevant indices...")
    indices = list_relevant_indices(es, keywords)
    
    if not indices:
        print("🤷 No matching indices found")
        return

    # Stage 4: Interactive index selection
    selected_indices = display_indices_selector(indices)
    if not selected_indices:
        print("🚪 Exiting without download")
        return

    # Stage 5: Time range specification for temporal filtering
    print("\n🕒 Time Range Selection (UTC)")
    print("💡 Example format: 'Jan 29, 2025 @ 04:24:54.863'")
    start_time = parse_utc_time(input("⏱️  Start time: "))
    end_time = parse_utc_time(input("⏰ End time: "))
    
    # Validate time parsing results
    if not all([start_time, end_time]):
        print("⛔ Invalid time parameters")
        return

    # Stage 6: Execute data export for each selected index
    print("\n⏳ Starting data export...")
    for index in selected_indices:
        export_index_data(es, index, start_time, end_time)

    # Display summary for verification and downstream reference
    print(f'start time: {start_time}')
    print(f'end time: {end_time}')

In [ ]:
# Execute the complete Elasticsearch data extraction pipeline
if __name__ == "__main__":
    main()  # Run interactive data extraction workflow

In [2]:
# Configuration
es_host = "https://10.2.0.20:9200"
username = "elastic"
password = "hiYqiU21LVg0F8krD=XN"
keywords = ['sysmon', 'network_traffic']
output_dir = "./"
TIMESTAMP_FORMAT = "%b %d, %Y @ %H:%M:%S.%f"

In [3]:
def connect_elasticsearch():
    """Create secure connection to Elasticsearch (with SSL verification disabled)"""
    return Elasticsearch(
        hosts=[es_host],
        basic_auth=(username, password),
        verify_certs=False,
        ssl_show_warn=False
    )

In [4]:
def test_connection(es):
    """Validate ES connection with cluster health check"""
    try:
        return es.ping()
    except Exception as e:
        print(f"🔥 Connection failed: {e}")
        return False


In [5]:
def list_relevant_indices(es, keywords):
    """Retrieve indices containing keywords with storage size"""
    try:
        response = es.cat.indices(format="json", h="index,store.size,creation.date")
        return [
            {
                "name": idx["index"],
                "size": idx.get("store.size", "0b"),
                "created": datetime.fromtimestamp(int(idx["creation.date"])/1000, tz=timezone.utc)
            }
            for idx in response
            if any(kw in idx["index"] for kw in keywords)
        ]
    except Exception as e:
        print(f"🚨 Error listing indices: {e}")
        return []

In [6]:
def display_indices_selector(indices):
    """Interactive index selection with pagination"""
    print(f"\n📂 Found {len(indices)} relevant indices:")
    for i, idx in enumerate(indices, 1):
        print(f"{i:>3}. {idx['name']} ({idx['size']}) [Created: {idx['created'].strftime('%Y-%m-%d')}]")

    while True:
        selection = input("\n🔢 Select indices (comma-separated numbers, 'all', or 'exit'): ").strip().lower()
        
        if selection == "exit":
            return []
        if selection == "all":
            return [idx["name"] for idx in indices]
        
        try:
            selected_indices = [
                indices[int(num)-1]["name"] 
                for num in selection.split(",") 
                if num.strip().isdigit()
            ]
            if selected_indices:
                return list(set(selected_indices))  # Remove duplicates
            print("⚠️ No valid selection. Please try again.")
        except (IndexError, ValueError):
            print("⛔ Invalid input format. Use numbers separated by commas.")

In [7]:
def parse_utc_time(time_str):  # Corrected function name
    """Parse time string into UTC datetime object"""
    try:
        # Strip any trailing timezone identifiers (we assume UTC)
        time_str = time_str.split(" (UTC)")[0].strip()
        return datetime.strptime(time_str, TIMESTAMP_FORMAT).replace(tzinfo=timezone.utc)
    except ValueError as e:
        print(f"⏰ Time parsing error: {e}")
        print(f"📅 Expected format: {TIMESTAMP_FORMAT} (UTC)")
        return None

In [8]:
def export_index_data(es, index_name, start_time, end_time):
    """Export index data within time range to JSONL file"""
    os.makedirs(output_dir, exist_ok=True)
    safe_name = index_name.replace(":", "_").replace(".", "-")
    filename = os.path.join(output_dir, f"{safe_name}.jsonl")
    
    query = {
        "query": {
            "range": {
                "@timestamp": {
                    "gte": start_time.isoformat(),
                    "lte": end_time.isoformat(),
                    "format": "strict_date_optional_time"
                }
            }
        }
    }

    try:
        with open(filename, "w") as f:
            count = 0
            for hit in scan(es, index=index_name, query=query):
                f.write(json.dumps(hit["_source"]) + "\n")
                count += 1
        print(f"✅ Success: {count} documents from {index_name} -> {filename}")
        return True
    except Exception as e:
        print(f"❌ Failed to export {index_name}: {e}")
        return False

In [9]:
def main():
    print("\n🔗 Connecting to Elasticsearch...")
    es = connect_elasticsearch()
    
    if not test_connection(es):
        print("🚨 Could not establish connection to Elasticsearch")
        return

    print("\n🔍 Searching for relevant indices...")
    indices = list_relevant_indices(es, keywords)
    
    if not indices:
        print("🤷 No matching indices found")
        return

    selected_indices = display_indices_selector(indices)
    if not selected_indices:
        print("🚪 Exiting without download")
        return

    print("\n🕒 Time Range Selection (UTC)")
    print("💡 Example format: 'Jan 29, 2025 @ 04:24:54.863'")
    start_time = parse_utc_time(input("⏱️  Start time: "))
    end_time = parse_utc_time(input("⏰ End time: "))
    
    if not all([start_time, end_time]):
        print("⛔ Invalid time parameters")
        return

    print("\n⏳ Starting data export...")
    for index in selected_indices:
        export_index_data(es, index, start_time, end_time)

    print(f'start time: {start_time}')
    print(f'end time: {end_time}')

In [10]:
if __name__ == "__main__":
    main()


🔗 Connecting to Elasticsearch...

🔍 Searching for relevant indices...

📂 Found 13 relevant indices:
  1. .ds-logs-network_traffic.dns-default-2025.03.10-000001 (114.2mb) [Created: 2025-03-10]
  2. .ds-logs-network_traffic.tls-default-2025.03.10-000001 (52.7mb) [Created: 2025-03-10]
  3. .ds-logs-network_traffic.icmp-default-2025.03.10-000001 (2mb) [Created: 2025-03-10]
  4. .ds-logs-windows.sysmon_operational-default-2025.04.15-000002 (17.2gb) [Created: 2025-04-15]
  5. .ds-logs-network_traffic.dhcpv4-default-2025.04.17-000001 (86.2kb) [Created: 2025-04-17]
  6. .ds-logs-network_traffic.tls-default-2025.04.15-000002 (45.3mb) [Created: 2025-04-15]
  7. .ds-logs-network_traffic.flow-default-2025.04.15-000002 (25.3gb) [Created: 2025-04-15]
  8. .ds-logs-network_traffic.dns-default-2025.04.15-000002 (255.4mb) [Created: 2025-04-15]
  9. .ds-logs-windows.sysmon_operational-default-2025.03.10-000001 (8.4gb) [Created: 2025-03-10]
 10. .ds-logs-network_traffic.icmp-default-2025.04.15-000002 (7