<!-- # Elasticsearch Index Downloader

## 📖 Overview

This notebook extracts cybersecurity data from an Elasticsearch cluster containing Windows event logs and network traffic. It provides an interactive interface for users to select specific indices and time ranges for data extraction.

### 🎯 Purpose

- **Connect** to Elasticsearch cluster securely
- **Discover** available indices containing security data  
- **Select** relevant indices through user interaction
- **Extract** data within specified time ranges
- **Output** structured JSONL files for further processing

### 📊 Target Data Types

- **Windows Sysmon Events**: Process creation, network connections, file operations
- **Network Traffic**: DNS queries, HTTP requests, TLS handshakes, flow data

--- -->


## 🛠️ Required Libraries

Import essential libraries for Elasticsearch integration and data processing:

## ⚙️ Configuration Parameters

Set up connection details and extraction settings:

## 🔌 Connection Functions

Functions to establish and validate Elasticsearch connections:

In [None]:
# Core libraries for Elasticsearch integration and data handling
from elasticsearch import Elasticsearch          # Elasticsearch client for cluster communication
from elasticsearch.helpers import scan           # Efficient scrolling through large result sets
from datetime import datetime, timezone          # Timestamp parsing and timezone handling
import json                                      # JSON serialization for JSONL output format
import os                                        # File system operations for output directory

## 🔍 Index Discovery Functions

Functions to find and filter relevant security indices in the cluster:

In [None]:
# Elasticsearch cluster connection configuration
es_host = "https://10.2.0.20:9200"               # HTTPS endpoint for secure connection
username = "elastic"                             # Authentication username
password = "hiYqiU21LVg0F8krD=XN"               # Authentication password

# Data discovery and filtering settings  
keywords = ['sysmon', 'network_traffic']         # Keywords to identify relevant security indices
output_dir = "./"                                # Local directory for extracted JSONL files

# Time format for user input (human-readable format)
TIMESTAMP_FORMAT = "%b %d, %Y @ %H:%M:%S.%f"     # Example: "Jan 29, 2025 @ 04:24:54.863"

## 📅 Temporal Processing & Data Extraction

### ⏰ Time Range Management

The temporal processing system handles time-based filtering for cybersecurity event extraction. This is critical for focusing on specific attack simulation windows or investigation timeframes.

#### Time Format Design
- **Human-Readable Input**: Format like `'Jan 29, 2025 @ 04:24:54.863'` for intuitive user interaction
- **UTC Standardization**: All timestamps converted to UTC for consistent processing
- **Precision Support**: Microsecond precision for high-resolution security event correlation
- **Timezone Handling**: Automatic UTC conversion regardless of input timezone indicators

#### Data Extraction Architecture
The extraction system implements **streaming processing** for memory-efficient handling of large security datasets:

##### Extraction Features
- **Time-Range Queries**: Elasticsearch range queries on `@timestamp` field
- **Streaming Output**: Direct write to JSONL files without loading entire datasets in memory
- **Progress Tracking**: Real-time document count reporting during extraction
- **Error Resilience**: Individual index failures don't stop the entire process

##### JSONL Format Benefits
- **Line-by-Line Processing**: Each security event on a separate line for easy streaming
- **Fault Tolerance**: Partial files remain valid if extraction is interrupted
- **ML Pipeline Ready**: Direct compatibility with pandas, ML frameworks, and data processing tools
- **Human Readable**: JSON format allows manual inspection and debugging

#### Processing Functions

1. **`parse_utc_time(time_str)`**: Converts human-readable timestamps to UTC datetime objects
2. **`export_index_data(es, index_name, start_time, end_time)`**: Streams security events to JSONL files

This design prioritizes **usability** for researchers while maintaining **efficiency** for large-scale cybersecurity data processing.

---

In [None]:
def connect_elasticsearch():
    """
    Create secure connection to Elasticsearch cluster.
    
    Returns:
        Elasticsearch: Configured client instance with authentication
    """
    return Elasticsearch(
        hosts=[es_host],                         # Cluster endpoint
        basic_auth=(username, password),         # Username/password authentication  
        verify_certs=False,                      # Disable SSL cert verification (lab environment)
        ssl_show_warn=False                      # Suppress SSL warnings
    )

In [None]:
def test_connection(es):
    """
    Validate Elasticsearch connection with ping test.
    
    Args:
        es (Elasticsearch): Elasticsearch client instance
        
    Returns:
        bool: True if connection successful, False otherwise
    """
    try:
        return es.ping()                         # Test basic connectivity
    except Exception as e:
        print(f"🔥 Connection failed: {e}")      # Display connection error
        return False

## 🚀 Main Orchestration Function

Main function that coordinates the entire data extraction workflow:

In [None]:
def list_relevant_indices(es, keywords):
    """
    Discover indices containing security-related keywords.
    
    Args:
        es (Elasticsearch): Elasticsearch client instance
        keywords (list): Keywords to filter indices (e.g., ['sysmon', 'network_traffic'])
        
    Returns:
        list: List of dictionaries containing index metadata (name, size, creation date)
    """
    try:
        # Query cluster for all indices with metadata
        response = es.cat.indices(format="json", h="index,store.size,creation.date")
        
        # Filter indices containing security keywords and extract metadata
        return [
            {
                "name": idx["index"],                                                    # Index name
                "size": idx.get("store.size", "0b"),                                   # Storage size
                "created": datetime.fromtimestamp(int(idx["creation.date"])/1000, tz=timezone.utc)  # Creation timestamp
            }
            for idx in response
            if any(kw in idx["index"] for kw in keywords)                              # Filter by keywords
        ]
    except Exception as e:
        print(f"🚨 Error listing indices: {e}")
        return []

## 📝 Usage Summary

### 🎯 What This Notebook Does

1. **Connects** to Elasticsearch cluster using configured credentials
2. **Discovers** indices containing 'sysmon' or 'network_traffic' data
3. **Presents** available indices with size and creation date information
4. **Allows** interactive selection of specific indices to process
5. **Accepts** human-readable time range input for data filtering
6. **Extracts** matching documents and saves as JSONL files

### 📊 Output Files

- **Format**: JSONL (JSON Lines) - one JSON object per line
- **Naming**: Index name with special characters replaced (`:` → `_`, `.` → `-`)
- **Content**: Raw document sources from Elasticsearch (`_source` field)
- **Location**: Current directory (`./`)

### 🔧 Configuration Notes

- **SSL Verification**: Disabled for lab environments
- **Authentication**: Basic username/password
- **Time Format**: `"Jan 29, 2025 @ 04:24:54.863"` (microsecond precision)
- **Keywords**: Filters for `sysmon` and `network_traffic` indices only

---

def display_indices_selector(indices):
    """
    Interactive interface for user to select which indices to process.
    
    Args:
        indices (list): List of index dictionaries from list_relevant_indices()
        
    Returns:
        list: List of selected index names, empty list if user exits
    """
    # Display available indices with metadata
    print(f"\n📂 Found {len(indices)} relevant indices:")
    for i, idx in enumerate(indices, 1):
        print(f"{i:>3}. {idx['name']} ({idx['size']}) [Created: {idx['created'].strftime('%Y-%m-%d')}]")

    # Interactive selection loop
    while True:
        selection = input("\n🔢 Select indices (comma-separated numbers, 'all', or 'exit'): ").strip().lower()
        
        # Handle exit condition
        if selection == "exit":
            return []
        
        # Handle select all condition  
        if selection == "all":
            return [idx["name"] for idx in indices]
        
        # Parse individual number selections
        try:
            selected_indices = [
                indices[int(num)-1]["name"]                    # Convert 1-based index to 0-based
                for num in selection.split(",")               # Split comma-separated input
                if num.strip().isdigit()                      # Validate numeric input
            ]
            if selected_indices:
                return list(set(selected_indices))            # Remove duplicates and return
            print("⚠️ No valid selection. Please try again.")
        except (IndexError, ValueError):
            print("⛔ Invalid input format. Use numbers separated by commas.")

def parse_utc_time(time_str):
    """
    Parse human-readable time string into UTC datetime object.
    
    Args:
        time_str (str): Time string in format "Jan 29, 2025 @ 04:24:54.863"
        
    Returns:
        datetime: UTC datetime object, None if parsing fails
    """
    try:
        # Remove any trailing timezone indicators (assume UTC)
        time_str = time_str.split(" (UTC)")[0].strip()
        
        # Parse using predefined format and set UTC timezone
        return datetime.strptime(time_str, TIMESTAMP_FORMAT).replace(tzinfo=timezone.utc)
    except ValueError as e:
        print(f"⏰ Time parsing error: {e}")
        print(f"📅 Expected format: {TIMESTAMP_FORMAT} (UTC)")
        return None

def export_index_data(es, index_name, start_time, end_time):
    """
    Export data from specific index within time range to JSONL file.
    
    Args:
        es (Elasticsearch): Elasticsearch client instance
        index_name (str): Name of index to export
        start_time (datetime): Start of time range (UTC)
        end_time (datetime): End of time range (UTC)
        
    Returns:
        bool: True if export successful, False otherwise
    """
    # Ensure output directory exists
    os.makedirs(output_dir, exist_ok=True)
    
    # Create safe filename from index name
    safe_name = index_name.replace(":", "_").replace(".", "-")
    filename = os.path.join(output_dir, f"{safe_name}.jsonl")
    
    # Build Elasticsearch query for time range filtering
    query = {
        "query": {
            "range": {
                "@timestamp": {                                # Filter by timestamp field
                    "gte": start_time.isoformat(),            # Greater than or equal to start
                    "lte": end_time.isoformat(),              # Less than or equal to end
                    "format": "strict_date_optional_time"     # ISO 8601 format
                }
            }
        }
    }

    try:
        # Open output file and stream data
        with open(filename, "w") as f:
            count = 0
            # Use scan helper for efficient scrolling through large result sets
            for hit in scan(es, index=index_name, query=query):
                # Write each document as single JSON line
                f.write(json.dumps(hit["_source"]) + "\n")
                count += 1
                
        print(f"✅ Success: {count} documents from {index_name} -> {filename}")
        return True
    except Exception as e:
        print(f"❌ Failed to export {index_name}: {e}")
        return False

In [None]:
def main():
    """
    Main orchestration function that coordinates the data extraction workflow.
    
    Workflow steps:
    1. Connect to Elasticsearch cluster
    2. Discover relevant indices
    3. Present indices for user selection
    4. Get time range from user input
    5. Extract data from selected indices
    """
    # Step 1: Establish connection
    print("\n🔗 Connecting to Elasticsearch...")
    es = connect_elasticsearch()
    
    if not test_connection(es):
        print("🚨 Could not establish connection to Elasticsearch")
        return

    # Step 2: Discover relevant indices
    print("\n🔍 Searching for relevant indices...")
    indices = list_relevant_indices(es, keywords)
    
    if not indices:
        print("🤷 No matching indices found")
        return

    # Step 3: Interactive index selection
    selected_indices = display_indices_selector(indices)
    if not selected_indices:
        print("🚪 Exiting without download")
        return

    # Step 4: Get time range from user
    print("\n🕒 Time Range Selection (UTC)")
    print("💡 Example format: 'Jan 29, 2025 @ 04:24:54.863'")
    start_time = parse_utc_time(input("⏱️  Start time: "))
    end_time = parse_utc_time(input("⏰ End time: "))
    
    # Validate time parameters
    if not all([start_time, end_time]):
        print("⛔ Invalid time parameters")
        return

    # Step 5: Extract data from each selected index
    print("\n⏳ Starting data export...")
    for index in selected_indices:
        export_index_data(es, index, start_time, end_time)

    # Display final time range for confirmation
    print(f'start time: {start_time}')
    print(f'end time: {end_time}')

# Execute main workflow when script is run directly
if __name__ == "__main__":
    main()

In [7]:
def parse_utc_time(time_str):  # Corrected function name
    """Parse time string into UTC datetime object"""
    try:
        # Strip any trailing timezone identifiers (we assume UTC)
        time_str = time_str.split(" (UTC)")[0].strip()
        return datetime.strptime(time_str, TIMESTAMP_FORMAT).replace(tzinfo=timezone.utc)
    except ValueError as e:
        print(f"⏰ Time parsing error: {e}")
        print(f"📅 Expected format: {TIMESTAMP_FORMAT} (UTC)")
        return None

In [8]:
def export_index_data(es, index_name, start_time, end_time):
    """Export index data within time range to JSONL file"""
    os.makedirs(output_dir, exist_ok=True)
    safe_name = index_name.replace(":", "_").replace(".", "-")
    filename = os.path.join(output_dir, f"{safe_name}.jsonl")
    
    query = {
        "query": {
            "range": {
                "@timestamp": {
                    "gte": start_time.isoformat(),
                    "lte": end_time.isoformat(),
                    "format": "strict_date_optional_time"
                }
            }
        }
    }

    try:
        with open(filename, "w") as f:
            count = 0
            for hit in scan(es, index=index_name, query=query):
                f.write(json.dumps(hit["_source"]) + "\n")
                count += 1
        print(f"✅ Success: {count} documents from {index_name} -> {filename}")
        return True
    except Exception as e:
        print(f"❌ Failed to export {index_name}: {e}")
        return False

In [9]:
def main():
    print("\n🔗 Connecting to Elasticsearch...")
    es = connect_elasticsearch()
    
    if not test_connection(es):
        print("🚨 Could not establish connection to Elasticsearch")
        return

    print("\n🔍 Searching for relevant indices...")
    indices = list_relevant_indices(es, keywords)
    
    if not indices:
        print("🤷 No matching indices found")
        return

    selected_indices = display_indices_selector(indices)
    if not selected_indices:
        print("🚪 Exiting without download")
        return

    print("\n🕒 Time Range Selection (UTC)")
    print("💡 Example format: 'Jan 29, 2025 @ 04:24:54.863'")
    start_time = parse_utc_time(input("⏱️  Start time: "))
    end_time = parse_utc_time(input("⏰ End time: "))
    
    if not all([start_time, end_time]):
        print("⛔ Invalid time parameters")
        return

    print("\n⏳ Starting data export...")
    for index in selected_indices:
        export_index_data(es, index, start_time, end_time)

    print(f'start time: {start_time}')
    print(f'end time: {end_time}')

In [10]:
if __name__ == "__main__":
    main()


🔗 Connecting to Elasticsearch...

🔍 Searching for relevant indices...

📂 Found 13 relevant indices:
  1. .ds-logs-network_traffic.dns-default-2025.03.10-000001 (114.2mb) [Created: 2025-03-10]
  2. .ds-logs-network_traffic.tls-default-2025.03.10-000001 (52.7mb) [Created: 2025-03-10]
  3. .ds-logs-network_traffic.icmp-default-2025.03.10-000001 (2mb) [Created: 2025-03-10]
  4. .ds-logs-windows.sysmon_operational-default-2025.04.15-000002 (17.2gb) [Created: 2025-04-15]
  5. .ds-logs-network_traffic.dhcpv4-default-2025.04.17-000001 (86.2kb) [Created: 2025-04-17]
  6. .ds-logs-network_traffic.tls-default-2025.04.15-000002 (45.3mb) [Created: 2025-04-15]
  7. .ds-logs-network_traffic.flow-default-2025.04.15-000002 (25.3gb) [Created: 2025-04-15]
  8. .ds-logs-network_traffic.dns-default-2025.04.15-000002 (255.4mb) [Created: 2025-04-15]
  9. .ds-logs-windows.sysmon_operational-default-2025.03.10-000001 (8.4gb) [Created: 2025-03-10]
 10. .ds-logs-network_traffic.icmp-default-2025.04.15-000002 (7