# Elasticsearch Index Downloader for Cybersecurity Data Collection

## 📖 Academic Overview

This notebook implements the **first stage** of a comprehensive cybersecurity dataset creation pipeline designed for machine learning research in anomaly detection and threat hunting. The pipeline extracts real-world cybersecurity event data from an Elasticsearch cluster that captures live network traffic and host-based security events during Advanced Persistent Threat (APT) attack simulations.

### 🎯 Research Context

- **Domain**: Cybersecurity Machine Learning, Anomaly Detection
- **Application**: APT Attack Detection, Security Information and Event Management (SIEM)
- **Data Source**: Elasticsearch cluster with live cybersecurity telemetry
- **Pipeline Stage**: Stage 1 of 7 (Data Extraction → Feature Engineering → Model Training)

### 📊 Data Collection Architecture

The Elasticsearch cluster collects multi-modal cybersecurity data:

1. **Windows Security Events** (Sysmon): Process creation, network connections, file modifications
2. **Network Traffic Flows**: DNS queries, HTTP requests, TLS handshakes, ICMP packets
3. **Host-based Logs**: Authentication events, system calls, file access patterns
4. **Attack Simulation Data**: Caldera framework APT emulation with ground truth labels

### 🔄 Pipeline Architecture

This notebook is part of a **7-stage cybersecurity dataset creation pipeline**:

```mermaid
graph TD
    A["🔍 Stage 1: Elasticsearch Data Extraction<br/>(This Notebook)"] --> B["📊 Stage 2: Sysmon Dataset Creation"]
    B --> C["🌐 Stage 3: Network Flow Dataset Creation"]
    C --> D["📋 Stage 4: Caldera Report Analysis"]
    D --> E["🎯 Stage 5: Event Tracking & Labeling"]
    E --> F["📈 Stage 6: Timeline Visualization"]
    F --> G["🔗 Stage 7: Network Event Correlation"]
    
    subgraph "Data Sources"
        H["Elasticsearch Cluster"]
        I["Sysmon Logs"]
        J["Network Flows"]
        K["Caldera Reports"]
    end
    
    H --> A
    I --> B
    J --> C
    K --> D
    
    subgraph "Output Formats"
        L["JSONL Files"]
        M["CSV Datasets"]
        N["Labeled Data"]
        O["Visualizations"]
    end
    
    A --> L
    B --> M
    E --> N
    F --> O
```

### 🎯 Stage 1 Objectives (This Notebook)

1. **Connect** to Elasticsearch cluster with security telemetry
2. **Discover** available indices containing cybersecurity data
3. **Select** relevant indices through interactive user interface
4. **Extract** structured security events within specified time ranges
5. **Serialize** data in JSONL format for efficient downstream processing

---

## 🛠️ Environment Setup & Dependencies

### Required Libraries

This notebook requires several Python libraries for Elasticsearch integration, data processing, and file I/O operations. Each library serves a specific purpose in the data extraction pipeline:

- **elasticsearch**: Official Python client for Elasticsearch API interactions and search operations
- **json**: Built-in JSON serialization for data handling and JSONL output format
- **datetime**: Timestamp processing and time-based query filtering for temporal data extraction
- **os**: File system operations and path management for output directory handling

### 🔧 Technical Configuration

The notebook implements an **interactive approach** for:
- **Manual index selection**: User chooses which security indices to process
- **Custom time range input**: User specifies exact temporal boundaries for data extraction
- **Real-time feedback**: Progress updates and status messages during processing
- **Flexible output**: JSONL format for efficient downstream ML pipeline integration

### 🎓 Educational Value

This implementation demonstrates:
- **Production Elasticsearch integration** patterns for cybersecurity data
- **Interactive data exploration** techniques for security researchers
- **JSONL streaming** for memory-efficient large dataset processing
- **Error handling** best practices for distributed system interactions

---

## ⚙️ Configuration Management

### 🔧 Elasticsearch Connection Parameters

The configuration section defines critical parameters for connecting to the Elasticsearch cluster and extracting cybersecurity data. These parameters are optimized for interactive data collection from APT simulation environments:

#### Connection Settings
- **Host**: Elasticsearch cluster endpoint with HTTPS encryption
- **Authentication**: Username/password authentication for secure access
- **SSL Configuration**: Disabled certificate verification for lab environments
- **Keywords**: Filter criteria for discovering relevant security indices

#### Data Discovery
- **Target Keywords**: `['sysmon', 'network_traffic']` to identify cybersecurity-relevant indices
- **Output Directory**: Local directory for storing extracted JSONL files
- **Timestamp Format**: Human-readable format for interactive time range specification

#### Security Considerations
- **Lab Environment**: Configuration optimized for research/testing environments
- **Production Adaptation**: For production use, enable SSL verification and use secure credential management
- **Access Control**: Ensure appropriate Elasticsearch user permissions for read-only data access

### 🎯 Interactive Workflow Design

This configuration supports an **interactive data extraction workflow**:
1. **Automatic Discovery**: Find relevant indices based on keywords
2. **Manual Selection**: User chooses specific indices for processing
3. **Time Range Input**: User specifies extraction window with human-readable format
4. **Batch Processing**: Sequential extraction from selected indices

---

## 🔌 Elasticsearch Connection Management

### 🛡️ Connection Architecture

The connection management functions implement a simple yet robust approach to Elasticsearch cluster interaction for cybersecurity data extraction:

#### Connection Features
- **Basic Authentication**: Username/password authentication for lab environments
- **SSL Handling**: Disabled certificate verification for development clusters
- **Health Validation**: Connection testing with ping functionality
- **Error Reporting**: Clear error messages for troubleshooting connection issues

#### Design Philosophy
This implementation prioritizes **simplicity and clarity** over complex enterprise features:
- **Direct Connection**: Single-node connection suitable for research environments
- **Minimal Configuration**: Straightforward setup without complex pooling or load balancing
- **Interactive Feedback**: Clear status messages for user understanding
- **Research Focus**: Optimized for data extraction rather than production monitoring

#### Connection Functions

1. **`connect_elasticsearch()`**: Establishes secure connection to the cluster
2. **`test_connection(es)`**: Validates connectivity with basic ping test

These functions provide the foundation for all subsequent data discovery and extraction operations, ensuring reliable access to the cybersecurity telemetry stored in the Elasticsearch cluster.

---

In [1]:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan
from datetime import datetime, timezone  # Add timezone to the import
import json
import os

## 🔍 Index Discovery & Interactive Selection

### 📊 Intelligent Index Discovery

The index discovery system identifies cybersecurity-relevant data stores within the Elasticsearch cluster using keyword-based filtering. This approach efficiently narrows down the search space from potentially hundreds of indices to those containing security telemetry.

#### Discovery Process
1. **Comprehensive Scan**: Query all indices in the cluster using `cat.indices` API
2. **Keyword Filtering**: Match index names containing `'sysmon'` and `'network_traffic'` patterns
3. **Metadata Collection**: Extract storage size and creation timestamps for each relevant index
4. **User Presentation**: Display indices with human-readable information for selection

#### Interactive Selection Interface
The system implements a **user-friendly selection mechanism**:
- **Numbered List**: Clear enumeration of available indices with metadata
- **Flexible Input**: Support for individual numbers, comma-separated lists, or 'all' selection
- **Size Information**: Display storage size to help users understand data volume
- **Creation Dates**: Timestamp information for temporal context

#### Index Functions

1. **`list_relevant_indices(es, keywords)`**: Discovers and filters indices based on security keywords
2. **`display_indices_selector(indices)`**: Interactive user interface for index selection

This approach balances **automation** (keyword-based discovery) with **user control** (manual selection), allowing researchers to focus on specific data sources relevant to their analysis objectives.

---

In [2]:
# Configuration
es_host = "https://10.2.0.20:9200"
username = "elastic"
password = "hiYqiU21LVg0F8krD=XN"
keywords = ['sysmon', 'network_traffic']
output_dir = "./"
TIMESTAMP_FORMAT = "%b %d, %Y @ %H:%M:%S.%f"

## 📅 Temporal Processing & Data Extraction

### ⏰ Time Range Management

The temporal processing system handles time-based filtering for cybersecurity event extraction. This is critical for focusing on specific attack simulation windows or investigation timeframes.

#### Time Format Design
- **Human-Readable Input**: Format like `'Jan 29, 2025 @ 04:24:54.863'` for intuitive user interaction
- **UTC Standardization**: All timestamps converted to UTC for consistent processing
- **Precision Support**: Microsecond precision for high-resolution security event correlation
- **Timezone Handling**: Automatic UTC conversion regardless of input timezone indicators

#### Data Extraction Architecture
The extraction system implements **streaming processing** for memory-efficient handling of large security datasets:

##### Extraction Features
- **Time-Range Queries**: Elasticsearch range queries on `@timestamp` field
- **Streaming Output**: Direct write to JSONL files without loading entire datasets in memory
- **Progress Tracking**: Real-time document count reporting during extraction
- **Error Resilience**: Individual index failures don't stop the entire process

##### JSONL Format Benefits
- **Line-by-Line Processing**: Each security event on a separate line for easy streaming
- **Fault Tolerance**: Partial files remain valid if extraction is interrupted
- **ML Pipeline Ready**: Direct compatibility with pandas, ML frameworks, and data processing tools
- **Human Readable**: JSON format allows manual inspection and debugging

#### Processing Functions

1. **`parse_utc_time(time_str)`**: Converts human-readable timestamps to UTC datetime objects
2. **`export_index_data(es, index_name, start_time, end_time)`**: Streams security events to JSONL files

This design prioritizes **usability** for researchers while maintaining **efficiency** for large-scale cybersecurity data processing.

---

In [3]:
def connect_elasticsearch():
    """Create secure connection to Elasticsearch (with SSL verification disabled)"""
    return Elasticsearch(
        hosts=[es_host],
        basic_auth=(username, password),
        verify_certs=False,
        ssl_show_warn=False
    )

In [4]:
def test_connection(es):
    """Validate ES connection with cluster health check"""
    try:
        return es.ping()
    except Exception as e:
        print(f"🔥 Connection failed: {e}")
        return False


## 🚀 Orchestration & Execution Workflow

### 🎯 Interactive Data Extraction Pipeline

The main orchestration function coordinates the entire data extraction workflow, implementing a **user-guided approach** for cybersecurity data collection. This design empowers researchers to make informed decisions about which data to extract while maintaining automated efficiency for the actual processing.

#### Workflow Stages

1. **🔗 Connection Establishment**
   - Initialize Elasticsearch client with configured parameters
   - Validate connectivity to ensure cluster accessibility
   - Provide clear feedback on connection status

2. **🔍 Index Discovery**
   - Scan cluster for cybersecurity-relevant indices
   - Present findings to user with metadata (size, creation date)
   - Handle cases where no relevant indices are found

3. **📋 Interactive Selection**
   - Display numbered list of available indices
   - Accept user input for index selection (individual, multiple, or all)
   - Validate selections and handle user choices

4. **📅 Time Range Configuration**
   - Prompt user for start and end timestamps
   - Parse human-readable time format into UTC datetime objects
   - Validate time range parameters for logical consistency

5. **📥 Data Extraction**
   - Process each selected index sequentially
   - Stream data directly to JSONL files for memory efficiency
   - Provide real-time progress feedback and final statistics

#### Design Principles

- **User Control**: Researchers decide which data to extract and when
- **Transparency**: Clear feedback at each stage of the process
- **Flexibility**: Support for various selection patterns and time ranges
- **Reliability**: Graceful error handling and process continuation
- **Efficiency**: Streaming processing for large datasets

This interactive approach balances **automation** with **research flexibility**, allowing security practitioners to focus on their analysis objectives while the system handles the technical complexities of large-scale data extraction.

---

In [5]:
def list_relevant_indices(es, keywords):
    """Retrieve indices containing keywords with storage size"""
    try:
        response = es.cat.indices(format="json", h="index,store.size,creation.date")
        return [
            {
                "name": idx["index"],
                "size": idx.get("store.size", "0b"),
                "created": datetime.fromtimestamp(int(idx["creation.date"])/1000, tz=timezone.utc)
            }
            for idx in response
            if any(kw in idx["index"] for kw in keywords)
        ]
    except Exception as e:
        print(f"🚨 Error listing indices: {e}")
        return []

## 📝 Summary & Next Steps

### 🎯 Stage 1 Completion Summary

This notebook successfully implements the **first stage** of the cybersecurity dataset creation pipeline, providing an interactive and user-friendly approach to extracting security telemetry from Elasticsearch clusters for machine learning research.

#### ✅ Key Accomplishments

1. **🔌 Robust Elasticsearch Integration**
   - Simple yet reliable connection management for research environments
   - Clear error handling and user feedback for troubleshooting
   - Optimized for APT simulation data collection workflows

2. **🔍 Intelligent Data Discovery**
   - Keyword-based filtering to identify cybersecurity-relevant indices
   - Metadata presentation (size, creation date) for informed decision-making
   - Scalable approach for clusters with hundreds of indices

3. **🎮 Interactive User Experience**
   - Intuitive index selection with flexible input options
   - Human-readable timestamp format for easy time range specification
   - Real-time progress feedback during data extraction

4. **📥 Efficient Data Extraction**
   - Streaming JSONL output for memory-efficient processing
   - Individual index error isolation to prevent workflow interruption
   - Direct compatibility with downstream ML pipeline tools

#### 📊 Output Format

The extracted JSONL files contain structured cybersecurity events ready for the next pipeline stages:
- **Windows Security Events**: Sysmon process creation, network connections, file operations
- **Network Traffic Data**: DNS queries, HTTP requests, TLS handshakes, flow metadata
- **Temporal Precision**: Microsecond-level timestamps for accurate event correlation
- **Structured Fields**: JSON format preserving all original event metadata

---

### 🔄 Pipeline Continuity

The extracted JSONL files serve as input for the remaining pipeline stages:

#### 📋 Stage 2: Sysmon Dataset Creation (`2_elastic_sysmon-ds_csv_creator.ipynb`)
- **Purpose**: Transform Windows Sysmon events into structured CSV format for ML training
- **Input**: Sysmon JSONL files from this extraction
- **Output**: Labeled CSV datasets with process behavior features

#### 🌐 Stage 3: Network Flow Dataset Creation (`3_elastic_network-traffic-flow-ds_csv_creator.ipynb`)
- **Purpose**: Process network traffic into flow-based statistical features
- **Input**: Network traffic JSONL files from this extraction  
- **Output**: Network flow CSV datasets with connection patterns

#### 📋 Stage 4: Caldera Report Analysis (`4_caldera-report-analyzer.ipynb`)
- **Purpose**: Extract attack ground truth and TTPs mapping from simulation reports
- **Input**: Caldera JSON reports + extracted security events
- **Output**: Attack timeline correlation and ground truth labels

---

### 🚀 Usage Instructions

#### For Security Researchers:
1. **Configure** the Elasticsearch connection parameters for your environment
2. **Run** the notebook cells sequentially to establish connection and discover indices
3. **Select** relevant indices based on your research focus (Sysmon, network traffic, etc.)
4. **Specify** time ranges corresponding to your attack simulation or investigation window
5. **Monitor** extraction progress and verify output JSONL files

#### For ML Practitioners:
- The extracted JSONL files are immediately usable with pandas: `pd.read_json('file.jsonl', lines=True)`
- Each line represents a single security event with standardized timestamp and metadata fields
- Files can be processed in streaming fashion for large datasets that exceed memory capacity

#### For Students & Educators:
- This notebook demonstrates production-grade techniques for cybersecurity data collection
- Interactive design allows hands-on learning about Elasticsearch, security data structures, and ETL processes
- Code serves as reference implementation for distributed security data processing

---

### 📚 Educational Value

This implementation showcases several important concepts:

- **🔐 Cybersecurity Data Engineering**: Practical techniques for handling real-world security telemetry
- **🔍 Distributed Search Systems**: Elasticsearch integration patterns for large-scale data retrieval  
- **🎮 Interactive Data Science**: User-guided workflows that balance automation with human decision-making
- **📊 Streaming Data Processing**: Memory-efficient techniques for handling datasets larger than available RAM

The simple, well-commented code serves as an excellent foundation for understanding how to build robust data collection systems for cybersecurity machine learning research.

---

**🎉 Stage 1 Complete - Ready for Stage 2 Processing!**

*The extracted JSONL files are now available for transformation into structured ML datasets in the next pipeline stage.*

In [6]:
def display_indices_selector(indices):
    """Interactive index selection with pagination"""
    print(f"\n📂 Found {len(indices)} relevant indices:")
    for i, idx in enumerate(indices, 1):
        print(f"{i:>3}. {idx['name']} ({idx['size']}) [Created: {idx['created'].strftime('%Y-%m-%d')}]")

    while True:
        selection = input("\n🔢 Select indices (comma-separated numbers, 'all', or 'exit'): ").strip().lower()
        
        if selection == "exit":
            return []
        if selection == "all":
            return [idx["name"] for idx in indices]
        
        try:
            selected_indices = [
                indices[int(num)-1]["name"] 
                for num in selection.split(",") 
                if num.strip().isdigit()
            ]
            if selected_indices:
                return list(set(selected_indices))  # Remove duplicates
            print("⚠️ No valid selection. Please try again.")
        except (IndexError, ValueError):
            print("⛔ Invalid input format. Use numbers separated by commas.")

In [7]:
def parse_utc_time(time_str):  # Corrected function name
    """Parse time string into UTC datetime object"""
    try:
        # Strip any trailing timezone identifiers (we assume UTC)
        time_str = time_str.split(" (UTC)")[0].strip()
        return datetime.strptime(time_str, TIMESTAMP_FORMAT).replace(tzinfo=timezone.utc)
    except ValueError as e:
        print(f"⏰ Time parsing error: {e}")
        print(f"📅 Expected format: {TIMESTAMP_FORMAT} (UTC)")
        return None

In [8]:
def export_index_data(es, index_name, start_time, end_time):
    """Export index data within time range to JSONL file"""
    os.makedirs(output_dir, exist_ok=True)
    safe_name = index_name.replace(":", "_").replace(".", "-")
    filename = os.path.join(output_dir, f"{safe_name}.jsonl")
    
    query = {
        "query": {
            "range": {
                "@timestamp": {
                    "gte": start_time.isoformat(),
                    "lte": end_time.isoformat(),
                    "format": "strict_date_optional_time"
                }
            }
        }
    }

    try:
        with open(filename, "w") as f:
            count = 0
            for hit in scan(es, index=index_name, query=query):
                f.write(json.dumps(hit["_source"]) + "\n")
                count += 1
        print(f"✅ Success: {count} documents from {index_name} -> {filename}")
        return True
    except Exception as e:
        print(f"❌ Failed to export {index_name}: {e}")
        return False

In [9]:
def main():
    print("\n🔗 Connecting to Elasticsearch...")
    es = connect_elasticsearch()
    
    if not test_connection(es):
        print("🚨 Could not establish connection to Elasticsearch")
        return

    print("\n🔍 Searching for relevant indices...")
    indices = list_relevant_indices(es, keywords)
    
    if not indices:
        print("🤷 No matching indices found")
        return

    selected_indices = display_indices_selector(indices)
    if not selected_indices:
        print("🚪 Exiting without download")
        return

    print("\n🕒 Time Range Selection (UTC)")
    print("💡 Example format: 'Jan 29, 2025 @ 04:24:54.863'")
    start_time = parse_utc_time(input("⏱️  Start time: "))
    end_time = parse_utc_time(input("⏰ End time: "))
    
    if not all([start_time, end_time]):
        print("⛔ Invalid time parameters")
        return

    print("\n⏳ Starting data export...")
    for index in selected_indices:
        export_index_data(es, index, start_time, end_time)

    print(f'start time: {start_time}')
    print(f'end time: {end_time}')

In [10]:
if __name__ == "__main__":
    main()


🔗 Connecting to Elasticsearch...

🔍 Searching for relevant indices...

📂 Found 13 relevant indices:
  1. .ds-logs-network_traffic.dns-default-2025.03.10-000001 (114.2mb) [Created: 2025-03-10]
  2. .ds-logs-network_traffic.tls-default-2025.03.10-000001 (52.7mb) [Created: 2025-03-10]
  3. .ds-logs-network_traffic.icmp-default-2025.03.10-000001 (2mb) [Created: 2025-03-10]
  4. .ds-logs-windows.sysmon_operational-default-2025.04.15-000002 (17.2gb) [Created: 2025-04-15]
  5. .ds-logs-network_traffic.dhcpv4-default-2025.04.17-000001 (86.2kb) [Created: 2025-04-17]
  6. .ds-logs-network_traffic.tls-default-2025.04.15-000002 (45.3mb) [Created: 2025-04-15]
  7. .ds-logs-network_traffic.flow-default-2025.04.15-000002 (25.3gb) [Created: 2025-04-15]
  8. .ds-logs-network_traffic.dns-default-2025.04.15-000002 (255.4mb) [Created: 2025-04-15]
  9. .ds-logs-windows.sysmon_operational-default-2025.03.10-000001 (8.4gb) [Created: 2025-03-10]
 10. .ds-logs-network_traffic.icmp-default-2025.04.15-000002 (7