# üîç SYSMON EVENTS EXPLORATORY ANALYSIS

This notebook performs comprehensive exploratory analysis on Windows Sysmon events stored in JSONL format from Elasticsearch. The analysis focuses on understanding event distribution, XML structure patterns, field availability, and data quality characteristics.

**Target File**: `-ds-logs-windows-sysmon_operational-default-2025-05-04-000001.jsonl`  
**Analysis Type**: 2A-SYSMON  
**Purpose**: Understand Sysmon event structure for optimal CSV conversion strategy

**Key Analysis Areas**:
- Sysmon EventID distribution and frequency patterns
- XML structure analysis and field extraction patterns
- Computer/host distribution analysis
- Temporal patterns and event timeline analysis
- Field availability and completeness assessment
- Data quality and parsing success rates

## 1. Import Required Libraries

In [1]:
import json
import xml.etree.ElementTree as ET
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import random
import os
from datetime import datetime
from collections import defaultdict, Counter
import re

## 2. Analysis Configuration and Logging Setup

In [2]:
# Analysis Configuration
ANALYSIS_TYPE = "2a-sysmon"
SAMPLE_SIZE = 200_000  # Number of samples to analyze
TARGET_FILE = "-ds-logs-windows-sysmon_operational-default-2025-05-04-000001.jsonl"

# Create organized output directory structure
outputs_base_dir = "outputs"
analysis_outputs_dir = f"{outputs_base_dir}/{ANALYSIS_TYPE}"
os.makedirs(analysis_outputs_dir, exist_ok=True)

# Setup logging
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
log_filename = f"{analysis_outputs_dir}/{ANALYSIS_TYPE}_exploratory_analysis_{timestamp}.log"

def log_print(message):
    """Print and log messages"""
    print(message)
    with open(log_filename, 'a', encoding='utf-8') as f:
        f.write(message + '\n')

# Initialize log file
log_print("SYSMON EVENTS EXPLORATORY ANALYSIS")
log_print(f"Analysis Type: {ANALYSIS_TYPE.upper()}")
log_print(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
log_print(f"Target File: {TARGET_FILE}")
log_print("=" * 80)
log_print("")

SYSMON EVENTS EXPLORATORY ANALYSIS
Analysis Type: 2A-SYSMON
Generated: 2025-06-29 11:23:56
Target File: -ds-logs-windows-sysmon_operational-default-2025-05-04-000001.jsonl



## 3. XML Parsing Utilities

In [3]:
def sanitize_xml(xml_str):
    """Clean invalid characters and repair XML structure"""
    try:
        # Remove non-printable characters
        cleaned = ''.join(c for c in xml_str if 31 < ord(c) < 127 or c in '\t\n\r')
        # Fix common XML issues using BeautifulSoup's parser
        return BeautifulSoup(cleaned, "xml").prettify()
    except:
        return xml_str  # Return original if cleaning fails

def parse_sysmon_event_basic(xml_str):
    """Parse XML to extract basic event information"""
    try:
        # Clean XML first
        clean_xml = sanitize_xml(xml_str)
        
        # Parse with explicit namespace
        namespaces = {'ns': 'http://schemas.microsoft.com/win/2004/08/events/event'}
        root = ET.fromstring(clean_xml)
        
        # System section
        system = root.find('ns:System', namespaces)
        if not system:
            return None, None, None, {}

        event_id_elem = system.find('ns:EventID', namespaces)
        computer_elem = system.find('ns:Computer', namespaces)
        
        event_id = int(event_id_elem.text) if event_id_elem is not None and event_id_elem.text else None
        computer = computer_elem.text if computer_elem is not None and computer_elem.text else None

        # EventData section - extract all fields
        event_data = root.find('ns:EventData', namespaces)
        fields = {}
        if event_data:
            for data in event_data.findall('ns:Data', namespaces):
                name = data.get('Name')
                if name:
                    fields[name] = data.text if data.text else None

        return event_id, computer, len(fields), fields

    except Exception as e:
        return None, None, None, {}

print("‚úÖ XML parsing utilities loaded")

‚úÖ XML parsing utilities loaded


## 4. Data Loading and Sampling

In [4]:
# Start section logging
log_print("\n" + "=" * 80)
log_print("SECTION 3: INITIAL DATA LOADING AND RECORD COUNT")
log_print("=" * 80)
log_print("")

# Count total records
log_print(f"üîÑ Counting total records in: {TARGET_FILE}")
total_records = 0
with open(TARGET_FILE, 'r') as f:
    for line in f:
        total_records += 1

log_print(f"üìà Total records in dataset: {total_records:,}")
log_print(f"üéØ Sampling strategy: Will analyze {SAMPLE_SIZE:,} random samples")

# Random sampling strategy
random.seed()  # Use truly random seed for different results each run
sample_indices = set(random.sample(range(total_records), min(SAMPLE_SIZE, total_records)))

log_print(f"üìä Selected {len(sample_indices):,} random indices for analysis")

# End section logging
log_print("\n" + "-" * 60 + " END SECTION " + "-" * 60)
log_print("")


SECTION 3: INITIAL DATA LOADING AND RECORD COUNT

üîÑ Counting total records in: -ds-logs-windows-sysmon_operational-default-2025-05-04-000001.jsonl
üìà Total records in dataset: 570,078
üéØ Sampling strategy: Will analyze 200,000 random samples
üìä Selected 200,000 random indices for analysis

------------------------------------------------------------ END SECTION ------------------------------------------------------------



## 5. Basic Event Structure Analysis

In [5]:
# Start section logging
log_print("\n" + "=" * 80)
log_print("SECTION 4: BASIC EVENT STRUCTURE ANALYSIS")
log_print("=" * 80)
log_print("")

# Load and analyze sample data
log_print("üìã BASIC EVENT STRUCTURE ANALYSIS")
log_print("=" * 50)

# Analyze first record structure
sample_data = []
current_index = 0

with open(TARGET_FILE, 'r') as f:
    for line_number, line in enumerate(f):
        if line_number in sample_indices:
            try:
                event = json.loads(line)
                sample_data.append(event)
                if len(sample_data) >= 10:  # Get first 10 for structure analysis
                    break
            except json.JSONDecodeError:
                continue

if sample_data:
    first_sample = sample_data[0]
    log_print(f"üîç Data type: {type(first_sample)}")
    log_print(f"üìè Number of top-level fields: {len(first_sample)}")
    log_print(f"üóùÔ∏è  Top-level fields:")
    
    for idx, key in enumerate(first_sample.keys(), 1):
        value_type = type(first_sample[key]).__name__
        log_print(f"   {idx:2d}. {key:30s} ({value_type})")
    
    # Analyze the XML structure
    if 'event' in first_sample and 'original' in first_sample['event']:
        xml_content = first_sample['event']['original']
        log_print(f"\nüìÑ XML content sample (first 500 chars):")
        log_print("-" * 50)
        log_print(xml_content[:500] + "..." if len(xml_content) > 500 else xml_content)
        
        # Parse the XML to understand structure
        event_id, computer, field_count, fields = parse_sysmon_event_basic(xml_content)
        log_print(f"\nüéØ XML parsing results:")
        log_print(f"   ‚Ä¢ EventID: {event_id}")
        log_print(f"   ‚Ä¢ Computer: {computer}")
        log_print(f"   ‚Ä¢ Field count: {field_count}")
        if fields:
            log_print(f"   ‚Ä¢ Available fields: {list(fields.keys())[:10]}{'...' if len(fields) > 10 else ''}")
    
    log_print(f"\nüìä Complete first sample structure:")
    log_print("-" * 50)
    log_print(str(first_sample)[:1000] + "..." if len(str(first_sample)) > 1000 else str(first_sample))

# End section logging
log_print("\n" + "-" * 60 + " END SECTION " + "-" * 60)
log_print("")


SECTION 4: BASIC EVENT STRUCTURE ANALYSIS

üìã BASIC EVENT STRUCTURE ANALYSIS
üîç Data type: <class 'dict'>
üìè Number of top-level fields: 18
üóùÔ∏è  Top-level fields:
    1. agent                          (dict)
    2. process                        (dict)
    3. winlog                         (dict)
    4. log                            (dict)
    5. elastic_agent                  (dict)
    6. destination                    (dict)
    7. source                         (dict)
    8. message                        (str)
    9. tags                           (list)
   10. network                        (dict)
   11. input                          (dict)
   12. @timestamp                     (str)
   13. ecs                            (dict)
   14. related                        (dict)
   15. data_stream                    (dict)
   16. host                           (dict)
   17. event                          (dict)
   18. user                           (dict)

üìÑ XML content 

## 6. EventID Distribution Analysis

In [6]:
# Start section logging
log_print("\n" + "=" * 80)
log_print("SECTION 5: EVENTID DISTRIBUTION ANALYSIS")
log_print("=" * 80)
log_print("")

log_print("üìä EVENTID DISTRIBUTION ANALYSIS")
log_print("=" * 50)

# Analyze EventID distribution
eventid_counts = Counter()
computer_counts = Counter()
parsing_success = 0
parsing_errors = 0
samples_processed = 0

log_print(f"üîÑ Processing {len(sample_indices):,} samples for EventID analysis...")

with open(TARGET_FILE, 'r') as f:
    for line_number, line in enumerate(f):
        if line_number in sample_indices:
            samples_processed += 1
            try:
                event = json.loads(line)
                if 'event' in event and 'original' in event['event']:
                    xml_content = event['event']['original']
                    event_id, computer, field_count, fields = parse_sysmon_event_basic(xml_content)
                    
                    if event_id is not None:
                        eventid_counts[event_id] += 1
                        parsing_success += 1
                        
                        if computer:
                            computer_counts[computer] += 1
                    else:
                        parsing_errors += 1
                else:
                    parsing_errors += 1
                    
            except (json.JSONDecodeError, Exception):
                parsing_errors += 1
            
            # Progress indicator
            if samples_processed % 10000 == 0:
                log_print(f"   Processed {samples_processed:,} samples...")

log_print(f"\n‚úÖ Analysis complete!")
log_print(f"üìà Samples processed: {samples_processed:,}")
log_print(f"üìä Parsing success: {parsing_success:,} ({(parsing_success/samples_processed)*100:.1f}%)")
log_print(f"‚ö†Ô∏è  Parsing errors: {parsing_errors:,} ({(parsing_errors/samples_processed)*100:.1f}%)")

# EventID distribution
log_print(f"\nüéØ EVENTID FREQUENCY DISTRIBUTION:")
log_print("Event ID | Count    | Percentage | Description")
log_print("-" * 60)

# Sysmon EventID descriptions for context
eventid_descriptions = {
    1: "Process Creation",
    2: "File Creation Time Changed", 
    3: "Network Connection",
    4: "Sysmon Service State Changed",
    5: "Process Terminated",
    6: "Driver Loaded",
    7: "Image/Library Loaded",
    8: "Create Remote Thread",
    9: "Raw Access Read",
    10: "Process Access",
    11: "File Create",
    12: "Registry Event (Object create/delete)",
    13: "Registry Event (Value Set)",
    14: "Registry Event (Key/Value Rename)",
    15: "File Create Stream Hash",
    16: "Sysmon Config State Changed",
    17: "Pipe Event (Pipe Created)",
    18: "Pipe Event (Pipe Connected)",
    19: "WMI Event (WmiEventFilter activity)",
    20: "WMI Event (WmiEventConsumer activity)",
    21: "WMI Event (WmiEventConsumerToFilter activity)",
    22: "DNS Event (DNS query)",
    23: "File Delete (File Delete archived)",
    24: "Clipboard Change (New content in clipboard)",
    25: "Process Tampering (Process image change)",
    26: "File Delete (File Delete logged)"
}

for event_id, count in eventid_counts.most_common():
    percentage = (count / parsing_success) * 100
    description = eventid_descriptions.get(event_id, "Unknown EventID")
    log_print(f"{event_id:8d} | {count:8,} | {percentage:9.2f}% | {description}")

# Computer distribution
log_print(f"\nüñ•Ô∏è COMPUTER/HOST DISTRIBUTION:")
log_print(f"üìä Unique computers: {len(computer_counts)}")
log_print("Computer Name        | Count    | Percentage")
log_print("-" * 50)

for computer, count in computer_counts.most_common(10):  # Top 10 computers
    percentage = (count / parsing_success) * 100
    log_print(f"{computer:20s} | {count:8,} | {percentage:9.2f}%")

if len(computer_counts) > 10:
    log_print(f"... and {len(computer_counts) - 10} more computers")

# End section logging
log_print("\n" + "-" * 60 + " END SECTION " + "-" * 60)
log_print("")


SECTION 5: EVENTID DISTRIBUTION ANALYSIS

üìä EVENTID DISTRIBUTION ANALYSIS
üîÑ Processing 200,000 samples for EventID analysis...
   Processed 10,000 samples...
   Processed 20,000 samples...
   Processed 30,000 samples...
   Processed 40,000 samples...
   Processed 50,000 samples...
   Processed 60,000 samples...
   Processed 70,000 samples...
   Processed 80,000 samples...
   Processed 90,000 samples...
   Processed 100,000 samples...
   Processed 110,000 samples...
   Processed 120,000 samples...
   Processed 130,000 samples...
   Processed 140,000 samples...
   Processed 150,000 samples...
   Processed 160,000 samples...
   Processed 170,000 samples...
   Processed 180,000 samples...
   Processed 190,000 samples...
   Processed 200,000 samples...

‚úÖ Analysis complete!
üìà Samples processed: 200,000
üìä Parsing success: 200,000 (100.0%)
‚ö†Ô∏è  Parsing errors: 0 (0.0%)

üéØ EVENTID FREQUENCY DISTRIBUTION:
Event ID | Count    | Percentage | Description
----------------------

## 7. Field Availability Analysis by EventID

In [7]:
# Start section logging
log_print("\n" + "=" * 80)
log_print("SECTION 6: FIELD AVAILABILITY ANALYSIS BY EVENTID")
log_print("=" * 80)
log_print("")

log_print("üìä FIELD AVAILABILITY ANALYSIS BY EVENTID")
log_print("=" * 50)

# Sysmon field schemas from notebook #2
fields_per_eventid = {
    1: ['UtcTime', 'ProcessGuid', 'ProcessId', 'Image', 'CommandLine', 'CurrentDirectory', 'User', 'Hashes', 'ParentProcessGuid', 'ParentProcessId', 'ParentImage', 'ParentCommandLine'],
    2: ['UtcTime', 'ProcessGuid', 'ProcessId', 'Image', 'TargetFilename', 'CreationUtcTime', 'PreviousCreationUtcTime', 'User'],
    3: ['UtcTime', 'ProcessGuid', 'ProcessId', 'Image', 'User', 'Protocol', 'SourceIsIpv6', 'SourceIp', 'SourceHostname', 'SourcePort', 'SourcePortName', 'DestinationIsIpv6', 'DestinationIp', 'DestinationHostname', 'DestinationPort', 'DestinationPortName'],
    5: ['UtcTime', 'ProcessGuid', 'ProcessId', 'Image', 'User'],
    6: ['UtcTime', 'ImageLoaded', 'Hashes', 'User'],
    7: ['UtcTime', 'ProcessGuid', 'ProcessId', 'Image', 'ImageLoaded', 'OriginalFileName', 'Hashes', 'User'],
    8: ['UtcTime', 'SourceProcessGuid', 'SourceProcessId', 'SourceImage', 'TargetProcessGuid', 'TargetProcessId', 'TargetImage', 'NewThreadId', 'SourceUser', 'TargetUser'],
    9: ['UtcTime', 'ProcessGuid', 'ProcessId', 'Image', 'Device', 'User'],
    10: ['UtcTime', 'SourceProcessGUID', 'SourceProcessId', 'SourceImage', 'TargetProcessGUID', 'TargetProcessId', 'TargetImage', 'SourceThreadId', 'SourceUser', 'TargetUser'],
    11: ['UtcTime', 'ProcessGuid', 'ProcessId', 'Image', 'TargetFilename', 'CreationUtcTime', 'User'],
    12: ['EventType', 'UtcTime', 'ProcessGuid', 'ProcessId', 'Image', 'TargetObject', 'User'],
    13: ['EventType', 'UtcTime', 'ProcessGuid', 'ProcessId', 'Image', 'TargetObject', 'User'],
    15: ['UtcTime', 'ProcessGuid', 'ProcessId', 'Image', 'TargetFilename', 'CreationUtcTime', 'Hash', 'User'],
    17: ['EventType', 'UtcTime', 'ProcessGuid', 'ProcessId', 'PipeName', 'Image', 'User'],
    18: ['EventType', 'UtcTime', 'ProcessGuid', 'ProcessId', 'PipeName', 'Image', 'User'],
    22: ['UtcTime', 'ProcessGuid', 'ProcessId', 'Image', 'QueryName', 'QueryStatus', 'QueryResults', 'User'],
    23: ['UtcTime', 'ProcessGuid', 'ProcessId', 'User', 'Image', 'TargetFilename', 'Hashes'],
    24: ['UtcTime', 'ProcessGuid', 'ProcessId', 'User', 'Image', 'Hashes'],
    25: ['UtcTime', 'ProcessGuid', 'ProcessId', 'User', 'Image']
}

# Analyze field availability for each EventID
field_analysis = {}
samples_per_eventid = {}

log_print(f"üîÑ Analyzing field availability across {len(sample_indices):,} samples...")

with open(TARGET_FILE, 'r') as f:
    for line_number, line in enumerate(f):
        if line_number in sample_indices:
            try:
                event = json.loads(line)
                if 'event' in event and 'original' in event['event']:
                    xml_content = event['event']['original']
                    event_id, computer, field_count, fields = parse_sysmon_event_basic(xml_content)
                    
                    if event_id is not None and event_id in fields_per_eventid:
                        if event_id not in field_analysis:
                            field_analysis[event_id] = {}
                            samples_per_eventid[event_id] = 0
                        
                        samples_per_eventid[event_id] += 1
                        
                        # Check each expected field
                        for expected_field in fields_per_eventid[event_id]:
                            if expected_field not in field_analysis[event_id]:
                                field_analysis[event_id][expected_field] = 0
                            
                            if expected_field in fields and fields[expected_field] is not None:
                                field_analysis[event_id][expected_field] += 1
                    
            except (json.JSONDecodeError, Exception):
                continue

# Report field availability
log_print(f"\nüìã FIELD AVAILABILITY REPORT:")
log_print("=" * 60)

for event_id in sorted(field_analysis.keys()):
    total_samples = samples_per_eventid[event_id]
    if total_samples > 0:
        log_print(f"\nüéØ EventID {event_id} ({eventid_descriptions.get(event_id, 'Unknown')})")
        log_print(f"   üìä Analyzed samples: {total_samples:,}")
        log_print(f"   üìã Expected fields: {len(fields_per_eventid[event_id])}")
        log_print("   Field Availability:")
        
        for field in fields_per_eventid[event_id]:
            available_count = field_analysis[event_id].get(field, 0)
            percentage = (available_count / total_samples) * 100
            status = "‚úÖ" if percentage > 95 else "‚ö†Ô∏è" if percentage > 50 else "‚ùå"
            log_print(f"   {status} {field:25s}: {available_count:6,}/{total_samples:,} ({percentage:5.1f}%)")

# Summary statistics
log_print(f"\nüìä FIELD AVAILABILITY SUMMARY:")
log_print("=" * 40)

total_eventids_analyzed = len(field_analysis)
total_fields_analyzed = sum(len(fields_per_eventid[eid]) for eid in field_analysis.keys())

log_print(f"‚Ä¢ Total EventIDs analyzed: {total_eventids_analyzed}")
log_print(f"‚Ä¢ Total fields analyzed: {total_fields_analyzed}")
log_print(f"‚Ä¢ Field availability varies significantly by EventID")
log_print(f"‚Ä¢ Some fields may be conditionally present based on event context")

# End section logging
log_print("\n" + "-" * 60 + " END SECTION " + "-" * 60)
log_print("")


SECTION 6: FIELD AVAILABILITY ANALYSIS BY EVENTID

üìä FIELD AVAILABILITY ANALYSIS BY EVENTID
üîÑ Analyzing field availability across 200,000 samples...

üìã FIELD AVAILABILITY REPORT:

üéØ EventID 1 (Process Creation)
   üìä Analyzed samples: 527
   üìã Expected fields: 12
   Field Availability:
   ‚úÖ UtcTime                  :    527/527 (100.0%)
   ‚úÖ ProcessGuid              :    527/527 (100.0%)
   ‚úÖ ProcessId                :    527/527 (100.0%)
   ‚úÖ Image                    :    527/527 (100.0%)
   ‚úÖ CommandLine              :    527/527 (100.0%)
   ‚úÖ CurrentDirectory         :    527/527 (100.0%)
   ‚úÖ User                     :    527/527 (100.0%)
   ‚úÖ Hashes                   :    527/527 (100.0%)
   ‚úÖ ParentProcessGuid        :    527/527 (100.0%)
   ‚úÖ ParentProcessId          :    527/527 (100.0%)
   ‚úÖ ParentImage              :    527/527 (100.0%)
   ‚úÖ ParentCommandLine        :    527/527 (100.0%)

üéØ EventID 2 (File Creation Time Changed)
  

## 8. Temporal Pattern Analysis

In [8]:
# Start section logging
log_print("\n" + "=" * 80)
log_print("SECTION 7: TEMPORAL PATTERN ANALYSIS")
log_print("=" * 80)
log_print("")

log_print("‚è∞ TEMPORAL PATTERN ANALYSIS")
log_print("=" * 50)

# Analyze timestamps
timestamps = []
utc_times = []
samples_analyzed = 0

log_print(f"üîÑ Extracting temporal data from {len(sample_indices):,} samples...")

with open(TARGET_FILE, 'r') as f:
    for line_number, line in enumerate(f):
        if line_number in sample_indices:
            try:
                event = json.loads(line)
                samples_analyzed += 1
                
                # Extract @timestamp
                if '@timestamp' in event:
                    timestamps.append(event['@timestamp'])
                
                # Extract UtcTime from XML if available
                if 'event' in event and 'original' in event['event']:
                    xml_content = event['event']['original']
                    event_id, computer, field_count, fields = parse_sysmon_event_basic(xml_content)
                    
                    if 'UtcTime' in fields and fields['UtcTime']:
                        utc_times.append(fields['UtcTime'])
                    
            except (json.JSONDecodeError, Exception):
                continue
            
            if samples_analyzed % 50000 == 0:
                log_print(f"   Processed {samples_analyzed:,} samples...")

log_print(f"\nüìÖ TIMESTAMP ANALYSIS:")
log_print(f"‚Ä¢ @timestamp fields found: {len(timestamps):,}")
log_print(f"‚Ä¢ UtcTime fields found: {len(utc_times):,}")

if timestamps:
    # Sort timestamps to find range
    sorted_timestamps = sorted(timestamps)
    log_print(f"\nüìä @TIMESTAMP RANGE:")
    log_print(f"‚Ä¢ Earliest: {sorted_timestamps[0]}")
    log_print(f"‚Ä¢ Latest: {sorted_timestamps[-1]}")
    
    # Sample some timestamps
    log_print(f"\nüìÑ SAMPLE @TIMESTAMPS:")
    sample_count = min(10, len(sorted_timestamps))
    for i in range(sample_count):
        idx = i * len(sorted_timestamps) // sample_count
        log_print(f"   {i+1:2d}. {sorted_timestamps[idx]}")

if utc_times:
    # Sort UTC times
    sorted_utc = sorted(utc_times)
    log_print(f"\nüìä UTCTIME RANGE:")
    log_print(f"‚Ä¢ Earliest: {sorted_utc[0]}")
    log_print(f"‚Ä¢ Latest: {sorted_utc[-1]}")
    
    # Sample some UTC times
    log_print(f"\nüìÑ SAMPLE UTCTIMES:")
    sample_count = min(10, len(sorted_utc))
    for i in range(sample_count):
        idx = i * len(sorted_utc) // sample_count
        log_print(f"   {i+1:2d}. {sorted_utc[idx]}")

# End section logging
log_print("\n" + "-" * 60 + " END SECTION " + "-" * 60)
log_print("")


SECTION 7: TEMPORAL PATTERN ANALYSIS

‚è∞ TEMPORAL PATTERN ANALYSIS
üîÑ Extracting temporal data from 200,000 samples...
   Processed 50,000 samples...
   Processed 100,000 samples...
   Processed 150,000 samples...
   Processed 200,000 samples...

üìÖ TIMESTAMP ANALYSIS:
‚Ä¢ @timestamp fields found: 200,000
‚Ä¢ UtcTime fields found: 200,000

üìä @TIMESTAMP RANGE:
‚Ä¢ Earliest: 2025-05-04T11:30:00.040Z
‚Ä¢ Latest: 2025-05-04T12:40:00.980Z

üìÑ SAMPLE @TIMESTAMPS:
    1. 2025-05-04T11:30:00.040Z
    2. 2025-05-04T11:35:44.202Z
    3. 2025-05-04T11:53:13.689Z
    4. 2025-05-04T12:31:13.388Z
    5. 2025-05-04T12:34:13.744Z
    6. 2025-05-04T12:34:23.261Z
    7. 2025-05-04T12:34:23.375Z
    8. 2025-05-04T12:34:23.488Z
    9. 2025-05-04T12:34:23.584Z
   10. 2025-05-04T12:34:56.862Z

üìä UTCTIME RANGE:
‚Ä¢ Earliest: 
   2025-05-04 11:30:00.040
  
‚Ä¢ Latest: 
   2025-05-04 12:40:00.980
  

üìÑ SAMPLE UTCTIMES:
    1. 
   2025-05-04 11:30:00.040
  
    2. 
   2025-05-04 11:35:44.202
  

## 9. Sample Event Display

In [9]:
# Start section logging
log_print("\n" + "=" * 80)
log_print("SECTION 8: SAMPLE EVENT DISPLAY")
log_print("=" * 80)
log_print("")

log_print("üìÑ SAMPLE EVENT DISPLAY")
log_print("=" * 50)

# Display sample events for different EventIDs
sample_events = {}
target_eventids = [1, 3, 7, 12, 13]  # Key Sysmon events to showcase

log_print(f"üîç Collecting sample events for EventIDs: {target_eventids}")

with open(TARGET_FILE, 'r') as f:
    for line_number, line in enumerate(f):
        if line_number in sample_indices and len(sample_events) < len(target_eventids):
            try:
                event = json.loads(line)
                if 'event' in event and 'original' in event['event']:
                    xml_content = event['event']['original']
                    event_id, computer, field_count, fields = parse_sysmon_event_basic(xml_content)
                    
                    if event_id in target_eventids and event_id not in sample_events:
                        sample_events[event_id] = {
                            'full_event': event,
                            'parsed_fields': fields,
                            'computer': computer,
                            'field_count': field_count
                        }
                        
            except (json.JSONDecodeError, Exception):
                continue

# Display samples
for event_id in sorted(sample_events.keys()):
    sample = sample_events[event_id]
    description = eventid_descriptions.get(event_id, "Unknown")
    
    log_print(f"\nüéØ SAMPLE EVENT - EventID {event_id} ({description})")
    log_print("-" * 60)
    log_print(f"Computer: {sample['computer']}")
    log_print(f"Field Count: {sample['field_count']}")
    log_print(f"@timestamp: {sample['full_event'].get('@timestamp', 'N/A')}")
    
    log_print(f"\nParsed Fields:")
    for field_name, field_value in sample['parsed_fields'].items():
        # Truncate long values for readability
        display_value = str(field_value)[:100] + "..." if field_value and len(str(field_value)) > 100 else field_value
        log_print(f"   ‚Ä¢ {field_name:20s}: {display_value}")
    
    log_print(f"\nJSON Structure (top-level keys):")
    for key in sample['full_event'].keys():
        value_type = type(sample['full_event'][key]).__name__
        log_print(f"   ‚Ä¢ {key:20s}: {value_type}")

# End section logging
log_print("\n" + "-" * 60 + " END SECTION " + "-" * 60)
log_print("")


SECTION 8: SAMPLE EVENT DISPLAY

üìÑ SAMPLE EVENT DISPLAY
üîç Collecting sample events for EventIDs: [1, 3, 7, 12, 13]

üéØ SAMPLE EVENT - EventID 1 (Process Creation)
------------------------------------------------------------
Computer: 
   diskjockey.boombox.local
  
Field Count: 23
@timestamp: 2025-05-04T11:30:15.356Z

Parsed Fields:
   ‚Ä¢ RuleName            : 
   -
  
   ‚Ä¢ UtcTime             : 
   2025-05-04 11:30:15.356
  
   ‚Ä¢ ProcessGuid         : 
   {acb80d05-4fc7-6817-5600-000000001600}
  
   ‚Ä¢ ProcessId           : 
   4988
  
   ‚Ä¢ Image               : 
   C:\Windows\System32\cmd.exe
  
   ‚Ä¢ FileVersion         : 
   10.0.17763.1697 (WinBuild.160101.0800)
  
   ‚Ä¢ Description         : 
   Windows Command Processor
  
   ‚Ä¢ Product             : 
   Microsoft Windows Operating System
  
   ‚Ä¢ Company             : 
   Microsoft Corporation
  
   ‚Ä¢ OriginalFileName    : 
   Cmd.Exe
  
   ‚Ä¢ CommandLine         : 
   C:\Windows\system32\cmd.exe /c C:\W

## 10. Analysis Summary and Recommendations

In [10]:
# Start section logging
log_print("\n" + "=" * 80)
log_print("SECTION 9: ANALYSIS SUMMARY AND RECOMMENDATIONS")
log_print("=" * 80)
log_print("")

log_print("üìä ANALYSIS SUMMARY AND RECOMMENDATIONS")
log_print("=" * 60)

# Calculate summary statistics
total_eventids_found = len(eventid_counts)
most_common_eventid = eventid_counts.most_common(1)[0] if eventid_counts else (None, 0)
parsing_success_rate = (parsing_success / samples_processed) * 100 if samples_processed > 0 else 0

log_print(f"üîç DATASET CHARACTERISTICS:")
log_print(f"   ‚Ä¢ Total records in file: ~{total_records:,}")
log_print(f"   ‚Ä¢ Samples analyzed: {samples_processed:,}")
log_print(f"   ‚Ä¢ XML parsing success rate: {parsing_success_rate:.1f}%")
log_print(f"   ‚Ä¢ Unique EventIDs found: {total_eventids_found}")
log_print(f"   ‚Ä¢ Most common EventID: {most_common_eventid[0]} ({most_common_eventid[1]:,} occurrences)")
log_print(f"   ‚Ä¢ Unique computers: {len(computer_counts)}")

log_print(f"\nüí° PROCESSING RECOMMENDATIONS:")
log_print(f"   üß† Memory Management:")
log_print(f"      - Use batch processing for large dataset ({total_records:,} records)")
log_print(f"      - Consider chunked reading for memory efficiency")
log_print(f"      - Implement progress tracking for long operations")

log_print(f"   üõ†Ô∏è  XML Processing:")
log_print(f"      - XML sanitization is crucial for {parsing_errors:,} problematic records")
log_print(f"      - BeautifulSoup XML parser handles malformed XML well")
log_print(f"      - Namespace handling required for proper field extraction")

log_print(f"   üìä EventID-Specific Handling:")
log_print(f"      - Different EventIDs have different field schemas")
log_print(f"      - Field availability varies significantly by EventID")
log_print(f"      - Consider separate processing pipelines per EventID")

log_print(f"   üìÑ CSV Conversion Strategy:")
log_print(f"      - Use EventID-specific field mappings from schema analysis")
log_print(f"      - Handle missing fields with appropriate default values")
log_print(f"      - Implement robust error logging for malformed XML")
log_print(f"      - Consider unified vs EventID-specific CSV outputs")

log_print(f"\n‚úÖ NEXT STEPS:")
log_print(f"   1. Review findings above")
log_print(f"   2. Update notebook #2 with optimized processing logic")
log_print(f"   3. Implement EventID-specific field validation")
log_print(f"   4. Add comprehensive error handling and logging")
log_print(f"   5. Test with full dataset after validation")

log_print(f"\nüéØ Ready to proceed to notebook #2 optimization!")

# End section logging
log_print("\n" + "-" * 60 + " END SECTION " + "-" * 60)
log_print("")

print(f"\nüìã Analysis complete! Results saved to: {log_filename}")
print(f"üìÅ Output directory: {analysis_outputs_dir}")
print(f"üéâ Sysmon exploratory analysis complete!")


SECTION 9: ANALYSIS SUMMARY AND RECOMMENDATIONS

üìä ANALYSIS SUMMARY AND RECOMMENDATIONS
üîç DATASET CHARACTERISTICS:
   ‚Ä¢ Total records in file: ~570,078
   ‚Ä¢ Samples analyzed: 200,000
   ‚Ä¢ XML parsing success rate: 100.0%
   ‚Ä¢ Unique EventIDs found: 17
   ‚Ä¢ Most common EventID: 12 (101,377 occurrences)
   ‚Ä¢ Unique computers: 4

üí° PROCESSING RECOMMENDATIONS:
   üß† Memory Management:
      - Use batch processing for large dataset (570,078 records)
      - Consider chunked reading for memory efficiency
      - Implement progress tracking for long operations
   üõ†Ô∏è  XML Processing:
      - XML sanitization is crucial for 0 problematic records
      - BeautifulSoup XML parser handles malformed XML well
      - Namespace handling required for proper field extraction
   üìä EventID-Specific Handling:
      - Different EventIDs have different field schemas
      - Field availability varies significantly by EventID
      - Consider separate processing pipelines per E