# Network Traffic Flow Data Exploratory Analysis

This notebook provides comprehensive exploratory analysis of the network traffic JSONL file to understand its structure, data distribution, and characteristics before implementing the main processing logic in notebook #3.

**Objective**: Examine the network traffic flow data structure through random sampling and statistical analysis to inform optimal processing strategies.

**Target File**: `-ds-logs-network_traffic-flow-default-2025-05-04-000001.jsonl`

## 0. Initial Setup: Logging System and Output Organization

**What this section does**: Sets up a comprehensive logging system to capture all analysis outputs in organized files for easy inspection and review.

**Simple explanation**: 
- Creates output directories specifically for 3a network-flows analysis
- Generates timestamped log files so you can track when analysis was run
- Sets up automatic output capture so everything gets saved to files
- Organizes outputs in `outputs/3a-network-flows/` directory structure

**Paraphrasing tips**:
- "What type of analysis am I running?" → Sets `ANALYSIS_TYPE = "3a-network-flows"`
- "Where will my results be saved?" → Creates organized folder structure in `outputs/3a-network-flows/`
- "How can I review what happened?" → All print statements get saved to timestamped log files

**Key insight**: This creates a professional logging infrastructure that makes it easy to review analysis results later and compare with 3b structure consistency analysis.

In [1]:
import os
import sys
import json
from datetime import datetime
from contextlib import contextmanager

# Define analysis type and create organized output structure
ANALYSIS_TYPE = "3a-network-flows"  # This will be "sysmon" for sysmon analysis
outputs_base_dir = "outputs"
analysis_outputs_dir = f"{outputs_base_dir}/{ANALYSIS_TYPE}"

# Create output directories
if not os.path.exists(outputs_base_dir):
    os.makedirs(outputs_base_dir)
    print(f"✅ Created base outputs directory: {outputs_base_dir}")

if not os.path.exists(analysis_outputs_dir):
    os.makedirs(analysis_outputs_dir)
    print(f"✅ Created analysis outputs directory: {analysis_outputs_dir}")

# Generate timestamped filenames with descriptive names
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
log_filename = f"{analysis_outputs_dir}/{ANALYSIS_TYPE}_exploratory_analysis_{timestamp}.log"
data_filename = f"{analysis_outputs_dir}/{ANALYSIS_TYPE}_sample_data_{timestamp}.json"

print(f"📝 Analysis type: {ANALYSIS_TYPE}")
print(f"📁 Output directory: {analysis_outputs_dir}")
print(f"📊 Log file: {log_filename}")
print(f"💾 Data file: {data_filename}")

# Initialize log file with header
with open(log_filename, 'w', encoding='utf-8') as log_file:
    log_file.write(f"NETWORK TRAFFIC FLOW EXPLORATORY ANALYSIS\n")
    log_file.write(f"Analysis Type: {ANALYSIS_TYPE.upper()}\n")
    log_file.write(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
    log_file.write(f"Target File: -ds-logs-network_traffic-flow-default-2025-05-04-000001.jsonl\n")
    log_file.write(f"{'='*80}\n\n")

# Initialize data file with metadata structure
with open(data_filename, 'w', encoding='utf-8') as data_file:
    json.dump({"analysis_metadata": {"timestamp": timestamp, "analysis_type": ANALYSIS_TYPE, "samples": []}}, data_file, indent=2)

class LogCapture:
    def __init__(self, log_filename):
        self.log_filename = log_filename
        self.original_stdout = sys.stdout
    
    def write(self, text):
        self.original_stdout.write(text)
        self.original_stdout.flush()
        
        with open(self.log_filename, 'a', encoding='utf-8') as log_file:
            log_file.write(text)
            log_file.flush()
    
    def flush(self):
        self.original_stdout.flush()

log_capture = LogCapture(log_filename)

def log_section(section_name, section_number=None):
    header = f"\n{'='*80}\n"
    if section_number:
        header += f"SECTION {section_number}: {section_name.upper()}\n"
    else:
        header += f"{section_name.upper()}\n"
    header += f"{'='*80}\n\n"
    
    with open(log_filename, 'a', encoding='utf-8') as log_file:
        log_file.write(header)
    
    print(header.strip())

def save_data_structure(data, description, section_name):
    """
    Save complex data structures to organized JSON file for detailed inspection.
    """
    try:
        with open(data_filename, 'r', encoding='utf-8') as f:
            existing_data = json.load(f)
        
        new_entry = {
            "section": section_name,
            "description": description,
            "timestamp": datetime.now().isoformat(),
            "data": data
        }
        existing_data["analysis_metadata"]["samples"].append(new_entry)
        
        with open(data_filename, 'w', encoding='utf-8') as f:
            json.dump(existing_data, f, indent=2, default=str)
        
        print(f"💾 Saved data structure: {description}")
        
    except Exception as e:
        print(f"⚠️  Warning: Could not save data structure - {str(e)}")

@contextmanager
def capture_output(section_name, section_number=None):
    """
    Context manager to capture both console output and logging with organized section headers.
    """
    log_section(section_name, section_number)
    
    original_stdout = sys.stdout
    sys.stdout = log_capture
    
    try:
        yield
    finally:
        sys.stdout = original_stdout
        
        with open(log_filename, 'a', encoding='utf-8') as log_file:
            log_file.write(f"\n{'-'*60} END SECTION {'-'*60}\n\n")

print("✅ Logging system initialized for 3a network-flows exploratory analysis!")
print(f"📂 All outputs will be organized in: {analysis_outputs_dir}/")
print(f"🔍 Use 'with capture_output(\"Section Name\", section_number):' to log outputs")
print(f"💾 Use 'save_data_structure(data, \"description\", \"section\")' for complex data")

📝 Analysis type: 3a-network-flows
📁 Output directory: outputs/3a-network-flows
📊 Log file: outputs/3a-network-flows/3a-network-flows_exploratory_analysis_20250629_064647.log
💾 Data file: outputs/3a-network-flows/3a-network-flows_sample_data_20250629_064647.json
✅ Logging system initialized for 3a network-flows exploratory analysis!
📂 All outputs will be organized in: outputs/3a-network-flows/
🔍 Use 'with capture_output("Section Name", section_number):' to log outputs
💾 Use 'save_data_structure(data, "description", "section")' for complex data


## 1. Import Required Libraries

Import essential Python libraries for data exploration, JSON processing, and statistical analysis.

In [2]:
import json  # For parsing JSONL records
import random  # For random sampling of large datasets
import os  # For file system operations
from collections import Counter, defaultdict  # For frequency analysis and data aggregation
import pandas as pd  # For data manipulation and analysis
from pprint import pprint  # For pretty-printing complex data structures
import sys  # For system-specific parameters

## 2. File Path Setup and Validation

Define the target JSONL file path and verify its existence before proceeding with the analysis.

In [3]:
# Define the path to the network traffic JSONL file
jsonl_file_path = '-ds-logs-network_traffic-flow-default-2025-05-04-000001.jsonl'

# Verify that the file exists in the current directory
if os.path.exists(jsonl_file_path):
    # Get file size in MB for initial assessment
    file_size_bytes = os.path.getsize(jsonl_file_path)
    file_size_mb = file_size_bytes / (1024 * 1024)
    print(f"✅ File found: {jsonl_file_path}")
    print(f"📊 File size: {file_size_mb:.2f} MB ({file_size_bytes:,} bytes)")
else:
    print(f"❌ File not found: {jsonl_file_path}")
    print("📁 Available files in current directory:")
    # List all JSONL files in current directory for debugging
    for file in os.listdir('.'):
        if file.endswith('.jsonl'):
            print(f"   - {file}")

✅ File found: -ds-logs-network_traffic-flow-default-2025-05-04-000001.jsonl
📊 File size: 2309.17 MB (2,421,341,652 bytes)


## 3. Initial Data Loading and Line Count

Count total number of records in the JSONL file to understand dataset size and plan sampling strategy.

In [4]:
num_samples_to_analyze = 200_000

In [5]:
if 'capture_output' in globals():
    with capture_output("Initial Data Loading and Line Count", 3):
        total_records = 0
        try:
            with open(jsonl_file_path, 'r', encoding='utf-8') as f:
                for line_num, line in enumerate(f, 1):
                    if line.strip():
                        total_records += 1
                
            print(f"📈 Total records in dataset: {total_records:,}")
            print(f"🎯 Sampling strategy: Will analyze {min(num_samples_to_analyze, total_records)} random samples")
            
        except Exception as e:
            print(f"❌ Error reading file: {str(e)}")
            total_records = 0
else:
    total_records = 0
    try:
        with open(jsonl_file_path, 'r', encoding='utf-8') as f:
            for line_num, line in enumerate(f, 1):
                if line.strip():
                    total_records += 1
            
        print(f"📈 Total records in dataset: {total_records:,}")
        print(f"🎯 Sampling strategy: Will analyze {min(num_samples_to_analyze, total_records)} random samples")
        
    except Exception as e:
        print(f"❌ Error reading file: {str(e)}")
        total_records = 0

SECTION 3: INITIAL DATA LOADING AND LINE COUNT
📈 Total records in dataset: 1,090,212
🎯 Sampling strategy: Will analyze 200000 random samples


## 4. Random Sampling Function

Implement efficient random sampling to select representative records from the large JSONL dataset.

In [6]:
def get_random_samples(file_path, sample_size=num_samples_to_analyze, max_records=None):
    """
    Extract random samples from JSONL file using reservoir sampling algorithm.
    
    Args:
        file_path (str): Path to the JSONL file
        sample_size (int): Number of samples to collect
        max_records (int): Maximum records to read (None for all)
    
    Returns:
        list: List of parsed JSON objects
    """
    samples = []  # List to store selected samples
    
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            for line_num, line in enumerate(f, 1):
                # Skip empty lines
                if not line.strip():
                    continue
                
                # Stop if we've reached the maximum record limit
                if max_records and line_num > max_records:
                    break
                
                try:
                    # Parse the JSON record
                    record = json.loads(line)
                    
                    # Reservoir sampling: Always add first 'sample_size' records
                    if len(samples) < sample_size:
                        samples.append(record)
                    else:
                        # Randomly replace existing samples with decreasing probability
                        replace_index = random.randint(0, line_num - 1)
                        if replace_index < sample_size:
                            samples[replace_index] = record
                            
                except json.JSONDecodeError as e:
                    print(f"⚠️  JSON decode error on line {line_num}: {str(e)}")
                    continue
                    
        print(f"✅ Successfully collected {len(samples)} random samples")
        return samples
        
    except Exception as e:
        print(f"❌ Error during sampling: {str(e)}")
        return []

In [ ]:
# Set random seed for reproducible sampling
# random.seed(42)  # Fixed seed - always gives same samples
random.seed()  # Use system time for true randomness each run

# Collect random samples from the dataset
sample_size = min(100, total_records)  # Sample up to 100 records or total if smaller
random_samples = get_random_samples(jsonl_file_path, sample_size=sample_size)

if random_samples:
    print(f"📊 Analysis ready with {len(random_samples)} samples")
    print(f"🎲 Random seed: {random.getstate()[1][0]} (for debugging)")
    print(f"🔄 Note: Using system time for true randomness - results will vary between runs")
else:
    print("❌ No samples collected - cannot proceed with analysis")

In [7]:
# Set random seed for reproducible sampling
random.seed()

# Collect random samples from the dataset
sample_size = min(num_samples_to_analyze, total_records)  # Sample up to 100 records or total if smaller
random_samples = get_random_samples(jsonl_file_path, sample_size=sample_size)

if random_samples:
    print(f"📊 Analysis ready with {len(random_samples)} samples")
    print(f"🎲 Random seed: 42 (for reproducible results)")
else:
    print("❌ No samples collected - cannot proceed with analysis")

✅ Successfully collected 200000 random samples
📊 Analysis ready with 200000 samples
🎲 Random seed: 42 (for reproducible results)


## 6. Basic Data Structure Analysis

Examine the fundamental structure of network traffic records to understand data organization and hierarchy.

In [8]:
if random_samples:
    if 'capture_output' in globals() and 'save_data_structure' in globals():
        with capture_output("Basic Data Structure Analysis", 6):
            first_sample = random_samples[0]
            
            print("📋 BASIC DATA STRUCTURE ANALYSIS")
            print("=" * 50)
            
            print(f"🔍 Data type: {type(first_sample)}")
            print(f"📏 Number of top-level fields: {len(first_sample)}")
            
            print(f"🗝️  Top-level fields:")
            for i, key in enumerate(first_sample.keys(), 1):
                field_type = type(first_sample[key]).__name__
                print(f"   {i:2d}. {key:<30} ({field_type})")
            
            print(f"📄 Complete first sample structure:")
            print("-" * 30)
            pprint(first_sample, depth=3, width=100)
            
            save_data_structure(first_sample, "Complete First Sample Structure", "Basic Data Structure Analysis")
    else:
        first_sample = random_samples[0]
        
        print("📋 BASIC DATA STRUCTURE ANALYSIS")
        print("=" * 50)
        
        print(f"🔍 Data type: {type(first_sample)}")
        print(f"📏 Number of top-level fields: {len(first_sample)}")
        
        print(f"🗝️  Top-level fields:")
        for i, key in enumerate(first_sample.keys(), 1):
            field_type = type(first_sample[key]).__name__
            print(f"   {i:2d}. {key:<30} ({field_type})")
        
        print(f"📄 Complete first sample structure:")
        print("-" * 30)
        pprint(first_sample, depth=3, width=100)

SECTION 6: BASIC DATA STRUCTURE ANALYSIS
📋 BASIC DATA STRUCTURE ANALYSIS
🔍 Data type: <class 'dict'>
📏 Number of top-level fields: 11
🗝️  Top-level fields:
    1. agent                          (dict)
    2. destination                    (dict)
    3. elastic_agent                  (dict)
    4. network_traffic                (dict)
    5. source                         (dict)
    6. network                        (dict)
    7. @timestamp                     (str)
    8. ecs                            (dict)
    9. data_stream                    (dict)
   10. host                           (dict)
   11. event                          (dict)
📄 Complete first sample structure:
------------------------------
{'@timestamp': '2025-05-04T12:39:15.847Z',
 'agent': {'ephemeral_id': 'fdaf0b65-7a0d-49a7-90a6-5f011b283f17',
           'id': 'dea09884-b943-42b8-a03b-2e87109e5297',
           'name': 'theblock',
           'type': 'packetbeat',
           'version': '8.18.0'},
 'data_stream': {'da

## 7. Field Distribution Analysis

Analyze the frequency and distribution of fields across all samples to identify consistent vs. optional fields.

In [9]:
if random_samples:
    if 'capture_output' in globals() and 'save_data_structure' in globals():
        with capture_output("Field Distribution Analysis", 7):
            field_counter = Counter()
            field_types = defaultdict(Counter)
            
            print("📊 FIELD DISTRIBUTION ANALYSIS")
            print("=" * 50)
            
            for sample_idx, sample in enumerate(random_samples):
                for field_name, field_value in sample.items():
                    field_counter[field_name] += 1
                    field_type = type(field_value).__name__
                    field_types[field_name][field_type] += 1
            
            total_samples = len(random_samples)
            
            print(f"📈 Field frequency analysis ({total_samples} samples):")
            print(f"{'Field Name':<30} {'Count':<8} {'Percentage':<12} {'Primary Type':<15}")
            print("-" * 70)
            
            for field_name, count in field_counter.most_common():
                percentage = (count / total_samples) * 100
                primary_type = field_types[field_name].most_common(1)[0][0]
                print(f"{field_name:<30} {count:<8} {percentage:>7.1f}%     {primary_type:<15}")
            
            always_present = [field for field, count in field_counter.items() if count == total_samples]
            optional_fields = [field for field, count in field_counter.items() if count < total_samples]
            
            print(f"✅ Always present fields ({len(always_present)}):")
            for field in always_present:
                print(f"   - {field}")
            
            print(f"❓ Optional fields ({len(optional_fields)}):")
            for field in optional_fields:
                presence = (field_counter[field] / total_samples) * 100
                print(f"   - {field:<30} (present in {presence:.1f}% of samples)")
            
            save_data_structure({
                "field_frequency": dict(field_counter.most_common()),
                "always_present_fields": always_present,
                "optional_fields": optional_fields
            }, "Field Distribution Analysis Results", "Field Distribution Analysis")
    else:
        field_counter = Counter()
        field_types = defaultdict(Counter)
        
        print("📊 FIELD DISTRIBUTION ANALYSIS")
        print("=" * 50)
        
        for sample_idx, sample in enumerate(random_samples):
            for field_name, field_value in sample.items():
                field_counter[field_name] += 1
                field_type = type(field_value).__name__
                field_types[field_name][field_type] += 1
        
        total_samples = len(random_samples)
        
        print(f"📈 Field frequency analysis ({total_samples} samples):")
        print(f"{'Field Name':<30} {'Count':<8} {'Percentage':<12} {'Primary Type':<15}")
        print("-" * 70)
        
        for field_name, count in field_counter.most_common():
            percentage = (count / total_samples) * 100
            primary_type = field_types[field_name].most_common(1)[0][0]
            print(f"{field_name:<30} {count:<8} {percentage:>7.1f}%     {primary_type:<15}")
        
        always_present = [field for field, count in field_counter.items() if count == total_samples]
        optional_fields = [field for field, count in field_counter.items() if count < total_samples]
        
        print(f"✅ Always present fields ({len(always_present)}):")
        for field in always_present:
            print(f"   - {field}")
        
        print(f"❓ Optional fields ({len(optional_fields)}):")
        for field in optional_fields:
            presence = (field_counter[field] / total_samples) * 100
            print(f"   - {field:<30} (present in {presence:.1f}% of samples)")

SECTION 7: FIELD DISTRIBUTION ANALYSIS
📊 FIELD DISTRIBUTION ANALYSIS
📈 Field frequency analysis (200000 samples):
Field Name                     Count    Percentage   Primary Type   
----------------------------------------------------------------------
agent                          200000     100.0%     dict           
destination                    200000     100.0%     dict           
elastic_agent                  200000     100.0%     dict           
network_traffic                200000     100.0%     dict           
source                         200000     100.0%     dict           
network                        200000     100.0%     dict           
@timestamp                     200000     100.0%     str            
ecs                            200000     100.0%     dict           
data_stream                    200000     100.0%     dict           
host                           200000     100.0%     dict           
event                          200000     100.0%     dic

## 8. Data Type Analysis

Examine data types in detail to understand field characteristics and identify potential processing requirements.

In [10]:
if random_samples:
    print("🔬 DETAILED DATA TYPE ANALYSIS")
    print("=" * 50)
    
    # Analyze complex fields (nested objects, arrays)
    complex_fields = {}  # Store analysis of nested structures
    
    for field_name in field_counter.keys():
        print(f"\n📝 Field: {field_name}")
        print("-" * 30)
        
        # Get type distribution for this field
        type_dist = field_types[field_name]
        print(f"Type distribution:")
        for data_type, count in type_dist.most_common():
            percentage = (count / field_counter[field_name]) * 100
            print(f"   {data_type:<15} {count:>4} ({percentage:5.1f}%)")
        
        # Analyze sample values for each field
        sample_values = []
        for sample in random_samples[:5]:  # Show first 5 samples
            if field_name in sample:
                value = sample[field_name]
                
                # Handle different data types appropriately
                if isinstance(value, dict):
                    # For dictionaries, show structure
                    sample_values.append(f"dict with keys: {list(value.keys())[:3]}...")
                    complex_fields[field_name] = 'nested_object'
                elif isinstance(value, list):
                    # For lists, show length and sample items
                    sample_values.append(f"list[{len(value)}]: {value[:2]}...")
                    complex_fields[field_name] = 'array'
                elif isinstance(value, str):
                    # For strings, show truncated value
                    truncated = value[:50] + "..." if len(value) > 50 else value
                    sample_values.append(f'"({len(value)} chars): {truncated}"')
                else:
                    # For primitives, show actual value
                    sample_values.append(str(value))
        
        print(f"Sample values:")
        for i, val in enumerate(sample_values, 1):
            print(f"   {i}. {val}")
    
    # Summary of complex fields
    if complex_fields:
        print(f"\n🏗️  Complex fields requiring special processing:")
        for field, complexity in complex_fields.items():
            print(f"   - {field:<30} ({complexity})")

🔬 DETAILED DATA TYPE ANALYSIS

📝 Field: agent
------------------------------
Type distribution:
   dict            200000 (100.0%)
Sample values:
   1. dict with keys: ['name', 'id', 'ephemeral_id']...
   2. dict with keys: ['name', 'id', 'ephemeral_id']...
   3. dict with keys: ['name', 'id', 'ephemeral_id']...
   4. dict with keys: ['name', 'id', 'ephemeral_id']...
   5. dict with keys: ['name', 'id', 'type']...

📝 Field: destination
------------------------------
Type distribution:
   dict            200000 (100.0%)
Sample values:
   1. dict with keys: ['port', 'bytes', 'ip']...
   2. dict with keys: ['port', 'ip', 'mac']...
   3. dict with keys: ['port', 'ip', 'mac']...
   4. dict with keys: ['port', 'bytes', 'ip']...
   5. dict with keys: ['port', 'bytes', 'ip']...

📝 Field: elastic_agent
------------------------------
Type distribution:
   dict            200000 (100.0%)
Sample values:
   1. dict with keys: ['id', 'version', 'snapshot']...
   2. dict with keys: ['id', 'version', 

## 9. Nested Structure Deep Dive

Examine nested objects and arrays to understand the full data hierarchy and identify all available fields.

In [11]:
def analyze_nested_structure(obj, path="", max_depth=3, current_depth=0):
    paths = set()
    
    if current_depth >= max_depth:
        return paths
    
    if isinstance(obj, dict):
        for key, value in obj.items():
            new_path = f"{path}.{key}" if path else key
            paths.add(f"{new_path} ({type(value).__name__})")
            paths.update(analyze_nested_structure(value, new_path, max_depth, current_depth + 1))
            
    elif isinstance(obj, list) and obj:
        first_item = obj[0]
        array_path = f"{path}[0]"
        paths.add(f"{array_path} ({type(first_item).__name__})")
        paths.update(analyze_nested_structure(first_item, array_path, max_depth, current_depth + 1))
    
    return paths

if random_samples:
    if 'capture_output' in globals() and 'save_data_structure' in globals():
        with capture_output("Nested Structure Deep Dive", 9):
            print("🗂️  NESTED STRUCTURE DEEP DIVE")
            print("=" * 50)
            
            all_paths = set()
            
            for i, sample in enumerate(random_samples[:10]):
                sample_paths = analyze_nested_structure(sample, max_depth=4)
                all_paths.update(sample_paths)
                print(f"✓ Analyzed sample {i+1}/10")
            
            print(f"📋 Complete field hierarchy ({len(all_paths)} unique paths):")
            print("-" * 60)
            
            sorted_paths = sorted(all_paths)
            
            grouped_paths = defaultdict(list)
            for path in sorted_paths:
                top_level = path.split('.')[0].split(' ')[0]
                grouped_paths[top_level].append(path)
            
            for top_field, paths in grouped_paths.items():
                print(f"🌳 {top_field}:")
                for path in paths:
                    indent = "  " * (path.count('.') + 1)
                    field_name = path.split(' ')[0]
                    field_type = path.split('(')[1].replace(')', '') if '(' in path else 'unknown'
                    print(f"{indent}├─ {field_name.split('.')[-1]} ({field_type})")
            
            save_data_structure({
                "total_unique_paths": len(all_paths),
                "grouped_field_hierarchy": {k: v for k, v in grouped_paths.items()},
                "all_paths": sorted_paths
            }, "Complete Field Hierarchy Analysis", "Nested Structure Deep Dive")
    else:
        print("🗂️  NESTED STRUCTURE DEEP DIVE")
        print("=" * 50)
        
        all_paths = set()
        
        for i, sample in enumerate(random_samples[:10]):
            sample_paths = analyze_nested_structure(sample, max_depth=4)
            all_paths.update(sample_paths)
            print(f"✓ Analyzed sample {i+1}/10")
        
        print(f"📋 Complete field hierarchy ({len(all_paths)} unique paths):")
        print("-" * 60)
        
        sorted_paths = sorted(all_paths)
        
        grouped_paths = defaultdict(list)
        for path in sorted_paths:
            top_level = path.split('.')[0].split(' ')[0]
            grouped_paths[top_level].append(path)
        
        for top_field, paths in grouped_paths.items():
            print(f"🌳 {top_field}:")
            for path in paths:
                indent = "  " * (path.count('.') + 1)
                field_name = path.split(' ')[0]
                field_type = path.split('(')[1].replace(')', '') if '(' in path else 'unknown'
                print(f"{indent}├─ {field_name.split('.')[-1]} ({field_type})")

SECTION 9: NESTED STRUCTURE DEEP DIVE
🗂️  NESTED STRUCTURE DEEP DIVE
✓ Analyzed sample 1/10
✓ Analyzed sample 2/10
✓ Analyzed sample 3/10
✓ Analyzed sample 4/10
✓ Analyzed sample 5/10
✓ Analyzed sample 6/10
✓ Analyzed sample 7/10
✓ Analyzed sample 8/10
✓ Analyzed sample 9/10
✓ Analyzed sample 10/10
📋 Complete field hierarchy (87 unique paths):
------------------------------------------------------------
🌳 @timestamp:
  ├─ @timestamp (str)
🌳 agent:
  ├─ agent (dict)
    ├─ ephemeral_id (str)
    ├─ id (str)
    ├─ name (str)
    ├─ type (str)
    ├─ version (str)
🌳 data_stream:
  ├─ data_stream (dict)
    ├─ dataset (str)
    ├─ namespace (str)
    ├─ type (str)
🌳 destination:
  ├─ destination (dict)
    ├─ bytes (int)
    ├─ ip (str)
    ├─ mac (str)
    ├─ packets (int)
    ├─ port (int)
🌳 ecs:
  ├─ ecs (dict)
    ├─ version (str)
🌳 elastic_agent:
  ├─ elastic_agent (dict)
    ├─ id (str)
    ├─ snapshot (bool)
    ├─ version (str)
🌳 event:
  ├─ event (dict)
    ├─ action (str)
    ├─

## 10. Data Quality Assessment

Evaluate data quality by checking for missing values, empty fields, and potential data inconsistencies.

In [12]:
if random_samples:
    print("🔍 DATA QUALITY ASSESSMENT")
    print("=" * 50)
    
    # Initialize quality metrics
    quality_stats = {
        'empty_values': defaultdict(int),  # Count of empty/null values per field
        'field_lengths': defaultdict(list),  # String/array lengths per field
        'unique_values': defaultdict(set),  # Unique values for categorical analysis
        'total_samples': len(random_samples)
    }
    
    # Analyze each sample for quality metrics
    for sample_idx, sample in enumerate(random_samples):
        for field_name, field_value in sample.items():
            
            # Check for empty/null values
            if field_value is None or field_value == "" or field_value == []:
                quality_stats['empty_values'][field_name] += 1
            
            # Track field lengths for strings and arrays
            if isinstance(field_value, (str, list)):
                quality_stats['field_lengths'][field_name].append(len(field_value))
            
            # Collect unique values for potential categorical fields
            if isinstance(field_value, (str, int, float, bool)) and not isinstance(field_value, bool):
                # Limit unique value tracking to avoid memory issues
                if len(quality_stats['unique_values'][field_name]) < 50:
                    quality_stats['unique_values'][field_name].add(str(field_value))
    
    # Report data quality findings
    total_samples = quality_stats['total_samples']
    
    print(f"📊 Data completeness analysis:")
    print(f"{'Field Name':<30} {'Empty Count':<12} {'Completeness':<15} {'Notes':<20}")
    print("-" * 80)
    
    for field_name in field_counter.keys():
        empty_count = quality_stats['empty_values'][field_name]
        present_count = field_counter[field_name]
        completeness = ((present_count - empty_count) / present_count) * 100 if present_count > 0 else 0
        
        # Generate quality notes
        notes = []
        if empty_count > 0:
            notes.append(f"{empty_count} empty")
        if field_name in quality_stats['field_lengths']:
            lengths = quality_stats['field_lengths'][field_name]
            if lengths:
                avg_len = sum(lengths) / len(lengths)
                notes.append(f"avg len: {avg_len:.1f}")
        
        notes_str = ", ".join(notes) if notes else "good"
        
        print(f"{field_name:<30} {empty_count:<12} {completeness:>7.1f}%       {notes_str:<20}")
    
    # Identify potential categorical fields
    print(f"\n🏷️  Potential categorical fields (≤50 unique values):")
    for field_name, unique_vals in quality_stats['unique_values'].items():
        if len(unique_vals) <= 20:  # Show fields with few unique values
            unique_count = len(unique_vals)
            sample_values = list(unique_vals)[:5]  # Show first 5 unique values
            print(f"   - {field_name:<25} ({unique_count:2d} unique): {sample_values}")
            if len(unique_vals) > 5:
                print(f"     {' ' * 27} ... and {len(unique_vals) - 5} more")

🔍 DATA QUALITY ASSESSMENT
📊 Data completeness analysis:
Field Name                     Empty Count  Completeness    Notes               
--------------------------------------------------------------------------------
agent                          0              100.0%       good                
destination                    0              100.0%       good                
elastic_agent                  0              100.0%       good                
network_traffic                0              100.0%       good                
source                         0              100.0%       good                
network                        0              100.0%       good                
@timestamp                     0              100.0%       avg len: 24.0       
ecs                            0              100.0%       good                
data_stream                    0              100.0%       good                
host                           0              100.0%       goo

## 11. Sample Record Display

Display a few complete sample records with formatted output for manual inspection and verification.

In [13]:
if random_samples:
    if 'capture_output' in globals() and 'save_data_structure' in globals():
        with capture_output("Sample Record Display", 11):
            print("📄 SAMPLE RECORD DISPLAY")
            print("=" * 50)
            
            num_samples_to_show = min(3, len(random_samples))
            
            for i in range(num_samples_to_show):
                print(f"Sample Record #{i+1}:")
                print("-" * 40)
                
                sample = random_samples[i]
                
                save_data_structure(sample, f"Sample Record #{i+1}", "Sample Record Display")
                
                for field_name, field_value in sample.items():
                    if isinstance(field_value, dict):
                        print(f"🏗️  {field_name}:")
                        for nested_key, nested_val in field_value.items():
                            nested_str = str(nested_val)[:100] + "..." if len(str(nested_val)) > 100 else str(nested_val)
                            print(f"      {nested_key}: {nested_str}")
                    elif isinstance(field_value, list):
                        print(f"📋 {field_name} (array[{len(field_value)}]):")
                        for j, item in enumerate(field_value[:3]):
                            item_str = str(item)[:100] + "..." if len(str(item)) > 100 else str(item)
                            print(f"      [{j}]: {item_str}")
                        if len(field_value) > 3:
                            print(f"      ... and {len(field_value) - 3} more items")
                    else:
                        value_str = str(field_value)
                        if len(value_str) > 150:
                            value_str = value_str[:150] + "..."
                        print(f"📝 {field_name}: {value_str}")
                
                print()
        
        print(f"💾 Complete sample records saved to: {data_filename}")
        print(f"📝 All output logged to: {log_filename}")
    
    else:
        print("📄 SAMPLE RECORD DISPLAY")
        print("=" * 50)
        
        num_samples_to_show = min(3, len(random_samples))
        
        for i in range(num_samples_to_show):
            print(f"Sample Record #{i+1}:")
            print("-" * 40)
            
            sample = random_samples[i]
            
            for field_name, field_value in sample.items():
                if isinstance(field_value, dict):
                    print(f"🏗️  {field_name}:")
                    for nested_key, nested_val in field_value.items():
                        nested_str = str(nested_val)[:100] + "..." if len(str(nested_val)) > 100 else str(nested_val)
                        print(f"      {nested_key}: {nested_str}")
                elif isinstance(field_value, list):
                    print(f"📋 {field_name} (array[{len(field_value)}]):")
                    for j, item in enumerate(field_value[:3]):
                        item_str = str(item)[:100] + "..." if len(str(item)) > 100 else str(item)
                        print(f"      [{j}]: {item_str}")
                    if len(field_value) > 3:
                        print(f"      ... and {len(field_value) - 3} more items")
                else:
                    value_str = str(field_value)
                    if len(value_str) > 150:
                        value_str = value_str[:150] + "..."
                    print(f"📝 {field_name}: {value_str}")
            
            print()
        
        print("ℹ️  Note: Logging system not available - outputs not saved to files")

SECTION 11: SAMPLE RECORD DISPLAY
📄 SAMPLE RECORD DISPLAY
Sample Record #1:
----------------------------------------
💾 Saved data structure: Sample Record #1
🏗️  agent:
      name: theblock
      id: dea09884-b943-42b8-a03b-2e87109e5297
      ephemeral_id: fdaf0b65-7a0d-49a7-90a6-5f011b283f17
      type: packetbeat
      version: 8.18.0
🏗️  destination:
      port: 9200
      bytes: 7087
      ip: 10.2.0.20
      packets: 26
      mac: 08-00-27-26-33-A5
🏗️  elastic_agent:
      id: dea09884-b943-42b8-a03b-2e87109e5297
      version: 8.18.0
      snapshot: False
🏗️  network_traffic:
      flow: {'final': False, 'id': 'EQQA////DP//////FP8BAAEIACcmM6UIACf5E3UKAgAUCgEABfAj3vg'}
🏗️  source:
      port: 63710
      bytes: 60945
      ip: 10.1.0.5
      packets: 43
      mac: 08-00-27-F9-13-75
🏗️  network:
      community_id: 1:p/eP4xM4zgi1RF3mnJg2GvG310c=
      bytes: 68032
      transport: tcp
      type: ipv4
      packets: 69
📝 @timestamp: 2025-05-04T12:39:15.847Z
🏗️  ecs:
      version: 

## 12. Analysis Summary and Recommendations

Provide a comprehensive summary of findings and recommendations for processing the network traffic data in notebook #3.

In [14]:
if random_samples:
    if 'capture_output' in globals() and 'save_data_structure' in globals():
        with capture_output("Analysis Summary and Recommendations", 12):
            print("📊 ANALYSIS SUMMARY AND RECOMMENDATIONS")
            print("=" * 60)
            
            total_fields = len(field_counter)
            always_present_fields = [f for f, c in field_counter.items() if c == len(random_samples)]
            optional_fields = [f for f, c in field_counter.items() if c < len(random_samples)]
            
            print(f"🔍 DATASET CHARACTERISTICS:")
            print(f"   • Total records in file: ~{total_records:,}")
            print(f"   • Samples analyzed: {len(random_samples)}")
            print(f"   • Total unique fields: {total_fields}")
            print(f"   • Always present fields: {len(always_present_fields)}")
            print(f"   • Optional fields: {len(optional_fields)}")
            
            print(f"💡 PROCESSING RECOMMENDATIONS:")
            
            if total_records > 100000:
                print(f"   🧠 Memory Management:")
                print(f"      - Use batch processing for large dataset ({total_records:,} records)")
                print(f"      - Consider chunked reading with pandas or custom iterator")
                print(f"      - Implement progress tracking for long operations")
            
            if complex_fields:
                print(f"   🏗️  Complex Field Handling:")
                for field, complexity in complex_fields.items():
                    if complexity == 'nested_object':
                        print(f"      - {field}: Flatten nested structure or extract key fields")
                    elif complexity == 'array':
                        print(f"      - {field}: Consider array length, join elements, or extract features")
            
            high_empty_fields = [f for f, c in quality_stats['empty_values'].items() if c > len(random_samples) * 0.1]
            if high_empty_fields:
                print(f"   🚨 Data Quality:")
                for field in high_empty_fields:
                    empty_pct = (quality_stats['empty_values'][field] / field_counter[field]) * 100
                    print(f"      - {field}: {empty_pct:.1f}% empty values - consider exclusion or imputation")
            
            print(f"   📄 CSV Conversion Strategy:")
            print(f"      - Include all {len(always_present_fields)} always-present fields")
            print(f"      - Handle {len(optional_fields)} optional fields with default values")
            print(f"      - Implement error logging for malformed records")
            print(f"      - Use appropriate data types for each field")
            
            print(f"✅ NEXT STEPS:")
            print(f"   1. Review findings above")
            print(f"   2. Update notebook #3 with optimized processing logic")
            print(f"   3. Implement batch processing for memory efficiency")
            print(f"   4. Add comprehensive error handling and logging")
            print(f"   5. Test with full dataset after validation")
            
            print(f"🎯 Ready to proceed to notebook #3 optimization!")
            
            save_data_structure({
                "dataset_characteristics": {
                    "total_records": total_records,
                    "samples_analyzed": len(random_samples),
                    "total_unique_fields": total_fields,
                    "always_present_fields": always_present_fields,
                    "optional_fields": optional_fields
                },
                "processing_recommendations": {
                    "memory_management_needed": total_records > 100000,
                    "complex_fields_count": len(complex_fields) if 'complex_fields' in globals() else 0,
                    "high_empty_fields": high_empty_fields if 'quality_stats' in globals() else []
                }
            }, "Complete Analysis Summary", "Analysis Summary and Recommendations")
    else:
        print("📊 ANALYSIS SUMMARY AND RECOMMENDATIONS")
        print("=" * 60)
        
        total_fields = len(field_counter)
        always_present_fields = [f for f, c in field_counter.items() if c == len(random_samples)]
        optional_fields = [f for f, c in field_counter.items() if c < len(random_samples)]
        
        print(f"🔍 DATASET CHARACTERISTICS:")
        print(f"   • Total records in file: ~{total_records:,}")
        print(f"   • Samples analyzed: {len(random_samples)}")
        print(f"   • Total unique fields: {total_fields}")
        print(f"   • Always present fields: {len(always_present_fields)}")
        print(f"   • Optional fields: {len(optional_fields)}")
        
        print(f"💡 PROCESSING RECOMMENDATIONS:")
        
        if total_records > 100000:
            print(f"   🧠 Memory Management:")
            print(f"      - Use batch processing for large dataset ({total_records:,} records)")
            print(f"      - Consider chunked reading with pandas or custom iterator")
            print(f"      - Implement progress tracking for long operations")
        
        if 'complex_fields' in globals() and complex_fields:
            print(f"   🏗️  Complex Field Handling:")
            for field, complexity in complex_fields.items():
                if complexity == 'nested_object':
                    print(f"      - {field}: Flatten nested structure or extract key fields")
                elif complexity == 'array':
                    print(f"      - {field}: Consider array length, join elements, or extract features")
        
        if 'quality_stats' in globals():
            high_empty_fields = [f for f, c in quality_stats['empty_values'].items() if c > len(random_samples) * 0.1]
            if high_empty_fields:
                print(f"   🚨 Data Quality:")
                for field in high_empty_fields:
                    empty_pct = (quality_stats['empty_values'][field] / field_counter[field]) * 100
                    print(f"      - {field}: {empty_pct:.1f}% empty values - consider exclusion or imputation")
        
        print(f"   📄 CSV Conversion Strategy:")
        print(f"      - Include all {len(always_present_fields)} always-present fields")
        print(f"      - Handle {len(optional_fields)} optional fields with default values")
        print(f"      - Implement error logging for malformed records")
        print(f"      - Use appropriate data types for each field")
        
        print(f"✅ NEXT STEPS:")
        print(f"   1. Review findings above")
        print(f"   2. Update notebook #3 with optimized processing logic")
        print(f"   3. Implement batch processing for memory efficiency")
        print(f"   4. Add comprehensive error handling and logging")
        print(f"   5. Test with full dataset after validation")
        
        print(f"🎯 Ready to proceed to notebook #3 optimization!")
else:
    print("❌ No data available for analysis - check file path and permissions")

SECTION 12: ANALYSIS SUMMARY AND RECOMMENDATIONS
📊 ANALYSIS SUMMARY AND RECOMMENDATIONS
🔍 DATASET CHARACTERISTICS:
   • Total records in file: ~1,090,212
   • Samples analyzed: 200000
   • Total unique fields: 12
   • Always present fields: 11
   • Optional fields: 1
💡 PROCESSING RECOMMENDATIONS:
   🧠 Memory Management:
      - Use batch processing for large dataset (1,090,212 records)
      - Consider chunked reading with pandas or custom iterator
      - Implement progress tracking for long operations
   🏗️  Complex Field Handling:
      - agent: Flatten nested structure or extract key fields
      - destination: Flatten nested structure or extract key fields
      - elastic_agent: Flatten nested structure or extract key fields
      - network_traffic: Flatten nested structure or extract key fields
      - source: Flatten nested structure or extract key fields
      - network: Flatten nested structure or extract key fields
      - ecs: Flatten nested structure or extract key fields
 

## 13. Output Files Summary

Summary of all generated output files for easy access and inspection.

In [15]:
print("📁 OUTPUT FILES GENERATED")
print("=" * 50)

if os.path.exists(analysis_outputs_dir):
    output_files = os.listdir(analysis_outputs_dir)
    
    print(f"📂 Analysis directory: {analysis_outputs_dir}")
    print(f"🏷️  Analysis type: {ANALYSIS_TYPE}")
    print()
    
    for file in sorted(output_files):
        file_path = os.path.join(analysis_outputs_dir, file)
        file_size = os.path.getsize(file_path)
        
        if file.endswith('.log'):
            print(f"📝 Log file: {file}")
            print(f"   Size: {file_size:,} bytes")
            print(f"   Purpose: Contains all console output from analysis sections")
            print(f"   Usage: View complete analysis results and debugging info")
            
        elif file.endswith('.json'):
            print(f"📊 Data file: {file}")
            print(f"   Size: {file_size:,} bytes")
            print(f"   Purpose: Contains complete sample data structures")
            print(f"   Usage: Detailed inspection of individual records")
        
        print()
    
    print("🔍 HOW TO USE THE OUTPUT FILES:")
    print(f"   1. Copy log file contents to view complete analysis output")
    print(f"   2. Copy JSON file contents to examine detailed sample structures")
    print(f"   3. Use these files to provide complete analysis results for review")
    print(f"   4. Compare with outputs from 3b structure consistency analysis")
    
    print(f"📋 TO ADD LOGGING TO OTHER CELLS:")
    print(f"   • Wrap code with: with capture_output(\"Section Name\", section_number):")
    print(f"   • Save data with: save_data_structure(data, \"description\", \"section\")")
    print(f"   • Both console output and files will be updated automatically")
    
else:
    print("❌ No analysis output directory found - logging system may not have initialized")
    print(f"Expected directory: {analysis_outputs_dir}")

print(f"✅ 3a exploratory analysis complete! Check {analysis_outputs_dir}/ directory for all output files.")
print(f"🔗 Related: Check outputs/network-flows/ for 3b structure consistency analysis outputs.")

📁 OUTPUT FILES GENERATED
📂 Analysis directory: outputs/3a-network-flows
🏷️  Analysis type: 3a-network-flows

📝 Log file: 3a-network-flows_exploratory_analysis_20250629_055400.log
   Size: 18,069 bytes
   Purpose: Contains all console output from analysis sections
   Usage: View complete analysis results and debugging info

📝 Log file: 3a-network-flows_exploratory_analysis_20250629_063341.log
   Size: 18,069 bytes
   Purpose: Contains all console output from analysis sections
   Usage: View complete analysis results and debugging info

📝 Log file: 3a-network-flows_exploratory_analysis_20250629_064000.log
   Size: 18,170 bytes
   Purpose: Contains all console output from analysis sections
   Usage: View complete analysis results and debugging info

📝 Log file: 3a-network-flows_exploratory_analysis_20250629_064223.log
   Size: 20,233 bytes
   Purpose: Contains all console output from analysis sections
   Usage: View complete analysis results and debugging info

📝 Log file: 3a-network-flow