## 1. Initial Setup: Logging System and Output Organization

**What this section does**: Sets up a comprehensive logging system to capture all analysis outputs in organized files for easy inspection and review.

**Simple explanation**: 
- Creates output directories for different analysis types (network-flows vs sysmon)
- Generates timestamped log files so you can track when analysis was run
- Sets up automatic output capture so everything gets saved to files

**Paraphrasing tips**:
- "What type of data am I analyzing?" → Sets `ANALYSIS_TYPE` variable
- "Where will my results be saved?" → Creates organized folder structure in `outputs/`
- "How can I review what happened?" → All print statements get saved to timestamped log files

**Key insight**: This creates a professional logging infrastructure that makes it easy to review analysis results later and compare different runs.

In [1]:
import os
import sys
import json
from datetime import datetime
from contextlib import contextmanager

# Define analysis type and create organized output structure
ANALYSIS_TYPE = "3b-network-flows"  # This will be "sysmon" for sysmon analysis
outputs_base_dir = "outputs"
analysis_outputs_dir = f"{outputs_base_dir}/{ANALYSIS_TYPE}"

# Create output directories
if not os.path.exists(outputs_base_dir):
    os.makedirs(outputs_base_dir)
    print(f"✅ Created base outputs directory: {outputs_base_dir}")

if not os.path.exists(analysis_outputs_dir):
    os.makedirs(analysis_outputs_dir)
    print(f"✅ Created analysis outputs directory: {analysis_outputs_dir}")

# Generate timestamped filenames with descriptive names
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
log_filename = f"{analysis_outputs_dir}/{ANALYSIS_TYPE}_structure_analysis_{timestamp}.log"
results_filename = f"{analysis_outputs_dir}/{ANALYSIS_TYPE}_structure_results_{timestamp}.json"

print(f"📝 Analysis type: {ANALYSIS_TYPE}")
print(f"📁 Output directory: {analysis_outputs_dir}")
print(f"📊 Log file: {log_filename}")
print(f"💾 Results file: {results_filename}")

# Initialize log file with header
with open(log_filename, 'w', encoding='utf-8') as log_file:
    log_file.write(f"NETWORK FLOWS STRUCTURE CONSISTENCY ANALYSIS\n")
    log_file.write(f"Analysis Type: {ANALYSIS_TYPE.upper()}\n")
    log_file.write(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
    log_file.write(f"Target File: -ds-logs-network_traffic-flow-default-2025-05-04-000001.jsonl\n")
    log_file.write(f"{'='*80}\n\n")

class LogCapture:
    def __init__(self, log_filename):
        self.log_filename = log_filename
        self.original_stdout = sys.stdout
    
    def write(self, text):
        self.original_stdout.write(text)
        self.original_stdout.flush()
        
        with open(self.log_filename, 'a', encoding='utf-8') as log_file:
            log_file.write(text)
            log_file.flush()
    
    def flush(self):
        self.original_stdout.flush()

log_capture = LogCapture(log_filename)

def log_section(section_name, section_number=None):
    header = f"\n{'='*80}\n"
    if section_number:
        header += f"SECTION {section_number}: {section_name.upper()}\n"
    else:
        header += f"{section_name.upper()}\n"
    header += f"{'='*80}\n\n"
    
    with open(log_filename, 'a', encoding='utf-8') as log_file:
        log_file.write(header)
    
    print(header.strip())

@contextmanager
def capture_output(section_name, section_number=None):
    log_section(section_name, section_number)
    
    original_stdout = sys.stdout
    sys.stdout = log_capture
    
    try:
        yield
    finally:
        sys.stdout = original_stdout
        
        with open(log_filename, 'a', encoding='utf-8') as log_file:
            log_file.write(f"\n{'-'*60} END SECTION {'-'*60}\n\n")

print("✅ Logging system initialized for network-flows structure analysis!")
print(f"📂 All outputs will be organized in: {analysis_outputs_dir}/")
print(f"🔍 Use 'with capture_output(\"Section Name\", section_number):' to log outputs")

📝 Analysis type: 3b-network-flows
📁 Output directory: outputs/3b-network-flows
📊 Log file: outputs/3b-network-flows/3b-network-flows_structure_analysis_20250629_065016.log
💾 Results file: outputs/3b-network-flows/3b-network-flows_structure_results_20250629_065016.json
✅ Logging system initialized for network-flows structure analysis!
📂 All outputs will be organized in: outputs/3b-network-flows/
🔍 Use 'with capture_output("Section Name", section_number):' to log outputs


## 2. Schema Fingerprinting Functions: The Core Analysis Engine

**What this section does**: Defines the fundamental functions that detect and classify different data structure patterns in JSON records.

**Simple explanation**: 
- `generate_schema_fingerprint()`: Creates a unique "fingerprint" for each record's structure using MD5 hashing
- `classify_structure_variations()`: Groups similar structures and labels them by frequency (PRIMARY, SECONDARY, VARIANT, etc.)
- `analyze_field_presence_patterns()`: Tracks which fields appear together and how often

**Paraphrasing tips**:
- "What patterns exist in my data?" → The fingerprinting function identifies unique structural patterns
- "How important is each pattern?" → Classification function ranks patterns by frequency and importance
- "Which fields always appear together?" → Field analysis reveals co-occurrence relationships

**Key insight**: These functions work together to transform chaotic JSON data into organized patterns that can be systematically processed.

In [2]:
import json
import random
import os
import hashlib
from collections import Counter, defaultdict
from datetime import datetime
import pandas as pd
from pprint import pprint

def generate_schema_fingerprint(record, max_depth=10):
    """
    Generate a structural fingerprint for a JSON record using MD5 hashing.
    Returns both hash and detailed structure for analysis.
    """
    def extract_structure(obj, current_depth=0):
        if current_depth >= max_depth:
            return "max_depth_reached"
        
        if isinstance(obj, dict):
            # Sort keys for consistent hashing
            structure = {}
            for key in sorted(obj.keys()):
                structure[key] = extract_structure(obj[key], current_depth + 1)
            return structure
        elif isinstance(obj, list):
            if not obj:
                return "empty_list"
            # For lists, analyze first element structure and indicate it's a list
            return {"list_of": extract_structure(obj[0], current_depth + 1)}
        else:
            # Return type information for primitive values
            return type(obj).__name__
    
    # Extract the complete structure
    structure_detail = extract_structure(record)
    
    # Create a stable string representation for hashing
    structure_str = json.dumps(structure_detail, sort_keys=True)
    
    # Generate MD5 hash for the structure
    structure_hash = hashlib.md5(structure_str.encode('utf-8')).hexdigest()
    
    return structure_hash, structure_detail

def classify_structure_variations(structure_counts, total_samples):
    """
    Classify structure patterns based on frequency and characteristics.
    """
    classifications = {}
    
    for structure_hash, count in structure_counts.items():
        percentage = (count / total_samples) * 100
        
        if percentage >= 50:
            classification = "PRIMARY_SCHEMA"
        elif percentage >= 20:
            classification = "SECONDARY_SCHEMA"
        elif percentage >= 5:
            classification = "VARIANT"
        elif percentage >= 1:
            classification = "RARE_VARIANT"
        else:
            classification = "OUTLIER"
        
        classifications[structure_hash] = {
            'classification': classification,
            'count': count,
            'percentage': percentage
        }
    
    return classifications

def analyze_field_presence_patterns(records):
    """
    Analyze which fields appear together and count field frequency.
    """
    def extract_all_field_paths(obj, prefix=""):
        """Recursively extract all field paths from a nested object."""
        paths = []
        
        if isinstance(obj, dict):
            for key, value in obj.items():
                current_path = f"{prefix}.{key}" if prefix else key
                paths.append(current_path)
                
                if isinstance(value, (dict, list)):
                    paths.extend(extract_all_field_paths(value, current_path))
        elif isinstance(obj, list) and obj:
            # For lists, analyze the first element
            paths.extend(extract_all_field_paths(obj[0], prefix))
        
        return paths
    
    # Track field combinations and individual field counts
    field_combinations = Counter()
    field_counts = Counter()
    
    for record in records:
        # Get all field paths in this record
        field_paths = extract_all_field_paths(record)
        
        # Count individual fields
        for path in field_paths:
            field_counts[path] += 1
        
        # Count this specific combination of fields
        field_combo = tuple(sorted(field_paths))
        field_combinations[field_combo] += 1
    
    return field_combinations, field_counts

print("✅ Libraries imported successfully")
print("✅ All analysis functions defined")
print(f"📊 Analysis started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

✅ Libraries imported successfully
✅ All analysis functions defined
📊 Analysis started at: 2025-06-29 06:50:16


## 3. Data Loading and Sampling: Smart Data Collection Strategy

**What this section does**: Loads a representative sample of records from the massive JSONL file using stratified sampling for better coverage across the entire dataset.

**Simple explanation**: 
- Counts total records in the file first (over 1 million records!)
- Uses stratified sampling to get records from throughout the file, not just the beginning
- Collects 5,000 samples that represent the entire dataset's diversity

**Paraphrasing tips**:
- "How big is my dataset?" → Counts total records to understand scope
- "How do I avoid bias?" → Stratified sampling ensures representative coverage
- "What's a good sample size?" → 5,000 records provides statistical confidence

**Key insight**: Smart sampling is crucial for large datasets - taking records from throughout the file prevents bias from temporal changes or data collection patterns.

In [3]:
num_samples_to_analyze = 200_000

In [4]:
# Configuration
jsonl_file_path = '-ds-logs-network_traffic-flow-default-2025-05-04-000001.jsonl'
SAMPLE_SIZE = num_samples_to_analyze  # Analyze 5000 records for comprehensive coverage
random.seed(42)  # For reproducible results

def stratified_sample_jsonl(file_path, sample_size):
    """
    Perform stratified sampling across the file to get better representation.
    """
    # First pass: count total records
    print("📊 Counting total records...")
    total_records = 0
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            if line.strip():
                total_records += 1
    
    print(f"📈 Found {total_records:,} total records")
    
    # Calculate sampling interval for even distribution
    interval = max(1, total_records // sample_size)
    print(f"🎯 Sampling every {interval} records for stratified coverage")
    
    # Second pass: collect stratified samples
    samples = []
    with open(file_path, 'r', encoding='utf-8') as f:
        for line_num, line in enumerate(f, 1):
            if not line.strip():
                continue
            
            # Take every nth record plus some random samples
            if (line_num % interval == 0) or (random.random() < 0.0001):
                try:
                    record = json.loads(line)
                    samples.append(record)
                    
                    if len(samples) >= sample_size:
                        break
                        
                except json.JSONDecodeError:
                    continue
    
    print(f"✅ Collected {len(samples)} samples for analysis")
    return samples, total_records

# Load samples with logging
with capture_output("Data Loading and Sampling", 3):
    print(f"🔄 Loading samples from: {jsonl_file_path}")
    sample_records, total_file_records = stratified_sample_jsonl(jsonl_file_path, SAMPLE_SIZE)
    
    print(f"📊 SAMPLING SUMMARY:")
    print(f"   • Total file records: {total_file_records:,}")
    print(f"   • Samples collected: {len(sample_records):,}")
    print(f"   • Coverage ratio: {(len(sample_records)/total_file_records)*100:.2f}%")

SECTION 3: DATA LOADING AND SAMPLING
🔄 Loading samples from: -ds-logs-network_traffic-flow-default-2025-05-04-000001.jsonl
📊 Counting total records...
📈 Found 1,090,212 total records
🎯 Sampling every 5 records for stratified coverage
✅ Collected 200000 samples for analysis
📊 SAMPLING SUMMARY:
   • Total file records: 1,090,212
   • Samples collected: 200,000
   • Coverage ratio: 18.35%


## 4. Structure Pattern Detection: Discovering Hidden Data Patterns

**What this section does**: Processes all sample records to generate structural fingerprints and discovers how many unique patterns exist in the dataset.

**Simple explanation**: 
- Runs each record through the fingerprinting function to identify its structure
- Counts how many times each unique pattern appears
- Classifies patterns by frequency (OUTLIER, RARE_VARIANT, SECONDARY_SCHEMA, etc.)

**Paraphrasing tips**:
- "What patterns exist and how frequent are they?" → Discovers unique structural patterns and their occurrence rates
- "Is my data consistent or chaotic?" → Pattern count reveals dataset complexity
- "Which patterns matter most?" → Classification shows which patterns to focus on

**Key insight**: This analysis reveals whether your dataset has consistent structure (few patterns) or high variability (many patterns), which determines your processing strategy.

In [5]:
with capture_output("Structure Pattern Detection", 4):
    print("🔍 GENERATING STRUCTURE FINGERPRINTS")
    print("=" * 50)
    
    # Generate fingerprints for all samples
    structure_fingerprints = {}
    structure_examples = {}
    structure_counts = Counter()
    
    print("📋 Processing samples...")
    for i, record in enumerate(sample_records):
        if i % 1000 == 0:
            print(f"   Processed {i:,} / {len(sample_records):,} records")
        
        # Generate both hash and full structure
        structure_hash, structure_detail = generate_schema_fingerprint(record, max_depth=5)
        
        # Store first example of each structure type
        if structure_hash not in structure_examples:
            structure_examples[structure_hash] = record
            structure_fingerprints[structure_hash] = structure_detail
        
        # Count occurrences
        structure_counts[structure_hash] += 1
    
    print("✅ Fingerprinting complete!")
    print("📊 STRUCTURE ANALYSIS RESULTS:")
    print(f"   • Unique structure patterns found: {len(structure_counts)}")
    print(f"   • Records analyzed: {len(sample_records):,}")
    print(f"   • Most common pattern frequency: {structure_counts.most_common(1)[0][1]:,} records")
    
    # Classify patterns by frequency
    classifications = classify_structure_variations(structure_counts, len(sample_records))
    
    print("")
    print("🏷️  PATTERN CLASSIFICATION:")
    classification_summary = Counter()
    for _, info in classifications.items():
        classification_summary[info['classification']] += 1
    
    for class_type, count in classification_summary.most_common():
        print(f"   • {class_type}: {count} patterns")

SECTION 4: STRUCTURE PATTERN DETECTION
🔍 GENERATING STRUCTURE FINGERPRINTS
📋 Processing samples...
   Processed 0 / 200,000 records
   Processed 1,000 / 200,000 records
   Processed 2,000 / 200,000 records
   Processed 3,000 / 200,000 records
   Processed 4,000 / 200,000 records
   Processed 5,000 / 200,000 records
   Processed 6,000 / 200,000 records
   Processed 7,000 / 200,000 records
   Processed 8,000 / 200,000 records
   Processed 9,000 / 200,000 records
   Processed 10,000 / 200,000 records
   Processed 11,000 / 200,000 records
   Processed 12,000 / 200,000 records
   Processed 13,000 / 200,000 records
   Processed 14,000 / 200,000 records
   Processed 15,000 / 200,000 records
   Processed 16,000 / 200,000 records
   Processed 17,000 / 200,000 records
   Processed 18,000 / 200,000 records
   Processed 19,000 / 200,000 records
   Processed 20,000 / 200,000 records
   Processed 21,000 / 200,000 records
   Processed 22,000 / 200,000 records
   Processed 23,000 / 200,000 records
   

## 5. Detailed Pattern Analysis: Understanding Each Structure Type

**What this section does**: Examines each discovered pattern in detail, showing frequency, characteristics, and example records to understand what makes each pattern unique.

**Simple explanation**: 
- Sorts patterns from most common to least common
- Shows detailed characteristics for each pattern (top-level fields, optional elements)
- Identifies differences between patterns (what fields are missing or extra)
- Provides compact structure examples for visual inspection

**Paraphrasing tips**:
- "What are the characteristics and differences of each pattern?" → Detailed breakdown of pattern features
- "Which patterns are most important?" → Frequency-based ranking shows priority
- "How do patterns differ from each other?" → Comparative analysis reveals structural variations

**Key insight**: This deep dive helps you understand not just THAT patterns exist, but WHY they're different and what those differences mean for processing.

In [6]:
with capture_output("Detailed Pattern Analysis", 5):
    print("📋 DETAILED PATTERN ANALYSIS")
    print("=" * 60)
    
    # Sort patterns by frequency for analysis
    sorted_patterns = structure_counts.most_common()
    
    for i, (structure_hash, count) in enumerate(sorted_patterns[:10], 1):
        classification_info = classifications[structure_hash]
        
        print(f"🔍 PATTERN #{i} - {classification_info['classification']}")
        print("-" * 40)
        print(f"📊 Frequency: {count:,} records ({classification_info['percentage']:.2f}%)")
        print(f"🔑 Structure Hash: {structure_hash[:16]}...")
        
        # Get example record
        example_record = structure_examples[structure_hash]
        
        # Analyze top-level structure
        top_level_fields = list(example_record.keys())
        print(f"📝 Top-level fields ({len(top_level_fields)}): {', '.join(sorted(top_level_fields))}")
        
        # Check for optional fields
        optional_indicators = []
        for field in top_level_fields:
            if field == 'process':
                optional_indicators.append(f"process: {type(example_record[field]).__name__}")
            elif isinstance(example_record[field], dict) and not example_record[field]:
                optional_indicators.append(f"{field}: empty_dict")
            elif field == 'source' and isinstance(example_record[field], dict) and 'process' in example_record[field]:
                optional_indicators.append("source.process: present")
        
        if optional_indicators:
            print(f"🔧 Notable characteristics: {', '.join(optional_indicators)}")
        
        # Show structure differences for secondary patterns
        if i > 1:
            primary_structure = structure_fingerprints[sorted_patterns[0][0]]
            current_structure = structure_fingerprints[structure_hash]
            
            # Simple difference detection (can be enhanced)
            primary_fields = set(str(primary_structure).split())
            current_fields = set(str(current_structure).split())
            
            missing_fields = primary_fields - current_fields
            extra_fields = current_fields - primary_fields
            
            if missing_fields or extra_fields:
                print("🔄 Differences from Pattern #1:")
                if missing_fields:
                    print(f"   Missing: {len(missing_fields)} structural elements")
                if extra_fields:
                    print(f"   Extra: {len(extra_fields)} structural elements")
        
        print("")  # Add spacing between patterns
    
    # Show complete structure for top 3 patterns (compact view)
    print("")
    print("📄 COMPLETE STRUCTURE EXAMPLES")
    print("=" * 60)
    
    def show_compact_structure(obj, indent=0, max_depth=2):
        """Show a compact view of structure for logging"""
        prefix = "  " * indent
        if isinstance(obj, dict):
            for key, value in obj.items():
                if isinstance(value, dict):
                    print(f"{prefix}{key}: {{dict with {len(value)} fields}}")
                    if indent < max_depth:  # Limit depth for readability
                        show_compact_structure(value, indent + 1, max_depth)
                elif isinstance(value, list):
                    if value:
                        print(f"{prefix}{key}: [array with {len(value)} items, first: {type(value[0]).__name__}]")
                    else:
                        print(f"{prefix}{key}: [empty array]")
                else:
                    value_preview = str(value)[:30] + "..." if len(str(value)) > 30 else str(value)
                    print(f"{prefix}{key}: {type(value).__name__} = {value_preview}")
        elif isinstance(obj, list):
            print(f"{prefix}[{len(obj)} items]")
    
    for i, (structure_hash, count) in enumerate(sorted_patterns[:3], 1):
        print(f"🏗️  PATTERN #{i} EXAMPLE RECORD:")
        print("-" * 30)
        example = structure_examples[structure_hash]
        show_compact_structure(example)
        print("")

SECTION 5: DETAILED PATTERN ANALYSIS
📋 DETAILED PATTERN ANALYSIS
🔍 PATTERN #1 - SECONDARY_SCHEMA
----------------------------------------
📊 Frequency: 67,260 records (33.63%)
🔑 Structure Hash: 936748b80e5c31b9...
📝 Top-level fields (12): @timestamp, agent, data_stream, destination, ecs, elastic_agent, event, host, network, network_traffic, process, source
🔧 Notable characteristics: process: dict, source.process: present

🔍 PATTERN #2 - SECONDARY_SCHEMA
----------------------------------------
📊 Frequency: 67,235 records (33.62%)
🔑 Structure Hash: 0e116a9cdc0848e1...
📝 Top-level fields (11): @timestamp, agent, data_stream, destination, ecs, elastic_agent, event, host, network, network_traffic, source
🔄 Differences from Pattern #1:
   Missing: 9 structural elements
   Extra: 1 structural elements

🔍 PATTERN #3 - SECONDARY_SCHEMA
----------------------------------------
📊 Frequency: 54,406 records (27.20%)
🔑 Structure Hash: 955584995c34862b...
📝 Top-level fields (12): @timestamp, agent, d

## 6. Field Co-occurrence Analysis: Discovering Field Relationships

**What this section does**: Analyzes which fields appear together in records and identifies conditional field relationships to understand data dependencies.

**Simple explanation**: 
- Extracts all possible field paths from nested JSON structures
- Counts how often each field appears across all records
- Identifies which fields are always present vs. optional/conditional
- Discovers field combinations that appear together

**Paraphrasing tips**:
- "Which fields always appear together?" → Field co-occurrence reveals data relationships
- "What fields are optional vs. required?" → Presence analysis shows field reliability
- "Are there conditional dependencies?" → Some fields only appear when others are present

**Key insight**: Understanding field relationships is crucial for robust data processing - you need to know which fields you can count on and which require null handling.

In [7]:
with capture_output("Field Co-occurrence Analysis", 6):
    print("🔗 FIELD CO-OCCURRENCE ANALYSIS")
    print("=" * 50)
    
    # Analyze field presence patterns
    field_combinations, field_counts = analyze_field_presence_patterns(sample_records)
    
    print("📊 Field presence analysis:")
    print(f"   • Total unique field paths: {len(field_counts)}")
    print(f"   • Unique field combinations: {len(field_combinations)}")
    
    # Show most common fields
    print("")
    print("📈 Most common fields (top 20):")
    total_samples = len(sample_records)
    for field, count in field_counts.most_common(20):
        percentage = (count / total_samples) * 100
        print(f"   {field:<40} {count:>6,} ({percentage:>5.1f}%)")
    
    # Identify conditional fields
    print("")
    print("🔧 Conditional/Optional fields (present in <95% of records):")
    conditional_fields = []
    for field, count in field_counts.items():
        percentage = (count / total_samples) * 100
        if percentage < 95:
            conditional_fields.append((field, count, percentage))
    
    # Sort conditional fields by rarity
    conditional_fields.sort(key=lambda x: x[2])
    
    for field, count, percentage in conditional_fields:
        print(f"   {field:<40} {count:>6,} ({percentage:>5.1f}%)")
    
    # Analyze field combinations for insights
    print("")
    print("🎯 Most common field combinations (top 10):")
    for combination, count in field_combinations.most_common(10):
        percentage = (count / total_samples) * 100
        combination_size = len(combination)
        print(f"   Combination with {combination_size:2d} fields: {count:>6,} records ({percentage:>5.1f}%)")
    
    # Process field relationships
    print("")
    print("🔍 Key insights:")
    
    # Check process field correlation
    process_present = field_counts.get('process', 0)
    process_percentage = (process_present / total_samples) * 100
    print(f"   • 'process' field appears in {process_percentage:.1f}% of records")
    
    # Check source.process correlation  
    source_process_present = field_counts.get('source.process', 0)
    source_process_percentage = (source_process_present / total_samples) * 100
    print(f"   • 'source.process' appears in {source_process_percentage:.1f}% of records")
    
    # Calculate conditional probability
    if process_present > 0 and source_process_present > 0:
        print("   • Process fields show correlation - investigating structural relationship")

SECTION 6: FIELD CO-OCCURRENCE ANALYSIS
🔗 FIELD CO-OCCURRENCE ANALYSIS
📊 Field presence analysis:
   • Total unique field paths: 89
   • Unique field combinations: 11

📈 Most common fields (top 20):
   agent                                    200,000 (100.0%)
   agent.name                               200,000 (100.0%)
   agent.id                                 200,000 (100.0%)
   agent.type                               200,000 (100.0%)
   agent.ephemeral_id                       200,000 (100.0%)
   agent.version                            200,000 (100.0%)
   destination                              200,000 (100.0%)
   destination.mac                          200,000 (100.0%)
   elastic_agent                            200,000 (100.0%)
   elastic_agent.id                         200,000 (100.0%)
   elastic_agent.version                    200,000 (100.0%)
   elastic_agent.snapshot                   200,000 (100.0%)
   network_traffic                          200,000 (100.0%)
   netwo

## 7. Schema Consistency Report: Overall Data Quality Assessment

**What this section does**: Generates a comprehensive report that evaluates overall data consistency and provides specific recommendations for processing strategies.

**Simple explanation**: 
- Calculates consistency metrics (how uniform is the data structure?)
- Determines coverage percentages (how much data do top patterns represent?)
- Assigns an overall consistency rating (HIGHLY CONSISTENT → HIGHLY VARIABLE)
- Provides specific processing recommendations based on findings

**Paraphrasing tips**:
- "How consistent is my data overall?" → Consistency metrics and coverage analysis
- "Should I use one pipeline or multiple?" → Processing strategy recommendations
- "What's the best approach for this dataset?" → Executive summary with actionable insights

**Key insight**: This section translates technical analysis into business decisions - telling you exactly how to approach processing this specific dataset based on its structural characteristics.

In [8]:
with capture_output("Schema Consistency Report", 7):
    print("📊 SCHEMA CONSISTENCY REPORT")
    print("=" * 60)
    
    # Calculate consistency metrics
    total_patterns = len(structure_counts)
    total_samples = len(sample_records)
    primary_patterns = sum(1 for _, info in classifications.items() 
                          if info['classification'] in ['PRIMARY_SCHEMA', 'SECONDARY_SCHEMA'])
    
    # Calculate coverage of top patterns
    sorted_patterns = structure_counts.most_common()
    top_3_coverage = sum(count for _, count in sorted_patterns[:3])
    top_5_coverage = sum(count for _, count in sorted_patterns[:5])
    top_10_coverage = sum(count for _, count in sorted_patterns[:10])
    
    print("🔍 CONSISTENCY METRICS:")
    print(f"   • Total unique structure patterns: {total_patterns}")
    print(f"   • Primary/Secondary patterns: {primary_patterns}")
    print(f"   • Structure diversity ratio: {(total_patterns/total_samples)*100:.3f}%")
    print("")
    print("📈 COVERAGE ANALYSIS:")
    print(f"   • Top 3 patterns cover: {(top_3_coverage/total_samples)*100:.1f}% of data")
    print(f"   • Top 5 patterns cover: {(top_5_coverage/total_samples)*100:.1f}% of data")
    print(f"   • Top 10 patterns cover: {(top_10_coverage/total_samples)*100:.1f}% of data")
    
    # Determine overall consistency level
    if (top_3_coverage/total_samples) > 0.95:
        consistency_level = "HIGHLY CONSISTENT"
        consistency_color = "🟢"
    elif (top_5_coverage/total_samples) > 0.90:
        consistency_level = "MODERATELY CONSISTENT"
        consistency_color = "🟡"
    elif (top_10_coverage/total_samples) > 0.80:
        consistency_level = "SOMEWHAT VARIABLE"
        consistency_color = "🟠"
    else:
        consistency_level = "HIGHLY VARIABLE"
        consistency_color = "🔴"
    
    print("")
    print(f"{consistency_color} OVERALL ASSESSMENT: {consistency_level}")
    
    # Generate recommendations based on findings
    print("")
    print("💡 PROCESSING RECOMMENDATIONS:")
    
    if (top_3_coverage/total_samples) > 0.95:
        print("   ✅ Excellent consistency - can use single processing pipeline")
        print("   ✅ Focus on top 3 patterns for 95%+ coverage")
        print("   ✅ Simple fallback handling for rare variants")
    elif (top_5_coverage/total_samples) > 0.90:
        print("   🟡 Good consistency - recommend dual processing pipelines")
        print("   🟡 Primary pipeline for top 3 patterns")
        print("   🟡 Secondary pipeline for patterns 4-5")
        print("   🟡 Error handling for remaining variants")
    else:
        print("   🟠 Variable structure - requires flexible processing")
        print("   🟠 Implement schema-agnostic field extraction")
        print("   🟠 Use pattern matching for different record types")
        print("   🟠 Extensive error handling and validation needed")
    
    # Specific field handling recommendations
    print("")
    print("🛠️  FIELD HANDLING STRATEGIES:")
    
    # Always present fields
    always_present_fields = [field for field, count in field_counts.items() 
                            if (count / total_samples) >= 0.95]
    print(f"   • Always present fields ({len(always_present_fields)}): Standard extraction")
    
    # Conditional fields  
    conditional_field_count = len([field for field, count in field_counts.items() 
                                  if 0.05 < (count / total_samples) < 0.95])
    print(f"   • Conditional fields ({conditional_field_count}): Null handling required")
    
    # Rare fields
    rare_field_count = len([field for field, count in field_counts.items() 
                           if (count / total_samples) <= 0.05])
    print(f"   • Rare fields ({rare_field_count}): Consider exclusion or special handling")
    
    # Generate final summary
    print("")
    print("🎯 EXECUTIVE SUMMARY:")
    print(f"   Dataset shows {consistency_level.lower()} structure with {total_patterns} unique patterns.")
    print(f"   Top {min(3, total_patterns)} patterns handle {(top_3_coverage/total_samples)*100:.1f}% of records.")
    print(f"   Recommended approach: {'Single pipeline' if total_patterns <= 3 else 'Multi-pattern pipeline'} processing.")
    
    # Save detailed results for reference
    print("")
    print("💾 Saving detailed analysis results...")
    analysis_results = {
        'analysis_metadata': {
            'analysis_type': ANALYSIS_TYPE,
            'timestamp': timestamp,
            'total_file_records': total_file_records,
            'target_file': jsonl_file_path
        },
        'consistency_metrics': {
            'total_patterns': total_patterns,
            'total_samples': total_samples,
            'consistency_level': consistency_level,
            'diversity_ratio': (total_patterns/total_samples)*100
        },
        'coverage_analysis': {
            'top_3_coverage_pct': (top_3_coverage/total_samples)*100,
            'top_5_coverage_pct': (top_5_coverage/total_samples)*100,
            'top_10_coverage_pct': (top_10_coverage/total_samples)*100
        },
        'top_patterns': [
            {
                'pattern_hash': hash_val,
                'count': count,
                'percentage': classifications[hash_val]['percentage'],
                'classification': classifications[hash_val]['classification']
            } for hash_val, count in sorted_patterns[:10]
        ],
        'field_statistics': dict(list(field_counts.most_common(50))),
        'processing_recommendations': {
            'pipeline_strategy': 'Single pipeline' if total_patterns <= 3 else 'Multi-pattern pipeline',
            'always_present_field_count': len(always_present_fields),
            'conditional_field_count': conditional_field_count,
            'rare_field_count': rare_field_count
        }
    }
    
    # Save to organized output directory
    with open(results_filename, 'w') as f:
        json.dump(analysis_results, f, indent=2, default=str)
    
    print(f"✅ Results saved to: {results_filename}")
    print(f"📁 Output directory: {analysis_outputs_dir}")
    print("🎉 Network-flows structure consistency analysis complete!")

SECTION 7: SCHEMA CONSISTENCY REPORT
📊 SCHEMA CONSISTENCY REPORT
🔍 CONSISTENCY METRICS:
   • Total unique structure patterns: 14
   • Primary/Secondary patterns: 3
   • Structure diversity ratio: 0.007%

📈 COVERAGE ANALYSIS:
   • Top 3 patterns cover: 94.5% of data
   • Top 5 patterns cover: 98.6% of data
   • Top 10 patterns cover: 99.8% of data

🟡 OVERALL ASSESSMENT: MODERATELY CONSISTENT

💡 PROCESSING RECOMMENDATIONS:
   🟡 Good consistency - recommend dual processing pipelines
   🟡 Primary pipeline for top 3 patterns
   🟡 Secondary pipeline for patterns 4-5
   🟡 Error handling for remaining variants

🛠️  FIELD HANDLING STRATEGIES:
   • Always present fields (64): Standard extraction
   • Conditional fields (17): Null handling required
   • Rare fields (8): Consider exclusion or special handling

🎯 EXECUTIVE SUMMARY:
   Dataset shows moderately consistent structure with 14 unique patterns.
   Top 3 patterns handle 94.5% of records.
   Recommended approach: Multi-pattern pipeline proc

## 8. Pattern-Specific Processing Strategies: Implementation Roadmap

**What this section does**: Generates concrete implementation strategies for each major pattern, providing a detailed roadmap for building data processing pipelines.

**Simple explanation**: 
- Creates specific extraction strategies for each pattern type
- Identifies core fields vs. optional fields for each pattern
- Assigns processing priorities (HIGH, MEDIUM, LOW, MINIMAL)
- Provides step-by-step implementation roadmap

**Paraphrasing tips**:
- "How should I process each pattern type?" → Pattern-specific extraction strategies
- "What fields are important for each pattern?" → Core vs. optional field identification
- "What order should I implement support?" → Priority-based implementation roadmap

**Key insight**: This section transforms analysis insights into actionable engineering tasks - giving you a concrete plan for building robust data processing pipelines that handle all discovered patterns appropriately.

In [9]:
print("🛠️  PATTERN-SPECIFIC PROCESSING STRATEGIES")
print("=" * 60)

# Generate processing strategies for top patterns
for i, (structure_hash, count) in enumerate(sorted_patterns[:5], 1):
    classification_info = classifications[structure_hash]
    percentage = classification_info['percentage']
    
    print(f"\n📋 PATTERN #{i} PROCESSING STRATEGY")
    print("-" * 40)
    print(f"📊 Coverage: {count:,} records ({percentage:.2f}%)")
    print(f"🏷️  Classification: {classification_info['classification']}")
    
    # Get example for analysis
    example = structure_examples[structure_hash]
    
    # Generate field extraction strategy
    print(f"🔧 Recommended extraction strategy:")
    
    # Core fields that should always be extracted
    core_fields = ['@timestamp', 'agent', 'destination', 'source', 'network', 'event']
    present_core = [field for field in core_fields if field in example]
    print(f"   • Core fields ({len(present_core)}): {', '.join(present_core)}")
    
    # Optional fields with null handling
    optional_fields = []
    if 'process' in example:
        optional_fields.append('process')
    if 'source' in example and isinstance(example['source'], dict) and 'process' in example['source']:
        optional_fields.append('source.process')
    
    if optional_fields:
        print(f"   • Optional fields: {', '.join(optional_fields)} (null handling required)")
    
    # Nested extraction recommendations
    nested_extractions = []
    if 'host' in example and 'os' in example['host']:
        nested_extractions.append('host.os.*')
    if 'network_traffic' in example and 'flow' in example['network_traffic']:
        nested_extractions.append('network_traffic.flow.*')
    
    if nested_extractions:
        print(f"   • Nested extractions: {', '.join(nested_extractions)}")
    
    # Priority level for processing
    if percentage >= 50:
        priority = "HIGH - Primary processing pipeline"
    elif percentage >= 20:
        priority = "MEDIUM - Secondary processing pipeline"
    elif percentage >= 5:
        priority = "LOW - Specialized handling"
    else:
        priority = "MINIMAL - Error handling only"
    
    print(f"⭐ Processing priority: {priority}")

print(f"\n🚀 IMPLEMENTATION ROADMAP:")
print(f"   1. Implement primary pipeline for Pattern #1 ({sorted_patterns[0][1]:,} records)")
if len(sorted_patterns) > 1:
    print(f"   2. Add secondary handling for Pattern #2 ({sorted_patterns[1][1]:,} records)")
if len(sorted_patterns) > 2:
    print(f"   3. Consider Pattern #3 support ({sorted_patterns[2][1]:,} records)")
print(f"   4. Implement fallback processing for remaining {total_patterns-3} patterns")
print(f"   5. Add comprehensive error logging and schema validation")

print(f"\n✅ Analysis complete! Use these insights to design your processing pipeline.")

🛠️  PATTERN-SPECIFIC PROCESSING STRATEGIES

📋 PATTERN #1 PROCESSING STRATEGY
----------------------------------------
📊 Coverage: 67,260 records (33.63%)
🏷️  Classification: SECONDARY_SCHEMA
🔧 Recommended extraction strategy:
   • Core fields (6): @timestamp, agent, destination, source, network, event
   • Optional fields: process, source.process (null handling required)
   • Nested extractions: host.os.*, network_traffic.flow.*
⭐ Processing priority: MEDIUM - Secondary processing pipeline

📋 PATTERN #2 PROCESSING STRATEGY
----------------------------------------
📊 Coverage: 67,235 records (33.62%)
🏷️  Classification: SECONDARY_SCHEMA
🔧 Recommended extraction strategy:
   • Core fields (6): @timestamp, agent, destination, source, network, event
   • Nested extractions: host.os.*, network_traffic.flow.*
⭐ Processing priority: MEDIUM - Secondary processing pipeline

📋 PATTERN #3 PROCESSING STRATEGY
----------------------------------------
📊 Coverage: 54,406 records (27.20%)
🏷️  Classific