# Network Traffic Flow JSONL to CSV Converter

This notebook converts Elasticsearch network traffic flow data from JSONL format to CSV format for machine learning analysis. The conversion extracts specific fields from nested JSON structures and creates a structured dataset suitable for cybersecurity research.

**Input**: `-ds-logs-network_traffic-flow-default-2025-05-04-000001.jsonl` (Packetbeat network flow logs)  
**Output**: `network_traffic_flow-2025-05-04-000001.csv` (Structured CSV dataset)

**Key Features**:
- Handles nested JSON structures with safe value extraction
- Manages both scalar and array fields appropriately
- Preserves all network flow metadata for analysis
- Based on field analysis from notebooks 3a and 3b

## 1. Import Required Libraries

Import essential libraries for JSON processing, data manipulation, and CSV generation.

In [1]:
# Import necessary libraries for data processing
import json       # For parsing JSONL records
import pandas as pd  # For DataFrame creation and CSV export  
import numpy as np   # For handling missing values

## 2. Helper Function for Safe Value Extraction

Define a utility function to safely navigate nested JSON structures and extract field values without errors.

In [2]:
def get_nested_value(doc, path):
    """
    Safely retrieve nested values from document structure using dot notation.
    
    Args:
        doc (dict): The source JSON document
        path (str): Dot-separated path to the desired field (e.g., 'host.os.platform')
        
    Returns:
        The value at the specified path, or None if path doesn't exist
        
    Examples:
        get_nested_value({'host': {'os': {'platform': 'windows'}}}, 'host.os.platform') → 'windows'
        get_nested_value({'event': {'type': ['connection']}}, 'event.type[0]') → 'connection'
    """
    keys = path.split('.')
    current = doc
    
    for key in keys:
        if isinstance(current, dict):
            # Navigate through dictionary structure
            current = current.get(key)
        elif isinstance(current, list) and key.isdigit():
            # Handle array indexing (e.g., 'event.type[0]' → key='type[0]')
            try:
                current = current[int(key)] if int(key) < len(current) else None
            except (ValueError, IndexError):
                return None
        else:
            # Path doesn't exist in current structure
            return None
            
        # Stop if we hit a dead end
        if current is None:
            return None
            
    return current

## 3. Field Mapping Configuration

Define the mapping between JSON field paths and CSV column names. This mapping is based on analysis from notebooks 3a and 3b, which identified the most relevant fields for network flow analysis.

**Field Categories**:
- **Temporal**: Timestamp information
- **Network**: Source/destination IPs, ports, bytes, packets
- **Process**: Process information (present in ~64% of records)
- **Host**: Host identification and OS information
- **Flow**: Network flow metadata and identifiers

In [3]:
# Define field mapping: (JSON_path, CSV_column_name)
# Based on field analysis from notebooks 3a (exploratory) and 3b (structure consistency)
fields = [
    # === TEMPORAL FIELDS ===
    ('@timestamp', 'timestamp'),                    # Always present (100%)
    
    # === DESTINATION FIELDS ===
    ('destination.bytes', 'destination_bytes'),     # Always present (100%)
    ('destination.ip', 'destination_ip'),           # Always present (100%)
    ('destination.mac', 'destination_mac'),         # Always present (100%)
    ('destination.packets', 'destination_packets'), # Always present (100%)
    ('destination.port', 'destination_port'),       # Always present (100%)
    ('destination.process', 'destination_process'), # Rare field (~2.8% presence) - kept for analysis
    
    # === EVENT FIELDS ===
    ('event.action', 'event_action'),               # Always present (100%)
    ('event.duration', 'event_duration'),           # Always present (100%)
    ('event.type[0]', 'event_type'),               # Always present (100%) - extract first element
    
    # === HOST FIELDS ===
    ('host.hostname', 'host_hostname'),             # Always present (100%)
    ('host.ip', 'host_ip'),                        # Always present (100%) - keep all IPs as list
    ('host.mac[0]', 'host_mac'),                   # Always present (100%) - extract first MAC
    ('host.os.platform', 'host_os_platform'),      # Always present (100%)
    
    # === NETWORK FIELDS ===
    ('network.bytes', 'network_bytes'),             # Always present (100%)
    ('network.packets', 'network_packets'),         # Always present (100%)
    ('network.transport', 'network_transport'),     # Always present (100%) - tcp/udp
    ('network.type', 'network_type'),               # Always present (100%) - ipv4/ipv6
    
    # === NETWORK TRAFFIC FIELDS ===
    ('network_traffic.flow.id', 'network_traffic_flow_id'), # Always present (100%)
    
    # === PROCESS FIELDS ===
    ('process.args', 'process_args'),               # Conditional (~64% presence) - keep as list
    ('process.executable', 'process_executable'),   # Conditional (~64% presence)
    ('process.name', 'process_name'),               # Conditional (~64% presence)
    ('process.parent.pid', 'process_parent_pid'),   # Conditional (~64% presence)
    ('process.pid', 'process_pid'),                 # Conditional (~64% presence)
    
    # === SOURCE FIELDS ===
    ('source.bytes', 'source_bytes'),               # Always present (100%)
    ('source.ip', 'source_ip'),                     # Always present (100%)
    ('source.mac', 'source_mac'),                   # Always present (100%)
    ('source.packets', 'source_packets'),           # Always present (100%) - fixed typo from 'packet'
    ('source.port', 'source_port'),                 # Always present (100%)
    
    # === SOURCE PROCESS FIELDS ===
    ('source.process.args', 'source_process_args'), # Conditional (~61% presence) - keep as list
    ('source.process.executable', 'source_process_executable'), # Conditional (~61% presence)
    ('source.process.name', 'source_process_name'), # Conditional (~61% presence)
    ('source.process.pid', 'source_process_pid'),   # Conditional (~61% presence)
    ('source.process.ppid', 'source_process_ppid')  # Conditional (~61% presence)
]

print(f"📊 Total fields to extract: {len(fields)}")
print(f"🔍 Field categories: timestamp, destination, event, host, network, traffic, process, source")

📊 Total fields to extract: 34
🔍 Field categories: timestamp, destination, event, host, network, traffic, process, source


## 4. File Configuration

Set input and output file paths for the conversion process.

In [4]:
# Input and output file configuration
traffic_flow_filename = '-ds-logs-network_traffic-flow-default-2025-05-04-000001.jsonl'  # Source JSONL file
output_filename = 'network_traffic_flow-2025-05-04-000001.csv'                            # Target CSV file

print(f"📥 Input file: {traffic_flow_filename}")
print(f"📤 Output file: {output_filename}")

# Verify input file exists
import os
if os.path.exists(traffic_flow_filename):
    file_size_mb = os.path.getsize(traffic_flow_filename) / (1024 * 1024)
    print(f"✅ Input file found - Size: {file_size_mb:.1f} MB")
else:
    print(f"❌ Input file not found: {traffic_flow_filename}")
    print(f"📁 Current directory files: {[f for f in os.listdir('.') if f.endswith('.jsonl')]}")

📥 Input file: -ds-logs-network_traffic-flow-default-2025-05-04-000001.jsonl
📤 Output file: network_traffic_flow-2025-05-04-000001.csv
✅ Input file found - Size: 2309.2 MB


## 5. Data Processing and Conversion

Process the JSONL file line by line, extract specified fields, and convert to structured records.

**Processing Logic**:
- **Array Fields**: Keep as Python lists (`host.ip`, `process.args`, `source.process.args`, `destination.process`)
- **Scalar Fields**: Extract single values, use NaN for missing values
- **Error Handling**: Skip malformed JSON lines and continue processing

In [5]:
# Process documents and extract fields
print("🔄 Starting JSONL to CSV conversion...")
print(f"📊 Processing {len(fields)} fields per record")

records = []           # List to store processed records
line_count = 0         # Track total lines processed
error_count = 0        # Track JSON parsing errors

# Define which fields should be treated as arrays (kept as lists)
array_fields = ['process.args', 'source.process.args', 'host.ip', 'destination.process']

# Define port fields that need integer conversion (fix Elasticsearch float issue)
port_fields = ['destination.port', 'source.port']

with open(traffic_flow_filename, 'r') as f:
    for line_number, line in enumerate(f, 1):
        line_count += 1
        
        # Progress indicator for large files
        if line_count % 100000 == 0:
            print(f"   Processed {line_count:,} lines...")
        
        try:
            # Parse JSON line
            doc = json.loads(line)
            record = {}
            
            # Extract each mapped field
            for path, column in fields:
                # Get value using safe nested extraction
                value = get_nested_value(doc, path)
                
                # Handle array fields consistently - keep as lists
                if path in array_fields:
                    if isinstance(value, list):
                        record[column] = value              # Keep existing list
                    else:
                        record[column] = [value] if value is not None else []  # Wrap single value or empty list
                
                # Handle port fields - convert floats to integers (fix Elasticsearch issue)
                elif path in port_fields and value is not None:
                    try:
                        # Convert float ports to integers (e.g., 443.0 → 443)
                        record[column] = int(float(value))  # Handle both int and float inputs
                    except (ValueError, TypeError):
                        # If conversion fails, use NaN
                        record[column] = np.nan
                        
                else:
                    # Handle other scalar fields - use NaN for missing values
                    record[column] = value if value is not None else np.nan
                    
            records.append(record)
            
        except json.JSONDecodeError:
            # Skip malformed JSON lines
            error_count += 1
            continue

print(f"✅ Processing complete!")
print(f"📈 Total lines processed: {line_count:,}")
print(f"📊 Valid records extracted: {len(records):,}")
print(f"⚠️  JSON errors skipped: {error_count:,}")
print(f"📋 Success rate: {(len(records)/line_count)*100:.2f}%")
print(f"🔧 Port fields converted to integers: {port_fields}")

🔄 Starting JSONL to CSV conversion...
📊 Processing 34 fields per record
   Processed 100,000 lines...
   Processed 200,000 lines...
   Processed 300,000 lines...
   Processed 400,000 lines...
   Processed 500,000 lines...
   Processed 600,000 lines...
   Processed 700,000 lines...
   Processed 800,000 lines...
   Processed 900,000 lines...
   Processed 1,000,000 lines...
✅ Processing complete!
📈 Total lines processed: 1,090,212
📊 Valid records extracted: 1,090,212
⚠️  JSON errors skipped: 0
📋 Success rate: 100.00%
🔧 Port fields converted to integers: ['destination.port', 'source.port']


## 6. DataFrame Creation and CSV Export

Convert the processed records to a pandas DataFrame and export to CSV format for analysis.

In [6]:
# Create DataFrame and export to CSV
print("🔄 Creating DataFrame and exporting to CSV...")

# Convert records list to pandas DataFrame
df = pd.DataFrame(records)

print(f"📊 DataFrame created successfully!")
print(f"📏 Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"💾 Memory usage: ~{df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

# Display basic info about the dataset
print(f"\n📋 DATASET SUMMARY:")
print(f"   • Total records: {len(df):,}")
print(f"   • Total columns: {len(df.columns)}")
print(f"   • Column names: {list(df.columns)[:5]}... (showing first 5)")

# Save to CSV with appropriate handling for list columns
df.to_csv(output_filename, index=False)

print(f"\n✅ CSV export successful!")
print(f"📤 Output file: {output_filename}")

# Verify output file was created
if os.path.exists(output_filename):
    output_size_mb = os.path.getsize(output_filename) / (1024 * 1024)
    print(f"📁 File size: {output_size_mb:.1f} MB")
    print(f"🎯 Ready for machine learning analysis!")
else:
    print(f"❌ Error: Output file was not created")

# Display sample of first few rows (limit display for readability)
print(f"\n📄 Sample of first 3 rows:")
print(df.head(3).to_string()[:500] + "..." if len(str(df.head(3))) > 500 else df.head(3).to_string())

🔄 Creating DataFrame and exporting to CSV...
📊 DataFrame created successfully!
📏 Shape: 1,090,212 rows × 34 columns
💾 Memory usage: ~1659.6 MB

📋 DATASET SUMMARY:
   • Total records: 1,090,212
   • Total columns: 34
   • Column names: ['timestamp', 'destination_bytes', 'destination_ip', 'destination_mac', 'destination_packets']... (showing first 5)

✅ CSV export successful!
📤 Output file: network_traffic_flow-2025-05-04-000001.csv
📁 File size: 785.8 MB
🎯 Ready for machine learning analysis!

📄 Sample of first 3 rows:
                  timestamp  destination_bytes destination_ip    destination_mac  destination_packets  destination_port destination_process  event_action  event_duration  event_type host_hostname     host_ip  host_mac host_os_platform  network_bytes  network_packets network_transport network_type                                  network_traffic_flow_id process_args process_executable process_name  process_parent_pid  process_pid  source_bytes source_ip         source_mac  

# 🔬 EXPLORATORY DATA ANALYSIS

Now that we have converted the JSONL data to a structured DataFrame, let's perform comprehensive exploratory analysis to understand the network traffic patterns, security characteristics, and data quality.

## 7. Dataset Overview & Statistics

Get comprehensive understanding of the dataset structure, data types, and basic statistics.

In [7]:
# Dataset Overview
print("=" * 80)
print("📊 COMPREHENSIVE DATASET OVERVIEW")
print("=" * 80)

print(f"\n🔢 BASIC STATISTICS:")
print(f"   • Total Records: {len(df):,}")
print(f"   • Total Columns: {len(df.columns)}")
print(f"   • Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
print(f"   • Index Range: {df.index.min()} to {df.index.max()}")

print(f"\n📋 DATA TYPES DISTRIBUTION:")
dtype_counts = df.dtypes.value_counts()
for dtype, count in dtype_counts.items():
    print(f"   • {str(dtype):15s}: {count:2d} columns")

print(f"\n🔍 COLUMN CATEGORIES:")
array_columns = [col for col in df.columns if df[col].apply(lambda x: isinstance(x, list)).any()]
numeric_columns = df.select_dtypes(include=[np.number]).columns.tolist()
string_columns = df.select_dtypes(include=['object']).columns.tolist()
string_columns = [col for col in string_columns if col not in array_columns]

print(f"   • Array/List columns: {len(array_columns)} → {array_columns[:3]}{'...' if len(array_columns) > 3 else ''}")
print(f"   • Numeric columns: {len(numeric_columns)} → {numeric_columns[:3]}{'...' if len(numeric_columns) > 3 else ''}")
print(f"   • String columns: {len(string_columns)} → {string_columns[:3]}{'...' if len(string_columns) > 3 else ''}")

print(f"\n📊 MISSING DATA ANALYSIS:")
missing_counts = df.isnull().sum()
missing_data = missing_counts[missing_counts > 0].sort_values(ascending=False)
if len(missing_data) > 0:
    print("   Fields with missing values:")
    for field, count in missing_data.head(10).items():
        percentage = (count / len(df)) * 100
        print(f"   • {field:30s}: {count:,} ({percentage:.1f}%)")
else:
    print("   🎉 No missing values detected!")

print(f"\n🎯 SAMPLE DATA PREVIEW:")
print("   First 3 records (key fields only):")
preview_cols = ['timestamp', 'source_ip', 'destination_ip', 'network_transport', 'destination_port']
print(df[preview_cols].head(3).to_string(index=False))

📊 COMPREHENSIVE DATASET OVERVIEW

🔢 BASIC STATISTICS:
   • Total Records: 1,090,212
   • Total Columns: 34
   • Memory Usage: 1659.6 MB
   • Index Range: 0 to 1090211

📋 DATA TYPES DISTRIBUTION:
   • object         : 19 columns
   • float64        : 10 columns
   • int64          :  5 columns

🔍 COLUMN CATEGORIES:
   • Array/List columns: 4 → ['destination_process', 'host_ip', 'process_args']...
   • Numeric columns: 15 → ['destination_bytes', 'destination_packets', 'destination_port']...
   • String columns: 15 → ['timestamp', 'destination_ip', 'destination_mac']...

📊 MISSING DATA ANALYSIS:
   Fields with missing values:
   • event_type                    : 1,090,212 (100.0%)
   • host_mac                      : 1,090,212 (100.0%)
   • source_process_executable     : 441,517 (40.5%)
   • source_process_name           : 441,517 (40.5%)
   • source_process_pid            : 441,517 (40.5%)
   • source_process_ppid           : 441,517 (40.5%)
   • process_pid                   : 408,472 

## 8. Network Protocol & Transport Analysis

Analyze network protocol distributions, transport layer characteristics, and communication patterns.

In [8]:
# Network Protocol Analysis
print("=" * 80)
print("🌐 NETWORK PROTOCOL & TRANSPORT ANALYSIS")
print("=" * 80)

print(f"\n🚦 TRANSPORT PROTOCOL DISTRIBUTION:")
transport_dist = df['network_transport'].value_counts()
for protocol, count in transport_dist.items():
    percentage = (count / len(df)) * 100
    print(f"   • {protocol.upper():8s}: {count:,} ({percentage:.1f}%)")

print(f"\n📡 NETWORK TYPE DISTRIBUTION:")
network_type_dist = df['network_type'].value_counts()
for net_type, count in network_type_dist.items():
    percentage = (count / len(df)) * 100
    print(f"   • {net_type.upper():8s}: {count:,} ({percentage:.1f}%)")

print(f"\n🎯 PORT ANALYSIS - MOST COMMON DESTINATION PORTS:")
top_dest_ports = df['destination_port'].value_counts().head(15)
for port, count in top_dest_ports.items():
    percentage = (count / len(df)) * 100
    # Common port mappings for context
    port_services = {
        53: 'DNS', 80: 'HTTP', 443: 'HTTPS', 445: 'SMB', 135: 'RPC', 
        139: 'NetBIOS', 389: 'LDAP', 636: 'LDAPS', 3389: 'RDP',
        9200: 'Elasticsearch', 5355: 'LLMNR', 1433: 'SQL Server'
    }
    service = port_services.get(port, 'Unknown')
    print(f"   • Port {port:0f} ({service:12s}): {count:,} ({percentage:.2f}%)")

print(f"\n📊 SOURCE PORT CHARACTERISTICS:")
source_port_stats = df['source_port'].describe()
print(f"   • Min source port: {source_port_stats['min']:.0f}")
print(f"   • Max source port: {source_port_stats['max']:.0f}")
print(f"   • Avg source port: {source_port_stats['mean']:.0f}")
print(f"   • Unique source ports: {df['source_port'].nunique():,}")

# Analyze ephemeral vs well-known ports
ephemeral_sources = df[df['source_port'] >= 49152].shape[0]  # RFC 6335 ephemeral range
wellknown_sources = df[df['source_port'] <= 1023].shape[0]
registered_sources = df[(df['source_port'] > 1023) & (df['source_port'] < 49152)].shape[0]

print(f"\n🔌 SOURCE PORT CATEGORIES:")
print(f"   • Well-known ports (≤1023): {wellknown_sources:,} ({(wellknown_sources/len(df))*100:.1f}%)")
print(f"   • Registered ports (1024-49151): {registered_sources:,} ({(registered_sources/len(df))*100:.1f}%)")
print(f"   • Ephemeral ports (≥49152): {ephemeral_sources:,} ({(ephemeral_sources/len(df))*100:.1f}%)")

🌐 NETWORK PROTOCOL & TRANSPORT ANALYSIS

🚦 TRANSPORT PROTOCOL DISTRIBUTION:
   • TCP     : 1,012,577 (92.9%)
   • UDP     : 68,666 (6.3%)
   • IPV6-ICMP: 1,709 (0.2%)
   • ICMP    : 217 (0.0%)

📡 NETWORK TYPE DISTRIBUTION:
   • IPV4    : 1,076,481 (98.7%)
   • IPV6    : 7,076 (0.6%)

🎯 PORT ANALYSIS - MOST COMMON DESTINATION PORTS:
   • Port 9200.000000 (Elasticsearch): 560,025 (51.37%)
   • Port 34736.000000 (Unknown     ): 264,224 (24.24%)
   • Port 389.000000 (LDAP        ): 68,684 (6.30%)
   • Port 53.000000 (DNS         ): 47,354 (4.34%)
   • Port 3268.000000 (Unknown     ): 23,666 (2.17%)
   • Port 443.000000 (HTTPS       ): 17,175 (1.58%)
   • Port 88.000000 (Unknown     ): 16,904 (1.55%)
   • Port 1433.000000 (SQL Server  ): 7,029 (0.64%)
   • Port 8220.000000 (Unknown     ): 6,780 (0.62%)
   • Port 80.000000 (HTTP        ): 6,186 (0.57%)
   • Port 445.000000 (SMB         ): 4,704 (0.43%)
   • Port 135.000000 (RPC         ): 4,396 (0.40%)
   • Port 49667.000000 (Unknown     ): 

## 9. Traffic Volume Analysis

Examine data transfer patterns, bandwidth usage, and communication intensity across the network.

In [9]:
# Traffic Volume Analysis  
print("=" * 80)
print("📈 TRAFFIC VOLUME & BANDWIDTH ANALYSIS")
print("=" * 80)

print(f"\n💾 BYTES TRANSFER STATISTICS:")
# Network total bytes
total_bytes_stats = df['network_bytes'].describe()
print(f"   Network Bytes (Total Flow):")
print(f"   • Min: {total_bytes_stats['min']:,.0f} bytes")
print(f"   • Max: {total_bytes_stats['max']:,.0f} bytes ({total_bytes_stats['max']/1024/1024:.1f} MB)")
print(f"   • Mean: {total_bytes_stats['mean']:,.0f} bytes")
print(f"   • Median: {total_bytes_stats['50%']:,.0f} bytes")
print(f"   • Total network traffic: {df['network_bytes'].sum():,.0f} bytes ({df['network_bytes'].sum()/1024/1024/1024:.2f} GB)")

# Source vs Destination bytes
src_bytes_stats = df['source_bytes'].describe()
dst_bytes_stats = df['destination_bytes'].describe()
print(f"\n📤 SOURCE BYTES STATISTICS:")
print(f"   • Mean: {src_bytes_stats['mean']:,.0f} bytes | Max: {src_bytes_stats['max']:,.0f} bytes")
print(f"   • Total uploaded: {df['source_bytes'].sum():,.0f} bytes ({df['source_bytes'].sum()/1024/1024/1024:.2f} GB)")

print(f"\n📥 DESTINATION BYTES STATISTICS:")
print(f"   • Mean: {dst_bytes_stats['mean']:,.0f} bytes | Max: {dst_bytes_stats['max']:,.0f} bytes")
print(f"   • Total downloaded: {df['destination_bytes'].sum():,.0f} bytes ({df['destination_bytes'].sum()/1024/1024/1024:.2f} GB)")

print(f"\n📦 PACKET STATISTICS:")
network_packets_stats = df['network_packets'].describe()
print(f"   Network Packets (Total Flow):")
print(f"   • Min: {network_packets_stats['min']:,.0f} packets")
print(f"   • Max: {network_packets_stats['max']:,.0f} packets")
print(f"   • Mean: {network_packets_stats['mean']:,.1f} packets per flow")
print(f"   • Total packets: {df['network_packets'].sum():,.0f}")

print(f"\n🔢 FLOW EFFICIENCY METRICS:")
# Calculate average bytes per packet
df_temp = df[df['network_packets'] > 0].copy()  # Avoid division by zero
df_temp['bytes_per_packet'] = df_temp['network_bytes'] / df_temp['network_packets']
avg_bytes_per_packet = df_temp['bytes_per_packet'].mean()
print(f"   • Average bytes per packet: {avg_bytes_per_packet:.1f} bytes")
print(f"   • Flows with >1000 bytes: {(df['network_bytes'] > 1000).sum():,} ({(df['network_bytes'] > 1000).sum()/len(df)*100:.1f}%)")
print(f"   • Large flows (>1MB): {(df['network_bytes'] > 1024*1024).sum():,} ({(df['network_bytes'] > 1024*1024).sum()/len(df)*100:.1f}%)")

print(f"\n🏆 TOP TRAFFIC GENERATORS (by total bytes):")
# Group by source IP and sum bytes
top_sources = df.groupby('source_ip')['source_bytes'].agg(['sum', 'count']).sort_values('sum', ascending=False).head(10)
for idx, (ip, stats) in enumerate(top_sources.iterrows(), 1):
    total_gb = stats['sum'] / 1024 / 1024 / 1024
    print(f"   {idx:2d}. {ip:15s}: {total_gb:.2f} GB in {stats['count']:,} flows")

print(f"\n🎯 TOP TRAFFIC RECEIVERS (by total bytes):")
# Group by destination IP and sum bytes
top_destinations = df.groupby('destination_ip')['destination_bytes'].agg(['sum', 'count']).sort_values('sum', ascending=False).head(10)
for idx, (ip, stats) in enumerate(top_destinations.iterrows(), 1):
    total_gb = stats['sum'] / 1024 / 1024 / 1024
    print(f"   {idx:2d}. {ip:15s}: {total_gb:.2f} GB in {stats['count']:,} flows")

📈 TRAFFIC VOLUME & BANDWIDTH ANALYSIS

💾 BYTES TRANSFER STATISTICS:
   Network Bytes (Total Flow):
   • Min: 42 bytes
   • Max: 24,262,608 bytes (23.1 MB)
   • Mean: 228,364 bytes
   • Median: 14,052 bytes
   • Total network traffic: 248,964,662,840 bytes (231.87 GB)

📤 SOURCE BYTES STATISTICS:
   • Mean: 195,389 bytes | Max: 23,890,532 bytes
   • Total uploaded: 213,015,725,309 bytes (198.39 GB)

📥 DESTINATION BYTES STATISTICS:
   • Mean: 33,775 bytes | Max: 14,069,480 bytes
   • Total downloaded: 35,948,937,531 bytes (33.48 GB)

📦 PACKET STATISTICS:
   Network Packets (Total Flow):
   • Min: 1 packets
   • Max: 20,545 packets
   • Mean: 191.0 packets per flow
   • Total packets: 208,247,285

🔢 FLOW EFFICIENCY METRICS:
   • Average bytes per packet: 789.5 bytes
   • Flows with >1000 bytes: 974,023 (89.3%)
   • Large flows (>1MB): 67,591 (6.2%)

🏆 TOP TRAFFIC GENERATORS (by total bytes):
    1. 10.1.0.5       : 130.54 GB in 260,870 flows
    2. 10.1.0.6       : 58.06 GB in 580,142 flow

## 10. Host & Network Infrastructure Analysis

Analyze host characteristics, network topology, operating systems, and infrastructure patterns.

In [10]:
# Host & Network Infrastructure Analysis
print("=" * 80)
print("🏠 HOST & NETWORK INFRASTRUCTURE ANALYSIS")
print("=" * 80)

print(f"\n💻 HOST INFORMATION:")
unique_hosts = df['host_hostname'].nunique()
unique_host_ips = df['source_ip'].nunique()
print(f"   • Unique hostnames: {unique_hosts:,}")
print(f"   • Unique source IPs: {unique_host_ips:,}")
print(f"   • Unique destination IPs: {df['destination_ip'].nunique():,}")
print(f"   • Total unique IPs (src+dst): {pd.concat([df['source_ip'], df['destination_ip']]).nunique():,}")

print(f"\n🖥️ OPERATING SYSTEM DISTRIBUTION:")
os_dist = df['host_os_platform'].value_counts()
for os_name, count in os_dist.items():
    percentage = (count / len(df)) * 100
    print(f"   • {os_name.capitalize():12s}: {count:,} ({percentage:.1f}%)")

print(f"\n🌐 NETWORK TOPOLOGY ANALYSIS:")
# Analyze internal vs external communication
def categorize_ip(ip):
    # Handle NaN values
    if pd.isna(ip) or not isinstance(ip, str):
        return 'Unknown'
    if ip.startswith('10.') or ip.startswith('192.168.') or ip.startswith('172.'):
        return 'Internal'
    elif ip.startswith('224.') or ip.startswith('239.'):
        return 'Multicast'
    else:
        return 'External'

df['source_ip_type'] = df['source_ip'].apply(categorize_ip)
df['destination_ip_type'] = df['destination_ip'].apply(categorize_ip)

src_ip_types = df['source_ip_type'].value_counts()
dst_ip_types = df['destination_ip_type'].value_counts()

print(f"   Source IP Categories:")
for ip_type, count in src_ip_types.items():
    percentage = (count / len(df)) * 100
    print(f"   • {ip_type:12s}: {count:,} ({percentage:.1f}%)")

print(f"   Destination IP Categories:")
for ip_type, count in dst_ip_types.items():
    percentage = (count / len(df)) * 100
    print(f"   • {ip_type:12s}: {count:,} ({percentage:.1f}%)")

print(f"\n🏆 MOST ACTIVE HOSTS:")
# Filter out NaN values for host analysis
valid_hosts = df[df['host_hostname'].notna() & df['source_ip'].notna()].copy()
if len(valid_hosts) > 0:
    host_activity = valid_hosts.groupby('host_hostname').agg({
        'network_bytes': 'sum',
        'network_packets': 'sum',
        'source_ip': lambda x: x.iloc[0],  # Get IP for this host
        'host_os_platform': lambda x: x.iloc[0]  # Get OS for this host
    }).sort_values('network_bytes', ascending=False).head(8)

    for idx, (hostname, stats) in enumerate(host_activity.iterrows(), 1):
        gb = stats['network_bytes'] / 1024 / 1024 / 1024
        packets_k = stats['network_packets'] / 1000
        print(f"   {idx}. {hostname:15s} ({stats['source_ip']:12s}) - {gb:.2f} GB, {packets_k:.1f}K packets [{stats['host_os_platform']}]")
else:
    print("   • No valid host data available for analysis")

print(f"\n📡 NETWORK COMMUNICATION PATTERNS:")
# Internal to internal
internal_to_internal = df[(df['source_ip_type'] == 'Internal') & (df['destination_ip_type'] == 'Internal')].shape[0]
# Internal to external  
internal_to_external = df[(df['source_ip_type'] == 'Internal') & (df['destination_ip_type'] == 'External')].shape[0]
# Internal to multicast
internal_to_multicast = df[(df['source_ip_type'] == 'Internal') & (df['destination_ip_type'] == 'Multicast')].shape[0]

print(f"   • Internal ↔ Internal: {internal_to_internal:,} ({(internal_to_internal/len(df))*100:.1f}%)")
print(f"   • Internal → External: {internal_to_external:,} ({(internal_to_external/len(df))*100:.1f}%)")
print(f"   • Internal → Multicast: {internal_to_multicast:,} ({(internal_to_multicast/len(df))*100:.1f}%)")

# Clean up temporary columns
df.drop(['source_ip_type', 'destination_ip_type'], axis=1, inplace=True)

🏠 HOST & NETWORK INFRASTRUCTURE ANALYSIS

💻 HOST INFORMATION:
   • Unique hostnames: 4
   • Unique source IPs: 38
   • Unique destination IPs: 644
   • Total unique IPs (src+dst): 648

🖥️ OPERATING SYSTEM DISTRIBUTION:
   • Windows     : 1,090,212 (100.0%)

🌐 NETWORK TOPOLOGY ANALYSIS:
   Source IP Categories:
   • Internal    : 1,074,603 (98.6%)
   • External    : 8,954 (0.8%)
   • Unknown     : 6,655 (0.6%)
   Destination IP Categories:
   • Internal    : 1,035,366 (95.0%)
   • External    : 44,015 (4.0%)
   • Unknown     : 6,655 (0.6%)
   • Multicast   : 4,176 (0.4%)

🏆 MOST ACTIVE HOSTS:
   1. theblock        (23.54.61.183) - 152.37 GB, 136956.6K packets [windows]
   2. waterfalls      (23.208.31.151) - 62.49 GB, 52759.5K packets [windows]
   3. endofroad       (10.1.0.7    ) - 15.15 GB, 16616.9K packets [windows]
   4. diskjockey      (10.1.0.4    ) - 1.85 GB, 1892.9K packets [windows]

📡 NETWORK COMMUNICATION PATTERNS:
   • Internal ↔ Internal: 1,033,488 (94.8%)
   • Internal → E

## 11. Process Information Analysis

Analyze process-level network activity, executable patterns, and process relationships when data is available.

In [11]:
# Process Information Analysis
print("=" * 80)
print("⚙️ PROCESS INFORMATION ANALYSIS")
print("=" * 80)

# Check process data availability
process_fields = ['process_name', 'process_executable', 'process_pid', 'process_parent_pid']
source_process_fields = ['source_process_name', 'source_process_executable', 'source_process_pid', 'source_process_ppid']

print(f"\n📊 PROCESS DATA AVAILABILITY:")
for field in process_fields:
    non_null_count = df[field].notna().sum()
    percentage = (non_null_count / len(df)) * 100
    print(f"   • {field:25s}: {non_null_count:,} records ({percentage:.1f}%)")

print(f"\n📊 SOURCE PROCESS DATA AVAILABILITY:")
for field in source_process_fields:
    non_null_count = df[field].notna().sum()
    percentage = (non_null_count / len(df)) * 100
    print(f"   • {field:30s}: {non_null_count:,} records ({percentage:.1f}%)")

# Analyze process data when available
process_data = df[df['process_name'].notna()].copy()
if len(process_data) > 0:
    print(f"\n🔍 PROCESS NAME ANALYSIS ({len(process_data):,} records with process data):")
    top_processes = process_data['process_name'].value_counts().head(15)
    for idx, (process, count) in enumerate(top_processes.items(), 1):
        percentage = (count / len(process_data)) * 100
        print(f"   {idx:2d}. {process:25s}: {count:,} ({percentage:.1f}%)")

    print(f"\n💿 EXECUTABLE PATH ANALYSIS:")
    # Extract directory paths from executables
    exe_data = process_data[process_data['process_executable'].notna()].copy()
    if len(exe_data) > 0:
        exe_data['exe_dir'] = exe_data['process_executable'].apply(lambda x: '\\'.join(x.split('\\')[:-1]) if isinstance(x, str) and '\\' in x else x)
        top_dirs = exe_data['exe_dir'].value_counts().head(10)
        for idx, (directory, count) in enumerate(top_dirs.items(), 1):
            percentage = (count / len(exe_data)) * 100
            print(f"   {idx:2d}. {directory:45s}: {count:,} ({percentage:.1f}%)")

# Analyze source process data when available  
src_process_data = df[df['source_process_name'].notna()].copy()
if len(src_process_data) > 0:
    print(f"\n🔍 SOURCE PROCESS NAME ANALYSIS ({len(src_process_data):,} records with source process data):")
    top_src_processes = src_process_data['source_process_name'].value_counts().head(15)
    for idx, (process, count) in enumerate(top_src_processes.items(), 1):
        percentage = (count / len(src_process_data)) * 100
        print(f"   {idx:2d}. {process:25s}: {count:,} ({percentage:.1f}%)")

# Analyze process arguments when available
print(f"\n📋 PROCESS ARGUMENTS ANALYSIS:")
process_args_available = df['process_args'].apply(lambda x: isinstance(x, list) and len(x) > 0).sum()
source_process_args_available = df['source_process_args'].apply(lambda x: isinstance(x, list) and len(x) > 0).sum()

print(f"   • Records with process args: {process_args_available:,} ({(process_args_available/len(df))*100:.1f}%)")
print(f"   • Records with source process args: {source_process_args_available:,} ({(source_process_args_available/len(df))*100:.1f}%)")

if process_args_available > 0:
    # Analyze argument patterns
    args_data = df[df['process_args'].apply(lambda x: isinstance(x, list) and len(x) > 0)].copy()
    args_data['arg_count'] = args_data['process_args'].apply(len)
    
    print(f"\n📈 PROCESS ARGUMENTS STATISTICS:")
    print(f"   • Average arguments per process: {args_data['arg_count'].mean():.1f}")
    print(f"   • Max arguments: {args_data['arg_count'].max()}")
    print(f"   • Processes with >5 args: {(args_data['arg_count'] > 5).sum():,}")

# Process network activity correlation
if len(process_data) > 0:
    print(f"\n🌐 PROCESS NETWORK ACTIVITY:")
    process_traffic = process_data.groupby('process_name').agg({
        'network_bytes': ['sum', 'mean', 'count'],
        'destination_port': lambda x: x.nunique()
    }).round(2)
    
    process_traffic.columns = ['total_bytes', 'avg_bytes', 'flow_count', 'unique_ports']
    process_traffic = process_traffic.sort_values('total_bytes', ascending=False).head(10)
    
    print(f"   Top network-active processes:")
    for idx, (process, stats) in enumerate(process_traffic.iterrows(), 1):
        mb = stats['total_bytes'] / 1024 / 1024
        print(f"   {idx:2d}. {process:20s}: {mb:7.1f} MB total, {stats['flow_count']:4.0f} flows, {stats['unique_ports']:2.0f} ports")

⚙️ PROCESS INFORMATION ANALYSIS

📊 PROCESS DATA AVAILABILITY:
   • process_name             : 681,740 records (62.5%)
   • process_executable       : 681,740 records (62.5%)
   • process_pid              : 681,740 records (62.5%)
   • process_parent_pid       : 681,740 records (62.5%)

📊 SOURCE PROCESS DATA AVAILABILITY:
   • source_process_name           : 648,695 records (59.5%)
   • source_process_executable     : 648,695 records (59.5%)
   • source_process_pid            : 648,695 records (59.5%)
   • source_process_ppid           : 648,695 records (59.5%)

🔍 PROCESS NAME ANALYSIS (681,740 records with process data):
    1. .                        : 273,055 (40.1%)
    2. agentbeat.exe            : 264,500 (38.8%)
    3. w3wp.exe                 : 37,047 (5.4%)
    4. lsass.exe                : 20,198 (3.0%)
    5. dns.exe                  : 19,537 (2.9%)
    6. svchost.exe              : 13,518 (2.0%)
    7. elastic-agent.exe        : 11,500 (1.7%)
    8. Microsoft.Exchange.RpcCl

## 12. Temporal Analysis

Examine traffic patterns over time, flow durations, and temporal characteristics of network communication.

In [12]:
# Temporal Analysis
print("=" * 80)
print("⏰ TEMPORAL ANALYSIS")
print("=" * 80)

# Convert timestamp to datetime for analysis
df['timestamp_dt'] = pd.to_datetime(df['timestamp'])

print(f"\n📅 TIME RANGE ANALYSIS:")
time_range = df['timestamp_dt'].agg(['min', 'max'])
duration = time_range['max'] - time_range['min']
print(f"   • Start time: {time_range['min']}")
print(f"   • End time: {time_range['max']}")
print(f"   • Total duration: {duration}")
print(f"   • Total hours: {duration.total_seconds() / 3600:.1f} hours")

print(f"\n⏱️ FLOW DURATION ANALYSIS:")
# Convert event duration from nanoseconds to more readable units
df['duration_seconds'] = df['event_duration'] / 1_000_000_000  # Convert nanoseconds to seconds
duration_stats = df['duration_seconds'].describe()

print(f"   • Min duration: {duration_stats['min']:.3f} seconds")
print(f"   • Max duration: {duration_stats['max']:.1f} seconds ({duration_stats['max']/60:.1f} minutes)")
print(f"   • Mean duration: {duration_stats['mean']:.3f} seconds")
print(f"   • Median duration: {duration_stats['50%']:.3f} seconds")

# Categorize flows by duration
very_short = (df['duration_seconds'] < 0.1).sum()  # < 100ms
short = ((df['duration_seconds'] >= 0.1) & (df['duration_seconds'] < 1)).sum()  # 100ms - 1s
medium = ((df['duration_seconds'] >= 1) & (df['duration_seconds'] < 60)).sum()  # 1s - 1min
long_flows = (df['duration_seconds'] >= 60).sum()  # > 1min

print(f"\n📊 FLOW DURATION CATEGORIES:")
print(f"   • Very short (<100ms): {very_short:,} ({(very_short/len(df))*100:.1f}%)")
print(f"   • Short (100ms-1s): {short:,} ({(short/len(df))*100:.1f}%)")
print(f"   • Medium (1s-1min): {medium:,} ({(medium/len(df))*100:.1f}%)")
print(f"   • Long (>1min): {long_flows:,} ({(long_flows/len(df))*100:.1f}%)")

print(f"\n📈 TRAFFIC PATTERNS BY HOUR:")
df['hour'] = df['timestamp_dt'].dt.hour
hourly_traffic = df.groupby('hour').agg({
    'network_bytes': 'sum',
    'timestamp': 'count'  # Flow count
}).round(2)

hourly_traffic['bytes_gb'] = hourly_traffic['network_bytes'] / 1024 / 1024 / 1024
print("   Hour | Flows   | Traffic (GB)")
print("   -----|---------|-------------")
for hour in range(24):
    if hour in hourly_traffic.index:
        flows = hourly_traffic.loc[hour, 'timestamp']
        gb = hourly_traffic.loc[hour, 'bytes_gb']
        print(f"   {hour:2d}h  | {flows:7,.0f} | {gb:8.2f}")
    else:
        print(f"   {hour:2d}h  |       0 |     0.00")

# Find peak hours
peak_flows_hour = hourly_traffic['timestamp'].idxmax()
peak_traffic_hour = hourly_traffic['bytes_gb'].idxmax()
print(f"\n🏆 PEAK ACTIVITY:")
print(f"   • Peak flows: {peak_flows_hour:02d}:00 ({hourly_traffic.loc[peak_flows_hour, 'timestamp']:,.0f} flows)")
print(f"   • Peak traffic: {peak_traffic_hour:02d}:00 ({hourly_traffic.loc[peak_traffic_hour, 'bytes_gb']:.2f} GB)")

print(f"\n📊 TRAFFIC PATTERNS BY DAY:")
df['date'] = df['timestamp_dt'].dt.date
daily_traffic = df.groupby('date').agg({
    'network_bytes': 'sum',
    'timestamp': 'count'
}).round(2)

daily_traffic['bytes_gb'] = daily_traffic['network_bytes'] / 1024 / 1024 / 1024
print("   Date       | Flows    | Traffic (GB)")
print("   -----------|-----------|--------------")
for date, stats in daily_traffic.iterrows():
    flows = stats['timestamp']
    gb = stats['bytes_gb']
    print(f"   {date} | {flows:8,.0f} | {gb:9.2f}")

# Analyze weekend vs weekday patterns (if applicable)
df['weekday'] = df['timestamp_dt'].dt.day_name()
weekday_traffic = df.groupby('weekday').agg({
    'network_bytes': 'sum',
    'timestamp': 'count'
}).round(2)

if len(weekday_traffic) > 1:
    print(f"\n📅 WEEKDAY PATTERNS:")
    weekday_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
    for day in weekday_order:
        if day in weekday_traffic.index:
            flows = weekday_traffic.loc[day, 'timestamp']
            gb = weekday_traffic.loc[day, 'network_bytes'] / 1024 / 1024 / 1024
            print(f"   • {day:9s}: {flows:,} flows, {gb:.2f} GB")

# Clean up temporary columns
df.drop(['timestamp_dt', 'duration_seconds', 'hour', 'date', 'weekday'], axis=1, inplace=True)

⏰ TEMPORAL ANALYSIS

📅 TIME RANGE ANALYSIS:
   • Start time: 2025-05-04 11:30:08.613000+00:00
   • End time: 2025-05-04 12:40:00.999000+00:00
   • Total duration: 0 days 01:09:52.386000
   • Total hours: 1.2 hours

⏱️ FLOW DURATION ANALYSIS:
   • Min duration: 0.000 seconds
   • Max duration: 3414.9 seconds (56.9 minutes)
   • Mean duration: 47.658 seconds
   • Median duration: 1.371 seconds

📊 FLOW DURATION CATEGORIES:
   • Very short (<100ms): 417,770 (38.3%)
   • Short (100ms-1s): 118,352 (10.9%)
   • Medium (1s-1min): 430,465 (39.5%)
   • Long (>1min): 123,625 (11.3%)

📈 TRAFFIC PATTERNS BY HOUR:
   Hour | Flows   | Traffic (GB)
   -----|---------|-------------
    0h  |       0 |     0.00
    1h  |       0 |     0.00
    2h  |       0 |     0.00
    3h  |       0 |     0.00
    4h  |       0 |     0.00
    5h  |       0 |     0.00
    6h  |       0 |     0.00
    7h  |       0 |     0.00
    8h  |       0 |     0.00
    9h  |       0 |     0.00
   10h  |       0 |     0.00
   11h 

## 13. Security-Focused Analysis

Identify potential security patterns, anomalies, and suspicious network behaviors in the traffic data.

In [13]:
# Security-Focused Analysis
print("=" * 80)
print("🔒 SECURITY-FOCUSED ANALYSIS")
print("=" * 80)

print(f"\n🚨 UNUSUAL PORT ANALYSIS:")
# Define suspicious/uncommon ports
suspicious_ports = [1337, 31337, 4444, 6666, 8080, 1234, 12345, 54321, 9999]
high_ports = df[df['destination_port'] > 50000]
unusual_ports = df[df['destination_port'].isin(suspicious_ports)]

print(f"   • Connections to high ports (>50000): {len(high_ports):,} ({(len(high_ports)/len(df))*100:.2f}%)")
print(f"   • Connections to suspicious ports: {len(unusual_ports):,}")

if len(unusual_ports) > 0:
    print(f"   Suspicious port activity:")
    for port in suspicious_ports:
        count = (df['destination_port'] == port).sum()
        if count > 0:
            print(f"   • Port {port}: {count:,} connections")

print(f"\n📊 TRAFFIC VOLUME ANOMALIES:")
# Identify flows with unusually high byte transfers
bytes_q99 = df['network_bytes'].quantile(0.99)
large_flows = df[df['network_bytes'] > bytes_q99]
print(f"   • Large flows (>99th percentile, >{bytes_q99:,.0f} bytes): {len(large_flows):,} ({(len(large_flows)/len(df))*100:.2f}%)")

# Identify potential data exfiltration (high outbound traffic)
high_outbound = df[df['source_bytes'] > df['source_bytes'].quantile(0.95)]
print(f"   • High outbound traffic (>95th percentile): {len(high_outbound):,} ({(len(high_outbound)/len(df))*100:.2f}%)")

print(f"\n🔍 COMMUNICATION PATTERN ANALYSIS:")
# Analyze port scanning behavior (same source to multiple destination ports) - filter valid IPs
valid_source_ips = df[df['source_ip'].notna() & df['destination_port'].notna()].copy()
if len(valid_source_ips) > 0:
    source_port_diversity = valid_source_ips.groupby('source_ip')['destination_port'].nunique().sort_values(ascending=False)
    potential_scanners = source_port_diversity[source_port_diversity > 50]  # More than 50 different dest ports

    print(f"   • Potential port scanners (>50 dest ports): {len(potential_scanners):,}")
    if len(potential_scanners) > 0:
        print(f"   Top potential scanners:")
        for idx, (ip, port_count) in enumerate(potential_scanners.head(5).items(), 1):
            flow_count = (df['source_ip'] == ip).sum()
            print(f"   {idx}. {ip}: {port_count} unique ports, {flow_count:,} total flows")
else:
    print(f"   • No valid source IP data for port scanning analysis")

# Analyze beaconing behavior (regular communication patterns)
print(f"\n📡 POTENTIAL BEACONING ANALYSIS:")
valid_comm_data = df[df['source_ip'].notna() & df['destination_ip'].notna()].copy()
if len(valid_comm_data) > 0:
    communication_pairs = valid_comm_data.groupby(['source_ip', 'destination_ip']).size().sort_values(ascending=False)
    frequent_pairs = communication_pairs[communication_pairs > 100]  # More than 100 flows between same IPs

    print(f"   • Frequent communication pairs (>100 flows): {len(frequent_pairs):,}")
    if len(frequent_pairs) > 0:
        print(f"   Top communication pairs:")
        for idx, ((src, dst), count) in enumerate(frequent_pairs.head(8).items(), 1):
            avg_bytes = df[(df['source_ip'] == src) & (df['destination_ip'] == dst)]['network_bytes'].mean()
            print(f"   {idx}. {src} → {dst}: {count:,} flows, avg {avg_bytes:.0f} bytes")
else:
    print(f"   • No valid communication data for beaconing analysis")

print(f"\n🔗 NETWORK FLOW CHARACTERISTICS:")
# Analyze flow finality (completed vs ongoing connections)
if 'network_traffic_flow_id' in df.columns:
    valid_flows = df[df['network_traffic_flow_id'].notna()]
    if len(valid_flows) > 0:
        unique_flows = valid_flows['network_traffic_flow_id'].nunique()
        total_records = len(valid_flows)
        print(f"   • Unique flow IDs: {unique_flows:,}")
        print(f"   • Records per flow (avg): {total_records/unique_flows:.1f}")
        
        # Find flows with multiple records (ongoing connections)
        flow_counts = valid_flows['network_traffic_flow_id'].value_counts()
        multi_record_flows = flow_counts[flow_counts > 1]
        print(f"   • Multi-record flows: {len(multi_record_flows):,} ({(len(multi_record_flows)/unique_flows)*100:.1f}%)")

print(f"\n🎯 EXTERNAL COMMUNICATION ANALYSIS:")
# Categorize IPs and analyze external communication
def is_external_ip(ip):
    # Handle NaN values
    if pd.isna(ip) or not isinstance(ip, str):
        return False
    return not (ip.startswith('10.') or ip.startswith('192.168.') or ip.startswith('172.') or ip.startswith('224.'))

external_dest_flows = df[df['destination_ip'].apply(is_external_ip)]
if len(external_dest_flows) > 0:
    print(f"   • Flows to external IPs: {len(external_dest_flows):,} ({(len(external_dest_flows)/len(df))*100:.1f}%)")
    
    # Top external destinations
    top_external = external_dest_flows['destination_ip'].value_counts().head(10)
    print(f"   Top external destinations:")
    for idx, (ip, count) in enumerate(top_external.items(), 1):
        total_bytes = external_dest_flows[external_dest_flows['destination_ip'] == ip]['destination_bytes'].sum()
        mb = total_bytes / 1024 / 1024
        print(f"   {idx:2d}. {ip:15s}: {count:,} flows, {mb:.1f} MB received")
else:
    print(f"   • No external communication detected")

print(f"\n🛡️ PROTOCOL SECURITY ANALYSIS:")
# Analyze encrypted vs unencrypted traffic based on common ports
encrypted_ports = [443, 993, 995, 636, 22, 990]  # HTTPS, IMAPS, POP3S, LDAPS, SSH, FTPS
unencrypted_ports = [80, 21, 23, 25, 53, 110, 143]  # HTTP, FTP, Telnet, SMTP, DNS, POP3, IMAP

# Filter valid port data
valid_port_data = df[df['destination_port'].notna()]
if len(valid_port_data) > 0:
    encrypted_traffic = valid_port_data[valid_port_data['destination_port'].isin(encrypted_ports)]
    unencrypted_traffic = valid_port_data[valid_port_data['destination_port'].isin(unencrypted_ports)]

    print(f"   • Encrypted traffic (common secure ports): {len(encrypted_traffic):,} ({(len(encrypted_traffic)/len(df))*100:.1f}%)")
    print(f"   • Unencrypted traffic (common insecure ports): {len(unencrypted_traffic):,} ({(len(unencrypted_traffic)/len(df))*100:.1f}%)")
else:
    print(f"   • No valid port data for security analysis")

print(f"\n📋 SECURITY SUMMARY:")
print(f"   • Total network flows analyzed: {len(df):,}")
print(f"   • Unique source IPs: {df['source_ip'].nunique():,}")
print(f"   • Unique destination IPs: {df['destination_ip'].nunique():,}")
if df['network_transport'].notna().any():
    print(f"   • Most active protocol: {df['network_transport'].mode()[0].upper()}")
print(f"   • Average flow duration: {df['event_duration'].mean()/1e9:.2f} seconds")
print(f"   • Data quality: {((df.notna().sum().sum() / (len(df) * len(df.columns))) * 100):.1f}% fields populated")

print(f"\n🎉 EXPLORATORY ANALYSIS COMPLETE!")
print(f"   The dataset has been thoroughly analyzed and is ready for machine learning workflows.")
print(f"   Key insights have been extracted for cybersecurity analysis and anomaly detection.")

🔒 SECURITY-FOCUSED ANALYSIS

🚨 UNUSUAL PORT ANALYSIS:
   • Connections to high ports (>50000): 9,732 (0.89%)
   • Connections to suspicious ports: 0

📊 TRAFFIC VOLUME ANOMALIES:
   • Large flows (>99th percentile, >3,645,820 bytes): 10,871 (1.00%)
   • High outbound traffic (>95th percentile): 54,471 (5.00%)

🔍 COMMUNICATION PATTERN ANALYSIS:
   • Potential port scanners (>50 dest ports): 2
   Top potential scanners:
   1. 10.2.0.20: 152 unique ports, 13,606 total flows
   2. 10.1.0.4: 119 unique ports, 36,450 total flows

📡 POTENTIAL BEACONING ANALYSIS:
   • Frequent communication pairs (>100 flows): 141
   Top communication pairs:
   1. 10.1.0.6 → 192.168.0.4: 269,232 flows, avg 15994 bytes
   2. 10.1.0.5 → 10.2.0.20: 219,454 flows, avg 648136 bytes
   3. 10.1.0.7 → 10.2.0.20: 168,045 flows, avg 65739 bytes
   4. 10.1.0.6 → 10.2.0.20: 163,501 flows, avg 341393 bytes
   5. 10.1.0.6 → 10.1.0.4: 129,472 flows, avg 59452 bytes
   6. 10.1.0.5 → 10.1.0.4: 20,535 flows, avg 1876 bytes
   7.

## 🎯 Analysis Summary

This comprehensive exploratory data analysis has revealed key insights about the network traffic dataset:

### 📊 **Key Findings**
- **Dataset Scale**: 1,090,212 network flow records with 34 features
- **Time Coverage**: Complete temporal analysis of network activity patterns
- **Protocol Distribution**: Analysis of TCP/UDP traffic characteristics
- **Security Insights**: Port scanning detection, anomaly identification, encryption analysis

### 🔍 **Analysis Coverage**
1. **Data Quality Assessment**: Missing data patterns, field completeness
2. **Network Protocols**: Transport layer analysis, port distributions
3. **Traffic Patterns**: Volume analysis, top talkers, communication flows
4. **Infrastructure**: Host analysis, network topology, OS distributions
5. **Process Activity**: Executable analysis, process network behavior
6. **Temporal Patterns**: Hourly/daily traffic trends, flow durations
7. **Security Analysis**: Anomaly detection, suspicious patterns, threat indicators

### 🚀 **Ready for Machine Learning**
The dataset is now fully characterized and prepared for:
- **Anomaly Detection Models**: SAE, LSTM-SAE, Isolation Forest
- **Classification Tasks**: Attack vs benign traffic classification
- **Feature Engineering**: Based on identified patterns and distributions
- **Security Analytics**: Threat detection and network behavior analysis

### 📋 **Next Steps**
1. Use insights for feature selection in ML models
2. Apply findings to enhance anomaly detection algorithms
3. Leverage temporal patterns for sequential modeling approaches
4. Implement security-focused analysis for threat detection