Simple Flow Log Loader and Summarizer

This notebook loads vpc flow logs from a folder into a dataframe for analysis. After loading, several subsets are created.
In order to look for C2 or exfiltration activity, a derivative data frame is created containing north / south traffic. Definitions:

North / south traffic: flows with a source or destination that is remote on the Internet - traffic to or from the Internet.
East / west traffic: flows with private source and destination IPs - local traffic that does not leave the VPC.

North / south is a good place to start because a lot of what we’re looking for during threat hunting - initial access, credentialed access, C2, and exfiltration, requires north / south traffic. East / west traffic can be a place to hunt lateral movement but that requires more complex analytics, using different approaches, because most local traffic is benign and voluminous. 

The notebook generates a subset of flows containing north / south traffic and then counts the top 1000 Internet destination IPS by flow count volume. I’ll do counts by data volume in a forthcoming notebook. After that, the top 200 destination IPs are labeled with their AS (autonomous system) network and country names. Cloud providers, and OS vendors, are excluded here because most of this is benign, most of the time, and hunting intra-cloud threat traffic also requires different analytics. The lookups are limited to the top 200 because there are rate limits and trying to do all 1000 at once will tend to be too many. 

Finally, a deduplication function in the pandas project can be used to summarize the activity for an IP address. When you have an IP / ASN / country combination that cannot be reconciled to normal business activity, or is otherwise believed to be threat traffic, deduplication can be used to summarize the traffic in order to ask these questions:
Is the activity inbound, focused on a single port, with similar byte / packet counts? This is often scanning and discovery activity which may not have much impact unless a vuln was exploited and a host started doing what looks like C2.
Is the activity inbound, with large flow, packet and and byte counts, to an RDP or SSH port? That could be credentialed access if it cannot be accounted for as admin activity.
Is the activity outbound, with many small packets destined for one port? That could be C2 or some sort of telemetry or auto update mechanism depending on where it is going.
Is the activity outbound, to a single port, with large flow, packet and and byte counts? This can be exfiltration, if it cannot be reconciled as normal data movement. 



In [None]:
import ipaddress
from ipwhois import IPWhois
from ipwhois.exceptions import IPDefinedError, HTTPLookupError
import pandas as pd
import glob
import os

In [None]:
# For loading flow logs in folders downloaded from S3 (unzip them first)
# Recurse through a folder and ingest flow logs (path goes in the first param)

file_pattern = os.path.join('06', '**', '*.log')
files = glob.glob(file_pattern, recursive=True)
df_list = []

for file in files:
    try:
        if os.path.getsize(file) > 0:
            df = pd.read_csv(file, delim_whitespace=True)  # or sep='\t' for tab-separated
            df_list.append(df)
        else:
            print(f"Skipping empty file: {file}")
    except Exception as e:
        print(f"Error reading {file}: {e}")

if df_list:
   flows = pd.concat(df_list, ignore_index=True)
   print(flows)
else:
    print("No data frames were read successfully.")


In [None]:
# for when you have a saved set of previosuly ingested flows
# flows = pd.read_csv('flows.csv', low_memory = False)

In [None]:
# Convert the columns to string for searching, then check if any contain '-' character
invalid_rows = flows[
    flows['dstaddr'].astype(str).str.contains('-') |
    flows['srcaddr'].astype(str).str.contains('-') |
    flows['dstport'].astype(str).str.contains('-') |
    flows['protocol'].astype(str).str.contains('-') |
    flows['bytes'].astype(str).str.contains('-') |
    flows['packets'].astype(str).str.contains('-')
]
# Display the rows where '-' appears in one of the columns
invalid_rows

In [None]:
# Delete these problem row(s) with invalid values or fix then and re-ingest
flows = flows.drop(index=55745)
flows.reset_index(drop=True, inplace=True)
flows

In [None]:
# Convert bytes and packets values to numeric, convert the timestamps to readable formats
flows['end'] = pd.to_datetime(flows['end'], unit='s')
flows['start'] = pd.to_datetime(flows['start'], unit='s')

flows['dstport'] = flows['dstport'].astype(int)
flows['bytes'] = flows['bytes'].astype(int)
flows['packets'] = flows['packets'].astype(int)
flows['protocol'] = flows['protocol'].astype(int)
print(flows['dstport'].dtype)
print(flows['protocol'].dtype)
print(flows['packets'].dtype)
print(flows['bytes'].dtype)


In [None]:
flows

In [None]:
# create a subset dataframe (ns) for north / south traffic flows
# north / south traffic transits the Internet vs. east / west inside the VPC

def is_private_ip(ip):
    try:
        return ipaddress.ip_address(ip).is_private
    except ValueError:
        return False

ns = flows.copy()
ns = ns[ns['action'] == 'ACCEPT']
ns = ns[~(ns['srcaddr'].apply(is_private_ip) & ns['dstaddr'].apply(is_private_ip))]
ns


In [None]:
# remove NTP traffic, there will be a lot of benign ntp activity in most fleets. We need a list but this will work

filtered_ns = ns[
    ~(
        (ns['dstport'] == 123) &
        (ns['protocol'] == 17) &
        (ns['packets'] == 1) &
        (ns['bytes'] == 76)
    )
]

In [None]:
# spot check that the majority of ntp flows are indeed gone

ntp = filtered_ns[(filtered_ns['dstport'] == 123) ]
ntp


In [None]:
# Filter out rows where dstaddr is a private IP address to match extrusion and get rid of rejects
# Group by the specified fields and count the top flows by volume

filtered_ns = filtered_ns[~filtered_ns['dstaddr'].apply(is_private_ip)]
top_combinations = filtered_ns.groupby(['account-id', 'interface-id', 'srcaddr', 'dstaddr', 'dstport', 'protocol', 'action']).size().reset_index(name='flow_count')
top_combinations = top_combinations.sort_values(by='flow_count', ascending=False).head(1000)
top_combinations.reset_index(drop=True, inplace=True)

top_combinations[top_combinations['action'] == 'ACCEPT']
pd.options.display.max_rows = 200
top_combinations


In [None]:
# Function to get ASN information
def get_whois_info(ip):
    try:
        obj = IPWhois(ip)
        res = obj.lookup_rdap()
        asn = res.get('asn', 'N/A')
        org = res.get('network', {}).get('name', 'N/A')
        country = res.get('network', {}).get('country', 'N/A')
        return pd.Series([asn, org, country])
    except (IPDefinedError, HTTPLookupError, ValueError) as e:
        print(f"Error looking up {ip}: {e}")
        return pd.Series(['N/A', 'N/A', 'N/A'])

In [None]:
# List of ASN names to be excluded - cloud2cloud traffic requires different analytics so let's look at extrusion
# excluding cloud providers in order to focus on extrusion
# don't try to do lookups on thousands of IPs, you will run into rate limiting

asn_exclude_list = ['AMAZON-02', 'AMAZON-IAD', 'AT-88-Z', 'AMAZON-2011L', 'AMAZO-CF', 'GOOGLE', 'GOOGLE-CLOUD', 'GOOGL-2', \
                    'UK-CANONICAL-20151111', 'CANONICAL-CORE', \
                    'MICROSOFT', 'MSFT']
top_combinations = top_combinations.head(200)
lookup_results_dst = top_combinations['dstaddr'].apply(get_whois_info)
lookup_df_dst = pd.DataFrame(lookup_results_dst.values.tolist(), index=lookup_results_dst.index, columns=['ASN_dst', 'Organization_dst', 'Country_dst'])
lookup_df = pd.concat([top_combinations, lookup_df_dst], axis=1)

lookup_df = lookup_df[~lookup_df['Organization_dst'].isin(asn_exclude_list)]
lookup_df


In [None]:
# Summarize traffic for a destination IP by sources and destination ports using deduplication
# This allows for quick analysis of the nature of the traffic by summation
# Try removing either srcport or dstport depending on whether this is inbound or outbound traffic
# Define the target IP address to filter

target_ip = '1.2.3.4'  

search = filtered_ns[(filtered_ns['dstaddr'] == target_ip)]
result = search.drop_duplicates(subset=['account-id', 'srcaddr', 'srcport', 'dstaddr', 'dstport',  'protocol', 'action'], keep='last')
result
