# Cyber Threat Analyzer (CTA) - Part 1: Finding the Signal in the Noise

**Goal:** Ingest raw system log data, parse it into a structured format using Pandas and Regex, enrich it with threat intelligence from the AbuseIPDB API, and prepare it for exploratory data analysis (EDA).

### The "Why"
Raw log files are 99% routine system chatter or what we'll call "audio static". Hidden inside that noise is the 1% that truly matters: the "signal." This signal is our actionable intelligence - the failed logins, the weird connections, the real threats. The purpose of this notebook is to build the automatic noise filter. We'll use data engineering to turn that "wall of noise" into a clean, actionable signal, building the foundation for our ML model.

In [1]:
# This is the core library for building
# our data table (DataFrame).
import pandas as pd

# Pathlib makes it easy to build paths that work
# on multiple operating systems.
from pathlib import Path

# The Regular Expression library. This "pattern-matcher"
# will extract actionable data from the noise.
import re

print("Import Complete")

Import Complete


## Step 1: Parse Raw Log File

### The "Why"
Our `system.log` file is just a wall of unstructured text.

### The "How"
Our goal is to use a **Regular Expression (regex)** to tell Python exactly how to read each line and find three specific pieces:

1.  The `TIMESTAMP`
2.  The `[LOG_LEVEL]`
3.  The `MESSAGE`

We'll extract these pieces, put them into a list, and then load that list into a clean, structured Pandas DataFrame. This is the "Extract" and "Transform" part of our ETL pipeline.

In [13]:
# Define the 'regex' for a single log line telling Python *exactly* what pattern to look for.
#
#    ^         # Start of the line
#    (\S+)     # Capture Group 1: One or more non-space characters (the timestamp)
#    \s+       # Match one or more spaces (the gap after the timestamp)
#    \[        # Match a literal opening bracket '['
#    (\w+)     # Capture Group 2: One or more "word" characters (the log level)
#    \]        # Match a literal closing bracket ']'
#    \s+       # Match one or more spaces (the gap after the log level)
#    (.*)      # Capture Group 3: "Capture everything else" (the message)
#    $         # End of the line
#
log_pattern = re.compile(r'^(\S+)\s+\[(\w+)\]\s+(.*)$')

# Define the path to the log file using the 'pathlib' library.
log_file_path = Path.cwd().parent / "data" / "system.log"

# Create empty list to store structured data
data = []

# Open the log file and loop through it, line by line
with open(log_file_path, 'r') as f:
    for line in f:
        # Try to match regex "formula" to the current line
        match = log_pattern.match(line)
        
        # If the line matched our pattern, extract the captured groups
        if match:
            data.append({
                'timestamp': match.group(1),
                'log_level': match.group(2),
                'message': match.group(3).strip()
            })

# Create Pandas DataFrame from the list of dictionaries
df = pd.DataFrame(data)

print(f"--- CTA Parser finished. Found {len(data)} log entries. ---")

--- CTA Parser finished. Found 8 log entries. ---


### Verify Initial Parsing

Let's display the first few rows using `df.head()` to ensure the regex worked correctly and we have our initial `timestamp`, `log_level`, and `message` columns.

In [3]:
df.head()

## Step 2: Load API Key Securely

We need our AbuseIPDB API key to enrich the data. We load it securely from a `.env` file using the `python-dotenv` library. This keeps the key out of our code and off GitHub.

In [4]:
import os
from dotenv import load_dotenv # Import the new library

# This line automatically finds the .env file in your project's root
# and loads the variables found inside it into your environment
# for this specific notebook session.
load_dotenv()

# Now, the standard os.environ.get() should find the key!
api_key = os.environ.get('ABUSEIPDB_KEY')

if api_key:
    print("✅ Success! API Key loaded successfully from .env file.")
    # Optional: Verify the key looks right
    # print(f"   Key starts with: {api_key[:5]}... and ends with: {api_key[-5:]}")
else:
    print("❌ Error: API Key not found.")
    print("   1. Did you create the `.env` file in the 'cta' root folder?")
    print("   2. Does the `.env` file contain 'ABUSEIPDB_KEY=your-key'?")
    print("   3. Did you install 'python-dotenv' (`pip install python-dotenv`)?")
    print("   4. Did you run the `load_dotenv()` command in this cell?")

✅ Success! API Key loaded successfully from .env file.


## Step 3: Define and Test API Check Function

First, we define the `check_ip` function. This function takes an IP address and our API key, calls the AbuseIPDB `/check` endpoint using the `requests` library, handles potential errors, and returns the relevant 'data' portion of the JSON response (or `None`).

Immediately after defining it, we test the function with a known potentially malicious IP (`1.2.3.4`) to ensure our API key loading, the function logic, and the connection to AbuseIPDB are all working correctly before we apply it to our actual DataFrame.

In [5]:
import os
import requests
import json # To pretty-print the result
from dotenv import load_dotenv

# --- 1. Load the API Key (using dotenv) ---
load_dotenv()
api_key = os.environ.get('ABUSEIPDB_KEY')

# --- 2. Define the Function to Check an IP ---
def check_ip(ip_address, key):
    """
    Calls the AbuseIPDB API to check a given IP address.
    Returns the JSON response from the API.
    """
    if not key:
        print("API Key not loaded. Cannot check IP.")
        return None # Return nothing if the key isn't loaded

    # Define the API endpoint and parameters
    url = 'https://api.abuseipdb.com/api/v2/check'
    params = {
        'ipAddress': ip_address,
        'maxAgeInDays': '90', # How far back to look for reports
        'verbose': True # Ask for more details if available
    }
    headers = {
        'Accept': 'application/json',
        'Key': key
    }

    print(f"Checking IP: {ip_address}...")
    try:
        response = requests.get(url=url, headers=headers, params=params)
        response.raise_for_status() # Raise an error for bad status codes (4xx or 5xx)

        # If successful, parse the JSON response
        report = response.json()
        return report['data'] # Return just the 'data' part of the response

    except requests.exceptions.RequestException as e:
        print(f"  Error during API request: {e}")
        return None
    except Exception as e:
        print(f"  An unexpected error occurred: {e}")
        return None

# --- 3. Test the function with a known bad IP ---
test_ip = '1.2.3.4' # A commonly reported IP for testing
if api_key:
    ip_data = check_ip(test_ip, api_key)

    # --- 4. Print the results nicely ---
    if ip_data:
        print("\n--- API Report Received ---")
        print(f"  IP Address: {ip_data.get('ipAddress')}")
        print(f"  Country: {ip_data.get('countryCode')}")
        print(f"  Abuse Score: {ip_data.get('abuseConfidenceScore')}%")
        print(f"  Total Reports: {ip_data.get('totalReports')}")
        print(f"  ISP: {ip_data.get('isp')}")
        # print("\nFull Report:") # Uncomment to see everything
        # print(json.dumps(ip_data, indent=2))
    else:
        print("\n--- Failed to get API report ---")
else:
    print("API Key not loaded, skipping test.")

Checking IP: 1.2.3.4...

--- API Report Received ---
  IP Address: 1.2.3.4
  Country: AU
  Abuse Score: 43%
  Total Reports: 39
  ISP: APNIC Debogon Project


## Step 4: Extract IP Addresses from Messages

Not all log messages contain IP addresses. We need a function (`extract_ip`) that uses a regex (`ip_pattern`) specifically designed to find IPv4 addresses within the `message` string. It will return the IP if found, otherwise `None`.

In [6]:
import re

def extract_ip(message):
    """
    Uses regex to find the first IPv4 address in a string.
    Returns the IP address string if found, otherwise None.
    """
    # Regex pattern for an IPv4 address
    # \b matches word boundaries to avoid partial matches
    # (?:\d{1,3}\.){3} matches three groups of (1-3 digits followed by a dot)
    # \d{1,3} matches the final 1-3 digits
    ip_pattern = r'\b(?:\d{1,3}\.){3}\d{1,3}\b'

    match = re.search(ip_pattern, message) # Find the first match in the message

    if match:
        return match.group(0) # Return the matched IP string
    else:
        return None # Return None if no IP was found

In [7]:
# Test cases
test_message_with_ip = "Failed login for user 'root' from IP 1.2.3.4"
test_message_no_ip = "System startup complete."

# Run the function
found_ip = extract_ip(test_message_with_ip)
no_ip = extract_ip(test_message_no_ip)

print(f"IP found in first message: {found_ip}")
print(f"IP found in second message: {no_ip}")

IP found in first message: 1.2.3.4
IP found in second message: None


### Apply IP Extraction to DataFrame

Now, we use the Pandas `.apply()` method to run our `extract_ip` function on every row in the `message` column, creating a new `ip_address` column in our DataFrame.

In [8]:
# Create the new 'ip_address' column
# The 'lambda x:' part passes each message ('x') into our function.
df['ip_address'] = df['message'].apply(lambda x: extract_ip(x))

print("Created 'ip_address' column.")

# Display the DataFrame to see the new column!
df.head(10) # Show more rows to potentially see IPs and None values

Created 'ip_address' column.


Unnamed: 0,timestamp,log_level,message,ip_address
0,2023-10-27T14:01:03,INFO,System startup complete.,
1,2023-10-27T14:02:15,INFO,User 'admin' logged in from 192.168.1.100,192.168.1.100
2,2023-10-27T14:02:45,WARNING,Disk space low on /var. 85% used.,
3,2023-10-27T14:03:10,ERROR,Failed to connect to database [db-01]. IP 10.0...,10.0.0.5
4,2023-10-27T14:03:12,INFO,Retrying connection...,
5,2023-10-27T14:03:42,ERROR,Connection to [db-01] timed out. IP 10.0.0.5,10.0.0.5
6,2023-10-27T14:05:01,ERROR,Failed login for user 'root' from IP 1.2.3.4,1.2.3.4
7,2023-10-27T14:05:02,ERROR,Failed login for user 'guest' from IP 99.88.77.66,99.88.77.66


## Step 5: Enrich Data with API (Efficiently)

To avoid hitting API rate limits, we first get a list of *unique* IP addresses found in our `ip_address` column (`ips_to_check`), filtering out any `None` values.

In [9]:
# Get all unique values from the 'ip_address' column
unique_ips = df['ip_address'].unique()

# Filter out any 'None' values (which pandas often represents as float 'nan')
# We only want actual IP strings
ips_to_check = [ip for ip in unique_ips if isinstance(ip, str)]

# How many unique IPs do we actually need to check?
print(f"Found {len(ips_to_check)} unique IP addresses to check.")
print("Unique IPs:", ips_to_check)

Found 4 unique IP addresses to check.
Unique IPs: ['192.168.1.100', '10.0.0.5', '1.2.3.4', '99.88.77.66']


### Check Unique IPs and Cache Results

We loop *only* through the unique IPs, call our `check_ip` function for each, and store the results (abuse score, country) in a dictionary (`ip_report_cache`). This acts as a temporary lookup table to minimize API calls.

In [10]:
# Make sure your API key is loaded (run this if you restarted the notebook)
from dotenv import load_dotenv
import os
load_dotenv()
api_key = os.environ.get('ABUSEIPDB_KEY')

# Dictionary to store the results: {ip: {'score': score, 'country': country}, ...}
ip_report_cache = {}

if api_key:
    print("--- Starting API checks for unique IPs ---")
    # Loop through only the unique IPs found
    for ip in ips_to_check:
        print(f"Checking {ip}...")
        report_data = check_ip(ip, api_key) # Call the function we wrote earlier

        if report_data:
            # Store the relevant info (score and country) in our cache
            ip_report_cache[ip] = {
                'score': report_data.get('abuseConfidenceScore'),
                'country': report_data.get('countryCode')
                # Add any other fields you want here! e.g., 'isp': report_data.get('isp')
            }
        else:
            # Handle cases where the API might fail for a specific IP
            ip_report_cache[ip] = {'score': None, 'country': None}
            print(f"  -> Failed to get report for {ip}")

    print("--- Finished API checks ---")
    print("\nIP Report Cache:")
    print(ip_report_cache)

else:
    print("API Key not found. Cannot perform checks.")

--- Starting API checks for unique IPs ---
Checking 192.168.1.100...
Checking IP: 192.168.1.100...
Checking 10.0.0.5...
Checking IP: 10.0.0.5...
Checking 1.2.3.4...
Checking IP: 1.2.3.4...
Checking 99.88.77.66...
Checking IP: 99.88.77.66...
--- Finished API checks ---

IP Report Cache:
{'192.168.1.100': {'score': 0, 'country': None}, '10.0.0.5': {'score': 0, 'country': None}, '1.2.3.4': {'score': 43, 'country': 'AU'}, '99.88.77.66': {'score': 0, 'country': 'US'}}


### Map API Results Back to DataFrame

Using the `ip_report_cache`, we create the new `abuse_score` and `country` columns in the main DataFrame. The `.map()` function efficiently looks up each row's `ip_address` in our cache and adds the corresponding score and country. Rows without an IP will get `NaN` or `None`.

In [11]:
# Use the .map() method - it's like a VLOOKUP in Excel
# For each IP in 'ip_address', it looks up that IP in our cache
# and then retrieves the 'score' value from the nested dictionary.
df['abuse_score'] = df['ip_address'].map(lambda ip: ip_report_cache.get(ip, {}).get('score'))

# Do the same for the country code
df['country'] = df['ip_address'].map(lambda ip: ip_report_cache.get(ip, {}).get('country'))

print("Added 'abuse_score' and 'country' columns to the DataFrame.")

# Let's see the final enriched DataFrame!
df.head(10)

Added 'abuse_score' and 'country' columns to the DataFrame.


Unnamed: 0,timestamp,log_level,message,ip_address,abuse_score,country
0,2023-10-27T14:01:03,INFO,System startup complete.,,,
1,2023-10-27T14:02:15,INFO,User 'admin' logged in from 192.168.1.100,192.168.1.100,0.0,
2,2023-10-27T14:02:45,WARNING,Disk space low on /var. 85% used.,,,
3,2023-10-27T14:03:10,ERROR,Failed to connect to database [db-01]. IP 10.0...,10.0.0.5,0.0,
4,2023-10-27T14:03:12,INFO,Retrying connection...,,,
5,2023-10-27T14:03:42,ERROR,Connection to [db-01] timed out. IP 10.0.0.5,10.0.0.5,0.0,
6,2023-10-27T14:05:01,ERROR,Failed login for user 'root' from IP 1.2.3.4,1.2.3.4,43.0,AU
7,2023-10-27T14:05:02,ERROR,Failed login for user 'guest' from IP 99.88.77.66,99.88.77.66,0.0,US


## Step 6: Clean Timestamp Data

The `timestamp` column is currently text (`object`). We convert it to a proper `datetime` object using `pd.to_datetime()` so Pandas understands it as time data. We verify the conversion using `df.info()`.

In [12]:
# Convert the 'timestamp' column from text strings to datetime objects
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Check the data types of our columns to confirm the change
print("DataFrame info after converting timestamp:")
df.info()

DataFrame info after converting timestamp:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   timestamp    8 non-null      datetime64[ns]
 1   log_level    8 non-null      object        
 2   message      8 non-null      object        
 3   ip_address   5 non-null      object        
 4   abuse_score  5 non-null      float64       
 5   country      2 non-null      object        
dtypes: datetime64[ns](1), float64(1), object(4)
memory usage: 512.0+ bytes


### Extract Hour of Day Feature

Now that `timestamp` is a datetime object, we can easily extract time components. We use the `.dt.hour` accessor to create a new `hour_of_day` column, which might be a useful feature for our model later. We display `df.head()` again to see the result.

In [13]:
# Create a new column 'hour_of_day' by extracting the hour (0-23)
# The '.dt' accessor only works on datetime columns
df['hour_of_day'] = df['timestamp'].dt.hour

print("Added 'hour_of_day' column.")

# Let's see the DataFrame with the new hour column
df.head()

Added 'hour_of_day' column.


Unnamed: 0,timestamp,log_level,message,ip_address,abuse_score,country,hour_of_day
0,2023-10-27 14:01:03,INFO,System startup complete.,,,,14
1,2023-10-27 14:02:15,INFO,User 'admin' logged in from 192.168.1.100,192.168.1.100,0.0,,14
2,2023-10-27 14:02:45,WARNING,Disk space low on /var. 85% used.,,,,14
3,2023-10-27 14:03:10,ERROR,Failed to connect to database [db-01]. IP 10.0...,10.0.0.5,0.0,,14
4,2023-10-27 14:03:12,INFO,Retrying connection...,,,,14
