# Cyber Threat Analyzer (CTA) - Part 1: Parsing

This notebook details the process of ingesting and parsing a raw, unstructured system log into a clean Pandas DataFrame.

In [1]:
import pandas as pd
from pathlib import Path
import re

print("Import Complete")

Import Complete


## Step 1: Parsing the Log File

The log files have a custom format: TIMESTAMP [LOG_LEVEL] MESSAGE. A simple pd.read_csv won't work. We need to build a custom Regular Expression (regex) to match this pattern and extract the three key pieces of information.

In [2]:
# Define the 'regex' for a single log line telling Python *exactly* what pattern to look for.
#
#    ^         # Start of the line
#    (\S+)     # Capture Group 1: One or more non-space characters (the timestamp)
#    \s+       # Match one or more spaces (the gap after the timestamp)
#    \[        # Match a literal opening bracket '['
#    (\w+)     # Capture Group 2: One or more "word" characters (the log level)
#    \]        # Match a literal closing bracket ']'
#    \s+       # Match one or more spaces (the gap after the log level)
#    (.*)      # Capture Group 3: "Capture everything else" (the message)
#    $         # End of the line
#
log_pattern = re.compile(r'^(\S+)\s+\[(\w+)\]\s+(.*)$')

# Define the path to the log file using the 'pathlib' library.
log_file_path = Path.cwd().parent / "data" / "system.log"

# Create empty list to store structured data
data = []

# Open the log file and loop through it, line by line
with open(log_file_path, 'r') as f:
    for line in f:
        # Try to match regex "formula" to the current line
        match = log_pattern.match(line)
        
        # If the line matched our pattern, extract the captured groups
        if match:
            data.append({
                'timestamp': match.group(1),
                'log_level': match.group(2),
                'message': match.group(3).strip()
            })

# Create Pandas DataFrame from the list of dictionaries
df = pd.DataFrame(data)

print(f"--- CTA Parser finished. Found {len(data)} log entries. ---")

--- CTA Parser finished. Found 8 log entries. ---


## Step 2: Verify the DataFrame

Let's use df.head() to check the first 5 rows and confirm the parser worked as expected.

In [3]:
df.head()

Unnamed: 0,timestamp,log_level,message
0,2023-10-27T14:01:03,INFO,System startup complete.
1,2023-10-27T14:02:15,INFO,User 'admin' logged in from 192.168.1.100
2,2023-10-27T14:02:45,WARNING,Disk space low on /var. 85% used.
3,2023-10-27T14:03:10,ERROR,Failed to connect to database [db-01]. IP 10.0...
4,2023-10-27T14:03:12,INFO,Retrying connection...


## Next Steps

Now that we have clean data, the next step is to enrich it by calling the AbuseIPDB API...