In [1]:
import pandas as pd
log_path = "../data/raw/HDFS_2k.log"

with open(log_path, "r") as f:
    lines = f.readlines()

len(lines)

2000

In [4]:
lines[10]

'081109 204722 567 INFO dfs.DataNode$PacketResponder: Received block blk_5402003568334525940 of size 67108864 from /10.251.214.112\n'

### Regex for HDFS Logs

In [7]:
import re

log_pattern = re.compile(
    r'(?P<date>\d{6})\s+'
    r'(?P<time>\d{6})\s+'
    r'(?P<ms>\d+)\s+'
    r'(?P<level>\w+)\s+'
    r'(?P<component>[^:]+):\s+'
    r'(?P<message>.*)'
)

### Parse Again (Slowly)

In [8]:
parsed_logs = []

for line in lines:
    match = log_pattern.match(line)
    if match:
        parsed_logs.append(match.groupdict())

len(parsed_logs)

2000

In [9]:
import pandas as pd

df = pd.DataFrame(parsed_logs)
df.head()

Unnamed: 0,date,time,ms,level,component,message
0,81109,203615,148,INFO,dfs.DataNode$PacketResponder,PacketResponder 1 for block blk_38865049064139...
1,81109,203807,222,INFO,dfs.DataNode$PacketResponder,PacketResponder 0 for block blk_-6952295868487...
2,81109,204005,35,INFO,dfs.FSNamesystem,BLOCK* NameSystem.addStoredBlock: blockMap upd...
3,81109,204015,308,INFO,dfs.DataNode$PacketResponder,PacketResponder 2 for block blk_82291938032499...
4,81109,204106,329,INFO,dfs.DataNode$PacketResponder,PacketResponder 2 for block blk_-6670958622368...


### Build a Proper Timestamp (One Last Fix)

In [10]:
df["timestamp"] = pd.to_datetime(
    df["date"] + df["time"],
    format="%y%m%d%H%M%S"
)

df.dtypes

date                 object
time                 object
ms                   object
level                object
component            object
message              object
timestamp    datetime64[ns]
dtype: object

### Observations

Real-world event data rarely matches initial assumptions.
Inspecting raw data before parsing is critical.
Event schemas must be derived, not assumed.
Once parsed, system logs behave like event-driven analytics data.
Foundations from earlier days made this dataset manageable.