# Log Parsing Doesn't Need to be Difficult

In light of auth.log (an SSH log) destroying everyone's accuracy in National Cyber League for the third semester in a row, I have  written a guide to log parsing. The log data in this guide is generated by a python script I wrote, which you can find in the same folder as this notebook (you should go through the guide before looking at it!).

## Overview
It's important to remember when doing log parsing that logs are structured data. That's good news; structured data can be easily captured by patterns which allow us to restructure the data however we want. Trying to understand a log using ad-hoc scripts for each task may work in the short term, but if you need to understand anything remotely advanced it'll end up biting you in the back. Let's look at how to do things better:

## Exploratory Analysis
We'll start by taking a peek at the data:

In [1]:
with open("logd.log") as log:
    log_str = log.read()

log_list = log_str.split("\n")
for line in log_list[:25]:
    print(line)

1504695719 [logd] Failed password for account admin from 192.168.165.25
1504695721 [logd] Failed password for account root from 192.168.39.58
1504695722 [logd] Failed password for account root from 192.168.165.25
1504695728 [logd] Failed password for account guest from 192.168.109.206
1504695728 [logd] Failed password for account root from 192.168.177.236
1504695734 [logd] Failed password for account test from 192.168.47.192
1504695742 [logd] Failed password for account root from 192.168.85.154
1504695749 [logd] Failed password for account root from 192.168.199.224
1504695756 [logd] Failed password for account adm from 192.168.206.216
1504695764 [logd] Failed password for account root from 192.168.165.25
1504695770 [logd] Failed password for account info from 192.168.54.206
1504695777 [logd] Failed password for account admin from 192.168.165.25
1504695781 [logd] Failed password for account admin from 192.168.238.227
1504695783 [logd] Failed password for account ftp from 192.168.112.18


Looks pretty simple: a timestamp, a program name, then a description of a failed password attempt. The only real cause for concern is the "repeated 3 times" message. Don't worry, I'll come back to that. Let's see if there's anything else of interest:

In [2]:
print(log_list[194])

1504696545 [logd] Successful password for account root from 192.168.206.216


Okay, so logins can be successful too. That's good, it wouldn't be a very interesting service if logins could only fail.

In [3]:
print(log_list[114])

1504696206 [logd] Error: Failed to do a thing


Uh oh! So there are errors every now and then. Good thing we weren't planning on doing any parsing based soley on "Failed" showing up in the line! That sure would have caused some issues.

In [4]:
print(log_list[187])

1504696527 [logd] Failed ssh key for account admin from 192.168.109.206


This tells us that logins can fail based on ssh keys as well as passwords. You may also guess, based on the fact that this took until the 188th entry to show up, that this kind of failure is fairly rare.

Now that I've showed you all the gotchas in this particular log, let's look at how to find them without a walkthrough.

## Loading the Data
Let's say, to start with, that all I've seen so far are the failed password messages we looked at first. In order to find log messages that don't follow that structure, I'm going to write a regular expression to capture the structure I've seen and then see what shows up in the log not following that structure. The important thing here is to capture all the variation I've seen with as few repeat symbols (+ and *, in particular) in the regular expression as possible. This avoids accidentally capturing log events that don't follow the structure I've seen (which, for now, we don't want to do)

As an aside: if you do any kind of text parsing, you need to know regex. A great cheat sheet / general resource is available here: https://www.rexegg.com/regex-quickstart.html Additionally, the second chapter of Jurafsky and Martin's <i>Speech and Language Processing</i> provides a good introduction to writing regex. A draft of the third edition is currently available online for free: https://web.stanford.edu/~jurafsky/slp3/2.pdf

In [5]:
import re

ip = "\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}"
passwd_matches = re.findall("\d+ \[logd\] Failed password for account \w+ from " + ip, log_str)

for match in passwd_matches[:10]:
    print(match)

# I subtract 1 from len(log_list) because there's an empty line at the bottom of the file
print("\nTotal events:", len(log_list) - 1)
print("Captured here:", len(passwd_matches))

1504695719 [logd] Failed password for account admin from 192.168.165.25
1504695721 [logd] Failed password for account root from 192.168.39.58
1504695722 [logd] Failed password for account root from 192.168.165.25
1504695728 [logd] Failed password for account guest from 192.168.109.206
1504695728 [logd] Failed password for account root from 192.168.177.236
1504695734 [logd] Failed password for account test from 192.168.47.192
1504695742 [logd] Failed password for account root from 192.168.85.154
1504695749 [logd] Failed password for account root from 192.168.199.224
1504695756 [logd] Failed password for account adm from 192.168.206.216
1504695764 [logd] Failed password for account root from 192.168.165.25

Total events: 10000
Captured here: 9441


It looks like the "Failed password" pattern only accounts for about 94% of the data. Our goal is to write a pattern that accounts for 100% of the data, so it's clear just from this that we're missing something. Before moving on though, I want to show how we can modify what we already have to get some useful information.

First, I'm going to add a few captures groups to extract timestamp, user, and password from the regex pattern above. If you don't know what a capture group is, refer to the regex links above. All I'm doing is surrounding the stuff I want in parenthesis for easy access later.

In [6]:
fail_pass_data = re.findall("(\d+) \[logd\] Failed password for account (\w+) from (" + ip + ")", log_str)
for match in fail_pass_data[:10]:
    print(match)

('1504695719', 'admin', '192.168.165.25')
('1504695721', 'root', '192.168.39.58')
('1504695722', 'root', '192.168.165.25')
('1504695728', 'guest', '192.168.109.206')
('1504695728', 'root', '192.168.177.236')
('1504695734', 'test', '192.168.47.192')
('1504695742', 'root', '192.168.85.154')
('1504695749', 'root', '192.168.199.224')
('1504695756', 'adm', '192.168.206.216')
('1504695764', 'root', '192.168.165.25')


`fail_pass_data` now contains timestamps, users, and ips, all nicely structured in tuples. If, for example, I wanted to know how many unique ips failed a password attempt, all I would need is the following line:

In [7]:
print(len(set([data[2] for data in fail_pass_data])))

36


The third element of `fail_pass_data` is the ip, so I create a list of all the third elements of `fail_pass_data` using a list comprehension. If you're not very familiar with python and don't know what a list comprehension is, check out the <a href="https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions">python docs</a>. After getting the list, I remove duplicate ips by converting the list to a set. Then I simply print the length of that set.

Now that we've done something useful with the data we have, let's improve the pattern by looking at what it doesn't cover.

In [8]:
non_matches = re.findall("(?!\d+ \[logd\] Failed password for account \w+ from " + ip + ")^.*$", log_str, flags=re.M)
for match in non_matches[:10]:
    print(match)

1504695820 [logd] repeated 3 times: [ Failed password for account root from 192.168.177.236 ]
1504696021 [logd] repeated 4 times: [ Failed password for account test from 192.168.137.64 ]
1504696049 [logd] repeated 4 times: [ Failed password for account admin from 192.168.238.227 ]
1504696206 [logd] Error: Failed to do a thing
1504696276 [logd] repeated 2 times: [ Failed password for account root from 192.168.177.236 ]
1504696301 [logd] repeated 2 times: [ Failed password for account root from 192.168.228.10 ]
1504696343 [logd] Error: Failed to do a thing
1504696527 [logd] Failed ssh key for account admin from 192.168.109.206
1504696533 [logd] repeated 5 times: [ Failed password for account test from 192.168.109.206 ]
1504696545 [logd] Successful password for account root from 192.168.206.216


All I'm doing is surrounding my original pattern with a negative lookaround then adding the most basic match for a line afterwards: `^.*$`. This will match every line not otherwise matched by my failed password pattern. Note that I had to add `flags=re.M` to make this work. Without that flag, ^ and \$ characters aren't correctly matched.

The first thing I notice about the output is the "repeated" events. This can be a bit of a pain to parse correctly, so I'm just going to replace all the repetitions with whatever message they're repeating, multiplied by the number of times it's repeated. Here's the code to do it:

In [9]:
fixed_log = re.sub("^(.*)repeated (\d) times: \[ (.*?) \]\n",
                  lambda m: (m.group(1) + m.group(3) + "\n")*int(m.group(2)),
                  log_str, flags=re.M)
print(len(fixed_log.split("\n")) - 1)

10532


This is a bit more complicated the the last re code: I'm capturing the front part of the log message, the repeat count, and the message that's repeated. Then, I'm using a lambda function to print the first part and second part, repeated for the number of times the log says I should repeat. It looks like this added 532 records to our log, which seems about right

Now that we've fixed the repeat messages, let's look at what our match isn't accounting for again.

In [10]:
non_matches = re.findall("(?!\d+ \[logd\] Failed password for account \w+ from " + ip + ")^.*$", fixed_log, flags=re.M)
for match in non_matches[:10]:
    print(match)

1504696206 [logd] Error: Failed to do a thing
1504696343 [logd] Error: Failed to do a thing
1504696527 [logd] Failed ssh key for account admin from 192.168.109.206
1504696545 [logd] Successful password for account root from 192.168.206.216
1504696601 [logd] Failed ssh key for account admin from 192.168.199.224
1504696912 [logd] Error: Failed to do a thing
1504697104 [logd] Failed ssh key for account guest from 192.168.165.25
1504697110 [logd] Error: Failed to do a thing
1504697212 [logd] Successful password for account test from 192.168.109.206
1504697223 [logd] Successful password for account admin from 192.168.238.227


So, it looks like 3 things are missing: errors, successes, and ssh key events. I'll update the expression to account for these:

In [11]:
matches = re.findall("\d+ \[logd\] (?:\w+ .+? for account \w+ from " + ip + ")|(?:Error: .*)", fixed_log)

for match in matches[:10]:
    print(match)

print("\nTotal events:", len(fixed_log.split("\n")) - 1)
print("Captured here:", len(matches))

1504695719 [logd] Failed password for account admin from 192.168.165.25
1504695721 [logd] Failed password for account root from 192.168.39.58
1504695722 [logd] Failed password for account root from 192.168.165.25
1504695728 [logd] Failed password for account guest from 192.168.109.206
1504695728 [logd] Failed password for account root from 192.168.177.236
1504695734 [logd] Failed password for account test from 192.168.47.192
1504695742 [logd] Failed password for account root from 192.168.85.154
1504695749 [logd] Failed password for account root from 192.168.199.224
1504695756 [logd] Failed password for account adm from 192.168.206.216
1504695764 [logd] Failed password for account root from 192.168.165.25

Total events: 10532
Captured here: 10532


10532 out of 10532! Perfect. Notice that, for events that have entirely different structure (Failed/Success v. Error), I seperate by an or block with a non-capturing group: `(?:event1)|(?:event2)`. You will see shortly that this leads to some nice structure.

Now, let's add some capture groups and start extracting information.
## Information Extraction

In [12]:
matches = re.findall("(\d+) \[logd\] (?:(\w+) (.+?) for account (\w+) from (" + ip + "))|(?:(Error): (.*))", fixed_log)

for match in matches[:10]:
    print(match)
# example of an error message, for reference
print(matches[157])

('1504695719', 'Failed', 'password', 'admin', '192.168.165.25', '', '')
('1504695721', 'Failed', 'password', 'root', '192.168.39.58', '', '')
('1504695722', 'Failed', 'password', 'root', '192.168.165.25', '', '')
('1504695728', 'Failed', 'password', 'guest', '192.168.109.206', '', '')
('1504695728', 'Failed', 'password', 'root', '192.168.177.236', '', '')
('1504695734', 'Failed', 'password', 'test', '192.168.47.192', '', '')
('1504695742', 'Failed', 'password', 'root', '192.168.85.154', '', '')
('1504695749', 'Failed', 'password', 'root', '192.168.199.224', '', '')
('1504695756', 'Failed', 'password', 'adm', '192.168.206.216', '', '')
('1504695764', 'Failed', 'password', 'root', '192.168.165.25', '', '')
('', '', '', '', '', 'Error', 'Failed to do a thing')


Our new `matches` list gives us everything we need to know about the data. For example, if we want to look at all the error messages, we just have to look at the seventh capture group wherever the sixth capture group is "Error":

In [13]:
print(set(m[6] for m in matches if m[5] == "Error"))

{'Failed to do a thing'}


In this case, there was only one type of error message, but for more complicated logs this can give some useful information. Next, let's see how many times there was a failed ssh key login for root.

In [14]:
print(len([m for m in matches if m[1] == "Failed" and m[2] == "ssh key" and m[3] == "root"]))

46


Let's say instead we want to know how many unique ips failed an ssh key login for root. No problem! We just need a slight adjustment to the code.

In [15]:
print(len(set([m[4] for m in matches if m[1] == "Failed" and m[2] == "ssh key" and m[3] == "root"])))

12


Looks like there's a lot of repetition, so let's see which ip was the most common of those 12. Figuring out which ip made the most attempts will be a bit more work, but not much.

In [16]:
fail_ssh_ips = [m[4] for m in matches if m[1] == "Failed" and m[2] == "ssh key" and m[3] == "root"]
ip_dict = {}
for ip in fail_ssh_ips:
    if ip not in ip_dict:
        ip_dict[ip] = 1
    else:
        ip_dict[ip] += 1

print(ip_dict)
print("Most common ip:", max(ip_dict, key=lambda x: ip_dict[x]))

{'192.168.167.26': 2, '192.168.137.64': 4, '192.168.39.58': 8, '192.168.206.216': 2, '192.168.109.206': 6, '192.168.199.224': 1, '192.168.85.154': 5, '192.168.238.227': 2, '192.168.165.25': 7, '192.168.177.236': 2, '192.168.158.70': 2, '192.168.170.8': 5}
Most common ip: 192.168.39.58


Next, lets see what the most common username was for successful logins (both ssh key and password logins)

In [17]:
success_unames = [m[3] for m in matches if m[1] == "Successful"]
uname_dict = {}
for uname in success_unames:
    if uname not in uname_dict:
        uname_dict[uname] = 1
    else:
        uname_dict[uname] += 1

print(uname_dict)
print("Most common username:", max(uname_dict, key=lambda x: uname_dict[x]))

{'root': 15, 'test': 8, 'admin': 13, 'administrator': 1, 'guest': 6, 'adm': 1, 'info': 2, 'pi': 1}
Most common username: root


Finally, let's extract unique ips from a certain timeframe (selected arbitrary)

In [18]:
time_ips = [m[4] for m in matches if m[5] != "Error" and int(m[0]) > 1504690000 and int(m[0]) < 1504700000]
print("Unique ips in this timespan:", len(set(time_ips)))
print("Number of events in this timespan:", len(time_ips))

Unique ips in this timespan: 27
Number of events in this timespan: 1025


I hope that it's clear at this point that, using this data structure, we can extract almost any statistic of interest from the log. Even better, everything is consistent. If we do make an error at some point (which is bound to occur), fixing the parser for one query will fix it for all the queries. This makes it much easier to identify what went wrong if a statistic is calculated incorrectly.

## Wrap-up
To conclude, I would like to re-iterate the steps of the process:
1. Observe some of the data.
2. Write a regular expression to capture that data, using as few * and + symbols as possible.
3. Observe the data not matched by the expression.
4. If captured data is less than 100%, revise the expression following the same guideline as before and return to step 3.
5. Surround all variables of interest with capture groups

This procedure provides a straightforward, comprehensive approach to extracting information from simple log files. 