# Real-world logfile parsing

The real world is a bit more complicated and does not match the training examples. Sysadmins use `grep`, `awk` and `sed` to extract parts of a logfile and pipe the output into another. However, these one-liners become unreadable line-noise. Here is an example how you would extract information from a logfile, using Python.

open the [sfbios.log](data/sfbios.log) in the filebrowser, so you can get an idea what kind of logfile we are dealing with.

Now somebody (i.e. your boss 🧐) would like to extract a list of all sip usernames. The sip usernames can be found in a structure like this:

```xml
<property name="uri">sip:rolands@tdl.lv</property>
```

The logfile is encoded in utf-8.

## regex best practises: `re.X`

**Always use the extended regular expression syntax with `re.X`, unless the regex is really trivial.**

Let's say, the super-genius 🤓 just left your team and left you a regular expression which does some very clever data extraction:

In [None]:
import re
regex = re.compile('^(?P<alias_alternative>(?P<requested_entity>sample|object)(\.(?P<attribute>\w+))?)(\s+(?i)AS\s+(?P<alias>\w+))?\s*$')

❓ Anyone here can explain me what this regex should do? It is _somehow_ broken!

With the `re.X` flag, you can use the extended syntax and comment every piece of your regular expression separately. The same regex as above is now much easier to read and comprehend, the original intention is preserved. Because you can span the regex over many lines, you also need to specify all whitespace explicitly with `\s` or `\s+`:

In [None]:
import re
regex = re.compile(
    r"""^                                              # beginning of the string
        (?P<alias_alternative>                         # use first part as alias, if no alias is defined
          (?P<requested_entity>sample|object)          # string starts with sample or object
          (\.(?P<attribute>\w+))?                      # capture an optional .attribute
        )
        (                                              # capture an optional alias: entity.attribute AS alias
          \s+(?i)AS\s+                                 # whitespace, ignore case of 'AS', whitespace
          (?P<alias>\w+)                               # capture the alias
        )?                                             # 
        \s*                                            # ignore any trailing whitespace
        $                                              # end of string
    """,
    re.X + re.I
)

## regex best practises: Do not use `re.match`, always use `re.search`

This regular expression below does **not match anything**:

In [None]:
import re
line = "Cats are smarter than dogs"
re.match("dogs$", line)

but this **does**:

In [None]:
import re
line = "Cats are smarter than dogs"
re.search("dogs$", line)

**Why?** The difference between `re.match()` and `re.search()` is that `re.match()` behaves as if every pattern has `\A` prepended (or `^` if you don't use multiline). Anyone accustomed to Perl, grep, or sed regular expression matching is mislead by `re.match()`.

There is actually a reason why re.match exists at all: it is **speed**. When `re.search()` is used and no matching is possible, it takes a considerable amount [more time](https://stackoverflow.com/questions/29007197/why-have-re-match) than `re.match()` until the matching fails. I am inclined to say: Python has an implementation problem here. I think `re.match()` should better be *deprecated*, because it leads to unnecessary problems, despite the speed gain one might observe.

## regex best practises: named capture groups

A very common practice is to group elements in a regular expression:

In [None]:
import re

url = '/some/url/our_first_parameter/our_second_parameter'
match = re.search("^/some/url/((.*?)/(.*?))$", url)
match.groups()

However, this leads to the problem that the parameters fetched are positional.  If you have nested group captures, you have to count the number of the opening round brackets `(` to get the position of every parameter right. And if you decide to remove a grouping later, you will have to check every position again.


Instead, you would rather give your groups a name so you can easily rearrange your groupings without having to worry about their positions:

In [None]:
import re

url = '/some/url/our_first_parameter/our_second_parameter'
match = re.search(r"""
    ^                       # beginning of the string
    /some/url/              # match the base-url
    (?P<the_whole_thing>    # capture both parameters
      (?P<param1>.*)        # capture the first parameter only
      /                     # ... followed by a /
      (?P<param2>.*)        # capture the second parameter only
    )
    $                       # end of the string
    """, url, re.X)
if (match):
    print(match.groupdict())

This leads to much more robust regular expressions, especially when we are adding new or removing existing captures.

In **substitutions** or within regular expressions, named capture groups are back-referenced by

```
\g<the_name_of_the_captured_group>
```

## regex best practises in real-world parsing problem

Now back to our real problem, we will use this extended regex syntax:

In [None]:
import re

regex = re.compile(r'''
    <property\s name="uri">  # beginning of the property element
    (?P<sip>.*?)             # fetch content, put it named capture group «sip»
    <\/property>             # end of element
    ''', re.X)

In [None]:
import re
 
regex = re.compile(r'''
    <property\s name="uri">  # beginning of element
    (?P<sip>.*?)             # fetch content, put it named capture group «sip»
    <\/property>             # end of element
    ''', re.X)

log_file_path = "data/sfbios.log"

with open(log_file_path, "r", encoding="utf-8") as log_file, \
     open("sip_list", "w") as out:
    for line in log_file:
        match = regex.search(line)    # BE AWARE: always use re.search, NEVER re.match!
        if match:
            out.write(match.groupdict()['sip'] + "\n")

Voilà!

In [None]:
!cat sip_list

## Things never go right from the beginning. The Python debugger

To start the debugger in the most typical way, enter this line of code somewhere in the code above:

```py
import pdb; pdb.set_trace()
```

Then, run the cell again.

Use the following commands to control your debugger:

* `h` for help
* `n` next line
* `s` step into (follow into function call)
* `r` contintue, until function returns
* `l` list the current code
* `ll` show even more of the current code
* `p expr` print a variable etc.
* `c` continue until next breakpoint
* `q` quit the debugger

**Your task:**
    
Set breakpoints at various places in your code, use the commands above or use this [cheatsheet](https://www.nnja.io/2019/python-debugging-cheatsheet.pdf) to get a feeling how the debugger works.