# open and close files

### always use the `with` statement to open a file

Until Python 2.5, the usual way to open a file and write something into it was like this:

In [None]:
fh = open("say_hi.txt", "w")
print("hi", file=fh)
print("ho", file=fh)
not_allowed = 1/0     # simulate the real world: an error happens druring the write process

fh.close()

In [None]:
!cat say_hi.txt

what has been written to the file? Nothing! The file is empty. This is because the content is still in a memory buffer which has not been _flushed_ to the file. We can enforce the `flush=True` by providing this attribute to the `print` function:

In [None]:
fh = open("say_hi.txt", "w")
print("hi", file=fh, flush=True)
print("ho", file=fh, flush=True)
not_allowed = 1/0     # simulate the real world: an error happens druring the write process

fh.close()

**However, this is error prone, a lot to type and easy to forget.**

The `with` statement is a safe way to open a file and write content. If anything happens during the writing process, the memory buffer gets automatically flushed and written to the file, and the file gets closed properly:

In [None]:
with open("say_hi.txt", "w", encoding="utf-8") as file_handle_1:
    print("I ❤︎ ♚ and ♛", file=file_handle_1)
    not_allowed = 1/0     # still creates an error, but now the content is already saved!


We still receive the error, but at least our content has now reached its destiny:

In [None]:
!cat say_hi.txt

### read from one file, write to another

The `with` statement also allows to open multiple files at the same time, allowing to copy content safely. **Note:** The backslash `\` at the end of line 1 is needed to break the statement in two separate lines:

In [None]:
with open("say_hi.txt", "r", encoding="utf-8") as file_handle_1, \
     open("say_out.txt", "w", encoding="utf-8") as file_handle_2:
    
    content = file_handle_1.read()   # read in all content
    content = content.rstrip("\n")
    
    for i in range(1,11):
        print(f"{i}:\t{content}", file=file_handle_2)


In [None]:
!cat say_out.txt

### Read line by line

There is a `readline()` method available which does what it says on the lid: it reads a line!

In [None]:
with open("say_out.txt", "r", encoding="utf-8") as file_handle_2:
    myline = file_handle_2.readline()
    while myline:
        print(myline, end="")  # the line already contains a newline, so we set end="" to avoid double newlines
        myline = file_handle_2.readline()

This is not really convenient. Why not using **a for loop** instead?

In [None]:
with open("say_out.txt", "r", encoding="utf-8") as file_handle_2:
    for line in file_handle_2:
        print(line, end="")   # the line already contains a newline, so we set end="" to avoid double newlines

### get all lines of a file as a list

for this task we could use the `readlines()` method:

In [None]:
with open("say_out.txt", "r", encoding="utf-8") as file_handle_2:
    all_lines = file_handle_2.readlines()

In [None]:
all_lines

Almost. We still have the unecessary newline in every item, which we want to get rid of. And we might want to get rid of the numbers and the tabs, too.

In [None]:
with open("say_out.txt", "r", encoding="utf-8") as file_handle_2:
    all_lines = [line.rstrip('\n').split("\t")[1] for line in file_handle_2]

The line above is rather compact. It contains:

1. A list comprehension: `for line in file_handle_2`
2. for every `line` we remove the newline, using `line.rstrip("\n")` method
3. the remaining string is splitted by the tabulator character: `split("\t")`
4. the `split` command returns a list, and because we are only interested in the second column, we add `[1]`

Voilà!

In [None]:
all_lines

## Real-world logfile parsing using regular expressions

The real world is a bit more complicated and does not match the training examples. Sysadmins use `grep`, `awk` and `sed` to extract parts of a logfile and pipe the output into another. However, these one-liners become unreadable line-noise. Here is an example how you would extract information from a logfile, using Python.

open the `sfbios.log` in the filebrowser, so you can get an idea what kind of logfile we are dealing with.

Now somebody (i.e. your boss) would like to extract a list of all sip usernames. The sip usernames can be found in a structure like this:

`<property name="uri">sip:rolands@tdl.lv</property>`

The logfile is encoded in utf-8.

### REGEX best practises

**Always use the extended regular expression syntax with `re.X`, unless the regex is really trivial.**

Let's say, someone built a super regular expression which does some very smart extraction:

In [None]:
import re
regex = re.compile('^(?P<alias_alternative>(?P<requested_entity>sample|object)(\.(?P<attribute>\w+))?)(\s+(?i)AS\s+(?P<alias>\w+))?\s*$')

With the `re.X` flag, you can use the extended syntax and comment every piece of your regular expression separately. The same regex as above is now much easier to read and comprehend, the original intention is preserved. Because you can span the regex over many lines, you also need to specify all whitespace explicitly with `\s` or `\s+`:

In [None]:
import re
regex = re.compile(
    r"""^                                              # beginning of the string
        (?P<alias_alternative>                         # use first part as alias, if no alias is defined
          (?P<requested_entity>sample|object)          # string starts with sample or object
          (\.(?P<attribute>\w+))?                      # capture an optional .attribute
        )
        (                                              # capture an optional alias: entity.attribute AS alias
          \s+(?i)AS\s+                                 # whitespace, ignore case of 'AS', whitespace
          (?P<alias>\w+)                               # capture the alias
        )?                                             # 
        \s*                                            # ignore any trailing whitespace
        $                                              # end of string
    """,
    re.X + re.I
)

**Do not use `re.match`, always use `re.search`**

This regular expression below does **not match anything**:

In [None]:
import re
line = "Cats are smarter than dogs"
re.match("dogs$", line)

but this **does**:

In [None]:
import re
line = "Cats are smarter than dogs"
re.search("dogs$", line)

**Why?** The difference between `re.match()` and `re.search()` is that `re.match()` behaves as if every pattern has `\A` prepended (or `^` if you don't use multiline). Anyone accustomed to Perl, grep, or sed regular expression matching is mislead by `re.match()`.

There is actually a reason why re.match exists at all: it is **speed**. When `re.search()` is used and no matching is possible, it takes a considerable amount [more time](https://stackoverflow.com/questions/29007197/why-have-re-match) than `re.match()` until the matching fails. I am inclined to say: Python has an implementation problem here. I think `re.match()` should better be *deprecated*, because it leads to unnecessary problems, despite the speed gain one might observe.

### Make use of **named capture groups**

A very common practice is to group elements in a regular expression:

```python
import re

url = '/some/url/our_first_parameter/our_second_parameter'
match = re.search("^/some/url/((.*?)/(.*?))$", url)
match.groups()

# returns
('our_first_parameter/our_second_parameter',
 'our_first_parameter',
 'our_second_parameter')
```

However, this leads to the problem that the parameters fetched are positional.  If you have nested group captures, you have to count the number of the opening round brackets `(` to get the position of every parameter right. And if you decide to remove a grouping later, you will have to check every position again.


Instead, you would rather give your groups a name so you can easily rearrange your groupings without having to worry about their positions:
<strong>

```python
import re

url = '/some/url/our_first_parameter/our_second_parameter'
match = re.search(r"""
    ^                       # beginning of the string
    /some/url/              # match the base-url
    (
      ?P<the_whole_thing>   # capture both parameters
      (?P<param1>.*?)       # capture the first parameter only
      /                     # ... followed by a /
      (?P<param2>.*?)       # capture the second parameter only
    )
    $                       # end of the string
    """, url, re.X)
if (match):
    print(match.groupdict())

# returns
{
    'the_whole_thing': 'our_first_parameter/our_second_parameter',
    'param1': 'our_first_parameter',
    'param2': 'our_second_parameter'
}
```
</strong>

This leads to much more robust regular expressions, especially when we are adding new or removing existing captures.

In **substitutions** or within regular expressions, named capture groups are back-referenced by

```
\g<the_name_of_the_captured_group>
```

### back to the real-world problem...

Now back to our real problem, we will use this extended regex syntax:

In [None]:
import re

regex = re.compile(r'''
    <property\s name="uri">  # beginning of the property element
    (?P<sip>.*?)             # fetch content, put it named capture group «sip»
    <\/property>             # end of element
    ''', re.X)

In [None]:
import re
 
regex = re.compile(r'''
    <property\s name="uri">  # beginning of element
    (?P<sip>.*?)             # fetch content, put it named capture group «sip»
    <\/property>             # end of element
    ''', re.X)

log_file_path = "sfbios.log"

match_list = []
with open(log_file_path, "r", encoding="utf-8") as logfile_handle, \
     open("sip_list", "w") as output_handle:
    for line in logfile_handle:
        match = regex.search(line)    # BE AWARE: always use re.search, NEVER re.match!
        if match:
            print(match.groupdict()['sip'], file=output_handle)

Voilà!

In [None]:
!cat sip_list