The sample dataset [apache](https://github.com/gdv/foundationsCS/tree/master/students/ex-data/apache) contains the files *access.log* and *error.log* that contains the logfile of the accesses to a web server and the errors.
The *access.log* is in [Common Log Format](https://en.wikipedia.org/wiki/Common_Log_Format).
The entries in *error.log* usually have a corresponding entry in *access.log*

1.  Read the file *access.log*
1.  Count the number of accesses (number of lines) made by an IP number
1.  Count the number of successful accesses (status 200) made by an IP number
1.  Count the number of accesses for each directory served
1.  For each origin, count the number of successful accesses
1.  For each origin, count the number of unsuccessful accesses, split according to the
    status code
1.  From the results of the previous point, add a column with the error class (the first
    digit of the status code)
1.  Cluster the accesses in 5-minutes time slices (e.g. from 14:00 to 14:05, from 14:05 to
    14:10, etc). Count the number of accesses for each time slice
1.  Count the number of accesses between each pair of `[info]` or `[error]` entries of *error.log*

### Extra points

1.  For `[info]` entry of *error.log*, find the next entry of *access.log*. For
    example, when considering the entry at `Sun Mar  7 18:00:09 2004`, we want to find the
    entry at `[07/Mar/2004:18:02:10 -0800]`
1.  Count the number of times that the two accesses of the previous point have the same origin.


## Read the file *access.log*

In [21]:
import pandas as pd
import re

Since the first row of the file *access.log* does not contain the names of the columns, we use the `names` option. Moreover, we use a custom separator, otherwise the fields `type`, `url`, and `prot` would be combined together.

## Count the number of accesses (number of lines) made by an IP number

We use fancy indexing to filter from `access` only the rows where `origin` consists of an IP address. While an IP address consists of 4 numbers in the interval `[0,255]` separated by dots, a simpler regex suffices.

Then we can group the rows with the same origin and count the size of each group

## Count the number of successful accesses (status 200) made by an IP number

We only have to filter the rows with status equal to 200

An alternative version uses the `len` function.

## Count the number of accesses for each directory served

First we add a column `dir` to each row

The first step is to build a function, called `extract_dir`, that computes the directory from a url.

Since a regex can be a brittle solution, we have to check that it is actually correct. More precisely, we are going to check when the regex is not found.

Those two rows are problematic. Moreover, we cannot make any sense of them, so we decide to drop them.

Then we can use `apply`

Since using the `axis` option of `apply` can be confusing, an alternative solution is to build a list correponding to the new column

## For each origin, count the number of successful accesses

## For each origin, count the number of unsuccessful accesses, split according to the status code

The `groupby` can receive a list of column names

## From the results of the previous point, add a column with the error class (the first digit of the status code)

Since the `status` field is part of the index, we have to move it to a column name, via `reset_index`

Now we can add the desired column

If we want the class to be an integer, we need to apply also the `int` function

## Cluster the accesses in 5-minutes time slices (e.g. from 14:00 to 14:05, from 14:05 to 14:10, etc). Count the number of accesses for each time slice

We use a procedure similar to the previous point. Notice that we need only the hour and the minute (not the full timestamp) to build the clusters.

## Count the number of accesses between each pair of `[info]` or `[error]` entries of *error.log*

Read the `error.log` file

The first step is to extract the field corresponding to the date/time.

Then we extract the type of the error

Then we parse the date/time

Since each error has a corresponding entry in the access.log file, we merge the two dataframes.

We are going to exploit the fact that we have a column `index` of `merged` that contains the position inside the `access` dataframe. So we have to compute the difference of the index between two consecutive entries that are errors or info.

Let us start by isolating such entries.

We exploit the structure of the implicit index

Methods on dataframe are mostly designed to process each row independently from each other. Hence we prefer to transform the series into a list.

Now we scan the list, except for the first element, and we compute the difference between the current and the previous element.

This approach requires managing the index of the list.

An easier way is to extract two sublists of `info_errors_list`: the first removing the first element, and the second removing the last element. Those two sublists have the same length and are coordinated, that is in position `i` there are two elements that are related (actually, they are the two operands of the difference).

An even easier way is to exploit the fact that the two sublists are coordinated. This allows to use a zip to couple each pair of related elements, and a list comprehension to obtained the desired difference.