The Apache dataset contains the files [access.log](https://github.com/gdv/foundationsCS/raw/main/students/ex-data/apache/access.log) and [error.log](https://github.com/gdv/foundationsCS/raw/main/students/ex-data/apache/error.log) that contains the logfile of the accesses to a web server and the errors.
The *access.log* is in [Common Log Format](https://en.wikipedia.org/wiki/Common_Log_Format).
The entries in *error.log* usually have a corresponding entry in *access.log*

## Read the file *access.log*. This file is not trivial to read correctly.

In [1]:
import pandas as pd
import re

Hint: The first row of the file *access.log* does not contain the names of the columns: we can use the `names` option.

Hint: We use a custom separator, otherwise the fields `type`, `url`, and `prot` would be combined together.

In [2]:
access = pd.read_csv('https://github.com/gdv/foundationsCS/raw/main/students/ex-data/apache/access.log', 
                     sep='[\s\t]+',
                     names = ['origin', 'identity', 'user', 'time', 'tz', 'type', 'url', 'prot', 'status', 'size'])
access.head()

  access = pd.read_csv('https://github.com/gdv/foundationsCS/raw/main/students/ex-data/apache/access.log',


Unnamed: 0,origin,identity,user,time,tz,type,url,prot,status,size
0,64.242.88.10,-,-,[07/Mar/2004:16:05:49,-0800],"""GET",/twiki/bin/edit/Main/Double_bounce_sender?topi...,"HTTP/1.1""",401.0,12846
1,64.242.88.10,-,-,[07/Mar/2004:16:06:51,-0800],"""GET",/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1....,"HTTP/1.1""",200.0,4523
2,64.242.88.10,-,-,[07/Mar/2004:16:10:02,-0800],"""GET",/mailman/listinfo/hsdivision,"HTTP/1.1""",200.0,6291
3,64.242.88.10,-,-,[07/Mar/2004:16:11:58,-0800],"""GET",/twiki/bin/view/TWiki/WikiSyntax,"HTTP/1.1""",200.0,7352
4,64.242.88.10,-,-,[07/Mar/2004:16:20:55,-0800],"""GET",/twiki/bin/view/Main/DCCAndPostFix,"HTTP/1.1""",200.0,5253


In [3]:
access['time'] = access['time'].str.replace(r'^\[', '')
access['tz'] = access['tz'].str.replace(r'\]$', '')
access['type'] = access['type'].str.replace(r'^\"', '')
access['prot'] = access['prot'].str.replace(r'"$', '')
access

  access['time'] = access['time'].str.replace(r'^\[', '')
  access['tz'] = access['tz'].str.replace(r'\]$', '')
  access['type'] = access['type'].str.replace(r'^\"', '')
  access['prot'] = access['prot'].str.replace(r'"$', '')


Unnamed: 0,origin,identity,user,time,tz,type,url,prot,status,size
0,64.242.88.10,-,-,07/Mar/2004:16:05:49,-0800,GET,/twiki/bin/edit/Main/Double_bounce_sender?topi...,HTTP/1.1,401.0,12846
1,64.242.88.10,-,-,07/Mar/2004:16:06:51,-0800,GET,/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1....,HTTP/1.1,200.0,4523
2,64.242.88.10,-,-,07/Mar/2004:16:10:02,-0800,GET,/mailman/listinfo/hsdivision,HTTP/1.1,200.0,6291
3,64.242.88.10,-,-,07/Mar/2004:16:11:58,-0800,GET,/twiki/bin/view/TWiki/WikiSyntax,HTTP/1.1,200.0,7352
4,64.242.88.10,-,-,07/Mar/2004:16:20:55,-0800,GET,/twiki/bin/view/Main/DCCAndPostFix,HTTP/1.1,200.0,5253
...,...,...,...,...,...,...,...,...,...,...
1541,10.0.0.153,-,-,12/Mar/2004:12:23:41,-0800,GET,/dccstats/stats-spam-ratio.1year.png,HTTP/1.1,200.0,1906
1542,10.0.0.153,-,-,12/Mar/2004:12:23:41,-0800,GET,/dccstats/stats-hashes.1year.png,HTTP/1.1,200.0,1582
1543,216.139.185.45,-,-,12/Mar/2004:13:04:01,-0800,GET,/mailman/listinfo/webber,HTTP/1.1,200.0,6051
1544,pd95f99f2.dip.t-dialin.net,-,-,12/Mar/2004:13:18:57,-0800,GET,/razor.html,HTTP/1.1,200.0,2869


In [4]:
access['datetime'] = pd.to_datetime(access['time'], format="%d/%b/%Y:%H:%M:%S", errors="coerce")
access.head()

Unnamed: 0,origin,identity,user,time,tz,type,url,prot,status,size,datetime
0,64.242.88.10,-,-,07/Mar/2004:16:05:49,-800,GET,/twiki/bin/edit/Main/Double_bounce_sender?topi...,HTTP/1.1,401.0,12846,2004-03-07 16:05:49
1,64.242.88.10,-,-,07/Mar/2004:16:06:51,-800,GET,/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1....,HTTP/1.1,200.0,4523,2004-03-07 16:06:51
2,64.242.88.10,-,-,07/Mar/2004:16:10:02,-800,GET,/mailman/listinfo/hsdivision,HTTP/1.1,200.0,6291,2004-03-07 16:10:02
3,64.242.88.10,-,-,07/Mar/2004:16:11:58,-800,GET,/twiki/bin/view/TWiki/WikiSyntax,HTTP/1.1,200.0,7352,2004-03-07 16:11:58
4,64.242.88.10,-,-,07/Mar/2004:16:20:55,-800,GET,/twiki/bin/view/Main/DCCAndPostFix,HTTP/1.1,200.0,5253,2004-03-07 16:20:55


## Count the number of accesses (number of lines) made by an IP number

We use fancy indexing to filter from `access` only the rows where `origin` consists of an IP address. While an IP address consists of 4 numbers in the interval `[0,255]` separated by dots, a simpler regex suffices.

In [5]:
iponly = access[access['origin'].str.contains("^\d+\.\d+\.\d+\.\d+$")]
iponly.head()

Unnamed: 0,origin,identity,user,time,tz,type,url,prot,status,size,datetime
0,64.242.88.10,-,-,07/Mar/2004:16:05:49,-800,GET,/twiki/bin/edit/Main/Double_bounce_sender?topi...,HTTP/1.1,401.0,12846,2004-03-07 16:05:49
1,64.242.88.10,-,-,07/Mar/2004:16:06:51,-800,GET,/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1....,HTTP/1.1,200.0,4523,2004-03-07 16:06:51
2,64.242.88.10,-,-,07/Mar/2004:16:10:02,-800,GET,/mailman/listinfo/hsdivision,HTTP/1.1,200.0,6291,2004-03-07 16:10:02
3,64.242.88.10,-,-,07/Mar/2004:16:11:58,-800,GET,/twiki/bin/view/TWiki/WikiSyntax,HTTP/1.1,200.0,7352,2004-03-07 16:11:58
4,64.242.88.10,-,-,07/Mar/2004:16:20:55,-800,GET,/twiki/bin/view/Main/DCCAndPostFix,HTTP/1.1,200.0,5253,2004-03-07 16:20:55


Then we can group the rows with the same origin and count the size of each group

In [6]:
iponly.groupby('origin').size()

origin
10.0.0.153         270
12.22.207.235        1
128.227.88.79       14
142.27.64.35         7
145.253.208.9        7
194.151.73.43        4
195.11.231.210       1
195.230.181.122      1
195.246.13.119      12
200.222.33.33        1
203.147.138.233     13
207.195.59.160      20
208.247.148.12       4
212.21.228.26        1
212.92.37.62        14
213.181.81.4         1
216.139.185.45       1
219.95.17.51         1
4.37.97.186          1
61.165.64.6          4
61.9.4.61            3
64.242.88.10       452
64.246.94.141        1
64.246.94.152        1
66.213.206.2         1
67.131.107.5         3
dtype: int64

## Count the number of successful accesses (status 200) made by an IP number

We only have to filter the rows with status equal to 200

In [7]:
iponly[iponly['status'] == 200].count()

origin      627
identity    627
user        627
time        627
tz          627
type        627
url         627
prot        627
status      627
size        627
datetime    627
dtype: int64

An alternative version uses the `len` function.

In [8]:
len(iponly[iponly['status'] == 200])

627

## Count the number of accesses for each directory served

First we add a column `dir` to each row

The first step is to build a function, called `extract_dir`, that computes the directory from a url.

In [9]:
def extract_dir(url):
    if re.search('/', url):
        return re.match('.*\/', url).group()
    else:
        return None

Since a regex can be a brittle solution, we have to check that it is actually correct. More precisely, we are going to check when the regex is not found.

In [10]:
access[~ access['url'].str.contains(".*\/")]

Unnamed: 0,origin,identity,user,time,tz,type,url,prot,status,size,datetime
95,80-219-148-207.dclient.hispeed.ch,-,-,07/Mar/2004:19:47:36,-800,OPTIONS,*,HTTP/1.0,200.0,-,2004-03-07 19:47:36
906,h194n2fls308o1033.telia.com,-,-,09/Mar/2004:13:49:05,-800,"-""",408,-,,,2004-03-09 13:49:05


Those two rows are problematic. Moreover, we cannot make any sense of them, so we decide to drop them.

In [11]:
access.drop([95,906], inplace = True)
access[~ access['url'].str.contains(".*\/")]

Unnamed: 0,origin,identity,user,time,tz,type,url,prot,status,size,datetime


Then we can use `apply`

In [12]:
access['dir'] = access.apply(lambda row: extract_dir(row['url']), axis=1)
access.head()

Unnamed: 0,origin,identity,user,time,tz,type,url,prot,status,size,datetime,dir
0,64.242.88.10,-,-,07/Mar/2004:16:05:49,-800,GET,/twiki/bin/edit/Main/Double_bounce_sender?topi...,HTTP/1.1,401.0,12846,2004-03-07 16:05:49,/twiki/bin/edit/Main/
1,64.242.88.10,-,-,07/Mar/2004:16:06:51,-800,GET,/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1....,HTTP/1.1,200.0,4523,2004-03-07 16:06:51,/twiki/bin/rdiff/TWiki/
2,64.242.88.10,-,-,07/Mar/2004:16:10:02,-800,GET,/mailman/listinfo/hsdivision,HTTP/1.1,200.0,6291,2004-03-07 16:10:02,/mailman/listinfo/
3,64.242.88.10,-,-,07/Mar/2004:16:11:58,-800,GET,/twiki/bin/view/TWiki/WikiSyntax,HTTP/1.1,200.0,7352,2004-03-07 16:11:58,/twiki/bin/view/TWiki/
4,64.242.88.10,-,-,07/Mar/2004:16:20:55,-800,GET,/twiki/bin/view/Main/DCCAndPostFix,HTTP/1.1,200.0,5253,2004-03-07 16:20:55,/twiki/bin/view/Main/


Since using the `axis` option of `apply` can be confusing, an alternative solution is to build a list correponding to the new column

In [13]:
access['dir2'] = access['url'].str.extract("(.*\/)")
access.head()

Unnamed: 0,origin,identity,user,time,tz,type,url,prot,status,size,datetime,dir,dir2
0,64.242.88.10,-,-,07/Mar/2004:16:05:49,-800,GET,/twiki/bin/edit/Main/Double_bounce_sender?topi...,HTTP/1.1,401.0,12846,2004-03-07 16:05:49,/twiki/bin/edit/Main/,/twiki/bin/edit/Main/
1,64.242.88.10,-,-,07/Mar/2004:16:06:51,-800,GET,/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1....,HTTP/1.1,200.0,4523,2004-03-07 16:06:51,/twiki/bin/rdiff/TWiki/,/twiki/bin/rdiff/TWiki/
2,64.242.88.10,-,-,07/Mar/2004:16:10:02,-800,GET,/mailman/listinfo/hsdivision,HTTP/1.1,200.0,6291,2004-03-07 16:10:02,/mailman/listinfo/,/mailman/listinfo/
3,64.242.88.10,-,-,07/Mar/2004:16:11:58,-800,GET,/twiki/bin/view/TWiki/WikiSyntax,HTTP/1.1,200.0,7352,2004-03-07 16:11:58,/twiki/bin/view/TWiki/,/twiki/bin/view/TWiki/
4,64.242.88.10,-,-,07/Mar/2004:16:20:55,-800,GET,/twiki/bin/view/Main/DCCAndPostFix,HTTP/1.1,200.0,5253,2004-03-07 16:20:55,/twiki/bin/view/Main/,/twiki/bin/view/Main/


## For each origin, count the number of successful accesses

In [14]:
access[access['status'] == 200].groupby('origin').size()

origin
0x503e4fce.virnxx2.adsl-dhcp.tele.dk      2
1-320.cnc.bc.ca                           4
1-729.cnc.bc.ca                           6
10.0.0.153                              187
12.22.207.235                             1
                                       ... 
watchguard.cgmatane.qc.ca                 2
wc03.mtnk.rnc.net.cable.rogers.com        1
wc09.mtnk.rnc.net.cable.rogers.com        3
wwwcache.lanl.gov                         1
yongsan-cache.korea.army.mil              4
Length: 167, dtype: int64

## For each origin, count the number of unsuccessful accesses, split according to the status code

The `groupby` can receive a list of column names

In [15]:
access[access['status'] != 200].groupby(['origin', 'status']).size()

origin                                 status
0x503e4fce.virnxx2.adsl-dhcp.tele.dk   304.0       1
1-729.cnc.bc.ca                        302.0       1
10.0.0.153                             302.0       1
                                       304.0      82
128.227.88.79                          304.0       2
142.27.64.35                           302.0       1
                                       304.0       4
145.253.208.9                          304.0       1
1513.cps.virtua.com.br                 404.0       1
195.246.13.119                         401.0       1
2-110.cnc.bc.ca                        304.0       3
207.195.59.160                         304.0       5
                                       401.0       1
61.9.4.61                              404.0       2
64.242.88.10                           401.0     112
68-174-110-154.nyc.rr.com              304.0       1
92-moc-6.acn.waw.pl                    304.0       1
cpe-203-51-137-224.vic.bigpond.net.au  302.0       1


## From the results of the previous point, add a column with the error class (the first digit of the status code)

In [16]:
grouped = access[access['status'] != 200].groupby(['origin', 'status']).size()
grouped.index

MultiIndex([( '0x503e4fce.virnxx2.adsl-dhcp.tele.dk', 304.0),
            (                      '1-729.cnc.bc.ca', 302.0),
            (                           '10.0.0.153', 302.0),
            (                           '10.0.0.153', 304.0),
            (                        '128.227.88.79', 304.0),
            (                         '142.27.64.35', 302.0),
            (                         '142.27.64.35', 304.0),
            (                        '145.253.208.9', 304.0),
            (               '1513.cps.virtua.com.br', 404.0),
            (                       '195.246.13.119', 401.0),
            (                      '2-110.cnc.bc.ca', 304.0),
            (                       '207.195.59.160', 304.0),
            (                       '207.195.59.160', 401.0),
            (                            '61.9.4.61', 404.0),
            (                         '64.242.88.10', 401.0),
            (            '68-174-110-154.nyc.rr.com', 304.0),
        

Since the `status` field is part of the index, we have to move it to a column name, via `reset_index`

In [17]:
table = grouped.reset_index()
table.head()

Unnamed: 0,origin,status,0
0,0x503e4fce.virnxx2.adsl-dhcp.tele.dk,304.0,1
1,1-729.cnc.bc.ca,302.0,1
2,10.0.0.153,302.0,1
3,10.0.0.153,304.0,82
4,128.227.88.79,304.0,2


Now we can add the desired column

In [18]:
table['class'] = table['status'] // 100
table

Unnamed: 0,origin,status,0,class
0,0x503e4fce.virnxx2.adsl-dhcp.tele.dk,304.0,1,3.0
1,1-729.cnc.bc.ca,302.0,1,3.0
2,10.0.0.153,302.0,1,3.0
3,10.0.0.153,304.0,82,3.0
4,128.227.88.79,304.0,2,3.0
5,142.27.64.35,302.0,1,3.0
6,142.27.64.35,304.0,4,3.0
7,145.253.208.9,304.0,1,3.0
8,1513.cps.virtua.com.br,404.0,1,4.0
9,195.246.13.119,401.0,1,4.0


If we want the class to be an integer, we need to apply also the `int` function

In [19]:
table['class'] = table['status'].apply(lambda row: int(row // 100))
table

Unnamed: 0,origin,status,0,class
0,0x503e4fce.virnxx2.adsl-dhcp.tele.dk,304.0,1,3
1,1-729.cnc.bc.ca,302.0,1,3
2,10.0.0.153,302.0,1,3
3,10.0.0.153,304.0,82,3
4,128.227.88.79,304.0,2,3
5,142.27.64.35,302.0,1,3
6,142.27.64.35,304.0,4,3
7,145.253.208.9,304.0,1,3
8,1513.cps.virtua.com.br,404.0,1,4
9,195.246.13.119,401.0,1,4


## Cluster the accesses in 5-minutes time slices (e.g. from 14:00 to 14:05, from 14:05 to 14:10, etc). Count the number of accesses for each time slice

We use a procedure similar to the previous point. Notice that we need only the hour and the minute (not the full timestamp) to build the clusters.

In [20]:
access['cluster'] = (access['datetime'].dt.hour * 60 + access['datetime'].dt.minute) // 5