The sample dataset *apache* contains the files *access.log* and *error.log* that contains the logfile of the accesses to a web server and the errors.
The *access.log* is in [Common Log Format](https://en.wikipedia.org/wiki/Common_Log_Format).
The entries in *error.log* usually have a corresponding entry in *access.log*

1.  Read the file *access.log*
1.  Count the number of accesses (number of lines) made by an IP number
1.  Count the number of successful accesses (status 200) made by an IP number
1.  Count the number of accesses for each directory served
1.  For each origin, count the number of successful accesses
1.  For each origin, count the number of unsuccessful accesses, split according to the
    status code
1.  From the results of the previous point, add a column with the error class (the first
    digit of the status code)
1.  Cluster the accesses in 5-minutes time slices (e.g. from 14:00 to 14:05, from 14:05 to
    14:10, etc). Count the number of accesses for each time slice
1.  Count the number of accesses between each pair of `[info]` or `[error]` entries of *error.log*

### Extra points

1.  For `[info]` entry of *error.log*, find the next entry of *access.log*. For
    example, when considering the entry at `Sun Mar  7 18:00:09 2004`, we want to find the
    entry at `[07/Mar/2004:18:02:10 -0800]`
1.  Count the number of times that the two accesses of the previous point have the same origin.


## Read the file *access.log*

In [145]:
import pandas as pd
import re

Since the first row of the file *access.log* does not contain the names of the columns, we use the `names` option. Moreover, we use a custom separator, otherwise the fields `type`, `url`, and `prot` would be combined together.

In [146]:
access = pd.read_csv('https://github.com/gdv/foundationsCS-2018/raw/master/ex-data/apache/access.log', 
                     sep='[\s\t]+',
                     names = ['origin', 'identity', 'user', 'time', 'tz', 'type', 'url', 'prot', 'status', 'size'])
access.head()

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,origin,identity,user,time,tz,type,url,prot,status,size
0,64.242.88.10,-,-,[07/Mar/2004:16:05:49,-0800],"""GET",/twiki/bin/edit/Main/Double_bounce_sender?topi...,"HTTP/1.1""",401.0,12846
1,64.242.88.10,-,-,[07/Mar/2004:16:06:51,-0800],"""GET",/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1....,"HTTP/1.1""",200.0,4523
2,64.242.88.10,-,-,[07/Mar/2004:16:10:02,-0800],"""GET",/mailman/listinfo/hsdivision,"HTTP/1.1""",200.0,6291
3,64.242.88.10,-,-,[07/Mar/2004:16:11:58,-0800],"""GET",/twiki/bin/view/TWiki/WikiSyntax,"HTTP/1.1""",200.0,7352
4,64.242.88.10,-,-,[07/Mar/2004:16:20:55,-0800],"""GET",/twiki/bin/view/Main/DCCAndPostFix,"HTTP/1.1""",200.0,5253


In [147]:
access['time'] = access['time'].str.replace(r'^\[', '')
access['tz'] = access['tz'].str.replace(r'\]$', '')
access['type'] = access['type'].str.replace(r'^\"', '')
access['prot'] = access['prot'].str.replace(r'\$"', '')
access

Unnamed: 0,origin,identity,user,time,tz,type,url,prot,status,size
0,64.242.88.10,-,-,07/Mar/2004:16:05:49,-0800,GET,/twiki/bin/edit/Main/Double_bounce_sender?topi...,"HTTP/1.1""",401.0,12846
1,64.242.88.10,-,-,07/Mar/2004:16:06:51,-0800,GET,/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1....,"HTTP/1.1""",200.0,4523
2,64.242.88.10,-,-,07/Mar/2004:16:10:02,-0800,GET,/mailman/listinfo/hsdivision,"HTTP/1.1""",200.0,6291
3,64.242.88.10,-,-,07/Mar/2004:16:11:58,-0800,GET,/twiki/bin/view/TWiki/WikiSyntax,"HTTP/1.1""",200.0,7352
4,64.242.88.10,-,-,07/Mar/2004:16:20:55,-0800,GET,/twiki/bin/view/Main/DCCAndPostFix,"HTTP/1.1""",200.0,5253
...,...,...,...,...,...,...,...,...,...,...
1541,10.0.0.153,-,-,12/Mar/2004:12:23:41,-0800,GET,/dccstats/stats-spam-ratio.1year.png,"HTTP/1.1""",200.0,1906
1542,10.0.0.153,-,-,12/Mar/2004:12:23:41,-0800,GET,/dccstats/stats-hashes.1year.png,"HTTP/1.1""",200.0,1582
1543,216.139.185.45,-,-,12/Mar/2004:13:04:01,-0800,GET,/mailman/listinfo/webber,"HTTP/1.1""",200.0,6051
1544,pd95f99f2.dip.t-dialin.net,-,-,12/Mar/2004:13:18:57,-0800,GET,/razor.html,"HTTP/1.1""",200.0,2869


In [148]:
access['datetime'] = pd.to_datetime(access['time'], format="%d/%b/%Y:%H:%M:%S", errors="coerce")
access.head()

Unnamed: 0,origin,identity,user,time,tz,type,url,prot,status,size,datetime
0,64.242.88.10,-,-,07/Mar/2004:16:05:49,-800,GET,/twiki/bin/edit/Main/Double_bounce_sender?topi...,"HTTP/1.1""",401.0,12846,2004-03-07 16:05:49
1,64.242.88.10,-,-,07/Mar/2004:16:06:51,-800,GET,/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1....,"HTTP/1.1""",200.0,4523,2004-03-07 16:06:51
2,64.242.88.10,-,-,07/Mar/2004:16:10:02,-800,GET,/mailman/listinfo/hsdivision,"HTTP/1.1""",200.0,6291,2004-03-07 16:10:02
3,64.242.88.10,-,-,07/Mar/2004:16:11:58,-800,GET,/twiki/bin/view/TWiki/WikiSyntax,"HTTP/1.1""",200.0,7352,2004-03-07 16:11:58
4,64.242.88.10,-,-,07/Mar/2004:16:20:55,-800,GET,/twiki/bin/view/Main/DCCAndPostFix,"HTTP/1.1""",200.0,5253,2004-03-07 16:20:55


## Count the number of accesses (number of lines) made by an IP number

We use fancy indexing to filter from `access` only the rows where `origin` consists of an IP address. While an IP address consists of 4 numbers in the interval `[0,255]` separated by dots, a simpler regex suffices.

In [149]:
iponly = access[access['origin'].str.contains("^\d+\.\d+\.\d+\.\d+$")]
iponly.head()

Unnamed: 0,origin,identity,user,time,tz,type,url,prot,status,size,datetime
0,64.242.88.10,-,-,07/Mar/2004:16:05:49,-800,GET,/twiki/bin/edit/Main/Double_bounce_sender?topi...,"HTTP/1.1""",401.0,12846,2004-03-07 16:05:49
1,64.242.88.10,-,-,07/Mar/2004:16:06:51,-800,GET,/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1....,"HTTP/1.1""",200.0,4523,2004-03-07 16:06:51
2,64.242.88.10,-,-,07/Mar/2004:16:10:02,-800,GET,/mailman/listinfo/hsdivision,"HTTP/1.1""",200.0,6291,2004-03-07 16:10:02
3,64.242.88.10,-,-,07/Mar/2004:16:11:58,-800,GET,/twiki/bin/view/TWiki/WikiSyntax,"HTTP/1.1""",200.0,7352,2004-03-07 16:11:58
4,64.242.88.10,-,-,07/Mar/2004:16:20:55,-800,GET,/twiki/bin/view/Main/DCCAndPostFix,"HTTP/1.1""",200.0,5253,2004-03-07 16:20:55


If I really want a tighter regex, I can force the fact that numbers have at most three digits.

In [150]:
iponly = access[access['origin'].str.contains("^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$")]
iponly.head()

Unnamed: 0,origin,identity,user,time,tz,type,url,prot,status,size,datetime
0,64.242.88.10,-,-,07/Mar/2004:16:05:49,-800,GET,/twiki/bin/edit/Main/Double_bounce_sender?topi...,"HTTP/1.1""",401.0,12846,2004-03-07 16:05:49
1,64.242.88.10,-,-,07/Mar/2004:16:06:51,-800,GET,/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1....,"HTTP/1.1""",200.0,4523,2004-03-07 16:06:51
2,64.242.88.10,-,-,07/Mar/2004:16:10:02,-800,GET,/mailman/listinfo/hsdivision,"HTTP/1.1""",200.0,6291,2004-03-07 16:10:02
3,64.242.88.10,-,-,07/Mar/2004:16:11:58,-800,GET,/twiki/bin/view/TWiki/WikiSyntax,"HTTP/1.1""",200.0,7352,2004-03-07 16:11:58
4,64.242.88.10,-,-,07/Mar/2004:16:20:55,-800,GET,/twiki/bin/view/Main/DCCAndPostFix,"HTTP/1.1""",200.0,5253,2004-03-07 16:20:55


Then we can group the rows with the same origin and count the size of each group

In [151]:
iponly.groupby('origin').size()

origin
10.0.0.153         270
12.22.207.235        1
128.227.88.79       14
142.27.64.35         7
145.253.208.9        7
194.151.73.43        4
195.11.231.210       1
195.230.181.122      1
195.246.13.119      12
200.222.33.33        1
203.147.138.233     13
207.195.59.160      20
208.247.148.12       4
212.21.228.26        1
212.92.37.62        14
213.181.81.4         1
216.139.185.45       1
219.95.17.51         1
4.37.97.186          1
61.165.64.6          4
61.9.4.61            3
64.242.88.10       452
64.246.94.141        1
64.246.94.152        1
66.213.206.2         1
67.131.107.5         3
dtype: int64

## Count the number of successful accesses (status 200) made by an IP number

We only have to filter the rows with status equal to 200

In [153]:
iponly[iponly['status'] == 200].count()

origin      627
identity    627
user        627
time        627
tz          627
type        627
url         627
prot        627
status      627
size        627
datetime    627
dtype: int64

An alternative version uses the `len` function.

In [154]:
len(iponly[iponly['status'] == 200])

627

## Count the number of accesses for each directory served

First we add a column `dir` to each row

The first step is to build a function, called `extract_dir`, that computes the directory from a url.

In [155]:
def extract_dir(url):
    if re.search('/', url):
        return re.match('.*\/', url).group()
    else:
        return None

Since a regex can be a brittle solution, we have to check that it is actually correct. More precisely, we are going to check when the regex is not fond.

In [156]:
access[~ access['url'].str.contains(".*\/")]

Unnamed: 0,origin,identity,user,time,tz,type,url,prot,status,size,datetime
95,80-219-148-207.dclient.hispeed.ch,-,-,07/Mar/2004:19:47:36,-800,OPTIONS,*,"HTTP/1.0""",200.0,-,2004-03-07 19:47:36
906,h194n2fls308o1033.telia.com,-,-,09/Mar/2004:13:49:05,-800,"-""",408,-,,,2004-03-09 13:49:05


Those two rows are problematic. Moreover, we cannot make any sense of them, so we decide to drop them.

In [157]:
access.drop([95,906], inplace = True)
access[~ access['url'].str.contains(".*\/")]

Unnamed: 0,origin,identity,user,time,tz,type,url,prot,status,size,datetime


Then we can use `apply`

In [158]:
access['dir'] = access.apply(lambda row: extract_dir(row['url']), axis=1)
access.head()

Unnamed: 0,origin,identity,user,time,tz,type,url,prot,status,size,datetime,dir
0,64.242.88.10,-,-,07/Mar/2004:16:05:49,-800,GET,/twiki/bin/edit/Main/Double_bounce_sender?topi...,"HTTP/1.1""",401.0,12846,2004-03-07 16:05:49,/twiki/bin/edit/Main/
1,64.242.88.10,-,-,07/Mar/2004:16:06:51,-800,GET,/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1....,"HTTP/1.1""",200.0,4523,2004-03-07 16:06:51,/twiki/bin/rdiff/TWiki/
2,64.242.88.10,-,-,07/Mar/2004:16:10:02,-800,GET,/mailman/listinfo/hsdivision,"HTTP/1.1""",200.0,6291,2004-03-07 16:10:02,/mailman/listinfo/
3,64.242.88.10,-,-,07/Mar/2004:16:11:58,-800,GET,/twiki/bin/view/TWiki/WikiSyntax,"HTTP/1.1""",200.0,7352,2004-03-07 16:11:58,/twiki/bin/view/TWiki/
4,64.242.88.10,-,-,07/Mar/2004:16:20:55,-800,GET,/twiki/bin/view/Main/DCCAndPostFix,"HTTP/1.1""",200.0,5253,2004-03-07 16:20:55,/twiki/bin/view/Main/


Since using the `axis` option of `apply` can be confusing, an alternative solution is to build a list correponding to the new column

In [159]:
access['dir2'] = access['url'].str.extract("(.*\/)")
access.head()

Unnamed: 0,origin,identity,user,time,tz,type,url,prot,status,size,datetime,dir,dir2
0,64.242.88.10,-,-,07/Mar/2004:16:05:49,-800,GET,/twiki/bin/edit/Main/Double_bounce_sender?topi...,"HTTP/1.1""",401.0,12846,2004-03-07 16:05:49,/twiki/bin/edit/Main/,/twiki/bin/edit/Main/
1,64.242.88.10,-,-,07/Mar/2004:16:06:51,-800,GET,/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1....,"HTTP/1.1""",200.0,4523,2004-03-07 16:06:51,/twiki/bin/rdiff/TWiki/,/twiki/bin/rdiff/TWiki/
2,64.242.88.10,-,-,07/Mar/2004:16:10:02,-800,GET,/mailman/listinfo/hsdivision,"HTTP/1.1""",200.0,6291,2004-03-07 16:10:02,/mailman/listinfo/,/mailman/listinfo/
3,64.242.88.10,-,-,07/Mar/2004:16:11:58,-800,GET,/twiki/bin/view/TWiki/WikiSyntax,"HTTP/1.1""",200.0,7352,2004-03-07 16:11:58,/twiki/bin/view/TWiki/,/twiki/bin/view/TWiki/
4,64.242.88.10,-,-,07/Mar/2004:16:20:55,-800,GET,/twiki/bin/view/Main/DCCAndPostFix,"HTTP/1.1""",200.0,5253,2004-03-07 16:20:55,/twiki/bin/view/Main/,/twiki/bin/view/Main/


## For each origin, count the number of successful accesses

In [160]:
access[access['status'] == 200].groupby('origin').size()

origin
0x503e4fce.virnxx2.adsl-dhcp.tele.dk      2
1-320.cnc.bc.ca                           4
1-729.cnc.bc.ca                           6
10.0.0.153                              187
12.22.207.235                             1
                                       ... 
watchguard.cgmatane.qc.ca                 2
wc03.mtnk.rnc.net.cable.rogers.com        1
wc09.mtnk.rnc.net.cable.rogers.com        3
wwwcache.lanl.gov                         1
yongsan-cache.korea.army.mil              4
Length: 167, dtype: int64

## For each origin, count the number of unsuccessful accesses, split according to the status code

The `groupby` can receive a list of column names

In [161]:
access[access['status'] != 200].groupby(['origin', 'status']).size()

origin                                 status
0x503e4fce.virnxx2.adsl-dhcp.tele.dk   304.0       1
1-729.cnc.bc.ca                        302.0       1
10.0.0.153                             302.0       1
                                       304.0      82
128.227.88.79                          304.0       2
142.27.64.35                           302.0       1
                                       304.0       4
145.253.208.9                          304.0       1
1513.cps.virtua.com.br                 404.0       1
195.246.13.119                         401.0       1
2-110.cnc.bc.ca                        304.0       3
207.195.59.160                         304.0       5
                                       401.0       1
61.9.4.61                              404.0       2
64.242.88.10                           401.0     112
68-174-110-154.nyc.rr.com              304.0       1
92-moc-6.acn.waw.pl                    304.0       1
cpe-203-51-137-224.vic.bigpond.net.au  302.0       1


## From the results of the previous point, add a column with the error class (the first digit of the status code)

In [162]:
grouped = access[access['status'] != 200].groupby(['origin', 'status']).count()
grouped.index

MultiIndex([( '0x503e4fce.virnxx2.adsl-dhcp.tele.dk', 304.0),
            (                      '1-729.cnc.bc.ca', 302.0),
            (                           '10.0.0.153', 302.0),
            (                           '10.0.0.153', 304.0),
            (                        '128.227.88.79', 304.0),
            (                         '142.27.64.35', 302.0),
            (                         '142.27.64.35', 304.0),
            (                        '145.253.208.9', 304.0),
            (               '1513.cps.virtua.com.br', 404.0),
            (                       '195.246.13.119', 401.0),
            (                      '2-110.cnc.bc.ca', 304.0),
            (                       '207.195.59.160', 304.0),
            (                       '207.195.59.160', 401.0),
            (                            '61.9.4.61', 404.0),
            (                         '64.242.88.10', 401.0),
            (            '68-174-110-154.nyc.rr.com', 304.0),
        

Since the `status` field is part of the index, we have to move it to a column name, via `reset_index`

In [163]:
table = grouped.reset_index()
table.head()

Unnamed: 0,origin,status,identity,user,time,tz,type,url,prot,size,datetime,dir,dir2
0,0x503e4fce.virnxx2.adsl-dhcp.tele.dk,304.0,1,1,1,1,1,1,1,1,1,1,1
1,1-729.cnc.bc.ca,302.0,1,1,1,1,1,1,1,1,1,1,1
2,10.0.0.153,302.0,1,1,1,1,1,1,1,1,1,1,1
3,10.0.0.153,304.0,82,82,82,82,82,82,82,82,82,82,82
4,128.227.88.79,304.0,2,2,2,2,2,2,2,2,2,2,2


Now we can add the desired column

In [164]:
table['class'] = table['status'] // 100
table

Unnamed: 0,origin,status,identity,user,time,tz,type,url,prot,size,datetime,dir,dir2,class
0,0x503e4fce.virnxx2.adsl-dhcp.tele.dk,304.0,1,1,1,1,1,1,1,1,1,1,1,3.0
1,1-729.cnc.bc.ca,302.0,1,1,1,1,1,1,1,1,1,1,1,3.0
2,10.0.0.153,302.0,1,1,1,1,1,1,1,1,1,1,1,3.0
3,10.0.0.153,304.0,82,82,82,82,82,82,82,82,82,82,82,3.0
4,128.227.88.79,304.0,2,2,2,2,2,2,2,2,2,2,2,3.0
5,142.27.64.35,302.0,1,1,1,1,1,1,1,1,1,1,1,3.0
6,142.27.64.35,304.0,4,4,4,4,4,4,4,4,4,4,4,3.0
7,145.253.208.9,304.0,1,1,1,1,1,1,1,1,1,1,1,3.0
8,1513.cps.virtua.com.br,404.0,1,1,1,1,1,1,1,1,1,1,1,4.0
9,195.246.13.119,401.0,1,1,1,1,1,1,1,1,1,1,1,4.0


## Cluster the accesses in 5-minutes time slices (e.g. from 14:00 to 14:05, from 14:05 to 14:10, etc). Count the number of accesses for each time slice

We use a procedure similar to the previous point. Notice that we need only the hour and the minute (not the full timestamp) to build the clusters.

In [165]:
access['cluster'] = (access['datetime'].dt.hour * 60 + access['datetime'].dt.minute) // 5

In [166]:
access.groupby('cluster').size()

cluster
1      6
2      1
3      3
4      3
5      7
      ..
283    5
284    2
285    1
286    2
287    3
Length: 267, dtype: int64

## For `[info]` entry of *error.log*, find the next entry of *access.log*. 

*For example, when considering the entry at `Sun Mar  7 18:00:09 2004`, we want to find the entry at `[07/Mar/2004:18:02:10 -0800]`*

Each error has a corresponding (i.e. same date, time, origin) entry in *access.log*

In [167]:
error = pd.read_csv("https://github.com/gdv/foundationsCS-2018/raw/master/ex-data/apache/error.log",
                   names = ["text"])
error.head()

Unnamed: 0,text
0,[Sun Mar 7 16:02:00 2004] [notice] Apache/1.3...
1,[Sun Mar 7 16:02:00 2004] [info] Server built...
2,[Sun Mar 7 16:02:00 2004] [notice] Accept mut...
3,[Sun Mar 7 16:05:49 2004] [info] [client 64.2...
4,[Sun Mar 7 16:45:56 2004] [info] [client 64.2...


The first step is to extract the field corresponding to the date/time.

In [168]:
error['datetime_raw'] = error['text'].str.extract("^\[(.*?)\]")
error.head()

Unnamed: 0,text,datetime_raw
0,[Sun Mar 7 16:02:00 2004] [notice] Apache/1.3...,Sun Mar 7 16:02:00 2004
1,[Sun Mar 7 16:02:00 2004] [info] Server built...,Sun Mar 7 16:02:00 2004
2,[Sun Mar 7 16:02:00 2004] [notice] Accept mut...,Sun Mar 7 16:02:00 2004
3,[Sun Mar 7 16:05:49 2004] [info] [client 64.2...,Sun Mar 7 16:05:49 2004
4,[Sun Mar 7 16:45:56 2004] [info] [client 64.2...,Sun Mar 7 16:45:56 2004


Then we extract the type of the error

In [169]:
error['type'] = error['text'].str.extract("^\[.*?\]\s\[(.*?)\]")
error.head()

Unnamed: 0,text,datetime_raw,type
0,[Sun Mar 7 16:02:00 2004] [notice] Apache/1.3...,Sun Mar 7 16:02:00 2004,notice
1,[Sun Mar 7 16:02:00 2004] [info] Server built...,Sun Mar 7 16:02:00 2004,info
2,[Sun Mar 7 16:02:00 2004] [notice] Accept mut...,Sun Mar 7 16:02:00 2004,notice
3,[Sun Mar 7 16:05:49 2004] [info] [client 64.2...,Sun Mar 7 16:05:49 2004,info
4,[Sun Mar 7 16:45:56 2004] [info] [client 64.2...,Sun Mar 7 16:45:56 2004,info


Then we parse the date/time

In [170]:
error['datetime'] = pd.to_datetime(error['datetime_raw'])
error.head()

Unnamed: 0,text,datetime_raw,type,datetime
0,[Sun Mar 7 16:02:00 2004] [notice] Apache/1.3...,Sun Mar 7 16:02:00 2004,notice,2004-03-07 16:02:00
1,[Sun Mar 7 16:02:00 2004] [info] Server built...,Sun Mar 7 16:02:00 2004,info,2004-03-07 16:02:00
2,[Sun Mar 7 16:02:00 2004] [notice] Accept mut...,Sun Mar 7 16:02:00 2004,notice,2004-03-07 16:02:00
3,[Sun Mar 7 16:05:49 2004] [info] [client 64.2...,Sun Mar 7 16:05:49 2004,info,2004-03-07 16:05:49
4,[Sun Mar 7 16:45:56 2004] [info] [client 64.2...,Sun Mar 7 16:45:56 2004,info,2004-03-07 16:45:56


We add a field `next` which is the index of the next row. We exploit the fact that, once we reset the index, the index is a sequence of consecutive integers starting from zero and that we can build a column from a list of values.
To the purpose, the `access` dataframe has to be sorted by increasing `datetime`.

In [171]:
access.sort_values(by = ['datetime'], inplace = True)
access.reset_index(inplace = True)
access['next'] = list(range(1, len(access) + 1))
access[['datetime', 'next']].head()

Unnamed: 0,datetime,next
0,2004-03-07 16:05:49,1
1,2004-03-07 16:06:51,2
2,2004-03-07 16:10:02,3
3,2004-03-07 16:11:58,4
4,2004-03-07 16:20:55,5


Since each error has a corresponding entry in the `access.log` file, we merge the two dataframes.

In [172]:
merged = pd.merge(access, error, on = "datetime", how = 'left')
merged.head()

Unnamed: 0,index,origin,identity,user,time,tz,type_x,url,prot,status,size,datetime,dir,dir2,cluster,next,text,datetime_raw,type_y
0,0,64.242.88.10,-,-,07/Mar/2004:16:05:49,-800,GET,/twiki/bin/edit/Main/Double_bounce_sender?topi...,"HTTP/1.1""",401.0,12846,2004-03-07 16:05:49,/twiki/bin/edit/Main/,/twiki/bin/edit/Main/,193,1,[Sun Mar 7 16:05:49 2004] [info] [client 64.2...,Sun Mar 7 16:05:49 2004,info
1,1,64.242.88.10,-,-,07/Mar/2004:16:06:51,-800,GET,/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1....,"HTTP/1.1""",200.0,4523,2004-03-07 16:06:51,/twiki/bin/rdiff/TWiki/,/twiki/bin/rdiff/TWiki/,193,2,,,
2,2,64.242.88.10,-,-,07/Mar/2004:16:10:02,-800,GET,/mailman/listinfo/hsdivision,"HTTP/1.1""",200.0,6291,2004-03-07 16:10:02,/mailman/listinfo/,/mailman/listinfo/,194,3,,,
3,3,64.242.88.10,-,-,07/Mar/2004:16:11:58,-800,GET,/twiki/bin/view/TWiki/WikiSyntax,"HTTP/1.1""",200.0,7352,2004-03-07 16:11:58,/twiki/bin/view/TWiki/,/twiki/bin/view/TWiki/,194,4,,,
4,4,64.242.88.10,-,-,07/Mar/2004:16:20:55,-800,GET,/twiki/bin/view/Main/DCCAndPostFix,"HTTP/1.1""",200.0,5253,2004-03-07 16:20:55,/twiki/bin/view/Main/,/twiki/bin/view/Main/,196,5,,,


Check if the rows of `error` are in `merged`. The following query cannot return any row.

In [173]:
len(merged[merged['origin'].isnull()])

0

In [174]:
found = merged[merged['type_y'] == 'info']
found

Unnamed: 0,index,origin,identity,user,time,tz,type_x,url,prot,status,size,datetime,dir,dir2,cluster,next,text,datetime_raw,type_y
0,0,64.242.88.10,-,-,07/Mar/2004:16:05:49,-0800,GET,/twiki/bin/edit/Main/Double_bounce_sender?topi...,"HTTP/1.1""",401.0,12846,2004-03-07 16:05:49,/twiki/bin/edit/Main/,/twiki/bin/edit/Main/,193,1,[Sun Mar 7 16:05:49 2004] [info] [client 64.2...,Sun Mar 7 16:05:49 2004,info
17,17,64.242.88.10,-,-,07/Mar/2004:16:45:56,-0800,GET,/twiki/bin/attach/Main/PostfixCommands,"HTTP/1.1""",401.0,12846,2004-03-07 16:45:56,/twiki/bin/attach/Main/,/twiki/bin/attach/Main/,201,18,[Sun Mar 7 16:45:56 2004] [info] [client 64.2...,Sun Mar 7 16:45:56 2004,info
30,30,64.242.88.10,-,-,07/Mar/2004:17:13:50,-0800,GET,/twiki/bin/edit/TWiki/DefaultPlugin?t=1078688936,"HTTP/1.1""",401.0,12846,2004-03-07 17:13:50,/twiki/bin/edit/TWiki/,/twiki/bin/edit/TWiki/,206,31,[Sun Mar 7 17:13:50 2004] [info] [client 64.2...,Sun Mar 7 17:13:50 2004,info
35,35,64.242.88.10,-,-,07/Mar/2004:17:21:44,-0800,GET,/twiki/bin/attach/TWiki/TablePlugin,"HTTP/1.1""",401.0,12846,2004-03-07 17:21:44,/twiki/bin/attach/TWiki/,/twiki/bin/attach/TWiki/,208,36,[Sun Mar 7 17:21:44 2004] [info] [client 64.2...,Sun Mar 7 17:21:44 2004,info
39,39,64.242.88.10,-,-,07/Mar/2004:17:27:37,-0800,GET,/twiki/bin/edit/Main/WebSearch?t=1078669682,"HTTP/1.1""",401.0,12846,2004-03-07 17:27:37,/twiki/bin/edit/Main/,/twiki/bin/edit/Main/,209,40,[Sun Mar 7 17:27:37 2004] [info] [client 64.2...,Sun Mar 7 17:27:37 2004,info
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
661,662,64.242.88.10,-,-,08/Mar/2004:14:07:26,-0800,GET,/twiki/bin/edit/Main/Strict_8bitmime?topicpare...,"HTTP/1.1""",401.0,12846,2004-03-08 14:07:26,/twiki/bin/edit/Main/,/twiki/bin/edit/Main/,169,662,[Mon Mar 8 14:07:26 2004] [info] [client 64.2...,Mon Mar 8 14:07:26 2004,info
671,672,64.242.88.10,-,-,08/Mar/2004:14:27:46,-0800,GET,/twiki/bin/edit/Main/Virtual_gid_maps?topicpar...,"HTTP/1.1""",401.0,12846,2004-03-08 14:27:46,/twiki/bin/edit/Main/,/twiki/bin/edit/Main/,173,672,[Mon Mar 8 14:27:46 2004] [info] [client 64.2...,Mon Mar 8 14:27:46 2004,info
679,680,64.242.88.10,-,-,08/Mar/2004:14:54:56,-0800,GET,/twiki/bin/edit/Main/TokyoOffice?t=1078706364,"HTTP/1.1""",401.0,12846,2004-03-08 14:54:56,/twiki/bin/edit/Main/,/twiki/bin/edit/Main/,178,680,[Mon Mar 8 14:54:56 2004] [info] [client 64.2...,Mon Mar 8 14:54:56 2004,info
1080,1082,h24-71-236-129.ca.shawcable.net,-,-,10/Mar/2004:11:45:51,-0800,GET,/mailman/admin/ppwc/gateway,"HTTP/1.1""",200.0,0,2004-03-10 11:45:51,/mailman/admin/ppwc/,/mailman/admin/ppwc/,141,1081,[Wed Mar 10 11:45:51 2004] [info] [client 24.7...,Wed Mar 10 11:45:51 2004,info


Finally, use the `next` field to merge `merged` and `found`.

In [175]:
paired = pd.merge(found, access, left_on='next', right_index = True)
paired

Unnamed: 0,next,index_x,origin_x,identity_x,user_x,time_x,tz_x,type_x,url_x,prot_x,...,type,url_y,prot_y,status_y,size_y,datetime_y,dir_y,dir2_y,cluster_y,next_y
0,1,0,64.242.88.10,-,-,07/Mar/2004:16:05:49,-0800,GET,/twiki/bin/edit/Main/Double_bounce_sender?topi...,"HTTP/1.1""",...,GET,/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1....,"HTTP/1.1""",200.0,4523,2004-03-07 16:06:51,/twiki/bin/rdiff/TWiki/,/twiki/bin/rdiff/TWiki/,193,2
17,18,17,64.242.88.10,-,-,07/Mar/2004:16:45:56,-0800,GET,/twiki/bin/attach/Main/PostfixCommands,"HTTP/1.1""",...,GET,/robots.txt,"HTTP/1.1""",200.0,68,2004-03-07 16:47:12,/,/,201,19
30,31,30,64.242.88.10,-,-,07/Mar/2004:17:13:50,-0800,GET,/twiki/bin/edit/TWiki/DefaultPlugin?t=1078688936,"HTTP/1.1""",...,GET,/twiki/bin/search/Main/?scope=topic&regex=on&s...,"HTTP/1.1""",200.0,3675,2004-03-07 17:16:00,/twiki/bin/search/Main/,/twiki/bin/search/Main/,207,32
35,36,35,64.242.88.10,-,-,07/Mar/2004:17:21:44,-0800,GET,/twiki/bin/attach/TWiki/TablePlugin,"HTTP/1.1""",...,GET,/twiki/bin/view/TWiki/ManagingWebs?rev=1.22,"HTTP/1.1""",200.0,9310,2004-03-07 17:22:49,/twiki/bin/view/TWiki/,/twiki/bin/view/TWiki/,208,37
39,40,39,64.242.88.10,-,-,07/Mar/2004:17:27:37,-0800,GET,/twiki/bin/edit/Main/WebSearch?t=1078669682,"HTTP/1.1""",...,GET,/twiki/bin/oops/TWiki/ResetPassword?template=o...,"HTTP/1.1""",200.0,11281,2004-03-07 17:28:45,/twiki/bin/oops/TWiki/,/twiki/bin/oops/TWiki/,209,41
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
661,662,662,64.242.88.10,-,-,08/Mar/2004:14:07:26,-0800,GET,/twiki/bin/edit/Main/Strict_8bitmime?topicpare...,"HTTP/1.1""",...,GET,/twiki/bin/view/TWiki/WelcomeGuest?rev=r1.19,"HTTP/1.1""",200.0,13997,2004-03-08 14:11:28,/twiki/bin/view/TWiki/,/twiki/bin/view/TWiki/,170,663
671,672,672,64.242.88.10,-,-,08/Mar/2004:14:27:46,-0800,GET,/twiki/bin/edit/Main/Virtual_gid_maps?topicpar...,"HTTP/1.1""",...,GET,/twiki/bin/view/TWiki/NewUserTemplate?skin=print,"HTTP/1.1""",200.0,2449,2004-03-08 14:28:46,/twiki/bin/view/TWiki/,/twiki/bin/view/TWiki/,173,673
679,680,680,64.242.88.10,-,-,08/Mar/2004:14:54:56,-0800,GET,/twiki/bin/edit/Main/TokyoOffice?t=1078706364,"HTTP/1.1""",...,GET,/twiki/bin/rdiff/Main/SpamAssassinAndPostFix?r...,"HTTP/1.1""",200.0,5794,2004-03-08 14:57:19,/twiki/bin/rdiff/Main/,/twiki/bin/rdiff/Main/,179,681
1080,1081,1082,h24-71-236-129.ca.shawcable.net,-,-,10/Mar/2004:11:45:51,-0800,GET,/mailman/admin/ppwc/gateway,"HTTP/1.1""",...,GET,/mailman/admin/ppwc/gateway,"HTTP/1.1""",200.0,8692,2004-03-10 11:45:51,/mailman/admin/ppwc/,/mailman/admin/ppwc/,141,1082


## Count the number of times that the two accesses of the previous point have the same origin.

In [176]:
len(paired[paired['origin_x'] == paired['origin_y']])

84

## Count the number of accesses between each pair of `[info]` or `[error]` entries of *error.log*

We are going to exploit the fact that we have a column `index` of `merged` that contains the position inside the `access` dataframe. So we have to compute the difference of the index between two consecutive entries that are errors or info.

Let us start by isolating such entries.

In [177]:
info_errors = merged[(merged['type_y'] == 'info') | (merged['type_y'] == 'error')]['index']
info_errors.head()

0      0
17    17
30    30
35    35
39    39
Name: index, dtype: int64

Methods on dataframe are mostly designed to process each row independently from each other. Hence we prefer to transform the series into a list.

In [178]:
info_errors_list = list(info_errors)
info_errors_list

[0,
 17,
 30,
 35,
 39,
 42,
 51,
 52,
 57,
 61,
 67,
 71,
 74,
 77,
 78,
 85,
 89,
 92,
 93,
 94,
 103,
 106,
 109,
 110,
 112,
 118,
 121,
 126,
 133,
 136,
 139,
 142,
 147,
 149,
 150,
 152,
 155,
 167,
 184,
 190,
 200,
 208,
 211,
 220,
 225,
 231,
 234,
 237,
 245,
 263,
 279,
 280,
 283,
 286,
 291,
 292,
 308,
 311,
 312,
 327,
 330,
 338,
 354,
 359,
 362,
 368,
 375,
 379,
 380,
 381,
 384,
 386,
 389,
 396,
 401,
 402,
 436,
 458,
 499,
 502,
 508,
 530,
 548,
 555,
 558,
 572,
 574,
 575,
 596,
 638,
 640,
 643,
 650,
 651,
 653,
 656,
 657,
 662,
 672,
 680,
 1082,
 1083]

Now we scan the list, except for the first element, and we compute the difference between the current and the previous element.

This approach requires managing the index of the list.

In [179]:
where = []
for i in range(1, len(info_errors_list)):
    where.append(info_errors_list[i] - info_errors_list[i - 1] - 1)
where

[16,
 12,
 4,
 3,
 2,
 8,
 0,
 4,
 3,
 5,
 3,
 2,
 2,
 0,
 6,
 3,
 2,
 0,
 0,
 8,
 2,
 2,
 0,
 1,
 5,
 2,
 4,
 6,
 2,
 2,
 2,
 4,
 1,
 0,
 1,
 2,
 11,
 16,
 5,
 9,
 7,
 2,
 8,
 4,
 5,
 2,
 2,
 7,
 17,
 15,
 0,
 2,
 2,
 4,
 0,
 15,
 2,
 0,
 14,
 2,
 7,
 15,
 4,
 2,
 5,
 6,
 3,
 0,
 0,
 2,
 1,
 2,
 6,
 4,
 0,
 33,
 21,
 40,
 2,
 5,
 21,
 17,
 6,
 2,
 13,
 1,
 0,
 20,
 41,
 1,
 2,
 6,
 0,
 1,
 2,
 0,
 4,
 9,
 7,
 401,
 0]

An easier way is to extract two sublists of `info_errors_list`: the first removing the first element, and the second removing the last element. Those two sublists have the same length and are coordinated, that is in position `i` there are two elements that are related (actually, they are the two operands of the difference).

In [180]:
where = []
sublist1 = info_errors_list[1:]
sublist2 = info_errors_list[:-1]
for i in range(len(sublist1)):
    where.append(sublist1[i] - sublist2[i] - 1)
where

[16,
 12,
 4,
 3,
 2,
 8,
 0,
 4,
 3,
 5,
 3,
 2,
 2,
 0,
 6,
 3,
 2,
 0,
 0,
 8,
 2,
 2,
 0,
 1,
 5,
 2,
 4,
 6,
 2,
 2,
 2,
 4,
 1,
 0,
 1,
 2,
 11,
 16,
 5,
 9,
 7,
 2,
 8,
 4,
 5,
 2,
 2,
 7,
 17,
 15,
 0,
 2,
 2,
 4,
 0,
 15,
 2,
 0,
 14,
 2,
 7,
 15,
 4,
 2,
 5,
 6,
 3,
 0,
 0,
 2,
 1,
 2,
 6,
 4,
 0,
 33,
 21,
 40,
 2,
 5,
 21,
 17,
 6,
 2,
 13,
 1,
 0,
 20,
 41,
 1,
 2,
 6,
 0,
 1,
 2,
 0,
 4,
 9,
 7,
 401,
 0]

An even easier way is to exploit the fact that the two sublists are coordinated. This allows to use a zip to couple each pair of related elements, and a list comprehension to obtained the desired difference.

In [181]:
pairs = zip(info_errors_list[:-1], info_errors_list[1:])
list(pairs)

[(0, 17),
 (17, 30),
 (30, 35),
 (35, 39),
 (39, 42),
 (42, 51),
 (51, 52),
 (52, 57),
 (57, 61),
 (61, 67),
 (67, 71),
 (71, 74),
 (74, 77),
 (77, 78),
 (78, 85),
 (85, 89),
 (89, 92),
 (92, 93),
 (93, 94),
 (94, 103),
 (103, 106),
 (106, 109),
 (109, 110),
 (110, 112),
 (112, 118),
 (118, 121),
 (121, 126),
 (126, 133),
 (133, 136),
 (136, 139),
 (139, 142),
 (142, 147),
 (147, 149),
 (149, 150),
 (150, 152),
 (152, 155),
 (155, 167),
 (167, 184),
 (184, 190),
 (190, 200),
 (200, 208),
 (208, 211),
 (211, 220),
 (220, 225),
 (225, 231),
 (231, 234),
 (234, 237),
 (237, 245),
 (245, 263),
 (263, 279),
 (279, 280),
 (280, 283),
 (283, 286),
 (286, 291),
 (291, 292),
 (292, 308),
 (308, 311),
 (311, 312),
 (312, 327),
 (327, 330),
 (330, 338),
 (338, 354),
 (354, 359),
 (359, 362),
 (362, 368),
 (368, 375),
 (375, 379),
 (379, 380),
 (380, 381),
 (381, 384),
 (384, 386),
 (386, 389),
 (389, 396),
 (396, 401),
 (401, 402),
 (402, 436),
 (436, 458),
 (458, 499),
 (499, 502),
 (502, 508),


In [182]:
[ b - a - 1 for (a,b) in zip(info_errors_list[:-1], info_errors_list[1:])]

[16,
 12,
 4,
 3,
 2,
 8,
 0,
 4,
 3,
 5,
 3,
 2,
 2,
 0,
 6,
 3,
 2,
 0,
 0,
 8,
 2,
 2,
 0,
 1,
 5,
 2,
 4,
 6,
 2,
 2,
 2,
 4,
 1,
 0,
 1,
 2,
 11,
 16,
 5,
 9,
 7,
 2,
 8,
 4,
 5,
 2,
 2,
 7,
 17,
 15,
 0,
 2,
 2,
 4,
 0,
 15,
 2,
 0,
 14,
 2,
 7,
 15,
 4,
 2,
 5,
 6,
 3,
 0,
 0,
 2,
 1,
 2,
 6,
 4,
 0,
 33,
 21,
 40,
 2,
 5,
 21,
 17,
 6,
 2,
 13,
 1,
 0,
 20,
 41,
 1,
 2,
 6,
 0,
 1,
 2,
 0,
 4,
 9,
 7,
 401,
 0]