# Converting 'access.log' to tabular data

With this notebook, we will download the access.log file from our repository. From there, we will parse the data line by line and extract it into a dataframe. Pandas will let us download the df as a file, which we will name "access.tsv".

### Steps
1. Clone the repository
2. Open the file and parse it line by line
3. Extract important information from the request and create a dataset
4. Establish a dataframe with pandas
5. Download the data as "access.tsv"


In [19]:
# First, we need to clone the library, so we can have access to the log file.

!git clone https://github.com/brain-image-library/py-brain-logs.git

fatal: destination path 'py-brain-logs' already exists and is not an empty directory.


In [15]:
# Ensure that 'py-brain-logs' is in our directory
!ls


py-brain-logs  sample_data


A sample line of access.log looks like this:

```51.222.253.19 - - [15/May/2022:03:30:09 -0400] "GET /56/77/567794f41ad2dccd/mouseID_394528-18867/1059286962_18867_4330-X29413-Y7746.swc HTTP/2.0" 404 146 "-" "Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)" "-"```

We will split this by regex and retrieve information from it.

And lastly, we can check if the request is from a bot if we check if the agent contains 'bot.'

In [1]:
import re
from datetime import datetime

import pandas as pd

methods = ['GET', 'HEAD', 'POST', 'PUT', 'DELETE', 'OPTIONS', 'TRACE', 'PATCH']
pattern = r'^([\d.]+) - - \[([^]]+)\] "([^"]*)" (\d+) (\d+) "([^"]*)" "([^"]*)" "-"$'

with open('data/access.log', 'r') as file:
    data = []
    lines = file.readlines()

    for line in lines:
        matches = re.match(pattern, line) # Extract the information from the regular expression

        if matches:
            ip_address = matches.group(1)
            timestamp = matches.group(2)
            request_line = matches.group(3)
            status_code = matches.group(4)
            size = matches.group(5)
            referrer = matches.group(6)
            user_agent = matches.group(7)

            split_request = request_line.split(' ')

            # Some requests do not mention the method type, so we will not include it
            method = split_request[0] if split_request[0] in methods else ""
            url = split_request[1] if split_request[0] in methods else split_request[0]

            # Assemble the dataset
            dataset = {
                "ip": ip_address,
                "date": datetime.strptime(timestamp, '%d/%b/%Y:%H:%M:%S %z'),
                "method": method,
                "url": url,
                "status_code": status_code,
                "size": size,
                "referrer": referrer,
                "user_agent": user_agent,
                "is_bot": 'bot' in user_agent.lower()
            }

            data.append(dataset)
        else:
            print("No match found.")

# Create the dataframe
df = pd.DataFrame(data)
df

Unnamed: 0,ip,date,method,url,status_code,size,referrer,user_agent,is_bot
0,51.222.253.19,2022-05-15 03:30:09-04:00,GET,/56/77/567794f41ad2dccd/mouseID_394528-18867/1...,404,146,-,Mozilla/5.0 (compatible; AhrefsBot/7.0; +http:...,True
1,51.222.253.19,2022-05-15 03:30:17-04:00,GET,/56/77/567794f41ad2dccd/mouseID_394528-18867/1...,404,146,-,Mozilla/5.0 (compatible; AhrefsBot/7.0; +http:...,True
2,51.222.253.12,2022-05-15 03:30:24-04:00,GET,/56/77/567794f41ad2dccd/mouseID_394528-18867/1...,404,146,-,Mozilla/5.0 (compatible; AhrefsBot/7.0; +http:...,True
3,51.222.253.14,2022-05-15 03:30:33-04:00,GET,/56/77/567794f41ad2dccd/mouseID_394528-18867/1...,404,146,-,Mozilla/5.0 (compatible; AhrefsBot/7.0; +http:...,True
4,51.222.253.1,2022-05-15 03:30:40-04:00,GET,/56/77/567794f41ad2dccd/mouseID_394528-18867/1...,404,146,-,Mozilla/5.0 (compatible; AhrefsBot/7.0; +http:...,True
...,...,...,...,...,...,...,...,...,...
38505,185.191.171.17,2022-05-17 12:23:57-04:00,GET,/biccn/mueller/mouselight/2019-09-06/3/5/5/4/1/,200,1331,-,Mozilla/5.0 (compatible; SemrushBot/7~bl; +htt...,True
38506,185.191.171.14,2022-05-17 12:23:58-04:00,GET,/biccn/mueller/mouselight/2018-12-01/4/1/2/8/6/7/,200,479,-,Mozilla/5.0 (compatible; SemrushBot/7~bl; +htt...,True
38507,157.90.181.151,2022-05-17 12:23:59-04:00,GET,/bf/fb/,200,158,-,Mozilla/5.0 (compatible; BLEXBot/1.0; +http://...,True
38508,185.191.171.14,2022-05-17 12:23:59-04:00,GET,/2f/27/2f27b45f2590ec86/2019-09-06/ktx/6/3/3/5/1/,200,1224,-,Mozilla/5.0 (compatible; SemrushBot/7~bl; +htt...,True


In [3]:
# Convert the dataframe to a tsv file.
df.to_csv('data/access.tsv', sep='\t', index=False)