[back to README](../../README.ipynb)

# Web Data Anlaysis

## Data Source 

- raw web log data available as a part of [this dataset](http://old.honeynet.org/scans/scan34/) is used for this notebook.
- the weblogs stored at s3 bucket `/essentia-playground/Weblog_Dataset/http/`

Here, I'm planning on demonstrating 
1. deviding the log data into columns (converting to table structure)
2. perform some EDAs and visualization
3. pick features for anomaly detection 
4. format the data for suitable use for Sagemaker Random Cut Forest (RCF) algorithm.

Then put the data back to different S3 bucket. Step 3 and 4 are not inconclusive though, depends on the result from step 2.

## Setup
First we'll start with selecting the bucket as our datastore, then creating a category called `weblogs` and include all the raw apache web log data files in the bucket.

In [3]:
# select datastore, create category and take a peek at the data
ess select s3://essentia-playground
ess category add weblogs "/Weblog_Dataset/http/access_log.*"
ess summary weblogs

2019-11-27 22:23:32 ip-10-10-1-118 ess[3121]: Fetching file list from datastore.
2019-11-27 22:23:32 ip-10-10-1-118 ess[3121]: Examining largest matched file to determine compression type: /Weblog_Dataset/http/access_log.1
2019-11-27 22:23:32 ip-10-10-1-118 ess[3121]: Probing largest matched file to determine data configuration: /Weblog_Dataset/http/access_log.1
Name:        weblogs
Pattern:     /Weblog_Dataset/http/access_log.*
Exclude:     None
Date Format: auto
Date Regex:  
Archive:     
Delimiter:   Space
# of files:  6
Total size:  439.4KB
File range:  2019-11-01 - 2019-11-06
# columns:   10
Column Spec: IP:col_1 S:col_2 S:col_3 S:col_4 S:col_5 S:col_6 I:col_7 I:col_8 S:col_9 S:col_10
Pkey: 
Schema: IP:col_1 S:col_2 S:col_3 S:col_4 S:col_5 S:col_6 I:col_7 I:col_8 S:col_9 S:col_10
Preprocess:  
usecache:    False
Comment:    

First few lines:
24.196.254.170 - - [06/Mar/2005:05:28:52 -0500] "GET / HTTP/1.1" 403 2898 "-" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)"
192.195.225.

## Data Cleaning

It looks like each row is an entry of server access log in [combined log format](https://httpd.apache.org/docs/2.4/logs.html#combine_log), which is one long string. <br>
This cannot be analyzed in the format, so let's break it down to detailed columns. Followings are the list of columns we can get from the file, from left to right in the row log format.

```24.196.254.170 - - [06/Mar/2005:05:28:52 -0500] "GET / HTTP/1.1" 403 2898 "-" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)"```

* `24.196.254.170` - Client IP Address
* `-` - hyphen indicates request piece of info is not available.
* `-` - user id of the person who is requesting the documentsl
* `[06/Mar/2005:05:28:52 -0500]` - date, month, year and time of the request
* `"GET / HTTP/1.1"` - requet line given from client. This can be broken down to smaller columns
* `403` - status code that the server sent back to client. 
* `2898` - size of object returned to the cilent
* `"-"` - this can contain URLs or domain name, and is referer.
* `"Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)"` - user agent http request header. 

Out of all columns, request line and time columns will be separated to more detailed columns.

We can do this using `aq_pp` command. But first, let us create a column specs. It looks like `ess category add` was unable to scan the column spec we wanted, so we can make it ourselves. 
Below is the column spec that will be used for the rest of the notebook.

```IP:ip sep:' ' S:hypen sep:' ' S:user_id sep:' [' I:date sep:'/' S:month sep:'/' I:year sep:':' I:hour sep:':' I:minute sep:':' I:second sep:' ' S:time_dif sep:'] "' S:request_method sep:' ' S:requested_resource sep:' ' S:protocol sep:'" ' I:return_code sep:' ' I:obj_size sep:' ' S:referrer sep:' ' S:user_agent```

**Attributes**<br>

* `sep:` attributes were used alongside wtih 
* `div` to specify separate delimiters for each columns.
* `eok` was used to skip the line which include invalid data. works as data cleaning function.

In [31]:
# Create a column spec
cols="IP:ip sep:' ' S:hypen sep:' ' S:user_id sep:' [' I:date sep:'/' I:month sep:'/' I:year sep:':' I:hour sep:':' I:minute sep:':' I:second sep:' 'S:sign sep:'' I:diff sep:'] \"' S:request sep:'\" ' I:status sep:' ' I:size sep:' ' S:f"
ess stream weblogs "*" "*" | aq_pp -f,+1,div,eok - \
-d IP:ip sep:' ' S:hypen sep:' ' S:user_id sep:' [' I:date sep:'/' S:month sep:'/' I:year sep:':' I:hour sep:':' I:minute sep:':' I:second sep:' ' S:time_dif sep:'] "' S:request_method sep:' ' S:requested_resource sep:' ' S:protocol sep:'" ' I:return_code sep:' ' I:obj_size sep:' ' S:referrer sep:' ' S:user_agent | \
head -n 10

"ip","hypen","user_id","date","month","year","hour","minute","second","time_dif","request_method","requested_resource","protocol","return_code","obj_size","referrer","user_agent"
192.195.225.6,"-","-",6,"Mar",2005,5,31,37,"-0500","GET","/","HTTP/1.1",403,2898,"""-""","""Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)"""
218.75.19.178,"-","-",6,"Mar",2005,6,37,20,"-0500","GET","/","HTTP/1.1",403,2898,"""-""","""Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)"""
65.115.46.225,"-","-",6,"Mar",2005,7,45,21,"-0500","GET","/","HTTP/1.1",403,2898,"""-""","""Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)"""
219.137.79.110,"-","-",6,"Mar",2005,9,5,38,"-0500","GET","/","HTTP/1.1",403,2898,"""-""","""Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)"""
219.61.8.78,"-","-",6,"Mar",2005,9,12,31,"-0500","GET","/","HTTP/1.1",403,2898,"""-""","""Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)"""
208.57.191.112,"-","-",6,"Mar",2005,10,31,42,"-0500","GET","/","HTTP/1.1",403,2898,"""-""","""Mozilla/4.0 (compat

## EDA

Now we've done the cleaning and structuring of the data, we can move onto exploratory data analysis. 

Let's start with analyzing clients' ip column.

First we'll take a look at frequency counts of each clients' IP addresses to identify where we're getting accessed from.

# START BY FIXING THIS

In [46]:
# stream with ess, structurize with aq_pp, then count with aq_cnt
# note that aq_pp is outputting the data in aq_tools' binary format, so that column specs do not need to be specified in aq_cnt
ess stream weblogs "*" "*" | aq_pp -f,+1,div,eok - \
-d IP:ip sep:' ' S:hypen sep:' ' S:user_id sep:' [' I:date sep:'/' S:month sep:'/' I:year sep:':' I:hour sep:':' I:minute sep:':' I:second sep:' ' S:time_dif sep:'] "' S:request_method sep:' ' S:requested_resource sep:' ' S:protocol sep:'" ' I:return_code sep:' ' I:obj_size sep:' ' S:referrer sep:' ' S:user_agent -o,aq - | \
aq_pp -f,+1,aq - -o,csv - 

<stdin>: Syntax error: byte=1+0 rec=1 field=#1
<stdin>: Column spec not found or invalid
Invalid parameter "-": ... -f,+1,aq - -o,csv -
aq_pp: Input processing error
2019-11-28 00:19:45 ip-10-10-1-118 ess[4130]: ***Error*** cat: write error: Broken pipe


: 13