[back to README](../../README.ipynb)

# Web Data Anlaysis

## Data Source 

- raw web log data available as a part of [this dataset](http://old.honeynet.org/scans/scan34/) is used for this notebook.
- the weblogs stored at s3 bucket `/essentia-playground/Weblog_Dataset/http/`

Here, I'm planning on demonstrating 
1. deviding the log data into columns (converting to table structure)
2. perform some EDAs and visualization
3. pick features for anomaly detection 
4. format the data for suitable use for Sagemaker Random Cut Forest (RCF) algorithm.

Then put the data back to different S3 bucket. Step 3 and 4 are not inconclusive though, depends on the result from step 2.

## Setup
First we'll start with selecting the bucket as our datastore, then creating a category called `weblogs` and include all the raw apache web log data files in the bucket.

In [3]:
# select datastore, create category and take a peek at the data
ess select s3://essentia-playground
ess category add weblogs "/Weblog_Dataset/http/access_log.*"
ess summary weblogs

2019-11-27 22:23:32 ip-10-10-1-118 ess[3121]: Fetching file list from datastore.
2019-11-27 22:23:32 ip-10-10-1-118 ess[3121]: Examining largest matched file to determine compression type: /Weblog_Dataset/http/access_log.1
2019-11-27 22:23:32 ip-10-10-1-118 ess[3121]: Probing largest matched file to determine data configuration: /Weblog_Dataset/http/access_log.1
Name:        weblogs
Pattern:     /Weblog_Dataset/http/access_log.*
Exclude:     None
Date Format: auto
Date Regex:  
Archive:     
Delimiter:   Space
# of files:  6
Total size:  439.4KB
File range:  2019-11-01 - 2019-11-06
# columns:   10
Column Spec: IP:col_1 S:col_2 S:col_3 S:col_4 S:col_5 S:col_6 I:col_7 I:col_8 S:col_9 S:col_10
Pkey: 
Schema: IP:col_1 S:col_2 S:col_3 S:col_4 S:col_5 S:col_6 I:col_7 I:col_8 S:col_9 S:col_10
Preprocess:  
usecache:    False
Comment:    

First few lines:
24.196.254.170 - - [06/Mar/2005:05:28:52 -0500] "GET / HTTP/1.1" 403 2898 "-" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)"
192.195.225.

## Data Cleaning

It looks like each row is an entry of server access log in [combined log format](https://httpd.apache.org/docs/2.4/logs.html#combine_log), which is one long string. <br>
This cannot be analyzed in the format, so let's break it down to detailed columns. Followings are the list of columns we can get from the file, from left to right in the row log format.

```24.196.254.170 - - [06/Mar/2005:05:28:52 -0500] "GET / HTTP/1.1" 403 2898 "-" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)"```

* `24.196.254.170` - Client IP Address
* `-` - hyphen indicates request piece of info is not available.
* `-` - user id of the person who is requesting the documentsl
* `[06/Mar/2005:05:28:52 -0500]` - date, month, year and time of the request
* `"GET / HTTP/1.1"` - requet line given from client. This can be broken down to smaller columns
* `403` - status code that the server sent back to client. 
* `2898` - size of object returned to the cilent
* `"-"` - this can contain URLs or domain name, and is referer.
* `"Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)"` - user agent http request header. 

Out of all columns, request line and time columns will be separated to more detailed columns.

We can do this using `aq_pp` command. But first, let us create a column specs. It looks like `ess category add` was unable to scan the column spec we wanted, so we can make it ourselves. 
Below is the column spec that will be used for the rest of the notebook.

```IP:ip sep:' ' S:hypen sep:' ' S:user_id sep:' [' I:date sep:'/' S:month sep:'/' I:year sep:':' I:hour sep:':' I:minute sep:':' I:second sep:' ' S:time_dif sep:'] "' S:request_method sep:' ' S:requested_resource sep:' ' S:protocol sep:'" ' I:return_code sep:' ' I:obj_size sep:' ' S:referrer sep:' ' S:user_agent```

**Attributes**<br>

* `sep:` attributes were used alongside wtih 
* `div` to specify separate delimiters for each columns.
* `eok` was used to skip the line which include invalid data. works as data cleaning function.

In [31]:
# Create a column spec
cols="IP:ip sep:' ' S:hypen sep:' ' S:user_id sep:' [' I:date sep:'/' I:month sep:'/' I:year sep:':' I:hour sep:':' I:minute sep:':' I:second sep:' 'S:sign sep:'' I:diff sep:'] \"' S:request sep:'\" ' I:status sep:' ' I:size sep:' ' S:f"
ess stream weblogs "*" "*" | aq_pp -f,+1,div,eok - \
-d IP:ip sep:' ' S:hypen sep:' ' S:user_id sep:' [' I:date sep:'/' S:month sep:'/' I:year sep:':' I:hour sep:':' I:minute sep:':' I:second sep:' ' S:time_dif sep:'] "' S:request_method sep:' ' S:requested_resource sep:' ' S:protocol sep:'" ' I:return_code sep:' ' I:obj_size sep:' ' S:referrer sep:' ' S:user_agent | \
head -n 10

"ip","hypen","user_id","date","month","year","hour","minute","second","time_dif","request_method","requested_resource","protocol","return_code","obj_size","referrer","user_agent"
192.195.225.6,"-","-",6,"Mar",2005,5,31,37,"-0500","GET","/","HTTP/1.1",403,2898,"""-""","""Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)"""
218.75.19.178,"-","-",6,"Mar",2005,6,37,20,"-0500","GET","/","HTTP/1.1",403,2898,"""-""","""Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)"""
65.115.46.225,"-","-",6,"Mar",2005,7,45,21,"-0500","GET","/","HTTP/1.1",403,2898,"""-""","""Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)"""
219.137.79.110,"-","-",6,"Mar",2005,9,5,38,"-0500","GET","/","HTTP/1.1",403,2898,"""-""","""Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)"""
219.61.8.78,"-","-",6,"Mar",2005,9,12,31,"-0500","GET","/","HTTP/1.1",403,2898,"""-""","""Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)"""
208.57.191.112,"-","-",6,"Mar",2005,10,31,42,"-0500","GET","/","HTTP/1.1",403,2898,"""-""","""Mozilla/4.0 (compat

## EDA

Now we've done the cleaning and structuring of the data, we can move onto exploratory data analysis. 

Let's start with analyzing clients' ip column.

First we'll take a look at frequency counts of each clients' IP addresses to identify where we're getting accessed from.

### Clients Ip address

We'll acheive this by using `aq_cnt`. Note that we're outputting the stream from `aq_pp` in `aq` binary internal format, so that we do not need to provide additional column spec on `aq_cnt` command. <br>
Finally the result is sorted by `aq_ord` command.

In [23]:
# stream with ess, structurize with aq_pp, then count with aq_cnt
# note that aq_pp is outputting the data in aq_tools' binary format, so that column specs do not need to be specified in aq_cnt
ess stream weblogs "*" "*" | aq_pp -f,+1,div,eok,qui - \
-d IP:ip sep:' ' S:hypen sep:' ' S:user_id sep:' [' I:date sep:'/' S:month sep:'/' I:year sep:':' I:hour sep:':' I:minute sep:':' I:second sep:' ' S:time_dif sep:'] "' S:request_method sep:' ' S:requested_resource sep:' ' S:protocol sep:'" ' I:return_code sep:' ' I:obj_size sep:' ' S:referrer sep:' ' S:user_agent -o,aq - | \
aq_cnt -f,aq,eok,qui - -kX,aq - key IP | \
aq_ord -f,aq - -sort,dec count 

"ip","count"
64.62.145.98,414
210.118.169.20,410
210.116.59.164,179
4.152.207.238,94
210.51.12.238,92
64.122.238.114,91
220.170.88.36,63
222.95.35.200,51
4.152.207.126,46
210.127.248.52,46
216.105.210.91,45
66.99.250.98,42
66.173.230.27,40
222.95.32.114,38
61.166.155.162,38
61.178.79.227,34
63.202.179.73,32
67.162.217.35,32
216.171.174.124,28
211.229.45.67,23
82.229.169.73,23
69.86.160.253,23
65.117.45.251,23
220.95.232.60,23
217.162.121.21,23
4.249.111.162,23
220.95.231.3,23
61.81.96.175,23
219.245.156.12,23
148.245.28.39,21
24.18.186.248,16
172.169.6.104,16
220.28.212.176,16
60.248.26.50,16
69.86.164.233,16
221.184.100.13,16
67.181.19.171,16
69.233.130.234,16
211.59.0.204,16
211.211.4.51,16
81.193.64.246,16
61.235.189.106,16
24.19.201.156,16
69.153.13.148,16
201.129.65.141,16
172.155.203.99,16
24.5.169.95,16
69.210.209.141,16
68.55.90.67,16
24.107.54.128,16
221.234.48.249,16
24.99.140.64,16
69.229.28.252,16
211.59.0.40,16
67.162.253.25,16
60.248.32.153,16
203.112.195.156,16
61.219.99

You can observe that majority of accesses are from top few addresses. We can also plot the result file (`data/ip_distr.csv`), and we'll get the below plot.

## Insert IMG HERE

It is bit hard to comprehend what is going on with all these ip addresses though. What we can do is to map these addresses into thier corresponding geolocations, so we can interpret them precisely.

**Replacing IP addresses with Regions**<br>
We've prepared a lookup table file which includes a list of IP addresses listed above and its corresponding regions and countries. We can take a look at the lookup file as well as the result file from prior cell below using `head` command.

In [2]:
lookup="data/ip_table.csv"
client_ips="data/ip_distr.csv"
head $lookup
echo 
head $client_ips

ip,count,region,country
64.62.145.98,414,California,United States
210.118.169.20,410,Gangwon-do,Korea
210.116.59.164,179,Seoul,Korea
4.152.207.238,94,New Jersey,United States
210.51.12.238,92,Beijing,China
64.122.238.114,91,Minnesota,United States
220.170.88.36,63,Hunan,China
222.95.35.200,51,Jiangsu,China
4.152.207.126,46,New Jersey,United States

"ip","count"
64.62.145.98,414
210.118.169.20,410
210.116.59.164,179
4.152.207.238,94
210.51.12.238,92
64.122.238.114,91
220.170.88.36,63
222.95.35.200,51
4.152.207.126,46


We'd like to do is the following.
1. match ip columns' values on `ip_distr.csv` file to the ip values on the lookup, then add the columns with corresponding values for country and region on `ip_distr.csv` file.
2. aggregate the `ip_distr.csv`'s data by same region and country, and recount the frequency counts of each. 

For the first step, we can use `-cmb` option of `aq_pp` command in order to match the records with IP columns. 

In [9]:
# matching by IP address
aq_pp -f,+1 $client_ips -d ip:ip i:count -cmb,+1,all $lookup ip,key:ip i,cmb:count S:region S:country

"ip","count","region","country"
64.62.145.98,414,"California","United States"
210.118.169.20,410,"Gangwon-do","Korea"
210.116.59.164,179,"Seoul","Korea"
4.152.207.238,94,"New Jersey","United States"
210.51.12.238,92,"Beijing","China"
64.122.238.114,91,"Minnesota","United States"
220.170.88.36,63,"Hunan","China"
222.95.35.200,51,"Jiangsu","China"
4.152.207.126,46,"New Jersey","United States"
210.127.248.52,46,"Seoul","Korea"
216.105.210.91,45,"Wisconsin","United States"
66.99.250.98,42,"Illinois","United States"
66.173.230.27,40,"Virginia","United States"
222.95.32.114,38,"Jiangsu","China"
61.166.155.162,38,"Yunnan","China"
61.178.79.227,34,"Gansu","China"
63.202.179.73,32,"New York","United States"
67.162.217.35,32,"Arkansas","United States"
216.171.174.124,28,"California","United States"
211.229.45.67,23,"Gyeonggi-do","Korea"
82.229.169.73,23,"Île-de-France","France"
69.86.160.253,23,"New York","United States"
65.117.45.251,23,"Minnesota","United States"
220.95.232.60,23,"Gyeonggi-do"

70.80.223.173,1,,
218.64.24.205,1,,
210.82.34.25,1,,
220.170.233.137,1,,
216.74.229.19,1,,
172.155.134.244,1,,
82.52.98.248,1,,
68.201.200.65,1,,
195.166.22.231,1,,
193.109.122.11,1,,
66.249.194.119,1,,
216.73.162.91,1,,
222.166.160.157,1,,
220.170.159.6,1,,
193.109.122.23,1,,
222.33.38.94,1,,
24.232.140.34,1,,
68.94.125.162,1,,
24.13.26.115,1,,
67.121.94.193,1,,
216.104.137.150,1,,
61.10.7.101,1,,
211.179.10.212,1,,
219.133.240.116,1,,
60.213.104.126,1,,
138.73.71.118,1,,
218.61.87.218,1,,
193.109.122.47,1,,
207.71.220.100,1,,
218.74.38.64,1,,
83.131.116.49,1,,
61.10.7.47,1,,
68.42.30.57,1,,
66.76.233.44,1,,
211.9.146.6,1,,
222.166.160.161,1,,
198.54.202.4,1,,
195.199.231.234,1,,
208.183.225.11,1,,
222.240.84.179,1,,
204.49.32.16,1,,
61.234.121.2,1,,
208.185.174.208,1,,
218.59.22.246,1,,
63.85.149.252,1,,
220.174.102.79,1,,
61.222.0.242,1,,
222.166.160.60,1,,
218.90.219.110,1,,
61.10.7.82,1,,
221.217.50.33,1,,
216.150.210.194,1,,
67.172.189.99,1,,
216.252.70.106,1,,
63.26.231.71,1,,
1

Step 2 uses region and country both as composite key coulmn to count the occurence of regions and countries. 


In [12]:
# count occurance of each region and country
aq_pp -f,+1 $client_ips -d ip:ip i:count -cmb,+1,all $lookup ip,key:ip i,cmb:count S:region S:country -o,aq - | \
aq_cnt -f,aq - -kX - key country

"country","count"
,338
"Bolivia",1
"United Kingdom",1
"Denmark",1
"Iran",1
"Canada",1
"Bangladesh",1
"Portugal",2
"Taiwan",5
"Japan",4
"Mexico",3
"Switzerland",2
"France",1
"China",22
"Korea",11
"United States",44
