# Raw HDFS Data Review and Train/Test Split

The HDFS data used in this project is provided by the [Loghub collection](https://github.com/logpai/loghub):
- Shilin He, Jieming Zhu, Pinjia He, Michael R. Lyu. [Loghub: A Large Collection of System Log Datasets towards Automated Log Analytics](https://arxiv.org/abs/2008.06448). *Arxiv*, 2020.

The HDFS data are provided at: https://github.com/logpai/loghub/tree/master/HDFS

The downloaded file HDFS_1.tar.gz which provides the HDFS.log data used in this project is from: https://zenodo.org/record/3227177#.YHH_tOhKhPY

The downloaded HDFs_1.tar.gz also provides a file `anomaly_label.csv` which provides a label whether each HDFS block in the `HDFS.log` file in normal or anomalous.

Additional details of the `HDFS.log` file are provided in the paper:
- Wei Xu, Ling Huang, Armando Fox, David Patterson, Michael Jordan. [Detecting Large-Scale System Problems by Mining Console Logs](https://people.eecs.berkeley.edu/~jordan/papers/xu-etal-sosp09.pdf), in Proc. of the 22nd ACM Symposium on Operating Systems Principles (SOSP), 2009. 

In [3]:
import pandas as pd
import numpy as np
import logging

## Explore raw data in HDFS.log

In [4]:
# count rows
with open('project_raw/HDFS.log', "r") as file:
    totaln=0
    for line in file:
        totaln += 1

In [5]:
print('There are a total of {} lines'.format(totaln))

There are a total of 11175629 lines


In [6]:
#Take a quick look at the data:
   
data = []
with open('project_raw/HDFS.log', "r") as file:
    n=0
    for line in file:
        data.append(line)
        if n <200:
            n += 1
        else: break 

In [7]:
df = pd.DataFrame(data)

In [8]:
df.head()

Unnamed: 0,0
0,081109 203518 143 INFO dfs.DataNode$DataXceive...
1,081109 203518 35 INFO dfs.FSNamesystem: BLOCK*...
2,081109 203519 143 INFO dfs.DataNode$DataXceive...
3,081109 203519 145 INFO dfs.DataNode$DataXceive...
4,081109 203519 145 INFO dfs.DataNode$PacketResp...


In [9]:
df.iloc[100:120].values

array([['081109 203527 154 INFO dfs.DataNode$DataXceiver: 10.251.197.226:50010 Served block blk_-3544583377289625738 to /10.251.203.4\n'],
       ['081109 203527 154 INFO dfs.DataNode$DataXceiver: 10.251.215.16:50010 Served block blk_-1608999687919862906 to /10.250.19.227\n'],
       ['081109 203527 155 INFO dfs.DataNode$DataXceiver: 10.250.11.100:50010 Served block blk_-3544583377289625738 to /10.250.19.227\n'],
       ['081109 203527 155 INFO dfs.DataNode$DataXceiver: 10.251.197.226:50010 Served block blk_-3544583377289625738 to /10.251.215.16\n'],
       ['081109 203527 156 INFO dfs.DataNode$DataXceiver: 10.250.11.100:50010 Served block blk_-3544583377289625738 to /10.251.65.203\n'],
       ['081109 203527 156 INFO dfs.DataNode$DataXceiver: 10.251.197.226:50010 Served block blk_-3544583377289625738 to /10.250.17.177\n'],
       ['081109 203527 157 INFO dfs.DataNode$DataXceiver: 10.250.11.100:50010 Served block blk_-3544583377289625738 to /10.251.66.63\n'],
       ['081109 203527 157

The data are unstructed log files. After reviewing several anomaly detection papers, Drain was identified as the most accurate parser as part of the evaluation and Logparser provides the implementation of Drain discussed in:
- [**ICWS'17**] [Drain: An Online Log Parsing Approach with Fixed Depth Tree](https://jiemingzhu.github.io/pub/pjhe_icws2017.pdf), by Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu.

An implementation of the Drain log parser is available through the [Logparser toolkit](https://github.com/logpai/logparser). The Logparser toolkit provides multiple automated log parsing methods to create structured logs (also referred to as message template extraction). Logparser was created as part of an evaluation of various parsers:
- [**ICSE'19**] Jieming Zhu, Shilin He, Jinyang Liu, Pinjia He, Qi Xie, Zibin Zheng, Michael R. Lyu. [Tools and Benchmarks for Automated Log Parsing](https://arxiv.org/pdf/1811.03509.pdf). *International Conference on Software Engineering (ICSE)*, 2019.
+ [**DSN'16**] Pinjia He, Jieming Zhu, Shilin He, Jian Li, Michael R. Lyu. [An Evaluation Study on Log Parsing and Its Use in Log Mining](https://jiemingzhu.github.io/pub/pjhe_dsn2016.pdf). *IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)*, 2016.

However, Drain as provided in the Logparser package is implemented in Python 2.7. Documentation for Logparser can also be found [here](https://logparser.readthedocs.io/en/latest/README.html). Accordingly, running Drain is completed in a Python 2.7 environment and not completed in this notebook. The `project_parser.py` script uses Drain to parse both the `HDFS.log` and `HDFS_train.log` files described below.

## Training and Testing .log Files

Split the data into a training and testing set based on 80/20 split. Using the ordered data (no shuffling or random selection) as they are based on a log series history and we want to use the last 20% for test and we want to maintain the order of data.

Note that we're only creating a train file as we'll do the following procedure with the Drain log parser:
- Run Drain on the train file - this will create log templates only based on the training set
- Run Drain on the complete file - this will create the same log templates from the training set but will create any potential new templates only seen in the testing data, this may also result in modification of the original training data set templates (Drain updates teamplates as it learns new patterns) but these updated templates will not be copied back to the training set as they would not have been seen at that point. 

In [10]:
train_idx = int(totaln*.8)
train_idx

8940503

In [59]:
# read the training lines only
train_data = []
with open('project_raw/HDFS.log', "r") as file:
    n=0
    for line in file:
        if n < train_idx:
            train_data.append(line)
            n += 1
        else: break 

In [61]:
# write to the training file `HDFS_train.log`
with open('project_raw/HDFS_train.log', 'x') as file:
    for i in train_data:
        file.write(i)

## Drain Output Files

The `project_parser.py` script is run on both the `HDFS.log` and `HDFS_train.log` files as discussed above. The outputs created which are then used for feature extraction are:

- `HDFS_train.log_templates.csv`
- `HDFS_train.log_structured.csv`
- `HDFS.log_templates.csv`
- `HDFS.log_structured.csv`

`HDFS.log_templates.csv` can be used directly from the testing feature extraction but to create a testing file with only the structured log data, the training logs from `HDFS.log_structured.csv` will be removed and the testing only structured log file will be saved as `HDFS_test.log_structured.csv`.

In [12]:
all_parsed = pd.read_csv('project_parsed/HDFS.log_structured.csv')

In [18]:
test_parsed = all_parsed.iloc[train_idx:]

In [20]:
test_parsed.to_csv('project_parsed/HDFS_test.log_structured.csv', index = False)