# Experiments - BGP offline dataset preparation  
Generate dataset including anomaly events for training models with the help of existing functions from project Cyberdefence.

### 1. Set date range for messages to download.

In [None]:

# Slammer event data.
site = 'RIPE'
collector_ripe = 'rrc04'
start_date, end_date = ('20030123', '20030128')
start_date_anomaly, end_date_anomaly = ('20030125', '20030125')
start_time_anomaly, end_time_anomaly = ('0531', '1959')
    

### 2. Download the message files
A folder for each date.  
```
Input variables: start_date, end_date, site, collector_ripe  
Output files: src/data_ripe/yyyymmdd/DUMP_yyyymmdd  
```
Original BGP message file is downloaded to folder src/data_ripe, then converted to ASCII format.  
The output file includes BGP messages in plain text.  
This is a message example:  
```
TIME: 2024-8-24 18:05:00
TYPE: BGP4MP/BGP4MP_MESSAGE_AS4 AFI_IP
FROM: 192.65.185.3
TO: 192.65.185.40 
BGP PACKET TYPE: UPDATE
ORIGIN: IGP
AS_PATH: 513 21320 9002 57363 57363 57363
NEXT_HOP: 192.65.185.3
COMMUNITIES: 20965:3 20965:4 21320:64622 21320:64698
ANNOUNCED: 151.236.111.0/24
```

In [None]:
from src.dataDownload import data_downloader_multi

data_downloader_multi(start_date, end_date, site, collector_ripe)


### 3. Extract features from BGP messages
Call a C# executable to extract 37 features + 4 timestamps from the dump files.  What the extraction does is basically summarize all kind of numbers of messages by minute. For example, feature No.1 Number of annoucements means how many annoucement message there is in one minute. Feature No.5 means the average length of all the AS-path strings contained in the messages in one minute.
```
Input variables: start_date, end_date, site
Input files: src/data_ripe/yyyymmdd/DUMP_yyyymmdd  
Output files: src/data_split/DUMP_yyyymmdd_out.txt  
```
The following is the field definition for DUMP_yyyymmdd_out.txt:  
```
Columns 1-4: time (column 1: hour+minute; column 2: hour; column 3: minute; column 4: second)
Columns 5-41: features

List of features extracted from BGP update messages:
1 Number of announcements
2 Number of withdrawals
3 Number of announced NLRI prefixes
4 Number of withdrawn NLRI prefixes
5 Average AS-path length
6 Maximum AS-path length
7 Average unique AS-path length
8 Number of duplicate announcements
9 Number of duplicate withdrawals
10 Number of implicit withdrawals
11 Average edit distance
12 Maximum edit distance
13 Inter-arrival time
14–24 Maximum edit distance = n, n = 7, . . . , 17
25–33 Maximum AS-path length = n, n = 7, . . . , 15
34 Number of Interior Gateway Protocol (IGP) packets
35 Number of Exterior Gateway Protocol (EGP) packets
36 Number of incomplete packets
37 Packet size (B)
```

In [None]:
from src.featureExtraction import feature_extractor_multi

output_file_list = feature_extractor_multi(start_date, end_date, site)
print("Feature extraction done for:", output_file_list)


### 4. Label data points.
Simply label every data points between the period of anomaly as 1, and 0 for others.  Output the labels to a seperate file.  
```
Input variables: start_date_anomaly, end_date_anomaly, start_time_anomaly, end_time_anomaly, site, output_file_list  
Input files: src/data_split/DUMP_yyyymmdd_out.txt  
Output files: src/STAT/labels_RIPE.csv
```

In [None]:
from src.label_generation import label_generator

labels = label_generator(start_date_anomaly, end_date_anomaly, start_time_anomaly, end_time_anomaly, site, output_file_list)
print(len(labels), "label generated.")
    

### Partition data
Merge multiple dates of data point and their labels to one dataset, then cut the dataset to training set and testing set at the portion specified.  
The cutting rule is:  
1. Assuming the anomaly labels in all labels are one and only one continous segment of points.
2. Cut the anomaly label segment to two segments based on cut_pct parameter. As the result, we get the cutting point index.  
3. Cut the dataset according to the cutting point index. The left part belongs to train set while the right part belongs to test set.  
4. Since RNN algorithms require a sequence of data points as the input, the cutting position needs to be rounded to integral multiple of the length of the sequence (e.g. 10)


Here is an example:  
labels:  000001111111111000000000  
portion: 60% train, 40% test  
cutting point (\*): 00000111111\*1111000000000  
result train labels: 00000111111  
result test labels: 1111000000000  
```
Input files: src/data_split/DUMP_yyyymmdd_out.txt, src/STAT/labels_RIPE.csv  
Input variables: labels, site, output_file_list
Output files: src/data_split/train_64_RIPE.csv, test_64_RIPE.csv, src/STAT/train_test_stat.txt
```
The output file is an matrix of float values contains 37 columns of features and one label column in the end.

In [None]:
from src.data_partition import data_partition

cut_pct = '64' # Train: 60%, Test: 40%
rnn_seq = 10 # 10 sequence data for RNN input
data_partition(cut_pct, site, output_file_list, labels, rnn_seq)
    

### 5. Normalize data
```
Input variables: cut_pct, site  
Input files: src/data_split/train_64_RIPE.csv, test_64_RIPE.csv  
Output files: src/data_split/train_64_RIPE_n.csv, test_64_RIPE_n.csv  
```

In [16]:
from src.data_process import normTrainTest

normTrainTest(cut_pct, site)
print("Data normalization done.")

Data normalization done.


It's done. Now we can use train_64_RIPE_n.csv to train the model, and test_64_RIPE_n.csv to test the performance of the model trained.