# Experiments - BGP real time dataset preparation  
Generate dataset used for detecting anomalies with real time BGP messages and the help of existing functions from project Cyberdefence. The dataset is only used for classification(prediction) with existing models(e.g. VFBLS), not for trainning models.

### 1. Get latest message file name in source website.

In [None]:
from src.time_tracker import time_tracker_single
from src.dataDownload import updateMessageName

site='RIPE'
year, month, day, hour, minute = time_tracker_single(site)
print("Current time:", year, month, day, hour, minute)
update_message_file, data_date = updateMessageName(year, month, day, hour, minute)
print("Will process an update message file:", update_message_file)


### 2. Download the message file
```
Output files: src/data_ripe/DUMP  
```
Original BGP MRT file is downloaded to folder src/data_ripe, then converted to ASCII format.  
The output file includes BGP messages in plain text.  
This is a message example:  
```
TIME: 2024-8-24 18:05:00
TYPE: BGP4MP/BGP4MP_MESSAGE_AS4 AFI_IP
FROM: 192.65.185.3
TO: 192.65.185.40 
BGP PACKET TYPE: UPDATE
ORIGIN: IGP
AS_PATH: 513 21320 9002 57363 57363 57363
NEXT_HOP: 192.65.185.3
COMMUNITIES: 20965:3 20965:4 21320:64622 21320:64698
ANNOUNCED: 151.236.111.0/24
```

In [None]:
from src.dataDownload import data_downloader_single

data_downloader_single(update_message_file, data_date, site)


### 3. Extract features from BGP messages
Call a C# executable to extract 37 features + 4 timestamps from the dump files.  What the extraction does is basically summarize all kind of numbers of messages by minute. For example, feature No.1 Number of annoucements means how many annoucement message there is in one minute. Feature No.5 means the average length of all the AS-path strings contained in the messages in one minute. 
```
Input files: src/data_ripe/DUMP  
Output files: src/data_test/DUMP_out.txt  
```
The following is the field definition for DUMP_out.txt:  
```
Columns 1-4: time (column 1: hour+minute; column 2: hour; column 3: minute; column 4: second)
Columns 5-41: features

List of features extracted from BGP update messages:
1 Number of announcements
2 Number of withdrawals
3 Number of announced NLRI prefixes
4 Number of withdrawn NLRI prefixes
5 Average AS-path length
6 Maximum AS-path length
7 Average unique AS-path length
8 Number of duplicate announcements
9 Number of duplicate withdrawals
10 Number of implicit withdrawals
11 Average edit distance
12 Maximum edit distance
13 Inter-arrival time
14–24 Maximum edit distance = n, n = 7, . . . , 17
25–33 Maximum AS-path length = n, n = 7, . . . , 15
34 Number of Interior Gateway Protocol (IGP) packets
35 Number of Exterior Gateway Protocol (EGP) packets
36 Number of incomplete packets
37 Packet size (B)
```

In [None]:
from src.featureExtraction import feature_extractor_single

file_name = feature_extractor_single(site)
print("Feature extraction done for:", file_name)


It's done! Use src/data_test/DUMP_out.txt for classificatioin.