## Los Alamos National Lab Data Preparation
*Source:* [LANL dataset](https://csr.lanl.gov/data/cyber1/)  

This data set represents 58 consecutive days of de-identified event data collected from five sources within Los Alamos National Laboratory’s corporate, internal computer network.

Only the auth.txt file is used in our current work, as all red team activity appearing in the data correspond exclusively to authentication events. Future work includes utilizing additional data streams (namely; proc.txt.gz). We perform a pre-processing step on the file redteam.txt.gz so that its log lines are expanded to match the full log line which they correspond to in the auth.txt.gz file. This adds general convenience, and speeds up the process of querying to find out if a given log line is malicious.

This notebook outlines methods used for translating log lines into integer vectors which can be acted on by event level models. Note that the scripts in **/safekit/features/lanl** can also be used standalone to accomplish the same translation, but only operate on the auth.txt data file.

### Character Level
----
*note: /safekit/features/lanl/char_feats.py*  
At the character level, the ascii value for each character in a log line is used as a token in the input sequence for the model.

The translation used is

In [4]:
def translate_line(string, pad_len):
    return "0 " + " ".join([str(ord(c) - 30) for c in string]) + " 1 " + " ".join(["0"] * pad_len) + "\n"

Where **string** is a log line to be translated, and **pad_len** is the number of 0's to append so that the length of the translated string has the same number of characters as the longest log line in the dataset (character-wise). **0** and **1** are used to describe start and end of the translated sentence. 

The length of the max log line can be obtained using

In [28]:
max_len = 0
with open("auth_proc.txt", "r") as infile:
    for line in infile:
        tmp = line.strip().split(',')
        line_minus_time = tmp[0] + ',' + ','.join(tmp[2:])
        if len(line_minus_time) > max_len:
            max_len = len(line)
print (max_len)

129


We chose to filter out weekends for this dataset, as they capture little activity. These days are

In [None]:
weekend_days = [3, 4, 10, 11, 17, 18, 24, 25, 31, 32, 38, 39, 45, 46, 47, 52, 53]

It is also convenient to keep track of which lines are in fact red events. Note that labels are not used during unsupervised training, this step is included to simplify the evaluation process later on.

In [30]:
with open("redevents.txt", 'r') as red:
    redevents = set(red.readlines())

**redevents.txt** contains all of the red team log lines verbatim from auth.txt.   
It is now possible to parse the data file, reading in (raw) and writing out (translated) log lines. 

In [11]:
import math
with open("auth_proc.txt", 'r') as infile, open("ap_char_feats.txt", 'w') as outfile:
    outfile.write('line_number second day user red seq_len start_sentence\n') # header
    infile.readline()
    for line_num, line in enumerate(infile):
        tmp = line.strip().split(',')
        line_minus_time = tmpline_minus_time = tmp[0] + ',' + ','.join(tmp[2:])[0] + ',' + ','.join(tmp[2:])
        diff = max_len - len(line_minus_time)
        raw_line = line.split(",")
        sec = raw_line[1]
        user = raw_line[2].strip().split('@')[0]
        day = math.floor(int(sec)/86400)
        red = 0
        line_minus_event = ",".join(raw_line[1:])
        red += int(line_minus_event in redevents) # 1 if line is red event
        if user.startswith('U') and day not in weekend_days:
            translated_line = translate_line(line_minus_time, 120) # diff
            outfile.write("%s %s %s %s %s %s %s" % (line_num, sec, day, 
                                                    user.replace("U", ""), 
                                                    red, len(line_minus_time) + 1, translated_line))

The final preprocessing step is to split the translated data into multiple files; one for each day. 

In [None]:
import os
os.mkdir('./char_feats')
with open('./ap_char_feats.txt', 'r') as data:
    current_day = '0'
    outfile = open('./char_feats/' + current_day + '.txt', 'w')
    for line in data:
        larray = line.strip().split(' ')
        if int(larray[2]) == int(current_day):
            outfile.write(line)
        else:
            outfile.close()
            current_day = larray[2]
            outfile = open('./char_feats/' + current_day + '.txt', 'w')
            outfile.write(line)
    outfile.close()


The char_feats folder can now be passed to the tiered or simple language model.  
The config (data spec) or this experiment is shown below (write as string to json file).

>{**"sentence_length"**: 129,  
    **"token_set_size"**: 96,  
    **"num_days"**: 30,  
    **"test_files"**: ["0head.txt", "1head.txt", "2head.txt"],  
    **"weekend_days"**: [3, 4, 10, 11, 17, 18, 24, 25]}  

### Word Level
----
Instead of using the ascii values of individual characters, this approach operates on the features of a log line as if they were a sequence. In other words, each log line is split on "," and each index in the resulting array corresponds to a timestep in the input sequence of the event level model.

To map the token strings to integer values, a vocabulary is constructed for the dataset. Any tokens encountered during evaluation which were not present in the initial data are mapped to a common "out of vocabulary" (OOV) value during translation. If the number of unique tokens within the data is known to be prohibitively large, a count dictionary can be used to infer an arbitrary cutoff which maps the least likely tokens in the data to the OOV integer. Eg; all tokens which appear less than 5 times in the data map to the OOV token during translation.

Note that in our AICS paper we have seperate OOV tokens for user, pc, and domain tokens.  
In the following code, every unique token is included in the vocabulary.

In [1]:
index = 4 # <sos> : 1, <eos>: 2, auth_event: 3, proc_event: 4
vocab = {"OOV": "0", "<sos>": "1", "<eos>": "2","auth_event": "3", "proc_event": "4"}

def lookup(key):
    global index, vocab
    if key in vocab:
        return vocab[key]
    else:
        index += 1
        vocab[key] = str(index)
        return str(index)

Translated log lines should be padded out to the maximum length fields_list. In the case of LANL, the proc events are longer and contain 11 tokens. 

In [2]:
def translate(fields_list):
    translated_list = list(map(lookup, fields_list))
    while len(translated_list) < 11:
        translated_list.append("0")
    translated_list.insert(0, "1")  # <sos>
    translated_list.append("2")  # <eos>
    translated_list.insert(0, str(len(translated_list)))  # sentence len
    return translated_list

We chose to filter out weekends for this dataset, as they capture little activity. It is also convenient to keep track of which lines are in fact red events. Note that labels are not used during unsupervised training, this step is included to simplify the evaluation process later on.

In [None]:
weekend_days = [3, 4, 10, 11, 17, 18, 24, 25, 31, 32, 38, 39, 45, 46, 47, 52, 53]

with open("redevents.txt", 'r') as red:
    redevents = set(red.readlines())

Since OOV cutoffs are not being considered, the translation can be done in a single pass over the data.

In [25]:
import math
with open("auth_proc.txt", 'r') as infile, open("ap_word_feats.txt", 'w') as outfile:
    outfile.write('line_number second day user red sentence_len translated_line padding \n')
    for line_num, line in enumerate(infile):
        line = line.strip()
        fields = line.replace("@", ",").replace("$", "").split(",")
        sec = fields[1]
        translated = translate(fields[0:1] + fields[2:])
        user = fields[2]
        day = math.floor(int(sec)/86400)
        red = 0
        red += int(line in redevents)
        if user.startswith('U') and day not in weekend_days:
            outfile.write("%s %s %s %s %s %s" % (line_num, sec, day, user.replace("U", ""), 
                                                 red, " ".join(translated)))
print(len(vocab))

The final preprocessing step is to split the translated data into multiple files; one for each day. 

In [None]:
import os
os.mkdir('./word_feats')
with open('./ap_word_feats.txt', 'r') as data:
    current_day = '0'
    outfile = open('./word_feats/' + current_day + '.txt', 'w')
    for line in data:
        larray = line.strip().split(' ')
        if int(larray[2]) == int(current_day):
            outfile.write(line)
        else:
            outfile.close()
            current_day = larray[2]
            outfile = open('./word_feats/' + current_day + '.txt', 'w')
            outfile.write(line)
    outfile.close()

The word_feats directory can now be passed to the tiered or simple language model.   
The config json (data spec) for the expiriment is shown below (write as string to json file).

>{**"sentence_length"**: 12,   
    **"token_set_size"**: len(vocab),  
    **"num_days"**: 30,  
    **"test_files"**: ["0head.txt", "1head.txt", "2head.txt"],  
    **"weekend_days"**: [3, 4, 10, 11, 17, 18, 24, 25]}  