## Los Alamos National Lab Data Preparation
*Source:* [LANL dataset](https://csr.lanl.gov/data/cyber1/)  

This data set represents 58 consecutive days of de-identified event data collected from five sources within Los Alamos National Laboratory’s corporate, internal computer network. 

Only the auth.txt file is used in our current work, as all red team activity appearing in the data correspond exclusivley to authentication events. Future work includes utilizing additional data streams (namely; proc.txt.gz). We preform a pre-processing step on the file redteam.txt.gz so that its log lines are expanded to match the full log line which they correspond to in the auth.txt.gz file. This adds general convenience, and speeds up the process of querying to find out if a given log line is malicious.


### Character Level
At the character level, the dataset is parsed line by line, and the ascii value for each character in the log line is written to an output file, which can then be fed into the model. 

That translation in Python is,

In [1]:
def translate_line(string, pad_len):
    return "0 " + " ".join([str(ord(c) - 30) for c in string]) + " 1 " + " ".join(["0"] * pad_len) + "\n"


Where **string** is a log line to be translated, and **pad_len** is the number of 0's to append so that the length of the translated string has the same number of characters as the longest log line in the dataset (character-wise). The start of sentence token used here is **0** and the end of sentence token is **1**. 

The length of the max log line can be obtained using,

In [12]:
max_len = 0
with open("auth.txt", "r") as infile:
    for line in infile:
        if len(line) > max_len:
            max_len = len(line)
print (max_len)



94


We choose to filter out the weekends for this dataset, since very little happens on these days. These days are


In [3]:
weekend_days = [3, 4, 10, 11, 17, 18, 24, 25, 31, 32, 38, 39, 45, 46, 47, 52, 53]

It is also convenient to keep track of which lines are in fact red events. Note that labels are not used during unsupervised training, this step is included to simplify the evaluation process later on.

In [4]:
with open("redevents.txt", 'r') as red:
    redevents = set(red.readlines())


We then loop through auth.txt line by line, and symultaneously read in (raw) and write out (translated) log lines. 

In [13]:
import math
with open("auth.txt", 'r') as infile, open("char_feats.txt", 'w') as outfile:
    outfile.write('line_number second day user red seq_len start_sentence\n') # header
    infile.readline()
    for line_num, line in enumerate(infile):
        line_minus_time = ','.join(line.strip().split(',')[1:])
        diff = max_len - len(line_minus_time)
        raw_line = line.split(",")
        sec = raw_line[0]
        user = raw_line[1].strip().split('@')[0]
        day = math.floor(int(sec)/86400)
        red = 0
        red += int(line in redevents) # 1 if line is red event
        if user.startswith('U') and day not in weekend_days:
            index_rep = translate_line(line_minus_time, 120) # diff
            outfile.write("%s %s %s %s %s %s %s" % (line_num, 
                                                    sec, 
                                                    day, 
                                                    user.replace("U", ""), 
                                                    red, 
                                                    len(line_minus_time) + 1,
                                                    index_rep))

We can now use char_feats.txt as input data for the character level models.


### Word Level
For word level, our approach is to build a vocab based on each unique feature token that appears in the dataset. 