# RNN DGA Classifier

This notebook contains a barebone example workflow how to work on custom containerized code that seamlessly interfaces with the Deep Learning Toolkit for Splunk.

Note: By default every time you save this notebook the cells are exported into a python module which is then invoked by Splunk MLTK commands like <code> | fit ... | apply ... | summary </code>. Please read the Model Development Guide in the Deep Learning Toolkit app for more information.


## Stage 0 - import libraries
At stage 0 we define all imports necessary to run our subsequent code depending on various libraries.

In [3]:
# this definition exposes all python module imports that should be available in all subsequent commands
import json
from io import open
import glob
import numpy as np
import pandas as pd
import os
import unicodedata
import time
import math
import string
import torch
import torch.utils.data
from torch.utils.data import Dataset
import torch.nn as nn
from sklearn.preprocessing import LabelEncoder
    
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence
from torch.nn.utils.rnn import pad_sequence, pad_packed_sequence, pack_padded_sequence


# from __future__ import unicode_literals, print_function, division
# global constants
MODEL_DIRECTORY = "/srv/app/model/data/"

def seed_everything(seed, cuda=False):
    # Set the random seed manually for reproducibility.
    np.random.seed(seed)
    torch.manual_seed(seed)
    if cuda:
        torch.cuda.manual_seed_all(seed)

In [2]:
!pip install --upgrade torch

Collecting torch
[?25l  Downloading https://files.pythonhosted.org/packages/13/70/54e9fb010fe1547bc4774716f11ececb81ae5b306c05f090f4461ee13205/torch-1.5.0-cp36-cp36m-manylinux1_x86_64.whl (752.0MB)
[K    100% |████████████████████████████████| 752.0MB 53kB/s  eta 0:00:01| 110.0MB 31.3MB/s eta 0:00:2114       | 243.3MB 21.3MB/s eta 0:00:24ta 0:00:28MB 59.8MB/s eta 0:00:06|█████████████████▊              | 416.6MB 30.8MB/s eta 0:00:11B 16.3MB/s eta 0:00:09
Collecting future (from torch)
[?25l  Downloading https://files.pythonhosted.org/packages/45/0b/38b06fd9b92dc2b68d58b75f900e97884c45bedd2ff83203d933cf5851c9/future-0.18.2.tar.gz (829kB)
[K    100% |████████████████████████████████| 829kB 7.2MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: future
  Building wheel for future (setup.py) ... [?25ldone
[?25h  Stored in directory: /root/.cache/pip/wheels/8b/99/a0/81daf51dcd359a9377b110a8a886b3895921802d2fc1b2397e
Successfully built future
Installing collected packages: fu

In [5]:
!ls -la "/srv/app/model/data"

total 80
drwxr-xr-x. 2 root root    70 May 21 23:06 .
drwxr-xr-x. 6 root root  4096 May 27 22:26 ..
-rw-r--r--. 1 root root  6148 Jul 26  2019 .DS_Store
-rw-r--r--. 1 root root 33095 May 21 22:55 DGA_App.pt
-rw-r--r--. 1 root root 15004 May 21 23:15 dga.pt
-rw-r--r--. 1 root root 15004 May 27 22:34 dga.pth


In [4]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing purposes
print("numpy version: " + np.__version__)
print("pandas version: " + pd.__version__)
print("PyTorch: " + torch.__version__)
if torch.cuda.is_available():
    print(f"There are {torch.cuda.device_count()} CUDA devices available")
    for i in range(0,torch.cuda.device_count()):
        print(f"Device {i:0}: {torch.cuda.get_device_name(i)} ")
else:
    print("No GPU found")

numpy version: 1.15.4
pandas version: 0.25.1
PyTorch: 1.0.1.post2
No GPU found


## Stage 1 - get a data sample from Splunk
In Splunk run a search to pipe a dataset into your notebook environment. Note: mode=stage is used in the | fit command to do this.

| makeresults count=10<br>
| streamstats c as i<br>
| eval s = i%3<br>
| eval feature_{s}=0<br>
| foreach feature_* [eval &lt;&lt;FIELD&gt;&gt;=random()/pow(2,31)]<br>
| fit MLTKContainer mode=stage algo=barebone epochs=10 batch_size=1 s from feature_* into app:barebone_model

After you run this search your data set sample is available as a csv inside the container to develop your model. The name is taken from the into keyword ("barebone_model" in the example above) or set to "default" if no into keyword is present. This step is intended to work with a subset of your data to create your custom model.

In [5]:
# should be the name of directory you created to save your features data
data_dir = 'data'

In [6]:
# TO:DO Figure out splunk query
# this cell is not executed from MLTK and should only be used for staging data into the notebook environment
def stage(name):
    with open("data/"+name+".csv", 'r') as f:
        df = pd.read_csv(f)
    with open("data/"+name+".json", 'r') as f:
        param = json.load(f)
    return df, param

In [7]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing purposes
df, param = stage("dga")

print(df)

                                  domain  label
0       dustindicatefaultpressgarden.com      1
1          parkschedulebuildpencatch.com      1
2         newsattemptexperiencedtake.com      1
3           wingconditionpasslecture.com      1
4      fooddishwitnessboxtankoperate.com      1
...                                  ...    ...
19995                      wikipedia.org      0
19996                          baidu.com      0
19997                       facebook.com      0
19998                        youtube.com      0
19999                         google.com      0

[20000 rows x 2 columns]


## Stage 2 - create and initialize a model

In [29]:
class DomainDataset(Dataset):
    def __init__(self, df, train=True):
        """
        Args:
            data_dir (string): directory name
            csv_filename (string): csv filename
        """
        
        self.data_df = df
        
        self.all_chars =  self.__build__chars__()
        self.inputs = self.data_df.iloc[:, 0]
                                   
        self.train = train
                                   
        if self.train:
            self.labels = self.data_df.iloc[:, -1]
        
        self.data_len = len(self.data_df.index)

    def __build__chars__(self):
        """Build dictionary of chars."""
        all_letters = string.ascii_letters + string.digits + " .'-"
        return {all_letters[i]:i+1 for i in range(0, len(all_letters))}
    
    def char_to_ix(self, char):
        """Character to index lookup."""
        return self.all_chars[char]

    def ix_to_char(self, char):
        """Index to character lookup."""
        for i, val in self.all_chars.items():
            if val == char:
                return i

    def domain_to_ix(self, domain):
        """Domain to sequence of indexes."""
        return torch.LongTensor([self.char_to_ix(i) for i in domain])

    def __getitem__(self, index):
        domain = self.domain_to_ix(self.inputs[index])
        if self.train:
            target = torch.Tensor([self.labels[index]])
            return domain, target
        else:
            return domain

    def __len__(self):
        return self.data_len
    
def pad_collate(batch):
    (xx, yy) = zip(*batch)
    x_lens = [len(x) for x in xx]
    y_lens = [len(y) for y in yy]

    xx_pad = pad_sequence(xx, batch_first=True, padding_value=0)

    return xx_pad, yy, x_lens, y_lens

def _get_train_test_data_loader(batch_size, df):
    print("Getting test and train data loaders.")
    
    dataset =  DomainDataset(df, train=True)
    
    train_size = int(0.8 * len(dataset))
    
    test_size = len(dataset) - train_size
    
    train_dataset, test_dataset = torch.utils.data.random_split(dataset, [train_size, test_size])
    
    train_dl = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=pad_collate)
    
    test_dl = DataLoader(test_dataset, batch_size=batch_size, shuffle=True, collate_fn=pad_collate)

    return train_dl, test_dl

def pad_collate_pred(batch):

    x_lens = [len(x) for x in batch]

    xx_pad = pad_sequence(batch, batch_first=True, padding_value=0)

    return xx_pad, x_lens

def _get_predict_loader(batch_size, df):
    print("Getting test and train data loaders.")
    
    dataset =  DomainDataset(df, train=False)
    
    predict_dl = DataLoader(dataset, batch_size=batch_size, shuffle=False, collate_fn=pad_collate_pred)
    
    return predict_dl


class DGAClassifier(nn.Module):
    """
    RNN Estimator for generating sequences of target variables.
    """
    
    def __init__(self, input_features=65, hidden_dim=12, n_layers=2, output_dim=1, embedding_dim=5, batch_size=10):
        super(DGAClassifier, self).__init__()

        # Variables
        self.hidden_dim = hidden_dim
        self.hidden_layers = n_layers
        self.batch_size = batch_size
        self.embedding_dim = embedding_dim
        
        # Embedding
        self.embedding = nn.Embedding(input_features, embedding_dim)
        
        # RNN Layer
        self.rnn = nn.RNN(embedding_dim, hidden_dim, n_layers, dropout=0.3, batch_first=True)
        
        # Fully connected layer
        self.fc = nn.Linear(hidden_dim, output_dim)
        
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x, x_lens):
        """
        Perform a forward pass of our model on batch of tracks.
        """
        
        # x: (batch_size, longest_sequence, embedding) i.e. 10, 32, 5
        # hidden size: (hidden_layers, batch_size, hidden_dim) i.e. 2, 10, 30
        batch_size, seq_len = x.size()

        # x_embed: (batch_size, longest_sequence, 1?, embedding_size)
        embed_x = self.embedding(x)
        
        x_packed = pack_padded_sequence(embed_x, x_lens, batch_first=True, enforce_sorted=False)

        # Passing in the input and hidden state into the model and obtaining outputs
        output_packed, hidden_state = self.rnn(x_packed)
        
        output_padded, lengths = pad_packed_sequence(output_packed, batch_first=False)
        
        output = output_padded.view(batch_size*seq_len, self.hidden_dim)
        
        adjusted_lengths = [(l-1)*batch_size + i for i, l in enumerate(lengths)]
        
        lengthTensor = torch.tensor(adjusted_lengths, dtype=torch.int64)
        
        output = output.index_select(0, lengthTensor)
        
        output = output.view(batch_size, self.hidden_dim)
        
        output = self.sigmoid(self.fc(output))
        
        return output

def init(df, param):

    mapping = {0: 'benign', 1: 'dga'}
    
    model = {
        "input_size": 67,
        "hidden_dim": 30,
        "embedding_dim": 5,
        "num_classes": 1,
        "n_layers": 2,
        "learning_rate": 0.001,
        "mapping": mapping,
        "num_epochs": 2,
        "batch_size": 32,
    }
    
    model["train_dl"], model["test_dl"] = _get_train_test_data_loader(int(model['batch_size']), df)

    print("FIT build RNN model with input size " + str(model["train_dl"].dataset.__len__()))

    # Initialize DGA Classifier
    model['model'] = DGAClassifier(
        input_features=model["input_size"], 
        hidden_dim=model["hidden_dim"], 
        n_layers=model["n_layers"], 
        output_dim=model["num_classes"],
        embedding_dim=model["embedding_dim"], 
        batch_size=model["batch_size"]
    )
    
    # Define loss and optimizer
    model['criterion'] = torch.nn.BCELoss()
    
    model['optimizer'] = torch.optim.Adam(model['model'].parameters())
    
    return model

In [30]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing or development purposes
model = init(df, param)
print(model)

Getting test and train data loaders.
FIT build RNN model with input size 16000
{'input_size': 67, 'hidden_dim': 30, 'embedding_dim': 5, 'num_classes': 1, 'n_layers': 2, 'learning_rate': 0.001, 'mapping': {0: 'benign', 1: 'dga'}, 'num_epochs': 2, 'batch_size': 32, 'train_dl': <torch.utils.data.dataloader.DataLoader object at 0x7f15b5bfbba8>, 'test_dl': <torch.utils.data.dataloader.DataLoader object at 0x7f15b5bfbc18>, 'model': DGAClassifier(
  (embedding): Embedding(67, 5)
  (rnn): RNN(5, 30, num_layers=2, batch_first=True, dropout=0.3)
  (fc): Linear(in_features=30, out_features=1, bias=True)
  (sigmoid): Sigmoid()
), 'criterion': BCELoss(), 'optimizer': Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    eps: 1e-08
    lr: 0.001
    weight_decay: 0
)}


## Stage 3 - fit the model

In [8]:
# train your model
# returns a fit info json object and may modify the model object
def update_stats(accuracy, confusion_matrix, output, y):
    output = torch.round(output).flatten()
    equal = torch.eq(output, y)
    correct = int(torch.sum(equal))
    for j, i in zip(output, y):
        confusion_matrix[int(i),int(j)]+=1

    return accuracy + correct, confusion_matrix

def fit(model, df, param):
    
    returns = {"message": "model trained"}

    cuda = torch.cuda.is_available()
    device = torch.device("cpu") if not cuda else torch.device("cuda:0")
    seed_everything(seed=1337, cuda=cuda)
    
    accuracy, confusion_matrix = 0, np.zeros((2, 2), dtype=int)
    
    for epoch in range(1, model['num_epochs'] + 1):

        # Train
        model['model'].train()
        print("Train ({})".format(epoch))
        print("-"*20)
        t = time.time()

        accuracy, confusion_matrix = 0, np.zeros((2, 2), dtype=int)
        
        total_loss = 0
        
        # Iterate over dataset
        for batch_num, (x_padded, y_padded, x_lens, y_lens) in enumerate(model["train_dl"]):
            
            # Clear stored gradient
            model["optimizer"].zero_grad()

            # Forward pass
            output =  model['model'](x_padded, x_lens)
            loss = model['criterion'](output, torch.Tensor(y_padded).unsqueeze(1))

            total_loss += float(loss)
            
            accuracy, confusion_matrix = update_stats(
                accuracy, 
                confusion_matrix, 
                torch.Tensor(output), 
                torch.Tensor(y_padded)
            )
            
            # Backward pass
            loss.backward()

            # Update parameters
            model["optimizer"].step()
            if batch_num % 100 == 0:
                print("[Batch]: {}/{} in {:.5f} seconds".format(batch_num, len(model["train_dl"]), time.time() - t), end='\r', flush=True)
            t = time.time()
            
        print("")
        print("[Loss]: {:.5f}".format(total_loss / len(model["train_dl"])))
        print("[Accuracy]: {}/{} : {:.3f}%".format(
            accuracy, len(model["train_dl"].dataset), accuracy / len(model["train_dl"].dataset) * 100))
        print(confusion_matrix, "\n")
        
        # Evaluate
        model['model'].eval()
        accuracy, confusion_matrix = 0, np.zeros((2, 2), dtype=int)
        t = time.time()
        total_loss = 0
        print("Validation ({})".format(epoch))
        print("-"*20)
        with torch.no_grad():
            for batch_num, (x_padded, y_padded, x_lens, y_lens) in enumerate(model["test_dl"]):
                output =  model['model'](x_padded, x_lens)
                total_loss += float(model['criterion'](output, torch.Tensor(y_padded).unsqueeze(1)))
                accuracy, confusion_matrix = update_stats(
                    accuracy, 
                    confusion_matrix, 
                    torch.Tensor(output), 
                    torch.Tensor(y_padded)
                )
                if batch_num % 50 == 0:
                    print("[Batch]: {}/{} in {:.5f} seconds".format(batch_num, len(model["test_dl"]), time.time() - t), end='\r', flush=True)
                t = time.time()
                
        print("")
        print("[Loss]: {:.5f}".format(total_loss / len(model["test_dl"])))
        print("[Accuracy]: {}/{} : {:.3f}%".format(
            accuracy, len(model["test_dl"].dataset), accuracy / len(model["test_dl"].dataset) * 100))
        print(confusion_matrix, "\n")
        
    # memorize parameters
    returns['model_epochs'] = model['num_epochs']
    returns['model_batch_size'] = model['batch_size']
    returns['model_loss_acc'] = total_loss / len(model["train_dl"])
            
    return returns

In [77]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing or development purposes
print(fit(model,df,param))

Train (1)
--------------------
[Batch]: 400/500 in 0.02491 seconds
[Loss]: 0.01541
[Accuracy]: 15940/16000 : 99.625%
[[7938   48]
 [  12 8002]] 

Validation (1)
--------------------
[Batch]: 100/125 in 0.00725 seconds
[Loss]: 0.01332
[Accuracy]: 3989/4000 : 99.725%
[[2003   11]
 [   0 1986]] 

Train (2)
--------------------
[Batch]: 400/500 in 0.02603 seconds
[Loss]: 0.01240
[Accuracy]: 15952/16000 : 99.700%
[[7945   41]
 [   7 8007]] 

Validation (2)
--------------------
[Batch]: 100/125 in 0.01019 seconds
[Loss]: 0.01338
[Accuracy]: 3988/4000 : 99.700%
[[2002   12]
 [   0 1986]] 

{'message': 'model trained', 'model_epochs': 2, 'model_batch_size': 32, 'model_loss_acc': 0.0033458294686861336}


In [74]:
def apply(model, df, param):
    predict_dl = _get_predict_loader(32, df)    
    classes = mapping = {0: 'benign', 1: 'dga'}
    model['model'].eval()
    predictions = []
    with torch.no_grad():
        for batch_num, (x_padded,  x_lens) in enumerate(predict_dl):
            output =  model['model'](x_padded, x_lens)
            y_hat = torch.round(output.data)
            predictions += [classes[int(key)] for key in y_hat.flatten().numpy()]
           
    return predictions

In [75]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing or development purposes
print(apply(test,df,param)[:10])

Getting test and train data loaders.
['dga', 'dga', 'dga', 'dga', 'dga', 'dga', 'dga', 'dga', 'dga', 'dga']


## Stage 5 - save the model

In [78]:
# save model to name in expected convention "<algo_name>_<model_name>.h5"
def save(model, name):
    torch.save(model['model'].state_dict(), MODEL_DIRECTORY + "dga" + ".pth")
    return model

In [79]:
!ls -la /srv/app/model/data/

total 80
drwxr-xr-x. 2 root root    70 May 21 23:06 .
drwxr-xr-x. 6 root root  4096 May 27 22:26 ..
-rw-r--r--. 1 root root  6148 Jul 26  2019 .DS_Store
-rw-r--r--. 1 root root 33095 May 21 22:55 DGA_App.pt
-rw-r--r--. 1 root root 15004 May 21 23:15 dga.pt
-rw-r--r--. 1 root root 15004 May 27 22:31 dga.pth


In [80]:
save(model, "dga")

{'input_size': 67,
 'hidden_dim': 30,
 'embedding_dim': 5,
 'num_classes': 1,
 'n_layers': 2,
 'learning_rate': 0.001,
 'mapping': {0: 'benign', 1: 'dga'},
 'num_epochs': 2,
 'batch_size': 32,
 'train_dl': <torch.utils.data.dataloader.DataLoader at 0x7f15b5bfbba8>,
 'test_dl': <torch.utils.data.dataloader.DataLoader at 0x7f15b5bfbc18>,
 'model': DGAClassifier(
   (embedding): Embedding(67, 5)
   (rnn): RNN(5, 30, num_layers=2, batch_first=True, dropout=0.3)
   (fc): Linear(in_features=30, out_features=1, bias=True)
   (sigmoid): Sigmoid()
 ),
 'criterion': BCELoss(),
 'optimizer': Adam (
 Parameter Group 0
     amsgrad: False
     betas: (0.9, 0.999)
     eps: 1e-08
     lr: 0.001
     weight_decay: 0
 )}

## Stage 6 - load the model

In [81]:
# load model from name in expected convention "<algo_name>_<model_name>.h5"
def load(name):
    model = {}
    dga_model = DGAClassifier(67, 30, 2, 1, 5, 32)
    dga_model.load_state_dict(torch.load(MODEL_DIRECTORY + "dga" + ".pth"))
    model['model'] = dga_model
    return model 

In [82]:
test = load('dga')

In [83]:
print(apply(test,df,param)[:10])

Getting test and train data loaders.
['dga', 'dga', 'dga', 'dga', 'dga', 'dga', 'dga', 'dga', 'dga', 'dga']


## Stage 7 - provide a summary of the model

In [22]:
# return model summary
def summary(model=None):
    returns = {"version": {"pytorch": torch.__version__} }
    if model is not None:
        if 'model' in model:
            returns["summary"] = str(model)
    return returns

In [24]:
summary(model)

{'version': {'pytorch': '1.5.0'},
 'summary': "{'input_size': 67, 'hidden_dim': 30, 'embedding_dim': 5, 'num_classes': 1, 'n_layers': 2, 'learning_rate': 0.001, 'mapping': {0: 'benign', 1: 'dga'}, 'num_epochs': 2, 'batch_size': 32, 'train_dl': <torch.utils.data.dataloader.DataLoader object at 0x7fb7e6014a20>, 'test_dl': <torch.utils.data.dataloader.DataLoader object at 0x7fb7e6014e48>, 'model': DGAClassifier(\n  (embedding): Embedding(67, 5)\n  (rnn): RNN(5, 30, num_layers=2, batch_first=True, dropout=0.3)\n  (fc): Linear(in_features=30, out_features=1, bias=True)\n  (sigmoid): Sigmoid()\n), 'criterion': BCELoss(), 'optimizer': Adam (\nParameter Group 0\n    amsgrad: False\n    betas: (0.9, 0.999)\n    eps: 1e-08\n    lr: 0.001\n    weight_decay: 0\n)}"}

After implementing your fit, apply, save and load you can train your model:<br>
| makeresults count=10<br>
| streamstats c as i<br>
| eval s = i%3<br>
| eval feature_{s}=0<br>
| foreach feature_* [eval &lt;&lt;FIELD&gt;&gt;=random()/pow(2,31)]<br>
| fit MLTKContainer algo=barebone s from feature_* into app:barebone_model<br>

Or apply your model:<br>
| makeresults count=10<br>
| streamstats c as i<br>
| eval s = i%3<br>
| eval feature_{s}=0<br>
| foreach feature_* [eval &lt;&lt;FIELD&gt;&gt;=random()/pow(2,31)]<br>
| apply barebone_model as the_meaning_of_life

## End of Stages
All subsequent cells are not tagged and can be used for further freeform code