# Domain Generation Algorithm (DGA) Detection

## Authors
 - Gorkem Batmaz (NVIDIA) [gbatmaz@nvidia.com]
 - Bhargav Suryadevara (NVIDIA) [bsuryadevara@nvidia.com]

## Development Notes
* Developed using: RAPIDS v0.12.0 and CLX v0.12
* Last tested using: RAPIDS v0.12.0 and CLX v0.12 on Jan 28, 2020

## Table of Contents
* Introduction
* Data Importing
* Data Preprocessing
* Training and Evaluation
* Inference
* Conclusion

## Introduction
[Domain Generation Algorithms](https://en.wikipedia.org/wiki/Domain_generation_algorithm) (DGAs) are used to generate domain names that can be used by the malware to communicate with the command and control servers. IP addresses and static domain names can be easily blocked, and a DGA provides an easy method to generate a large number of domain names and rotate through them to circumvent traditional block lists. We will use a type of recurrent neural network called the [Gated Recurrent Unit](https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21) (GRU) for this example. The [CLX](https://github.com/rapidsai/clx) and [RAPIDS](https://rapids.ai) libraries enable users train their models with up-to-date domain names representative of both benign and DGA generated strings. Using a CLX workflow, this capability could also be used in production. This notebook provides a view into the data science workflow to create a DGA detection implementation.

In [1]:
import os
import wget
import time
import cudf
import torch
import shutil
import zipfile
import numpy as np
from datetime import datetime
from sklearn.metrics import accuracy_score, average_precision_score
from clx.analytics.detector_dataset import DetectorDataset
from clx.analytics.dga_detector import DGADetector
from cuml.preprocessing.model_selection import train_test_split

## Data Importing
Links used for examples of DGA generated domains and benign domains are below. Change these locations if you have a preferred alternative list.
- DGA : http://osint.bambenekconsulting.com/feeds/dga-feed.txt
- Benign : http://s3.amazonaws.com/alexa-static/top-1m.csv.zip

In [2]:
URL_META_LIST = [
    {
        "source": "DGA",
        "url": "http://osint.bambenekconsulting.com/feeds/dga-feed.txt",
        "compression": None,
        "storage_path": "/input/dga_feed",
    },
    {
        "source": "Benign",
        "url": "http://s3.amazonaws.com/alexa-static/top-1m.csv.zip",
        "compression": "zip",
        "storage_path": "/input/top-1m",
    },
]

In [3]:
def download_files(url_meta_list):
    for entry in url_meta_list:
        output_dir = entry['storage_path']
        if os.path.exists(output_dir):
            shutil.rmtree(output_dir)
        os.makedirs(output_dir)
        filepath = wget.download(entry['url'], out=output_dir)
        unpack(entry['compression'], filepath, output_dir)
        print('%s data is stored to location %s' %(entry['source'], output_dir))

In this scenario, one of the files is compressed. We'll define a funciton that can decompress it.

In [4]:
def unpack(compression_type, filepath, output_dir):
     if compression_type == 'zip':
        with zipfile.ZipFile(filepath, 'r') as f:
            f.extractall(output_dir)
        os.remove(filepath)

Now we download the example domain name lists.

In [5]:
download_files(URL_META_LIST)

DGA data is stored to location /input/dga_feed
Benign data is stored to location /input/top-1m


## Data Preprocessing
We need to preprocess the downloaded data to get it ready for downstream modeling. We can do this using a [cuDF](https://github.com/rapidsai/cudf) function.

In [6]:
def load_input_data(url_meta_list):
    dga_df = cudf.read_csv(url_meta_list[0]['storage_path'] + '/*', names=["domain"], skiprows=14)
    dga_df['type'] = 0
    benign_df = cudf.read_csv(url_meta_list[1]['storage_path'] + '/*', names=["line_num","domain"])
    benign_df = benign_df.drop('line_num')
    benign_df['type'] = 1
    input_df = cudf.concat([benign_df, dga_df])
    return input_df

We apply the function and load the data into a cuDF.

In [7]:
input_df = load_input_data(URL_META_LIST)
input_df.head()

Unnamed: 0,domain,type
0,google.com,1
1,youtube.com,1
2,tmall.com,1
3,facebook.com,1
4,baidu.com,1


#### Create Train and Test Dataset
We utilize the [`train_test_split` function](https://docs.rapids.ai/api/cuml/0.10/api.html#model-selection-and-data-splitting) from [cuML](https://github.com/rapidsai/cuml) and create a shuffled dataset for training and testing.

In [8]:
domain_train, domain_test, type_train, type_test = train_test_split(input_df, 'type', train_size=0.7)

In [9]:
def create_df(domain_df, type_series):
    df = cudf.DataFrame()
    df['domain'] = domain_df['domain']
    df['type'] = type_series
    return df

In [10]:
test_df = create_df(domain_test, type_test)
train_df = create_df(domain_train, type_train)

Because we have only benign and DGA (malicious) categoriesm, the number of domain types need to be set to 2 (`N_DOMAIN_TYPE=2`). Vocabulary size(`CHAR_VOCAB`) is set to 128 ASCII characters. The values below set for `HIDDEN_SIZE`, `N_LAYERS` of the network, and the `LR` (Learning Rate) give an optimum balance for the network size and performance. They might need be set via experiments when working with other datasets.

In [11]:
LR = 0.001
N_LAYERS = 3
CHAR_VOCAB = 128
HIDDEN_SIZE = 100
N_DOMAIN_TYPE = 2

#### Instantiate DGA Detector
Now that the data is ready, the datasets are created, and we've set the parameters for the model, we can use the DGADetector method built into CLX to create and train the model.

In [12]:
dd = DGADetector(lr=LR)
dd.init_model(n_layers=N_LAYERS, char_vocab=CHAR_VOCAB, hidden_size=HIDDEN_SIZE, n_domain_type=N_DOMAIN_TYPE)

#### Create Batches
We need to partition the input dataframe into one or more smaller dataframes per the given batch size for training and testing of a model.

In [13]:
batch_size = 10000
train_dataset = DetectorDataset(train_df, batch_size)
test_dataset = DetectorDataset(test_df, batch_size)

In [14]:
def create_dir(dir_path):
    print("Verify if directory `%s` is already exists." % (dir_path))
    if not os.path.exists(dir_path):
        print("Directory `%s` does not exists." % (dir_path))
        print("Creating directory `%s` to store trained models." % (dir_path))
        os.makedirs(dir_path)

In [15]:
def cleanup_cache():
    # release memory.
    torch.cuda.empty_cache()

In [16]:
def train_and_eval(dd, train_dataset, test_dataset, epoch, model_dir):
    print("Initiating model training")
    create_dir(model_dir)
    max_accuracy = 0
    prev_model_file_path = ""
    for i in range(1, epoch + 1):
        print("---------")
        print("Epoch: %s" % (i))
        print("---------")
        dd.train_model(train_dataset)
        accuracy = dd.evaluate_model(test_dataset)
        now = datetime.now()
        output_filepath = (
            model_dir
            + "/"
            + "rnn_classifier_{}.pth".format(now.strftime("%Y-%m-%d_%H_%M_%S"))
        )
        if accuracy > max_accuracy:
            dd.save_model(output_filepath)
            max_accuracy = accuracy
            if prev_model_file_path:
                os.remove(prev_model_file_path)
            prev_model_file_path = output_filepath
    print("Model with highest accuracy (%s) is stored to location %s" % (max_accuracy, prev_model_file_path))
    return prev_model_file_path

### Training and Evaluation
Using the function we created above, we now train and evaluate the model.
*NOTE: You may see warnings when you run the training due to a [bug in PyTorch](https://github.com/pytorch/pytorch/issues/27972) which is being actively investigated.*

In [17]:
%%time
epoch = 30
model_dir='/trained_models'
model_filepath = train_and_eval(dd, train_dataset, test_dataset, epoch, model_dir)
cleanup_cache()

Initiating model training
Verify if directory `/trained_models` is already exists.
Directory `/trained_models` does not exists.
Creating directory `/trained_models` to store trained models.
---------
Epoch: 1
---------


  return cpp_dlpack.to_dlpack(gdf_cols)


Test set: Accuracy: 327976/492096 (0.6664878397711016)

---------
Epoch: 2
---------


  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "


Test set: Accuracy: 438766/492096 (0.8916268370399272)

---------
Epoch: 3
---------
Test set: Accuracy: 455557/492096 (0.9257482279880348)

---------
Epoch: 4
---------
Test set: Accuracy: 458492/492096 (0.9317125113798933)

---------
Epoch: 5
---------
Test set: Accuracy: 462267/492096 (0.9393837787748732)

---------
Epoch: 6
---------
Test set: Accuracy: 466771/492096 (0.9485364644297047)

---------
Epoch: 7
---------
Test set: Accuracy: 473506/492096 (0.9622228183118741)

---------
Epoch: 8
---------
Test set: Accuracy: 472628/492096 (0.9604386136038496)

---------
Epoch: 9
---------
Test set: Accuracy: 470964/492096 (0.9570571595786188)

---------
Epoch: 10
---------
Test set: Accuracy: 477719/492096 (0.9707841559370529)

---------
Epoch: 11
---------
Test set: Accuracy: 475893/492096 (0.9670734978540773)

---------
Epoch: 12
---------
Test set: Accuracy: 477986/492096 (0.9713267329951879)

---------
Epoch: 13
---------
Test set: Accuracy: 483859/492096 (0.9832613961503447)

-----

### Inference

Using the model generated above, we now score the test dataset against the model to determine if the domain is likely generated by a DGA or benign.

In [18]:
dd = DGADetector()
dd.load_model(model_filepath)

pred_results = []
true_results = []
for partition in test_dataset.partitioned_dfs:
    pred_results.append(list(dd.predict(partition['domain']).values_host))
    true_results.append(list(partition['type'].values_host))
pred_results = np.concatenate(pred_results)
true_results = np.concatenate(true_results)
accuracy_score = accuracy_score(pred_results, true_results)
print('Model accuracy: %s'%(accuracy_score))
cleanup_cache()  

Model accuracy: 0.9921397450903888


In [19]:
average_precision = average_precision_score(true_results, pred_results)

print('Average precision score: {0:0.3f}'.format(average_precision))

Average precision score: 0.986


## Conclusion

DGA detector in CLX enables users to train their models for detection and also use existing models. This capability could also be used in conjunction with log parsing efforts if the logs contain domain names. DGA detection done with CLX and RAPIDS keeps data in GPU memory, removing unnecessary copy/converts and providing a 4X speed advantage over CPU only implementations. This is esepcially true with large batch sizes.