# Creating a hacking detection engine

The goal of this tutorial is to use various data science techniques coupled with machine learning in order to create models that can detect and categorize adversaries and the techniques they use when hacking into computer networks.

The first portion of this tutorial will attempt to detect "commerical" malware, this is malware developed by criminals with the goal of either stealing money, or joining the computer to a larger botnet (access to the computing resources of this botnet is usually sold and used to perform large-scale denial of service attacks (DDOS) such as the case of [mirai](https://en.wikipedia.org/wiki/Mirai_(malware)). Commerical malware does not usually attempt to evade detections, and leaves a very large footprint on the system. I will attempt to train a model to detect the footprint left by this malware

The second portion of this tutorial will attempt to detect and classify the activity of nation state actors. These are known as [advanced persistent threats](https://en.wikipedia.org/wiki/Advanced_persistent_threat) (APTs). These groups of highly skilled and efficient attackers usually gain access into networks using highly target e-mails (spear phising). Once a victim opens the e-mail and the attached file or clicks the attached link, a remote access trojan will be installed on the machine which will give the APT access to the network. Once access has been gained, these groups will attempt to evade detection as much as possible by leveraging existing applications pre-installed with windows. This is the exact type of activity I will attempt to detect.

# Enviroment Setup

Before we start, we need a __safe enviroment__ to execute our malicious code and to emulate the adversaries we are trying to detect. This step is extremely important, as when analyzing malicious code we need to ensure that we do not put ourselves or anyone on our network in danger.

I highly recommend this [guide](https://blog.christophetd.fr/malware-analysis-lab-with-virtualbox-inetsim-and-burp/) for setting up a complete malware analysis virtual enviroment. This tutorial will only set up a single windows virtual machine, disconnected from the outside internet, with a few tools in order to generate data we are able to analyze. The environment setup below is in __no way a suitable malware analysis enviroment__.

### Virtual Machine Setup

This set of instruction assumes the usage of [VMWare Fusion 10](https://my.vmware.com/web/vmware/info?slug=desktop_end_user_computing/vmware_fusion/10_0).

To get started, we will be using a Windows 7 guest machine, as of 2017/12/07 this operating system is currently the [most used](https://en.wikipedia.org/wiki/Usage_share_of_operating_systems#Desktop_and_laptop_computers) in the world. An ISO image of the operating system is available from [here](https://www.microsoft.com/en-us/software-download/windows7). 

The following [tutorial](https://kb.vmware.com/s/article/1011677) can be used to set up the initial virtual machine. 

Once that tutorial is completed, the following changes should be made to ensure the machine cannot communicate to the internet. Going to Virtual Machine -> Settings -> Network Adapter, uncheck the "connect network adapter" setting.


<img src="static/imgs/uncheck_network_adapter.png" width=400 height=400>

Again, this is the only virtual machine configuration step we will do in this tutorial, but is important enough to show explicitly how to do it.

### Windows Image Configuration

In order to log the various action malware does on our system, we will use [sysmon](https://docs.microsoft.com/en-us/sysinternals/downloads/sysmon). The reasoning for this is further explained in the Data Collection section. 

We will be using the [ion-storm](https://github.com/ion-storm/sysmon-config) sysmon configuration file. This configuration file does considerable whitelisting of activity from the machine, which will lower the amount of cleaning we will have to do in the long run. The copy used at the time of writing this tutorial is available under `etc/sysmon-config.xml` in the github repository.

To install


# Data Collection

We need to be able to represent the actions performed by computers in some way. The first step of this is to gather all possible logs from a computer.


## Baseline dataset

In order to generate 

Command used to pull down 7 days of data

```bash
for i in $(seq -w 1 7);
    do wget -c https://s3-us-gov-west-1.amazonaws.com/unified-host-network-dataset/2017/wls/wls_day-0$i.bz2;
done
```