# Creating a hacking detection engine

The goal of this tutorial is to use various data science techniques coupled with machine learning in order to create models that can detect and categorize adversaries and the techniques they use when hacking into computer networks.

The first portion of this tutorial will attempt to detect "commercial" malware, this is malware developed by criminals with the goal of either stealing money, or joining the computer to a larger botnet (access to the computing resources of this botnet is usually sold and used to perform large-scale denial of service attacks (DDOS) such as the case of [mirai](https://en.wikipedia.org/wiki/Mirai_(malware)). Commercial malware does not usually attempt to evade detections, and leaves a very large footprint on the system. I will attempt to train a model to detect the footprint left by this malware

The second portion of this tutorial will attempt to detect and classify the activity of nation state actors. These are known as [advanced persistent threats](https://en.wikipedia.org/wiki/Advanced_persistent_threat) (APTs). These groups of highly skilled and efficient attackers usually gain access into networks using highly target e-mails (spear phishing). Once a victim opens the e-mail and the attached file or clicks the attached link, a remote access trojan will be installed on the machine which will give the APT access to the network. Once access has been gained, these groups will attempt to evade detection as much as possible by leveraging existing applications pre-installed with windows. This is the exact type of activity I will attempt to detect.

This tutorial will work completely in a windows enviroment.

# Enviroment Setup

Before we start, we need a __safe enviroment__ to execute our malicious code and to emulate the adversaries we are trying to detect. This step is extremely important, as when analyzing malicious code we need to ensure that we do not put ourselves or anyone on our network in danger.

I highly recommend this [guide](https://blog.christophetd.fr/malware-analysis-lab-with-virtualbox-inetsim-and-burp/) for setting up a complete malware analysis virtual enviroment. This tutorial will only set up a single windows virtual machine, disconnected from the outside internet, with a few tools in order to generate data we are able to analyze. The environment setup below is in __no way a suitable malware analysis enviroment__.

It is important to following the following sections __in order__ in order to be able to properly configure the virtual machine.

### Virtual Machine Setup

This set of instruction assumes the usage of [VMWare Fusion 10](https://my.vmware.com/web/vmware/info?slug=desktop_end_user_computing/vmware_fusion/10_0).

To get started, we will be using a Windows 7 guest machine, as of 2017/12/07 this operating system is currently the [most used](https://en.wikipedia.org/wiki/Usage_share_of_operating_systems#Desktop_and_laptop_computers) in the world. An ISO image of the operating system is available from [here](https://www.microsoft.com/en-us/software-download/windows7). 

The following [tutorial](https://kb.vmware.com/s/article/1011677) can be used to set up the initial virtual machine. 

Once that tutorial is completed, the following changes should be made to ensure the machine cannot communicate to the internet. Going to Virtual Machine -> Settings -> Network Adapter, uncheck the "connect network adapter" setting.


<img src="static/imgs/uncheck_network_adapter.png" width=400 height=400>

Again, this is the only virtual machine configuration step we will do in this tutorial, but is important enough to show explicitly how to do it.

### Windows Image Configuration

In order to log the various action malware does on our system, we will use [sysmon](https://docs.microsoft.com/en-us/sysinternals/downloads/sysmon). The reasoning for this is further explained in the Data Collection section. 

We will be using the [ion-storm](https://github.com/ion-storm/sysmon-config) sysmon configuration file. This configuration file does considerable whitelisting of activity from the machine, which will lower the amount of cleaning we will have to do in the long run. The copy used at the time of writing this tutorial is available under `etc/sysmon-config.xml` in the github repository.

To set up the sysmon agent, simply download sysmon and the configuration file from their respective links. Copy them over to the virtual machine (VM), open an administrative command prompt (start -> type cmd.exe -> right click, run as adminsitrator), and in the folder where sysmon and the configuration file are located run `sysmon.exe -accepteula -i sysmonconfig-export.xml`. This will launch the sysmon agent. You will be presented with the following messages if everything was executed correctly:

<img src="static/imgs/sysmon_installed.png" width=400 height=400>

### Snapshot

The primary reason we are using a VM to generate data (outside of the ability to segment it off the internet) is that we are able to take snapshots of the state of our VM. After a snapshot is created we can easily revert our machine to the state it was in the snapshot. This is important when dealing with malware since we want the ability to revert our analysis machine to a clean state in order to re-execute the malware. This is even more important in data science since we will have a baseline for all our logs, we will be able to extract only the logs from the machine after the snapshot occurs, thus giving us only the logs relevant to the actions or malware we are currently creating events for.

To take a snapshot simply do Virtual Machine -> Take Snapshot

<img src="static/imgs/take_snapshot.png" width=400 height=400>

I renamed my snapshot from the snapshot manager (Virtual Machine -> Take Snapshot) to "Sysmon installed" for clarity.

<img src="static/imgs/renamed_snapshot.png" width=200 height=200>



That's it! Our enviroment is set up for running some bad code and emulating some elite hackers!

# Data Collection

We need to be able to represent the actions performed on our virtual in some way. To accomplish this we will be collecting some logs from our computer after every time we run some malware. We will be doing this using four different logs available on a windows machine, these are known as windows event logs:
 
1. Sysmon logs $\rightarrow$ This is the agent we installed in part one, its logs are available at:
    * `C:\Windows\System32\winevt\Logs\Microsoft-Windows-Sysmon%4Operational.evtx`
2. Windows Event Logs $\rightarrow$ We will be using a combination of the security, application and system logs to identify specific events. These are available under every windows machine at the following locations: 
    * Application $\rightarrow$ `C:\Windows\System32\winevt\Logs\Application.evtx` 
    * System $\rightarrow$ `C:\Windows\System32\winevt\Logs\System.evtx`
    * Security $\rightarrow$ `C:\Windows\System32\winevt\Logs\Security.evtx`


These `.evtx` files are not standard text files, they are a special format and will require some parsing. The below shows the file header of the sysmon event log.

```shell
| (master) => file Microsoft-Windows-Sysmon%4Operational.evtx
Microsoft-Windows-Sysmon%4Operational.evtx: MS Windows Vista Event Log, 1 chunks (no. 0 in use), empty, DIRTY

| (master) => xxd -l 32 Microsoft-Windows-Sysmon%4Operational.evtx
00000000: 456c 6646 696c 6500 0000 0000 0000 0000  ElfFile.........
00000010: 0000 0000 0000 0000 0100 0000 0000 0000  ................
```


We will be using a parsing library [python-evtx](https://github.com/williballenthin/python-evtx) to work with these files.

Let's try to convert a basic Sysmon event log into something serializable. 

In [None]:
import Evtx.evtx

## Baseline dataset

In order to generate 

Command used to pull down 7 days of data

```bash
for i in $(seq -w 1 7);
    do wget -c https://s3-us-gov-west-1.amazonaws.com/unified-host-network-dataset/2017/wls/wls_day-0$i.bz2;
done
```