<img src="../img/logo_white_bkg_small.png" align="right" />


# Worksheet 3:  Detecting Domain Generation Algorithm (DGA) Domains against DNS ANSWERS
This worksheet covers concepts covered in the second half of Module 6 - Hunting with Data Science.  It should take no more than 20-30 minutes to complete.  Please raise your hand if you get stuck.  

Your objective is to reduce a dataset that has thousands of domain names and identify those created by DGA.

## Import the Libraries
For this exercise, we will be using:
* Pandas (http://pandas.pydata.org/pandas-docs/stable/)
* Flare (https://github.com/austin-taylor/flare)
* Json (https://docs.python.org/3/library/json.html)
* WHOIS (https://pypi.python.org/pypi/whois)

Beacon writeup: <a href="http://www.austintaylor.io/detect/beaconing/intrusion/detection/system/command/control/flare/elastic/stack/2017/06/10/detect-beaconing-with-flare-elasticsearch-and-intrusion-detection-systems/"> Detect Beaconing with Flare, Elastic Stack, and Intrusion Detection Systems</a>

In [None]:
from flare.data_science.features import entropy
from flare.data_science.features import dga_classifier
from flare.data_science.features import domain_tld_extract
from flare.tools.alexa import Alexa
from pandas.io.json import json_normalize
from whois import whois
import pandas as pd
import json
import warnings

warnings.filterwarnings('ignore')

## This is an example of how to generate domain generated algorithms. 

In [27]:
def generate_domain(year, month, day):
    """Generates a domain name for the given date."""
    domain = ""

    for i in range(16):
        year = ((year ^ 8 * year) >> 11) ^ ((year & 0xFFFFFFF0) << 17)
        month = ((month ^ 4 * month) >> 25) ^ 16 * (month & 0xFFFFFFF8)
        day = ((day ^ (day << 13)) >> 19) ^ ((day & 0xFFFFFFFE) << 12)
        domain += chr(((year ^ month ^ day) % 25) + 97)

    return domain + '.com'

In [28]:
generate_domain(2017, 6, 23)

'vtlfccmfxlkgifuf.com'

### A large portion of data science is data preparation. In this exercise, we'll take output from Suricata's eve.json file and extract the DNS records so we can find anything using DGA. 

First you'll need to **unzip the large_eve_json.zip file** in the data directory and specify the path.

In [29]:
eve_json = '../data/large_eve.json'

### Next read the data in and build a list

In [30]:
all_suricata_data = [json.loads(record) for record in open(eve_json).readlines()]

In [31]:
len(all_suricata_data)

746909

### Our output from Suricata has 746909 records, and for we are only interested in DNS records. Let's narrow our data down to records that only contain dns

### Read in suricata data and load each record as json if DNS is in the key. This will help pandas json_normalize feature

In [32]:
dns_records = [json.loads(record) for record in open(eve_json).readlines() if 'dns' in record]

In [33]:
len(dns_records)

21484

### Down to 21484 -- much better.

### Somewhere in our _21484_ records is communication from infected computers. It's up to you to narrow the results down and find the malicious DNS request. 

In [34]:
dns_records[2]

{'dest_ip': '192.168.207.4',
 'dest_port': 53,
 'dns': {'id': 54724,
  'rrname': 'hpca-tier2.office.aol.com.ad.aol.aoltw.net',
  'rrtype': 'A',
  'tx_id': 0,
  'type': 'query'},
 'event_type': 'dns',
 'flow_id': 578544790391795,
 'pcap_cnt': 54519,
 'proto': 'UDP',
 'src_ip': '192.168.205.170',
 'src_port': 31393,
 'timestamp': '2017-07-22T17:33:27.379891-0500',
 'vlan': 150}

### The data is nested json and has varying lengths, so you will need to use the json_normalize feature

In [35]:
suricata_df = json_normalize(dns_records)

In [36]:
suricata_df.shape

(21484, 163)

In [37]:
suricata_df.head(2)

Unnamed: 0,app_proto,dest_ip,dest_port,dns.id,dns.rcode,dns.rdata,dns.rrname,dns.rrtype,dns.ttl,dns.tx_id,...,tcp.psh,tcp.rst,tcp.state,tcp.syn,tcp.tcp_flags,tcp.tcp_flags_tc,tcp.tcp_flags_ts,timestamp,tx_id,vlan
0,,2001:0500:0001:0000:0000:0000:803f:0235,53.0,15529.0,,,api.wunderground.com,A,,0.0,...,,,,,,,,2017-07-22T17:33:16.661646-0500,,110.0
1,,2001:0500:0003:0000:0000:0000:0000:0042,53.0,58278.0,,,stork79.dropbox.com,A,,0.0,...,,,,,,,,2017-07-22T17:33:24.990320-0500,,110.0


### Next we need to filter out all A records

In [38]:
a_records = suricata_df[suricata_df['dns.rrtype']==('A')][['dns.rrname','dns.rrtype']]

In [39]:
a_records.shape

(2849, 2)

### By filtering out the A records, our dataset is down to 2849.

In [40]:
a_records['dns.rrname'].value_counts().head()

stork79.dropbox.com                312
versioncheck.addons.mozilla.org    120
safebrowsing.clients.google.com     95
centos.mirror.facebook.net          84
mirror.team-cymru.org               84
Name: dns.rrname, dtype: int64

### Next we can figure out how many unique DNS names there are.

In [41]:
a_records_unique = pd.DataFrame(a_records['dns.rrname'].unique(), columns=['dns_rrname'])

In [42]:
len(a_records_unique)

177

### Down to 177 records to process!

In [43]:
a_records_unique.head()

Unnamed: 0,dns_rrname
0,api.wunderground.com
1,stork79.dropbox.com
2,hpca-tier2.office.aol.com.ad.aol.aoltw.net
3,safebrowsing.clients.google.com.home
4,fxfeeds.mozilla.com


### Next we need to train extract the top level domains (remove subdomains) using flare so we can feed it to our classifier

In [44]:
a_records_unique['domain_tld'] = a_records_unique.dns_rrname.apply(domain_tld_extract)

In [45]:
a_records_unique.head()

Unnamed: 0,dns_rrname,domain_tld
0,api.wunderground.com,wunderground.com
1,stork79.dropbox.com,dropbox.com
2,hpca-tier2.office.aol.com.ad.aol.aoltw.net,aoltw.net
3,safebrowsing.clients.google.com.home,com.home
4,fxfeeds.mozilla.com,mozilla.com


### Train DGA Classifier with dictionary words, n-grams and DGA Domains

In [46]:
dga_predictor = dga_classifier()

[*] Initializing... training classifier - Please wait.
[+] Classifier Ready


In [47]:
a_records_unique['dga_predict'] = a_records_unique.domain_tld.apply(lambda x: dga_predictor.predict(x))

### A quick sampling of the data shows our prediction has labelled our data. 

In [48]:
a_records_unique.sample(10)

Unnamed: 0,dns_rrname,domain_tld,dga_predict
107,mirror.hmc.edu,hmc.edu,legit
76,download.windowsupdate.com,windowsupdate.com,legit
93,mirror.its.uidaho.edu,uidaho.edu,legit
106,mirrors.kernel.org,kernel.org,legit
176,client-software.real.com,real.com,legit
129,cloud.xmarks.com,xmarks.com,legit
126,FL,FL,legit
91,google.com,google.com,legit
83,api.facebook.com,facebook.com,legit
82,www.malwarecity.com,malwarecity.com,legit


In [49]:
dga_df = a_records_unique[a_records_unique.dga_predict=='dga']

In [50]:
dga_df

Unnamed: 0,dns_rrname,domain_tld,dga_predict
144,dns.msftncsi.com,msftncsi.com,dga
147,www.msftncsi.com,msftncsi.com,dga
160,vtlfccmfxlkgifuf.com,vtlfccmfxlkgifuf.com,dga
167,ejfodfmfxlkgifuf.xyz,ejfodfmfxlkgifuf.xyz,dga


### Our dataset is down to 5 results! Let's run the domains through alexa to see if ny are in the top 1 million

In [51]:
alexa = Alexa()

In [52]:
dga_df['in_alexa'] = dga_df.dns_rrname.apply(alexa.domain_in_alexa)

In [53]:
def get_creation_date(domain):
    try:
        lookup = whois(domain)
        output = lookup.get('creation_date','No results')
    except:
        output = 'No Creation Date!'
    if output is None:
        output = 'No Creation Date!'
    return output

In [54]:
get_creation_date('google.com')

datetime.datetime(1997, 9, 15, 0, 0)

In [55]:
dga_df

Unnamed: 0,dns_rrname,domain_tld,dga_predict,in_alexa
144,dns.msftncsi.com,msftncsi.com,dga,False
147,www.msftncsi.com,msftncsi.com,dga,False
160,vtlfccmfxlkgifuf.com,vtlfccmfxlkgifuf.com,dga,False
167,ejfodfmfxlkgifuf.xyz,ejfodfmfxlkgifuf.xyz,dga,False


### It appears none of our domains are in Alexa, but let's check creation dates.

In [56]:
dga_df['creation_date'] = dga_df.dns_rrname.apply(get_creation_date)

In [57]:
dga_df

Unnamed: 0,dns_rrname,domain_tld,dga_predict,in_alexa,creation_date
144,dns.msftncsi.com,msftncsi.com,dga,False,"[2005-11-10 00:00:00, 2005-11-10 22:06:51]"
147,www.msftncsi.com,msftncsi.com,dga,False,"[2005-11-10 00:00:00, 2005-11-10 22:06:51]"
160,vtlfccmfxlkgifuf.com,vtlfccmfxlkgifuf.com,dga,False,No Creation Date!
167,ejfodfmfxlkgifuf.xyz,ejfodfmfxlkgifuf.xyz,dga,False,No Creation Date!


### Ah-ha! We have identified 2 domains with no creation_date. The other domains appear to be fairly well established. You have successfully identified 2 the domains created with DGA!