In [6]:
# Load Libraries - Make sure to run this cell!
import pandas as pd
import numpy as np
import re
import datetime 
from collections import Counter
from sklearn import feature_extraction, tree, model_selection, metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import seaborn as sns
from yellowbrick.features.rankd import Rank2D
from yellowbrick.features.radviz import RadViz
from yellowbrick.features.pcoords import ParallelCoordinates
from yellowbrick.features import JointPlotVisualizer
from yellowbrick.classifier import ConfusionMatrix
from yellowbrick.classifier import ClassificationReport
import matplotlib.pyplot as plt
import matplotlib
import entropy
%matplotlib inline

# Useful cyber libraries
import whois   # pip install python-whois
import tldextract  # pip install tldextract 
import ipaddress  # pip install ipaddress
import dns.query  # pip install dnspython
import dns.resolver
from dns.exception import DNSException

<img src="../img/logo_white_bkg_small.png" align="right" />   


## Worksheet 5.1 - Feature Engineering: Malicious URL Detection using Machine Learning

This worksheet is a step-by-step guide on how to train a Machine Learning model that can detect malicious URLs. We will walk you through the process of transforming raw URL strings to Machine Learning features and creating a Decision Tree Classifer which you will use to determine whether a given URL is malicious or not. Once you have implemented the classifier, the worksheet will walk you through evaluating your model.  

### Overview 2 main steps:

1. **Feature Engineering** - from raw URL strings to features using [pandas](http://pandas.pydata.org/pandas-docs/stable/) DataFrame, datetime and [numpy](http://www.numpy.org/) manipulations.
2. **Machine Learning Classification** - predict whether a URL is malicious or not using a Decision Tree Classifier in [sklearn](http://scikit-learn.org/stable/) and evaluate model performance

We provide an additional notebook where you can see how to use "Featureless Deep Learning" to build such a classifier.


### Data

The dataset was build from various different open source data sources. Computationally intensive tasks such as retrieving the creation time for each unique domain in the data set via [whois](https://pypi.python.org/pypi/python-whois) have already been performed beforehand. Some of the open source URLs came with the zone apex only, others didn't include the protocol, therefore, we uniformly removed the protocol (http:// or https://) and subdomain (e.g. www) from the URL string if applicable.

Benign
- Custom automated webscraping of [Alexa Top 1M](https://blog.majestic.com/development/majestic-million-csv-daily/) with recursive depth of scraping of level 1.

Malicious
- Various blacklists
- [openphish](https://openphish.com/)
- [phishtank](https://www.phishtank.com/)
- [public GitHub faizann24](https://github.com/faizann24/Using-machine-learning-to-detect-malicious-URLs)
- some more sources

The dataset is perfectly balanced (50% benign and 50% malicious). We emphasized on getting benign URLs with paths and not just the domain. Furthermore, depending on your environment you can choose between a smaller subset (```url_data_small.csv``` containing 4000 URLs balanced) or the full data set (```url_data_full.csv``` containing 87380 URLs balanced).


In [None]:
## Load data
DATA_HOME = '../data/'
df = pd.read_csv(DATA_HOME + 'url_data_full.csv')
# df = pd.read_csv(DATA_HOME + 'url_data_small.csv')
df.isIP = df.isIP.astype(int)
print(df.shape)
df.sample(n=5).head() # print a random sample of the DataFrame

In [None]:
df.isMalicious.value_counts()

## Part 1 - Feature Engineering


The traditional approach is to hand-craft Machine Learning features. This can be the most tedious part and often requires extensive domain expertise and data wrangling skills.

Previous academic research on identifying malicious or suspicious URLs has focused on studying the usefulness of an exhausted list of candidate features. Here, we cover only a selection of some basic and most widely used features.

There are 4 main "URL Features" families:
1. **BlackList Features**: Check if in any BlackList. BlackLists suffer from a high false negative rate, but can still be useful as a feature.
2. **Lexical Features**: Using methods from Natural Language Processing. They capture the property of malicious URLs tending to "look different" from benign URLs. Therefore, lexical features quantify contextual information such as the length of the URL.
3. **Host-based Features**: They quantify properties of the web site host and answer "where" the site is hosted, "who" owns it and "how" it is managed. API queries are needed for this type of features (WHOIS, DNS records). Some example features can be the date of registration, geolocation, autonomous system (AS) number, connection speed or time-to-live (TTL).
4. **Content-based Features**: This is one of the less commonly used feature families as it requires the download of the entire web-page, hence execution of the potential malicious site, which can not only be not safe, but also increases the computational cost of deriving features. Features here can be HTML or JavaScript based.

Source: Sahoo et al. 2017: [Malicious URL Detection using Machine Learning: A Survey](https://arxiv.org/pdf/1701.07179.pdf)

In this notebook, we focus on a selection of **lexical features** and **host-based features**, starting with the lexical ones in the subsequent code cell. The host-based features instructions will follow in the next markdown cell.

### Feature Engineering Sub-Section A - Lexical Features


**Selection of lexical features**:

1. Length of URL ["Length"]
2. Length of hostname/domain ["LengthDomain]
3. Count of digits ["DigitsCount"]
4. Entropy of hostname/domain ["EntropyDomain"] - use ```H_entropy``` function provided 
5. Position (or index) of the first digit ["FirstDigitIndex"] - use ```firstDigitIndex``` function provided 
6. Bag-of-words - more details later

We provide a couple of helper functions. Please run the following function cell and then continue reading the next markdown cell with more details on how to derive those features. Have fun!



In [None]:
def H_entropy (x):
    # Calculate Shannon Entropy
    return entropy.shannon_entropy(x)

def firstDigitIndex( s ):
    for i, c in enumerate(s):
        if c.isdigit():
            return i + 1
    return 0

### Tasks - Sub-Section A - Lexical Features

Append features to the pandas 2D DataFrame ```df``` with a new column for each feature. Later, simply drop the columns that are not features. Please focus on ```["Length"]```, ```["LengthDomain]```, ```["DigitsCount"]```, ```["EntropyDomain"]``` and ```["FirstDigitIndex"]``` here. [pandas.Series.str](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.html), [pandas.Series.replace](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.replace.html) and [pandas.Series,apply](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html) and [tldextract](https://pypi.python.org/pypi/tldextract) can be very helpful to quickly derive those features. Functions you need to apply here are provided in above cell.

For the ```Bag-of-words``` see next instructions in next markdown cell...


In [None]:
# derive simple lexical features

### YOUR CODE ###

### Tasks - Sub-Section A - Lexical Features (continued)

There are many different approaches of applying ```bag-of-words``` to URLs. Here we suggest the following approach:

1. Extract the different portions of the URL (host names (domains), top-level-domains (tlds) [what is TLD](https://en.wikipedia.org/wiki/Top-level_domain), paths) and create separate pandas Series (or Python lists) using the [tldextract](https://pypi.python.org/pypi/tldextract) library.
2. (Code for step 2 is provided) Find the top 20 tlds (e.g. ```com```, ```de```, ```ru``` etc) from the data. Then use [sklearn CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to create ```bag-of-words``` with a custom vocabulary, here the top 20 tlds as parameter input. Use the ```.fit()``` method to train the CountVectorizer model (save the model in a variable, you will need this model later for real-time transformations of new URLs). After this the ```.transform()``` function is applied to the pandas Series or list of tlds. The resulting matrix is dense, therefore ```.toarray()``` is needed to get a regular numpy matrix. You will notice that this numpy matrix is very sparse, that is, contains a lot of zeros. The ```get_feature_names()``` is useful to get not only the vocabulary, but also to know which column of the matrix corresponds to which ```word```.
3. Knowing procedures for step 2, please try to do something similar for the domains. However, choose an ```ngram``` approach via setting the following parameters for the CountVectorizer: ```analyzer='char', ngram_range=(3, 4), max_features=30```.
4. (Code for step 4 is provided) Again, we provide you with the solution to applying a different CountVectorizer approach to the path using ```analyzer='word', tokenizer=custom_path_tokenizer, max_features=100``` as parameters.
5. Feel free to try different approaches.

At each step the numpy matrix is converted to a pandas DataFrame and then concatenated to the previous one and so on. That way you can run one cell multiple times without re-concatenating to the original df which would throw errors. At the end simply replace the original df with the df that contains all bag-of-words features.


In [None]:
def extract_path(url):
    return re.sub('.'.join([tldextract.extract(url).domain, tldextract.extract(url).suffix]), '', url)

In [None]:
#Task1: Create separate Series from df.url for the different portions of the URL

### YOUR CODE ###






# domains = 
# tlds = 
# paths =

In [None]:
#Task2: Code provided as example

n_tlds = 20
top_tlds = list(tlds.value_counts().head(n_tlds).keys())
top_tlds = [tld if tld is not '' else 'nan' for tld in top_tlds]  # encode empty/missing tld as 'nan'

In [None]:
#Task2: Code provided as example
CountVectorizer_tlds = CountVectorizer(analyzer='word', vocabulary=top_tlds)
CountVectorizer_tlds = CountVectorizer_tlds.fit(tlds)
matrix_dense_tlds = CountVectorizer_tlds.transform(tlds)

print(CountVectorizer_tlds.get_feature_names())
print(matrix_dense_tlds.shape)
print(sum(matrix_dense_tlds.toarray()))

In [None]:
#Task2: Code provided as example (see answers notebook to regenerate the sample output)
df_tlds = pd.DataFrame(matrix_dense_tlds.toarray(), columns=CountVectorizer_tlds.get_feature_names())
# matrix_dense_tlds.toarray() converts dense matrix to a regular matrix, which will be sparse (a lot of zeros)
df1 = pd.concat([df, df_tlds],axis=1)
print(len(df1.columns))
df1.head()

In [None]:
#Task3: Knowing procedures for step 2, please try to do something similar for the domains. 
#However, choose an ngram approach via setting the following parameters for the 
#CountVectorizer: analyzer='char', ngram_range=(3, 4), max_features=30.



### YOUR CODE ###






#df_domains=

# df2 = pd.concat([df1, df_domains],axis=1)
# print(len(df2.columns))
# df2.head()

In [None]:
def custom_path_tokenizer(path):  # input is a string for one path from one URL
    return list(filter(None, re.compile('[\?\=/\._-]').split(path.lower())))

In [None]:
#Task4: CountVectorizer approach for the path

### YOUR CODE ###

# modify or come up with a new solution!!!



CountVectorizer_paths = CountVectorizer(analyzer='word', tokenizer=custom_path_tokenizer, max_features=100)
CountVectorizer_paths = CountVectorizer_paths.fit(paths)

matrix_dense_paths = CountVectorizer_paths.transform(paths)

df_paths = pd.DataFrame(matrix_dense_paths.toarray(), columns=CountVectorizer_paths.get_feature_names())
df3 = pd.concat([df2, df_paths],axis=1)
print(len(df3.columns))
df3.head()

### Feature Engineering Sub-Section B - Host-based Features


Derivation of host-based features often requires the use of APIs or querying information from some authoritative source. It took us 2 days to get all whois data for all of our unique domains (see ```domains_created_db.csv``` file). 

**Selection of host-based features**:

1. Time delta between today's date and creation date ['DurationCreated'] (original whois code included at the end of the notebook)
2. Check if it is an IP address ['isIP'] - already provided, no feature engineering needed   
3. (Time-to-live ['ttl'] - code to query an authoritative nameserver included at the end of the notebook, but not included in preprocessed data set)


### Tasks - Sub-Section B - Host-based Features

Append features to the pandas 2D DataFrame ```df``` with a new column for each feature. Later, simply drop the columns that are not features. [pandas.to_datetime](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html) with ```errors='coerce'``` is easy to use to convert the ```WHOIS``` info ```["created"]``` to a datetime data type. Make sure to also fillna with zeros! You can then simply subtract the creation date from today's date to derive the ```["DurationCreated"]``` feature. [pandas.Series.dt.day](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.dt.day.html) can be handy to express the time delta in days. 

After all features have been added to the pandas 2D DataFrame, please drop all columns that are not features etc, here drop ```['url', 'created', 'domain']```.


In [None]:
df=df3
### YOUR CODE ###

In [None]:
df_final = df
### YOUR CODE ###
#drop ['url', 'created', 'domain']





df_final.sample(n=5).head()

#### Breakpoint: Load Features and Labels

If you got stuck in Part 1, please simply load the feature matrix we prepared for you, so you can move on to Part 2 and train a Decision Tree Classifier.

In [None]:
df_final = pd.read_csv(DATA_HOME + 'url_features_final_df.csv')
print(df_final.isMalicious.value_counts())
print(len(df_final.columns))
df_final.sample(n=5).head()

In [7]:
import dns.query
import dns.resolver
from dns.exception import DNSException

def query_authoritative_ns (domain, log=lambda msg: None, ttl_only=True):

    default = dns.resolver.get_default_resolver()
    ns = default.nameservers[0]

    n = domain.split('.')

    for i in range(len(n), 0, -1):
        sub = '.'.join(n[i-1:])

        log('Looking up %s on %s' % (sub, ns))
        query = dns.message.make_query(sub, dns.rdatatype.NS)
        response = dns.query.udp(query, ns)

        rcode = response.rcode()
        if rcode != dns.rcode.NOERROR:
            if rcode == dns.rcode.NXDOMAIN:
                raise Exception('%s does not exist.' % (sub))
            else:
                raise Exception('Error %s' % (dns.rcode.to_text(rcode)))

        if len(response.authority) > 0:
            rrsets = response.authority
        elif len(response.additional) > 0:
            rrsets = [response.additional]
        else:
            rrsets = response.answer

        # Handle all RRsets, not just the first one
        for rrset in rrsets:
            for rr in rrset:
                if rr.rdtype == dns.rdatatype.SOA:
                    print('Same server is authoritative for %s' % (sub))
                elif rr.rdtype == dns.rdatatype.A:
                    ns = rr.items[0].address
                    print('Glue record for %s: %s' % (rr.name, ns))
                elif rr.rdtype == dns.rdatatype.NS:
                    authority = rr.target
                    ns = default.query(authority).rrset[0].to_text()
                    print('%s [%s] is authoritative for %s; ttl %i' % (authority, ns, sub, rrset.ttl))
                    result = rrset
                    if ttl_only:
                        print(rrset)
                        result = rrset.ttl
                else:
                    # IPv6 glue records etc
                    #log('Ignoring %s' % (rr))
                    pass

    return result

In [8]:
query_authoritative_ns('www.gtkcyber.com')

Glue record for a.gtld-servers.net.: 192.5.6.30
Glue record for d.gtld-servers.net.: 192.31.80.30
Glue record for g.gtld-servers.net.: 192.42.93.30
Glue record for j.gtld-servers.net.: 192.48.79.30
Glue record for m.gtld-servers.net.: 192.55.83.30
ns-cloud-b1.googledomains.com. [216.239.32.107] is authoritative for gtkcyber.com; ttl 172800
gtkcyber.com. 172800 IN NS ns-cloud-b1.googledomains.com.
gtkcyber.com. 172800 IN NS ns-cloud-b2.googledomains.com.
gtkcyber.com. 172800 IN NS ns-cloud-b3.googledomains.com.
gtkcyber.com. 172800 IN NS ns-cloud-b4.googledomains.com.
ns-cloud-b2.googledomains.com. [216.239.34.107] is authoritative for gtkcyber.com; ttl 172800
gtkcyber.com. 172800 IN NS ns-cloud-b1.googledomains.com.
gtkcyber.com. 172800 IN NS ns-cloud-b2.googledomains.com.
gtkcyber.com. 172800 IN NS ns-cloud-b3.googledomains.com.
gtkcyber.com. 172800 IN NS ns-cloud-b4.googledomains.com.
ns-cloud-b3.googledomains.com. [216.239.36.107] is authoritative for gtkcyber.com; ttl 172800
gtkcyb

172800

In [5]:
whois.whois('www.gtkcyber.com')

{'domain_name': ['GTKCYBER.COM', 'gtkcyber.com'],
 'registrar': 'Google LLC',
 'whois_server': 'whois.google.com',
 'referral_url': None,
 'updated_date': datetime.datetime(2018, 1, 20, 9, 7, 56),
 'creation_date': datetime.datetime(2017, 1, 20, 3, 32, 35),
 'expiration_date': datetime.datetime(2019, 1, 20, 3, 32, 35),
 'name_servers': ['NS-CLOUD-B1.GOOGLEDOMAINS.COM',
  'NS-CLOUD-B2.GOOGLEDOMAINS.COM',
  'NS-CLOUD-B3.GOOGLEDOMAINS.COM',
  'NS-CLOUD-B4.GOOGLEDOMAINS.COM'],
 'status': ['ok https://icann.org/epp#ok', 'ok https://www.icann.org/epp#ok'],
 'emails': ['registrar-abuse@google.com', 'lxhotryq6rne@contactprivacy.email'],
 'dnssec': 'unsigned',
 'name': 'Contact Privacy Inc. Customer 1241054753',
 'org': 'Contact Privacy Inc. Customer 1241054753',
 'address': '96 Mowat Ave',
 'city': 'Toronto',
 'state': 'ON',
 'zipcode': 'M4K 3K1',
 'country': 'CA'}

In [None]:
whois.whois('www.gtkcyber.com')['creation_date']

##  Visualizing the Features
In the last step, you're going to explore the feature space to see which features are potentially useful or not and of course whether there is too much noise to make predictions.  

First, using [Yellowbrick](http://pythonhosted.org/yellowbrick/examples/examples.html), create a Covariance ranking of the features.  Since this section is about visualizing this information and not deriving it, please execute the cell below so that everyone will have the same data and get the same results.

In [None]:
## Load data
DATA_HOME = '../data/'
df_final = pd.read_csv(DATA_HOME + 'url_features_final_df.csv')
features = df_final.loc[:,'isIP':]
target = df_final['isMalicious']

In [None]:
# Your code for the covariance ranking here...

### What did you see?
If you did this correctly, you should see that most of the features are nearly useless. Next, pick 7 features yourself, either using the `feature_selection` functions in `scikit-learn` or by just picking them yourself, and create a pair plot using Seaborn to determine whether there are clear class boundaries between the classes in these features. 

In [None]:
#Gets an arrary of the best features in 1 step.
best_features = SelectKBest( score_func=chi2, k=7).fit_transform(features,target)

#Get the feature names and indexes
best = SelectKBest( score_func=chi2, k=7).fit(features,target)
feature_names = pd.Series(features.columns)
feature_names[best.get_support()]

In [None]:
#Use Seaborn to create a pairplot here