In [None]:
# Load Libraries - Make sure to run this cell!
import pandas as pd
import numpy as np
import re
import datetime  
from collections import Counter
from sklearn import feature_extraction, tree, model_selection, metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import matplotlib
from yellowbrick.classifier import ConfusionMatrix, ROCAUC
from yellowbrick.classifier import ClassificationReport
import scikitplot as skplt
%matplotlib inline
plt.rcParams['figure.figsize'] = [10, 8]

# Useful cyber libraries
import whois   # pip install python-whois
import tldextract  # pip install tldextract 
import ipaddress  # pip install ipaddress
import dns.query  # pip install dnspython
import dns.resolver
import entropy
from dns.exception import DNSException

<img src="../img/logo_white_bkg_small.png" align="right" />   


## Worksheet 5.2 Malicious URL Classification

This worksheet is a step-by-step guide on how to train a Machine Learning model that can detect malicious URLs. We will walk you through the process of transforming raw URL strings to Machine Learning features and creating a Decision Tree Classifer which you will use to determine whether a given URL is malicious or not. Once you have implemented the classifier, the worksheet will walk you through evaluating your model.  

### Overview 2 main steps:

1. **Feature Engineering** - from raw URL strings to features using [pandas](http://pandas.pydata.org/pandas-docs/stable/) DataFrame, datetime and [numpy](http://www.numpy.org/) manipulations.
2. **Machine Learning Classification** - predict whether a URL is malicious or not using a Decision Tree Classifier in [sklearn](http://scikit-learn.org/stable/) and evaluate model performance

We provide an additional notebook where you can see how to use "Featureless Deep Learning" to build such a classifier.


### Data

The dataset was build from various different open source data sources. Computationally intensive tasks such as retrieving the creation time for each unique domain in the data set via [whois](https://pypi.python.org/pypi/python-whois) have already been performed beforehand. Some of the open source URLs came with the zone apex only, others didn't include the protocol, therefore, we uniformly removed the protocol (http:// or https://) and subdomain (e.g. www) from the URL string if applicable.

Benign
- Custom automated webscraping of [Alexa Top 1M](https://blog.majestic.com/development/majestic-million-csv-daily/) with recursive depth of scraping of level 1.

Malicious
- Various blacklists
- [openphish](https://openphish.com/)
- [phishtank](https://www.phishtank.com/)
- [public GitHub faizann24](https://github.com/faizann24/Using-machine-learning-to-detect-malicious-URLs)
- some more sources

The dataset is perfectly balanced (50% benign and 50% malicious). We emphasized on getting benign URLs with paths and not just the domain. Furthermore, depending on your environment you can choose between a smaller subset (```url_data_small.csv``` containing 4000 URLs balanced) or the full data set (```url_data_full.csv``` containing 87380 URLs balanced).

In [None]:
## Load data
DATA_HOME = '../data/'
df = pd.read_csv(DATA_HOME + 'url_data_full.csv')
# df = pd.read_csv('../../Data/URL/url_data_small.csv')
df.isIP = df.isIP.astype(int)
print(df.shape)
df.sample(n=5).head() # print a random sample of the DataFrame

#### Breakpoint: Load Features and Labels

If you got stuck in Part 1, please simply load the feature matrix we prepared for you, so you can move on to Part 2 and train a Decision Tree Classifier.

In [None]:
df_final = pd.read_csv(DATA_HOME + 'url_features_final_df.csv')
print(df_final.isMalicious.value_counts())
print(len(df_final.columns))
df_final.sample(n=5)

## Part 2 - Machine Learning

To learn simple classification procedures using [sklearn](http://scikit-learn.org/stable/) we have split the work flow into 5 steps.

### Step 1: Prepare Feature matrix ```X``` and ```target``` vector containing the URL labels

- X is the feature matrix
- target is a vector containing the labels for each URL (often also called *y* in statistics)
- for sklearn input X and target can either be a pandas DataFrame/Series or numpy array/vector respectively (can't be lists!)

Tasks:
- assign ```'isMalicious'``` column to a pandas Series named 'target'
- drop ```'isMalicious'``` column from DataFrame and name the resulting pandas DataFrame ```X```
- save the column names of the ```X``` in a variable called ```feature_names```

In [None]:
# Your Code here...

### Step 2: Simple Cross-Validation

Tasks:
- split your feature matrix X and target vector into train and test subsets using sklearn [model_selection.train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). Sklearn documentation uses also ```X``` to represent the feature matrix of shape [n_samples, n_features], but ```y``` to represent the labels, which we call ```target``` here of shape [n_samples] or [n_samples, n_outputs].

In [None]:
# Simple Cross-Validation: Split the data set into training and test data
X_train, X_test, target_train, target_test = #Your code here...

### Step 3: Train the model and make a prediction

Finally, we have prepared and segmented the data. Let's start classifying!!   

Tasks:

-  Use the sklearn [tree.DecisionTreeClassfier()](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), create a decision tree with standard parameters, and train it using the ```.fit()``` function with ```X_train``` and ```target_train``` data.
-  Next, pull a random row from the ```X_test``` array and ```target_test``` vector and see if your classifier got it correct by using the ```.predict()``` function on your classifier with ```test_feature``` as input to this method. When you extract only one random row from the numpy array you'll have to apply ```.reshape(1, -1)``` before passing it into the ```.predict()``` method.

If you are interested in trying a real unknown domain, you'll have to create a function to generate the features for that domain before you run it through the classifier (see function ```is_malicious``` a few cells below). 

In [None]:
# Train the decision tree based on the entropy criterion


# Extract a row from the test data


# Make the prediction



print('Predicted class:', pred)
print('Accurate prediction?', pred[0] == test_target)

In [None]:
# Additional helper functions for is_malicious function
def ip_matcher(address):
    # Used to validate if string is an ipaddress, currently only IPv4 supported
    ip = re.match(
        '^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$', address)
    if ip:
        return 1
    else:
        return 0

def strip_http(s):
    pattern = re.compile('(http[s]?://)')
    clean_url = pattern.sub('', s)
    
    try:
        if clean_url[-1] in ['/']:
            clean_url = clean_url[:-1]
        if tldextract.extract(clean_url).subdomain:
            clean_url = re.sub(tldextract.extract(clean_url).subdomain + '.', '', clean_url)
    except:
        pass
    return clean_url

In [None]:
# Load database of creation dates of various unqiue domains from our data set
domains_created_db = pd.read_csv(DATA_HOME + 'domains_created_db.csv')
domains_created_db.created = pd.to_datetime(domains_created_db.created, errors='coerce')
domains_created_db.head()


In [None]:
# Load Vectorizer models from part 1
from six.moves import cPickle as pickle

# Note all functions for feature calcuations need to be again available here to make new predictions on new URLs

def custom_path_tokenizer(path):  # input one path from one URL
    return list(filter(None, re.compile('[\?\=/\._-]').split(path.lower()))) 

def extract_path(url):
    return re.sub('.'.join([tldextract.extract(url).domain, tldextract.extract(url).suffix]), '', url)

def H_entropy (x):
    # Calculate Shannon Entropy
    return entropy.shannon_entropy(x)

def firstDigitIndex( s ):
    for i, c in enumerate(s):
        if c.isdigit():
            return i + 1
    return 0

models = ["feature_names","CountVectorizer_tlds","CountVectorizer_domains","CountVectorizer_paths"]

tmp_models = []
for key in models:

    with open(DATA_HOME + key + '.pickle', 'rb') as f:
            tmp_models.append(pickle.load(f))
        
feature_names = tmp_models[0]
CountVectorizer_tlds = tmp_models[1]
CountVectorizer_domains = tmp_models[2]
CountVectorizer_paths = tmp_models[3]

In [None]:
def is_malicious(url, clf, feature_names, domains_created_db, CountVectorizer_tlds, CountVectorizer_domains, CountVectorizer_paths):
    """
    INPUT:
    - new raw URL string
    - trained model/classifiers 'clf'
    - feature_names to correctly identify index in numpy array
    - domains_created_db: custom database (from .csv file) containing creation date of unique domains of our data set
    - trained model CountVectorizer_tlds
    - trained model CountVectorizer_domains
    - trained model CountVectorizer_paths
    
    OUTPUT:
    - binary prediction (int 0 or 1, where 1 means URL is most likely to be malicious)
    
    """
    
    # clean raw URL
    url = strip_http(url)  # make sure strip_http function is available

    # extract portions of URL and retrieve create date from database
    domain = tldextract.extract(url).domain
    tld = tldextract.extract(url).suffix
    path = extract_path(url)  # make sure extract_path function is available
    created = domains_created_db.created[domains_created_db.url == '.'.join([domain, tld])]

    try:
        delta = (pd.to_datetime(datetime.date.today()) - created).dt.days.values[0]
    except:
        delta = 0

    # initialize numpy array
    n_features = len(feature_names)
    url_features = np.empty(shape=(1, n_features))
    
    # lexical features simple
    url_features[0, feature_names.index('Length')] = len(url)
    url_features[0, feature_names.index('LengthDomain')] = len(domain)
    pattern = re.compile('([0-9])')
    url_features[0, feature_names.index('DigitsCount')] = len(re.findall(pattern, url))
    url_features[0, feature_names.index('EntropyDomain')] = H_entropy(domain)  # make sure H_entropy function is available
    url_features[0, feature_names.index('FirstDigitIndex')] = firstDigitIndex(url)  # make sure firstDigitIndex function is available
   
    # lexical features bag-of-words
    new_row_tlds = CountVectorizer_tlds.transform([tld]).toarray()
    start = feature_names.index(CountVectorizer_tlds.get_feature_names()[0])
    stop = feature_names.index(CountVectorizer_tlds.get_feature_names()[-1])                            
    url_features[0, start:stop+1] = new_row_tlds
    
    new_row_domains = CountVectorizer_domains.transform([domain]).toarray()
    start = feature_names.index(CountVectorizer_domains.get_feature_names()[0])
    stop = feature_names.index(CountVectorizer_domains.get_feature_names()[-1])                            
    url_features[0, start:stop+1] = new_row_domains
    
    new_row_paths = CountVectorizer_paths.transform([path]).toarray()
    start = feature_names.index(CountVectorizer_paths.get_feature_names()[0])
    stop = feature_names.index(CountVectorizer_paths.get_feature_names()[-1])                            
    url_features[0, start:stop+1] = new_row_paths
    
    # host-based features
    url_features[0, feature_names.index('DurationCreated')] = delta
    url_features[0, feature_names.index('isIP')] = ip_matcher(url)
    
    
    pred = clf.predict(url_features)
    return pred[0]


test_URL = 'https://www.google.com'

print('Prediction of domain %s\nis %s [0 means \"benign\" and 1 \"is malicious\"]: ' \
%(test_URL, is_malicious(test_URL, clf, feature_names, domains_created_db, CountVectorizer_tlds,\
                         CountVectorizer_domains, CountVectorizer_paths)))  

# print('\nTiming result of new prediction:\n')
# result = %timeit -o is_malicious(test_URL, clf, feature_names, domains_created_db, CountVectorizer_tlds,\
#                          CountVectorizer_domains, CountVectorizer_paths)

### Step 4: Assess model accuracy with simple cross-validation

Tasks:
- Use sklearn [metrics.accuracy_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) to determine your models accuracy. Detailed Instruction:
    - Use your trained model to predict the labels of your test data ```X_test```. Run ```.predict()``` method on the clf with your test data ```X_test``` and store the results in a variable called ```target_pred```.. 
    - Then calculate the accuracy using ```target_test``` (which are the true labels/groundtruth) AND your models predictions on the test portion ```target_pred``` as inputs (in that order ```target_test, target_pred```). The advantage here is to see how your model performs on new data it has not been seen during the training phase.
    
- Finally print out the confusion matrix using [metrics.confusion_matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)

In [None]:
# fair approach: make prediction on test data portion


In [None]:
# Visualize this using Yellowbrick


In [None]:
# Classification Report...neat summary


In [None]:
# Use the Yellowbrick classification report



In [None]:
#Use Yellowbrick to visualize the ROC/AUC



### Step 5: Assess model accuracy with k-fold cross-validation

Tasks:
- Partition the dataset into *k* different subsets
- Create *k* different models by training on *k-1* subsets and testing on the remaining subsets
- Measure the performance on each of the models and take the average measure.

*Short-Cut*
All of these steps can be easily achieved by simply using sklearn's [model_selection.KFold()](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) and [model_selection.cross_val_score()](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) functions.

In [None]:
# Your code here...


In [None]:
# Get avergage score +- Standard Error (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.sem.html)
from scipy.stats import sem
def mean_score( scores ):
    return "Mean score: {0:.3f} (+/- {1:.3f})".format( np.mean(scores), sem( scores ))
print( mean_score( scores))

### Step 6: Get the best predictors

Tasks:
- Using your fitted clf (here DecisionTreeClassifier) get the attribute [feature\_importances\_](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)
- Create a new df ```f_imp``` with the column names of your features (not including isMalicious column of original df) and transpose ```f_imp```.
- Sort [pd.sort\_vaues](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) the new df ```f_imp```.

In [None]:
f_imp = pd.DataFrame([clf.feature_importances_, feature_names]).T.sort_values(0, ascending=False)
f_imp.head(20)

## Visualizing Feature Importances
You can also visualize feature importances using scikit-plot.  The code for this is:

```python
skplt.estimators.plot_feature_importances(clf, feature_names=feature_names)
plt.show()
```

From this you really can tell which features are contributing and which are not.

In [1]:
#Your code here...


#### (Optional) Visualizing your Tree
As an optional step, you can actually visualize your tree.  The following code will generate a graph of your decision tree.  You will need graphviz (http://www.graphviz.org) and pydotplus (or pydot) installed for this to work.
The Griffon VM has this installed already, but if you try this on a Mac, or Linux machine you will need to install graphviz.

**NOTE: This might time out with a large tree.**

In [None]:
# These libraries are used to visualize the decision tree and require that you have GraphViz
# and pydot or pydotplus installed on your computer.

from sklearn.externals.six import StringIO  
from IPython.core.display import Image
import pydotplus as pydot


dot_data = StringIO() 
tree.export_graphviz(clf, out_file=dot_data, 
                     feature_names=feature_names,
                    filled=True, rounded=True,  
                    special_characters=True) 

graph = pydot.graph_from_dot_data(dot_data.getvalue()) 
Image(graph.create_png())