# Static Analysis and Classification of Zeus Malware
An attempt at reproducing the work in ["Unveiling Zeus: Automated Classification of Malware Samples"](https://alrawi.github.io/static/papers/unzeus_www13.pdf).

In [None]:
%matplotlib notebook

import numpy
import pandas
import seaborn

from sklearn import metrics
from sklearn import neighbors
from sklearn import preprocessing

In [None]:
random_seed = 0

## Data Import
The aim of this task is to distinguish the Zeus malware from other types of malware. Load in sample describing both Zeus and other classes of malware:

In [None]:
zeus = pandas.read_csv("resources/Zeus.csv")  # sourced from http://www.secrepo.com/Datasets%20Description/PE_malware/Zeus.html
zeus['Source'] = 'zeus'
zeus.describe().transpose()

In [None]:
op_cleaver = pandas.read_csv("resources/OPCleaver.csv")  # sourced from http://www.secrepo.com/Datasets%20Description/PE_malware/OPCleaver.html
op_cleaver['Source'] = 'op_cleaver'
op_cleaver.describe().transpose()

In [None]:
virus_share = pandas.read_csv("resources/VirusShare.csv")  # sourced from http://www.secrepo.com/Datasets%20Description/PE_malware/VirusShare.html
virus_share['Source'] = 'virus_share'
virus_share.describe().transpose()

The Zeus and OPCleaver data sets both have an extra column "SizeOfHeaders.1". Ensure these are just duplicates of "SizeOfHeaders", then delete them:

In [None]:
if all(zeus['SizeOfHeaders'] == zeus['SizeOfHeaders.1']):
    del zeus['SizeOfHeaders.1']

In [None]:
if all(op_cleaver['SizeOfHeaders'] == op_cleaver['SizeOfHeaders.1']):
    del op_cleaver['SizeOfHeaders.1']

There is a large size disparity between the three data sets. Combine the OpCleaver and VirusShare data sets to create a single data set with 392 non-Zeus samples.

In [None]:
non_zeus = op_cleaver.append(
    virus_share, 
    ignore_index=True,  # generate new indexes for the virus_share set
)
non_zeus['Source'] = 'non-zeus'

## Preparation
*Strategy:* to mirror the work in the paper we will use an equal number of Zeus and non-Zeus samples in the learning data. 

In [None]:
len(non_zeus)

In [None]:
zeus = zeus.sample(n=392, random_state=random_seed)
len(zeus)

Now we need to split the data into training and testing sets. Initially, use 10% of the data for testing.

In [None]:
zeus_training = zeus.sample(frac=0.9, random_state=random_seed)
zeus_testing = zeus.drop(index=zeus_training.index)

len(zeus_training), len(zeus_testing)

In [None]:
non_zeus_training = non_zeus.sample(frac=0.9, random_state=random_seed)
non_zeus_testing = non_zeus.drop(index=non_zeus_training.index)

len(non_zeus_training), len(non_zeus_testing)

Mix up the training and testing sets. Extract the 'Source' columns so they are excluded from modelling, but available for comparison:

In [None]:
training_set = zeus_training.append(non_zeus_training, ignore_index=True)
training_source = training_set['Source']
del training_set['Source']

In [None]:
testing_set = zeus_testing.append(non_zeus_testing, ignore_index=True)
testing_source = testing_set['Source']
del testing_set['Source']

The paper evaluates three methods of classifying binaries:
  - Support Vector Classification/Machines
  - Classification/Decision Trees
  - K-Nearest Neighbor (KNN)
  
We have chosen to implement the KNN approach described in the paper. Why? It is easy to understand (and therefore analyse).

## k-Nearest-Neighbours
Resources:
  - http://ogrisel.github.io/scikit-learn.org/sklearn-tutorial/auto_examples/tutorial/plot_knn_iris.html
  - https://www.datacamp.com/community/tutorials/k-nearest-neighbor-classification-scikit-learn
  - https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

### Categorical values

The k-NN algorithm can only be used on numeric values (since it needs to take a distance metric). For each categorical column in the data set we should either remove it, or map it into a numeric space:
  - `TimeDateStamp` :: This can easily be mapped into a UNIX epoch, which should allow computing sane distances between times.
  - `FileName` :: These values don't seem to be the natural filename the binaries were distributed as, and are instead a concatentation of the malware type (e.g. 'zeusbin') and a hash. This column should be removed since it gives the game away slightly.

#### FileName

In [None]:
del training_set['FileName']
training_set.head()

In [None]:
del testing_set['FileName']
testing_set.head()

#### TimeDateStamp

In [None]:
def parse_time_date_stamp(time_date_stamp):
    """
    Reads in a TimeDateStamp string and parses it, 
    returning a POSIX timestamp.
    
    Example TimeDateStamp:
        0x50FDE944 [Tue Jan 22 01:20:04 2013 UTC]
        
    where 0x50FDE944 is a hex string representation
    of the number of seconds from the UNIX epoch (in
    UTC).
    """
    hex_string = time_date_stamp.split()[0]
    posix_timestamp = int(hex_string, 16)
    return posix_timestamp

In [None]:
# Example:
import datetime

example = zeus.values[0][4]
print(example)

# Parse timestamp and convert that to a Python datetime, then print it. Does it match the above?
print(datetime.datetime.utcfromtimestamp(parse_time_date_stamp(example)))

It works! Apply this conversion function to the data set:

In [None]:
training_set['TimeDateStamp'] = training_set['TimeDateStamp'].apply(func=parse_time_date_stamp)
training_set.head()

In [None]:
testing_set['TimeDateStamp'] = testing_set['TimeDateStamp'].apply(func=parse_time_date_stamp)
testing_set.head()

### Normalising numeric values
From https://stackabuse.com/k-nearest-neighbors-algorithm-in-python-and-scikit-learn/:
``` py
from sklearn.preprocessing import StandardScaler  
scaler = StandardScaler()  
scaler.fit(X_train)

X_train = scaler.transform(X_train)  
X_test = scaler.transform(X_test)  
```

(docs: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [None]:
scaler = preprocessing.StandardScaler()  
scaler.fit(training_set)

training_set = scaler.transform(training_set)
testing_set = scaler.transform(testing_set)

### Training the model

In [None]:
binary_classifier = neighbors.KNeighborsClassifier(n_neighbors=35)
binary_classifier.fit(training_set, training_source)

Run the classifier on the whole testing set, and use the models `.score()` method for a quick sanity check of the models accuracy:

In [None]:
testing_predictions = binary_classifier.predict(testing_set)
binary_classifier.score(testing_set, testing_source)

### Analysis
The effectiveness of this model can be analysed using a [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix). From the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix) of sci-kit learn:

>  By definition a confusion matrix $C$ is such that $C_{i, j}$
> is equal to the number of observations known to be in group $i$ but
> predicted to be in group $j$.
>
> Thus in binary classification, the count of true negatives is
> $C_{0,0}$, false negatives is $C_{1,0}$, true positives is
> $C_{1,1}$ and false positives is $C_{0,1}$.

In [None]:
confusion_matrix = metrics.confusion_matrix(testing_source, testing_predictions)
confusion_matrix = pandas.DataFrame(
    data=confusion_matrix, 
    index=[
        'Non-Zeus', 
        'Zeus',
    ], 
    columns=[
        'Predicted Non-Zeus',
        'Predicted Zeus'
    ],
)

seaborn.heatmap(
    confusion_matrix,
    annot=True,
    fmt="d",
    cmap=seaborn.color_palette("Blues"),
    vmin=0,
    vmax=len(testing_set),
)

The specificity and sensitivity of the model can be easily computed from the confusion matrix above:

In [None]:
true_positives = confusion_matrix['Predicted Zeus']['Zeus']
false_positives = confusion_matrix['Predicted Zeus']['Non-Zeus']
true_negatives = confusion_matrix['Predicted Non-Zeus']['Non-Zeus']
false_negatives = confusion_matrix['Predicted Non-Zeus']['Zeus']

In [None]:
sensitivity = true_positives/(true_positives+false_negatives)
print('Sensitivity: {:.2f}%'.format(sensitivity*100))

In [None]:
specificity = true_negatives/(true_negatives+false_positives)
print('Specificity: {:.2f}%'.format(specificity*100))

## K-Fold Validation
  - TODO: Move the above into a repeatable unit.
  - TODO: Use K-fold validation to more accurately score our predictions?


## Optimise number of neighbors in k-NN model
  - TODO: Plot `num_neighbors` vs `score` (use k-fold validation for accuracy) and find the best possible value.
  - TODO: Write up a bit about how the different distance metrics work.  