# Static Analysis and Classification of Zeus Malware

The paper " surveys a number of classification methods to differentiate binaries from the [Zeus malware family](https://en.wikipedia.org/wiki/Zeus_%28malware%29) from other types of malware. The paper used the `auto-mal` tool to perform [dynamic analysis](https://en.wikipedia.org/wiki/Dynamic_program_analysis) of the binaries as they run, creating a set of sixty-five features for each binary sample. A number of classification methods were applied and their accuracy compared.

Here, the techniques used above are applied to a data set generated via [static analysis](https://en.wikipedia.org/wiki/Static_program_analysis) of malware binaries.

In [None]:
%matplotlib notebook

import numpy
import pandas
import seaborn
import matplotlib

from sklearn import metrics
from sklearn import neighbors
from sklearn import preprocessing

In [None]:
random_seed = 0

## Data Import

Source the data set from static analysis of three different families of malware:
  1. [Zeus](https://en.wikipedia.org/wiki/Zeus_%28malware%29)
  1. [Operation Cleaver](https://en.wikipedia.org/wiki/Operation_Cleaver)
  1. [APT-1](https://www.fireeye.com/content/dam/fireeye-www/services/pdfs/mandiant-apt1-report.pdf)
  
The following uses pre-analysed files, sourced from Mike Sconzo at [SecRepo](https://secrepo.com) under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).

In [None]:
zeus = pandas.read_csv("resources/Zeus.csv")  # sourced from http://www.secrepo.com/Datasets%20Description/PE_malware/Zeus.html
zeus['Source'] = 'zeus'
zeus.describe().transpose()

In [None]:
zeus.describe(include=numpy.object).transpose()

In [None]:
op_cleaver = pandas.read_csv("resources/OPCleaver.csv")  # sourced from http://www.secrepo.com/Datasets%20Description/PE_malware/OPCleaver.html
op_cleaver['Source'] = 'op_cleaver'
op_cleaver.describe().transpose()

In [None]:
op_cleaver.describe(include=numpy.object).transpose()

In [None]:
virus_share = pandas.read_csv("resources/VirusShare.csv")  # sourced from http://www.secrepo.com/Datasets%20Description/PE_malware/VirusShare.html
virus_share['Source'] = 'virus_share'
virus_share.describe().transpose()

In [None]:
virus_share.describe(include=numpy.object).transpose()

The above data sets show 11 numerical features and 2 categorical (with the `Source` column added at import time). There is little documentation on how these features have been generated. They are a combination of PE file headers (e.g.`SectionAlignment` ), and further analysis (e.g. `HighEntropy`).

### Data Cleanup

The Zeus and Operation Cleaver data sets both have the column "SizeOfHeaders.1", which is missing from the APT-1/VirusShare data set. Check that these are duplicates of the "SizeOfHeaders" column, then delete them:

In [None]:
if all(zeus['SizeOfHeaders'] == zeus['SizeOfHeaders.1']):
    del zeus['SizeOfHeaders.1']

In [None]:
if all(op_cleaver['SizeOfHeaders'] == op_cleaver['SizeOfHeaders.1']):
    del op_cleaver['SizeOfHeaders.1']

There is a large size disparity between the three data sets. Combine the OpCleaver and VirusShare data sets to create a single data set with 392 non-Zeus samples.

In [None]:
non_zeus = op_cleaver.append(
    virus_share, 
    ignore_index=True,  # generate new indexes for the virus_share set
)
non_zeus['Source'] = 'non-zeus'

## Preparation of the data sets

To mirror the work in the paper we will use an equal number of Zeus and non-Zeus samples in the learning data. Since the Zeus data set is much larger than that of the Non-Zeus data set, take a random sample of the same size.

In [None]:
len(non_zeus)

In [None]:
zeus = zeus.sample(n=392, random_state=random_seed)
len(zeus)

Now we need to split the data into training and testing sets. Inline with the paper, keep 10% of the data for testing. When doing this, we sample separately from the Zeus and non-Zeus data, this ensures the same number of Zeus and non-Zeus data points will be used during training and testing.

In [None]:
zeus_training = zeus.sample(frac=0.9, random_state=random_seed)
zeus_testing = zeus.drop(index=zeus_training.index)

len(zeus_training), len(zeus_testing)

In [None]:
non_zeus_training = non_zeus.sample(frac=0.9, random_state=random_seed)
non_zeus_testing = non_zeus.drop(index=non_zeus_training.index)

len(non_zeus_training), len(non_zeus_testing)

Mix up the zeus and non-zeus data sets, making sure to extract the 'Source' columns so they are excluded from modelling. Save these known classifications in a separate variable. This will be used during the training stage to map each entry to it's group, and the testing stage to determine the accuracy of the model.

In [None]:
training_set = zeus_training.append(non_zeus_training, ignore_index=True)
training_source = training_set['Source']
del training_set['Source']

In [None]:
testing_set = zeus_testing.append(non_zeus_testing, ignore_index=True)
testing_source = testing_set['Source']
del testing_set['Source']

The paper evaluates four methods of classifying binaries:
  1. Support Vector Classification/Machines
  2. Logistic Regression
  3. Classification/Decision Trees
  4. k-Nearest Neighbors (k-NN)
  
We have chosen to implement the k-NN approach described in the paper. This approach has been chosen due to it's ease of implementation using modern machine learning toolkits. In this case, we will use the `KNeighbors` classifier from [scikit-learn](http://scikit-learn.org) [1].

## k-Nearest-Neighbours
Resources:
  - http://ogrisel.github.io/scikit-learn.org/sklearn-tutorial/auto_examples/tutorial/plot_knn_iris.html
  - https://www.datacamp.com/community/tutorials/k-nearest-neighbor-classification-scikit-learn
  - https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
  
  
The book "Introduction to machine learning" [2]  ... TODO


### Categorical values

The k-NN algorithm can only be used on numeric values (since it needs to take a distance metric). For each categorical column in the data set we should either remove it, or map it into a numeric space:
  - `TimeDateStamp` :: This can easily be mapped into a UNIX epoch, which should allow computing sane distances between times.
  - `FileName` :: These values don't seem to be the natural filename the binaries were distributed as, and are instead a concatentation of the malware type (e.g. 'zeusbin') and a hash. This column should be removed since it gives the game away slightly.

#### FileName

In [None]:
del training_set['FileName']
training_set.head()

In [None]:
del testing_set['FileName']
testing_set.head()

#### TimeDateStamp

In [None]:
def parse_time_date_stamp(time_date_stamp):
    """
    Reads in a TimeDateStamp string and parses it, 
    returning a POSIX timestamp.
    
    Example TimeDateStamp:
        0x50FDE944 [Tue Jan 22 01:20:04 2013 UTC]
        
    where 0x50FDE944 is a hex string representation
    of the number of seconds from the UNIX epoch (in
    UTC).
    """
    hex_string = time_date_stamp.split()[0]
    posix_timestamp = int(hex_string, 16)
    return posix_timestamp

In [None]:
# Example:
import datetime

example = zeus.values[0][4]
print(example)

# Parse timestamp and convert that to a Python datetime, then print it. Does it match the above?
print(datetime.datetime.utcfromtimestamp(parse_time_date_stamp(example)))

It works! Apply this conversion function to the data set:

In [None]:
training_set['TimeDateStamp'] = training_set['TimeDateStamp'].apply(func=parse_time_date_stamp)
training_set.head()

In [None]:
testing_set['TimeDateStamp'] = testing_set['TimeDateStamp'].apply(func=parse_time_date_stamp)
testing_set.head()

### Normalising numeric values
From https://stackabuse.com/k-nearest-neighbors-algorithm-in-python-and-scikit-learn/:
``` py
from sklearn.preprocessing import StandardScaler  
scaler = StandardScaler()  
scaler.fit(X_train)

X_train = scaler.transform(X_train)  
X_test = scaler.transform(X_test)  
```

(docs: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [None]:
scaler = preprocessing.StandardScaler()  
scaler.fit(training_set)

training_set = scaler.transform(training_set)
testing_set = scaler.transform(testing_set)

### Training the model

In [None]:
binary_classifier = neighbors.KNeighborsClassifier(n_neighbors=35)
binary_classifier.fit(training_set, training_source)

Run the classifier on the whole testing set, and use the models `.score()` method for a quick sanity check of the models accuracy:

In [None]:
testing_predictions = binary_classifier.predict(testing_set)
binary_classifier.score(testing_set, testing_source)

### Analysis
The effectiveness of this model can be analysed using a [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix). From the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix) of sci-kit learn:

>  By definition a confusion matrix $C$ is such that $C_{i, j}$
> is equal to the number of observations known to be in group $i$ but
> predicted to be in group $j$.
>
> Thus in binary classification, the count of true negatives is
> $C_{0,0}$, false negatives is $C_{1,0}$, true positives is
> $C_{1,1}$ and false positives is $C_{0,1}$.

In [None]:
confusion_matrix = metrics.confusion_matrix(testing_source, testing_predictions)
confusion_matrix = pandas.DataFrame(
    data=confusion_matrix, 
    index=[
        'Non-Zeus', 
        'Zeus',
    ], 
    columns=[
        'Predicted Non-Zeus',
        'Predicted Zeus'
    ],
)

confusion_figure, confusion_axes = matplotlib.pyplot.subplots()
seaborn.heatmap(
    confusion_matrix,
    annot=True,
    fmt="d",
    cmap=seaborn.color_palette("Blues"),
    vmin=0,
    vmax=len(testing_set),
    ax=confusion_axes,
)

The specificity and sensitivity of the model can be easily computed from the confusion matrix above:

In [None]:
true_positives = confusion_matrix['Predicted Zeus']['Zeus']
false_positives = confusion_matrix['Predicted Zeus']['Non-Zeus']
true_negatives = confusion_matrix['Predicted Non-Zeus']['Non-Zeus']
false_negatives = confusion_matrix['Predicted Non-Zeus']['Zeus']

In [None]:
sensitivity = true_positives/(true_positives+false_negatives)
print('Sensitivity: {:.2f}%'.format(sensitivity*100))

In [None]:
specificity = true_negatives/(true_negatives+false_positives)
print('Specificity: {:.2f}%'.format(specificity*100))

## K-Fold Validation
  - TODO: Move the above into a repeatable unit.
  - TODO: Use K-fold validation to more accurately score our predictions? https://en.wikipedia.org/wiki/Cross-validation_(statistics)
     - This should decrease the variance in our accuracy score:
     > To reduce variability, in most methods multiple rounds of cross-validation are performed using different partitions, and the validation results are combined (e.g. averaged) over the rounds to give an estimate of the model’s predictive performance. 


## Optimise number of neighbors in k-NN model
  - TODO: Plot `num_neighbors` vs `score` (use k-fold validation for accuracy) and find the best possible value.
  - TODO: Write up a bit about how the different distance metrics work.  
  
  

## References

[0]: Abedelaziz Mohaisen and Omar Alrawi. 2013. Unveiling Zeus: automated classification of malware samples. In Proceedings of the 22nd International Conference on World Wide Web (WWW '13 Companion). ACM, New York, NY, USA, 829-832. DOI: https://doi.org/10.1145/2487788.2488056 PDF: https://alrawi.github.io/static/papers/unzeus_www13.pdf

[1]: [Scikit-learn: Machine Learning in Python](http://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html), Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

[2]: E. Alpaydin. Introduction to machine learning. MIT press, 2004. 