# Botnet Classifiaction

## Pre-processing

This function reads the given scenario from the [CTU-13 dataset](http://mcfp.weebly.com/the-ctu-13-dataset-a-labeled-dataset-with-botnet-normal-and-background-traffic.html) and loads it into a Pandas dataframe.
It does the following pre-processing steps
1. Convert label starting with "flow=From-Botnet" to 1 and "flow=From-Normal" to 0
    - This is done due to the following suggestion in the read-me of the dataset:
        - "Please note that the labels of the flows generated by the malware start with "From-Botnet". The labels "To-Botnet" are flows sent to the botnet by unknown computers, so they should not be considered malicious perse. Also for the normal computers, the counts are for the labels "From-Normal". The labels "To-Normal" are flows sent to the botnet by unknown computers, so they should not be considered malicious perse."
1. Drop rows that contain null values for atleast one of these rows: "DstAddr", "SrcAddr", "Dport", "Sport", "Label"
    -These attributes will be used to build a classifier
1. Remove the background flows
    - This was suggested by the professor

In [1]:
def read_from_file(scenario):
    print("Reading from file. Scenario: %s" % scenario)

    # Get the path of the file
    dir_path = os.path.join("..", "data", "CTU-13-Dataset", str(scenario))
    file_name = filter(lambda x: x.endswith(".binetflow"), os.listdir(dir_path))[0]
    file_path = os.path.join(dir_path, file_name)

    # Read the csv file in a pandas dataframe
    # Convert label: "flow=From-Botnet" to 1, label: "flow=From-Normal" to 0 and the rest to 2
    converters = {"Label": lambda x: 1 if x.startswith("flow=From-Botnet") else (0 if x.startswith("flow=From-Normal") else 2)}
    df = pd.read_csv(file_path, skip_blank_lines=True, delimiter=",", converters=converters)

    # Drop rows that contain null values for atleast one of these rows: "DstAddr", "SrcAddr", "Dport", "Sport", "Label"
    df.dropna(subset=["DstAddr", "SrcAddr", "Dport", "Sport", "Label"], inplace=True, how="any")

    # remove the background flows
    df = df[df.Label != 2]

    print("\tDone!!")
    return df

## Pickle functions

A lot of the scenarios have a big dataset. Library pickle will be used to save the intermediate results to file, to avoid recomputation everytime. Here are some useful functions.

In [2]:
def read_pickle_file(file_name):
    print("Reading pickle file: %s" % file_name)
    path = os.path.join("..", "data", "pickle", file_name)
    pickle_file = open(path, "r")
    result = pickle.load(pickle_file)
    pickle_file.close()
    print("\tDone!!")
    return result


def save_using_pickle(python_object, file_name):
    print("Writing python object to file: %s" % file_name)
    pickle_file = open(os.path.join("..", "data", "pickle", file_name), "w")
    pickle.dump(python_object, pickle_file)
    pickle_file.close()
    print("\tDone!!")


def save_to_file(content, file_name):
    file_ = open(os.path.join("..", "data", "pickle", file_name), "w")
    file_.write(content)
    file_.close()

def read_txt_file(file_name):
    file_ = open(os.path.join("..", "data", "pickle", file_name), "r")
    data = file_.read()
    file_.close()
    return data 

## MINDS feature extraction

[MINDS](https://www.scopus.com/record/display.uri?eid=2-s2.0-85008699758&origin=inward&txGid=32D807FF8E8677E35E2A45FE0AAED286.wsnAw8kcdt7IPYLO0V48gA%3a2) (MINDS-Minnesota intrusion detectoin system) featureset will be used to build a classifier.

The following aggregrate features will be used for each NetFlow.
1. The number of NetFlows from the same source IP address as the evaluated NetFlow
2. The number of NetFlows toward the same destination host
3. The number of NetFlows towards the same destination host from the same source port
4. The number of NetFlows from the same source host towards the same destination port.

This function computes groupings needed to extract a new feature vector for each of the given scenarios. It saves the groupings to file.

In [3]:
def save_groupings_scenarios(scenarios):
    # This function groups the dataframe bij the given first_column and save the count of each value
    # Furthermore, it groups the Netflows of each of this group by the second_colum and saves the count of each of the value
    def nested_grouping(df, first_column, second_column):
        print("Performing nested grouping with columns: %s, %s" % (first_column, second_column))
        groupby_fc = df.groupby(first_column)
        result = {}
        fc_values = groupby_fc.groups.keys()
        i = 0
        for fc_value in fc_values:
            # Save the number of occurrences of the current first column value
            groupby_fc_current = groupby_fc.get_group(fc_value)
            result[fc_value] = (len(groupby_fc_current.index), {})

            # Loop over the values of the second column and save their number of occurrences
            groupby_fc_sc = groupby_fc_current.groupby(second_column)
            for sc_value in groupby_fc_sc.groups.keys():
                groupby_fc_sc_current = groupby_fc_sc.get_group(sc_value)
                result[fc_value][1][sc_value] = len(groupby_fc_sc_current.index)
            i += 1
            if i % 200 == 0:
                print("%s/%s, percentage: %s" % (i, len(fc_values), float(i) / len(fc_values)))
        print("\tDone!!")
        return result
    
    # This function save the nested grouping to a pickle file
    def save_groupings_df(df, first_column, second_column, name):
        # Get the groupings
        output = nested_grouping(df, first_column, second_column)

        # Write the groupings to file
        save_using_pickle(output, name)

    for scenario in scenarios:
        df = read_from_file(scenario)
        save_groupings_df(df, "SrcAddr", "Dport", os.path.join("grouping", "src_dport_%s.p" % scenario))
        save_groupings_df(df, "DstAddr", "Sport", os.path.join("grouping", "dst_sport_%s.p" % scenario))


This function reads the groupings from file and uses it to extract new MINDS features. At last the feature vector is saved to file.

In [4]:
def save_feature_vectors(scenarios):

    def build_feature_vector(scenario):
        df = read_from_file(scenario)

        src_dport = read_pickle_file(os.path.join("grouping", "src_dport_%s.p" % scenario))
        dst_sport = read_pickle_file(os.path.join("grouping", "dst_sport_%s.p" % scenario))
        feature_vector = []

        for index, row in df.iterrows():
            nr_src = src_dport[row["SrcAddr"]][0]
            nr_dst = dst_sport[row["DstAddr"]][0]

            nr_sport = dst_sport[row["DstAddr"]][1][row["Sport"]]
            nr_dport = src_dport[row["SrcAddr"]][1][row["Dport"]]

            feature_vector.append([row["SrcAddr"], nr_src, nr_dst, nr_sport, nr_dport, row["Label"]])
        return feature_vector

    for scenario in scenarios:
        feature_vector = build_feature_vector(scenario)
        save_using_pickle(feature_vector, os.path.join("feature_vector","fv_%s.p" % scenario))


## Classification

The following set of functions read the computed feature vector of each of the given scenarios. SVM is used for classification.

- Evaluation stratey:
   - The classifier is evaluated per scenario (suggested in slack) using 3 fold cross validation.
   - Packet (NetFlow) level
       - Each netflow will be classified as normal or attack
   - Host level
       - A host is classified as an attacker if it has sent at least one NetFlow which is labeled as attack.

In [5]:
# Type can be "packet_level" or "host_level"
def save_classification_error(scenarios, type):

    def get_evaluation_metrics(real_labels, predicted_labels, pos_label):
        print("Calculating evaluation metrics.")
        accuracy = accuracy_score(real_labels, predicted_labels)
        f1 = f1_score(real_labels, predicted_labels, pos_label=pos_label)
        precision = precision_score(real_labels, predicted_labels,pos_label=pos_label)
        recall = recall_score(real_labels, predicted_labels, pos_label=pos_label)
        confusion_matrix_ = confusion_matrix(real_labels, predicted_labels)

        resulting_metrics = {"accuracy": accuracy, "f1": f1,
                             "precision":precision, "recall": recall, "confusion_matrix": confusion_matrix_}
        print("\tDone!!")
        return resulting_metrics

    def evaluate_host_level(real_labels, predicted_labels, host_ips):
        index_host_ips = {}
        for index in range(len(host_ips)):
            if host_ips[index] not in index_host_ips.keys():
                index_host_ips[host_ips[index]] = []
            index_host_ips[host_ips[index]].append(index)

        new_real_labels = []
        new_predicted_labels = []
        for host_ip in host_ips:
            new_real_labels.append("1" if "1" in real_labels[index_host_ips[host_ip]] else "0")
            new_predicted_labels.append("1" if "1" in predicted_labels[index_host_ips[host_ip]] else "0")

        return new_real_labels, new_predicted_labels

    def evaluate(scenario):
        # Read the feature vector for the given scenario
        feature_vector = np.array(read_pickle_file(os.path.join("feature_vector", "fv_%s.p" % scenario)))

        # Get the training set and the labels
        X = feature_vector[:, 1:5]
        y = feature_vector[:, 5:6].flatten()
        host_ips = feature_vector[:, 0:1].flatten()

        # Perform three fold cross validation
        real_labels = []
        predicted_labels = []
        test_host_ips = []
        k_fold = StratifiedKFold(n_splits=3)
        for train, test in k_fold.split(X, y):

            # Build a svm classifier
            classifier = SVC()
            classifier.fit(X[train], y[train])

            real_labels.extend(y[test])
            predicted_labels.extend(classifier.predict(X[test]))
            test_host_ips.extend(host_ips[test])

        pos_label = "1"
        if type == "host_level":
            real_labels, predicted_labels = evaluate_host_level(np.array(real_labels), np.array(predicted_labels), test_host_ips)

        return get_evaluation_metrics(real_labels, predicted_labels, pos_label)

    for scenario in scenarios:
        evaluation_metrics = evaluate(scenario)
        evaluation_metrics_str = reduce(lambda x, y: "%s\n%s" % (x,y), ["%s = %s" % (metric, evaluation_metrics[metric]) for metric in evaluation_metrics.keys()])
        save_to_file(evaluation_metrics_str, os.path.join("evaluation_metric", type, "result_%s.txt" % scenario))


## Results

All scenarios have been computed off-line. We let our computer running the whole night. The results were saved in disk.
Scenario 3 and 10 could not be calculated due to its huge size and memory limitation of my laptop. Python outputted memory error.

The distribution of the different types (attack and normal) differs per scenario. To cope with this multiple evaluation metrics are considered. The F1 score, precision and accuracy are caculated in reference to the attack class.

In [6]:
import os

def read_evaluation_metrics(scenario):
    result = "Packet level:\n"
    result += read_txt_file(os.path.join("evaluation_metric", "packet_level", "result_%s.txt" % scenario))
    result += "\n\nHost level:\n"
    result += read_txt_file(os.path.join("evaluation_metric", "host_level", "result_%s.txt" % scenario ))
    return result

### Scenario 1

In [7]:
print(read_evaluation_metrics(1))

Packet level:
f1 = 0.989850414441
recall = 1.0
confusion_matrix = [[29405   840]
 [    0 40961]]
precision = 0.979904786967
accuracy = 0.9882032413

Host level:
f1 = 0.732054295084
recall = 1.0
confusion_matrix = [[  260 29985]
 [    0 40961]]
precision = 0.57735460773
accuracy = 0.578897845687


### Scenario 2

In [8]:
print(read_evaluation_metrics(2))

Packet level:
f1 = 0.994774595031
recall = 1.0
confusion_matrix = [[ 8853   220]
 [    0 20941]]
precision = 0.989603515902
accuracy = 0.992670087293

Host level:
f1 = 0.82336288753
recall = 1.0
confusion_matrix = [[   88  8985]
 [    0 20941]]
precision = 0.699759406536
accuracy = 0.700639701473


### Scenario 4

In [9]:
print(read_evaluation_metrics(4))

Packet level:
f1 = 0.912087912088
recall = 0.838383838384
confusion_matrix = [[25184     0]
 [  240  1245]]
precision = 1.0
accuracy = 0.991000787431

Host level:
f1 = 1.0
recall = 1.0
confusion_matrix = [[25184     0]
 [    0  1485]]
precision = 1.0
accuracy = 1.0


### Scenario 5

In [10]:
print(read_evaluation_metrics(5))

Packet level:
f1 = 0.975554292211
recall = 0.952275249723
confusion_matrix = [[4657    0]
 [  43  858]]
precision = 1.0
accuracy = 0.992263404102

Host level:
f1 = 1.0
recall = 1.0
confusion_matrix = [[4657    0]
 [   0  901]]
precision = 1.0
accuracy = 1.0


You can get the evaluation metrics of the other scenarios using print(read_evaluation_metrics(scenario)).

It looks like that the perfomance of the classifier varies a lot between different scenarios. 
This means that the behavior of some some botnets are eaiser to model than others.