# Lab 4, Exercise 3

In [1]:
import numpy as np
import sys
import os
import pathlib
from pathlib import Path

## Load data 

The data is separated into three folders: Attack_Data_Master, Training_Data_Master, and Validation_Data_Master
These can be found here:
data/exercise3/Training_Data_Master
data/exercise3/Validation_Data_Master
data/exercise3/Attack_Data_Master

All of the data in Training_Data_Master and Validation_Data_Master is normal, 
and all the data in Attack_Data_Master is malicious

For the purpose of this exercise, you will ignore the predefined training/validation splits, and simply use Training_Data_Master
and Validation_Data_Master as a single pool of normal data

As mentioned, each system call trace is stored as a single file.  Treat each system call trace as a separate datapoint.

In [2]:
# Load all the normal system call traces (i.e., everything in Training_Data_Master and Validation_Data_Master)
# Load all the malicious system call traces (i.e., everything in Attack_Data_Master)

def load_traces(dir_str):
    traces = []
    directory = Path(dir_str)

    for file in directory.rglob('*.txt'):
        traces.append(file.open().readline())
        
    return traces

train_traces = load_traces('data/exercise3/Training_Data_Master')
valid_traces = load_traces('data/exercise3/Validation_Data_Master')
attk_traces = load_traces('data/exercise3/Attack_Data_Master')

norm_traces = train_traces + valid_traces
mal_traces = attk_traces
all_traces = norm_traces + mal_traces

# Hint: A useful way to load this is as one or two Python lists, where each entry in the list corresponds to the text string
#       of system calls ids; feel free to use a single list for all the data, or separate lists for malicious versus normal
#       data

## Feature extraction

Tokenize and create a dataset where each datapoint corresponds to (normalized) counts of 
system call n-grams. Try various sizes of ngrams.

Reminder: A sequence of system call IDs that looks like this:
'6 6 63 6 42'

contains the following 3-grams:
'6 6 63'
'6 63 6'
'63 6 42'

Note: There are a number of ways you could code this up, but if you loaded the data
as lists of strings, you could consider using some of the feature extraction methods in 
sklearn.feature_extraction.text

In [3]:
# Look at the classdemo notebook for an example of doing this
# CODE HERE

# Build feature extractor
import sklearn
from sklearn.feature_extraction.text import CountVectorizer

ngram_size = 5
count_vect = CountVectorizer(analyzer='word', ngram_range=(ngram_size, ngram_size))

# Extract feature counts
raw_cnts = count_vect.fit_transform(all_traces)

# Display features
features = count_vect.get_feature_names()

# Normalize counts
from sklearn.feature_extraction.text import TfidfTransformer

tf_transformer = TfidfTransformer(use_idf=False)
all_data = tf_transformer.fit_transform(raw_cnts)

all_labels = [0] * len(norm_traces) + [1] * len(mal_traces)


## Create train/test split

In [4]:
# Use 50% of the data for the training set and the rest for the test set
# CODE HERE

from sklearn.model_selection import train_test_split

train_data, test_data, train_labels, test_labels = train_test_split(all_data, all_labels, test_size=0.5, random_state=0)


## Train a classifier

In [5]:
# Please use Logistic Regression for this exercise
# Feel free to experiment with the various hyperparameters available to you in sklearn
# CODE HERE

from sklearn.linear_model import SGDClassifier

classifier = SGDClassifier(loss='log', penalty='none', random_state=0)


## Inference and results

In [6]:
# Run inference on the test data and predict labels for each data point in the test data
# CODE HERE

model = classifier.fit(train_data, train_labels)
pred_labels = model.predict(test_data)

# Calculate and print the following metrics: precision, recall, f1-measure, and accuracy
# CODE HERE

from sklearn import metrics

precision = metrics.precision_score(test_labels, pred_labels)
recall = metrics.recall_score(test_labels, pred_labels)
f1measure = metrics.f1_score(test_labels, pred_labels)
accuracy = metrics.accuracy_score(test_labels, pred_labels)

print('Metrics:')
print(' precision = ' + str(precision)) # true positives / predicted positives
print('    recall = ' + str(recall))    # true positives / actual positives
print('F1-measure = ' + str(f1measure)) # weighted average of the precision and recall
print('  accuracy = ' + str(accuracy))  # correctly pred / sample size
print('\n')


Metrics:
 precision = 0.7867435158501441
    recall = 0.773371104815864
F1-measure = 0.78
  accuracy = 0.948252688172043




# Part 2: Varying class priors

Create several new test datasets where you have randomly subsampled the number of 
attack datapoints.

In particular, create the following datasets:
- 10 datasets where 25% of the attack datapoints are removed from the original test set
- 10 datasets where 50% of the attack datapoints are removed from the original test set
- 10 datasets where 75% of the attack datapoints are removed from the original test set
- 10 datasets where 90% of the attack datapoints are removed from the original test set
- 10 datasets where 95% of the attack datapoints are removed from the original test set

Report five sets of precision, recall, f1-measure, and accuracy corresponding to the following:
- Average precision, recall, f1-measure, accuracy for datasets where 25% of attack datapoints removed
- Average precision, recall, f1-measure, accuracy for datasets where 50% of attack datapoints removed
- Average precision, recall, f1-measure, accuracy for datasets where 75% of attack datapoints removed
- Average precision, recall, f1-measure, accuracy for datasets where 90% of attack datapoints removed
- Average precision, recall, f1-measure, accuracy for datasets where 95% of attack datapoints removed

Note: You will use the same model trained in part 1 for all of these datasets.  
All you are varying is the class priors during the inference stage.

In [7]:
# Create subsets of the test set by randomly discarding X% of points with label +1
# CODE HERE

import random

percentages = [0.25, 0.50, 0.75, 0.90, 0.95]
num_datasets = 10

for percentage in percentages:
    precision = recall = f1measure = accuracy = 0
    for i in range(num_datasets):
        indices_to_keep = []
        for j in range(len(test_labels)):
            if test_labels[j] == 0 or random.random() > percentage:
                indices_to_keep.append(j)
        
        new_test_data = test_data[indices_to_keep,:]
        new_test_labels = [test_labels[i] for i in indices_to_keep]
        pred_labels = model.predict(new_test_data)

        precision += metrics.precision_score(new_test_labels, pred_labels)
        recall += metrics.recall_score(new_test_labels, pred_labels)
        f1measure += metrics.f1_score(new_test_labels, pred_labels)
        accuracy += metrics.accuracy_score(new_test_labels, pred_labels)

    print('Metrics:')
    print(' precision = ' + str(precision/num_datasets))
    print('    recall = ' + str(recall/num_datasets))    
    print('F1-measure = ' + str(f1measure/num_datasets)) 
    print('  accuracy = ' + str(accuracy/num_datasets))  
    print('\n')


Metrics:
 precision = 0.7352941440529622
    recall = 0.7675948239082071
F1-measure = 0.7510223024458988
  accuracy = 0.9528293411811075


Metrics:
 precision = 0.6494875824522242
    recall = 0.7811591647316893
F1-measure = 0.7091678371606347
  accuracy = 0.9598077891352963


Metrics:
 precision = 0.47985206345743947
    recall = 0.7841491598640965
F1-measure = 0.5949015257055825
  accuracy = 0.965728636648101


Metrics:
 precision = 0.2745454185686528
    recall = 0.7905918770062208
F1-measure = 0.4061598366552195
  accuracy = 0.9693108223096264


Metrics:
 precision = 0.14296853468277218
    recall = 0.7454417877444193
F1-measure = 0.23919249579558383
  accuracy = 0.9703012772710146




# Questions

1) In Part 1, what size of ngrams gives the best performance? What are the tradeoffs as you change the size?

5-grams gives the highest accuracy. As the size of ngrams increase, the recall and F1-measure decrease.

2) In Part 1, how does performance change if we use simple counts as features (i.e., 1-grams) as opposed to counts of 2-grams? What does this tell you about the role of sequences in prediction for this dataset?

The model performs better when using 2-grams than 1-grams. This means that certain sequences of syscalls tend to appear in malicious syscall traces rather than just trying to indentify malicious activity based on single syscalls.

3) How does performance change as a function of class prior in Part 2?

As the number of attack datapoints decrease, the accuracy increases. This could indicate the proportions in the training set have a smaller percentage of attack datapoints.