## Real Time Defense
### A Supervised Learning Approach to Detecting Malicious Network Traffic

With this simple example, we will use a database of IP addresses that have been labeled as malicious or benign to train a model that can predict if a new IP address is malicious or benign.

Let's start with the imports:


In [1]:
from datetime import time, date, datetime, timedelta
import csv
import random
from collections import Counter
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

### A log generator
The following cell defines a function that generates log entries with IP addresses labeled as benign or malicious.

The function parameters have default values that we can override to try different scenarios.
- values - the number of entries to create
- benignIP - a vector of benign IP addresses
- hackerIP - a vector of hacker IP addresses
- apiEntries - a vector of API entries
- bias - the probability of generating a malicious IP address
- outlier - denominator on when to generate an outlier

The general idea of the function is that hackers tend to focus more on obscure APIs as they are more likely to contain bugs that can be exploited.

Review and execute the following cell to define the log generator function.


In [2]:
def CreateAPITraffic(
    values = 5000,
    benignIP = ['172:144:0:22', '172:144:0:23', 
                '172:144:0:24', '172:144:0:25',
                '172:144:0:26', '172:144:0:27'],
    hackerIP = ['175:144:22:2', '175:144:22:3',
                '175:144:22:4', '175:144:22:5',
                '175:144:22:6', '175:144:22:7'],
    apiEntries = ['Rarely', 'Sometimes', 'Regularly'],
    bias = .8, 
    outlier = 50):
    
    # Define the variables needed to perform tasks within
    # the function. You use data to hold the actual log entries
    # for return to the caller. The currTime and updateTime 
    # variables help create the log’s time entries. The selectedIP
    # variable holds one of the IP addresses provided as part of
    # benignIP or hackerIP arguments and is the IP address added to
    # the current log entry. The threshold determines the split
    # between benign and hacker log entries. The hackerCount and 
    # benignCount variables specify how many of each entry type
    # appears in the log.
    data = []
    currTime = time(0, 0, 0)
    updateTime = timedelta(seconds = 1)
    selectedIP = ""
    threshold = (len(apiEntries) * 2) - \
        (len(apiEntries) * 2 * bias)
    hackerCount = 0
    benignCount = 0

    # A loop for generating entries comes next. This code begins
    # by defining the time element of an individual log entry.
    for x in range(values):
        currTime = (datetime.combine(date.today(), 
                                     currTime)
                    + updateTime).time()
        
        # Selecting an API entry comes next.
        apiChoice = random.choice(apiEntries)
        
        # Determine which IP address to use for the data entry.
        # The CreateAPITraffic() function uses a combination of
        # approaches to make the determination based on the assumption
        # that the hacker will select less commonly used API calls to 
        # attack because these calls are more likely to contain bugs,
        # which is where threshold comes into play. However, it’s also
        # important to include a certain amount of noise in the form of
        # outliers as part of the dataset. This example uses hackerCount
        # as a means of determining when to create an outlier.
        choiceIndex = apiEntries.index(apiChoice) + 1
        randSelect = choiceIndex * \
            random.randint(1, len(apiEntries)) * bias
        if hackerCount % outlier == 0:
            selectedIP = random.choice(hackerIP)
        else:
            if randSelect >= threshold:
                selectedIP = random.choice(benignIP)
            else:
                selectedIP = random.choice(hackerIP)
        
        # Each entry is appended to data in turn. In addition, the code
        # also tracks whether the entry is a hacker or a benign entry.
        data.append([currTime.strftime("%H:%M:%S"), 
                     selectedIP, apiChoice])
        if selectedIP in hackerIP:
            hackerCount += 1
        else:
            benignCount += 1
    
    return (threshold, benignCount, hackerCount, data)

---

Next, we will persist the generated logs to a csv file.  The following function does that.  Review and execute the following cell to define the function.

In [3]:
def SaveDataToCSV(data = [], fields = [], 
                  filename = "test.csv"):
    with open(filename, 'w', newline='') as file:
        write = csv.writer(file, delimiter=',')
        write.writerow(fields)
        write.writerows(data)

---

### Generate Data
Now, let's create some API entries and IP addresses to use in our simulation.  We will use the following vectors to generate the logs.

Review and execute the following cell to define the vectors.

In [4]:
callNames = ['Rarely', 
             'Sometimes1', 'Sometimes2',
             'Regularly1', 'Regularly2', 'Regularly3',
             'Often1', 'Often2', 'Often3', 'Often4', 
             'Often5', 'Often6', 'Often7', 'Often8']
benignIPs = ['172:144:0:22', '172:144:0:23', 
             '172:144:0:24', '172:144:0:25', 
             '172:144:0:26', '172:144:0:27',
             '172:144:0:28', '172:144:0:29', 
             '172:144:0:30', '172:144:0:31',
             '172:144:0:32', '172:144:0:33',
             '172:144:0:34', '172:144:0:35',
             '172:144:0:36', '172:144:0:37']

### Generate a log
Now, let's generate a log with 10,000 entries using the vecors above.

Review and execute the following cell to generate the log.

In [None]:
random.seed(52)
threshold, benignCount, hackerCount, data = \
    CreateAPITraffic(values=10000, 
                     benignIP=benignIPs, 
                     apiEntries=callNames)
print(f"There are {benignCount} benign entries " \
      f"and {hackerCount} hacker entries " \
      f"with a threshold of {threshold}.")
fields = ['Time', 'IP_Address', 'API_Call']
SaveDataToCSV(data, fields, "CallData.csv")

### Preparing the data
Now that we have the log, let's prepare the data for training the model.

First, review the `CallData.csv` file to understand the data structure.

Now, let's load the data and label it as malicious or benign.
Review and execute the code in the cell below to load the data and label it.

In [6]:
def ReadDataFromCSV(filename="test.csv"):
    logData = pd.read_csv(filename)
    
    # Obtain a listing of the unique API calls found in the file.
    calls = np.unique(np.array(logData['API_Call']))
    
    # Aggregate the data using the IP_Address as the means
    # for determining how to group the entries and API_Call
    # as the means to determine which column to use for aggregation.
    aggData = logData.groupby(
        'IP_Address')['API_Call'].agg(list)
    
    # Create a DataFrame to hold the data to analyze later.
    # Begin labelling the data based on its IP address.
    analysisEntries = {}
    analysisData = pd.DataFrame(columns=calls)
    for ipIndex, ipEntry in zip(aggData.index, aggData):
        ipEntry.sort()
        if ipIndex[0:3] == '172':
            values = [0]
        else:
            values = [1]
        
        # Create columns for the DataFrame based on the API calls.
        keys = ['Benign']
        for callType in calls:
            keys.append(callType)
            values.append(ipEntry.count(callType))
        
        # Define each row of the DataFrame using the number of calls
        # from the IP address in question as the values for each column.
        analysisEntries[ipIndex] = pd.Series(values,
                                             index=keys)
    
    # Create the DataFrame and return it to the caller.
    analysisData = pd.DataFrame(analysisEntries)
    return (analysisData, calls)

--- 

Now, load the csv and review the dataframe returned from our function by running the cell below:

In [None]:
analysisData, calls = ReadDataFromCSV("CallData.csv")
print(analysisData)

In [None]:
X = np.array(analysisData[1:len(calls)+1]).T
print(X)
y = analysisData[0:1]
print(y)
y = y.values.ravel()
print(y)

---

The data prep took some steps for this example.  In the real world, it would be more complex.  For example, we would need to handle missing data, normalize the data, and encode categorical data.

### Train the model
Let's use a RandomForestClassifier to train the model.

In [None]:
clf=RandomForestClassifier()
clf.fit(X,y)


Now, let's generate some test data using a different seed and our CreateAPITraffic function.

Review and execute the following cell to generate the test data.

In [None]:
random.seed(19)
threshold, benignCount, hackerCount, data = \
    CreateAPITraffic(benignIP=benignIPs, 
                     apiEntries=callNames, 
                     bias=.95, outlier=15)
print(f"There are {benignCount} benign entries " \
      f"and {hackerCount} hacker entries " \
      f"with a threshold of {threshold}.")
fields = ['Time', 'IP_Address', 'API_Call']
SaveDataToCSV(data, fields, "TestData.csv")

---

Now, load and check the test data by running the cell below:


In [None]:
testData, testCalls = ReadDataFromCSV("TestData.csv")
X_test = np.array(testData[1:len(calls)+1]).T
y_test = testData[0:1].values.ravel()
y_pred = clf.predict(X_test)
print('Accuracy: %.3f' % accuracy_score(y_test, y_pred))