# Predicting Malicious Cyber Connections
<p style="margin:30px">
    <img style="display:inline; margin-right:50px" width=50% src="https://www.featuretools.com/wp-content/uploads/2017/12/FeatureLabs-Logo-Tangerine-800.png" alt="Featuretools" />
</p>

The general setup for the problem is a common one: we have a single table of log lines recording Internet traffic between various sources. Traffic between a source and destination is labeled as malicious or clean in the dataset, and we'd like to be able to predict ahead of time if a future connection between a source and a destination will be malicious.

We'll demonstrate an end-to-end workflow using this [Cybersecurity Dataset](). This notebook demonstrates a rapid way to predict whether a connection (defined as a source name/destination name pair) is malicious.


## Highlights
* Quickly make end-to-end workflow using log-line cybersecurity data
* Find interesting automatically generated features

Note: this is an extremely imbalanced dataset, and would benefit tremendously from additional positive (malicious) labels

In [1]:
import featuretools as ft
from featuretools.selection import remove_low_information_features
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score
import utils

# Step 1: Understanding the Data
Here we load in the data and do a bit of preprocessing

In [2]:
cyber_df = pd.read_csv("data/CyberFLTenDays.csv")
cyber_df.index.name = "log_id"
cyber_df.reset_index(inplace=True, drop=False)
cyber_df['label'] = cyber_df['label'].map({'N': False, 'A': True}, na_action='ignore')

# Sample down negative examples because very few positives
# Can also do this after the feature engineering step (but doing it here reduces computation time)
cyber_df_pos = cyber_df[cyber_df['label']]
cyber_df_neg = cyber_df[~cyber_df['label']].sample(100000)
cyber_df = pd.concat([cyber_df_pos, cyber_df_neg]).sort_values(['secs'])

In [3]:
cyber_df.head()

Unnamed: 0,log_id,secs,src_name,dest_name,src_host,dest_host,auth_type,login_type,stage,result,label
1,1,2,C297$@DOM1,SYSTEM@C297,C297,C297,Negotiate,Service,LogOn,Success,False
4,4,4,C1034$@DOM1,SYSTEM@C1034,C1034,C1034,Negotiate,Service,LogOn,Success,False
9,9,21,C4589$@DOM1,C4589$@DOM1,C4589,C625,?,?,TGS,Success,False
13,13,45,C4893$@DOM1,C4893$@DOM1,C4893,TGT,?,?,TGS,Success,False
16,16,74,C1697$@DOM1,C1697$@DOM1,C529,C529,?,Network,LogOff,Success,False


## Create an EntitySet
To apply Deep Feature Synthesis we need to establish an `EntitySet` structure for our data. Since we're interested in predicting for combinations of "src_name" and "dest_name" (we call this pair a "session"), we need to create a separate normalized entity for "sessions".

In [4]:
es = ft.EntitySet("CyberLL")

# create an index column
cyber_df["name_host_pair"] = cyber_df["src_name"].str.cat(
                                [cyber_df["dest_name"],
                                 cyber_df["src_host"],
                                 cyber_df["dest_host"]],
                                sep=' / ')
cyber_df["session_id"] = cyber_df["src_name"].str.cat(
                                 cyber_df["dest_name"],
                                 sep=' / ')

es.entity_from_dataframe("log",
                         cyber_df,
                         index="log_id",
                         time_index="secs")
es.normalize_entity(base_entity_id="log",
                    new_entity_id="name_host_pairs",
                    index="name_host_pair",
                    additional_variables=["src_name", "dest_name",
                                          "src_host", "dest_host",
                                          #"src_pair",
                                          #"dest_pair",
                                          "session_id",
                                          "label"])
es.normalize_entity(base_entity_id="name_host_pairs",
                    new_entity_id="sessions",
                    index="session_id",
                    additional_variables=["dest_name", "src_name"])

Entityset: CyberLL
  Entities:
    log [Rows: 100329, Columns: 7]
    name_host_pairs [Rows: 61890, Columns: 6]
    sessions [Rows: 19261, Columns: 4]
  Relationships:
    log.name_host_pair -> name_host_pairs.name_host_pair
    name_host_pairs.session_id -> sessions.session_id

In [5]:
cyber_df.head()

Unnamed: 0,log_id,secs,src_name,dest_name,src_host,dest_host,auth_type,login_type,stage,result,label,name_host_pair,session_id
1,1,2,C297$@DOM1,SYSTEM@C297,C297,C297,Negotiate,Service,LogOn,Success,False,C297$@DOM1 / SYSTEM@C297 / C297 / C297,C297$@DOM1 / SYSTEM@C297
4,4,4,C1034$@DOM1,SYSTEM@C1034,C1034,C1034,Negotiate,Service,LogOn,Success,False,C1034$@DOM1 / SYSTEM@C1034 / C1034 / C1034,C1034$@DOM1 / SYSTEM@C1034
9,9,21,C4589$@DOM1,C4589$@DOM1,C4589,C625,?,?,TGS,Success,False,C4589$@DOM1 / C4589$@DOM1 / C4589 / C625,C4589$@DOM1 / C4589$@DOM1
13,13,45,C4893$@DOM1,C4893$@DOM1,C4893,TGT,?,?,TGS,Success,False,C4893$@DOM1 / C4893$@DOM1 / C4893 / TGT,C4893$@DOM1 / C4893$@DOM1
16,16,74,C1697$@DOM1,C1697$@DOM1,C529,C529,?,Network,LogOff,Success,False,C1697$@DOM1 / C1697$@DOM1 / C529 / C529,C1697$@DOM1 / C1697$@DOM1


# Generate labels and associated cutoff times

Featuretools can generate features for each session strictly before an associated cutoff time. We find these cutoff times in the process of computing labels. Labels are defined as follows:

For a given session:
 * After seeing the same name/host pair N times
 * Predict L observations of this same session in the future
 * Where any connections from this session in a window of size W are malicious
 
We will set N = 2 (number of observations to wait for), L = 2 (lead time), and W = 10 (prediction window)

In [6]:
def generate_cutoffs(cyber_df, index_col, after_n_obs, lead, prediction_window):
    window_start = after_n_obs + lead
    window_end = window_start + prediction_window
    grouped = cyber_df.groupby(index_col)[index_col].count()
    grouped.name = "count"
    min_obs = after_n_obs + lead + 1
    enough_examples = grouped[grouped > min_obs].to_frame().reset_index()
    enough_examples = cyber_df[cyber_df[index_col].isin(enough_examples[index_col])]
    def get_label_and_cutoff(df):
        cutoff = df.iloc[after_n_obs]
        cutoff['label'] = df.iloc[window_start: window_end]["label"].any()
        return cutoff
    cutoffs = enough_examples.groupby(index_col)[[index_col, "secs", "label"]].apply(get_label_and_cutoff)
    return cutoffs

In [7]:
cutoffs = generate_cutoffs(cyber_df, "session_id", 2, 2, 10)

In [8]:
cutoffs['label'].value_counts()

False    4106
True       36
Name: label, dtype: int64

# Compute features using DFS

In [9]:
fm, fl = ft.dfs(entityset=es, target_entity="sessions", cutoff_time=cutoffs,
                cutoff_time_in_index=True,
                verbose=True, max_depth=3)

Built 98 features
Elapsed: 30:47 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 11/11 chunks


In [10]:
fm.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,dest_name,src_name,first_name_host_pairs_time,SUM(name_host_pairs.first_log_time),STD(name_host_pairs.first_log_time),MAX(name_host_pairs.first_log_time),SKEW(name_host_pairs.first_log_time),MIN(name_host_pairs.first_log_time),MEAN(name_host_pairs.first_log_time),COUNT(name_host_pairs),...,MEAN(name_host_pairs.NUM_UNIQUE(log.result)),NUM_UNIQUE(name_host_pairs.MODE(log.auth_type)),NUM_UNIQUE(name_host_pairs.MODE(log.login_type)),NUM_UNIQUE(name_host_pairs.MODE(log.stage)),NUM_UNIQUE(name_host_pairs.MODE(log.result)),MODE(name_host_pairs.MODE(log.auth_type)),MODE(name_host_pairs.MODE(log.login_type)),MODE(name_host_pairs.MODE(log.stage)),MODE(name_host_pairs.MODE(log.result)),label
session_id,time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
U22@DOM1 / U22@DOM1,832,U22@DOM1,U22@DOM1,355,1855,242.347547,832,-0.883495,355,618.333333,3,...,1,1,1,1,1,?,Network,LogOff,Success,False
C567$@DOM1 / C567$@DOM1,877,C567$@DOM1,C567$@DOM1,223,1764,333.558091,877,-0.972081,223,588.0,3,...,1,2,1,2,1,Kerberos,Network,LogOn,Success,False
C1114$@DOM1 / C1114$@DOM1,1447,C1114$@DOM1,C1114$@DOM1,428,2518,537.12227,1447,1.425103,428,839.333333,3,...,1,2,2,3,1,?,Network,LogOff,Success,False
C1766$@DOM1 / C1766$@DOM1,1823,C1766$@DOM1,C1766$@DOM1,253,3001,787.706375,1823,0.426427,253,1000.333333,3,...,1,2,1,2,1,Kerberos,Network,LogOn,Success,False
C123$@DOM1 / C123$@DOM1,1939,C123$@DOM1,C123$@DOM1,146,2252,1029.180418,1939,1.73124,146,750.666667,3,...,1,2,1,2,1,Kerberos,Network,LogOn,Success,False


### Sort indexes to line up cutoffs with feature matrix

In [11]:
fm = fm.reorder_levels(['time', 'session_id']).sort_index()
cutoffs = cutoffs.set_index('secs', append=True).reorder_levels(['secs', 'session_id']).sort_index()
fm['label'] = cutoffs['label'].values

### One-Hot-Encode categorical features and remove features with low information

In [12]:
fm_encoded, fl_encoded = ft.encode_features(fm, fl)
fm_encoded, fl_encoded = remove_low_information_features(fm_encoded, fl_encoded)

# Machine Learning

Now that we have a feature matrix and associated labels, we can build a standard machine learning pipeline with a RandomForestClassifier

First, split up the data into train and test sets

In [13]:
train, test = train_test_split(fm_encoded, test_size=0.2, shuffle=True)

In [14]:
X_train = train
y_train = X_train.pop('label')
X_test = test
y_test = X_test.pop('label')

### Construct the model

In [15]:
imputer = Imputer(missing_values='NaN', strategy="mean", axis=0)
scaler = StandardScaler()
clf = RandomForestClassifier(n_jobs=-1)
model = Pipeline([("imputer", imputer),
                  ("scaler", scaler),
                  ("rf", clf)])



### Fit the model, then score it

In [16]:
model.fit(X_train, y_train)
    
probs = model.predict_proba(X_test)
score = roc_auc_score(y_test, probs[:,1])
print('ROC AUC Score: {:.2f}'.format(score))



ROC AUC Score: 0.60


## View the most important features
according to the Random Forest

In [17]:
high_imp_feats = utils.feature_importances(X_train, clf, feats=10)

1: SUM(log.secs) [0.095]
2: MEAN(name_host_pairs.MIN(log.secs)) [0.035]
3: MEAN(name_host_pairs.MAX(log.secs)) [0.033]
4: SKEW(name_host_pairs.MEAN(log.secs)) [0.033]
5: MIN(name_host_pairs.MAX(log.secs)) [0.033]
6: STD(name_host_pairs.MAX(log.secs)) [0.030]
7: MIN(name_host_pairs.first_log_time) [0.029]
8: STD(log.secs) [0.028]
9: MEAN(log.secs) [0.027]
10: PERCENT_TRUE(name_host_pairs.label) [0.026]
-----



## Save output files

In [18]:
import os

try:
    os.mkdir("output")
except:
    pass

fm_encoded.to_csv('output/feature_matrix.csv')
cutoffs.to_csv('output/cutoffs.csv')

<p>
    <img src="https://www.featurelabs.com/wp-content/uploads/2017/12/logo.png" alt="Featuretools" />
</p>

Featuretools was created by the developers at [Feature Labs](https://www.featurelabs.com/). If building impactful data science pipelines is important to you or your business, please [get in touch](https://www.featurelabs.com/contact/).