# Model Training with Scikit-learn

Train a model that predicts whether or not a file transfer is suspicous or benign , based on its features (attributes). 

### 1. Import the required libraries and packages.

In [1]:
from typing import List, Dict

import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

### 2. Load the data into a Pandas dataframe.

In [8]:
data = pd.read_csv('./data/mftinput4.csv')

### 3. Preprocess the data.

Split the data into two data frames: features (`X`) and target variable (`y`).

In [9]:
X = data.drop('Outcome', axis=1)
y = data['Outcome']

Inspect the two dataframes.

In [10]:
X.head()

Unnamed: 0,Entropy,FileAge,CompressionRatio,FileSize,TransferTime,PacketsSize,TransferRate,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


In [11]:
y.head()

0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64

Divide the data into training and test data sets. 

The `train_test_split` method of Scikit-learn can split the data set into random train and test subsets.

In [12]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=0
)

print(f"Number of samples in training set: {X_train.shape[0]}")
print(f"Number of samples in test set: {X_test.shape[0]}")

Number of samples in training set: 614
Number of samples in test set: 154


### 4. Create and train the model.

Create an instance of one of the models available in Scikit-learn and train the model with the training set. For this particular example, you can use a logistic regression model, which is a simple approach to solve classification problems.

In [13]:
# Instantiate the model with hyper parameters
model = LogisticRegression(penalty="l2", C=1.0, max_iter=300)

# Train the model
model.fit(X_train, y_train)

### 5. Evaluate the model metrics.

After the model is trained, evaluate the model against the test set.

In [14]:
# Compute the model predictions for the test data: y_predicted
y_predicted = model.predict(X_test)

# Compare the predicted values for the test set (y_predicted)
# against the expected values (y_test)
print("Classification report:")
print(classification_report(y_test, y_predicted))

Classification report:
              precision    recall  f1-score   support

           0       0.84      0.92      0.88       107
           1       0.76      0.62      0.68        47

    accuracy                           0.82       154
   macro avg       0.80      0.77      0.78       154
weighted avg       0.82      0.82      0.82       154



The trained model has an accuracy value of nearly 82%.

You can improve the score by retraining the model after more sophisticated data engineering or by tweaking the model hyper parameters.

### 6. Test the model with sample cases.
Test the model with data from two file transfers: one is suspicious and one is benign.

In [16]:
# Tuple for textual display of prediction
classes = ('Benign', 'Suspicious')


def predict(patients: List[Dict]):
    inputs = pd.DataFrame(patients)
    predictions = model.predict(inputs)
    return [classes[p] for p in predictions]

#Entropy,FileAge,CompressionRatio,FileSize,TransferTime,PacketsSize,TransferRate,Age,
suspicious_filetransfer = {
    "Entropy": 6.0,
    "FileAge": 110.0,
    "CompressionRatio": 65.0,
    "FileSize": 15.0,
    "TransferTime": 1.0,
    "PacketsSize": 45.7,
    "TransferRate": 0.627,
    "Age": 50
}

benign_filetransfer = {
    "Entropy": 0,
    "FileAge": 88.0,
    "CompressionRatio": 60.0,
    "FileSize": 35.0,
    "TransferTime": 1.0,
    "PacketsSize": 45.7,
    "TransferRate": 0.27,
    "Age": 20
}


predictions = predict([suspicious_filetransfer, benign_filetransfer])
print(predictions)

['Suspicious', 'Benign']
