# Linear Learner

Amazon Sagemaker's built in algorithm called Linear Learner is used here. 
This algorithm was selected becaause it has a feature to work with imbalanced datasets, and this notebook explores whether it is effective.

There are two labels used for the dataset here - 'normal' and 'abnormal'. The dataset used here is imbalanced because the number of 'abnormal' labels is a lot less than 'normal' labels.

In [1]:
!pip install loglizer

Collecting loglizer
  Downloading loglizer-1.0-py3-none-any.whl (21 kB)
Installing collected packages: loglizer
Successfully installed loglizer-1.0
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p36/bin/python -m pip install --upgrade pip' command.[0m


## Set up the environment

In [2]:
import io
import os
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd

import sys
sys.path.append('../')
from loglizer.models import LogClustering
from loglizer import dataloader, preprocessing

import boto3
import sagemaker
from sagemaker import get_execution_role

%matplotlib inline

## Create the session

The session is initialized to remember our connection parameters to SageMaker, and used to perform all SageMaker operations.

In [3]:
# sagemaker session, role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# S3 bucket name
bucket = sagemaker_session.default_bucket()

In [4]:
label_path = 'anomaly_label.csv'
feature_path = 'HDFS_100k.log_structured.csv'

In [5]:
struct_log = feature_path # The structured log file
label_file = label_path # The anomaly label file
max_dist = 0.3 # the threshold to stop the clustering process
anomaly_threshold = 0.3 # the threshold for anomaly detection

## Data preperation

In [6]:
(x_train, y_train), (x_test, y_test) = dataloader.load_HDFS(feature_path,
                                                            label_file=label_path,
                                                            window='session', 
                                                            train_ratio=0.7,
                                                            split_type='uniform')

Loading HDFS_100k.log_structured.csv
219 94
Total: 7940 instances, 313 anomaly, 7627 normal
Train: 5557 instances, 219 anomaly, 5338 normal
Test: 2383 instances, 94 anomaly, 2289 normal



##### Observe prepared data

In [7]:
x_train.shape, y_train.shape

((5557,), (5557,))

In [8]:
x_train[0]

['E22',
 'E5',
 'E5',
 'E5',
 'E26',
 'E26',
 'E11',
 'E9',
 'E11',
 'E9',
 'E26',
 'E11',
 'E9']

## Vectorize the prepared data

In [9]:
feature_extractor = preprocessing.FeatureExtractor()
x_train_transformed = feature_extractor.fit_transform(x_train, term_weighting='tf-idf')
x_test_transformed = feature_extractor.transform(x_test)

Train data shape: 5557-by-16

Test data shape: 2383-by-16



In [10]:
x_train_transformed[0]

array([-1.79956050e-12, -5.39868150e-12,  4.78878298e-02,  4.73393336e-02,
        4.78878298e-02,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00])

In [11]:
x_test_transformed.shape

(2383, 16)

# Modeling

## Create a LinearLearner Estimator

In [12]:
# import LinearLearner
from sagemaker import LinearLearner

prefix = 'abnormalDetect_linear'
output_path = 's3://{}/{}'.format(bucket, prefix)

# instantiate LinearLearner
linear = LinearLearner(role=role, 
                       train_instance_count=1, 
                       train_instance_type='ml.c4.xlarge', 
                       predictor_type='binary_classifier', 
                       binary_classifier_model_selection_criteria='f1',
                      output_path=output_path,
                      sagemaker_session = sagemaker_session,
                      epochs=15)

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


##  Convert data into a RecordSet format

In [13]:

# create RecordSet of training data
formatted_train_data = linear.record_set(train=x_train_transformed.astype('float32'), labels=y_train.astype('float32'))

## Train the Estimator

In [14]:
%%time 
# train the estimator on formatted training data
linear.fit(formatted_train_data)

Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.
Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.


2021-11-03 00:32:57 Starting - Starting the training job...
2021-11-03 00:32:59 Starting - Launching requested ML instancesProfilerReport-1635899577: InProgress
......
2021-11-03 00:34:15 Starting - Preparing the instances for training.........
2021-11-03 00:35:56 Downloading - Downloading input data
2021-11-03 00:35:56 Training - Downloading the training image.....[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[11/03/2021 00:36:37 INFO 140430469535552] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-input.json: {'mini_batch_size': '1000', 'epochs': '15', 'feature_dim': 'auto', 'use_bias': 'true', 'binary_classifier_model_selection_criteria': 'accuracy', 'f_beta': '1.0', 'target_recall': '0.8', 'target_precision': '0.8', 'num_models': 'auto', 'num_calibration_samples': '10000000', 'init_method': 'uniform', 'init_scale': '0.07', 'init_sigma': '0.01', 'init_bia

## Deploy the trained model

In [15]:
%%time 
# deploy and evaluate a predictor
prec_predictor = linear.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.


----------!CPU times: user 185 ms, sys: 12.5 ms, total: 197 ms
Wall time: 5min 3s


# Model Evaluation

In [19]:
# code to evaluate the endpoint on test data
def evaluate(predictor, test_features, test_labels, verbose=True):
    """
    Evaluate a model on a test set given the prediction endpoint.  
    Return binary classification metrics.
    :param predictor: A prediction endpoint
    :param test_features: Test features
    :param test_labels: Class labels for test data
    :param verbose: If True, prints a table of all performance metrics
    :return: A dictionary of performance metrics.
    """
    
    # We have a lot of test data, so we'll split it into batches of 100
    # split the test data set into batches and evaluate using prediction endpoint    
    prediction_batches = [predictor.predict(batch) for batch in np.array_split(test_features, 100)]
    
    # LinearLearner produces a `predicted_label` for each data point in a batch
    # get the 'predicted_label' for every point in a batch
    test_preds = np.concatenate([np.array([x.label['predicted_label'].float32_tensor.values[0] for x in batch]) 
                                 for batch in prediction_batches])
    
    # calculate true positives, false positives, true negatives, false negatives
    tp = np.logical_and(test_labels, test_preds).sum()
    fp = np.logical_and(1-test_labels, test_preds).sum()
    tn = np.logical_and(1-test_labels, 1-test_preds).sum()
    fn = np.logical_and(test_labels, 1-test_preds).sum()
    
    # calculate binary classification metrics
    recall = tp / (tp + fn)
    precision = tp / (tp + fp)
    f1 = 2*(recall * precision) / (recall + precision)
    
    # printing a table of metrics
    if verbose:
        print(pd.crosstab(test_labels, test_preds, rownames=['actual (row)'], colnames=['prediction (col)']))
        print("\n{:<11} {:.3f}".format('Recall:', recall))
        print("{:<11} {:.3f}".format('Precision:', precision))
        print("{:<11} {:.3f}".format('F1-measure:', f1))
        print()
        
    return {'TP': tp, 'FP': fp, 'FN': fn, 'TN': tn, 
            'Precision': precision, 'Recall': recall, 'F1-measure': f1}

## Test Results

In [20]:
# from sagemaker.tensorflow import TensorFlowPredictor

# predictor = TensorFlowPredictor('linear-learner-2021-10-09-06-54-21-866')

In [21]:
metrics = evaluate(prec_predictor, 
                   x_test_transformed.astype('float32'), 
                   y_test, 
                   verbose=True)

prediction (col)   0.0  1.0
actual (row)               
0                 2288    1
1                   52   42

Recall:     0.447
Precision:  0.977
F1-measure: 0.613



In [None]:
# metrics = evaluate(predictor, 
#                    x_test_transformed.astype('float32'), 
#                    y_test, 
#                    verbose=True)

### Delete the Endpoint

In [22]:
# Deletes a precictor.endpoint
def delete_endpoint(predictor):
        try:
            boto3.client('sagemaker').delete_endpoint(EndpointName=predictor.endpoint)
            print('Deleted {}'.format(predictor.endpoint))
        except:
            print('Already deleted: {}'.format(predictor.endpoint))

In [23]:
# delete the predictor endpoint after evaluation 
delete_endpoint(prec_predictor)

The endpoint attribute has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
The endpoint attribute has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


Deleted linear-learner-2021-11-03-00-37-17-409


# Improvement: Model Tuning¶

## Create a LinearLearner and tune for higher precision, and a positive_example_weight_mult parameter

In [24]:
# instantiate a LinearLearner
# tune the model for a higher recall
linear_balanced = LinearLearner(role=role,
                              train_instance_count=1, 
                              train_instance_type='ml.c4.xlarge',
                              predictor_type='binary_classifier',
                              output_path=output_path,
                              sagemaker_session=sagemaker_session,
                              epochs=15,
                              binary_classifier_model_selection_criteria='precision_at_target_recall', # target recall
                              target_recall=0.5,
                            positive_example_weight_mult='balanced') # 50% recall

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


## Train the Estimator

In [25]:
%%time 
# train the estimator on formatted training data
linear_balanced.fit(formatted_train_data)

Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.
Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.


2021-11-03 00:52:35 Starting - Starting the training job...
2021-11-03 00:52:37 Starting - Launching requested ML instancesProfilerReport-1635900754: InProgress
......
2021-11-03 00:53:55 Starting - Preparing the instances for training.........
2021-11-03 00:55:35 Downloading - Downloading input data...
2021-11-03 00:55:55 Training - Downloading the training image..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[11/03/2021 00:56:14 INFO 140100530812736] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-input.json: {'mini_batch_size': '1000', 'epochs': '15', 'feature_dim': 'auto', 'use_bias': 'true', 'binary_classifier_model_selection_criteria': 'accuracy', 'f_beta': '1.0', 'target_recall': '0.8', 'target_precision': '0.8', 'num_models': 'auto', 'num_calibration_samples': '10000000', 'init_method': 'uniform', 'init_scale': '0.07', 'init_sigma': '0.01', 'init_bia

## Deploy and evaluate the balanced estimator

In [26]:
%%time 
# deploy and create a predictor
balanced_predictor = linear_balanced.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.


----------!CPU times: user 193 ms, sys: 650 µs, total: 193 ms
Wall time: 5min 3s


In [27]:
print('Metrics for balanced, LinearLearner.\n')

# get metrics for balanced predictor
metrics = evaluate(balanced_predictor, 
                   x_test_transformed.astype('float32'), 
                   y_test, 
                   verbose=True)

Metrics for balanced, LinearLearner.

prediction (col)  0.0   1.0
actual (row)               
0                   1  2288
1                   0    94

Recall:     1.000
Precision:  0.039
F1-measure: 0.076



In [28]:
# delete the predictor endpoint 
delete_endpoint(balanced_predictor)

The endpoint attribute has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
The endpoint attribute has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


Deleted linear-learner-2021-11-03-00-56-47-745
