# Unsupervised Anomaly Detection on Security Logs using <u>Levenshtein Similarity Score</u>
This is a continuation of the prior research. In this notebook, we are leveraging the Isolation Forest algorithm to perform unsupervised training of an anomaly detection model.  HOWEVER...we are trying this using Levenshtein similarity score INSTEAD OF using text vector embeddings as the input. I expect that this approach will be much less accurate...Levenshtein score is a scalar value that is the minimum number of distinct edits to make two text passages the same, versus a 384-dimension vector embedding output by a language-interpreting LLM that is trying to capture the meaning of the text. This is a no-AI approach to this problem, but it's worth investigating.  Will it work?  Will it come close to the known actual 950/50 split of good/bad log entries?  Let us see...

In [1]:
### Uncomment and run the line below if this is the first time executing this notebook. Package installs in requirements.txt.
#! python -m venv venv
#! powershell venv\Scripts\Activate.ps1
#! pip install pandas
#! pip install matplotlib
#! pip install sklearn
#! pip install levenshtein

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
from sklearn.metrics import pairwise_distances
import Levenshtein as lev

In [None]:
# Levenshtein calculates the minimum number of insertions, deletions, and substitutions required to change 
# one text sequence into the other with custom costs for insertion, deletion and substitution
# Example...should return a score of 2:
lev.distance("lewenstein", "levenshtein")

In [None]:
# We'll use the fabricated mixed proxy logs with known 950 benign log entries and 50 malicious log entries 
# Our goal is to see if we can get them accurately classified into that 950 benign + 50 malicious
# Read the log data into a dataframe, and drop the columns we don't need for this exercise.
df = pd.read_csv('proxy_logs_mixed.csv')
df = df.drop(['IP Address', 'Timestamp'], axis=1)  # We don't need IP nor timestamp for this task
df.sample(3)

In [None]:
# Compute the pairwise distance matrix
# This creates a matrix of every row x every row and it's distance, related back to the original 
# dataframe by the ID and using Levenshtein distance as the metric. This matrix looks something like: 
#       ABC   BCD   DEF
# ABC    0     2     6
# BCD    2     0     4
# DEF    6     4     0
# For instance, the distance between ABC and BCD in edits is 1 delete and 1 insertion, for a total of 2.

distance_matrix = pairwise_distances(df['Log Entry'].values.reshape(-1, 1), 
                                     metric=lambda x, y: lev.distance(x[0], y[0]))
distance_matrix.shape # Check the shape of the matrix

In [None]:
# Define a model object with the Isolation Forest algorithm
# Let's favor a large number of samples and let the algorithm figure out the contamination ratio
#model=IsolationForest(n_estimators=100,max_samples=1000,contamination='auto',random_state=96)
model=IsolationForest(n_estimators=100,max_samples=1000,contamination='auto',random_state=96)

# Fit the data to the model
model.fit(distance_matrix)

# Display parameter values that were used
model.get_params()

In [26]:
# Add a column to the DF for raw scores from the model's decision_function
df['raw_score'] = model.decision_function(distance_matrix)

# Add a column to the DF for the anomaly flag from the model's predict function...-1 indicates anomaly
df['anomaly_score'] = model.predict(distance_matrix)

In [None]:
# Display the data with the score columns added
df.sample(3)

In [None]:
# Display just the anomalies
df[df['anomaly_score']==-1].sample(3)

In [None]:
# The outliers have anomaly_score = -1
# This is pretty darned good, given that we know the real split is 950/50
df['anomaly_score'].value_counts()

In [None]:
# Scatter plot of the anomaly_score results...it's what we'd expect to see given the counts above
plt.figure(figsize=(10, 7))
scatter = plt.scatter(df.raw_score, df.anomaly_score)
plt.title('Anomaly Scatter Plot')
plt.xlabel('Raw Score')
plt.ylabel('Anomaly Score')
plt.show()

Our unsupervised ML model, trained using a very simple distance metric that was merely the number of edits to make two text passages the same, did surprisingly well.  It flagged 70 log entries as anomalies, with 930 seen as typical.  We know the real answer from the fabricated test log entry data is 950 benign and 50 malicious.  This may be worth testing on real data and/or larger data sets, as it might be at least useable as a coarse-grained pre-filter. 