# <u>Neural/AI Search applied to Proxy Logs</u>

The intention of this Jupyter notebook is to examine how we might apply AI search to security logs.  We'll be using fabricated proxy logs that (in theory) show benign behavior, malicious behavior, and a 950/50 mix of those two. I hope to demonstrate the following three examples:
- Semantic matching: Compare incoming logs to a vector DB of known malicious logs, flagging matches for alert
- Anomaly detection: Compare incoming logs to a vector DB of known benign logs, flagging possible anomalies for alert
- Full-text Clustering: Show clustering done with full text by leveraging vector embeddings

We'll be using a simple development-grade vector datastore called ChromaDB, and a locally-running HuggingFace LLM (large language model) that is a common one of calculating generic vector embeddings of text.

Note that there are many synonyms for <b>vector search</b>...here's a list:
- Vector similarity search
- Semantic similarity search
- Neural search (because of it's basis in neural network models for the embeddings)
- AI search
- KNN search (K-Nearest Neighbors...this is the old and/or conceptual name)
- ANN search (Approximate Nearest Neighbors) 
- HNSW (Hierarchical Navigable Small World) search (the most popular approximation algorithm)
- Probably a few others I've forgotten

All of these synonyms are referring to the same concepts and technology. It's worth noting that vector search technology has been used in large scale recommendation systems, fraud detection systems, and other "big data" types of applications for nearly a decade. In fact, handling very large volumes of full-text data is exactly what this technology was invented to handle. It's also worth noting that vector embeddings are not just relevant to text...given a suitable ML model, vector embeddings can be captured for images, audio, video, even binary executables and encrypted data. 

In [None]:
import os, sys
import pandas as pd 
import chromadb
#from chromadb.config import DEFAULT_TENANT, DEFAULT_DATABASE, Settings

# <u>Semantic Matching: Match incoming logs against known malicious logs</u>
## Step 1: Hydrate vector DB of malicious logs
We can use this vectorstore of known malicious logs as reference data to compare incoming logs against to determine if any of those events are similar enough to raise an alert.  

In [None]:
# This CSV file contains fabricated proxy logs that are examples of malicious activity attempts
df = pd.read_csv('proxy_logs_malicious.csv')
df.columns

In [None]:
df.sample(3) # I've purposefully generated these to look like something that might be extracted from Splunk

In [None]:
# setup Chroma in-memory, for easy prototyping. Can add persistence easily!
client = chromadb.Client()
#client = chromadb.PersistentClient(path='./chromadb_proxy_logs')  # A bug prevents this from working

In [None]:
# Create collection. get_collection, get_or_create_collection, delete_collection also available
# ChromaDB uses L2 (Euclidean distance) by default...we want Cosine metric.
# Cosine similarity -> higher = better
collection = client.get_or_create_collection(name='malicious_proxy_logs', metadata={"hnsw:space": "cosine"})

In [None]:
# Add docs to the collection. Can also update and delete.

# Create lists of the necessary data from the dataframe
ID_list = df['ID'].astype(str).tolist()  # ID list, converted to string
LogEntry_list = df['Log Entry'].tolist()   # List of documents (log content)

# We are letting ChromaDB automatically calculate the vector embedding, instead of explicitly handling it
# By default, ChromaDB uses all-MiniLM-L6-v2 sentence transformer model to calculate vector embeddings
# This all-MiniLM-L6-v2 model provides a 384 dimension vector that can be used for embedding and clustering
collection.add(
    documents=LogEntry_list, # we handle tokenization, embedding, and indexing automatically. You can skip that and add your own embeddings as well
    #metadatas=[{"source": "notion"}, {"source": "google-docs"}], # metadata filters
    ids=ID_list, # unique ID for each doc
)

In [None]:
collection.count()  # Check our count to make sure it looks right

In [None]:
# Examine a record by ID...tell it to show the vector embedding so we can see what it looks like
# The vector embedding is the 384 dimension numeric representation of what that text "means"...
collection.get('1', include=['embeddings', 'documents', 'metadatas'])
#collection.get(['1','2'], include=['embeddings', 'documents', 'metadatas'])

In [None]:
# Let's examine simple calcs around vector distance to foster understanding
import numpy
import math
from scipy import spatial

# Grab data from two of the records we've put into the vector DB to examine
vec1 = collection.get('1', include=['embeddings', 'documents'])['embeddings'][0]
vec2 = collection.get('2', include=['embeddings', 'documents'])['embeddings'][0]

# Calc the euclidean (or L2) distance between those two vectors
# This is the same thing you'd do with geographic distance between 2 cities
# Euclidean provides magnitude but not direction...a lower distance is better match
euclidean_dist = spatial.distance.euclidean(vec1, vec2)
print('euclidean distance (smaller value = more similar): ', euclidean_dist)

# Cosine similarity is the difference between angles...it provides direction but not magnitude
# Cosine is often a good metric for text comparison...not that a higher value is better match
# Cosine example values:
# 180 degrees = -1.0
# 120 degrees = -0.5
#  90 degrees =  0.0
#  60 degrees = +0.5
#   0 degrees = +1.0
cosine_dist = spatial.distance.cosine(vec1, vec2)
print('cosine similarity (larger value = more similar): ', cosine_dist)


In [None]:
# Execute an ANN query/search for K most similar results.
results = collection.query(
    query_texts=["http://www.example.com/../../etc/passwd"], # This gets vectorized and used for vector query
    n_results=3,
    # where_document={"$contains":"Macintosh"}  # optional keyword filter
    # where={"metadata_field": "is_equal_to_this"}, # optional metadata filter
)

results

In [None]:
# Ask the same question, but this time in plain English!
results = collection.query(
    query_texts=["Can you show me possible attempts to change a password?"],  # Vectorize and search
    n_results=3,
    # where_document={"$contains":"script"}  # optional keyword filter
    # where={"metadata_field": "is_equal_to_this"}, # optional metadata filter
)

results

We've built a vector datastore of known malicious logs...we can now use this to compare with logs of an unknown risk.

# Step 2: Semantic matching of malicious attempts
Using our vector DB of malicious logs to execute neural searches against, let's feed it some fresh logs to see what matches we get.  We want to use cosine similarity for this.  We'll have to experiment a bit to set a reasonable threshold for when to return an alert vs not.

In [None]:
# Some fabricated proxy logs that contain 950 benign log entries and 50 malicious log entries
df = pd.read_csv('proxy_logs_mixed.csv')

In [None]:
df.sample(3)

In [None]:
# Find incoming log entries that match known malicious activity at a pre-determined threshold 
incoming_proxy_logs = df['Log Entry']

for log_entry in incoming_proxy_logs:
    results = collection.query(
    query_texts=log_entry,
    n_results=1,
    # where={"metadata_field": "is_equal_to_this"}, # optional filter
    where_document={"$contains":"Macintosh"}  # optional filter
    )
    if results['distances'][0][0] <= 0.001:  # Threshold for a match...this should be the wrong direction!
    #if results['distances'][0][0] >= 0.488:  # Threshold for a match
    # NOTE:  It appears that ChromaDB is using Euclidean distance even though I specified Cosine similarity
    #        Our best matches should be high values, not low values
       print(f'Score {results["distances"][0][0]}: {log_entry} \n -match ID {results["ids"][0]} content: {results["documents"][0][0]}')

We demonstrated here that we can do a vector search comparing incoming log records with a vector DB of known malicious log records, get the top reasonable matches, and do so without any drastic data manipulation. (This is after adjusting for ChromaDB's apparent malfunction with using the specified distance calcs, of course.) In fact, we <b><i>did not even parse any of the log content</i></b>, we just used the vector embeddings of that log content. We are doing semantic matching using the vector embeddings of the full text...this is the important item to take note of.  The data in this case are sets of fabricated proxy logs...this exercise needs to be examined with real logs. 

# <u>Anomaly Detection by AI search on Benign Logs</u>
## Step 1: Build vector DB of Benign Proxy log data
Now, let's build a vector DB from log data that is good rather than bad.

In [None]:
# To conserve memory, re-use the previous collection for new data
collection = client.get_or_create_collection(name='benign_proxy_logs', metadata={"hnsw:space": "cosine"})

In [None]:
# This CSV file contains fabricated proxy logs that are examples of benign activity
df = pd.read_csv('proxy_logs_good.csv')
df.sample(3)

In [None]:
# Create lists of the necessary data from the dataframe
ID_list = df['ID'].astype(str).tolist()  # ID list, converted to string
LogEntry_list = df['Log Entry'].tolist()   # List of documents (log content)

# Add log records to vector store, allowing ChromaDB to calculate vector embeddings
collection.add(
    documents=LogEntry_list, # we handle tokenization, embedding, and indexing automatically. 
    #metadatas=[{"source": "notion"}, {"source": "google-docs"}], # metadata filters
    ids=ID_list, # unique ID for each doc
)

In [None]:
collection.count()  # Check our count to make sure it looks right

In [None]:
# Check a few records...
collection.get(['1','2','3'])

# Step 1: Vector search for Anomaly Detection
With a vector DB of known benign/good proxy log entries, we can do semantic comparison to flag incoming log entries that are <b><i>too different</i></b> from what we know to be benign logs. 

In [None]:
# We'll use fabricated proxy logs that contain 950 benign log entries and 50 malicious log entries again
df = pd.read_csv('proxy_logs_mixed.csv')
df.sample(3)

In [None]:
# Let's identify anomalies in the incoming logs by using vector search against DB of known benign logs
incoming_proxy_logs = df['Log Entry']

for log_entry in incoming_proxy_logs:
    results = collection.query(
    query_texts=log_entry,
    n_results=1,
    # where={"metadata_field": "is_equal_to_this"}, # optional filter
    # where_document={"$contains":"search_string"}  # optional filter
    )
    if results['distances'][0][0] >= 0.3:  # Threshold for a non-match...this should be wrong direction!
    #if results['distances'][0][0] <= 0.0001:  # Threshold for a non-match
        print(f'Anomaly score {results["distances"][0][0]}: {log_entry} \n -worst match ID {results["ids"][0]} content: {results["documents"][0][0]}')
    
    # NOTE:  Again ChromaDB appears to be using Euclidean distance calcs instead of the specified Cosine metric
    #        I think perhaps I've uncovered a bug in ChromaDB

We've gotten some possible anomalies that appear to be drastic non-matches (after adjusting for ChromaDB's apparent malfunction around using the specified distance calcs, of course). Just like the previous matching exercise, this was done with no parsing or data manipulations of the log data...only the vector embeddings of the log data were used. In practice, we'd likely do some pre-filtering with metadata and keyword matches to apply the vector search to the smallest subset reasonably possible. For anomaly detection, that could be a pretty large data set however.  Regardless, good next steps would be to test this against a sample of real logs and compare what we get.

# <u>Full-text Clustering with Vector Embeddings</u>
We will use the mixed proxy log data which 950 benign logs + 50 malicious logs to see if we can segregate the two using clustering. This is a notable exercise...only recently has this technology advanced to the point that we can actually do Clustering with full text such as security logs!
We are going to feed the clustering algorithms the mixed data with 950 benign logs and 50 malicious logs, and (at least for Kmeans), tell it we want two cluster groups. What we'd like to see if around 950 members land in one of those groups, and around 50 members land in the other group. Let's see how this looks...


In [None]:
import pandas as pd

# We'll use the mixed proxy logs to see if we can get them clustered into 950 benign + 50 malicious
df = pd.read_csv('proxy_logs_mixed.csv')
df = df.drop(['IP Address', 'Timestamp'], axis=1)  # We don't need IP nor timestamp for this task
df.sample(3)


In [None]:
# It's easier to just calc the vector embeddings explicitly, vs adding inserting to ChromaDB to get embeddings
# ChromaDB can't do clustering queries (yet), but we will use the same embedding model that ChromaDB is using

# Load the embedding model so we can use it to easily populate the dataframe
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
# Use like this: embeddings = model.encode(whatever_text)

# Use a lambda function to encode the text in each row and apply it to a new column
df['embedding'] = df['Log Entry'].apply(lambda text:model.encode(text))
df.sample(3)

In [None]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

# Extract the embeddings from the DataFrame
embeddings = df['embedding'].tolist()

# Convert the list of embeddings into a numpy array
# import numpy as np
X = np.array(embeddings)

# Perform K-means clustering with K=2
# We know we have good logs and bad logs, so let's try to cluster them into good and bad
kmeans = KMeans(n_clusters=2, n_init=100, max_iter=3000, random_state=0)
#kmeans = KMeans(n_clusters=2, n_init=10, max_iter=300, random_state=0)  # These are defaults
#kmeans = KMeans(n_clusters=2, n_init=10000, max_iter=10000, random_state=0)
df['cluster_kmeans'] = kmeans.fit_predict(X)

# Now `df` will have a new column 'cluster' indicating the cluster each entry belongs to
df.columns


In [None]:
df.sample(3)

In [None]:
# We hope to see cluster group counts in the ballpark of 950 and 50
df['cluster_kmeans'].value_counts()

In [None]:
# Let's try dimensionality reduction to 2 dimensions from 384, then try re-clustering
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

# Extract the embeddings from the DataFrame
embeddings = df['embedding'].tolist()

# Convert the list of embeddings into a numpy array
import numpy as np
X = np.array(embeddings)

# Perform PCA for dimensionality reduction
n_components = 2  # Number of dimensions to reduce to
pca = PCA(n_components=n_components)
principal_components = pca.fit_transform(X)

# Add the principal components to the DataFrame
df_pca = pd.DataFrame(data=principal_components, columns=[f'PC{i+1}' for i in range(n_components)])
df = pd.concat([df, df_pca], axis=1)

# Perform K-means clustering using the PCA components, and add a cluster2 column from that exercise
kmeans2 = KMeans(n_clusters=2, random_state=0)
df['cluster_pca_kmeans'] = kmeans.fit_predict(df[['PC1', 'PC2']])


In [None]:
df.sample(3)

In [None]:
# We hope to see cluster group counts in the ballpark of 950 and 50
df['cluster_pca_kmeans'].value_counts()

In [None]:
# Visualize the clusters
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 7))
plt.scatter(df['PC1'], df['PC2'], c=df['cluster_pca_kmeans'], cmap='viridis', marker='o')
plt.title('KMeans Clustering on PCA-Reduced Data')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.colorbar(label='Cluster_pca_kmeans')
plt.show()

In [None]:
# Let's try using DBSCAN clustering instead of Kmeans...we'll use cosine distance
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import numpy as np
from sklearn.metrics.pairwise import cosine_distances

# Convert the embeddings column to a numpy array
embeddings = np.array(df['embedding'].tolist())

# Standardize the data (optional but recommended)
scaler = StandardScaler()
embeddings_scaled = scaler.fit_transform(embeddings)

# Calculate the cosine distance matrix
cosine_dist_matrix = cosine_distances(embeddings_scaled)

# Perform DBSCAN clustering with cosine distance
# Parameters to adjust: `eps` (radius of neighborhood), `min_samples` (min points to form a cluster)
dbscan = DBSCAN(eps=0.5, min_samples=5, metric='precomputed')
df['cluster_dbscan'] = dbscan.fit_predict(cosine_dist_matrix)

# Analyzing the results
n_clusters = len(set(df['cluster_dbscan'])) - (1 if -1 in df['cluster_dbscan'] else 0)
n_noise = list(df['cluster_dbscan']).count(-1)

print(f'Estimated number of DBSCAN clusters: {n_clusters}')
print(f'Estimated number of noise points: {n_noise}')

# If you want to see the first few entries and their clusters
df.columns


In [None]:
df.sample(3)

In [None]:
# We hope to see cluster group counts in the ballpark of 950 and 50
df['cluster_dbscan'].value_counts()

# It appears that perhaps cluster 2 with 53 points might be our malicious logs, but...
# I'm not sure (yet) how we would determine that if we didn't already there were 50 bad logs

In [None]:
# Plot the results after applying PCA

# Reduce the embeddings to 2 dimensions using PCA
pca = PCA(n_components=2)
reduced_embeddings = pca.fit_transform(embeddings)

# Plot
plt.figure(figsize=(10, 7))
scatter = plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], c=df['cluster_dbscan'], cmap='viridis', marker='o')
plt.title('DBSCAN Clustering with Cosine Distance (PCA Reduced to 2D)')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.colorbar(scatter, label='Cluster_dbscan')
plt.show()

In [None]:
# Let's re-try Kmeans clustering, but instead of allowing it to use the default euclidean distance metric
# we will convert the cosine similary vectors to equivalent euclidean distance vectors
from sklearn.cluster import KMeans
from sklearn.preprocessing import normalize

# Extract the embeddings from the DataFrame
embeddings = df['embedding'].tolist()

# Normalize the embeddings to make the cosine distance equivalent to Euclidean distance
embeddings_normalized = normalize(embeddings)

# Apply KMeans clustering
kmeans = KMeans(n_clusters=2, random_state=0)  # Adjust the number of clusters as needed
df['cluster_kmeans_cosine'] = kmeans.fit_predict(embeddings_normalized)

df.columns
# Visualize the clusters with PCA or t-SNE as discussed earlier


In [None]:
# We hope to see cluster group counts in the ballpark of 950 and 50
df['cluster_kmeans_cosine'].value_counts()


In [None]:
# Plot the results after applying PCA

# Reduce the embeddings to 2 dimensions using PCA
pca = PCA(n_components=2)
reduced_embeddings = pca.fit_transform(embeddings)

# Plot
plt.figure(figsize=(10, 7))
scatter = plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], c=df['cluster_kmeans_cosine'], cmap='viridis', marker='o')
plt.title('DBSCAN Clustering with Cosine Distance (PCA Reduced to 2D)')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.colorbar(scatter, label='Cluster_Kmeans_Cosine')
plt.show()

We've successfully demonstrated how we can do clustering analysis with full-text data.  However, there's no value in analyzing further since this is all fabricated data. 

# Where do we got from here?
Some areas for further research:
- Try different embedding models to determine which one performs best with proxy logs...the one we used is very generic, one finetuned on event logs or specifically on cybersecurity data content might produce much better results.
- Dig deeper into the data to determine how we would be able to name clusters from DBSCAN (with real data, not worthwhile with fake data).
- Examine other clustering algorithms that might be more meaningful (with real data, not worthwhile with fake data).
