# <u>Neural/AI Search applied to Proxy Logs</u>

The intention with this Jupyter notebook is to examine how we might apply AI search to security logs.  We'll be using fabricated proxy logs that (in theory) show benign behavior, malicious behavior, and a mix of those two. I want to demonstrate the following three examples:
- Compare incoming logs to a vector DB of known malicious logs, flagging matches for alert
- Compare incoming logs to a vector DB of known benign logs, flagging possible anomalies for alert
- Show clustering done with full text via their vector embeddings

We'll be using a simple development-grade vector datastore called ChromaDB, and a locally-running HuggingFace LLM (large language model) that is a common one of calculating generic vector embeddings of text.

In [None]:
import os, sys
import pandas as pd 
import chromadb
#from chromadb.config import DEFAULT_TENANT, DEFAULT_DATABASE, Settings

# <u>Flag incoming logs that match known malicious logs</u>
## Create vector DB of malicious logs
We can use this vectorstore of known malicious logs as reference data to compare incoming logs against to determine if any of those events are similar enough to raise an alert.  

In [None]:
# This CSV file contains fabricated proxy logs that are examples of malicious activity attempts
df = pd.read_csv('proxy_logs_malicious.csv')
df.columns

In [None]:
df.sample(3)

In [None]:
# setup Chroma in-memory, for easy prototyping. Can add persistence easily!
client = chromadb.Client()
#client = chromadb.PersistentClient(path='./chromadb_proxy_logs')

In [None]:
# Create collection. get_collection, get_or_create_collection, delete_collection also available
# ChromaDB uses L2 (Euclidean distance) by default...we want Cosine metric.
# Cosine similarity -> higher = better
collection = client.get_or_create_collection(name='malicious_proxy_logs', metadata={"hnsw:space": "cosine"})



In [None]:
# Create lists of the necessary data from the dataframe
ID_list = df['ID'].astype(str).tolist()  # ID list, converted to string
LogEntry_list = df['Log Entry'].tolist()   # List of documents (log content)


In [None]:

# Add docs to the collection. Can also update and delete.
# We are letting ChromaDB automatically calculate the vector embedding, instead of explicitly handling it
# By default, ChromaDB uses all-MiniLM-L6-v2 sentence transformer model to calculate vector embeddings
# This all-MiniLM-L6-v2 model provides a 384 dimension vector that can be used for embedding and clustering
collection.add(
    documents=LogEntry_list, # we handle tokenization, embedding, and indexing automatically. You can skip that and add your own embeddings as well
    #metadatas=[{"source": "notion"}, {"source": "google-docs"}], # metadata filters
    ids=ID_list, # unique ID for each doc
)

In [None]:
# Examine a record by ID...tell it to show the vector embedding so we can see what it looks like
# The vector embedding is the 384 dimension numeric representation of what that text "means"...
collection.get('1', include=['embeddings', 'documents', 'metadatas'])
#collection.get(['1','2'], include=['embeddings', 'documents', 'metadatas'])

In [None]:
# Let's examine simple calcs around vector distance to foster understanding
import numpy
import math
from scipy import spatial

# Grab data from two of the records we've put into the vector DB to examine
vec1 = collection.get('1', include=['embeddings', 'documents'])['embeddings'][0]
vec2 = collection.get('2', include=['embeddings', 'documents'])['embeddings'][0]

# Calc the euclidean (or L2) distance between those two vectors
# This is the same thing you'd do with geographic distance between 2 cities
# Euclidean provides magnitude but not direction...a lower distance is better match
euclidean_dist = spatial.distance.euclidean(vec1, vec2)
print('euclidean_dist: ', euclidean_dist)

# Cosine similarity is the difference between angles...it provides direction but not magnitude
# Cosine is often a good metric for text comparison...not that a higher value is better match
cosine_dist = spatial.distance.cosine(vec1, vec2)
print('cosine dist: ', cosine_dist)


In [None]:
# Execute an ANN Query/search for K most similar results.
results = collection.query(
    query_texts=["http://www.example.com/../../etc/passwd"], # This gets vectorized and used for vector query
    n_results=3,
    # where_document={"$contains":"Macintosh"}  # optional keyword filter
    # where={"metadata_field": "is_equal_to_this"}, # optional metadata filter
)

results

In [None]:
# Ask the same question, but this time in plain English!
results = collection.query(
    query_texts=["Can you show me possible attempts to change a password?"],  # Vectorize and search
    n_results=3,
    # where_document={"$contains":"script"}  # optional keyword filter
    # where={"metadata_field": "is_equal_to_this"}, # optional metadata filter
)

results

# Semantic matching of malicious attempts
We have a vector DB of malicious logs to execute neural searches against.  Now, let's feed it some fresh logs to see what matches we get.  We want to use cosine similarity for this.  We'll have to experiment a bit to set a reasonable threshold for when to return an alert vs not.

In [None]:
# Some fabricated proxy logs that contain 950 benign log entries and 50 malicious log entries
df = pd.read_csv('proxy_logs_mixed.csv')

In [None]:
df.sample(3)

In [None]:
incoming_proxy_logs = df['Log Entry']

In [None]:
incoming_proxy_logs

In [None]:
# Find incoming log entries that match known malicious activity at a pre-determined threshold 
for log_entry in incoming_proxy_logs:
    results = collection.query(
    query_texts=log_entry,
    n_results=1,
    # where={"metadata_field": "is_equal_to_this"}, # optional filter
    # where_document={"$contains":"search_string"}  # optional filter
    )
    if results['distances'][0][0] >= 0.485:  # Threshold for a match
       print(f'Possible alert on: {log_entry} \n  -- Match within threshold on {results}')

# <u>Anomaly Detection by AI search on Benign Logs</u>
## Build vector DB of Benign Proxy log data

In [None]:
# To conserve memory, re-use the previous collection for new data
collection = client.get_or_create_collection(name='benign_proxy_logs', metadata={"hnsw:space": "cosine"})


In [None]:
# This CSV file contains fabricated proxy logs that are examples of benign activity
df = pd.read_csv('proxy_logs_good.csv')
df.sample(3)

In [None]:
# Create lists of the necessary data from the dataframe
ID_list = df['ID'].astype(str).tolist()  # ID list, converted to string
LogEntry_list = df['Log Entry'].tolist()   # List of documents (log content)

In [None]:
# Add log records to vector store, allowing ChromaDB to calculate vector embeddings
collection.add(
    documents=LogEntry_list, # we handle tokenization, embedding, and indexing automatically. 
    #metadatas=[{"source": "notion"}, {"source": "google-docs"}], # metadata filters
    ids=ID_list, # unique ID for each doc
)

In [None]:
# Check a few records...
collection.get(['1','2','3'])


# Vector search for Anomaly Detection
With a vector DB of known benign/good proxy log entries, we can do semantic comparison to flag incoming log entries that are too different from what we know to be benign logs. 

In [None]:
# We'll use fabricated proxy logs that contain 950 benign log entries and 50 malicious log entries again
df = pd.read_csv('proxy_logs_mixed.csv')
df.sample(3)

In [None]:
# Grab just the log entry iself
incoming_proxy_logs = df['Log Entry']
incoming_proxy_logs

In [None]:
# Let's identify anomalies in the incoming logs by using vector search against DB of known benign logs
for log_entry in incoming_proxy_logs:
    results = collection.query(
    query_texts=log_entry,
    n_results=1,
    # where={"metadata_field": "is_equal_to_this"}, # optional filter
    # where_document={"$contains":"search_string"}  # optional filter
    )
    if results['distances'][0][0] <= 0.0001:  # Threshold for a non-match
       print(f'Possible anomaly on: {log_entry} \n  -- Different within threshold on {results}')

# <u>Full-text Clustering with Vector Embeddings</u>
We will use the mixed proxy log data which 950 benign logs + 50 malicious logs to see if we can segregate the two using clustering. This is a notable exercise...only recently has this technology advanced to the point that we can actually do Clustering with full text such as security logs!


In [None]:
# We'll use the mixed proxy logs to see if we can get them clustered into 950 benign + 50 malicious
df = pd.read_csv('proxy_logs_mixed.csv')
df = df.drop(['IP Address', 'Timestamp'], axis=1)  # We don't need IP nor timestamp for this task
df.sample(3)


In [None]:
# It's easier to just calc the vector embeddings explicitly, vs adding inserting to ChromaDB to get embeddings
# We will use the same embedding model that ChromaDB is using

# Load the embedding model so we can use it to easily populate the dataframe
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
# Use like this: embeddings = model.encode(whatever_text)

# Use a lambda function to encode the text in each row and apply it to a new column
df['embedding'] = df['Log Entry'].apply(lambda text:model.encode(text))
df.sample(3)

In [None]:
# import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

# Assuming you have a DataFrame `df` with columns: 'ID', 'text', and 'embedding'
# Here 'embedding' contains the 384-dimensional vectors

# Extract the embeddings from the DataFrame
embeddings = df['embedding'].tolist()

# Convert the list of embeddings into a numpy array
# import numpy as np
X = np.array(embeddings)

# Perform K-means clustering with K=2
kmeans = KMeans(n_clusters=2, random_state=0)
df['cluster'] = kmeans.fit_predict(X)

# Now `df` will have a new column 'cluster' indicating the cluster each entry belongs to
df.columns


In [None]:
df.sample(3)

In [None]:
# It should find 950 in one cluster group and 50 in the other cluster group
df['cluster'].value_counts()