# Build doc2vec embedding model w/ public-domain DNS log data
This example builds a vector embedding model from sample dns data using doc2vec.
Source data: http://www.secrepo.com/maccdc2012/dns.log.gz 

Building a vector embedding model using Doc2Vec (from the Gensim library) involves converting text documents into dense vector representations. These vector embeddings capture semantic information about the text, allowing you to use them for tasks such as document similarity, clustering, or classification. Here’s a step-by-step explanation of the process:
1. Understanding Doc2Vec

Doc2Vec is an extension of Word2Vec, which generates vector embeddings for entire documents or sentences rather than individual words. It introduces the concept of document vectors, which represent each document in a continuous vector space. Doc2Vec uses two primary models:

    Distributed Memory (DM): Focuses on predicting a word using both surrounding words and a document vector. It's somewhat similar to how Word2Vec works but also includes the document vector.
    Distributed Bag of Words (DBOW): Focuses on predicting the words in a document using the document vector. It's similar to skip-gram in Word2Vec.

2. Preparing Data

First, you need to organize your text data for training. The data must be preprocessed and tokenized into words. Each document is labeled with a unique ID so that the model can associate each document with its corresponding vector.
Steps for preparing data:

    Tokenization: Split your text into tokens (words).
    Remove stop words: Optionally, remove common stop words.
    Label your documents: Doc2Vec requires each document to have a unique label, which can be done using TaggedDocument in Gensim.

In this example, each document is tokenized and tagged with a unique ID (e.g., '0', '1', '2').

In [4]:
#! pip install gensim
#! pip install nltk

In [2]:
# Do once to get NLTK tokenizers
#import nltk
#nltk.download('punkt_tab')

In [5]:
from gensim.models.doc2vec import TaggedDocument
from nltk.tokenize import word_tokenize

In [6]:
# Load the DNS log data into a list
filename = 'dns.log'

with open(filename, 'r') as file:
    # Read lines into a list, removing the newline characters
    documents = [line.strip() for line in file.readlines()]  # Treat each line as a separate document

tagged_data = [TaggedDocument(words=word_tokenize(doc.lower()), tags=[str(i)]) for i, doc in enumerate(documents)]

len(documents)  # How many log entries do we have?  Should be 427K...

427935

In [7]:
documents[0]  # Take a look at the raw log data...\t are tab characters. First row...

'1331901005.510000\tCWGtK431H9XuaTN4fi\t192.168.202.100\t45658\t192.168.27.203\t137\tudp\t33008\t*\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\t1\tC_INTERNET\t33\tSRV\t0\tNOERROR\tF\tF\tF\tF\t1\t-\t-\tF'

In [8]:
tagged_data[0] # Take a look at the tagged data...first row

TaggedDocument(words=['1331901005.510000', 'cwgtk431h9xuatn4fi', '192.168.202.100', '45658', '192.168.27.203', '137', 'udp', '33008', '*', '\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00', '1', 'c_internet', '33', 'srv', '0', 'noerror', 'f', 'f', 'f', 'f', '1', '-', '-', 'f'], tags=['0'])

3. Define the model parameters


In [9]:
from gensim.models import Doc2Vec

# Parameters:
# vector_size: size of the document vectors
# window: context window size (how many words before and after to look)
# min_count: minimum word frequency
# workers: number of CPU cores to use
# dm: 1 for Distributed Memory (DM), 0 for Distributed Bag of Words (DBOW)

model = Doc2Vec(vector_size=64, window=5, min_count=2, workers=4, dm=1)

# Build vocabulary
model.build_vocab(tagged_data)


4. Training the Model

Training involves optimizing the model weights by predicting words in context and adjusting the document vectors accordingly. The number of epochs controls how many passes over the training data the model makes.

In [10]:
# Train the model
model.train(tagged_data, total_examples=model.corpus_count, epochs=model.epochs)

# Save the model
model.save("doc2vec-embedding-model_DNSv1.model")



5. Using the Model

Once the model is trained, you can use it to infer document vectors, calculate similarity between documents, or perform other downstream tasks.
Inferring vectors for new documents:

You can infer the vector for new, unseen documents (important for testing and evaluating the model).

In [11]:
# Doing inference

# If the model is not already loaded, can load the model like so:
# from gensim.models import Doc2Vec
# model = Doc2Vec.load("doc2vec-embedding-model_DNSv1.model")

#new_doc = word_tokenize("This is a new document")
# First line from dns.log file...we need something to visually examine
new_doc = word_tokenize('1331901005.510000	CWGtK431H9XuaTN4fi	192.168.202.100	45658	192.168.27.203	137	udp	33008	*\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00	1	C_INTERNET	33	SRV	0	NOERROR	F	F	F	F	1	-	-	F')
vector = model.infer_vector(new_doc)
vector


array([-0.3171621 ,  0.20640056, -0.0300635 , -0.0098218 , -0.19444545,
        0.28547275, -0.0464296 , -0.0525442 ,  0.26177865, -0.25753337,
        0.02328326,  0.04312877, -0.09204759, -0.12764314,  0.01225038,
       -0.04808943,  0.05217761,  0.18822627, -0.20082216,  0.15671954,
       -0.1441735 , -0.00588884, -0.37063843,  0.10306489, -0.13889146,
        0.01294738, -0.17230865, -0.07669706,  0.08608611, -0.11367495,
       -0.06796396,  0.09768035,  0.12308938, -0.10865992, -0.00776571,
       -0.03252527,  0.05046219, -0.13647558,  0.06918532,  0.2669966 ,
        0.2061837 , -0.01954957,  0.12396771,  0.1897618 , -0.05829672,
        0.05223336, -0.21107773,  0.1119943 , -0.15251513, -0.21341778,
       -0.02543332, -0.02141355, -0.18571535, -0.09996499, -0.02004255,
        0.02308452, -0.00938944,  0.11421383, -0.02856918,  0.0830159 ,
        0.19229126,  0.23906521, -0.17293878,  0.08165473], dtype=float32)

Finding similar documents:


In [12]:
# Use the vector embeddings to find similar documents
similar_docs = model.dv.most_similar([vector], topn=5)
similar_docs
# We should get an exact match since we used a log entry this model was trained on...investigate why not.

[('156239', 0.681245744228363),
 ('179800', 0.6660510897636414),
 ('108707', 0.6547045111656189),
 ('85702', 0.6476038694381714),
 ('256034', 0.6371015906333923)]

6. Evaluating the Model

Evaluation involves measuring how well the model’s document vectors capture meaningful semantic relationships. You can test the similarity of document vectors or use them in downstream tasks like clustering or classification.
For clustering:

    Use clustering algorithms like KMeans, DBSCAN, or others to group documents based on their vector representations.

In [13]:
# Rework this...we need a better examination of the document vector embeddings
from sklearn.cluster import KMeans

# Extract document vectors
doc_vectors = [model.dv[i] for i in range(len(documents))]

# Perform KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(doc_vectors)


  super()._check_params_vs_input(X, default_n_init=10)


In [14]:
documents[0]

'1331901005.510000\tCWGtK431H9XuaTN4fi\t192.168.202.100\t45658\t192.168.27.203\t137\tudp\t33008\t*\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\t1\tC_INTERNET\t33\tSRV\t0\tNOERROR\tF\tF\tF\tF\t1\t-\t-\tF'

In [18]:
doc_vectors[0]

array([ 0.00586598, -0.0374762 ,  0.01549683, -0.00364431,  0.01255742,
        0.02020624, -0.04240043, -0.01447181,  0.00996931, -0.04043568,
        0.04119173,  0.02079903, -0.0119105 , -0.01311227, -0.02433136,
       -0.0172599 ,  0.03064309,  0.01647692, -0.0172838 ,  0.00704636,
        0.03837198,  0.02055551,  0.02231336,  0.02196497, -0.0138323 ,
       -0.00686659, -0.06895209, -0.03626766,  0.0231456 , -0.04226487,
        0.00496994,  0.04478219,  0.02419034,  0.01738112,  0.02189103,
       -0.00944712,  0.00675522, -0.00328304,  0.00272201,  0.01345214,
        0.02938774,  0.02063978,  0.02461391, -0.04983344, -0.00899704,
       -0.0431048 ,  0.00083163, -0.01817878, -0.00527828, -0.01752057,
        0.03710725,  0.02567088, -0.02727177, -0.02965645, -0.02437101,
        0.02469743, -0.01802357,  0.08237184,  0.00124555,  0.03246439,
        0.03550833,  0.00885791, -0.04207022, -0.00595579], dtype=float32)

In [20]:
# Let's pull the documents and the document vectors together into a dataframe so the data is easier to work with
import pandas as pd

df = pd.DataFrame({'document': documents, 'embedding': doc_vectors})

In [21]:
df.shape

(427935, 2)

In [22]:
df.sample(3)

Unnamed: 0,document,embedding
6038,1331902418.160000\tClRAb643sjAeTGHFJh\t192.168...,"[0.008520887, -0.051922653, 0.024452504, -0.02..."
195114,1331922181.350000\tC52Rpy4TPuSW2nQU6\t192.168....,"[0.06223684, -0.05806742, 0.07472073, 0.047282..."
177664,1331921224.900000\tCzNRck2zqMl2K4BvIh\t10.10.1...,"[0.016794674, -0.009601517, -0.0038929433, -0...."


In [23]:
# Let's try clustering using DBSCAN to determine the number of cluster groups/etc
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import numpy as np
from sklearn.metrics.pairwise import cosine_distances

# Convert the embeddings column to a numpy array
embeddings = np.array(df['embedding'].tolist())

# Standardize the data (optional but recommended)
scaler = StandardScaler()
embeddings_scaled = scaler.fit_transform(embeddings)

# Calculate the cosine distance matrix
cosine_dist_matrix = cosine_distances(embeddings_scaled)

# Perform DBSCAN clustering with cosine distance
# Parameters to adjust: `eps` (radius of neighborhood), `min_samples` (min points to form a cluster)
dbscan = DBSCAN(eps=0.5, min_samples=5, metric='precomputed')
df['cluster_group'] = dbscan.fit_predict(cosine_dist_matrix)

# Analyzing the results
n_clusters = len(set(df['cluster_group'])) - (1 if -1 in df['cluster_group'] else 0)
n_noise = list(df['cluster_group']).count(-1)

print(f'Estimated number of DBSCAN clusters: {n_clusters}')
print(f'Estimated number of noise points: {n_noise}')

# If you want to see the first few entries and their clusters
df.columns


MemoryError: Unable to allocate 682. GiB for an array with shape (427935, 427935) and data type float32

In [24]:
embeddings = df['embedding'].tolist()

# Train the DBSCAN clustering model
dbscan_model = DBSCAN(eps=0.5, min_samples=5, metric='euclidean')
clusters = dbscan_model.fit_predict(embeddings)

# Add the cluster labels back to the DataFrame
df['cluster_group'] = clusters



: 

Summary:

    Prepare the data: Tokenize and tag the documents.
    Build the model: Initialize a Doc2Vec model with appropriate hyperparameters.
    Train the model: Train the model on the tagged document data.
    Infer vectors: Use the trained model to infer vectors for new documents.
    Use the vectors: Perform similarity search, clustering, or other tasks using the document vectors.

