# Build doc2vec embedding model w/ public-domain DNS log data
This example builds a vector embedding model from sample dns data using doc2vec.
Source data: http://www.secrepo.com/maccdc2012/dns.log.gz 

Building a vector embedding model using Doc2Vec (from the Gensim library) involves converting text documents into dense vector representations. These vector embeddings capture semantic information about the text, allowing you to use them for tasks such as document similarity, clustering, or classification. Here’s a step-by-step explanation of the process:
1. Understanding Doc2Vec

Doc2Vec is an extension of Word2Vec, which generates vector embeddings for entire documents or sentences rather than individual words. It introduces the concept of document vectors, which represent each document in a continuous vector space. Doc2Vec uses two primary models:

    Distributed Memory (DM): Focuses on predicting a word using both surrounding words and a document vector. It's somewhat similar to how Word2Vec works but also includes the document vector.
    Distributed Bag of Words (DBOW): Focuses on predicting the words in a document using the document vector. It's similar to skip-gram in Word2Vec.

2. Preparing Data

First, you need to organize your text data for training. The data must be preprocessed and tokenized into words. Each document is labeled with a unique ID so that the model can associate each document with its corresponding vector.
Steps for preparing data:

    Tokenization: Split your text into tokens (words).
    Remove stop words: Optionally, remove common stop words.
    Label your documents: Doc2Vec requires each document to have a unique label, which can be done using TaggedDocument in Gensim.

In this example, each document is tokenized and tagged with a unique ID (e.g., '0', '1', '2').

In [1]:
#! pip install gensim
#! pip install nltk

In [2]:
# Do once to get NLTK tokenizers
#import nltk
#nltk.download('punkt_tab')

In [2]:
from gensim.models.doc2vec import TaggedDocument
from nltk.tokenize import word_tokenize

In [7]:
# Load the DNS log data into a list
filename = 'dns.log'

with open(filename, 'r') as file:
    # Read lines into a list, removing the newline characters
    documents = [line.strip() for line in file.readlines()]  # Treat each line as a separate document

tagged_data = [TaggedDocument(words=word_tokenize(doc.lower()), tags=[str(i)]) for i, doc in enumerate(documents)]
documents

['1331901005.510000\tCWGtK431H9XuaTN4fi\t192.168.202.100\t45658\t192.168.27.203\t137\tudp\t33008\t*\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\t1\tC_INTERNET\t33\tSRV\t0\tNOERROR\tF\tF\tF\tF\t1\t-\t-\tF',
 '1331901015.070000\tC36a282Jljz7BsbGH\t192.168.202.76\t137\t192.168.202.255\t137\tudp\t57402\tHPE8AA67\t1\tC_INTERNET\t32\tNB\t-\t-\tF\tF\tT\tF\t1\t-\t-\tF',
 '1331901015.820000\tC36a282Jljz7BsbGH\t192.168.202.76\t137\t192.168.202.255\t137\tudp\t57402\tHPE8AA67\t1\tC_INTERNET\t32\tNB\t-\t-\tF\tF\tT\tF\t1\t-\t-\tF',
 '1331901016.570000\tC36a282Jljz7BsbGH\t192.168.202.76\t137\t192.168.202.255\t137\tudp\t57402\tHPE8AA67\t1\tC_INTERNET\t32\tNB\t-\t-\tF\tF\tT\tF\t1\t-\t-\tF',
 '1331901005.860000\tC36a282Jljz7BsbGH\t192.168.202.76\t137\t192.168.202.255\t137\tudp\t57398\tWPAD\t1\tC_INTERNET\t32\tNB\t-\t-\tF\tF\tT\tF\t1\t-\t-\tF',
 '1331901006.610000\tC36a282Jljz7BsbGH\t192.168.202.76\t137\t192.168.202.255\t137\tudp\t57398\tWPAD\t1\tC_INTERNET\t32\tNB\t-\t-\tF\tF\t

In [8]:
len(documents)  # How many log entries do we have?

427935

3. Building the Model


In [9]:
from gensim.models import Doc2Vec

# Parameters:
# vector_size: size of the document vectors
# window: context window size (how many words before and after to look)
# min_count: minimum word frequency
# workers: number of CPU cores to use
# dm: 1 for Distributed Memory (DM), 0 for Distributed Bag of Words (DBOW)

model = Doc2Vec(vector_size=64, window=5, min_count=2, workers=4, dm=1)

# Build vocabulary
model.build_vocab(tagged_data)


4. Training the Model

Training involves optimizing the model weights by predicting words in context and adjusting the document vectors accordingly. The number of epochs controls how many passes over the training data the model makes.

In [10]:
# Train the model
model.train(tagged_data, total_examples=model.corpus_count, epochs=model.epochs)

# Save the model
model.save("doc2vec-embedding-model_DNSv1.model")

# Can load the model like so:
# from gensim.models import Doc2Vec
# model = Doc2Vec.load("model_name")

5. Using the Model

Once the model is trained, you can use it to infer document vectors, calculate similarity between documents, or perform other downstream tasks.
Inferring vectors for new documents:

You can infer the vector for new, unseen documents (important for testing and evaluating the model).

In [99]:
# Doing inference
#new_doc = word_tokenize("This is a new document".lower())  # Why do we need to lower() it first?
#new_doc = word_tokenize("This is a new document")
# First line from dns.log file...we need something to visually examine
new_doc = word_tokenize('1331901005.510000	CWGtK431H9XuaTN4fi	192.168.202.100	45658	192.168.27.203	137	udp	33008	*\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00	1	C_INTERNET	33	SRV	0	NOERROR	F	F	F	F	1	-	-	F')
vector = model.infer_vector(new_doc)
vector


array([-0.21231006,  0.26722193, -0.08867374,  0.00931171, -0.26318645,
        0.1256276 , -0.12953615, -0.11728455,  0.22105853, -0.19559929,
        0.11268631, -0.2080051 ,  0.07386941, -0.23471183, -0.08659401,
       -0.05732593,  0.10695031,  0.00213445, -0.05043063, -0.0972134 ,
       -0.08471301,  0.11421717, -0.4291398 ,  0.12945926, -0.17658453,
       -0.20815568, -0.23497787,  0.02407804,  0.05578978, -0.16567206,
       -0.00701154,  0.19533543,  0.17519332,  0.04145779,  0.02184435,
       -0.03102021,  0.15984164, -0.06846754,  0.01803243,  0.2400159 ,
        0.17609592, -0.02572192,  0.1434301 ,  0.26405668, -0.11265863,
       -0.0057001 , -0.26837862,  0.12170205, -0.20012821,  0.01285687,
        0.10941636, -0.06398063, -0.10644042, -0.1700753 , -0.0937715 ,
       -0.05989009, -0.22819111,  0.02488633,  0.07734212,  0.22307202,
        0.05801981,  0.1772641 , -0.20819472,  0.09418017], dtype=float32)

Finding similar documents:


In [98]:
# Use the vector embeddings to find similar documents
similar_docs = model.dv.most_similar([vector], topn=5)
similar_docs
# We should get an exact match since we used a log entry this model was trained on...investigate why not.

[('36024', 0.7518634796142578),
 ('108707', 0.6876424551010132),
 ('330827', 0.6535166501998901),
 ('186663', 0.6457772254943848),
 ('36053', 0.6409603357315063)]

6. Evaluating the Model

Evaluation involves measuring how well the model’s document vectors capture meaningful semantic relationships. You can test the similarity of document vectors or use them in downstream tasks like clustering or classification.
For clustering:

    Use clustering algorithms like KMeans, DBSCAN, or others to group documents based on their vector representations.

In [95]:
from sklearn.cluster import KMeans

# Extract document vectors
doc_vectors = [model.dv[i] for i in range(len(documents))]

# Perform KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(doc_vectors)


Summary:

    Prepare the data: Tokenize and tag the documents.
    Build the model: Initialize a Doc2Vec model with appropriate hyperparameters.
    Train the model: Train the model on the tagged document data.
    Infer vectors: Use the trained model to infer vectors for new documents.
    Use the vectors: Perform similarity search, clustering, or other tasks using the document vectors.

