# Example: Build bare-bones doc2vec embedding model
This example builds a vector embedding model from a few text strings using doc2vec. 

Building a vector embedding model using Doc2Vec (from the Gensim library) involves converting text documents into dense vector representations. These vector embeddings capture semantic information about the text, allowing you to use them for tasks such as document similarity, clustering, or classification. Here’s a step-by-step explanation of the process:
1. Understanding Doc2Vec

Doc2Vec is an extension of Word2Vec, which generates vector embeddings for entire documents or sentences rather than individual words. It introduces the concept of document vectors, which represent each document in a continuous vector space. Doc2Vec uses two primary models:

    Distributed Memory (DM): Focuses on predicting a word using both surrounding words and a document vector. It's somewhat similar to how Word2Vec works but also includes the document vector.
    Distributed Bag of Words (DBOW): Focuses on predicting the words in a document using the document vector. It's similar to skip-gram in Word2Vec.

2. Preparing Data

First, you need to organize your text data for training. The data must be preprocessed and tokenized into words. Each document is labeled with a unique ID so that the model can associate each document with its corresponding vector.
Steps for preparing data:

    Tokenization: Split your text into tokens (words).
    Remove stop words: Optionally, remove common stop words.
    Label your documents: Doc2Vec requires each document to have a unique label, which can be done using TaggedDocument in Gensim.

In this example, each document is tokenized and tagged with a unique ID (e.g., '0', '1', '2').

In [1]:
#! pip install gensim
#! pip install nltk

In [2]:
# Do once to get NLTK tokenizers
#import nltk
#nltk.download('punkt_tab')

In [3]:
from gensim.models.doc2vec import TaggedDocument
from nltk.tokenize import word_tokenize

documents = ["Text of document 1", "Text of document 2", "Another document"]
tagged_data = [TaggedDocument(words=word_tokenize(doc.lower()), tags=[str(i)]) for i, doc in enumerate(documents)]


3. Building the Model

Once the data is ready, you can build and train the Doc2Vec model. You can specify which algorithm to use (DM or DBOW) and several hyperparameters such as vector size, window size, and the number of epochs.

- vector_size: The dimensionality of the vector representation.
- window: The maximum distance between the current and predicted word.
- min_count: Ignores all words with a frequency lower than this.
- workers: How many CPU cores to use for training.
- dm: Set to 1 for Distributed Memory and 0 for Distributed Bag of Words.

In [4]:
from gensim.models import Doc2Vec

# Parameters:
# vector_size: size of the document vectors
# window: context window size (how many words before and after to look)
# min_count: minimum word frequency
# workers: number of CPU cores to use
# dm: 1 for Distributed Memory (DM), 0 for Distributed Bag of Words (DBOW)

model = Doc2Vec(vector_size=100, window=5, min_count=2, workers=4, dm=1)

# Build vocabulary
model.build_vocab(tagged_data)

# Train the model
model.train(tagged_data, total_examples=model.corpus_count, epochs=20)


4. Training the Model

Training involves optimizing the model weights by predicting words in context and adjusting the document vectors accordingly. The number of epochs controls how many passes over the training data the model makes.

You can train the model using the train() function and pass the total_examples (number of documents) and epochs (how many passes over the data).

In [5]:
# Train the model
model.train(tagged_data, total_examples=model.corpus_count, epochs=model.epochs)


5. Using the Model

Once the model is trained, you can use it to infer document vectors, calculate similarity between documents, or perform other downstream tasks.
Inferring vectors for new documents:

You can infer the vector for new, unseen documents (important for testing and evaluating the model).

In [6]:
# Doing inference
new_doc = word_tokenize("This is a new document".lower())
vector = model.infer_vector(new_doc)
vector


array([-6.9503306e-04,  2.2336466e-03, -4.8587485e-03, -3.8589155e-03,
        3.2359166e-03, -1.1491886e-03,  2.7911800e-03, -4.5735021e-03,
       -3.4503937e-05, -1.0539826e-03, -5.7602377e-04, -2.3606466e-03,
        2.7031361e-04,  8.0543759e-05, -1.4229298e-04, -3.2289792e-03,
       -2.4669169e-04, -4.2843614e-03, -4.8718508e-03, -1.5475699e-03,
       -4.1027009e-04,  3.5659373e-03, -8.3692820e-04, -3.6034442e-03,
        2.1089865e-03,  4.5406464e-03, -2.3125492e-03,  1.7837441e-03,
        3.3302926e-03,  3.3474518e-03,  2.6024072e-03,  1.7890168e-03,
       -4.5362976e-03,  5.7850778e-04,  3.7251920e-03, -3.0369079e-03,
        1.6201055e-03,  4.6499134e-03, -2.9909620e-03, -8.9684012e-04,
        2.9917841e-03, -1.4215005e-03,  1.3429522e-05, -4.8169652e-03,
        1.5783489e-03,  2.1076298e-03,  5.9415045e-04,  2.4248653e-03,
       -5.4720609e-04, -9.7109226e-04,  4.2268797e-03, -3.0706383e-03,
       -4.8724581e-03,  1.1321121e-03,  1.1363316e-03,  2.1320807e-03,
      

Finding similar documents:

You can find the most similar documents to a given document by using the most_similar function:

In [7]:
similar_docs = model.dv.most_similar([vector], topn=5)
similar_docs


[('0', 0.06741232424974442),
 ('1', 0.008448691107332706),
 ('2', -0.024614667519927025)]

6. Evaluating the Model

Evaluation involves measuring how well the model’s document vectors capture meaningful semantic relationships. You can test the similarity of document vectors or use them in downstream tasks like clustering or classification.
For clustering:

    Use clustering algorithms like KMeans, DBSCAN, or others to group documents based on their vector representations.

In [8]:
from sklearn.cluster import KMeans

# Extract document vectors
doc_vectors = [model.dv[i] for i in range(len(documents))]

# Perform KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(doc_vectors)


Summary:

    Prepare the data: Tokenize and tag the documents.
    Build the model: Initialize a Doc2Vec model with appropriate hyperparameters.
    Train the model: Train the model on the tagged document data.
    Infer vectors: Use the trained model to infer vectors for new documents.
    Use the vectors: Perform similarity search, clustering, or other tasks using the document vectors.

Doc2Vec is a powerful tool for representing documents in a vector space, which can then be used for a variety of natural language processing (NLP) tasks.