# Training Doc2Vec on Wikipedia articles full content
1/13/2025, Dave Sisk, https://github.com/davidcsisk, https://www.linkedin.com/in/davesisk-doctordatabase/ 

This notebook replicates the **Document Embedding with Paragraph Vectors** paper, http://arxiv.org/abs/1507.07998, and it also adds on to this notebook from Gensim: https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb

Per Gensim: In that paper, the authors only showed results from the DBOW ("distributed bag of words") mode, trained on the English Wikipedia. Here we replicate this experiment using not only DBOW, but also the DM ("distributed memory") mode of the Paragraph Vector algorithm aka Doc2Vec.

Per me: After working through this once with both DBOW and DM modes, I determined that DM mode delivered the better results when used on security log data, both in general purpose form and after I fine-tuned the base model with some security log data. The DM model mode also trains considerably faster.  I've commented out the code around the DBOW model mode, but left it in the notebook so you can try it for yourself if you choose to do so. In running the training process on both Windows and Linux hosts, Linux was close to 2X faster on the long-running processes, but they still ran on Windows.

I've also uploaded a copy of the base model that is trained on the full contents of Wikipedia...you can choose to download and use or fine-tune that copy, versus building it from scratch with this notebook. See the download link below. 

## Basic setup

Python 3.12 seems to have some breaking new features.  This notebook works correctly with Python 3.11 though. There's a few different ways to handle this, but I used <b>pyenv</b> as documented here: https://forums.linuxmint.com/viewtopic.php?t=362499

In [13]:
# Install if not already present
#!pip install gensim


In [5]:
import logging
import multiprocessing
from pprint import pprint

import smart_open
from gensim.corpora.wikicorpus import WikiCorpus, tokenize
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Preparing the corpus

First, download the dump of all Wikipedia articles from [http://download.wikimedia.org/enwiki/latest](http://download.wikimedia.org/enwiki/latest). You want the file named `enwiki-latest-pages-articles.xml.bz2`.

Second, convert that Wikipedia article dump from the Wikimedia XML format into a plain text file. This will make the subsequent training faster and also allow easy inspection of the data = "input eyeballing".

We'll preprocess each article at the same time, normalizing its text to lowercase, splitting into tokens, etc. Below I use a regexp tokenizer that simply looks for alphabetic sequences as tokens. But feel free to adapt the text preprocessing to your own domain. High quality preprocessing is often critical for the final pipeline accuracy – garbage in, garbage out!

In [14]:
# Uncomment and run this cell to download the most recent Wikipedia backup/dump
#!wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

In [6]:
wiki = WikiCorpus(
    "enwiki-latest-pages-articles.xml.bz2",  # path to the file you downloaded above
    tokenizer_func=tokenize,  # simple regexp; plug in your own tokenizer here
    metadata=True,  # also return the article titles and ids when parsing
    dictionary={},  # don't start processing the data yet
)

with smart_open.open("training-data_wikipedia-full-content.txt.gz", "w", encoding='utf8') as fout:
    for article_no, (content, (page_id, title)) in enumerate(wiki.get_texts()):
        title = ' '.join(title.split())
        if article_no % 500000 == 0:
            logging.info("processing article #%i: %r (%i tokens)", article_no, title, len(content))
        fout.write(f"{title}\t{' '.join(content)}\n")  # title_of_article [TAB] words of the article

2025-02-12 14:19:10,750 : INFO : processing article #0: 'Anarchism' (6790 tokens)
2025-02-12 14:34:07,652 : INFO : processing article #500000: 'Brian Lee (wrestler)' (3057 tokens)
2025-02-12 14:44:03,189 : INFO : processing article #1000000: 'Hay Festival' (1364 tokens)
2025-02-12 14:52:48,948 : INFO : processing article #1500000: 'Conquered lorikeet' (128 tokens)
2025-02-12 15:01:26,447 : INFO : processing article #2000000: 'Poverty Valley Aerodrome' (54 tokens)
2025-02-12 15:10:07,276 : INFO : processing article #2500000: 'Get Ready (Mase song)' (136 tokens)
2025-02-12 15:19:18,563 : INFO : processing article #3000000: 'Hans IV Jordaens' (201 tokens)
2025-02-12 15:29:02,436 : INFO : processing article #3500000: 'Anthony Hawken' (97 tokens)
2025-02-12 15:38:20,518 : INFO : processing article #4000000: 'The Peak Scaler' (91 tokens)
2025-02-12 15:47:22,235 : INFO : processing article #4500000: 'Isobel Lilian Gloag' (208 tokens)
2025-02-12 15:57:22,788 : INFO : processing article #500000

The above took about 2 hours and created a new ~7 GB file named `training-data_wikipedia-full-content.txt.gz`. Note the output text was transparently compressed into `.gz` (GZIP) right away, using the [smart_open](https://github.com/RaRe-Technologies/smart_open) library, to save on disk space.

Next we'll set up a document stream to load the preprocessed articles from the training data one by one, in the format expected by Doc2Vec, ready for training. We don't want to load everything into RAM at once, because that would blow up the memory. And it is not necessary – Gensim can handle streamed input training data.

In [7]:
class TaggedWikiCorpus:
    def __init__(self, wiki_text_path):
        self.wiki_text_path = wiki_text_path
        
    def __iter__(self):
        for line in smart_open.open(self.wiki_text_path, encoding='utf8'):
            title, words = line.split('\t')
            yield TaggedDocument(words=words.split(), tags=[title])

documents = TaggedWikiCorpus('training-data_wikipedia-full-content.txt.gz')  # A streamed iterable; nothing in RAM yet.

In [8]:
# Load and print the first preprocessed Wikipedia document, as a sanity check = "input eyeballing".
first_doc = next(iter(documents))
print(first_doc.tags, ': ', ' '.join(first_doc.words[:50] + ['………'] + first_doc.words[-50:]))

['Anarchism'] :  anarchism is political philosophy and movement that is against all forms of authority and seeks to abolish the institutions it claims maintain unnecessary coercion and hierarchy typically including the state and capitalism anarchism advocates for the replacement of the state with stateless societies and voluntary free associations historically left wing ……… sources further reading criticism of philosophical anarchism defence of philosophical anarchism stating that both kinds of anarchism philosophical and political anarchism are philosophical and political claims anarchistic popular fiction novel an argument for philosophical anarchism external links anarchy archives an online research center on the history and theory of anarchism


The document seems legit so let's move on to finally training some Doc2vec models.

## Training Doc2Vec

The original paper had a vocabulary size of 915,715 word types, so we'll try to match it by setting `max_final_vocab` to 1,000,000 in the Doc2vec constructor.

Other critical parameters were left unspecified in the paper, so we'll go with a window size of eight (a prediction window of 8 tokens to either side). It looks like the authors tried vector dimensionality of 100, 300, 1,000 & 10,000 in the paper (with 10k dims performing the best), but I'll only train with 200 dimensions here, to keep the RAM in check on my laptop.

Feel free to tinker with these values yourself if you like:

In [9]:
workers = 12  # multiprocessing.cpu_count() - 1  # leave one core for the OS & other stuff

# PV-DBOW: paragraph vector in distributed bag of words mode
#model_dbow = Doc2Vec(
#    dm=0, dbow_words=1,  # dbow_words=1 to train word vectors at the same time too, not only DBOW
#    vector_size=200, window=8, epochs=10, workers=workers, max_final_vocab=1000000,
#)

# PV-DM: paragraph vector in distributed memory mode
model_dm = Doc2Vec(
    dm=1, dm_mean=1,  # use average of context word vectors to train DM
    vector_size=256, window=8, epochs=10, workers=workers, max_final_vocab=1000000,
)

2025-02-12 16:15:43,317 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec<dm/m,d256,n5,w8,mc5,s0.001,t12>', 'datetime': '2025-02-12T16:15:43.316999', 'gensim': '4.3.3', 'python': '3.11.11 (main, Feb 12 2025, 14:14:40) [GCC 13.3.0]', 'platform': 'Linux-6.8.0-51-generic-x86_64-with-glibc2.39', 'event': 'created'}


Run one pass through the Wikipedia corpus, to collect the 1M vocabulary and initialize the doc2vec models:

In [10]:
#model_dbow.build_vocab(documents, progress_per=500000)
#print(model_dbow)

# Save some time by copying the vocabulary structures from the DBOW model to the DM model.
# Both models are built on top of exactly the same data, so there's no need to repeat the vocab-building step.
#model_dm.reset_from(model_dbow)
model_dm.build_vocab(documents, progress_per=500000)
print(model_dm)

2025-02-12 16:17:20,953 : INFO : collecting all words and their counts
2025-02-12 16:17:20,956 : INFO : PROGRESS: at example #0, processed 0 words (0 words/s), 0 word types, 0 tags
2025-02-12 16:20:04,662 : INFO : PROGRESS: at example #500000, processed 691945216 words (4226775 words/s), 3302056 word types, 500000 tags
2025-02-12 16:21:39,356 : INFO : PROGRESS: at example #1000000, processed 1076732994 words (4063659 words/s), 4589545 word types, 1000000 tags
2025-02-12 16:22:55,513 : INFO : PROGRESS: at example #1500000, processed 1378006617 words (3955982 words/s), 5550052 word types, 1500000 tags
2025-02-12 16:24:01,459 : INFO : PROGRESS: at example #2000000, processed 1635595979 words (3906086 words/s), 6332484 word types, 2000000 tags
2025-02-12 16:25:05,278 : INFO : PROGRESS: at example #2500000, processed 1887726556 words (3950759 words/s), 7100806 word types, 2500000 tags
2025-02-12 16:26:08,999 : INFO : PROGRESS: at example #3000000, processed 2136847999 words (3909570 words/s

Doc2Vec<dm/m,d256,n5,w8,mc5,s0.001,t12>


Now we’re ready to train Doc2Vec on the entirety of the English Wikipedia. **Warning!** Training these models can take 6-18 hours depending on your compute resources.

In [15]:
# Train DBOW doc2vec incl. word vectors.
# Report progress every ½ hour.
# NOTE: This ran for ~20 hours on a Windows 10 laptop with 12 cores, 128Gb ram, and 1Tb SSD
#model_dbow.train(documents, total_examples=model_dbow.corpus_count, epochs=model_dbow.epochs, report_delay=30*60)
#model_dbow.save('doc2vec_wikipedia_dbow.model')


In [None]:
# Train DM doc2vec.
# NOTE: This ran for ~8.5 hours on Intel NUC w/ 16 cores, 64Gb ram, 1Tb SDD, and Linux Mint 22
model_dm.train(documents, total_examples=model_dm.corpus_count, epochs=model_dm.epochs, report_delay=30*60)
model_dm.save('doc2vec_wikipedia_dm.model')


2025-02-12 16:34:49,618 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 12 workers on 981089 vocabulary and 256 features, using sg=0 hs=0 sample=0.001 negative=5 window=8 shrink_windows=True', 'datetime': '2025-02-12T16:34:49.618448', 'gensim': '4.3.3', 'python': '3.11.11 (main, Feb 12 2025, 14:14:40) [GCC 13.3.0]', 'platform': 'Linux-6.8.0-51-generic-x86_64-with-glibc2.39', 'event': 'train'}
2025-02-12 16:34:50,624 : INFO : EPOCH 0 - PROGRESS: at 0.00% examples, 921087 words/s, in_qsize 0, out_qsize 0
2025-02-12 17:04:50,638 : INFO : EPOCH 0 - PROGRESS: at 52.14% examples, 943748 words/s, in_qsize 23, out_qsize 0
2025-02-12 17:25:38,921 : INFO : EPOCH 0: training on 3420342527 raw words (2743634368 effective words) took 3049.3s, 899759 effective words/s
2025-02-12 17:25:39,929 : INFO : EPOCH 1 - PROGRESS: at 0.01% examples, 1158951 words/s, in_qsize 0, out_qsize 1
2025-02-12 17:55:39,935 : INFO : EPOCH 1 - PROGRESS: at 52.68% examples, 950137 words/s, in_qsize 23, out_qs

The models have been saved to the url's below so you can avoid this 1 day of training compute time if you are just looking to experiment. The model is around 7Gb in size.
- doc2vec_wikipedia_dm-model.256.20250212.zip (256 dimension, wikipedia data as of 2025-02-12): 
https://mega.nz/file/m6ICnQxb#tUY8hCGhScyAOf3Y7HONNk7GsGrftcpYNFLZw2QZHrU

## Finding similar documents

If you have already trained or downloaded/unzipped the models and you are picking up here, run the first cell with the imports and then load the models below. 

In [None]:
#model_dbow = Doc2Vec.load('doc2vec_wikipedia_dbow.model')
model_dm = Doc2Vec.load('doc2vec_wikipedia_dm.model')

First, calculate the most similar Wikipedia articles to the "Machine learning" article. The calculated word vectors and document vectors are stored separately, in `model.wv` and `model.dv` respectively:

In [16]:
#for model in [model_dbow, model_dm]:
for model in [model_dm]:
    print(model)
    pprint(model.dv.most_similar(positive=["Machine learning"], topn=20))

Doc2Vec<dm/m,d256,n5,w8,mc5,s0.001,t12>
[('Pattern recognition', 0.6913762092590332),
 ('Supervised learning', 0.6727138757705688),
 ('Neural network (machine learning)', 0.6502088904380798),
 ('Meta-learning (computer science)', 0.6317244172096252),
 ('Feature learning', 0.6298242807388306),
 ('Anomaly detection', 0.6272202730178833),
 ('Feature selection', 0.6252041459083557),
 ('Linear classifier', 0.616140604019165),
 ('Ensemble learning', 0.615498423576355),
 ('Boosting (machine learning)', 0.6137340664863586),
 ('Naive Bayes classifier', 0.610593855381012),
 ('Automatic image annotation', 0.6056917309761047),
 ('Multiclass classification', 0.6055539846420288),
 ('Multi-task learning', 0.6039445400238037),
 ('Statistical classification', 0.603486955165863),
 ('Regularization (mathematics)', 0.6019155979156494),
 ('Artificial intelligence', 0.6012045741081238),
 ('Random subspace method', 0.5989224314689636),
 ('Early stopping', 0.5979453325271606),
 ('Latent space', 0.595871329307

In [21]:
#for model in [model_dbow, model_dm]:
for model in [model_dm]:
    print(model)
    pprint(model.dv.most_similar(positive=["Chris Cornell"], topn=10))

Doc2Vec<dm/m,d256,n5,w8,mc5,s0.001,t12>
[('Soundgarden', 0.6760900020599365),
 ('Temple of the Dog', 0.6714499592781067),
 ('Alice in Chains', 0.6464263200759888),
 ('Scott Weiland', 0.6344491839408875),
 ('Layne Staley', 0.6146446466445923),
 ('Audioslave', 0.6069024205207825),
 ('Chester Bennington', 0.6001834869384766),
 ('Euphoria Morning', 0.5890706777572632),
 ('Louder Than Love', 0.5855295062065125),
 ('Hunger Strike (song)', 0.5804468989372253)]


I'll keep the commentary from the original notebook...my search term was 'Chris Cornell' instead of 'Lady Gaga' though: 
The DBOW results are in line with what the paper shows in Table 2a), revealing similar singers in the U.S. Interestingly, the DM results seem to capture more "fact about Lady Gaga" (her albums, trivia), whereas DBOW recovered "similar artists".

**Finally, let's do some of the wilder arithmetics that vectors embeddings are famous for**. What are the entries most similar to "Lady Gaga" - "American" + "Japanese"? Table 2b) in the paper.
Note that "American" and "Japanese" are word vectors, but they live in the same space as the document vectors so we can add / subtract them at will, for some interesting results. All word vectors were already lowercased by our tokenizer above, so we look for the lowercased version here:

In [26]:
#for model in [model_dbow, model_dm]:
for model in [model_dm]:
    print(model)
    vec = [model.dv["Lady Gaga"] + model.wv["american"] - model.wv["japanese"]]
    # I switched the search math here...+ american and - japanese
    pprint([m for m in model.dv.most_similar(vec, topn=11) if m[0] != "Lady Gaga"])

Doc2Vec<dm/m,d256,n5,w8,mc5,s0.001,t12>
[('Jennifer Lopez', 0.5050081014633179),
 ('Born This Way (album)', 0.49613016843795776),
 ('Taylor Swift', 0.4820432662963867),
 ('Lady Gaga videography', 0.4794251024723053),
 ('Bad Romance', 0.4734669327735901),
 ('Selena Gomez', 0.4711049199104309),
 ('Adele', 0.46933799982070923),
 ('Lizzo', 0.46371859312057495),
 ('Normani', 0.4565371870994568),
 ('Beyoncé', 0.4494146406650543)]


These results demonstrate that both training modes employed in the original paper are outstanding for calculating similarity between document vectors, word vectors, or a combination of both. The DM mode has the added advantage of being 4x faster to train.

To continue your doc2vec explorations, refer to the official API documentation in Gensim: https://radimrehurek.com/gensim/models/doc2vec.html