# Training Doc2Vec on Wikipedia articles full content
1/13/2025, Dave Sisk, https://github.com/davidcsisk, https://www.linkedin.com/in/davesisk-doctordatabase/ 

This notebook replicates the **Document Embedding with Paragraph Vectors** paper, http://arxiv.org/abs/1507.07998, and it also adds on to this notebook from Gensim: https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb

In that paper, the authors only showed results from the DBOW ("distributed bag of words") mode, trained on the English Wikipedia. Here we replicate this experiment using not only DBOW, but also the DM ("distributed memory") mode of the Paragraph Vector algorithm aka Doc2Vec.

## Basic setup

Let's import the necessary modules and set up logging. The code below assumes Python 3.7+ and Gensim 4.0+.

In [2]:
import logging
import multiprocessing
from pprint import pprint

import smart_open
from gensim.corpora.wikicorpus import WikiCorpus, tokenize
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Preparing the corpus

First, download the dump of all Wikipedia articles from [here](http://download.wikimedia.org/enwiki/latest). You want the file named `enwiki-latest-pages-articles.xml.bz2`.

Second, convert that Wikipedia article dump from the arcane Wikimedia XML format into a plain text file. This will make the subsequent training faster and also allow easy inspection of the data = "input eyeballing".

We'll preprocess each article at the same time, normalizing its text to lowercase, splitting into tokens, etc. Below I use a regexp tokenizer that simply looks for alphabetic sequences as tokens. But feel free to adapt the text preprocessing to your own domain. High quality preprocessing is often critical for the final pipeline accuracy – garbage in, garbage out!

In [2]:
wiki = WikiCorpus(
    "enwiki-latest-pages-articles.xml.bz2",  # path to the file you downloaded above
    tokenizer_func=tokenize,  # simple regexp; plug in your own tokenizer here
    metadata=True,  # also return the article titles and ids when parsing
    dictionary={},  # don't start processing the data yet
)

with smart_open.open("wiki.txt.gz", "w", encoding='utf8') as fout:
    for article_no, (content, (page_id, title)) in enumerate(wiki.get_texts()):
        title = ' '.join(title.split())
        if article_no % 500000 == 0:
            logging.info("processing article #%i: %r (%i tokens)", article_no, title, len(content))
        fout.write(f"{title}\t{' '.join(content)}\n")  # title_of_article [TAB] words of the article

2025-01-08 14:12:20,778 : INFO : processing article #0: 'Anarchism' (6790 tokens)
2025-01-08 14:40:58,152 : INFO : processing article #500000: 'Dora Riparia' (364 tokens)
2025-01-08 15:00:30,204 : INFO : processing article #1000000: 'Nashville School of Law' (852 tokens)
2025-01-08 15:18:02,203 : INFO : processing article #1500000: 'Leetonia, Hibbing, Minnesota' (58 tokens)
2025-01-08 15:36:12,284 : INFO : processing article #2000000: 'The Next Food Network Star season 4' (2408 tokens)
2025-01-08 15:54:37,440 : INFO : processing article #2500000: 'Star Trek: Nero' (53 tokens)
2025-01-08 16:13:53,773 : INFO : processing article #3000000: 'David Berni' (534 tokens)
2025-01-08 16:34:01,876 : INFO : processing article #3500000: 'Ophichthus mystacinus' (59 tokens)
2025-01-08 16:53:28,328 : INFO : processing article #4000000: 'Robert Harington, 3rd Baron Harington' (268 tokens)
2025-01-08 17:13:54,948 : INFO : processing article #4500000: 'Melantho (1812 ship)' (333 tokens)
2025-01-08 17:36:

The above took about 1 hour and created a new ~5.8 GB file named `wiki.txt.gz`. Note the output text was transparently compressed into `.gz` (GZIP) right away, using the [smart_open](https://github.com/RaRe-Technologies/smart_open) library, to save on disk space.

Note that `wiki.txt.gz` has been saved here so you can skip that whole step if desired: 
https://mega.nz/file/L2YzUQAD#b_2pkaWHkFUdWoTa-EypJmVXWetVeqsi1fVlMeMeAzk

Next we'll set up a document stream to load the preprocessed articles from `wiki.txt.gz` one by one, in the format expected by Doc2Vec, ready for training. We don't want to load everything into RAM at once, because that would blow up the memory. And it is not necessary – Gensim can handle streamed input training data:

In [3]:
class TaggedWikiCorpus:
    def __init__(self, wiki_text_path):
        self.wiki_text_path = wiki_text_path
        
    def __iter__(self):
        for line in smart_open.open(self.wiki_text_path, encoding='utf8'):
            title, words = line.split('\t')
            yield TaggedDocument(words=words.split(), tags=[title])

documents = TaggedWikiCorpus('wiki.txt.gz')  # A streamed iterable; nothing in RAM yet.

In [5]:
# Load and print the first preprocessed Wikipedia document, as a sanity check = "input eyeballing".
first_doc = next(iter(documents))
print(first_doc.tags, ': ', ' '.join(first_doc.words[:50] + ['………'] + first_doc.words[-50:]))

['Anarchism'] :  anarchism is political philosophy and movement that is against all forms of authority and seeks to abolish the institutions it claims maintain unnecessary coercion and hierarchy typically including the state and capitalism anarchism advocates for the replacement of the state with stateless societies and voluntary free associations historically left wing ……… sources further reading criticism of philosophical anarchism defence of philosophical anarchism stating that both kinds of anarchism philosophical and political anarchism are philosophical and political claims anarchistic popular fiction novel an argument for philosophical anarchism external links anarchy archives an online research center on the history and theory of anarchism


The document seems legit so let's move on to finally training some Doc2vec models.

## Training Doc2Vec

The original paper had a vocabulary size of 915,715 word types, so we'll try to match it by setting `max_final_vocab` to 1,000,000 in the Doc2vec constructor.

Other critical parameters were left unspecified in the paper, so we'll go with a window size of eight (a prediction window of 8 tokens to either side). It looks like the authors tried vector dimensionality of 100, 300, 1,000 & 10,000 in the paper (with 10k dims performing the best), but I'll only train with 200 dimensions here, to keep the RAM in check on my laptop.

Feel free to tinker with these values yourself if you like:

In [6]:
workers = 20  # multiprocessing.cpu_count() - 1  # leave one core for the OS & other stuff

# PV-DBOW: paragraph vector in distributed bag of words mode
model_dbow = Doc2Vec(
    dm=0, dbow_words=1,  # dbow_words=1 to train word vectors at the same time too, not only DBOW
    vector_size=200, window=8, epochs=10, workers=workers, max_final_vocab=1000000,
)

# PV-DM: paragraph vector in distributed memory mode
model_dm = Doc2Vec(
    dm=1, dm_mean=1,  # use average of context word vectors to train DM
    vector_size=200, window=8, epochs=10, workers=workers, max_final_vocab=1000000,
)

2025-01-09 07:58:52,387 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t20>', 'datetime': '2025-01-09T07:58:52.387111', 'gensim': '4.3.3', 'python': '3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19045-SP0', 'event': 'created'}
2025-01-09 07:58:52,389 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec<dm/m,d200,n5,w8,mc5,s0.001,t20>', 'datetime': '2025-01-09T07:58:52.389132', 'gensim': '4.3.3', 'python': '3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19045-SP0', 'event': 'created'}


Run one pass through the Wikipedia corpus, to collect the 1M vocabulary and initialize the doc2vec models:

In [7]:
model_dbow.build_vocab(documents, progress_per=500000)
print(model_dbow)

# Save some time by copying the vocabulary structures from the DBOW model to the DM model.
# Both models are built on top of exactly the same data, so there's no need to repeat the vocab-building step.
model_dm.reset_from(model_dbow)
print(model_dm)

2025-01-09 07:59:02,357 : INFO : collecting all words and their counts
2025-01-09 07:59:02,371 : INFO : PROGRESS: at example #0, processed 0 words (0 words/s), 0 word types, 0 tags
2025-01-09 08:02:38,158 : INFO : PROGRESS: at example #500000, processed 690662809 words (3200658 words/s), 3299120 word types, 500000 tags
2025-01-09 08:04:49,139 : INFO : PROGRESS: at example #1000000, processed 1074431339 words (2929984 words/s), 4585092 word types, 1000000 tags
2025-01-09 08:06:34,302 : INFO : PROGRESS: at example #1500000, processed 1375135884 words (2859418 words/s), 5544766 word types, 1500000 tags
2025-01-09 08:08:02,967 : INFO : PROGRESS: at example #2000000, processed 1632207736 words (2899398 words/s), 6327024 word types, 2000000 tags
2025-01-09 08:09:29,755 : INFO : PROGRESS: at example #2500000, processed 1884101404 words (2902444 words/s), 7096120 word types, 2500000 tags
2025-01-09 08:10:56,557 : INFO : PROGRESS: at example #3000000, processed 2132714040 words (2864139 words/s

Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t20>
Doc2Vec<dm/m,d200,n5,w8,mc5,s0.001,t20>


Now we’re ready to train Doc2Vec on the entirety of the English Wikipedia. **Warning!** Training this DBOW model takes ~14 hours, and DM ~6 hours, on my 2020 Linux machine.

In [8]:
# Train DBOW doc2vec incl. word vectors.
# Report progress every ½ hour.
# NOTE: This runs for ~20 hours
model_dbow.train(documents, total_examples=model_dbow.corpus_count, epochs=model_dbow.epochs, report_delay=30*60)
model_dbow.save('doc2vec_wikipedia_dbow.model')


2025-01-09 08:33:05,561 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 20 workers on 977793 vocabulary and 200 features, using sg=1 hs=0 sample=0.001 negative=5 window=8 shrink_windows=True', 'datetime': '2025-01-09T08:33:05.561594', 'gensim': '4.3.3', 'python': '3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19045-SP0', 'event': 'train'}
2025-01-09 08:33:06,601 : INFO : EPOCH 0 - PROGRESS: at 0.00% examples, 227430 words/s, in_qsize 39, out_qsize 0
2025-01-09 09:03:06,646 : INFO : EPOCH 0 - PROGRESS: at 10.23% examples, 336596 words/s, in_qsize 39, out_qsize 0
2025-01-09 09:33:06,660 : INFO : EPOCH 0 - PROGRESS: at 31.01% examples, 335250 words/s, in_qsize 39, out_qsize 1
2025-01-09 10:03:06,671 : INFO : EPOCH 0 - PROGRESS: at 57.43% examples, 334903 words/s, in_qsize 40, out_qsize 0
2025-01-09 10:33:06,674 : INFO : EPOCH 0 - PROGRESS: at 85.43% examples, 334998 words/s, in_qsize 39, out_qsize 0
2025-01-09

In [9]:
# Train DM doc2vec.
# NOTE: This runs for ~12 hours
model_dm.train(documents, total_examples=model_dm.corpus_count, epochs=model_dm.epochs, report_delay=30*60)
model_dm.save('doc2vec_wikipedia_dm.model')


2025-01-10 07:04:51,333 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 20 workers on 977793 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5 window=8 shrink_windows=True', 'datetime': '2025-01-10T07:04:51.333385', 'gensim': '4.3.3', 'python': '3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19045-SP0', 'event': 'train'}
2025-01-10 07:04:52,355 : INFO : EPOCH 0 - PROGRESS: at 0.01% examples, 1110553 words/s, in_qsize 0, out_qsize 0
2025-01-10 07:34:52,357 : INFO : EPOCH 0 - PROGRESS: at 51.41% examples, 929821 words/s, in_qsize 38, out_qsize 1
2025-01-10 07:56:59,321 : INFO : EPOCH 0: training on 3402712129 raw words (2729432921 effective words) took 3128.0s, 872588 effective words/s
2025-01-10 07:57:00,340 : INFO : EPOCH 1 - PROGRESS: at 0.01% examples, 1063460 words/s, in_qsize 0, out_qsize 1
2025-01-10 08:27:00,358 : INFO : EPOCH 1 - PROGRESS: at 51.65% examples, 932958 words/s, in_qsiz

The models have been saved to the url's below so you can avoid this 1-2 days of training compute time if you are just looking to experiment. The models are around 5.4Gb in size each.
- doc2vec_wikipedia_dbow-model.zip: https://mega.nz/file/i6pVWQRD#gIdgXKJG5gEjRBZ2BDW_XlklBUtJb81a9EnlUrltnro
- doc2vec_wikipedia_dm-model.zip: https://mega.nz/file/ivwDmQYL#G1vmS8jJNpoDWf09mCspNsRzNzGFmqk1UPtTGjN7gBo 

## Finding similar documents

If you have already trained or downloaded/unzipped the models and you are picking up here, run the first cell with the imports and then load the models below. 

In [7]:
model_dbow = Doc2Vec.load('doc2vec_wikipedia_dbow.model')
model_dm = Doc2Vec.load('doc2vec_wikipedia_dm.model')

2025-01-13 14:36:32,557 : INFO : loading Doc2Vec object from doc2vec_wikipedia_dbow.model
2025-01-13 14:36:36,399 : INFO : loading dv recursively from doc2vec_wikipedia_dbow.model.dv.* with mmap=None
2025-01-13 14:36:36,400 : INFO : loading vectors from doc2vec_wikipedia_dbow.model.dv.vectors.npy with mmap=None
2025-01-13 14:36:37,770 : INFO : loading wv recursively from doc2vec_wikipedia_dbow.model.wv.* with mmap=None
2025-01-13 14:36:37,770 : INFO : loading vectors from doc2vec_wikipedia_dbow.model.wv.vectors.npy with mmap=None
2025-01-13 14:36:37,995 : INFO : loading syn1neg from doc2vec_wikipedia_dbow.model.syn1neg.npy with mmap=None
2025-01-13 14:36:38,227 : INFO : setting ignored attribute cum_table to None
2025-01-13 14:36:44,360 : INFO : Doc2Vec lifecycle event {'fname': 'doc2vec_wikipedia_dbow.model', 'datetime': '2025-01-13T14:36:44.360213', 'gensim': '4.3.3', 'python': '3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]', 'platform': 'Windows-10

After that, let's test both models! The DBOW model shows similar results as the original paper.

First, calculate the most similar Wikipedia articles to the "Machine learning" article. The calculated word vectors and document vectors are stored separately, in `model.wv` and `model.dv` respectively:

In [4]:
for model in [model_dbow, model_dm]:
    print(model)
    pprint(model.dv.most_similar(positive=["Machine learning"], topn=20))

Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t20>
[('Supervised learning', 0.7331087589263916),
 ('Neural network (machine learning)', 0.7137694358825684),
 ('Boosting (machine learning)', 0.7093158960342407),
 ('Pattern recognition', 0.703530490398407),
 ('Symbolic artificial intelligence', 0.6911135315895081),
 ('Liang Zhao', 0.6848805546760559),
 ('Feature selection', 0.6822749376296997),
 ('Data mining', 0.6801027655601501),
 ('Linear classifier', 0.6765809059143066),
 ('Deep learning', 0.6754170060157776),
 ('Neural network software', 0.6663535833358765),
 ('Support vector machine', 0.6654434204101562),
 ('Multi-task learning', 0.6646270751953125),
 ('Outline of computer science', 0.6641519069671631),
 ('Statistical assumption', 0.6632416248321533),
 ('Bayesian network', 0.6610164642333984),
 ('Computer scientist', 0.657122015953064),
 ('Image segmentation', 0.6535505056381226),
 ('Training, validation, and test data sets', 0.6525720357894897),
 ('Early stopping', 0.6518106460571289)]
Doc

Both results seem similar and match the results from the paper's Table 1, although not exactly. This is because we don't know the exact parameters of the original implementation (see above). And also because we're training the model 7 years later and the Wikipedia content has changed in the meantime.

Now following the paper's Table 2a), let's calculate the most similar Wikipedia entries to "Lady Gaga" using Paragraph Vector:

In [5]:
for model in [model_dbow, model_dm]:
    print(model)
    pprint(model.dv.most_similar(positive=["Lady Gaga"], topn=10))

Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t20>
[('Katy Perry', 0.7451617121696472),
 ('Ariana Grande', 0.730435848236084),
 ('Adele', 0.7103729248046875),
 ('Nicki Minaj', 0.7018465399742126),
 ('Miley Cyrus', 0.6926740407943726),
 ('Taylor Swift', 0.6914724111557007),
 ('Demi Lovato', 0.676815927028656),
 ('Selena Gomez', 0.6723569631576538),
 ('Ellie Goulding', 0.670768141746521),
 ('Harry Styles', 0.6642683148384094)]
Doc2Vec<dm/m,d200,n5,w8,mc5,s0.001,t20>
[('Born This Way (album)', 0.6678953766822815),
 ('Artpop', 0.649540364742279),
 ('Beautiful, Dirty, Rich', 0.6354315876960754),
 ('Lady Gaga videography', 0.6259204745292664),
 ('Jennifer Lopez', 0.6139587759971619),
 ('Lady Gaga discography', 0.6118119359016418),
 ('Madonna', 0.6114965677261353),
 ('Katy Perry', 0.6079288721084595),
 ('Selena Gomez', 0.5989758968353271),
 ('Nicki Minaj', 0.5984588861465454)]


The DBOW results are in line with what the paper shows in Table 2a), revealing similar singers in the U.S.

Interestingly, the DM results seem to capture more "fact about Lady Gaga" (her albums, trivia), whereas DBOW recovered "similar artists".

**Finally, let's do some of the wilder arithmetics that vectors embeddings are famous for**. What are the entries most similar to "Lady Gaga" - "American" + "Japanese"? Table 2b) in the paper.

Note that "American" and "Japanese" are word vectors, but they live in the same space as the document vectors so we can add / subtract them at will, for some interesting results. All word vectors were already lowercased by our tokenizer above, so we look for the lowercased version here:

In [6]:
for model in [model_dbow, model_dm]:
    print(model)
    vec = [model.dv["Lady Gaga"] - model.wv["american"] + model.wv["japanese"]]
    pprint([m for m in model.dv.most_similar(vec, topn=11) if m[0] != "Lady Gaga"])

Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t20>
[('Ayumi Hamasaki', 0.6177269816398621),
 ('Dango 3 Kyodai', 0.6052653789520264),
 ('X -Cross-', 0.5981104373931885),
 ('Katy Perry', 0.5980274081230164),
 ('We Are "Lonely Girl"', 0.5946057438850403),
 ("D' no Junjō", 0.5916231274604797),
 ('Jidai (Miyuki Nakajima song)', 0.5841305255889893),
 ('Ring a Ding Dong', 0.5810612440109253),
 ('Aitakute Aitakute', 0.5777259469032288),
 ('Seventeen (South Korean band)', 0.5761528015136719)]
Doc2Vec<dm/m,d200,n5,w8,mc5,s0.001,t20>
[('Chisato Moritaka', 0.5491870641708374),
 ('Kaela Kimura', 0.5438317060470581),
 ('D&D (band)', 0.5341318845748901),
 ('Mari Amachi', 0.53079754114151),
 ('Rei Yasuda', 0.5302826166152954),
 ('Beautiful, Dirty, Rich', 0.5257192850112915),
 ('Radwimps', 0.5251664519309998),
 ('Pink Lady (duo)', 0.5212596654891968),
 ('Miliyah Kato', 0.5167880654335022),
 ('Koda Kumi', 0.5148667693138123)]


As a result, the DBOW model surfaced artists similar to Lady Gaga in Japan, such as **Ayumi Hamasaki** whose Wiki bio says:

> Ayumi Hamasaki is a Japanese singer, songwriter, record producer, actress, model, spokesperson, and entrepreneur.

So that sounds like a success. It's also the nr. 1 hit in the paper we're replicating – success!

The DM model results are opaque to me, but seem art & Japan related as well. The score deltas between these DM results are marginal, so it's likely they would change if retrained on a different version of Wikipedia. Or even when simply re-run on the same version – the doc2vec training algorithm is stochastic.

These results demonstrate that both training modes employed in the original paper are outstanding for calculating similarity between document vectors, word vectors, or a combination of both. The DM mode has the added advantage of being 4x faster to train.

To continue your doc2vec explorations, refer to the official API documentation in Gensim: https://radimrehurek.com/gensim/models/doc2vec.html