# Document embedding (doc2vec)

The general idea here is to train a neural network model to create text embeddings.
Text embedding vectors can then be used to obtrain similarity metrics on neighbouring texts.

In this example we use the `doc2vec` model 
([Le & Mikolov, ICML 2014](https://cs.stanford.edu/~quocle/paragraph_vector.pdf))
to create text embeddings using pubmed data.

In [13]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

import re
import os
import collections

import pandas as pd
import numpy as np

import gensim
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

import config

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Theories

- word2vec -> doc2vec
  - skip-gram -> distributed memory
  - continuous-bag-of-words -> distributed bag-of-words

![Distributed Memory model (PV-DM)](https://adriancolyer.files.wordpress.com/2016/05/paragraph-vectors-fig-2.png?w=600)

![Distributed Bag of Words model (PV-DBW)](https://adriancolyer.files.wordpress.com/2016/05/paragraph-vectors-fig-3.png?w=600)

## Pre-processing data

We use the pubmed data from earlier sessions.

In [14]:
pubmed_data = pd.read_csv(config.demoPubmedFile, sep="\t")

pubmed_data.head()

Unnamed: 0,pmid,year,title,abstract
0,25475436,2015,Sixty-five common genetic variants and predict...,We developed a 65 type 2 diabetes (T2D) varian...
1,25011450,2014,Association between alcohol and cardiovascular...,To use the rs1229984 variant in the alcohol de...
2,28968714,2018,FATHMM-XF: accurate prediction of pathogenic p...,"We present FATHMM-XF, a method for predicting ..."
3,21965548,2012,Four genetic loci influencing electrocardiogra...,Presence of left ventricular hypertrophy on an...
4,26930047,2016,Diagnosis of Coronary Heart Diseases Using Gen...,Cardiovascular disease (including coronary art...


Here are some of the pre-processing steps to be done (subjective):

- Make sure `abstract`s are strings.
- Keep only abstracts that are sufficiently long enough, and not just fragments.

In [4]:
def long_enough(text, word_length=40, num_sentences=2):
    return (len(text.split(" ")) >= word_length 
            and text.count(".") >= num_sentences)

pubmed_data = pubmed_data \
    .assign(abstract=lambda df: df.abstract.astype(str)) \
    .assign(keep=lambda df: df.abstract.apply(long_enough))

pubmed_data_keep = pubmed_data.query("keep")

In [5]:
# Abstracts to be discarded from the corpus
for text in pubmed_data.query("not keep").abstract[:5]:
    print(text, "\n")

To use the rs1229984 variant in the alcohol dehydrogenase 1B gene (ADH1B) as an instrument to investigate the causal role of alcohol in cardiovascular disease. 

Conclusions. A candidate functional variant, rs28451064, was identified. Future work should focus on identifying the pathway(s) involved. 

Haptoglobin acts as an antioxidant by limiting peroxidative tissue damage by free hemoglobin. The haptoglobin gene allele Hp2 comprises a 1.7 kb partial duplication. Relative to allele Hp1, Hp2 carriers form protein multimers, suboptimal for hemoglobin scavenging. 

To establish whether the association between milk intake and prostate cancer operates via the insulin-like growth factor (IGF) pathway (including IGF-I, IGF-II, IGFBP-1, IGFBP-2, and IGFBP-3). 

Prenatal exposure to maternal cigarette smoking (prenatal smoke exposure) had been associated with altered DNA methylation (DNAm) at birth. 



We create two sets of text corpus, in which a text element consists of both the `title` and the `abstract`.

In [15]:
split = 0.95
split_idx = int(np.floor(pubmed_data_keep.shape[0] * split))

pubmed_train = pubmed_data_keep[:split_idx]
pubmed_test = pubmed_data_keep[split_idx:]
train_corpus = []
test_corpus = []

for i, (title, abstract) in enumerate(zip(pubmed_train.title, 
                                          pubmed_train.abstract)):
    train_corpus.append(TaggedDocument(title + " " + abstract, [i]))
    
for i, (title, abstract) in enumerate(zip(pubmed_test.title,
                                          pubmed_test.abstract)):
    test_corpus.append(TaggedDocument(title + " " + abstract, [i]))
    
print(f"Number of texts in training set: {len(train_corpus)}")
print(f"Number of texts in test set: {len(test_corpus)}")

Number of texts in training set: 8379
Number of texts in test set: 441


In [16]:
for i in range(3):
    print(train_corpus[i], "\n")

TaggedDocument(Sixty-five common genetic variants and prediction of type 2 diabetes. We developed a 65 type 2 diabetes (T2D) variant-weighted gene score to examine the impact on T2D risk assessment in a U.K.-based consortium of prospective studies, with subjects initially free from T2D (N = 13,294; 37.3% women; mean age 58.5 [38-99] years). We compared the performance of the gene score with the phenotypically derived Framingham Offspring Study T2D risk model and then the two in combination. Over the median 10 years of follow-up, 804 participants developed T2D. The odds ratio for T2D (top vs. bottom quintiles of gene score) was 2.70 (95% CI 2.12-3.43). With a 10% false-positive rate, the genetic score alone detected 19.9% incident cases, the Framingham risk model 30.7%, and together 37.3%. The respective area under the receiver operator characteristic curves were 0.60 (95% CI 0.58-0.62), 0.75 (95% CI 0.73 to 0.77), and 0.76 (95% CI 0.75 to 0.78). The combined risk score net reclassifica

## Train a Doc2Vec model

Here we demonstrate a simple usage of paragraph / sentence embedding using a Doc2Vec model.

Refer to [gensim's documentation](https://radimrehurek.com/gensim/models/doc2vec.html) 
on the specific usage of Doc2Vec model and its APIs

In [19]:
d2v_model = Doc2Vec(train_corpus)

%time d2v_model.train(train_corpus, total_examples=d2v_model.corpus_count, epochs=40)

CPU times: user 3min 28s, sys: 8.01 s, total: 3min 36s
Wall time: 1min 33s


In [9]:
# This is how you can check the API usage.
# You can do this in the notebook kernel session or 
# in a jupyter lab console session associated to this kernel

?Doc2Vec

[0;31mInit signature:[0m
[0mDoc2Vec[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mdocuments[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdm_mean[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdm[0m[0;34m=[0m[0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdbow_words[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdm_concat[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdm_tag_count[0m[0;34m=[0m[0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdocvecs[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdocvecs_mapfile[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcomment[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtrim_rule[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcallbacks[0m[0;34m=[0m[0;34m([0m[0;34m)[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m**[0

## Assess accuracy

As a sanity check of our model fit,
we take the first 1000 abstract from the entire train corpus,
and see whether for the given text, the most similar abstract is itself.

In [21]:
sample_n = 1000
sample_corpus = train_corpus[:sample_n]

In [22]:
ranks = []
for doc_id in range(len(sample_corpus)):
    # The inferred_vector of this document
    inferred_vector = d2v_model.infer_vector(sample_corpus[doc_id].words)
    # Get the most similar document rankings across entire `train_corpus`
    # in the form of
    # [(499, 0.7327660322189331),
    #  (8835, 0.6161227822303772),
    #  (981, 0.5684570074081421),
    #  ...]
    sims = d2v_model.docvecs.most_similar([inferred_vector], 
                                          topn=len(d2v_model.docvecs))
    # The index position for `doc_id`.
    # If this abstract is most similar to itself, rank should be 0
    rank = [docid for docid, sim in sims].index(doc_id)
    
    ranks.append(rank)

In [23]:
print(collections.Counter(ranks))

accuracy = np.sum(np.array(ranks) == 0) / sample_n * 100
print(f"Accuracy: {accuracy}%")

Counter({0: 996, 1: 2, 2: 1, 3: 1})
Accuracy: 99.6%


## Recommendation engine based on abstract embeddings

Suppose documents from `test_corpus` are from authors that are interested in finding collaborators for future works.

For a small sample of `test_corpus` documents, we test to find the most similar documents from `train_corpus`.

In [25]:
for doc_id in range(3):
    print(test_corpus[doc_id], "\n")

TaggedDocument(Adaptation and validation of antibody-ELISA using dried blood spots on filter paper for epidemiological surveys of tsetse-transmitted trypanosomosis in cattle. The indirect enzyme-linked immunosorbent assay (ELISA) for the detection of anti-trypanosomal antibodies in bovine serum was adapted for use with dried blood spots on filter paper. Absorbance (450 nm) results for samples were expressed as percent positivity, i.e. percentage of the median absorbance result of four replicates of the strong positive control serum. The antibody-ELISA was evaluated in Zambia for use in epidemiological surveys of the prevalence of tsetse-transmitted bovine trypanosomosis. Known negative samples (sera, n = 209; blood spots, n = 466) were obtained from cattle from closed herds in tsetse-free areas close to Lusaka. Known positive samples (sera, n = 367; blood spots, n = 278) were obtained from cattle in Zambia's Central, Lusaka and Eastern Provinces, diagnosed as being infected with Trypan

In [26]:
top_similar_n = 3
similar = []

for test_doc_id in range(len(test_corpus)):
    words = test_corpus[test_doc_id].words
    inferred_vector = d2v_model.infer_vector(words, steps=20)
    top_sim = d2v_model.docvecs.most_similar(
        [inferred_vector], topn = top_similar_n)
    similar.append(top_sim)

In [27]:
# rank test abstract by similar scores
most_similar = []
for test_doc_id in range(len(test_corpus)):
    item = {
        "test_doc_id": test_doc_id,
        "train_doc_id": similar[test_doc_id][0][0],
        "score": similar[test_doc_id][0][1],
    }
    most_similar.append(item)
    
most_similar = pd.DataFrame(most_similar).sort_values(by="score", ascending=False)

In [28]:
most_similar.head()

Unnamed: 0,score,test_doc_id,train_doc_id
253,0.729217,253,8185
368,0.712785,368,5598
401,0.707592,401,5598
293,0.682189,293,8185
138,0.673601,138,4084


Let's select top 2 test abstracts that have the best match from `train_corpus`

In [29]:
top_n = 2
preview_n = 3

for test_doc_id in most_similar.test_doc_id[:top_n]:
    print(f"# Abstract {test_doc_id}:")
    words = test_corpus[test_doc_id].words
    print(f"content: {words}")
    print("\n")
    
    for i in range(preview_n):
        train_doc_id = similar[test_doc_id][i][0]
        similarity_score = similar[test_doc_id][i][1]
        print(f"\t ## Matched abstract {i}: id {train_doc_id}, similarity {similarity_score}")
        words = train_corpus[train_doc_id].words
        print(f"\t {words}")
        print("\n")
    print("\n\n")

# Abstract 253:
content: Rho proteins and cancer. The Rho family of GTPases has been intensively studied for their roles in signal transduction processes leading to cytoskeletal-dependent responses, including cell migration and phagocytosis. In addition, they are important regulators of cell cycle progression and affect the expression of a number of genes, including those for matrix-degrading proteases implicated in cancer invasion. So far, the expression of some Rho family members has been found to be increased in some human cancers, and some cancer-associated mutations in Rho family regulators have been characterized. This makes Rho protein signalling pathways attractive targets for cancer therapy. However, there is little evidence so far from animal studies to define if and how Rho proteins contribute to cancer cell proliferation, survival, invasion and metastasis.


	 ## Matched abstract 0: id 8185, similarity 0.7292169332504272
	 Regulation of endocytic traffic by Rho GTPases. The

Feeling lucky with some random text?

In [30]:
text = "Searching for the causal effects of body mass index in over 300 000 participants in UK Biobank, using Mendelian randomization."

def feeling_lucky(text, model=d2v_model, train_corpus=train_corpus):
    inferred_vector = model.infer_vector(text, steps=20)
    top_sim = model.docvecs.most_similar(
        [inferred_vector], topn = 1)
    matched_abstract = train_corpus[top_sim[0][0]].words
    similarity_score = top_sim[0][1]
    return matched_abstract, similarity_score

feeling_lucky(text)

('Evidence synthesis for decision making 2: a generalized linear modeling framework for pairwise and network meta-analysis of randomized controlled trials. We set out a generalized linear model framework for the synthesis of data from randomized controlled trials. A common model is described, taking the form of a linear regression for both fixed and random effects synthesis, which can be implemented with normal, binomial, Poisson, and multinomial data. The familiar logistic model for meta-analysis with binomial data is a generalized linear model with a logit link function, which is appropriate for probability outcomes. The same linear regression framework can be applied to continuous outcomes, rate models, competing risks, or ordered category outcomes by using other link functions, such as identity, log, complementary log-log, and probit link functions. The common core model for the linear predictor can be applied to pairwise meta-analysis, indirect comparisons, synthesis of multiarm t

Note that this is a very crude demo (small training set and lack of pre-processing and model tuning).

Below is the overall distribution of top similarity scores. 

In [17]:
most_similar

Unnamed: 0,score,test_doc_id,train_doc_id
256,0.687133,256,7012
154,0.681590,154,6335
138,0.679621,138,4084
260,0.678780,260,2168
300,0.676889,300,8202
403,0.671718,403,5784
293,0.665003,293,8185
368,0.661996,368,5598
440,0.656170,440,1988
286,0.653303,286,8185


## Useful sources

- [introduction to doc2vec](https://blog.acolyer.org/2016/06/01/distributed-representations-of-sentences-and-documents/)
- [sentiment analysis](http://linanqiu.github.io/2015/10/07/word2vec-sentiment/)
- https://medium.com/@ermolushka/text-clusterization-using-python-and-doc2vec-8c499668fa61
- https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb