# Introduction

For a background corpus to build the GloVe embeddings, I used the complete PubMed CentralÂ® (PMC) Open Access non-commercial subset. The files used were downloaded on July 30, 2019. See the [PMC FTP](https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/) and [PMC About](https://www.ncbi.nlm.nih.gov/pmc/about/intro/) for information on accessing the data using FTP. 

(GloVe: Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.)

Here are some statistics about the downloaded articles:

* 958,634 articles (one .nxml file per article).
* XML files compressed are about 11.1 GB, 51.5 GB uncompressed (!)
* I'm using XML in order to exclude tables (where collocation data isn't very meaningful, especially when the tables are numeric).

# Preprocessing the Corpus

I have included a simple python module `corpus.py` that simplifies the process of iterating over the entire corpus. At each step, a list of tokens from the next article is returned (punctuation removed). This module is included in this repository.

Each article's token list is written as a space delimited string into the file `corpus.txt`. Documents are separated by a newline character:

In [None]:
from corpus import DocumentCorpus

PMC = r'/media/ryan/ExtraDrive1/PMC/XML/'

docs = DocumentCorpus(root = PMC)
with open('corpus.txt', 'w') as output:
    for doc in docs:
        text = ' '.join(doc)
        output.write("\n")  # end of article

# Building the GloVe model

The first step is to download and build GloVe from [StanfordNLP GitHub](https://github.com/stanfordnlp/GloVe). Note that I was only able to run the python evaluation script using python 2.7 (and numpy).

Here is a simple shell script to take the text corpus generated in the previous step, and output a GloVe model. I am mostly using GloVe defaults:

```bash
#!/bin/bash

MODEL_DIR=/home/ryan/Development/deep-learn-bio-nlp
GLOVE_DIR=/home/ryan/Development/GloVe-1.2/build
MAX_VOCAB=100000
MODEL_NAME=vectors_100K
GLOVE_ITER=7
OUTPUT_DIM=100
# Example using mostly GloVe defaults, with a few exceptions

$GLOVE_DIR/vocab_count -max-vocab $MAX_VOCAB -min-count 10 < $MODEL_DIR/corpus.txt > $MODEL_DIR/vocab.txt

$GLOVE_DIR/cooccur -window-size 10 -vocab-file $MODEL_DIR/vocab.txt < $MODEL_DIR/corpus.txt > $MODEL_DIR/cooccurrences.bin

$GLOVE_DIR/shuffle -verbose 0  < $MODEL_DIR/cooccurrences.bin > $MODEL_DIR/cooccurrences.shuf.bin

$GLOVE_DIR/glove -iter $GLOVE_ITER -binary 2 -vector-size 256 -input-file $MODEL_DIR/cooccurrences.shuf.bin -vocab-file $MODEL_DIR/vocab.txt -save-file $MODEL_NAME

```