Here are the scripts, code and vectors for the ACL BioNLP 2016 workshop paper:

Chiu et al. How to Train good Word Embeddings for Biomedical NLP

API Package

word2vec: original word2vec from Mikolov:
wvlib: lib to read word2vec file:
geniass: lib to segment bioMedical text:

Scripts segment and tokenized input text (e.g. raw PubMed or PMC text) create lowercased and sentence-shuffled text (input: tokenized text) Create word2vec.bin file with different parameters run intrinsic evaluation on UMNSRS and Mayo data-set (input: Dir. for testing vector) run extrinsic evaluation


Pre-processing: tokenized text (requires NLTK)
geniass: segment sentence

Intrinsic evaluation: perform intrinisic evaluation

Extrinsic evaluation: (Keras folder: Need either tensorflow or theano installed): simple feed-forward Neural Network parameters for the Neual Network

Word vectors


All data on this page is made available under the Creative Commons Attribution (CC BY) license

