Skip to content

catcatrun/BioNLP-2016

 
 

Repository files navigation

BioNLP-2016

Here are the scripts, code and vectors for the ACL BioNLP 2016 workshop paper:

Chiu et al. How to Train good Word Embeddings for Biomedical NLP

API Package

word2vec: original word2vec from Mikolov: https://code.google.com/archive/p/word2vec/
wvlib: lib to read word2vec file: https://github.com/spyysalo/wvlib
geniass: lib to segment bioMedical text: http://www.nactem.ac.uk/y-matsu/geniass/

Scripts

pre-process.sh: segment and tokenized input text (e.g. raw PubMed or PMC text)
create_shf_low_text.sh: create lowercased and sentence-shuffled text (input: tokenized text)
createModel.sh: Create word2vec.bin file with different parameters
intrinsicEva.sh: run intrinsic evaluation on UMNSRS and Mayo data-set (input: Dir. for testing vector)
ExtrinsicEva.sh: run extrinsic evaluation

Code

Pre-processing:
tokenize_text.py: tokenized text (requires NLTK)
geniass: segment sentence

Intrinsic evaluation:
evaluate.py: perform intrinisic evaluation

Extrinsic evaluation: (Keras folder: Need either tensorflow or theano installed):
mlp.py: simple feed-forward Neural Network
setting.py: parameters for the Neual Network

Word vectors

https://drive.google.com/open?id=0BzMCqpcgEJgiUWs0ZnU0NlFTam8

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 59.7%
  • C 13.4%
  • C++ 10.0%
  • Roff 9.7%
  • Shell 3.6%
  • Other 1.4%
  • Other 2.2%