# DEEP learning in bio NLP

### background
Named entity recognition (NER) is an important technique that promises to improve information classification and retrieval in biomedical natural language processing (NLP). However, existing approaches primarily rely on either laborious manual curation or feature engineering. Here we adopt deep learning techniques in NLP and repurpose the vast amount of entity-freetext pairs available in the Sequence Read Archive (SRA) to train a scalable NER model. 


###  notebooks

|Code| Usage| 
|:--------------:|------:|
|downloadFromPMC.ipynb|download the pubmed text|
|train_pmc_word2vec.ipynb| Train a word2vec model based on pubmed text|
|keras_on_sra_data.ipynb| Train an entity recognition model using SRA meta data |

|Data| Usage|
|:--------------:|------:|
|https://www.synapse.org/#!Synapse:syn11421651 | all SRS annotations|
| https://www.synapse.org/#!Synapse:syn11421649 | all SRX annotations|
|ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz|PUBMED ID conversions|

### depending packages
if u have anaconda, install relevant packages using command line: 



In [None]:
!pip install keras gensim  nltk spacy tensorflow




### License
This work is under Creative Commons Attribution license. This work is unpublished at the moment. Please attribute this work by citing the github page. 


automatically update the notebook with data

# scratch
Please ignore the bottom parts, it's just for my convenience. 

In [6]:
!jupyter nbconvert --to markdown README.ipynb


[NbConvertApp] Converting notebook README.ipynb to markdown
[NbConvertApp] Writing 2486 bytes to README.md


In [24]:
!ls 

Data			      model
NCBI_harmonized_names.ipynb   nGramClassification_simple.ipynb
NCIT_parsing.ipynb	      pmc_word2_vec
README.ipynb		      pubmed
README.md		      read_embedding_matrix.ipynb
Results			      read_word_vector_matrix.ipynb
Thesaurus.txt		      semantic_count.csv
Untitled.ipynb		      testPhraseMatcher.ipynb
Untitled9.ipynb		      tmp.tsv
analyzeBioNLPEmbedding.ipynb  tmp.txt
analyzeSRAEntities.ipynb      tmpResults
downloadFromPMC.ipynb	      tmpResults.xlsx
downloadPubmed.ipynb	      track_word_count_and_embedding_size.ipynb
download_wordvectors.ipynb    train_pmc_word2vec.ipynb
keras_on_sra_data.ipynb       train_pmc_word2vec.py
keras_on_sra_data_old.ipynb   wikipedia-pubmed-and-PMC-w2v
merge.pmc		      wikipedia-pubmed-and-PMC-w2v.bin
mergeEntities.ipynb


In [1]:
#README.ipynb README.md keras_on_sra_data.ipynb
!git add keras_on_sra_data.ipynb nGramClassification_simple.ipynb

In [2]:
!git commit -m "updated n gram classification to use max"

[master d4f1614] updated n gram classification to use max
 2 files changed, 505 insertions(+), 120 deletions(-)
 create mode 100644 nGramClassification_simple.ipynb


In [3]:
!git push 

Git 2.0 from 'matching' to 'simple'. To squelch this message
and maintain the traditional behavior, use:

  git config --global push.default matching

To squelch this message and adopt the new behavior now, use:

  git config --global push.default simple

When push.default is set to 'matching', git will push local branches
to the remote branches that already exist with the same name.

Since Git 2.0, Git defaults to the more conservative 'simple'
behavior, which only pushes the current branch to the corresponding
remote branch that 'git pull' uses to update the current branch.

See 'git help config' and search for 'push.default' for further information.
(the 'simple' mode was introduced in Git 1.7.11. Use the similar mode
'current' instead of 'simple' if you sometimes use older versions of Git)

Counting objects: 7, done.
Delta compression using up to 96 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 25.31 KiB | 0 bytes/s, done.
Total 7 (delta 3), reused 0 

### status 
retraining the word2vec models

training the word2vec using the entire PMC now: 
train_pmc_word2vec.ipynb


In [None]:
%%bash
cd ./Data/
#wget ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz

In [22]:
!gunzip -c ./Data/PMC-ids.csv.gz | head

Journal Title,ISSN,eISSN,Year,Volume,Issue,Page,DOI,PMCID,PMID,Manuscript Id,Release Date
Breast Cancer Res,1465-5411,1465-542X,2000,3,1,55,,PMC13900,11250746,,live
Breast Cancer Res,1465-5411,1465-542X,2000,3,1,61,,PMC13901,11250747,,live
Breast Cancer Res,1465-5411,1465-542X,2000,3,1,66,,PMC13902,11250748,,live
Breast Cancer Res,1465-5411,1465-542X,1999,2,1,59,10.1186/bcr29,PMC13911,11056684,,live
Breast Cancer Res,1465-5411,1465-542X,1999,2,1,64,,PMC13912,11400682,,live
Breast Cancer Res,1465-5411,1465-542X,1999,1,1,73,10.1186/bcr16,PMC13913,11056681,,live
Breast Cancer Res,1465-5411,1465-542X,1999,1,1,81,10.1186/bcr17,PMC13914,11056682,,live
Breast Cancer Res,1465-5411,1465-542X,1999,1,1,88,10.1186/bcr18,PMC13915,11056683,,live
Breast Cancer Res,1465-5411,1465-542X,2000,2,2,139,10.1186/bcr45,PMC13916,11056686,,live

gzip: stdout: Broken pipe
