# Chemical-Disease Relation (CDR) Tutorial

In this example, we'll be writing an application to extract *mentions of* **chemical-induced-disease relationships** from Pubmed abstracts, as per the [BioCreative CDR Challenge](http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/).  This tutorial will show off some of the more advanced features of Snorkel, so we'll assume you've followed the Intro tutorial.

## Part 0: Initial Prep

In your shell, download the raw data by running:
```bash
cd tutorials/cdr
./download_data.sh
```

Note that if you've previously run this tutorial (using SQLite), you can delete the old database by running (in the same directory as above):
```bash
rm snorkel.db
```

# Part I: Corpus Preprocessing

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

from snorkel import SnorkelSession
session = SnorkelSession()

In [2]:
import os
from snorkel.parser import XMLMultiDocPreprocessor

# The following line is for testing only. Feel free to ignore it.
file_path = 'data/CDR.BioC.small.xml' if 'CI' in os.environ else 'data/CDR.BioC.xml'

doc_preprocessor = XMLMultiDocPreprocessor(
    path=file_path,
    doc='.//document',
    text='.//passage/text/text()',
    id='.//id/text()'
)

In [3]:
from snorkel.parser import CorpusParser
from utils import TaggerOneTagger

tagger_one = TaggerOneTagger()
corpus_parser = CorpusParser(fn=tagger_one.tag)
corpus_parser.apply(list(doc_preprocessor))

  1%|          | 10/1500 [00:00<00:17, 87.16it/s]

Clearing existing...
Running UDF...


100%|██████████| 1500/1500 [00:36<00:00, 41.00it/s]


In [4]:
from snorkel.models import Document, Sentence

print("Documents:", session.query(Document).count())
print("Sentences:", session.query(Sentence).count())

Documents: 1500
Sentences: 14593


# Part II: Candidate Extraction

With the TaggerOne entity tags, candidate extraction is pretty easy! We split into some preset training, development, and test sets. Then we'll use CrossSentencePretaggedCandidateExtractor to extract candidates using the TaggerOne entity tags.

In [5]:
from snorkel.models import Candidate, candidate_subclass

ChemicalDisease = candidate_subclass('ChemicalDisease', ['chemical', 'disease'])

In [6]:
from six.moves.cPickle import load

with open('data/doc_ids.pkl', 'rb') as f:
    train_ids, dev_ids, test_ids = load(f)
train_ids, dev_ids, test_ids = set(train_ids), set(dev_ids), set(test_ids)

train_sents, dev_sents, test_sents = [], [], []
docs = session.query(Document).order_by(Document.name).all()
for i, doc in enumerate(docs):
    if doc.name in train_ids:
        train_sents.append(list(doc.sentences))
    elif doc.name in dev_ids:
        dev_sents.append(list(doc.sentences))
    elif doc.name in test_ids:
        test_sents.append(list(doc.sentences))
    else:
        raise Exception('ID <{0}> not found in any id set'.format(doc.name))

In [7]:
from snorkel.candidates import CrossSentencePretaggedCandidateExtractor

candidate_extractor = CrossSentencePretaggedCandidateExtractor(ChemicalDisease, ['Chemical', 'Disease'])
for k, sents in enumerate([train_sents, dev_sents, test_sents]):
    candidate_extractor.apply(sents, window_size = 2, split=k)
    print("Number of candidates:", session.query(ChemicalDisease).filter(ChemicalDisease.split == k).count())

  0%|          | 3/900 [00:00<00:47, 19.03it/s]

Clearing existing...
Running UDF...


100%|██████████| 900/900 [00:47<00:00, 19.01it/s]
  3%|▎         | 3/100 [00:00<00:03, 25.16it/s]

Number of candidates: 21312
Clearing existing...
Running UDF...


100%|██████████| 100/100 [00:05<00:00, 19.13it/s]
  1%|          | 4/500 [00:00<00:18, 27.51it/s]

Number of candidates: 2486
Clearing existing...
Running UDF...


100%|██████████| 500/500 [00:29<00:00, 16.68it/s]

Number of candidates: 11975



