# Biomedical NLP Preprocessing

I looked at some existing biomedical Python libraries. All the following were unable to read BC5CDR PubTator source files without error due to the relation annotations
- PubTatorCorpusReader
- PubTatorCorpus
- kindred

Support for BioC XML seems better.

For representing data in HuggingFace, the format is flexible. I prefer to preserve document/abstract information when generating tagging datasets, so something like this record format might make sense. 

```
{
"id":"0"
"sentences":[
    {"text": "xxx",
    "entities":[...],
    "tokens":[...],
    "ner_tags":[...]},
]
"relations":[]
}
```

In [4]:
import bioc

fname = 'CDR.Corpus.v010516/CDR_TrainingSet.BioC.xml'

reader = bioc.BioCXMLDocumentReader(fname)
collection_info = reader.get_collection_info()
for doc in reader:
    for item in doc.passages:
        print(item.text)
        for anno in item.annotations:
            print(anno)
        print('---')
    print(doc.relations)
    break

Naloxone reverses the antihypertensive effect of clonidine.
BioCAnnotation[id=0,text='Naloxone',infons=[type=Chemical,MESH=D009270],locations=[BioCLocation[offset=0,length=8]],]
BioCAnnotation[id=1,text='clonidine',infons=[type=Chemical,MESH=D003000],locations=[BioCLocation[offset=49,length=9]],]
---
In unanesthetized, spontaneously hypertensive rats the decrease in blood pressure and heart rate produced by intravenous clonidine, 5 to 20 micrograms/kg, was inhibited or reversed by nalozone, 0.2 to 2 mg/kg. The hypotensive effect of 100 mg/kg alpha-methyldopa was also partially reversed by naloxone. Naloxone alone did not affect either blood pressure or heart rate. In brain membranes from spontaneously hypertensive rats clonidine, 10(-8) to 10(-5) M, did not influence stereoselective binding of [3H]-naloxone (8 nM), and naloxone, 10(-8) to 10(-4) M, did not influence clonidine-suppressible binding of [3H]-dihydroergocryptine (1 nM). These findings indicate that in spontaneously hyperten

## Sentence Boundary Detection / Tokenization

`scispacy` provides some nice tools for preprocessing biomedical documents. In general, I'd suggest doing minimal processing to the raw data, especially if you are implementing a dataset for HuggingFace's Datasets hub. However, most NER tasks datasets are broken down into sentences so in some cases we might want to preprocess data. This is critical, for example, in clinical text where doucuments can be quite long and we want to focus on a specific section or sentence for defining our input context. 

In [5]:
import spacy
import scispacy

nlp = spacy.load('en_core_sci_lg')
doc = [section.text for section in doc.passages]

In [6]:
tok_n = 0
pdoc = nlp(doc[1])
for sent in pdoc.sents:
    print(sent)
    tokens = [tok.text for tok in sent]
    tok_n += len(tokens)
    print('---')
    
tok_n

In unanesthetized, spontaneously hypertensive rats the decrease in blood pressure and heart rate produced by intravenous clonidine, 5 to 20 micrograms/kg, was inhibited or reversed by nalozone, 0.2 to 2 mg/kg.
---
The hypotensive effect of 100 mg/kg alpha-methyldopa was also partially reversed by naloxone.
---
Naloxone alone did not affect either blood pressure or heart rate.
---
In brain membranes from spontaneously hypertensive rats clonidine, 10(-8) to 10(-5) M, did not influence stereoselective binding of [3H]-naloxone (8 nM), and naloxone, 10(-8) to 10(-4) M, did not influence clonidine-suppressible binding of [3H]-dihydroergocryptine (1 nM).
---
These findings indicate that in spontaneously hypertensive rats the effects of central alpha-adrenoceptor stimulation involve activation of opiate receptors.
---
As naloxone and clonidine do not appear to interact with the same receptor site, the observed functional antagonism suggests the release of an endogenous opiate by clonidine or a

177

In [None]:
print(doc[0])