# BioCreative II Gene Mention (GM) Task

For more information: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-ii/task-1a-gene-mention-tagging/

## Training Data

The training data is described in the corpus README.GM file, but I'll describe it here as well. Training data consists of a sentences file `train.in` and a label file `GENE.eval` while lists the offsets of any gene mentions (there may be none for any sentence). It is easiest to understand using an example using the first sentence from `train.in`:

```
P00001606T0076 Comparison with alkaline phosphatases and 5-nucleotidase
```
Each line contains a single sentence, starting with a unique sentence identifier, followed by the text. This particular sample contains two (2) gene mentions, listed on two lines in the `GENE.eval` file:

```
P00001606T0076|14 33|alkaline phosphatases
P00001606T0076|37 50|5-nucleotidase
```
The first field (delimied by the bar symbols) is the matching sentence ID. The second field contains the offset of the first and last characters in the GM, *not counting space characters*. So, looking at *alkaline phosphatases*, the first letter *a* is at offset 14 keeping in mind that the first character in the sentence is offset 0. If you are not careful, you may think the offset of *a* is 16, but remember that spaces are not counted. Counting in a similar way, the last *s* in *phosphatases* is at offset 33.

## Prepare for Training

The format is not very convenient for training our ML model. One method used to train NER systems to label each sentence token with either 'B','I', or 'O' where 'B' marks the beginning token in an entity, 'I' marks subsequent tokens in a multi-token entity (*inside*), and 'O' is for tokens *outside* the entity.

The module *bc2reader.py* will help convert these two files to something more usable. The first argument to the `BC2Reader` contructor is the sentence file. The second is the gene mention file 


In [3]:
from bc2reader import BC2Reader

train_home = '/home/ryan/Development/deep-learn-bio-nlp/bc2/bc2geneMention/train'
reader = BC2Reader('{0}/train.in'.format(train_home), '{0}/GENE.eval'.format(train_home))
reader.convert('{0}/converted.json'.format(train_home))

This will generate a JSON file with a more familiar format. Here is the first sentence in our BIO format:

In [8]:
import json
with open('{0}/converted.json'.format(train_home), 'r') as json_file:
    training_data = json.load(json_file)
    print(training_data[0])

['P00001606T0076', ['Comparison', 'with', 'alkaline', 'phosphatases', 'and', '5-nucleotidase'], ['O', 'O', 'B', 'I', 'O', 'B']]


This may be easier to read if we zip together the tokens and labels:

In [9]:
print(list(zip(training_data[0][1], training_data[0][2])))

[('Comparison', 'O'), ('with', 'O'), ('alkaline', 'B'), ('phosphatases', 'I'), ('and', 'O'), ('5-nucleotidase', 'B')]
