# Transfer Learning for Biomedical Named Entity Recognition with Neural Networks

### Paper Descriptions

The paper's task is to do Named Entity Recognition on Biomedical corpora.
The model/tool to be used is called NeuroNER, the neural net architecture is: CNNs + BLSTM (char embedding + word embedding) + BLSTM + Fully Connected NN & CRF (do the prediction, input sentence sequence -> predicted labels).

This NN method is dependent on gold-standard corpora (GSCs) consisting of hand-labeled entities, which tend to be small but highly reliable. An alternative to GSCs are silver-standard corpora (SSCs), 
which are generated by harmonizing the annotations made by several automatic annotation systems. SSCs typically contain more noise than GSCs but have the advantage of containing many more training examples. Ideally, these corpora could be combined to achieve the benefits of both, which is an opportunity for transfer learning. 

So the whole step will be: we firstly trained on SSC corpora to get a pre-trained model, then based on this trained model to train on GSC (transfer learning parts). The results will be 
compared with that when we only train on the GSC.
The inspiration is that SSCs typically contain more noise than GSCs but have the advantage of containing many more training examples. Ideally, these corpora could be combined to achieve the benefits of both, which is an opportunity for transfer learning. 

![title](./neuromodel.png)

### Implementation Steps

You can get the biomedical Copora from [here](https://github.com/BaderLab/Transfer-Learning-BNER-Bioinformatics-2018/
)

And the model/tool NeuroNER is from [here](https://github.com/Franck-Dernoncourt/NeuroNER/)

Please download them so that we can do the following experiments

In order to implement the experiment, the following steps will be a good reference.
1. Please download embedding file of [Glove](https://nlp.stanford.edu/projects/glove/), 
which gives us the word embedding and character embedding vectors. Since NeuroNER requires the
embedding vector to be 100-dimensinal, we used the `glove.6B.100d.txt` embedding file. 
Then, the embedding file should be put under the `./data/word_vectors` directory.

2. After setting the embedding file, then we need to set the training, validation and testing file. 
Firstly, in order to experiment on the Biomedical corpora with transfer learning, you need to download
the corpora via the paper'author [github repository](https://github.com/BaderLab/Transfer-Learning-BNER-Bioinformatics-2018/), 
unzip the SSC and GSC corpora you want to experiment on, and put the file folders under the `./data/` directoty, there
are three different processing way and three different file requirements.

#### for training without a pretrained model, your folder must contain a train and a valid folder (or a train.txt file and a valid.txt file)
#### for training with a pretrained model, your folder must contain a train and a valid folder (or a train.txt file and a valid.txt file)
#### for testing or predicting, your folder must contain a test folder (or a test.txt file)
    

3. By which processing way is determined by configurations in `./parameters.ini` file. There, you need to set `dataset_text_folder` and `token_pretrained_embedding_filepath` pointing
to your data folder and embedding file. The you need to set `train_model`, `use_pretrained_model` and `pretrained_model_folder` as follows:

#### for training without a pretrained model, you need to set 'train_model' to be True and 'use_pretrained_model' to be False.
#### for training with a pretrained model, you need to set 'train_model' to be True and 'use_pretrained_model' to be Talse, and 'pretrained_model_folder' pointing to the trained model's folder
#### for testing or predicting, you need to set 'train_model' to be False and 'use_pretrained_model' to be True, and 'pretrained_model_folder' pointing to the trained model's folder.

4. After setting the configuration as you need, you are ready to run the codes, there are two ways to run the system.

    First, you can run it from the `command line`, just open your command line tool, set the directory to the `neuroner` folder,
then run the command `neuroner`, you will see the system running. 
    Second, you can run it from a Python Interpreter, just open the `__main__.py` file and run it, you will see the system running.

5. The following two items are specific for doing the transfer learning, after training on the SSC, you will have a trained model saved under the 
`output` folder, in order to use this pre-trained model, you need to move this folder under the `trained_models` folder, then delete
unnecessary files but only keep five 'pickle', 'ckpt' of your last epoch record and 'parameter.ini' files, and rename them according to files in the
`./trained_models/conll_2003_en` folder.

6. So now you have your owned pre-trained model, and you can do the transfer learning on the GSC, just set the parameters.ini under the root directory,
the setting method are as above.

### Some explanations

I didn't goes through the whole experiment process because for one thing, the original NeuroNER code is very complex and twisted, so I didn't aggregrate the original functions into my reading, training, predicting, evaluating functions. 
Actually the original repository is very unreliable and contains many bugs, it is also renewing since first created. So I have already spent much time on debugging and experimenting. It's really not easy to run the whole system successfully and get the experiment results. With the final exam approaching, I can't afford the time spent on debugging and testing on my own module class.

For another thing, the original project has already created a class named `NeuroNER`, which the author has already aggregrated the loading, fitting, predicting functions, so I think this can also meet the requirement, to some extent.

Therefore, to make up for the lost, I extended my experiment on two general NER benchmarks: CONLL2003 and CHEMDNER. And I've written a very elaborative notes about how to run the system, both using transfer learning or not, as above.
I hope these notes can help.