# <p style="text-align: center;">Lab 5: Neural Networks NERC</p>
#### <p style="text-align: center;"> Alejandra López de Aberasturi Gómez, Thèo Ding</p>


### Introduction
----------------

The purpose of this assignment was the development of a neural network that would be able to succesfully perform named entity recognition tasks in a specialized domain such as the corresponding to the given dataset. 
The training dataset used for this design was that one comprised of all the sentences available at the /Train folder. Meanwhile, devel and test datasets corresponded respectively to the contents of /Devel and /Test folders. 

In all the tested implementations, the basic architecture was that one of a bidirectional LSTM (biLSTM hereafter). This kind of neural network is specially suited for our goals, for we have access to both the past and future input features for a given time. In such model, words need to be assigned to numerical (dense) vectors, and during training the neural network is able to learn improved representations for them. Performance might be increased by assigning these numerical-word vectors to pre-trained word embeddings learned with GloVe. Further, it is possible to input character-level embeddings so that the system can automatically capture prefixes, suffixes and other grammatical features of words, providing the classifier with richer representations. 

In our search for the most satistying implementation, we have experimented with both word and character-level embeddings. The former was always present in our networks, although we have tested the resulting accuracy when using both random and pre-trained word embeddings. The second one was optional and used randomized embedding. For the pre-trained case, we have tried all the vectors available in [this file](http://nlp.stanford.edu/data/glove.6B.zip), as well as the ones [here](http://nlp.stanford.edu/data/glove.840B.300d.zip). We present the results regarding the second package.

Once the random embeddings were set-up, the layers were concatenated and connected to the biLSTM network. In the case of the character-level embedding, it was first connected to a smaller biLSTM, and it was this layer that was concatenated with the rest of embeddings. 

#### Feature Augmentation and other embeddings

We have also worked with several experimental arrangements that used **lowercased** and non-lowercased word embeddings (*i.e.* different embedding layers for lowercased words and non-lowercased words), as well as **PoS-tag** embeddings. Just as in the case of the character-level embedding, these two were random (*vs.* pre-trained). 
Finally, some features (**prefixes, words containing dashes, keywords**, etc.) were augmented (extracted and input to a feature embedding) in order to boost the performance of the network regarding group-type entities, whose recall and precision scores tended to be low. Again, keras.layers. Embedding layer was used for this processing. 




### Training and hyper-parameters
------------------------------------
The trained models have all been tested blindly on unseen test data. The split of the dataset into train and evaluation was, as aknowledged before, straightforward: we worked on the /Train folder when training the model and tested on the /Devel and /Test datasets. During training, the validation dataset's loss and viterbi accuracy were monitored. In fact, the first of these metrics was used as an early-stopping criterion. The hyper-parameters of the LSTM include: 

* Number of hidden nodes for both biLSTMs and dense layer: \\( \left(H_{w}, H_{c}, H_{d}\right)\in\{25,50,100\} \\)
* Word embedding dimension, \\(d_{w}\in\{50,100,200,300\}\\)
* Character embedding dimension, \\(d_{c}\in\{25,50\}\\)
* Learning rate was left at its default value (*i.e.* 0.01)
* Batch size, which was 32 for all the experiments.
* Number of epochs, which was set to 15 with early stopping when the validation started decreasing consecutively after 2 epochs. 

The best model from the validation set was tested finally on the unseen test dataset.


## Performed Experiments
----------------------------------
In total, more than 30 experiments were carried out in our pursuit of the best possible architecture. As we have already mentioned, the different variants differed in the value of the hyper-parameters (always ranging among the values refered above). Furthermore, we tried multiple combinations of the embedding layers. For instance, just to name a few: 

* Randomized word embedding + char embedding
* Randomized word embedding + pos-tag embedding + char embedding 
* Randomized word embedding + feature augmentation + char embedding 
* Randomized word embedding + pos-tag embedding + feature augmentation + char embedding
* Lowercased/non-lowercased randomized word embedding + char embedding
* Lowercased/non-lowercased randomized word embedding + pos-tag embedding + char embedding
* Lowercased/non-lowercased randomized word embedding + feature augmentation + char embedding
* Lowercased/non-lowercased randomized word embedding + pos-tag embedding + feature augmentation + char embedding
* All the previous configurations without char embedding


Regarding the main LSTM layer, we have tested the performance of the system when is fed with pre-trained and random word vectors. As mentioned in the Introduction section, such pre-trained vectors were from the Wikipedia + Gigaword 5 and Common Crawl (840B tokens, 2.2M vocab) datasets.

## Architecture
--------------------

In what follows, we will focus on the architecture of the experimental arrangement that proved to perform best. More than 30 slight variations of this model have been tested until we have finally arrived to this one, with F1 scores varying from 0.3 to the one presented here.
Nevertheless, the general structure doesn't change, and further feature-augmenting embeddings simply involve the addition of random embedding layers. 
It is noteworthy, however, that the fully-connected layer after the main biLSTM was not always present, as well as the dropout layer introduced to prevent the model from overfitting. The table below shows all the hyper parameters used for the experiments reported in the Results section

| Hyper-paramenter | Value |
|--- | --- |
|Word embedding dim, \\(d_{w}\\) | 300 (common crawl)|
|Character embedding dim, \\(d_{c}\\) | 25 (random) |
|Character biLSTM hidden layer dimension, \\(H_{c}\\) | 25|
|Main biLSTM hidden layer dimension, \\(H_{w}\\)|80|
|Dense hidden layer dimension, \\(H_{d}\\) | 160|
|Activation in dense hidden layer| tanh |
|Dropout  | 0.2|
|Recurrent dropout | 0.5
|Optimizer | adam |
|Loss | crf_loss|


## Results
--------------
Below we present the results obtained when using the configuration described. The final viterbi accuracy reached a value of 0.97 in the validation dataset, the F1 score being of 0.64 (P = 0.67, R = 0.63). We can see that the results regarding the *drug_n* entity type and the *group* type are not as good as the rest, which was to be expected given the low presence (if any) of examples of this class in the training dataset. 

## Code
----------

We present here the code of build_network. The `config` parameter refers to an object containing all the hyper-parameters of the experimental arrangement. Such parameters were introduced by means of a `.json` file so that several experiments could be launched automatically from a `bash` file.

In [None]:
def build_network(idx, config):
    '''
    Builds the nn. Receives the index dictionary with the encondings 
    of words and tags , and the maximum length of sentences

    Parameters:
    -----------
    idx: dict

    config: NetConfig instance
        Contains configuration of the neural network
        
    Returns:
    --------    
    model: neural network
    '''

    # sizes

    n_pos =             len(idx['pos'])             # UNK & PAD considered
    n_case =            len(idx['case'])            # PAD considered
    n_type =            len(idx['type'])            # PAD considered
    n_chars =           len(idx['chars'])           # UNK & PAD considered
    n_words =           len(idx['words'])           # UNK & PAD considered
    n_tags =            len(idx['tags'])            # PAD considered
    max_len_sentences = idx['max_len_sentences']   
    max_len_words =     idx['max_len_words'] 

    # ************************************************

    # architectural parameters

    pre_trained =       config.pre_trained
    w_embedding =       config.w_embedding
    c_embedding =       config.c_embedding
    lstm_char_units =   config.lstm_char_units
    lstm_main_units =   config.lstm_main_units
    dense_units =       config.dense_units
    return_sequences =  config.return_sequences
    mask_zero =         config.mask_zero
    activation =        config.activation

    #training parameters

    dropout =       config.dropout
    rcrr_dropout =  config.rcrr_dropout
    optimizer =     config.optimizer
    loss =          config.loss
    metrics =       config.metrics

    #********************************************************

    # create network layers

    # type embedding
    #---------------#

    type_inp = Input(shape=(max_len_sentences,))
    type_emb = Embedding(
        input_dim=n_type,
        output_dim=w_embedding,
        input_length=max_len_sentences,
        mask_zero=mask_zero)(type_inp)

    # pos embedding
    #--------------#

    pos_inp = Input(shape=(max_len_sentences,))
    pos_emb = Embedding(
        input_dim=n_pos,
        output_dim=w_embedding,
        input_length=max_len_sentences,
        mask_zero=mask_zero)(pos_inp)

    # capitalization words embedding
    #--------------------------#    

    case_inp = Input(shape=(max_len_sentences,))
    case_emb = Embedding(
        input_dim=n_case,
        output_dim=w_embedding,
        input_length=max_len_sentences,
        mask_zero=mask_zero)(case_inp)

    # word embedding
    # --------------#

    word_inp = Input(shape=(max_len_sentences,))
      
    if pre_trained: 

        #  word embedding option (1): load pre-trained embeddings 
        # and create the customized weights matrix according to our dataset

        word_emb = Embedding(
            input_dim=n_words, 
            output_dim=w_embedding,
            weights=[embedding_matrix(idx, n_words, w_embedding)], 
            trainable=False)(word_inp)

    else:

        # word embedding option (2): random embedding

        word_emb = Embedding(
        input_dim=n_words, 
        output_dim=w_embedding,
        input_length=max_len_sentences, 
        mask_zero=mask_zero)(word_inp)        


    #char embedding + char biLSTM
    #----------------------------

    char_inp = Input(shape=(max_len_sentences, max_len_words)) 

    char_emb = TimeDistributed(
                    Embedding(
                        input_dim=n_chars,
                        output_dim=c_embedding,
                        input_length=max_len_words,
                        mask_zero=mask_zero)
                    )(char_inp)  

    char_biLSTM = TimeDistributed(
                    Bidirectional(LSTM(
                    units=lstm_char_units, 
                    return_sequences=False,
                    recurrent_dropout=rcrr_dropout, 
                    dropout=dropout))
                    )(char_emb) 
    
    # main LSTM
    #---------#

    model = concatenate([
        word_emb, 
        char_biLSTM,
        case_emb,
        # pos_emb, 
        # type_emb
        ]
    )


    # model = Dropout(dropout)(model)

    model = Bidirectional(LSTM(units=lstm_main_units, return_sequences=return_sequences,
                recurrent_dropout=rcrr_dropout, dropout=dropout))(model)

    model = TimeDistributed(Dense(units=dense_units, activation=activation))(model) 

    model = TimeDistributed(Dense(units=n_tags, activation=activation))(model) 
    
    # model = Dropout(dropout)(model)

    # CRF layer
    #----------

    crf = CRF(n_tags)

    out = crf(model)               
    
    # create and compile model

    model = Model([
        word_inp, 
        char_inp,
        case_inp,
        # pos_inp, 
        # type_inp, 
        
        ], out)

    
    if str.lower(optimizer) == 'nadam': 
        optimizer = Nadam()

    model.compile(optimizer=optimizer, loss=crf_loss, metrics=[crf_accuracy])

    return model

## References
------------------
### [1. LSTM-CRF  for Drug-Named Entity Recognition.](https://www.mdpi.com/1099-4300/19/6/283)
(Donghuo Zeng, Chengjie Sun, Lei Lin and Bingquan Liu)

### [2. Recurrent Neural Networks with Specialized Word Embeddings for health-domain named-entity recognition](https://arxiv.org/abs/1706.09569)
(Iñigo Jauregi Unanue, Ehsan Zare Borzeshi and Massimo Picardi)

### [3. End-to-end Sequence Labelling via Bi-directional LSTM-CNNs-CRF](https://arxiv.org/abs/1603.01354)
(Xuezhe Ma and Eduard Hovy)