# Homework and bake-off: word-level entailment with neural networks

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2020"

## Contents

1. [Overview](#Overview)
1. [Set-up](#Set-up)
1. [Data](#Data)
  1. [Edge disjoint](#Edge-disjoint)
  1. [Word disjoint](#Word-disjoint)
1. [Baseline](#Baseline)
  1. [Representing words: vector_func](#Representing-words:-vector_func)
  1. [Combining words into inputs: vector_combo_func](#Combining-words-into-inputs:-vector_combo_func)
  1. [Classifier model](#Classifier-model)
  1. [Baseline results](#Baseline-results)
1. [Homework questions](#Homework-questions)
  1. [Hypothesis-only baseline [2 points]](#Hypothesis-only-baseline-[2-points])
  1. [Alternatives to concatenation [2 points]](#Alternatives-to-concatenation-[2-points])
  1. [A deeper network [2 points]](#A-deeper-network-[2-points])
  1. [Your original system [3 points]](#Your-original-system-[3-points])
1. [Bake-off [1 point]](#Bake-off-[1-point])

## Overview

The general problem is word-level natural language inference.

Training examples are pairs of words $(w_{L}, w_{R}), y$ with $y = 1$ if $w_{L}$ entails $w_{R}$, otherwise $0$.

The homework questions below ask you to define baseline models for this and develop your own system for entry in the bake-off, which will take place on a held-out test-set distributed at the start of the bake-off. (Thus, all the data you have available for development is available for training your final system before the bake-off begins.)

<img src="fig/wordentail-diagram.png" width=600 alt="wordentail-diagram.png" />

## Set-up

See [the first notebook in this unit](nli_01_task_and_data.ipynb) for set-up instructions.

In [2]:
from collections import defaultdict
import json
import numpy as np
import os
import pandas as pd
from torch_shallow_neural_classifier import TorchShallowNeuralClassifier
import nli
import utils

In [3]:
DATA_HOME = 'data'

NLIDATA_HOME = os.path.join(DATA_HOME, 'nlidata')

wordentail_filename = os.path.join(
    NLIDATA_HOME, 'nli_wordentail_bakeoff_data.json')

GLOVE_HOME = os.path.join(DATA_HOME, 'glove.6B')

## Data

I've processed the data into two different train/test splits, in an effort to put some pressure on our models to actually learn these semantic relations, as opposed to exploiting regularities in the sample.

* `edge_disjoint`: The `train` and `dev` __edge__ sets are disjoint, but many __words__ appear in both `train` and `dev`.
* `word_disjoint`: The `train` and `dev` __vocabularies are disjoint__, and thus the edges are disjoint as well.

These are very different problems. For `word_disjoint`, there is real pressure on the model to learn abstract relationships, as opposed to memorizing properties of individual words.

In [4]:
with open(wordentail_filename) as f:
    wordentail_data = json.load(f)

The outer keys are the  splits plus a list giving the vocabulary for the entire dataset:

In [5]:
wordentail_data.keys()

dict_keys(['edge_disjoint', 'vocab', 'word_disjoint'])

### Edge disjoint

In [6]:
wordentail_data['edge_disjoint'].keys()

dict_keys(['dev', 'train'])

This is what the split looks like; all three have this same format:

In [7]:
wordentail_data['edge_disjoint']['dev'][: 5]

[[['sweater', 'stroke'], 0],
 [['constipation', 'hypovolemia'], 0],
 [['disease', 'inflammation'], 0],
 [['herring', 'animal'], 1],
 [['cauliflower', 'outlook'], 0]]

Let's test to make sure no edges are shared between `train` and `dev`:

In [8]:
nli.get_edge_overlap_size(wordentail_data, 'edge_disjoint')

0

As we expect, a *lot* of vocabulary items are shared between `train` and `dev`:

In [9]:
nli.get_vocab_overlap_size(wordentail_data, 'edge_disjoint')

2916

This is a large percentage of the entire vocab:

In [10]:
len(wordentail_data['vocab'])

8470

Here's the distribution of labels in the `train` set. It's highly imbalanced, which will pose a challenge for learning. (I'll go ahead and reveal that the `dev` set is similarly distributed.)

In [11]:
def label_distribution(split):
    return pd.DataFrame(wordentail_data[split]['train'])[1].value_counts()

In [12]:
label_distribution('edge_disjoint')

0    14650
1     2745
Name: 1, dtype: int64

### Word disjoint

In [13]:
wordentail_data['word_disjoint'].keys()

dict_keys(['dev', 'train'])

In the `word_disjoint` split, no __words__ are shared between `train` and `dev`:

In [14]:
nli.get_vocab_overlap_size(wordentail_data, 'word_disjoint')

0

Because no words are shared between `train` and `dev`, no edges are either:

In [15]:
nli.get_edge_overlap_size(wordentail_data, 'word_disjoint')

0

The label distribution is similar to that of `edge_disjoint`, though the overall number of examples is a bit smaller:

In [16]:
label_distribution('word_disjoint')

0    7199
1    1349
Name: 1, dtype: int64

## Baseline

Even in deep learning, __feature representation is vital and requires care!__ For our task, feature representation has two parts: representing the individual words and combining those representations into a single network input.

### Representing words: vector_func

Let's consider two baseline word representations methods:

1. Random vectors (as returned by `utils.randvec`).
1. 50-dimensional GloVe representations.

In [17]:
def randvec(w, n=50, lower=-1.0, upper=1.0):
    """Returns a random vector of length `n`. `w` is ignored."""
    return utils.randvec(n=n, lower=lower, upper=upper)

In [18]:
# Any of the files in glove.6B will work here:

glove_dim = 50

glove_src = os.path.join(GLOVE_HOME, 'glove.6B.{}d.txt'.format(glove_dim))

# Creates a dict mapping strings (words) to GloVe vectors:
GLOVE = utils.glove2dict(glove_src)

def glove_vec(w):    
    """Return `w`'s GloVe representation if available, else return 
    a random vector."""
    return GLOVE.get(w, randvec(w, n=glove_dim))

### Combining words into inputs: vector_combo_func

Here we decide how to combine the two word vectors into a single representation. In more detail, where `u` is a vector representation of the left word and `v` is a vector representation of the right word, we need a function `vector_combo_func` such that `vector_combo_func(u, v)` returns a new input vector `z` of dimension `m`. A simple example is concatenation:

In [19]:
def vec_concatenate(u, v):
    """Concatenate np.array instances `u` and `v` into a new np.array"""
    return np.concatenate((u, v))

`vector_combo_func` could instead be vector average, vector difference, etc. (even combinations of those) – there's lots of space for experimentation here; [homework question 2](#Alternatives-to-concatenation-[1-point]) below pushes you to do some exploration.

### Classifier model

For a baseline model, I chose `TorchShallowNeuralClassifier`:

In [20]:
net = TorchShallowNeuralClassifier(hidden_dim=50, max_iter=100)

### Baseline results

The following puts the above pieces together, using `vector_func=glove_vec`, since `vector_func=randvec` seems so hopelessly misguided for `word_disjoint`!

In [21]:
word_disjoint_experiment = nli.wordentail_experiment(
    train_data=wordentail_data['word_disjoint']['train'],
    assess_data=wordentail_data['word_disjoint']['dev'], 
    model=net, 
    vector_func=glove_vec,
    vector_combo_func=vec_concatenate)

Initialized model Sequential:
       Sequential(
        (0): Linear(in_features=100, out_features=50, bias=True)
        (1): Tanh()
        (2): Linear(in_features=50, out_features=2, bias=True)
      )
Finished epoch 100 of 100; error is 0.0256              precision    recall  f1-score   support

           0      0.915     0.938     0.926      1910
           1      0.380     0.305     0.339       239

    accuracy                          0.867      2149
   macro avg      0.648     0.622     0.633      2149
weighted avg      0.856     0.867     0.861      2149



## Homework questions

Please embed your homework responses in this notebook, and do not delete any cells from the notebook. (You are free to add as many cells as you like as part of your responses.)

### Hypothesis-only baseline [2 points]

During our discussion of SNLI and MultiNLI, we noted that a number of research teams have shown that hypothesis-only baselines for NLI tasks can be remarkably robust. This question asks you to explore briefly how this baseline effects the 'edge_disjoint' and 'word_disjoint' versions of our task.

For this problem, submit two functions:

1. A `vector_combo_func` function called `hypothesis_only` that simply throws away the premise, using the unmodified hypothesis (second) vector as its representation of the example.

1. A function called `run_hypothesis_only_evaluation` that does the following:
    1. Loops over the two conditions 'word_disjoint' and 'edge_disjoint' and the two `vector_combo_func` values `vec_concatenate` and `hypothesis_only`, calling `nli.wordentail_experiment` to train on the conditions 'train' portion and assess on its 'dev' portion, with `glove_vec` as the `vector_func`. So that the results are consistent, use an `sklearn.linear_model.LogisticRegression` with default parameters as the model.
    1. Returns a `dict` mapping `(condition_name, function_name)` pairs to the 'macro-F1' score for that pair, as returned by the call to `nli.wordentail_experiment`. (Tip: you can get the `str` name of your function `hypothesis_only` with `hypothesis_only.__name__`.)
    
The test functions `test_hypothesis_only` and `test_run_hypothesis_only_evaluation` will help ensure that your functions have the desired logic.

In [22]:
##### YOUR CODE HERE
from sklearn.linear_model import LogisticRegression

def hypothesis_only(u, v):
    return v



def run_hypothesis_only_evaluation(include_shallow=False):
    ##### YOUR CODE HERE
    results = {}
    model_facts = [lambda: LogisticRegression()]
    if include_shallow:
        model_facts.append(lambda: TorchShallowNeuralClassifier(hidden_dim=50, max_iter=100))
    for condition in [ 'word_disjoint', 'edge_disjoint']:
        for vector_combo_func in [vec_concatenate, hypothesis_only]:
            for model_fact in model_facts:
                model = model_fact()
                iter_name = (condition, vector_combo_func.__name__) 
                if include_shallow:
                    iter_name = iter_name + (model.__class__.__name__,)
                print("Running {}".format(iter_name))
                exp = nli.wordentail_experiment(
                    train_data=wordentail_data[condition]['train'],
                    assess_data=wordentail_data[condition]['dev'], 
                    model=model, 
                    vector_func=glove_vec,
                    vector_combo_func=vector_combo_func
                )   
                results[iter_name] = exp['macro-F1']
    pres = []
    for itr in results: 
        pres.append(("{:<15}".format(str(itr)+':'), results[itr]))
    pres = sorted(pres, key=lambda x:x[1])
    for r in pres: print(r[0], r[1])
    return results



In [23]:
def test_hypothesis_only(hypothesis_only):
    v = hypothesis_only(1, 2)
    assert v == 2   

In [24]:
test_hypothesis_only(hypothesis_only)

In [25]:
def test_run_hypothesis_only_evaluation(run_hypothesis_only_evaluation):
    results = run_hypothesis_only_evaluation()
    assert ('word_disjoint', 'vec_concatenate') in results, \
        "The return value of `run_hypothesis_only_evaluation` does not have the intended kind of keys"
    assert isinstance(results[('word_disjoint', 'vec_concatenate')], float), \
        "The values of the `run_hypothesis_only_evaluation` result should be floats"

In [26]:
test_run_hypothesis_only_evaluation(run_hypothesis_only_evaluation)

Running ('word_disjoint', 'vec_concatenate')
              precision    recall  f1-score   support

           0      0.901     0.979     0.938      1910
           1      0.446     0.138     0.211       239

    accuracy                          0.885      2149
   macro avg      0.673     0.558     0.574      2149
weighted avg      0.850     0.885     0.857      2149

Running ('word_disjoint', 'hypothesis_only')
              precision    recall  f1-score   support

           0      0.893     0.988     0.938      1910
           1      0.343     0.050     0.088       239

    accuracy                          0.884      2149
   macro avg      0.618     0.519     0.513      2149
weighted avg      0.831     0.884     0.843      2149

Running ('edge_disjoint', 'vec_concatenate')
              precision    recall  f1-score   support

           0      0.875     0.970     0.920      7376
           1      0.578     0.227     0.326      1321

    accuracy                          0.857    

In [27]:
#run_hypothesis_only_evaluation(include_shallow=True)

### Alternatives to concatenation [2 points]

We've so far just used vector concatenation to represent the premise and hypothesis words. This question asks you to explore two simple alternative:

1. Write a function `vec_diff` that, for a given pair of vector inputs `u` and `v`, returns the element-wise difference between `u` and `v`.

1. Write a function `vec_max` that, for a given pair of vector inputs `u` and `v`, returns the element-wise max values between `u` and `v`.

You needn't include your uses of `nli.wordentail_experiment` with these functions, but we assume you'll be curious to see how they do!

In [28]:
def vec_diff(u, v):
    return u-v



    
def vec_max(u, v):
    return np.maximum(u,v)




In [29]:
def test_vec_diff(vec_diff):
    u = np.array([10.2, 8.1])
    v = np.array([1.2, -7.1])
    result = vec_diff(u, v)
    expected = np.array([9.0, 15.2])
    assert np.array_equal(result, expected), \
        "Expected {}; got {}".format(expected, result)

In [30]:
test_vec_diff(vec_diff)

In [31]:
def test_vec_max(vec_max):
    u = np.array([1.2,  8.1])
    v = np.array([10.2, -7.1])
    result = vec_max(u, v)
    expected = np.array([10.2, 8.1])
    assert np.array_equal(result, expected), \
        "Expected {}; got {}".format(expected, result)

In [32]:
test_vec_max(vec_max)

### A deeper network [2 points]

It is very easy to subclass `TorchShallowNeuralClassifier` if all you want to do is change the network graph: all you have to do is write a new `define_graph`. If your graph has new arguments that the user might want to set, then you should also redefine `__init__` so that these values are accepted and set as attributes.

For this question, please subclass `TorchShallowNeuralClassifier` so that it defines the following graph:

$$\begin{align}
h_{1} &= xW_{1} + b_{1} \\
r_{1} &= \textbf{Bernoulli}(1 - \textbf{dropout\_prob}, n) \\
d_{1} &= r_1 * h_{1} \\
h_{2} &= f(d_{1}) \\
h_{3} &= h_{2}W_{2} + b_{2}
\end{align}$$

Here, $r_{1}$ and $d_{1}$ define a dropout layer: $r_{1}$ is a random binary vector of dimension $n$, where the probability of a value being $1$ is given by $1 - \textbf{dropout_prob}$. $r_{1}$ is multiplied element-wise by our first hidden representation, thereby zeroing out some of the values. The result is fed to the user's activation function $f$, and the result of that is fed through another linear layer to produce $h_{3}$. (Inside `TorchShallowNeuralClassifier`, $h_{3}$ is the basis for a softmax classifier, so no activation function is applied to it.)

For your implementation, please use `nn.Sequential`, `nn.Linear`, and `nn.Dropout` to define the required layers.

For comparison, using this notation, `TorchShallowNeuralClassifier` defines the following graph:

$$\begin{align}
h_{1} &= xW_{1} + b_{1} \\
h_{2} &= f(h_{1}) \\
h_{3} &= h_{2}W_{2} + b_{2}
\end{align}$$

The following code starts this sub-class for you, so that you can concentrate on `define_graph`. Be sure to make use of `self.dropout_prob`

For this problem, submit just your completed  `TorchDeepNeuralClassifier`. You needn't evaluate it, though we assume you will be keen to do that!

You can use `test_TorchDeepNeuralClassifier` to ensure that your network has the intended structure.

In [33]:
import torch.nn as nn

class TorchDeepNeuralClassifier(TorchShallowNeuralClassifier):
    def __init__(self, dropout_prob=0.7, **kwargs):
        self.dropout_prob = dropout_prob
        super().__init__(**kwargs)
    
    def define_graph(self):
        """Complete this method!
        
        Returns
        -------
        an `nn.Module` instance, which can be a free-standing class you 
        write yourself, as in `torch_rnn_classifier`, or the outpiut of 
        `nn.Sequential`, as in `torch_shallow_neural_classifier`.
        
        """
        return nn.Sequential(
            nn.Linear(self.input_dim, self.hidden_dim),
            nn.Dropout(self.dropout_prob),
            self.hidden_activation,
            nn.Linear(self.hidden_dim, self.n_classes_)
        )


    

##### YOUR CODE HERE    




In [34]:
def test_TorchDeepNeuralClassifier(TorchDeepNeuralClassifier):
    dropout_prob = 0.55
    assert hasattr(TorchDeepNeuralClassifier(), "dropout_prob"), \
        "TorchDeepNeuralClassifier must have an attribute `dropout_prob`."
    try:
        inst = TorchDeepNeuralClassifier(dropout_prob=dropout_prob)
    except TypeError:
        raise TypeError("TorchDeepNeuralClassifier must allow the user "
                        "to set `dropout_prob` on initialization")
    inst.input_dim = 10
    inst.n_classes_ = 5
    graph = inst.define_graph()
    assert len(graph) == 4, \
        "The graph should have 4 layers; yours has {}".format(len(graph))    
    expected = {
        0: 'Linear',
        1: 'Dropout',
        2: 'Tanh',
        3: 'Linear'}
    for i, label in expected.items():
        name = graph[i].__class__.__name__
        assert label in name, \
            "The {} layer of the graph should be a {} layer; yours is {}".format(i, label, name)
    assert graph[1].p == dropout_prob, \
        "The user's value for `dropout_prob` should be the value of `p` for the Dropout layer."

In [35]:
test_TorchDeepNeuralClassifier(TorchDeepNeuralClassifier)

### Your original system [3 points]

This is a simple dataset, but our focus on the 'word_disjoint' condition ensures that it's a challenging one, and there are lots of modeling strategies one might adopt. 

You are free to do whatever you like. We require only that your system differ in some way from those defined in the preceding questions. They don't have to be completely different, though. For example, you might want to stick with the model but represent examples differently, or the reverse.

Keep in mind that, for the bake-off evaluation, the 'edge_disjoint' portions of the data are off limits. You can, though, train on the combination of the 'word_disjoint' 'train' and 'dev' portions. You are free to use different pretrained word vectors and the like. Please do not introduce additional entailment datasets into your training data, though.

Please embed your code in this notebook so that we can rerun it.

In the cell below, please provide a brief technical description of your original system, so that the teaching team can gain an understanding of what it does. This will help us to understand your code and analyze all the submissions to identify patterns and strategies.

In [36]:
# Enter your system description in this cell.
# Please do not remove this comment.
__description__ = """

Solution

For this particular task two major directions have been identified: that of employing a MLP on a particular 
representation of the input (u,v) tuple or that of creating architectures based on siamese networks. Although the 
MLP approach seems more direct the main area that has been explored is that of simease-based architectures. Random grid
search was used on various architectures. 

More information regarding the architectural details are in the following markup cells followed by the full source code.


"""




## Process overview

The overall approach was based on three steps:

(1) define the baselines - see `get_baselines` function - that have both the above mentioned simple MLP approach 
as well as logistic regression 

(2) prepare a overall model architecture `ThWordEntailModel` incapsulated in `WordEntailClassifier` model wrapper that 
would allow us to experiment with classic siamese networks approaches such as "Dimensionality Reduction by Learning 
an Invariant Mapping" or "Siamese Neural Networks for One-shot Image Recognition" as well as approaches based on a 
graphs architecture with two-column separate encoders and a final classification MLP backbone.

(3) prepare a grid search strategy that would cycle most of the major architectures proposed by (2) - this particular 
feat is provided by `run_grid_search` function

(4) run and refine the random grid search and add features and/or new approaches. Two separated grids have been defined
anc concatenated - one for the non-contrastive-loss graphs and another for the ones based on contrastive loss

(5) select best architecture and re-run the experiment (including ensembles)

## Experimentation details

The actual training procedure was based on early stopping combined with efficient learning rate decay (that includes
parameter/optimizer restoration). The early stopping mechanism used the proposed `dev_data`. Due to class imbalance and 
to the model tendency to favor "not entail", focal loss has been added in the experiment as a grid-search option vs standard
BCE. To further test the model capabilities 4 different datasets have been extracted from the original two datasets (train 
and dev): a train dataset where all words are contained by the  pretrained GloVe, a similar dev set and a train/dev couple
where either one or both words in each observation are not contained within the pretrained GloVe embeddings (see the
function `test_glove_vs_data`).

#### Word vectorizer and OOVs

Due to the fact that both training and validation (as well as real-life) datasets have out-of-vocabulary words (out of the 
pretrained GloVe vocab) a "approaximator" function has been developed that generates word embeddings for the OOV words.
Below we can see a few examples of word-embeddings neighbors obtained based on the generated embedding (value is cosine
distance):

```
    replacement for 'unwrought'
    wrought        1.1327e-10
    devastation    4.6164e-01
    ironwork       4.7856e-01
    wreaked        4.8097e-01
    railings       5.1378e-01


    replacement for 'haematemesis'
    emesis                  1.3017e-10
    diaphoresis             5.0699e-01
    itraconazole            5.4648e-01
    chemotherapy-induced    5.5422e-01
    anovulation             5.5664e-01

    replacement for 'pyrexia'
    pyrex           9.7854e-11
    corningware     4.8760e-01
    borosilicate    5.2159e-01
    bakeware        5.2440e-01
    glassware       5.3898e-01

    replacement for 'cervicitis'
    cervi         1.0381e-10
    severini      5.9782e-01
    pasqualino    6.0177e-01
    kalantar      6.0200e-01
    conason       6.0281e-01

    replacement for 'dacryocystitis'
    cystitis          1.0738e-10
    interstitial      4.8357e-01
    pyelonephritis    5.2237e-01
    endometriosis     5.3036e-01
    bronchiolitis     5.4040e-01

    replacement for 'antheridium'
    anther      1.4220e-10
    stamen      5.4327e-01
    anthers     5.6385e-01
    piasters    5.9172e-01
    pistil      6.0421e-01
```

### The classifiers architectures

Th main paramaters of the `WordEntailClassifier` model wrapper are the following:

```
   model_name,  # model name
   siam_lyrs,   # layers of the siamese or the individual word encoders
   s_l2,        # applycation of l2 on siamese/paths
   s_bn,        # BN in siams/paths
   c_act,       # apply activation on siam/paths combiner
   separ,       # use separate paths for each word
   layers,      # layers of the final classifier dnn
   inp_drp,     # apply drop on inputs
   o_drp,       # apply drop on each fc
   bn,          # apply BN on each liniar in final dnn
   bn_inp,      # apply BN on inputs
   activ,       # activation (all)
   x_dev,       # x_dev for early stop w. lr decay
   y_dev,       # y_dev for early stop w. lr decay
   s_comb,      # method for combining siams/paths
   rev,         # reverse targets during training/predict
   lr,          # starting lr
   loss,        # loss function name for CL/BCE/FL
   bal,         # apply sample balancing during training

   cl_m=1,        # if using CL this is margin
   fl_a=4,        # focal loss weighting
   fl_g=2,        # focal loss discount exponent
   lr_decay=0.5,  # lr decay factor
   batch=256,     # batch size
   l2_strength=0, # l2 weight decay
   max_epochs=10000,  # not really used
   max_patience=10,   # maximum patience before reload & lr decay  
   max_fails=40,      # max consecutive fails before stop
```

To further detail the different categories of models used in the experiments below we have three examples (to have clear
self-explanatory models each *operation was encapsulated within a module*):

#### Model type #1: Identical word encoders followed by a classifier
```
  ThWordEntailModel(
    (siam_layers): ModuleList(
      (0): InputPlaceholder(input_dim=300)
      (1): Linear(in_features=300, out_features=256, bias=True)
      (2): Tanh()
      (3): Dropout(p=0.2, inplace=False)
      (4): Linear(in_features=256, out_features=128, bias=True)
      (5): Tanh()
      (6): Dropout(p=0.2, inplace=False)
      (7): Linear(in_features=128, out_features=64, bias=True)
      (8): L2_Normalizer()
    )
    (post_layers): ModuleList(
      (0): PathsCombiner(input_dim=64x2, output_dim=64, method='sqr', act=None)
      (1): Linear(in_features=64, out_features=128, bias=False)
      (2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (3): Tanh()
      (4): Dropout(p=0.2, inplace=False)
      (5): Linear(in_features=128, out_features=32, bias=False)
      (6): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (7): Tanh()
      (8): Dropout(p=0.2, inplace=False)
      (9): Linear(in_features=32, out_features=1, bias=True)
    )
  )
  Loss: FocalLoss(alpha=4, gamma=2)
```  
In the above case we have either focal loss or the standard BCE.
  
#### Model type #2: Siamese encoders followed by euclidean distance and contrastive loss
```
  ThWordEntailModel(
    (siam_layers): ModuleList(
      (0): InputPlaceholder(input_dim=300)
      (1): Linear(in_features=300, out_features=512, bias=False)
      (2): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (3): ReLU()
      (4): Dropout(p=0.5, inplace=False)
      (5): Linear(in_features=512, out_features=256, bias=False)
      (6): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (7): L2_Normalizer()
    )
    (post_layers): ModuleList(
      (0): PathsCombiner(input_dim=256x2, output_dim=1, method='eucl', act=None)
    )
  )
  Loss: ConstrativeLoss(margin=1)  
```  

#### Model type #3: Separated word encoders that are combined and passed throgh the final dnn classifier
```
  ThWordEntailModel(
    (path1_layers): ModuleList(
      (0): InputPlaceholder(input_dim=300)
      (1): BatchNorm1d(300, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): Linear(in_features=300, out_features=128, bias=True)
      (3): L2_Normalizer()
    )
    (path2_layers): ModuleList(
      (0): InputPlaceholder(input_dim=300)
      (1): BatchNorm1d(300, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): Linear(in_features=300, out_features=128, bias=True)
      (3): L2_Normalizer()
    )
    (post_layers): ModuleList(
      (0): PathsCombiner(input_dim=128x2, output_dim=128, method='add', act=None)
      (1): Linear(in_features=128, out_features=256, bias=True)
      (2): SELU()
      (3): Linear(in_features=256, out_features=128, bias=True)
      (4): SELU()
      (5): Linear(in_features=128, out_features=64, bias=True)
      (6): SELU()
      (7): Linear(in_features=64, out_features=1, bias=True)
    )
  )
  Loss: FocalLoss(alpha=4, gamma=2)
```

Please note that all above modules such as `ConstrativeLoss`, `FocalLoss`, `PathsCombiner`, `L2_Normalizer`, `InputPlaceholder` as well as the whole `ThWordEntailModel` can be reviewed in the following sections.

### About vector combining

In order to simplify the grid searching process we decided to use a single `vector_combine_function` that just pairs 
the words in each observation. This allows us to experiment a wide variate of combine functions at the level of of the
`PathsCombiner`. For the classic case of not havin any kind of individual word re-encoder we have the following example:

```
  ThWordEntailModel(
    (path1_layers): ModuleList(
      (0): InputPlaceholder(input_dim=300)
      (1): Dropout(p=0.3, inplace=False)
    )
    (path2_layers): ModuleList(
      (0): InputPlaceholder(input_dim=300)
      (1): Dropout(p=0.3, inplace=False)
    )
    (post_layers): ModuleList(
      (0): PathsCombiner(input_dim=300x2, output_dim=600, method='cat', act=None)
      (1): Linear(in_features=600, out_features=256, bias=False)
      (2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (3): Sigmoid()
      (4): Linear(in_features=256, out_features=128, bias=False)
      (5): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (6): Sigmoid()
      (7): Linear(in_features=128, out_features=64, bias=False)
      (8): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (9): Sigmoid()
      (10): Linear(in_features=64, out_features=1, bias=True)
    )
  )
  Loss: BCEWithLogitsLoss()
```
In the above example the GloVe-300 embeddings are basically concatenated after a 30% dropout has been applied.

## The results

After multiple random grid search iterations we obtained a reduced list of classifier candidates. Another confirmed 
hypotesys was the fact that ensemble models outperform simple models as shown by the following results table:

```
Results:
             MODEL    SCORE   pos_F1  pos_Rec  pos_Pre         vf
10        H3v3_020  72.4773  51.4735  54.8117  48.5185  l_glv_rep
8         H3v2_321  72.8769  52.9101  62.7615  45.7317  l_glv_rep
4         H3v3_015  73.0251  54.1213  72.8033  43.0693      l_glv
172  E_424_335_020  73.0790  51.3317  44.3515  60.9195  l_glv_rep
9         H3v2_335  73.1663  52.3013  52.3013  52.3013  l_glv_rep
171  E_424_335_020  73.2309  51.5815  44.3515  61.6279      l_glv
1         H3v2_424  73.2915  52.2876  50.2092  54.5455      l_glv
158  E_424_197_335  73.3479  51.9048  45.6067  60.2210  l_glv_rep
157  E_424_197_335  73.4992  52.1531  45.6067  60.8939      l_glv
0         H3v2_197  73.5365  53.1440  54.8117  51.5748      l_glv
164  E_424_396_335  73.6420  52.5581  47.2803  59.1623  l_glv_rep
159  E_424_197_020  73.6420  52.5581  47.2803  59.1623      l_glv
2         H3v2_500  73.6656  53.0172  51.4644  54.6667      l_glv
163  E_424_396_335  73.7169  52.6807  47.2803  59.4737      l_glv
135  E_424_015_396  74.2958  54.3568  54.8117  53.9095      l_glv
5         H3v3_171  74.3145  54.5455  56.4854  52.7344      l_glv
30   E_197_500_017  74.3241  53.8813  49.3724  59.2965  l_glv_rep
43   E_197_500_020  74.3399  53.6232  46.4435  63.4286      l_glv
6         H3v3_197  74.4320  54.6939  56.0669  53.3865      l_glv
146  E_424_171_396  74.4381  54.3478  52.3013  56.5611  l_glv_rep
113  E_424_500_335  74.4446  53.6341  44.7699  66.8750      l_glv
180  E_500_017_396  74.4464  54.0416  48.9540  60.3093  l_glv_rep
335  E_396_321_020  74.4640  54.9407  58.1590  52.0599      l_glv
7         H3v3_396  74.4640  54.9407  58.1590  52.0599      l_glv
15   E_197_424_015  74.4870  54.5064  53.1381  55.9471      l_glv
339  E_321_335_020  74.6320  55.1308  57.3222  53.1008      l_glv
154  E_424_197_396  74.6333  54.6256  51.8828  57.6744  l_glv_rep
289  E_015_396_321  74.6549  55.5133  61.0879  50.8711      l_glv
3         H3v3_017  74.6551  54.8180  53.5565  56.1404      l_glv
284  E_015_197_321  74.6707  55.5766  61.5063  50.6897  l_glv_rep
340  E_321_335_020  74.7015  55.2419  57.3222  53.3074  l_glv_rep
151  E_424_171_020  74.7068  54.6275  50.6276  59.3137      l_glv
290  E_015_396_321  74.7375  55.6818  61.5063  50.8651  l_glv_rep
131  E_424_015_171  74.7425  55.1148  55.2301  55.0000      l_glv
47   E_197_017_171  77.4752  59.5937  55.2301  64.7059      l_glv
218  E_500_396_321  77.5012  59.7345  56.4854  63.3803  l_glv_rep
112  E_424_500_321  77.5472  59.6811  54.8117  65.5000  l_glv_rep
245  E_017_171_321  77.5506  60.0423  59.4142  60.6838      l_glv
217  E_500_396_321  77.5810  59.8670  56.4854  63.6792      l_glv
246  E_017_171_321  77.6334  60.2105  59.8326  60.5932  l_glv_rep
242  E_017_171_197  77.6374  59.8639  55.2301  65.3465  l_glv_rep
241  E_017_171_197  77.7190  60.0000  55.2301  65.6716      l_glv
229  E_017_015_171  77.7833  60.4255  59.4142  61.4719      l_glv
240  E_017_015_020  78.0222  60.8511  59.8326  61.9048  l_glv_rep
204  E_500_171_321  78.1480  60.8108  56.4854  65.8537  l_glv_rep
203  E_500_171_321  78.1480  60.8108  56.4854  65.8537      l_glv
239  E_017_015_020  78.2611  61.2766  60.2510  62.3377      l_glv
```

##### The following cell presents the utility functions used by the the main building blocks.

In [37]:

###############################################################################
###############################################################################
###############################################################################
####                                                                       ####
####                       Utility code section                            ####
####                                                                       ####
###############################################################################
###############################################################################
###############################################################################
import torch as th
from datetime import datetime as dt
from sklearn.linear_model import LogisticRegression
from collections import OrderedDict
from time import time
import textwrap
from sklearn.metrics import classification_report
from itertools import combinations
import vsm

lst_log = []
_date = dt.now().strftime("%Y%m%d_%H%M")
log_fn = dt.now().strftime("logs/"+_date+"_log.txt")

def P(s=''):
  lst_log.append(s)
  print(s, flush=True)
  try:
    with open(log_fn, 'w') as f:
      for item in lst_log:
        f.write("{}\n".format(item))
  except:
    pass
  return

def Pr(s=''):
  print('\r' + str(s), end='', flush=True)


def get_object_params(obj, n=None):
  """
  Parameters
  ----------
  obj : any type
    the inspected object.
  n : int, optional
    the number of params that are returned. The default is None
    (all params returned).
  Returns
  -------
  out_str : str
    the description of the object 'obj' in terms of parameters values.
  """
  
  out_str = obj.__class__.__name__+"("
  n_added_to_log = 0
  for _iter, (prop, value) in enumerate(vars(obj).items()):
    if type(value) in [int, float, bool]:
      out_str += prop+'='+str(value) + ','
      n_added_to_log += 1
    elif type(value) in [str]:
      out_str += prop+"='" + value + "',"
      n_added_to_log += 1
    
    if n is not None and n_added_to_log >= n:
      break
  #endfor
  
  out_str = out_str[:-1] if out_str[-1]==',' else out_str
  out_str += ')'
  return out_str  

  
def prepare_grid_search(params_grid, valid_fn, nr_trials):
  import itertools


  params = []
  values = []
  for k in params_grid:
    params.append(k)
    assert type(params_grid[k]) is list, 'All grid-search params must be lists. Error: {}'.format(k)
    values.append(params_grid[k])
  combs = list(itertools.product(*values))
  n_options = len(combs)
  grid_iterations = []
  for i in range(n_options):
    comb = combs[i]
    func_kwargs = {}
    for j,k in enumerate(params):
      func_kwargs[k] = comb[j]
    grid_iterations.append(func_kwargs)
  P("Filtering {} grid-search options...".format(len(grid_iterations)))
  cleaned_iters = [x for x in grid_iterations if valid_fn(x)]
  n_options = len(cleaned_iters)
  idxs = np.arange(n_options)
  np.random.shuffle(idxs)
  idxs = idxs[:nr_trials]
  P("Generated {} random grid-search iters out of a total of {} iters".format(
      len(idxs), n_options))
  return [cleaned_iters[i] for i in idxs]


def add_res(dct, model_name, score, **kwargs):
  if 'MODEL' not in dct:
    dct['MODEL'] = []
  if 'SCORE' not in dct:
    dct['SCORE'] = []
  n_existing = len(dct['MODEL'])
  dct['MODEL'].append(model_name)
  dct['SCORE'].append(score)
  for key in kwargs:
    if key not in dct:
      dct[key] = ['-' ] * n_existing
    dct[key].append(kwargs[key])
  for k in dct:
    if len(dct[k]) < (n_existing + 1):
      dct[k] = dct[k] + [' '] * ((n_existing + 1) - len(dct[k]))
  return dct


def maybe_add_top_model(top_models, model, score, k=5):
  if len(top_models) < k:
    top_models.append([model, score])
  else:
    for i in range(k):
      if top_models[i][1] < score:
        for jj in range(k-1, i, -1):
          top_models[jj] = top_models[jj-1].copy()
        top_models[i][0] = model
        top_models[i][1] = score
        break          
  return sorted(top_models, key=lambda x: x[1], reverse=True)
       

###############################################################################
###############################################################################
###############################################################################
####                                                                       ####
####                      END utility code section                         ####
####                                                                       ####
###############################################################################
###############################################################################
###############################################################################


### The following section presents the main building blocks previously described 

This will include basic modules for various custom losses used in the models as well as modules used for special 
purpose in the proposed architectures.

In [38]:
def calc_label_distrib(data_split):
  info = pd.DataFrame(data_split)[1].value_counts()
  P(info)
  return info 


def maybe_find_glove_replacement(w):
  w = w.lower()
  lw = len(w)
  if lw < 4:
    return None
  else:   
    nw1 = ''
    found = False
    for i in range(lw//2 + 2):
      nw = w[:-(i+1)]
      if nw in GLOVE:
        nw1 = nw
        found = True
        break
    nw2 = ''
    for i in range(lw//2 + 2):
      nw = w[(i+1):]
      if nw in GLOVE:
        nw2 = nw
        found = True
        break
    if found:
      return nw1 if len(nw1) > len(nw2) else nw2
    return None

def l_glv_rep(w):    
  """Return lower `w`'s GloVe representation if available, else return 
  a replacement (zeros vector is nothing is found)."""
  if w in GLOVE:
    return GLOVE[w]
  else:
    nw = maybe_find_glove_replacement(w)
    v = np.random.uniform(low=-1e-7, high=1e-7, size=GLOVE_DIM)
    if nw is not None:
      v = v + GLOVE[nw]
    return v
    
def l_glv(w):    
  """Return lower `w`'s GloVe representation if available, else return 
  a zeros vector."""
  return GLOVE.get(w.lower(), np.zeros(GLOVE_DIM))

  
def concat(u, v):
  return np.concatenate((u,v))

def summar(u, v):
  return u + v

def arr(u,v):
  return np.array((u,v))


def test_glove_vs_data(trn, dev):
  train_words = set()
  for x in trn:
    train_words.add(x[0][0])
    train_words.add(x[0][1])

  dev_words = set()
  for x in dev:
    dev_words.add(x[0][0])
    dev_words.add(x[0][1])
  
  glove_train = [x for x in trn if (x[0][0] in GLOVE) and (x[0][1] in GLOVE)]
  glove_dev = [x for x in dev if (x[0][0] in GLOVE) and (x[0][1] in GLOVE)]
  out_train = [x for x in trn if (x[0][0] not in GLOVE) or (x[0][1] not in GLOVE)]
  out_dev = [x for x in dev if (x[0][0] not in GLOVE) or (x[0][1] not in GLOVE)]
  
  P("\nGlove train: {} ({:.1f}%)".format(len(glove_train), len(glove_train)/len(trn)*100))
  P("\nGlove dev: {} ({:.1f}%)".format(len(glove_dev), len(glove_dev)/len(dev)*100))
  
  miss_train = [x.lower() for x in train_words if x.lower() not in GLOVE] 
  miss_dev = [x.lower() for x in dev_words if x.lower() not in GLOVE]
  P("\nTrain has {} words that are not in GLOVE: {}...".format(len(miss_train), miss_train[:5]))
  positives = 0
  negatives = 0
  for i, w in enumerate(miss_train):
    for x in trn:
      if x[0][0] == w or x[0][1] == w:
        if x[1]:
          positives += 1
        else:
          negatives += 1
        if x[1] == 1 and positives < 5:
          P("  {}".format(x))
        if x[1] == 0 and negatives < 5:
          P("  {}".format(x))
  P("\nPos vs neg: {} vs {}".format(positives, negatives))
  P("\n\nDev has {} words that are not in GLOVE: {}...".format(len(miss_dev), miss_dev[:5]))
  positives = 0
  negatives = 0
  for i, w in enumerate(miss_dev):
    for x in dev:
      if x[0][0] == w or x[0][1] == w:
        if x[1]:
          positives += 1
        else:
          negatives += 1
        if x[1] == 1 and positives < 5:
          P("  {}".format(x))
        if x[1] == 0 and negatives < 5:
          P("  {}".format(x))
  P("\nPos vs neg: {} vs {}".format(positives, negatives))
  res = {
      'glove_train' : glove_train,
      'glove_dev' : glove_dev,
      'out_train' : out_train,
      'out_dev' : out_dev
      }
  return res, (miss_train, miss_dev)


class ConstrativeLoss(th.nn.Module):
  def __init__(self, margin=0.2):
    super(ConstrativeLoss, self).__init__()
    self.margin = margin
    
    
  def forward(self, dist, gold):
    th_d_sq = th.pow(dist, 2)
    th_d_sqm = th.pow(th.clamp(self.margin - dist, 0), 2)
    loss = (1 - gold) * th_d_sq + gold * th_d_sqm
    return loss.mean()
  
  def __repr__(self):
    s = self.__class__.__name__ + "(margin={})".format(
        self.margin,
        )
    return s  


class FocalLossWithLogits_B(th.nn.Module):
  def __init__(self, alpha=0.3, gamma=2):
    super().__init__()
    assert alpha <= 0.9 and alpha >= 0.1
    self.alpha = alpha
    self.gamma = gamma

  def forward(self, inputs, targets):
    pos_alpha = self.alpha
    neg_alpha = 1 - self.alpha
    eps = 1e-14
    
    y_pred = th.sigmoid(inputs)
    
    pos_pt = th.where(targets==1 , y_pred , th.ones_like(y_pred)) # positive pt (fill all the 0 place in y_true with 1 so (1-pt)=0 and log(pt)=0.0) where pt is 1
    neg_pt = th.where(targets==0 , y_pred , th.zeros_like(y_pred)) # negative pt
    
    pos_pt = th.clamp(pos_pt, eps, 1 - eps)
    neg_pt = th.clamp(neg_pt, eps, 1 - eps)
    
    pos_modulating = th.pow(1-pos_pt, self.gamma) # compute postive modulating factor for correct classification the value approaches to zero
    neg_modulating = th.pow(neg_pt, self.gamma) # compute negative modulating factor
    
    
    pos = - pos_alpha * pos_modulating * th.log(pos_pt) #pos part
    neg = - neg_alpha * neg_modulating * th.log(1 - neg_pt) # neg part
    
    loss = pos + neg  # this is final loss to be returned with some reduction
    
    return th.mean(loss)
      
  def __repr__(self):
    s = self.__class__.__name__ + "(alpha={}, gamma={})".format(
        self.alpha,
        self.gamma,
        )
    return s




class FocalLossWithLogits_A(th.nn.Module):
  def __init__(self, alpha=4, gamma=2):
    super().__init__()
    self.alpha = alpha
    self.gamma = gamma

  def forward(self, inputs, targets):
    BCE_loss = th.nn.functional.binary_cross_entropy_with_logits(
        inputs, 
        targets, 
        reduction='none',
        )

    pt = th.exp(-BCE_loss)
    F_loss = self.alpha * th.pow(1 - pt, self.gamma) * BCE_loss
    return th.mean(F_loss)
      
  def __repr__(self):
    s = self.__class__.__name__ + "(alpha={}, gamma={})".format(
        self.alpha,
        self.gamma,
        )
    return s


class InputPlaceholder(th.nn.Module):
  def __init__(self, input_dim):
    super().__init__()
    self.input_dim = input_dim
    
  def forward(self, inputs):
    return inputs
  
  def __repr__(self):
    s = self.__class__.__name__ + "(input_dim={})".format(
        self.input_dim,
        )
    return s
  
class L2_Normalizer(th.nn.Module):
  def __init__(self,):
    super().__init__()
    
  def forward(self, inputs):
    return th.nn.functional.normalize(inputs, p=2, dim=1)

class PathsCombiner(th.nn.Module):
  def __init__(self, input_dim, method, activ=None):
    super().__init__()
    self.method = method
    self.input_dim = input_dim
    self.activ = self.get_activation(activ)
    if method in ['sub','abs','sqr', 'add']:
      self.output_dim = input_dim
    elif method == 'cat':
      self.output_dim = input_dim * 2
    elif method == 'eucl':
      self.output_dim = 1
    else:
      raise ValueError("Unknown combine method '{}'".format(method))
    return
    
  def forward(self, paths):
    path1 = paths[0]
    path2 = paths[1]
#    if self.norm_each:
#      path1 = th.nn.functional.normalize(path1, p=2, dim=1)
#      path2 = th.nn.functional.normalize(path2, p=2, dim=1)
      
    if self.method == 'sub':
      th_x = path1 - path2
    elif self.method == 'add':
      th_x = path1 + path2
    elif self.method == 'cat':
      th_x = th.cat((path1, path2), dim=1)
    elif self.method == 'abs':
      th_x = (path1 - path2).abs()
    elif self.method == 'sqr':
      th_x = th.pow(path1 - path2, 2)
    elif self.method == 'eucl':
      th_x = th.pairwise_distance(path1, path2, keepdim=True)    
    
    if self.activ is not None:
      th_x = self.activ(th_x)
      
    return th_x
  
  def get_activation(self, act):
    if act == 'relu':
      return th.nn.ReLU()
    elif act == 'tanh':
      return th.nn.Tanh()
    elif act == 'selu':
      return th.nn.SELU()
    elif act == 'sigmoid':
      return th.nn.Sigmoid()
    else:
      return None
  
  
  def __repr__(self):
    s = self.__class__.__name__ + "(input_dim={}x2, output_dim={}, method='{}', act={})".format(
        self.input_dim,
        self.output_dim,
        self.method,
        self.activ,
        )
    return s

class ThWordEntailModel(th.nn.Module):
  def __init__(self,
               input_dim,
               siam_lyrs,
               siam_norm,
               siam_bn,
               comb_activ,
               separate_paths,
               layers,
               input_drop,
               other_drop,
               bn_inputs,
               bn,
               smethod,
               loss_type,
               device,
               activ='relu',
               ):
    super().__init__()
    self.device = device
    self.has_input_drop = input_drop
    self.has_input_bn = bn_inputs
    self.smethod = smethod
    self.separate = separate_paths
    self.loss_type = loss_type
    self.siam_norm = siam_norm
    self.siam_bn = siam_bn
    self.comb_activ = comb_activ
    
    if self.loss_type == 'cl' and (layers != [] or siam_lyrs == [] or separate_paths):
      raise ValueError("Cannot have siamese nets with CL with this config: layers={}  siam_lyrs={} sep={}".format(
          layers, siam_lyrs, separate_paths))
      
    if other_drop != 0 and layers == [] and siam_lyrs == []:
      raise ValueError("Cannot have dropout on no layers...")

    if self.separate:
      paths = [[],[]]
    else:
      paths = [[]]
    self.path_input = input_dim
    
    for path_no in range(len(paths)):
      last_output = self.path_input
      paths[path_no].append(InputPlaceholder(self.path_input))
      if input_drop > 0:
        paths[path_no].append(th.nn.Dropout(input_drop))
      if bn_inputs:
        paths[path_no].append(th.nn.BatchNorm1d(last_output))
      if len(siam_lyrs) > 0:
        for i, layer in enumerate(siam_lyrs):
          paths[path_no].append(th.nn.Linear(last_output, layer, bias=not self.siam_bn))
          if self.siam_bn:
            paths[path_no].append(th.nn.BatchNorm1d(layer))
          if i < (len(siam_lyrs) - 1):
            paths[path_no].append(self.get_activation(activ))
            if other_drop > 0:
              paths[path_no].append(th.nn.Dropout(other_drop))
          last_output = layer
      if self.siam_norm:
        paths[path_no].append(L2_Normalizer())
    if self.separate :
      self.path1_layers = th.nn.ModuleList(paths[0])
      self.path2_layers = th.nn.ModuleList(paths[1])
    else:
      self.siam_layers = th.nn.ModuleList(paths[0])
      
    
    siam_combine = PathsCombiner(
        last_output, 
        method=self.smethod, 
        activ=self.comb_activ,
        )
    last_output = siam_combine.output_dim
    post_lyrs = [siam_combine]    
    if self.loss_type != 'cl':
      for i, layer in enumerate(layers):
        post_lyrs.append(th.nn.Linear(last_output, layer, bias=not bn))
        if bn:
          post_lyrs.append(th.nn.BatchNorm1d(layer))
        post_lyrs.append(self.get_activation(activ))
        if other_drop > 0:
          post_lyrs.append(th.nn.Dropout(other_drop))
        last_output = layer
      post_lyrs.append(th.nn.Linear(last_output, 1))
    self.post_layers = th.nn.ModuleList(post_lyrs)
    return
  
  def forward(self, inputs):
    th_path1 = inputs[:,0]
    th_path2 = inputs[:,1]

    if self.separate:
      if len(self.path1_layers) > 0:
        for th_layer in self.path1_layers:
          th_path1 = th_layer(th_path1)
        for th_layer in self.path2_layers:
          th_path2 = th_layer(th_path2)
    else:
      if len(self.siam_layers) > 0:
        for th_layer in self.siam_layers:
          th_path1 = th_layer(th_path1)
        for th_layer in self.siam_layers:
          th_path2 = th_layer(th_path2)

    
    th_x = (th_path1, th_path2)    
    # first layer in post-siam must be the combination layer
    for layer in self.post_layers:
      th_x = layer(th_x)    
    return th_x
    
  
  def get_activation(self, act):
    if act == 'relu':
      return th.nn.ReLU()
    elif act == 'tanh':
      return th.nn.Tanh()
    elif act == 'selu':
      return th.nn.SELU()
    elif act == 'sigmoid':
      return th.nn.Sigmoid()
    else:
      raise ValueError("Unknown activation function '{}'".format(act))
  
      
      


class WordEntailClassifier():
  def __init__(self, 
               model_name,  # model name
               siam_lyrs,   # layers of the siamese or the individual word encoders
               s_l2,        # applycation of l2 on siamese/paths
               s_bn,        # BN in siams/paths
               c_act,       # apply activation on siam/paths combiner
               separ,       # use separate paths for each word
               layers,      # layers of the final classifier dnn
               inp_drp,     # apply drop on inputs
               o_drp,       # apply drop on each fc
               bn,          # apply BN on each liniar in final dnn
               bn_inp,      # apply BN on inputs
               activ,       # activation (all)
               x_dev,       # x_dev for early stop w. lr decay
               y_dev,       # y_dev for early stop w. lr decay
               s_comb,      # method for combining siams/paths
               rev,         # reverse targets during training/predict
               lr,          # starting lr
               loss,        # loss function name for CL/BCE/FL
               bal,         # apply sample balancing during training
               VF,
               
               cl_m=1,        # if using CL this is margin
               fl_g=2,        # focal loss discount exponent
               lr_decay=0.5,  # lr decay factor
               batch=256,     # batch size
               l2_strength=0, # l2 weight decay
               max_epochs=10000,  # not really used
               max_patience=10,   # maximum patience before reload & lr decay  
               max_fails=40,      # max consecutive fails before stop
               optim=th.optim.Adam,
               
               device=th.device("cuda" if th.cuda.is_available() else "cpu"),
               ):
    self.model = None
    self.model_name = model_name
    self.layers = layers
    self.siam_lyrs = siam_lyrs
    self.input_drop = inp_drp
    self.other_drop = o_drp
    self.bn = bn
    self.bn_inputs = bn_inp
    self.activ=activ    
    self.max_epochs = max_epochs
    self.x_dev = x_dev
    self.y_dev = np.array(y_dev)
    self.batch_size = batch
    self.max_patience = max_patience
    self.optimizer = optim
    self.siamese_method = s_comb
    self.reverse_target = rev
    self.device = device
    self.lr = lr
    self.lr_decay = lr_decay
    self.l2_strength = l2_strength
    self.max_fails = max_fails
    self.separate_paths = separ
    self.loss_type = loss
    self.siam_norm = s_l2
    self.siam_bn = s_bn
    self.comb_activ = c_act
    self.margin = cl_m
    self.use_balancing = bal
    self.focal_loss_alpha = 0.3 if loss == 'flb' else 4
    self.focan_loss_gamma = fl_g
    self.vector_func_name = VF if type(VF) == str else VF.__name__
    if loss == 'cl' and not rev:
      raise ValueError("CL must receive reversed targets")
    return
  
  
  def define_graph(self):
    if not hasattr(self, 'input_dim'):
      self.input_dim = GLOVE_DIM
    model = ThWordEntailModel(
        input_dim=self.input_dim,
        siam_lyrs=self.siam_lyrs,
        separate_paths=self.separate_paths,
        layers=self.layers,
        input_drop=self.input_drop,
        other_drop=self.other_drop,
        bn=self.bn,
        bn_inputs=self.bn_inputs,
        activ=self.activ,
        device=self.device,
        smethod=self.siamese_method,
        siam_norm=self.siam_norm,
        loss_type=self.loss_type,
        siam_bn=self.siam_bn,
        comb_activ=self.comb_activ,
        )
    return model    
      
  
  def fit(self, X, y):
    utils.fix_random_seeds()
    # Data prep:
    X = np.array(X).astype(np.float32)
    n_obs = X.shape[0]
    # here is a trick: we consider the words that entail those that have
    # minimal distance if using siamese
    np_y = np.array(y).reshape(-1,1)
    if self.reverse_target:
      np_y = 1 - np_y
    self.input_dim = X.shape[-1]
    X = th.tensor(X, dtype=th.float32)
    y = th.tensor(np_y, dtype=th.float32)
    dataset = th.utils.data.TensorDataset(X, y)
    
    sampler = None
    if self.use_balancing:
      cls_0 = (np_y == 0).sum()
      cls_1 = np_y.shape[0] - cls_0
      cls_weights = 1 / np.array([cls_0, cls_1])
      weights = [cls_weights[i] for i in np_y.ravel()]
      sampler = th.utils.data.sampler.WeightedRandomSampler(weights, num_samples=len(weights))
      
    dataloader = th.utils.data.DataLoader(
        dataset, 
        batch_size=self.batch_size, 
        shuffle=sampler is None,
        pin_memory=True,
        sampler=sampler,
        )
    # Optimization:
    if self.loss_type == 'bce':
      loss = th.nn.BCEWithLogitsLoss()
    elif self.loss_type == 'cl':
      loss = ConstrativeLoss(margin=self.margin)
    elif 'fl' in self.loss_type:
      if self.loss_type == 'flb':
        loss = FocalLossWithLogits_B(
            alpha=self.focal_loss_alpha,
            gamma=self.focan_loss_gamma,
            )
      else:
        loss = FocalLossWithLogits_A(
            alpha=self.focal_loss_alpha,
            gamma=self.focan_loss_gamma,
            )        
    else:
      raise ValueError('unknown loss {}'.format(self.loss_type))
    if self.model is None:
      self.model = self.define_graph()
      P("Initialized model:")
      self.print_model()
      P("  Loss: {}\n\n".format(str(loss) if loss.__class__.__name__ != 'method' else loss.__name__))

    else:
      P("\rFitting already loaded model...\t\t\t")
    self.model.to(self.device)
    self.model.train()
    optimizer = self.optimizer(
        self.model.parameters(),
        lr=self.lr,
        weight_decay=self.l2_strength)
    # Train:
    patience = 0
    fails = 0
    best_f1 = 0
    best_fn = ''
    best_epoch = -1
    self.errors = []
    not_del_fns = []
    for epoch in range(1, self.max_epochs+1):
      epoch_error = 0.0
      for batch_iter, (X_batch, y_batch) in enumerate(dataloader):
        X_batch = X_batch.to(self.device, non_blocking=True)
        y_batch = y_batch.to(self.device, non_blocking=True)
        batch_preds = self.model(X_batch)
        err = loss(batch_preds, y_batch)
        epoch_error += err.item()
        optimizer.zero_grad()
        err.backward()
        optimizer.step()
        Pr("Training epoch {} - {:.1f}% - Patience {}/{},  Fails {}/{}\t\t\t\t\t".format(
            epoch, 
            (batch_iter + 1) / (n_obs // self.batch_size + 1) * 100,
            patience, self.max_patience,
            fails, self.max_fails))
      # end epoch
      predictions = self.predict(self.x_dev)
      macrof1 = utils.safe_macro_f1(self.y_dev, predictions)
      # resume training
      self.model.train()
      if macrof1 > best_f1:
        patience = 0
        fails = 0
        last_best_fn = best_fn
        best_fn = "models/{}_e{:03}_F{:.4f}.th".format(
            self.model_name, epoch, macrof1)
        best_epoch = epoch
        P("\rFound new best macro-f1 {:.4f} > {:.4f} at epoch {}. \t\t\t".format(macrof1, best_f1, epoch))
        best_f1 = macrof1
        th.save(self.model.state_dict(), best_fn)
        th.save(optimizer.state_dict(), best_fn + '.optim')
        if last_best_fn != '':
          try:
            os.remove(last_best_fn)
            os.remove(last_best_fn + '.optim')
          except:
            not_del_fns.append(last_best_fn)
            not_del_fns.append(last_best_fn + '.optim')
      else:
        patience += 1
        fails += 1
        Pr("Finished epoch {}. Current score {:.3f} < {:.3f}. Patience {}/{},  Fails {}/{}".format(
            epoch, macrof1, best_f1, patience, self.max_patience, fails, self.max_fails))
        if patience > self.max_patience:
          lr_old = optimizer.param_groups[0]['lr'] 
          lr_new = lr_old * self.lr_decay
          self.model.load_state_dict(th.load(best_fn))
          optimizer.load_state_dict(th.load(best_fn + '.optim'))
          for param_group in optimizer.param_groups:
            param_group['lr'] = lr_new
          P("\nPatience reached {}/{}  -  reloaded from ep {} reduced lr from {:.1e} to {:.1e}".format(
              patience, self.max_patience, best_epoch, lr_old, lr_new))
          patience = 0
          
        if fails > self.max_fails:
          P("\nMax fails {}/{} reached!".format(fails, self.max_fails))
          break
          
      self.errors.append(epoch_error)
    # end all epochs
    if best_fn != '':
      P("Loading model from epoch {} with macro-f1 {:.4f}".format(best_epoch, best_f1))
      self.model.load_state_dict(th.load(best_fn))        
      not_del_fns.append(best_fn + '.optim')
      if best_f1 < 0.67:
        P("  Removing '{}'".format(best_fn))        
        not_del_fns.append(best_fn)
      else:
        os.rename(best_fn, "models/{}.th".format(self.model_name))
    P("  Cleaning after fit...")
    atmps = 0
    while atmps < 10 and len(not_del_fns) > 0:   
      removed = []
      for fn in not_del_fns:
        if os.path.isfile(fn):
          try:
            os.remove(fn)
            removed.append(fn)
            P("  Removed '{}'".format(fn))
          except:
            pass
        else:
          removed.append(fn)            
      not_del_fns = [x for x in not_del_fns if x not in removed]
    return self
  

  def predict_proba(self, X):
    self.model.eval()
    with th.no_grad():
      self.model.to(self.device)
      X = th.tensor(X, dtype=th.float).to(self.device)
      preds = self.model(X)
      if self.loss_type in ['bce', 'fl', 'fla', 'flb']:
        result = th.sigmoid(preds).cpu().numpy().ravel()
        if self.reverse_target:
          result = 1 - result
      else:
        dists = preds.cpu().numpy().ravel()
        result = self._dist_to_proba(dists, eps=0.52)
        
      return result


  def predict(self, X):
    probs = self.predict_proba(X)
    classes = (probs >= 0.5).astype(np.int8)

    return classes
  
  def print_model(self, indent=2):
    P("{}Model name: {}".format(indent * " ",self.model_name))
    P(textwrap.indent(str(self.model), " " * indent))
    P("{}Vector func: {}".format(indent * " ", self.vector_func_name))
    P("{}Trained on {} data.".format(indent * " ", "BALANCED" if self.use_balancing else "raw unbalanced"))
    return
  
  
  def _dist_to_proba(self, y_pred, eps):
    s = -0.75 * y_pred / eps + 1.75
    d = -0.25 * y_pred / eps + 0.25
    sgn = ((eps - y_pred) > 0) + 0 - ((eps - y_pred) < 0)
    
    yproba = ((s + sgn * d) / 2).clip(0)
    return yproba
  
  def save(self, label=None):
    if label is None:
      label = self.model_name    
    fn = 'models/{}.th'.format(label)
    th.save(self.model.state_dict(), fn)
    P("Saved '{}'".format(fn))
    return
  
  def load(self, label=None):
    if label is None:
      label = self.model_name    
    fn = 'models/{}.th'.format(label)
    if not os.path.isfile(fn):
      raise ValueError("Model file '{}' not found!".format(fn))
    if self.model is None:
      self.model = self.define_graph()
    self.model.load_state_dict(th.load(fn))
    P("Loaded model from '{}'".format(fn))
    return
  

  def has_saved(self, label=None):
    if label is None:
      label = self.model_name    
    fn = 'models/{}.th'.format(label)
    return os.path.isfile(fn)
    

  
  
  
  
class EnsembleWrapper():
  def __init__(self, models, vector_func_name, verbose=False):
    self.models = models
    self.vector_func_name = vector_func_name
    names = [x.model_name.split('_')[1] for x in self.models]
    self.model_name = "E_" + "_".join(names)
    if verbose:
      P("Initialized ensemble '{}' with {} models using vectorizer '{}':".format(
          self.model_name, len(self.models), self.vector_func_name))
      self.print_models()
            
  def print_models(self):
    P("Ensemble '{}' architecture:".format(self.model_name))
    P("  " + "-"* 80)
    for i,model in enumerate(self.models):
      P("  Model {}/{}".format(i+1, len(self.models)))
      model.print_model()
      P("  " + "-"* 80)
  
  def predict(self, x):
    preds = []
    for model in self.models:
      model_preds = model.predict_proba(x)
      preds.append(model_preds)        
    final_preds_cat = np.vstack(preds).T
    final_preds = final_preds_cat.mean(axis=1)
    return (final_preds >= 0.5).astype(np.uint8)

    
  def fit(self, x, y):
    P("****** Ensemble model passing fit call ******")
    return self
  
  
    




def get_baselines(dct_res, trn, dev):
  baseline_model_facts = {
      "BaseLR_C6L2": lambda: LogisticRegression(fit_intercept=True, 
                                                solver='liblinear', 
                                                multi_class='auto',
                                                C=0.6,
                                                penalty='l2'),
      "BaseLR_C4L1": lambda: LogisticRegression(fit_intercept=True, 
                                                solver='liblinear', 
                                                multi_class='auto',
                                                C=0.4,
                                                penalty='l1'),
      "BaseNN_50" : lambda: TorchShallowNeuralClassifier(hidden_dim=50, eta=0.005),
      "BaseNN_150" : lambda: TorchShallowNeuralClassifier(hidden_dim=150, eta=0.005),
      "BaseNN_300" : lambda: TorchShallowNeuralClassifier(hidden_dim=300, eta=0.005),
  }
  
  baseline_vector_combo_funcs = [
      concat, 
  #    summar,
      ]
  for vf in [l_glv, l_glv_rep]:
    for vcf in baseline_vector_combo_funcs:
      for model_name in baseline_model_facts:
        P("=" * 70)
        P("Running baseline model '{}' with '{}'".format(
            model_name, vcf.__name__))
        model = baseline_model_facts[model_name]()
        x_d, y_d = nli.word_entail_featurize(
            data=dev, 
            vector_func=vf, 
            vector_combo_func=vcf
            )
        res = nli.wordentail_experiment(
            train_data=trn,
            assess_data=dev,
            vector_func=vf,
            vector_combo_func=vcf,
            model=model,
            )
        score = res['macro-F1']
        y_pred = model.predict(x_d)
        report = classification_report(y_d, y_pred, digits=3, output_dict=True)
        assert score == report['macro avg']['f1-score']
        P_REC = report['1']['recall']
        add_res(dct_res, model_name, round(score * 100,2), P_REC=round(P_REC*100,2), 
                VF=vf.__name__)
        df = pd.DataFrame(dct_results).sort_values('SCORE')
        P("\nResults so far:\n{}\n".format(df))
  return dct_res


def run_grid_search(dct_res, trn, dev, non_cl_runs=350, cl_runs=100):
  grids = {
      300: {
          'non_cl' : {
                "siam_lyrs" : [[],[600, 300],[600, 300, 150],],          
                "separ" : [True,False,],      
                "layers" : [[512, 256],[1024, 384, 128]],          
                "inp_drp" : [0.3],  
                "o_drp" : [0.5,],          
                "bn" : [True,False],          
                "bn_inp" : [True,False],          
                "activ" : ['relu',],          
                "lr"  :[0.005,],          
                "s_comb" : ['sqr',],      
                's_l2' : [True,False,],          
                's_bn' :[True,False],          
                'rev' :[True,False,],          
                'loss' : ['bce','fla','flb'],
                'c_act' : [None],                    
                'VF':['l_glv','l_glv_rep',],          
                'bal' : [True,False,],          
                'cl_m' : [None,]        
              },
          'cl' : {
                "siam_lyrs" : [[256, 128],[512, 256],[512, 256, 128],[1024, 512],],          
                "separ" : [False,],      
                "layers" : [[],],          
                "inp_drp" : [0,0.3],  
                "o_drp" : [0.5,],          
                "bn" : [None],          
                "bn_inp" : [True,False],          
                "activ" : ['relu',],          
                "lr"  :[0.0001,],          
                'c_act' : [None,],
                "s_comb" : ['eucl',],      
                's_l2' : [True,],
                's_bn' : [False],          
                'rev' :[True,],          
                'loss' : ['cl',],          
                'VF':['l_glv','l_glv_rep',],          
                'bal' : [True,False,],
              }
          },
      100: {
          'non_cl' : {
                "siam_lyrs" : [[],[200, 100],[200, 100, 50],],          
                "separ" : [True,False,],      
                "layers" : [[368, 64],[512, 256, 64]],          
                "inp_drp" : [0.3],  
                "o_drp" : [0.5,],          
                "bn" : [True,False],          
                "bn_inp" : [True,False],          
                "activ" : ['relu',],          
                "lr"  :[0.005,],          
                "s_comb" : ['sqr',],      
                's_l2' : [True,False,],          
                's_bn' :[True,False],          
                'rev' :[True,False,],          
                'loss' : ['bce','fla','flb'],
                'c_act' : [None],                    
                'VF':['l_glv','l_glv_rep',],          
                'bal' : [True,False,],          
                'cl_m' : [None,]        
              },
          'cl' : {
                "siam_lyrs" : [[128, 64],[256, 128],[256, 128, 64],[512, 256],],          
                "separ" : [False,],      
                "layers" : [[],],          
                "inp_drp" : [0,0.3],  
                "o_drp" : [0.5,],          
                "bn" : [None],          
                "bn_inp" : [True,False],          
                "activ" : ['relu',],          
                "lr"  :[0.0001,],          
                'c_act' : [None,],
                "s_comb" : ['eucl',],      
                's_l2' : [True,],
                's_bn' : [False],          
                'rev' :[True,],          
                'loss' : ['cl',],          
                'VF':['l_glv','l_glv_rep',],          
                'bal' : [True,False,],
              }
          }
      
      }

        
  def filter_func(grid_iter):
    test_contrastive_loss = (
        grid_iter['separ'] or 
        grid_iter['rev'] == False or 
        grid_iter['layers'] != [] or
        grid_iter['siam_lyrs'] == [] or
        grid_iter['s_comb'] != 'eucl'
        )
    if grid_iter['loss'] == 'cl' and test_contrastive_loss:
      return False
    if grid_iter['layers'] == [] and grid_iter['siam_lyrs'] == [] and grid_iter['other_drop'] != 0 :
      return False
    if grid_iter['layers'] != [] and grid_iter['s_comb'] == 'eucl':
      return False
    if not grid_iter['separ'] and grid_iter['siam_lyrs'] == []:
      return False
    return True
  dct_main_grid = grids[GLOVE_DIM]
  options1 = prepare_grid_search(dct_main_grid['non_cl'], valid_fn=filter_func, nr_trials=non_cl_runs)
  options2 = prepare_grid_search(dct_main_grid['cl'], valid_fn=filter_func, nr_trials=cl_runs)
  options = options1 + options2
  options = [options[x] for x in np.random.choice(len(options), size=len(options), replace=False)]
  timings = []
  t_left = np.inf
  top_models = []
  k=3
  last_ensemble = ''
  ver = '4'
  GD = str(GLOVE_DIM)[0]
  for grid_iter, option in enumerate(options):    
    g_type = 'G' if option['VF'] == 'l_glv' else 'R'
    model_name = 'H3_{}{}{}D{:03d}'.format(
        ver, g_type, GD, grid_iter+1)
    P("\n\n" + "=" * 70)
    P("Running grid search iteration {}/{}\n '{}' : {}".format(
        grid_iter+1, len(options), model_name, option))
#    for k in option:
#      P("  {}={},".format(k,option[k] if type(option[k]) != str else "'" + option[k] + "'"))
    P("  Time left for grid search completion: {:.1f} hrs".format(t_left / 3600))
    VF = option['VF']
    vector_func = globals()[VF]
    _t_start = time()
    #### we need this ...
    x_dev, y_dev = nli.word_entail_featurize(
        data=dev, 
        vector_func=vector_func, 
        vector_combo_func=arr
        )
    model = WordEntailClassifier(
        model_name=model_name,
        x_dev=x_dev,
        y_dev=y_dev,
        **option)
    res = nli.wordentail_experiment(
            train_data=trn,
            assess_data=dev,
            vector_func=vector_func,
            vector_combo_func=arr,
            model=model,
            )
    score = res['macro-F1']
    top_models = maybe_add_top_model(
        top_models=top_models,
        model=model,
        score=score,
        k=k
        )
    y_pred = model.predict(x_dev)
    report = classification_report(y_dev, y_pred, digits=3, output_dict=True)
    assert round(score,3) == round(report['macro avg']['f1-score'],3), "ERROR:  score {} differs from report score {}".format(
        round(score,3), round(report['macro avg']['f1-score'],3))
    P_REC = report['1']['recall']
    ####
    t_res = time() - _t_start
    timings.append(t_res)
    t_left = (len(options) - grid_iter - 1) * np.mean(timings)
    dct_res = add_res(
        dct=dct_res, 
        model_name=model_name, 
        score=round(score * 100, 2), 
        P_REC=round(P_REC * 100, 2),
        **option)
    if len(top_models) >= k:
      P("Testing ensemble so far...")
      ensemble = EnsembleWrapper(
          [x[0] for x in top_models], 
          vector_func_name=VF)
      if last_ensemble != ensemble.model_name:
        last_ensemble = ensemble.model_name
        res = nli.wordentail_experiment(
                train_data=trn,
                assess_data=dev,
                vector_func=vector_func,
                vector_combo_func=arr,
                model=ensemble,
                )
        score = res['macro-F1']
        y_pred = ensemble.predict(x_dev)
        report = classification_report(y_dev, y_pred, digits=3, output_dict=True)
        assert score == report['macro avg']['f1-score']
        P_REC = report['1']['recall']
        dct_res = add_res(
          dct=dct_res, 
          model_name=ensemble.model_name, 
          score=score, 
          P_REC=P_REC,
          VF=VF,
          )      
    df = pd.DataFrame(dct_res).sort_values('SCORE')
    P("Results so far:\n{}".format(df.iloc[-100:]))
    df.to_csv("models/"+_date+"_results.csv")
  # end grid
  return df
      
def vect_neighbors(v, df):
  import scipy
  distfunc = scipy.spatial.distance.cosine
  dists = df.apply(lambda x: distfunc(v, x), axis=1)
  return dists.sort_values().head()


def ensemble_test(clfs, trn, dev, vect_func):
  res = {'MODEL':[],'SCORE':[]}

  x_trn, y_trn = nli.word_entail_featurize(
      data=trn, 
      vector_func=vect_func, 
      vector_combo_func=arr
      )
  x_dev, y_dev = nli.word_entail_featurize(
      data=dev, 
      vector_func=vect_func, 
      vector_combo_func=arr
      )
  
  for model in clfs:
    P("Testing model '{}' on data vectorized with '{}'".format(
        model.model_name, vect_func.__name__))
    y_pred = model.predict(x_dev)
    report = classification_report(y_dev, y_pred, digits=3, output_dict=True)
    res = add_res(
        res, 
        model.model_name, 
        score=round(report['macro avg']['f1-score'] * 100,2),
        pos_f1=report['1']['f1-score'] * 100,
        pos_rc=report['1']['recall'] * 100,
        pos_pr=report['1']['precision'] * 100,
        )
    
  
  ens = EnsembleWrapper(
      clfs, 
      vector_func_name=vect_func.__name__,
      )
  y_pred = ens.predict(x_dev)
  report = classification_report(y_dev, y_pred, digits=3, output_dict=True)
  res = add_res(
      res, 
      ens.model_name, 
      score=round(report['macro avg']['f1-score'] * 100, 2),
      pos_f1=report['1']['f1-score'] * 100,
      pos_rc=report['1']['recall'] * 100,
      pos_pr=report['1']['precision'] * 100,
      )
  df = pd.DataFrame(res).sort_values('SCORE')
  P("Results:\n{}".format(df.iloc[-50:]))
  return df



def ensemble_train_test(dct_models_params, trn, dev, vect_funcs, n_models=None):
  clfs = []
  res = {'MODEL':[],'SCORE':[]}
  for ii, model_name in enumerate(dct_models_params):
    model_params = dct_models_params[model_name]
    c_vect_func = globals()[model_params['VF']]
    P("\n" + "-"*80)
    P("Loading or training ensemble component {}/{}: '{}' with vect_func: '{}'...".format(
        ii+1, len(dct_models_params), model_name, c_vect_func.__name__))
    x_dev, y_dev = nli.word_entail_featurize(
        data=dev, 
        vector_func=c_vect_func, 
        vector_combo_func=arr
        )
      
    clf = WordEntailClassifier(model_name=model_name, 
                               x_dev=x_dev,
                               y_dev=y_dev,
                               **model_params)
    if clf.has_saved():
      clf.load()
    else:
      x_trn, y_trn = nli.word_entail_featurize(
          data=trn, 
          vector_func=c_vect_func, 
          vector_combo_func=arr
          )  
      clf.fit(x_trn, y_trn)
      
    clfs.append(clf)
    y_pred = clf.predict(x_dev)
    report = classification_report(y_dev, y_pred, digits=3, output_dict=True)
    res = add_res(
        res, 
        model_name,         
        score=round(report['macro avg']['f1-score'] * 100,2),
        pos_F1=report['1']['f1-score'] * 100,
        pos_Rec=report['1']['recall'] * 100,
        pos_Pre=report['1']['precision'] * 100,
        vf=c_vect_func.__name__,
        )
    df = pd.DataFrame(res).sort_values('SCORE')
    P("Results:\n{}".format(df))
  
  best_ens = None
  best_mf1 = 0
  if n_models is None:
    lst_n_models = list(range(2,6))
  elif type(n_models) == int:
    lst_n_models = [n_models]
  else:
    lst_n_models = n_models

  all_clfs_combs = []
  for n_clfs in lst_n_models:
    all_clfs_combs = all_clfs_combs + list(combinations(clfs, n_clfs))
  P("Testing/searching for best ensemble...")
  for i, selected_clfs in enumerate(all_clfs_combs):    
    for vect_func in vect_funcs:
      Pr(" Testing ensmble {}/{} ({:.2f}%) with {} models and {} vector func\t".format(
          i+1, len(all_clfs_combs), (i+1)/len(all_clfs_combs) * 100, 
          len(selected_clfs), vect_func.__name__))
      x_trn, y_trn = nli.word_entail_featurize(
          data=trn, 
          vector_func=vect_func, 
          vector_combo_func=arr,
          )
      x_dev, y_dev = nli.word_entail_featurize(
          data=dev, 
          vector_func=vect_func, 
          vector_combo_func=arr,
          )
        
      ens = EnsembleWrapper(
          selected_clfs, 
          vector_func_name=vect_func.__name__,
          verbose=False,
          )
      y_pred = ens.predict(x_dev)
      report = classification_report(y_dev, y_pred, digits=3, output_dict=True)
      score = round(report['macro avg']['f1-score'] * 100,2)
      if score > best_mf1:
        best_mf1 = score
        best_ens = ens
      res = add_res(
          res, 
          ens.model_name, 
          score=score,
          pos_F1=report['1']['f1-score'] * 100,
          pos_Rec=report['1']['recall'] * 100,
          pos_Pre=report['1']['precision'] * 100,
          vf=vect_func.__name__,        
          )
  df = pd.DataFrame(res).sort_values('SCORE')
  df.to_csv('models/{}_ensembles.csv'.format(_date))
  P("Results:\n{}".format(df.iloc[-50:]))
  return best_ens, df


def test_model(model, trn, dev, dct_res, model_sufix=''):
  model_name = model.model_name
  VECT_FUNC = globals()[model.vector_func_name]
  P("Testing model '{}' with vector func '{} on train: {},  dev: {}".format(
      model_name, VECT_FUNC.__name__, len(trn), len(dev)))
  dct_res_extra = {
      "DATA" : [],
      "VF" : [],
      "Macro-F1" : [],
      "Pos F1" : [],
      "Pos Recall" : [],
      "Pos Precis" : [],
      }
  def _log_data_result(dn, mf1, pf1, rec, prec, vf):
    dct_res_extra['DATA'].append(dn)
    dct_res_extra['VF'].append(vf.__name__)
    dct_res_extra['Macro-F1'].append(mf1)
    dct_res_extra['Pos F1'].append(pf1)
    dct_res_extra['Pos Recall'].append(rec)
    dct_res_extra['Pos Precis'].append(prec)

  x_trn, y_trn = nli.word_entail_featurize(
      data=trn, 
      vector_func=VECT_FUNC, 
      vector_combo_func=arr
      )
    
  x_dev, y_dev = nli.word_entail_featurize(
      data=dev, 
      vector_func=VECT_FUNC, 
      vector_combo_func=arr
      )
  y_pred = model.predict(x_dev)
  report = classification_report(y_dev, y_pred, digits=3, output_dict=True)
  mf1 = round(report['macro avg']['f1-score'] * 100,2)
  pf1 = round(report['1']['f1-score'] * 100,2)
  prc = round(report['1']['recall'] * 100,2)
  ppr = round(report['1']['precision'] * 100,2)
  dct_res = add_res(
      dct=dct_res,
      model_name=model_name + model_sufix,
      score=mf1,
      P_REC=prc)
  P("\n{}\n  MF1: {:.2f}, 1F1: {:.2f}, 1R: {:.2f}, 1P: {:.2f}\n".format(model_name, mf1, pf1, prc, ppr))
  _log_data_result(
      'dev_full', 
      mf1,
      pf1,
      prc,
      ppr,
      VECT_FUNC,
      )
  y_pred = model.predict(x_trn)
  report = classification_report(y_trn, y_pred, digits=3, output_dict=True)
  _log_data_result(
      'train_full', 
      report['macro avg']['f1-score'] * 100,
      report['1']['f1-score'] * 100,
      report['1']['recall'] * 100,
      report['1']['precision'] * 100,
      VECT_FUNC,
      )
  
  for test_name in dct_data_GLOBAL:
    P("\n\nTesting on '{}':".format(test_name))
    x, y = nli.word_entail_featurize(
        data=dct_data_GLOBAL[test_name], 
        vector_func=VECT_FUNC, 
        vector_combo_func=arr
        )
    yh = model.predict(x)
    P(classification_report(y, yh, digits=3))
    report = classification_report(y, yh, digits=3, output_dict=True)
    _log_data_result(
        test_name, 
        report['macro avg']['f1-score'] * 100,
        report['1']['f1-score'] * 100,
        report['1']['recall'] * 100,
        report['1']['precision'] * 100,
        VECT_FUNC,
        )
  df = pd.DataFrame(dct_res_extra).sort_values('Macro-F1')
  P(df)
  return dct_res


def train_test_config(model_name, model_config, trn, dev, dct_res):
    
  VECT_FUNC = globals()[model_config['VF']]
  

  x_dev, y_dev = nli.word_entail_featurize(
      data=dev, 
      vector_func=VECT_FUNC, 
      vector_combo_func=arr
      )

  model = WordEntailClassifier(      
    batch=256,
    x_dev=x_dev,
    y_dev=y_dev,
    model_name=model_name,
    optim=th.optim.Adam,
    **model_config      
    )
  
  if not model.has_saved():
    _ = nli.wordentail_experiment(
          train_data=trn,
          assess_data=dev,
          vector_func=VECT_FUNC,
          vector_combo_func=arr,
          model=model,    
        )
  else:
    model.load()
  
  dct_res = test_model(model, trn, dev, dct_res)
  
  return dct_res


def test_model_configs(dct_test_models, dct_res):
  for _model_name, _model_params in dct_test_models.items():
    VECT_FUNC = globals()[_model_params['VF']]
    _x_trn, _y_trn = nli.word_entail_featurize(
        data=train_data, 
        vector_func=VECT_FUNC, 
        vector_combo_func=arr
        )
    _x_dev, _y_dev = nli.word_entail_featurize(
        data=dev_data, 
        vector_func=VECT_FUNC, 
        vector_combo_func=arr
        )
    _model = WordEntailClassifier(      
      batch=256,
      x_dev=_x_dev,
      y_dev=_y_dev,
      model_name=_model_name,
      optim=th.optim.Adam,
      **_model_params      
      )  
    _model.load()
    _y_pred = _model.predict(_x_dev)
    report = classification_report(_y_dev, _y_pred, digits=3, output_dict=True)
    mf1 = round(report['macro avg']['f1-score'] * 100,2)
    pf1 = round(report['1']['f1-score'] * 100,2)
    prc = round(report['1']['recall'] * 100,2)
    ppr = round(report['1']['precision'] * 100,2)
    dct_res = add_res(
        dct_res,
        model_name=_model_name,
        score=mf1,
        pf1=pf1,
        prc=prc,
        ppr=ppr,
        )
  df_scores = pd.DataFrame(dct_res).sort_values('SCORE')
  P("-" * 80 + "\nResults:\n")
  P(df_scores)
  return dct_res
  
      

def train_models(dct_models, trn, dev):
  all_models = []
  for i, (model_name, model_params) in enumerate(dct_models.items()):
    P("\nLoading or training/saving model {}/{}".format(i+1, len(dct_models)))
    VECT_FUNC = globals()[model_params['VF']]
    x_trn, y_trn = nli.word_entail_featurize(
        data=trn, 
        vector_func=VECT_FUNC, 
        vector_combo_func=arr
        )
    x_dev, y_dev = nli.word_entail_featurize(
        data=dev, 
        vector_func=VECT_FUNC, 
        vector_combo_func=arr
        )
    model = WordEntailClassifier(      
      batch=256,
      x_dev=x_dev,
      y_dev=y_dev,
      model_name=model_name,
      optim=th.optim.Adam,
      **model_params      
      )  
    if model.has_saved():
      model.load()
      all_models.append(model)
      continue
    _ = nli.wordentail_experiment(
          train_data=train_data,
          assess_data=dev_data,
          vector_func=VECT_FUNC,
          vector_combo_func=arr,
          model=model,    
        )
    all_models.append(model)
  return all_models
    


### The next cell is the actual code that generates the Original system

In [39]:
USE_NEW_SPLIT = False
 
GLOVE_DIM = 300

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('precision', 4)  
  
if "GLOVE" not in globals() or len(next(iter(GLOVE.values()))) != GLOVE_DIM:
  P("Loading GloVe-{}...".format(GLOVE_DIM))
  GLOVE = utils.glove2dict(os.path.join(GLOVE_HOME, 'glove.6B.{}d.txt'.format(GLOVE_DIM)))  
  P("GloVe-{} loaded.".format(GLOVE_DIM))

with open(wordentail_filename) as f:
  wordentail_data = json.load(f)  
  
train_data = wordentail_data['word_disjoint']['train']
dev_data = wordentail_data['word_disjoint']['dev']
dct_results = OrderedDict({'MODEL':[], 'SCORE':[], 'P_REC': []})

maybe_find_glove_replacement('aromatization')
glv_analysis = test_glove_vs_data(train_data, dev_data)
dct_data_GLOBAL = glv_analysis[0]
miss_train, miss_dev = glv_analysis[1]

sol_models_params = {
#'H3_2G424': {'siam_lyrs': [256, 128], 'separ': True, 'layers': [256, 128, 64], 'inp_drp': 0.3, 'o_drp': 0.2, 'bn': True, 'bn_inp': False, 'activ': 'relu', 'lr': 0.005, 's_comb': 'sub', 's_l2': True, 's_bn': False, 'rev': False, 'loss': 'fl', 'c_act': None, 'VF': 'l_glv', 'bal': True},
'H3_3G017' : {'siam_lyrs': [], 'separ': True, 'layers': [128, 32], 'inp_drp': 0.3, 'o_drp': 0.2, 'bn': True, 'bn_inp': True, 'activ': 'sigmoid', 'lr': 0.005, 's_comb': 'abs', 's_l2': True, 's_bn': True, 'rev': True, 'loss': 'fla', 'c_act': None, 'VF': 'l_glv', 'bal': False},  
'H3_3G015' : {'siam_lyrs': [256, 128, 64], 'separ': True, 'layers': [256, 128, 64], 'inp_drp': 0.3, 'o_drp': 0.5, 'bn': False, 'bn_inp': False, 'activ': 'relu', 'lr': 0.01, 's_comb': 'sqr', 's_l2': True, 's_bn': True, 'rev': False, 'loss': 'fla', 'c_act': None, 'VF': 'l_glv', 'bal': True},
'H3_3G093' : {'siam_lyrs': [], 'separ': True, 'layers': [256, 128, 64], 'inp_drp': 0.3, 'o_drp': 0, 'bn': True, 'bn_inp': True, 'activ': 'tanh', 'lr': 0.01, 's_comb': 'abs', 's_l2': True, 's_bn': True, 'rev': True, 'loss': 'flb', 'c_act': None, 'VF': 'l_glv', 'bal': False},
#"H3_2R321" : {'siam_lyrs': [256, 128, 64], 'separ': True, 'layers': [128, 32], 'inp_drp': 0.3, 'o_drp': 0, 'bn': True, 'bn_inp': False, 'activ': 'sigmoid', 'lr': 0.01, 's_comb': 'sqr', 's_l2': False, 's_bn': True, 'rev': True, 'loss': 'bce', 'c_act': None, 'bal': True, 'VF' : 'l_glv_rep'},
  }

SOL_VECT_FUNC = l_glv

P("Train label dist:")
calc_label_distrib(train_data)
P("Dev label dist:")
calc_label_distrib(dev_data)
# reduce DEV and add to train
new_dev_size = 500
train_added = len(dev_data) - new_dev_size
dev_to_train_idxs = np.random.choice(len(dev_data), size=train_added, replace=False)
added_train_data = [dev_data[x] for x in dev_to_train_idxs]


base_sol_models = train_models(
  sol_models_params, 
  trn=train_data, 
  dev=dev_data,
  )

base_solution_ens = EnsembleWrapper(
  models=base_sol_models,
  vector_func_name=SOL_VECT_FUNC.__name__,
  verbose=True
  )  
P("Ensemble results with standard train/dev distrib:")
dct_results = test_model(
  base_solution_ens,
  trn=train_data,
  dev=dev_data,
  dct_res=dct_results,
  )

if USE_NEW_SPLIT:  
    new_train_data = train_data + added_train_data
    new_dev_data = [dev_data[x] for x in range(len(dev_data)) if x not in dev_to_train_idxs]
    new_model_dict = {k+'XT':v for k,v in sol_models_params.items()}

    P("NEW Train label distrib:")
    calc_label_distrib(new_train_data)
    P("NEW Dev label distrib:")
    calc_label_distrib(new_dev_data)

    sol_models = train_models(
        new_model_dict,
        trn=new_train_data,
        dev=new_dev_data) 


    solution_ensemble = EnsembleWrapper(
        models=sol_models,
        vector_func_name=SOL_VECT_FUNC.__name__,
        verbose=True
        )  

    P("\nTEST on original train/dev splits")
    dct_results = test_model(
        solution_ensemble, 
        trn=train_data,
        dev=dev_data,
        dct_res=dct_results,
        model_sufix='_std')
    P("\nTEST on NEW train/dev splits")
    dct_results = test_model(
        solution_ensemble, 
        trn=new_train_data,
        dev=new_dev_data,
        dct_res=dct_results,
        model_sufix='_ext')

else:
    solution_ensemble = base_solution_ens

solution_result = nli.wordentail_experiment(
  train_data=train_data,
  assess_data=dev_data,
  vector_func=globals()[solution_ensemble.vector_func_name],
  vector_combo_func=arr,
  model=solution_ensemble,
  )



Loading GloVe-300...
GloVe-300 loaded.

Glove train: 8195 (95.9%)

Glove dev: 2038 (94.8%)

Train has 71 words that are not in GLOVE: ['worktop', 'overhead-travelling', 'prodrome', 'galactosidase', 'underpant']...
  [['plum', 'worktop'], 0]
  [['potato', 'overhead-travelling'], 0]
  [['chill', 'prodrome'], 0]
  [['coughing', 'prodrome'], 0]
  [['chorioretinitis', 'blindness'], 1]
  [['chorioretinitis', 'inflammation'], 1]
  [['iridocyclitis', 'inflammation'], 1]
  [['scleritis', 'inflammation'], 1]

Pos vs neg: 16 vs 299


Dev has 24 words that are not in GLOVE: ['tenosynovitis', 'hypernatremia', 'wharve', 'fermoota', 'vruttanama']...
  [['antigen', 'tenosynovitis'], 0]
  [['colitis', 'tenosynovitis'], 0]
  [['corticosteroid', 'tenosynovitis'], 0]
  [['depression', 'tenosynovitis'], 0]
  [['tenosynovitis', 'antigen'], 1]
  [['tenosynovitis', 'corticosteroid'], 1]
  [['tenosynovitis', 'infection'], 1]

Pos vs neg: 3 vs 92
Train label dist:
0    7199
1    1349
Name: 1, dtype: int64
Dev l

  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0      0.973     1.000     0.986       108
           1      0.000     0.000     0.000         3

    accuracy                          0.973       111
   macro avg      0.486     0.500     0.493       111
weighted avg      0.947     0.973     0.960       111

          DATA     VF  Macro-F1   Pos F1  Pos Recall  Pos Precis
5      out_dev  l_glv   49.3151   0.0000      0.0000      0.0000
0     dev_full  l_glv   77.2000  59.6300     61.5100     57.8700
3    glove_dev  l_glv   77.2672  60.0000     62.2881     57.8740
4    out_train  l_glv   85.1243  71.4286     62.5000     83.3333
1   train_full  l_glv   93.5850  89.3733     97.2572     82.6717
2  glove_train  l_glv   93.6456  89.5461     97.6744     82.6667
****** Ensemble model passing fit call ******
              precision    recall  f1-score   support

           0      0.951     0.944     0.948      1910
           1      0.579     0.615     0.596       239

    accu

## Bake-off [1 point]

The goal of the bake-off is to achieve the highest macro-average F1 score on __word_disjoint__, on a test set that we will make available at the start of the bake-off. The announcement will go out on the discussion forum. To enter, you'll be asked to run `nli.bake_off_evaluation` on the output of your chosen `nli.wordentail_experiment` run. 

The cells below this one constitute your bake-off entry.

The rules described in the [Your original system](#Your-original-system-[3-points]) homework question are also in effect for the bake-off.

Systems that enter will receive the additional homework point, and systems that achieve the top score will receive an additional 0.5 points. We will test the top-performing systems ourselves, and only systems for which we can reproduce the reported results will win the extra 0.5 points.

Late entries will be accepted, but they cannot earn the extra 0.5 points. Similarly, you cannot win the bake-off unless your homework is submitted on time.

The announcement will include the details on where to submit your entry.

In [45]:
# Enter your bake-off assessment code into this cell. 
# Please do not remove this comment.
##### YOUR CODE HERE

P("====== Bake-off ensemble architecture ======")
bakeoff_model['model'].print_models()

bakeoff_model = solution_result
test_data_filename = os.path.join(
    NLIDATA_HOME,
    "bakeoff-wordentail-data",
    "nli_wordentail_bakeoff_data-test.json")

P("Bake-off evaluation:")

nli.bake_off_evaluation(
    bakeoff_model,
    test_data_filename)


Ensemble 'E_3G017_3G015_3G093' architecture:
  --------------------------------------------------------------------------------
  Model 1/3
  Model name: H3_3G017
  ThWordEntailModel(
    (path1_layers): ModuleList(
      (0): InputPlaceholder(input_dim=300)
      (1): Dropout(p=0.3, inplace=False)
      (2): BatchNorm1d(300, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (3): L2_Normalizer()
    )
    (path2_layers): ModuleList(
      (0): InputPlaceholder(input_dim=300)
      (1): Dropout(p=0.3, inplace=False)
      (2): BatchNorm1d(300, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (3): L2_Normalizer()
    )
    (post_layers): ModuleList(
      (0): PathsCombiner(input_dim=300x2, output_dim=300, method='abs', act=None)
      (1): Linear(in_features=300, out_features=128, bias=False)
      (2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (3): Sigmoid()
      (4): Dropout(p=0.2, inplace=False)
  

In [42]:
# On an otherwise blank line in this cell, please enter
# your macro-avg f1 value as reported by the code above. 
# Please enter only a number between 0 and 1 inclusive.
# Please do not remove this comment.

##### YOUR CODE HERE

0.747



0.747