# Homework and bake-off: Relation extraction using distant supervision

In [1]:
__author__ = "Bill MacCartney and Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2020"

## Contents

1. [Overview](#Overview)
1. [Set-up](#Set-up)
1. [Baselines](#Baselines)
  1. [Hand-build feature functions](#Hand-build-feature-functions)
  1. [Distributed representations](#Distributed-representations)
1. [Homework questions](#Homework-questions)
  1. [Different model factory [1 points]](#Different-model-factory-[1-points])
  1. [Directional unigram features [1.5 points]](#Directional-unigram-features-[1.5-points])
  1. [The part-of-speech tags of the "middle" words [1.5 points]](#The-part-of-speech-tags-of-the-"middle"-words-[1.5-points])
  1. [Bag of Synsets [2 points]](#Bag-of-Synsets-[2-points])
  1. [Your original system [3 points]](#Your-original-system-[3-points])
1. [Bake-off [1 point]](#Bake-off-[1-point])

## Overview

This homework and associated bake-off are devoted to the developing really effective relation extraction systems using distant supervision. 

As with the previous assignments, this notebook first establishes a baseline system. The initial homework questions ask you to create additional baselines and suggest areas for innovation, and the final homework question asks you to develop an original system for you to enter into the bake-off.

## Set-up

See [the first notebook in this unit](rel_ext_01_task.ipynb#Set-up) for set-up instructions.

In [2]:
import numpy as np
import os
import rel_ext
from sklearn.linear_model import LogisticRegression
import utils

As usual, we unite our corpus and KB into a dataset, and create some splits for experimentation:

In [3]:
rel_ext_data_home = os.path.join('data', 'rel_ext_data')

In [4]:
corpus = rel_ext.Corpus(os.path.join(rel_ext_data_home, 'corpus.tsv.gz'))

In [5]:
kb = rel_ext.KB(os.path.join(rel_ext_data_home, 'kb.tsv.gz'))

In [6]:
dataset = rel_ext.Dataset(corpus, kb)

You are not wedded to this set-up for splits. The bake-off will be conducted on a previously unseen test-set, so all of the data in `dataset` is fair game:

In [7]:
splits = dataset.build_splits(
    split_names=['tiny', 'train', 'dev'],
    split_fracs=[0.01, 0.79, 0.20],
    seed=1)

In [8]:
splits

{'tiny': Corpus with 3,474 examples; KB with 445 triples,
 'train': Corpus with 263,285 examples; KB with 36,191 triples,
 'dev': Corpus with 64,937 examples; KB with 9,248 triples,
 'all': Corpus with 331,696 examples; KB with 45,884 triples}

## Baselines

### Hand-build feature functions

In [9]:
def simple_bag_of_words_featurizer(kbt, corpus, feature_counter):
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for word in ex.middle.split(' '):
            feature_counter[word] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        for word in ex.middle.split(' '):
            feature_counter[word] += 1
    return feature_counter

In [10]:
featurizers = [simple_bag_of_words_featurizer]

In [11]:
model_factory = lambda: LogisticRegression(fit_intercept=True, solver='liblinear')

In [12]:
baseline_results = rel_ext.experiment(
    splits,
    train_split='train',
    test_split='dev',
    featurizers=featurizers,
    model_factory=model_factory,
    verbose=True)


relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
adjoins                   0.844      0.382      0.680        340       5716
author                    0.798      0.544      0.730        509       5885
capital                   0.545      0.189      0.396         95       5471
contains                  0.803      0.608      0.754       3904       9280
film_performance          0.811      0.565      0.746        766       6142
founders                  0.833      0.395      0.682        380       5756
genre                     0.524      0.129      0.325        170       5546
has_sibling               0.888      0.238      0.575        499       5875
has_spouse                0.881      0.325      0.656        594       5970
is_a                      0.723      0.225      0.501        497       5873
nationality               0.600      0.189      0.419        301       5677
parents    

Studying model weights might yield insights:

In [13]:
rel_ext.examine_model_weights(baseline_results)

Highest and lowest feature weights for relation adjoins:

     2.529 Córdoba
     2.430 Valais
     2.371 Taluks
     ..... .....
    -1.457 he
    -1.475 America
    -1.548 who

Highest and lowest feature weights for relation author:

     3.126 author
     2.602 wrote
     2.597 books
     ..... .....
    -2.026 sequence
    -2.041 produced
    -2.069 or

Highest and lowest feature weights for relation capital:

     3.323 capital
     1.685 km
     1.638 posted
     ..... .....
    -1.148 and
    -1.169 being
    -1.731 Antrim

Highest and lowest feature weights for relation contains:

     2.279 transferred
     2.211 bounded
     2.210 tiny
     ..... .....
    -2.765 Mile
    -2.867 band
    -4.198 Antrim

Highest and lowest feature weights for relation film_performance:

     4.269 starring
     3.736 alongside
     3.531 co-starring
     ..... .....
    -1.843 .The
    -1.911 Khakee
    -2.111 She

Highest and lowest feature weights for relation founders:

     4.106 founder
  

### Distributed representations

This simple baseline sums the GloVe vector representations for all of the words in the "middle" span and feeds those representations into the standard `LogisticRegression`-based `model_factory`. The crucial parameter that enables this is `vectorize=False`. This essentially says to `rel_ext.experiment` that your featurizer or your model will do the work of turning examples into vectors; in that case, `rel_ext.experiment` just organizes these representations by relation type.

In [14]:
GLOVE_HOME = os.path.join('data', 'glove.6B')

In [15]:
glove_lookup = utils.glove2dict(
    os.path.join(GLOVE_HOME, 'glove.6B.300d.txt'))

In [16]:
def glove_middle_featurizer(kbt, corpus, np_func=np.sum):
    reps = []
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for word in ex.middle.split():
            rep = glove_lookup.get(word)
            if rep is not None:
                reps.append(rep)
    # A random representation of the right dimensionality if the
    # example happens not to overlap with GloVe's vocabulary:
    if len(reps) == 0:
        dim = len(next(iter(glove_lookup.values())))                
        return utils.randvec(n=dim)
    else:
        return np_func(reps, axis=0)

In [17]:
glove_results = rel_ext.experiment(
    splits,
    train_split='train',
    test_split='dev',
    featurizers=[glove_middle_featurizer],    
    vectorize=False, # Crucial for this featurizer!
    verbose=True)


relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
adjoins                   0.888      0.465      0.751        340       5716
author                    0.859      0.444      0.724        509       5885
capital                   0.621      0.189      0.427         95       5471
contains                  0.661      0.416      0.591       3904       9280
film_performance          0.811      0.320      0.621        766       6142
founders                  0.789      0.237      0.538        380       5756
genre                     0.481      0.076      0.234        170       5546
has_sibling               0.842      0.246      0.568        499       5875
has_spouse                0.894      0.355      0.686        594       5970
is_a                      0.744      0.135      0.391        497       5873
nationality               0.679      0.183      0.440        301       5677
parents    

With the same basic code design, one can also use the PyTorch models included in the course repo, or write new ones that are better aligned with the task. For those models, it's likely that the featurizer will just return a list of tokens (or perhaps a list of lists of tokens), and the model will map those into vectors using an embedding.

## Homework questions

Please embed your homework responses in this notebook, and do not delete any cells from the notebook. (You are free to add as many cells as you like as part of your responses.)

### Different model factory [1 points]

The code in `rel_ext` makes it very easy to experiment with other classifier models: one need only redefine the `model_factory` argument. This question asks you to assess a [Support Vector Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

__To submit:__ A wrapper function `run_svm_model_factory` that does the following: 

1. Uses `rel_ext.experiment` with the model factory set to one based in an `SVC` with `kernel='linear'` and all other arguments left with default values. 
1. Trains on the 'train' part of `splits`.
1. Assesses on the `dev` part of `splits`.
1. Uses `featurizers` as defined above. 
1. Returns the return value of `rel_ext.experiment` for this set-up.

The function `test_run_svm_model_factory` will check that your function conforms to these general specifications.

In [18]:
def run_svm_model_factory():
    
    ##### YOUR CODE HERE
    from sklearn.svm import SVC
    res = rel_ext.experiment(
              splits,
              train_split='train',
              test_split='dev',
              featurizers=[glove_middle_featurizer], 
              model_factory=lambda: SVC(kernel='linear', max_iter=4),
              vectorize=False, # we are using Glove
              verbose=True)  
    return res


In [19]:
def test_run_svm_model_factory(run_svm_model_factory):
    results = run_svm_model_factory()
    assert 'featurizers' in results, \
        "The return value of `run_svm_model_factory` seems not to be correct"
    # Check one of the models to make sure it's an SVC:
    assert 'SVC' in results['models']['adjoins'].__class__.__name__, \
        "It looks like the model factor wasn't set to use an SVC."    

In [20]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    test_run_svm_model_factory(run_svm_model_factory)




relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
adjoins                   0.933      0.285      0.642        340       5716
author                    0.333      0.069      0.188        509       5885
capital                   0.016      0.937      0.020         95       5471
contains                  0.804      0.136      0.405       3904       9280
film_performance          0.125      0.999      0.152        766       6142
founders                  0.066      0.995      0.081        380       5756
genre                     0.031      0.994      0.038        170       5546
has_sibling               0.083      0.976      0.102        499       5875
has_spouse                0.100      0.998      0.121        594       5970
is_a                      0.085      0.998      0.103        497       5873
nationality               0.053      0.987      0.065        301       5677
parents    

### Directional unigram features [1.5 points]

The current bag-of-words representation makes no distinction between "forward" and "reverse" examples. But, intuitively, there is big difference between _X and his son Y_ and _Y and his son X_. This question asks you to modify `simple_bag_of_words_featurizer` to capture these differences. 

__To submit:__

1. A feature function `directional_bag_of_words_featurizer` that is just like `simple_bag_of_words_featurizer` except that it distinguishes "forward" and "reverse". To do this, you just need to mark each word feature for whether it is derived from a subject–object example or from an object–subject example.  The included function `test_directional_bag_of_words_featurizer` should help verify that you've done this correctly.

2. A call to `rel_ext.experiment` with `directional_bag_of_words_featurizer` as the only featurizer. (Aside from this, use all the default values for `rel_ext.experiment` as exemplified above in this notebook.)

3. `rel_ext.experiment` returns some of the core objects used in the experiment. How many feature names does the `vectorizer` have for the experiment run in the previous step? Include the code needed for getting this value. (Note: we're partly asking you to figure out how to get this value by using the sklearn documentation, so please don't ask how to do it!)

In [21]:
def directional_bag_of_words_featurizer(kbt, corpus, feature_counter): 
    # Append these to the end of the keys you add/access in 
    # `feature_counter` to distinguish the two orders. You'll
    # need to use exactly these strings in order to pass 
    # `test_directional_bag_of_words_featurizer`.
    subject_object_suffix = "_SO"
    object_subject_suffix = "_OS"
    
    ##### YOUR CODE HERE
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for word in ex.middle.split(' '):
            feature_counter[word + subject_object_suffix] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        for word in ex.middle.split(' '):
            feature_counter[word + object_subject_suffix] += 1
    return feature_counter



# Call to `rel_ext.experiment`:
##### YOUR CODE HERE    
dir_results = rel_ext.experiment(
    splits,
    train_split='train',
    test_split='dev',
    featurizers=[directional_bag_of_words_featurizer],
    verbose=True)

print("Previous features: {}".format(len(baseline_results['vectorizer'].feature_names_)))
print("Current features: {}".format(len(dir_results['vectorizer'].feature_names_)))




relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
adjoins                   0.864      0.429      0.719        340       5716
author                    0.838      0.599      0.776        509       5885
capital                   0.618      0.221      0.455         95       5471
contains                  0.816      0.679      0.784       3904       9280
film_performance          0.839      0.661      0.796        766       6142
founders                  0.833      0.421      0.697        380       5756
genre                     0.727      0.235      0.513        170       5546
has_sibling               0.878      0.244      0.578        499       5875
has_spouse                0.897      0.354      0.686        594       5970
is_a                      0.788      0.247      0.549        497       5873
nationality               0.660      0.219      0.471        301       5677
parents    

In [22]:
def test_directional_bag_of_words_featurizer(corpus):
    from collections import defaultdict
    kbt = rel_ext.KBTriple(rel='worked_at', sbj='Randall_Munroe', obj='xkcd')
    feature_counter = defaultdict(int)
    # Make sure `feature_counter` is being updated, not reinitialized:
    feature_counter['is_OS'] += 5
    feature_counter = directional_bag_of_words_featurizer(kbt, corpus, feature_counter)
    expected = defaultdict(
        int, {'is_OS':6,'a_OS':1,'webcomic_OS':1,'created_OS':1,'by_OS':1})
    assert feature_counter == expected, \
        "Expected:\n{}\nGot:\n{}".format(expected, feature_counter)

In [23]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    test_directional_bag_of_words_featurizer(corpus)

### The part-of-speech tags of the "middle" words [1.5 points]

Our corpus distribution contains part-of-speech (POS) tagged versions of the core text spans. Let's begin to explore whether there is information in these sequences, focusing on `middle_POS`.

__To submit:__

1. A feature function `middle_bigram_pos_tag_featurizer` that is just like `simple_bag_of_words_featurizer` except that it creates a feature for bigram POS sequences. For example, given 

  `The/DT dog/N napped/V`
  
   we obtain the list of bigram POS sequences
  
   `b = ['<s> DT', 'DT N', 'N V', 'V </s>']`. 
   
   Of course, `middle_bigram_pos_tag_featurizer` should return count dictionaries defined in terms of such bigram POS lists, on the model of `simple_bag_of_words_featurizer`.  Don't forget the start and end tags, to model those environments properly! The included function `test_middle_bigram_pos_tag_featurizer` should help verify that you've done this correctly.

2. A call to `rel_ext.experiment` with `middle_bigram_pos_tag_featurizer` as the only featurizer. (Aside from this, use all the default values for `rel_ext.experiment` as exemplified above in this notebook.)

In [24]:
def middle_bigram_pos_tag_featurizer(kbt, corpus, feature_counter):
    
    ##### YOUR CODE HERE
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for bigram in get_tag_bigrams(ex.middle_POS):
          feature_counter[bigram] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        for bigram in get_tag_bigrams(ex.middle_POS):
          feature_counter[bigram] += 1
    return feature_counter    


def get_tag_bigrams(s):
    """Suggested helper method for `middle_bigram_pos_tag_featurizer`.
    This should be defined so that it returns a list of str, where each 
    element is a POS bigram."""
    # The values of `start_symbol` and `end_symbol` are defined
    # here so that you can use `test_middle_bigram_pos_tag_featurizer`.
    start_symbol = "<s>"
    end_symbol = "</s>"    
    ##### YOUR CODE HERE
    parts = [start_symbol] + get_tags(s) + [end_symbol]
    res = []
    for i in range(len(parts)-1):
        res.append(parts[i]+ ' ' + parts[i+1])
    return res

    
def get_tags(s): 
    """Given a sequence of word/POS elements (lemmas), this function
    returns a list containing just the POS elements, in order.    
    """
    return [parse_lem(lem)[1] for lem in s.strip().split(' ') if lem]


def parse_lem(lem):
    """Helper method for parsing word/POS elements. It just splits
    on the rightmost / and returns (word, POS) as a tuple of str."""
    return lem.strip().rsplit('/', 1)  

# Call to `rel_ext.experiment`:
##### YOUR CODE HERE

pos_results = rel_ext.experiment(
    splits,
    train_split='train',
    test_split='dev',
    featurizers=[middle_bigram_pos_tag_featurizer],
    verbose=True)



relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
adjoins                   0.881      0.347      0.674        340       5716
author                    0.723      0.338      0.589        509       5885
capital                   0.515      0.179      0.374         95       5471
contains                  0.755      0.593      0.716       3904       9280
film_performance          0.728      0.437      0.643        766       6142
founders                  0.566      0.168      0.385        380       5756
genre                     0.571      0.165      0.383        170       5546
has_sibling               0.719      0.164      0.429        499       5875
has_spouse                0.761      0.273      0.560        594       5970
is_a                      0.636      0.169      0.410        497       5873
nationality               0.477      0.070      0.220        301       5677
parents    

In [25]:
def test_middle_bigram_pos_tag_featurizer(corpus):
    from collections import defaultdict
    kbt = rel_ext.KBTriple(rel='worked_at', sbj='Randall_Munroe', obj='xkcd')
    feature_counter = defaultdict(int)
    # Make sure `feature_counter` is being updated, not reinitialized:
    feature_counter['<s> VBZ'] += 5
    feature_counter = middle_bigram_pos_tag_featurizer(kbt, corpus, feature_counter)
    expected = defaultdict(
        int, {'<s> VBZ':6,'VBZ DT':1,'DT JJ':1,'JJ VBN':1,'VBN IN':1,'IN </s>':1})
    assert feature_counter == expected, \
        "Expected:\n{}\nGot:\n{}".format(expected, feature_counter)

In [26]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    test_middle_bigram_pos_tag_featurizer(corpus)

### Bag of Synsets [2 points]

The following allows you to use NLTK's WordNet API to get the synsets compatible with _dog_ as used as a noun:

```
from nltk.corpus import wordnet as wn
dog = wn.synsets('dog', pos='n')
dog
[Synset('dog.n.01'),
 Synset('frump.n.01'),
 Synset('dog.n.03'),
 Synset('cad.n.01'),
 Synset('frank.n.02'),
 Synset('pawl.n.01'),
 Synset('andiron.n.01')]
```

This question asks you to create synset-based features from the word/tag pairs in `middle_POS`.

__To submit:__

1. A feature function `synset_featurizer` that is just like `simple_bag_of_words_featurizer` except that it returns a list of synsets derived from `middle_POS`. Stringify these objects with `str` so that they can be `dict` keys. Use `convert_tag` (included below) to convert tags to `pos` arguments usable by `wn.synsets`. The included function `test_synset_featurizer` should help verify that you've done this correctly.

2. A call to `rel_ext.experiment` with `synset_featurizer` as the only featurizer. (Aside from this, use all the default values for `rel_ext.experiment`.)

In [27]:
from nltk.corpus import wordnet as wn

def synset_featurizer(kbt, corpus, feature_counter):
    
    ##### YOUR CODE HERE
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for ss in get_synsets(ex.middle_POS):
          feature_counter[ss] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        for ss in get_synsets(ex.middle_POS):
          feature_counter[ss] += 1
    return feature_counter      


def get_synsets(s):
    """Suggested helper method for `synset_featurizer`. This should
    be completed so that it returns a list of stringified Synsets 
    associated with elements of `s`.
    """   
    # Use `parse_lem` from the previous question to get a list of
    # (word, POS) pairs. Remember to convert the POS strings.
    wt = [parse_lem(lem) for lem in s.strip().split(' ') if lem]
    
    ##### YOUR CODE HERE
    res = []
    for word, tag in wt:
        t = convert_tag(tag)
        syns = wn.synsets(word, pos=t)
        for ss in syns:
            res.append(str(ss))
    return res

    
    
def convert_tag(t):
    """Converts tags so that they can be used by WordNet:
    
    | Tag begins with | WordNet tag |
    |-----------------|-------------|
    | `N`             | `n`         |
    | `V`             | `v`         |
    | `J`             | `a`         |
    | `R`             | `r`         |
    | Otherwise       | `None`      |
    """        
    if t[0].lower() in {'n', 'v', 'r'}:
        return t[0].lower()
    elif t[0].lower() == 'j':
        return 'a'
    else:
        return None    


# Call to `rel_ext.experiment`:
##### YOUR CODE HERE    

syn_results = rel_ext.experiment(
    splits,
    train_split='train',
    test_split='dev',
    featurizers=[synset_featurizer],
    verbose=True)






relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
adjoins                   0.799      0.338      0.628        340       5716
author                    0.764      0.452      0.671        509       5885
capital                   0.600      0.221      0.447         95       5471
contains                  0.786      0.587      0.736       3904       9280
film_performance          0.796      0.555      0.732        766       6142
founders                  0.760      0.384      0.636        380       5756
genre                     0.429      0.194      0.345        170       5546
has_sibling               0.803      0.220      0.525        499       5875
has_spouse                0.800      0.303      0.602        594       5970
is_a                      0.568      0.227      0.437        497       5873
nationality               0.556      0.150      0.360        301       5677
parents    

In [28]:
def test_synset_featurizer(corpus):
    from collections import defaultdict
    kbt = rel_ext.KBTriple(rel='worked_at', sbj='Randall_Munroe', obj='xkcd')
    feature_counter = defaultdict(int)
    # Make sure `feature_counter` is being updated, not reinitialized:
    feature_counter["Synset('be.v.01')"] += 5
    feature_counter = synset_featurizer(kbt, corpus, feature_counter)
    # The full return values for this tend to be long, so we just
    # test a few examples to avoid cluttering up this notebook.
    test_cases = {
        "Synset('be.v.01')": 6,
        "Synset('embody.v.02')": 1
    }
    for ss, expected in test_cases.items():   
        result = feature_counter[ss]
        assert result == expected, \
            "Incorrect count for {}: Expected {}; Got {}".format(ss, expected, result)

In [29]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    test_synset_featurizer(corpus)

### Your original system [3 points]

There are many options, and this could easily grow into a project. Here are a few ideas:

- Try out different classifier models, from `sklearn` and elsewhere.
- Add a feature that indicates the length of the middle.
- Augment the bag-of-words representation to include bigrams or trigrams (not just unigrams).
- Introduce features based on the entity mentions themselves. <!-- \[SPOILER: it helps a lot, maybe 4% in F-score. And combines nicely with the directional features.\] -->
- Experiment with features based on the context outside (rather than between) the two entity mentions — that is, the words before the first mention, or after the second.
- Try adding features which capture syntactic information, such as the dependency-path features used by Mintz et al. 2009. The [NLTK](https://www.nltk.org/) toolkit contains a variety of [parsing algorithms](http://www.nltk.org/api/nltk.parse.html) that may help.
- The bag-of-words representation does not permit generalization across word categories such as names of people, places, or companies. Can we do better using word embeddings such as [GloVe](https://nlp.stanford.edu/projects/glove/)?

In the cell below, please provide a brief technical description of your original system, so that the teaching team can gain an understanding of what it does. This will help us to understand your code and analyze all the submissions to identify patterns and strategies.

In [2]:
# Enter your system description in this cell.
# Please do not remove this comment.


def P(s=''):
  print(s, flush=True)

def Pr(s=''):
  print('\r' + str(s), end='', flush=True)
  

def prepare_grid_search(params_grid, nr_trials):
  import itertools

  pd.set_option('display.max_rows', 500)
  pd.set_option('display.max_columns', 500)
  pd.set_option('display.width', 1000)

  params = []
  values = []
  for k in params_grid:
    params.append(k)
    assert type(params_grid[k]) is list, 'All grid-search params must be lists. Error: {}'.format(k)
    values.append(params_grid[k])
  combs = list(itertools.product(*values))
  n_options = len(combs)
  grid_iterations = []
  for i in range(n_options):
    comb = combs[i]
    func_kwargs = {}
    for j,k in enumerate(params):
      func_kwargs[k] = comb[j]
    grid_iterations.append(func_kwargs)
  idxs = np.arange(n_options)
  if nr_trials < n_options:
    idxs = np.random.choice(idxs, size=nr_trials, replace=False)
  return [grid_iterations[i] for i in idxs]


def get_seq_feats(kbt, corpus, how='middle', two_dir=True, max_words=50):
  reps = []
  so_sents = []
  
  def extract_text(exmpl):
    str_ex = ''
    if how == 'full':
      str_ex = ' '.join((exmpl.left, exmpl.mention_1, exmpl.middle, exmpl.mention_2, exmpl.right))
    else:
      if 'left' in how:
        str_ex += ' ' + exmpl.left
      if 'm1' in how:
        str_ex += ' ' + exmpl.mention_1
      if 'middle' in how:
        str_ex += ' ' + exmpl.middle
      if 'm2' in how:
        str_ex += ' ' + exmpl.mention_2
      if 'right' in how:
        str_ex += ' ' + exmpl.right
    return str_ex
  
  for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
    str_ex = extract_text(ex)
    so_sents.append(str_ex)
  so_sents = sorted(so_sents, key=lambda x: len(x))
  so_best = so_sents[-1] if len(so_sents) > 0 else ''
  
  os_best = ''
  if two_dir:
    os_sents = []
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
      str_ex = extract_text(ex)
      os_sents.append(str_ex)
    os_sents = sorted(os_sents, key=lambda x:len(x))
    os_best = os_sents[-1] if len(os_sents) > 0 else ''

  str_text = so_best + ' ' + os_best
    
  str_rep = ''
  for word in str_text.split():
    w = word.lower()
    rep = glove_lookup.get(w)
    if rep is not None:
      reps.append(rep.astype(np.float32))
      str_rep += ' ' + w


  if len(reps) == 0:
    reps = [np.zeros(glv_emb_size, dtype=np.float32)]
    
  return np.array(reps[:max_words], dtype=np.float32)

def run_grid_search():
    grid = {
        'eta' : [
            0.01,
            0.001,
            ],

        'l2_strength' : [
            0,
            0.01
            ],

        'how' : [
            'full',
            'middle',
            'left-right',
            'm1-middle-m2',
            ],

        'batch_size' : [
              64,
              256
            ],
        'bidirectional' : [
             True,
             False,
            ],

        'max_words' : [
            50,
            100,
            ],

        'two_dir' : [
            True,
            False
            ],

        'hidden_dim' : [
            128,
            256,
            512
            ],

        }
    
    dct_results = OrderedDict({'MODEL':[], 'F05': [], 'HRS': []})
    for k in grid: 
      dct_results[k] = []
    
    options = prepare_grid_search(grid, nr_trials=15)
    for grid_iter, option in enumerate(options):
      model_name = 'rnn_v1_{:02}'.format(grid_iter+1)
      print("Running grid search iteration {}/{} '{}': {}".format(
          grid_iter+1, len(options), model_name, option), flush=True)
      dct_results['MODEL'].append(model_name)
      for k in option:
        dct_results[k].append(option[k])
      max_words = option.pop('max_words')
      how = option.pop('how')
      two_dir = option.pop('two_dir')
      rnn1_model_factory = lambda: TorchRNNClassifier(vocab={}, 
                                                      use_embedding=False,
                                                      max_iter=20,
                                                      **option)
      featurizer_func = partial(get_seq_feats, how=how, max_words=max_words, two_dir=two_dir)
      start_timer(model_name)
      res = rel_ext.experiment(
              splits,
              train_split='train',
              test_split='dev',
              model_factory=rnn1_model_factory,
              featurizers=[featurizer_func],
              vectorize=False,
              verbose=True,
              return_macro=True)   
      
      rnn_results, rnn_f1 = res
      t_res = end_timer(model_name, rnn_f1)  
      dct_results['F05'].append(rnn_f1)
      dct_results['HRS'].append(round(t_res/3600,2))
      P("Results so far:\n{}".format(pd.DataFrame(dct_results).sort_values('F05')))  
    
"""
 For the original system the proposed approach is based on a simple RNN (LSTM) architecture where the actual 
 hyperparameters of the model are generated based on random grid search. This model architecture is searched 
 in conjunction with the searching of optimal feature generation. For this second aspect the chosen approach 
 is also based on grid search however this time the options do not control model hyperparameters but rather the 
 triplet corpus example extraction and featurization process. Basically the `get_seq_feats` function generates a 
 sequence of GloVo-300 embedding vectors based on the text extraction from the given corpus. We control what exactly
 are we extracting - left/right text, mentions and middle text - as well as the maximum number of words per sequence
 (one sequence equals one observation) and the option to concatenate the sbj-obj example with the obj-sbj example 
 from the triplet. Now, in terms of corpus text extraction we get all the examples for the triplet and then keep only 
 the longest sequence of text (taken from examples based on the `how` parameter). We use the GloVe dictionary where we
 lookup each word in lowercase format (due to standard preprocessing of GloVe). 
 To summarize the parameters used in the grid search are:
  - eta: the learning rate of rnn model optimizer
  - l2_strength : the strength of the L2 weights regularization applied by the model optimizer
  - how : the proposed sections of the corpus example to be used
  - batch_size : training batch size
  - bidirectional : True if we want the LSTM to go both forward and backward on the input sequence
  - max_words : max words per sequence generation
  - two_dir : use example both from sbj-obj as well as obj-sbj
  - hidden_dim : dimension of the LSTM cell (will actually be double if bidi)
 The grid-search mini-framework is fully contained in the function `prepare_grid_search`.
  
 Below is the first random grid-search "execution" with the results sorted based on the F_0.5 score


Results so far:
        MODEL       F05   HRS    eta  l2_strength                       how  batch_size  bidirectional  max_words  two_dir  hidden_dim
1   rnn_v1_02  0.123757  0.39  0.010         0.01                left-right         256          False         50     True         128
12  rnn_v1_13  0.171640  1.25  0.010         0.01                      full          64           True        100     True         256
11  rnn_v1_12  0.177829  1.17  0.010         0.01                      full          64          False        100     True         512
5   rnn_v1_06  0.203638  0.66  0.010         0.01                      full          64          False         50     True         128
3   rnn_v1_04  0.236578  0.89  0.001         0.01                left-right          64          False        100     True         128
13  rnn_v1_14  0.403284  0.27  0.010         0.01                    middle         256           True         50     True         256
10  rnn_v1_11  0.404736  0.23  0.001         0.01                    middle         256          False        100     True         256
7   rnn_v1_08  0.413200  0.69  0.010         0.01  mention1-middle-mention2          64           True         50     True         128
0   rnn_v1_01  0.414059  0.23  0.001         0.01                    middle         256          False        100     True         512
2   rnn_v1_03  0.422167  0.62  0.001         0.01                    middle          64           True         50     True         128
6   rnn_v1_07  0.424688  0.84  0.010         0.00                left-right          64           True         50     True         256
4   rnn_v1_05  0.518044  0.21  0.001         0.00                    middle         256          False         50    False         512
14  rnn_v1_15  0.534089  0.23  0.001         0.01  mention1-middle-mention2         256          False         50    False         256
8   rnn_v1_09  0.684349  0.23  0.001         0.00  mention1-middle-mention2         256          False         50    False         256
9   rnn_v1_10  0.697654  0.64  0.010         0.00  mention1-middle-mention2          64           True         50    False         256

From this first random grid-search iteration (15 iterations) we are able to narrow the grid-search options and move even 
further. A second iteration revealed similar results:
Results so far:
        MODEL       F05   HRS     eta  l2_strength           how  batch_size  bidirectional  max_words  two_dir  hidden_dim
2   rnn_v1_03  0.152079  0.90  0.0050        0.005    left-right         512           True        100    False         256
10  rnn_v1_11  0.153213  0.55  0.0050        0.005    left-right         512          False         50    False         128
20  rnn_v1_21  0.195650  0.82  0.0005        0.005    left-right         512           True        100    False         128
5   rnn_v1_06  0.213335  0.74  0.0005        0.005    left-right         512          False        100    False         128
14  rnn_v1_15  0.285871  0.81  0.0005        0.005          full         512          False        100    False         128
15  rnn_v1_16  0.294115  0.55  0.0050        0.000    left-right         512          False         50    False         128
21  rnn_v1_22  0.296603  0.85  0.0005        0.005          full         512          False        100    False         256
16  rnn_v1_17  0.406255  0.68  0.0005        0.005          full         512           True         50    False         256
17  rnn_v1_18  0.425472  0.17  0.0050        0.005        middle         512          False        100    False         128
19  rnn_v1_20  0.436710  0.17  0.0050        0.005        middle         512          False         50    False         512
9   rnn_v1_10  0.436834  0.81  0.0050        0.000          full         512          False        100    False         128
13  rnn_v1_14  0.441768  0.20  0.0050        0.005        middle         512           True        100    False         512
24  rnn_v1_25  0.442149  0.17  0.0050        0.005        middle         512          False         50    False         256
12  rnn_v1_13  0.442889  0.17  0.0050        0.005        middle         512          False         50    False         128
11  rnn_v1_12  0.514573  0.19  0.0005        0.000        middle         512           True        100    False         256
22  rnn_v1_23  0.521309  0.16  0.0005        0.000        middle         512          False        100    False         256
4   rnn_v1_05  0.523081  0.20  0.0005        0.000        middle         512           True        100    False         512
18  rnn_v1_19  0.530735  0.19  0.0005        0.000        middle         512           True        100    False         128
7   rnn_v1_08  0.532243  0.16  0.0050        0.000        middle         512          False         50    False         128
8   rnn_v1_09  0.533425  0.19  0.0050        0.000        middle         512           True        100    False         256
23  rnn_v1_24  0.539101  0.92  0.0005        0.000          full         512           True         50    False         512
6   rnn_v1_07  0.542324  0.19  0.0050        0.000        middle         512           True        100    False         128
1   rnn_v1_02  0.667892  0.22  0.0005        0.000  m1-middle-m2         512           True        100    False         256
3   rnn_v1_04  0.669198  0.20  0.0005        0.000  m1-middle-m2         512          False        100    False         512
0   rnn_v1_01  0.691331  0.18  0.0050        0.000  m1-middle-m2         512          False        100    False         128  
X   rnn_v1_01  0.692352   0.2  0.0050            0  m1-middle-m2         512          False        100    False         128

The top model scores for each relation are as follows:
relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
adjoins                   0.672      0.459      0.615        340       5716
author                    0.899      0.507      0.779        509       5885
capital                   0.792      0.400      0.662         95       5471
contains                  0.915      0.338      0.682       3904       9280
film_performance          0.883      0.484      0.758        766       6142
founders                  0.815      0.361      0.651        380       5756
genre                     0.814      0.282      0.591        170       5546
has_sibling               0.757      0.437      0.660        499       5875
has_spouse                0.755      0.446      0.663        594       5970
is_a                      0.852      0.406      0.699        497       5873
nationality               0.882      0.571      0.796        301       5677
parents                   0.863      0.446      0.727        312       5688
place_of_birth            0.742      0.506      0.679        233       5609
place_of_death            0.648      0.428      0.587        159       5535
profession                0.938      0.486      0.791        247       5623
worked_at                 0.831      0.508      0.737        242       5618
------------------    ---------  ---------  ---------  ---------  ---------
macro-average             0.816      0.442      0.692       9248      95264

Finally, the decision was made to use the narrowed hyperparameters in order to cunstruct an actual candidate for the 
original system. However the main issue as hand is the potential difference in the optimization dynamic for each individual 
relation-based-model when training for a fixed number of epochs. As a result the proposed approach is to employ optimization
with early stopping for each individual relation and record as a experiment result the optimal ecoch of each relation model.
Using a early stopping criterion based on the F0.5 score a final round of random grid-search has been performed and the final
model (original model) has code has been prepared.

"""
if 'IS_GRADESCOPE_ENV' not in os.environ:
    from sklearn.metrics import precision_recall_fscore_support
    import torch as th
    max_epochs = 1000
    early_stop_steps = 5
    epochs_per_fit = 1
    fit_iters = max_epochs // epochs_per_fit
    final_model_factory  = lambda: TorchRNNClassifier(vocab={}, 
                                                      use_embedding=False,
                                                      warm_start=True,
                                                      max_iter=epochs_per_fit,
                                                      eta=0.001,
                                                      bidirectional=False,
                                                      batch_size=64,
                                                      l2_strength=0,
                                                      hidden_dim=256)    
    final_featurizer = partial(get_seq_feats, how='m1-middle-m2', two_dir=False, max_words=100)
    ### now we prepare the dataset
    # first the train
    train_dataset = splits['train']
    train_o, train_y = train_dataset.build_dataset()
    P("Featurizing train dataset...")
    train_X, vectorizer = train_dataset.featurize(
        train_o, [final_featurizer], vectorize=False)
    # now train_X, train_y holds the train data
    
    # now the dev
    assess_dataset = splits['dev']
    assess_o, assess_y = assess_dataset.build_dataset()
    P("Featurizing dev dataset...")
    test_X, _ = assess_dataset.featurize(
        assess_o,
        featurizers=[final_featurizer],
        vectorizer=None,
        vectorize=False)
    # now test_X and assess_y holds the dev data
    
    # lets train all the models
    start_timer('final_model')
    models = {}
    early_stops = {}
    n_rels = len(splits['all'].kb.all_relations)
    P("Training {} {} classifiers on {} relations".format(
        n_rels, final_model_factory().__class__.__name__, n_rels))
    for i_rel, rel in enumerate(splits['all'].kb.all_relations):
      models[rel] = final_model_factory()
      P("  Training {}/{}: Running {}.fit() for rel={} for max {} epochs with early stop...".format(
              i_rel + 1, n_rels, models[rel].__class__.__name__, rel, max_epochs))
      best_rel_f1 = 0
      best_rel_model = ''
      patience = 0
      max_patience = 5
      for ep in range(1, max_epochs + 1):
        models[rel].fit(train_X[rel], train_y[rel])
        # finished fit stage now lets evaluate
        predictions =  models[rel].predict(test_X[rel], verbose=False)
        stats = precision_recall_fscore_support(assess_y[rel], predictions, beta=0.5)
        stats = [stat[1] for stat in stats]     
        rel_f1 = stats[2]
        if best_rel_f1 < rel_f1:
          patience = 0
          if best_rel_model != '':
            try:
              os.remove(best_rel_model)
              P("  Old model file '{}' removed".format(best_rel_model))
            except:
              P("FAILED to remove old model file")
          best_rel_f1 = rel_f1
          best_rel_model = 'model_rel_{}_{}.th'.format(rel, ep)
          th.save(models[rel].model.state_dict(), best_rel_model)
          P("  Found new best for rel={} with f05={:.4f} @ ep {}".format(
              rel, best_rel_f1, ep * epochs_per_fit))
          early_stops[rel] = ep * epochs_per_fit
        else:
          patience += 1
          P("  Model did not improve {:.4f} < {:.4f}. Patience {}/{}".format(
              rel_f1, best_rel_f1, patience, max_patience))
        if patience > max_patience:
          P("  Stopping trainn for rel '{}' after {} epochs".format(
              rel, ep * epochs_per_fit))
          break
      if best_rel_model != '':
        P("  Loading best model '{}' for rel='{}'".format(best_rel_model, rel))
        models[rel].model.load_state_dict(th.load(best_rel_model))
        predictions =  models[rel].predict(test_X[rel], verbose=False)
        stats = precision_recall_fscore_support(assess_y[rel], predictions, beta=0.5)
        stats = [stat[1] for stat in stats]     
        rel_f1 = stats[2]   
        P("  Final model for rel '{}' has a F0.5 of {:.4f}".format(rel,
          rel_f1))
        assert rel_f1 == best_rel_f1, "Results can not be replicated {} vs {} ".format(best_rel_f1, rel_f1)
    # now we have trained one model for each realtion with independent early stopping                        
    train_result = {
        'featurizers': featurizers,
        'vectorizer': vectorizer,
        'models': models,
        'all_relations': splits['all'].kb.all_relations,
        'vectorize': False}
    predictions, test_y = rel_ext.predict(
        splits,
        train_result,
        split_name='dev',
        vectorize=False)
    eval_res = rel_ext.evaluate_predictions(
                  predictions,
                  test_y,
                  verbose=True)
    end_timer('final_model', eval_res)



## Bake-off [1 point]

For the bake-off, we will release a test set. The announcement will go out on the discussion forum. You will evaluate your custom model from the previous question on these new datasets using the function `rel_ext.bake_off_experiment`. Rules:

1. Only one evaluation is permitted.
1. No additional system tuning is permitted once the bake-off has started.

The cells below this one constitute your bake-off entry.

People who enter will receive the additional homework point, and people whose systems achieve the top score will receive an additional 0.5 points. We will test the top-performing systems ourselves, and only systems for which we can reproduce the reported results will win the extra 0.5 points.

Late entries will be accepted, but they cannot earn the extra 0.5 points. Similarly, you cannot win the bake-off unless your homework is submitted on time.

The announcement will include the details on where to submit your entry.

In [31]:
# Enter your bake-off assessment code in this cell. 
# Please do not remove this comment.
if 'IS_GRADESCOPE_ENV' not in os.environ:
    pass
    # Please enter your code in the scope of the above conditional.
    ##### YOUR CODE HERE




In [32]:
# On an otherwise blank line in this cell, please enter
# your macro-average f-score (an F_0.5 score) as reported 
# by the code above. Please enter only a number between 
# 0 and 1 inclusive. Please do not remove this comment.
if 'IS_GRADESCOPE_ENV' not in os.environ:
    pass
    # Please enter your score in the scope of the above conditional.
    ##### YOUR CODE HERE


