# Homework 3: Relation extraction using distant supervision

In [1]:
__author__ = "Bill MacCartney"
__version__ = "CS224U, Stanford, Spring 2019"

## Contents

1. [Overview](#Overview)
1. [Set-up](#Set-up)
1. [Baseline](#Baseline)
1. [Homework questions](#Homework-questions)
  1. [Different model factory [1 point]](#Different-model-factory-[1-point])
  1. [Directional unigram features [2 points]](#Directional-unigram-features-[2-points])
  1. [The part-of-speech tags of the "middle" words [2 points]](#The-part-of-speech-tags-of-the-"middle"-words-[2-points])
  1. [Your original system [4 points]](#Your-original-system-[4-points])
1. [Bake-off [1 point]](#Bake-off-[1-point])

## Overview

This homework and associated bake-off are devoted to the developing really effective relation extraction systems using distant supervision. 

As with the previous assignments, this notebook first establishes a baseline system. The initial homework questions ask you to create additional baselines and suggest areas for innovation, and the final homework question asks you to develop an original system for you to enter into the bake-off.

## Set-up

See [the first notebook in this unit](rel_ext_01_task.ipynb#Set-up) for set-up instructions.

In [2]:
from functools import partial
import nltk
import numpy as np
import os
import random
import rel_ext
from nltk.corpus import wordnet as wn
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
import string
import utils

In [3]:
sbt_obj = "Stockholm"
obt_obj = "Bergen"
#feature_counter["sbt_obj"] = 0
#feature_counter["obt_obj"] = 0
text = nltk.word_tokenize(sbt_obj)
nes = nltk.ne_chunk(nltk.pos_tag(text))
for ne in nes:
    if type(ne) is nltk.tree.Tree and ne.label() == "GPE":
        print(ne.label())
        #feature_counter["sbt_obj"] = 1

text = nltk.word_tokenize(obt_obj)
nes = nltk.ne_chunk(nltk.pos_tag(text))
for ne in nes:
    if type(ne) is nltk.tree.Tree and ne.label() == "GPE":
        print("obt")
        #feature_counter["obt_obj"] = 1

GPE
obt


As usual, we unite our corpus and KB into a dataset, and create some splits for experimentation:

In [4]:
DATA_HOME = '/home/kd/data/data'
rel_ext_data_home = os.path.join(DATA_HOME, 'rel_ext_data')
GLOVE_HOME = os.path.join(DATA_HOME, 'glove.6B')

In [5]:
glove_lookup = utils.glove2dict(
    os.path.join(GLOVE_HOME, 'glove.6B.300d.txt'))

In [6]:
corpus = rel_ext.Corpus(os.path.join(rel_ext_data_home, 'corpus.tsv.gz'))

In [7]:
kb = rel_ext.KB(os.path.join(rel_ext_data_home, 'kb.tsv.gz'))

In [8]:
dataset = rel_ext.Dataset(corpus, kb)

You are not wedded to this set-up for splits. The bake-off will be conducted on a previously unseen test-set, so all of the data in `dataset` is fair game:

In [9]:
splits = dataset.build_splits(
    split_names=['tiny', 'train', 'dev'],
    split_fracs=[0.01, 0.79, 0.20],
    seed=1)

In [10]:
splits

{'all': Corpus with 331,696 examples; KB with 45,884 triples,
 'dev': Corpus with 64,937 examples; KB with 9,248 triples,
 'tiny': Corpus with 3,474 examples; KB with 445 triples,
 'train': Corpus with 263,285 examples; KB with 36,191 triples}

## Baseline

In [11]:
def simple_bag_of_words_featurizer(kbt, corpus, feature_counter, 
                                    use_middle_length=False, 
                                    use_entities=False,
                                    context_section='middle', # can be 'left', 'right', or 'middle'
                                    use_synsets=False):
    
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        words = None
        if context_section == 'left':
            words = ex.left.split(' ')
        elif context_section == 'right':
            words = ex.right.split(' ')
        elif context_section == 'middle':
            words = ex.middle.split(' ')
        else:
            words = ' '.join((ex.left, ex.mention_1, ex.middle, ex.mention_2, ex.right)).split(' ')
        
        if use_synsets:            
            pos_s = ex.middle_POS.split(' ')
            for word, pos in zip(words,pos_s):
                if word not in string.punctuation:
                    feature_counter[word] += 1
                    pos_split = pos.rsplit('/', 1)
                    word, pos_word = pos_split[0], pos_split[1]
                    synsets = wn.synsets(word, pos_word)
                    for syn in synsets:
                        feature_counter[syn.lemma()] += 1
        else: 
            for word in words:
                feature_counter[word] += 1
        
        if use_middle_length:
            feature_counter['NUM_WORD_IN_MIDDLE']  += len(words)
        if use_entities:
            feature_counter[kbt.sbj] += 1
            feature_counter[kbt.obj] += 1
            
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        words = None
        if context_section == 'left':
            words = ex.left.split(' ')
        elif context_section == 'right':
            words = ex.right.split(' ')
        else:
            words = ex.middle.split(' ')

        if use_synsets:            
            pos_s = ex.middle_POS.split(' ')
            for word, pos in zip(words,pos_s):
                if word not in string.punctuation:
                    feature_counter[word] += 1
                    pos_split = pos.rsplit('/', 1)
                    word, pos_word = pos_split[0], pos_split[1]
                    synsets = wn.synsets(word, pos_word)
                    for syn in synsets:
                        feature_counter[syn.lemma()] += 1
        else: 
            for word in words:
                feature_counter[word] += 1
        if use_middle_length:
            feature_counter['NUM_WORD_IN_MIDDLE']  += len(words)
        if use_entities:
            feature_counter[kbt.sbj] += 1
            feature_counter[kbt.obj] += 1
            
    return feature_counter

In [12]:
featurizers = [simple_bag_of_words_featurizer]

In [13]:
model_factory = lambda: LogisticRegression(fit_intercept=True, solver='liblinear')

# we create below model factories, in case model_factory fails to converge, we need to use more iterations
# to try and achieve convergence
model_factory_2k = lambda: LogisticRegression(fit_intercept=True, solver='liblinear', max_iter=2000)
model_factory_4k = lambda: LogisticRegression(fit_intercept=True, solver='liblinear', max_iter=4000)

In [14]:
baseline_results = rel_ext.experiment(
    splits,
    train_split='train',
    test_split='dev',
    featurizers=featurizers,
    model_factory=model_factory,
    verbose=True)

relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
nationality               0.588      0.199      0.423        301       5677
capital                   0.688      0.232      0.493         95       5471
author                    0.815      0.528      0.735        509       5885
contains                  0.790      0.600      0.743       3904       9280
founders                  0.805      0.392      0.665        380       5756
profession                0.588      0.190      0.414        247       5623
parents                   0.890      0.545      0.790        312       5688
worked_at                 0.694      0.244      0.507        242       5618
film_performance          0.803      0.570      0.743        766       6142
has_sibling               0.884      0.228      0.562        499       5875
place_of_death            0.462      0.113      0.286        159       5535
place_of_bir

Studying model weights might yield insights:

In [15]:
rel_ext.examine_model_weights(baseline_results)

Highest and lowest feature weights for relation nationality:

     2.866 born
     1.875 Foreign
     1.869 Pinky
     ..... .....
    -1.393 and
    -1.406 American
    -1.693 state

Highest and lowest feature weights for relation capital:

     3.363 capital
     1.911 especially
     1.765 now
     ..... .....
    -1.342 and
    -2.794 Isfahan
    -2.815 Province

Highest and lowest feature weights for relation author:

     2.837 books
     2.620 book
     2.196 by
     ..... .....
    -2.655 only
    -2.999 1818
    -3.201 1890

Highest and lowest feature weights for relation contains:

     2.897 third-largest
     2.084 bordered
     2.035 continent
     ..... .....
    -2.537 who
    -2.699 Isfahan
    -2.757 second-largest

Highest and lowest feature weights for relation founders:

     4.008 founder
     3.954 founded
     2.550 label
     ..... .....
    -1.800 novel
    -1.849 William
    -2.018 Griffith

Highest and lowest feature weights for relation profession:

     3.4

## Homework questions

Please embed your homework responses in this notebook, and do not delete any cells from the notebook. (You are free to add as many cells as you like as part of your responses.)

### Different model factory [1 point]

The code in `rel_ext` makes it very easy to experiment with other classifier models: one need only redefine the `model_factory` argument. This question asks you to assess a [Support Vector Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

__To submit:__ A call to `rel_ext.experiment` training on the 'train' part of `splits` and assessing on its `dev` part, with `featurizers` as defined above in this notebook and the `model_factory` set to one based in an `SVC` with `kernel='linear'` and all other arguments left with default values.

In [16]:
svc_model_factory = lambda: SVC(kernel='linear')

In [17]:
svc_results = rel_ext.experiment(
    splits,
    train_split='train',
    test_split='dev',
    featurizers=featurizers,
    model_factory=svc_model_factory,
    verbose=True)

relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
nationality               0.567      0.196      0.411        301       5677
capital                   0.690      0.305      0.551         95       5471
author                    0.737      0.611      0.708        509       5885
contains                  0.782      0.604      0.738       3904       9280
founders                  0.720      0.426      0.633        380       5756
profession                0.527      0.235      0.422        247       5623
parents                   0.815      0.593      0.758        312       5688
worked_at                 0.667      0.289      0.529        242       5618
film_performance          0.752      0.627      0.723        766       6142
has_sibling               0.811      0.240      0.550        499       5875
place_of_death            0.377      0.126      0.270        159       5535
place_of_bir

### Directional unigram features [2 points]

The current bag-of-words representation makes no distinction between "forward" and "reverse" examples. But, intuitively, there is big difference between _X and his son Y_ and _Y and his son X_. This question asks you to modify `simple_bag_of_words_featurizer` to capture these differences. 

__To submit:__

1. A feature function `directional_bag_of_words_featurizer` that is just like `simple_bag_of_words_featurizer` except that it distinguishes "forward" and "reverse". To do this, you just need to mark each word feature for whether it is derived from a subject–object example or from an object–subject example. The precise nature of the mark you add for the two cases doesn't make a difference to the model.

2. The macro-average F-score on the `dev` set that you obtain from running `rel_ext.experiment` with `directional_bag_of_words_featurizer` as the only featurizer. (Aside from this, use all the default values for `experiment` as exemplified above in this notebook.)

3. `rel_ext.experiment` returns some of the core objects used in the experiment. How many feature names does the `vectorizer` have for the experiment run in the previous step? (Note: we're partly asking you to figure out how to get this value by using the sklearn documentation, so please don't ask how to do it on Piazza!)

In [18]:
def directional_bag_of_words_featurizer(kbt, corpus, feature_counter, fwd_prefix='$FWD_DIRECTION: ', 
                                        bwd_prefix='$BWD_DIREECTION: ', use_middle_length=False,
                                        use_entities=False, include_left=False, include_right=False,
                                       use_entities2=False):
    count = 0
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        count += 1
        words = ex.middle.split(' ')
        for word in words:
            word_direction = fwd_prefix + "_middle_" + word
            feature_counter[word_direction] += 1
        if use_middle_length:
            feature_counter[fwd_prefix + 'NUM_WORD_IN_MIDDLE']  += len(words)
        if include_left:
            words = ex.left.split(' ')
            for word in words:
                word_key = fwd_prefix + "_left_" + word
                feature_counter[word_key] += 1
            if use_middle_length:
                feature_counter[fwd_prefix + 'NUM_WORD_IN_LEFT']  += len(words)
        if include_right:
            words = ex.right.split(' ')
            for word in words:
                word_key = fwd_prefix + "_right_" + word
                feature_counter[word_key] += 1
            if use_middle_length:
                feature_counter[fwd_prefix + 'NUM_WORD_IN_RIGHT']  += len(words)                
        if use_entities:
            feature_counter[fwd_prefix + kbt.sbj] += 1
            feature_counter[fwd_prefix + kbt.obj] += 1
        if use_entities2:
            feature_counter["fwd_kbt.sbj"] += 1
            feature_counter["fwd_kbt.obj"] += 1

    count = max(count, 1)
    if use_middle_length:
        feature_counter[fwd_prefix + 'NUM_WORD_IN_MIDDLE']  /= count
    if include_left:
        feature_counter[fwd_prefix + 'NUM_WORD_IN_LEFT'] /= count
    if include_right:
        feature_counter[fwd_prefix + 'NUM_WORD_IN_RIGHT'] /= count
        
    count = 0
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        count += 1
        words = ex.middle.split(' ')
        for word in words:
            word_direction = bwd_prefix +"_middle_" + word
            feature_counter[word_direction] += 1
        if use_middle_length:
            feature_counter['BWD_NUM_WORD_IN_MIDDLE']  += len(words)
            
        if include_left:
            words = ex.left.split(' ')
            for word in words:
                word_key = bwd_prefix + "_left_" + word
                feature_counter[word_key] += 1
            if use_middle_length:
                feature_counter['BWD_NUM_WORD_IN_LEFT']  += len(words)
        if include_right:
            words = ex.right.split(' ')
            for word in words:
                word_key = bwd_prefix + "_right_" + word
                feature_counter[word_key] += 1
            if use_middle_length:
                feature_counter[bwd_prefix +"BWD_NUM_WORD_IN_RIGHT'"]  += len(words)
        if use_entities:
            feature_counter[bwd_prefix + kbt.sbj] += 1
            feature_counter[bwd_prefix + kbt.obj] += 1
        if use_entities2:
            feature_counter["bwd_kbt.sbj"] += 1
            feature_counter["bwd_kbt.obj"] += 1
            
    count = max(count, 1)

    if use_middle_length:
        feature_counter[fwd_prefix + 'NUM_WORD_IN_MIDDLE']  /= count
    if include_left:
        feature_counter[fwd_prefix + 'NUM_WORD_IN_LEFT'] /= count
    if include_right:
        feature_counter[fwd_prefix + 'NUM_WORD_IN_RIGHT'] /= count

    return feature_counter

In [19]:
directional_bag_results = rel_ext.experiment(
    splits,
    train_split='train',
    test_split='dev',
    featurizers=[directional_bag_of_words_featurizer],
    model_factory=model_factory,
    verbose=True)

relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
nationality               0.648      0.233      0.477        301       5677
capital                   0.641      0.263      0.498         95       5471
author                    0.848      0.591      0.780        509       5885
contains                  0.809      0.668      0.776       3904       9280
founders                  0.793      0.413      0.670        380       5756
profession                0.731      0.231      0.510        247       5623
parents                   0.900      0.519      0.785        312       5688
worked_at                 0.810      0.264      0.573        242       5618
film_performance          0.855      0.653      0.805        766       6142
has_sibling               0.887      0.251      0.588        499       5875
place_of_death            0.697      0.145      0.395        159       5535
place_of_bir

In [20]:
directional_bag_left_right_results = rel_ext.experiment(
    splits,
    train_split='train',
    test_split='dev',
    featurizers=[partial(directional_bag_of_words_featurizer, use_middle_length=True, 
                         use_entities=True, include_left=True, include_right=True)],
    model_factory=model_factory_4k,
    verbose=True)

relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
nationality               0.831      0.654      0.789        301       5677
capital                   0.686      0.253      0.511         95       5471
author                    0.877      0.727      0.842        509       5885
contains                  0.860      0.692      0.820       3904       9280
founders                  0.790      0.495      0.706        380       5756
profession                0.877      0.462      0.743        247       5623
parents                   0.873      0.641      0.814        312       5688
worked_at                 0.761      0.343      0.612        242       5618
film_performance          0.861      0.702      0.824        766       6142
has_sibling               0.901      0.673      0.844        499       5875
place_of_death            0.795      0.415      0.672        159       5535
place_of_bir

### The part-of-speech tags of the "middle" words [2 points]

Our corpus distribution contains part-of-speech (POS) tagged versions of the core text spans. Let's begin to explore whether there is information in these sequences, focusing on `middle_POS`.

__To submit:__

1. A feature function `middle_bigram_pos_tag_featurizer` that is just like `simple_bag_of_words_featurizer` except that it creates a feature for bigram POS sequences. For example, given 

  `The/DT dog/N napped/V`
  
   we obtain the list of bigram POS sequences
  
   `b = ['<s> DT', 'DT N', 'N V', 'V </s>']`. 
   
   Of course, `middle_bigram_pos_tag_featurizer` should return count dictionaries defined in terms of such bigram POS lists, on the model of `simple_bag_of_words_featurizer`.
   
   Don't forget the start and end tags, to model those environments properly!

2. The macro-average F-score on the `dev` set that you obtain from running `rel_ext.experiment` with `middle_bigram_pos_tag_featurizer` as the only featurizer. (Aside from this, use all the default values for `experiment` as exemplified above in this notebook.)

Note: To parse `middle_POS`, one splits on whitespace to get the `word/TAG` pairs. Each of these pairs `s` can be parsed with `s.rsplit('/', 1)`.

In [21]:
def pos_featurize(pos_segments, feature_counter, prefix=""):
    word_POSs = pos_segments.split(' ')
    len_POS = len(word_POSs)
    for i in range(-1, len_POS - 1):
        pos = word_POSs[i].rsplit('/', 1)
        bigram = ""
        if len(pos) > 1:
            if i == -1:
                bigram = '<s> ' + pos[1]
            elif i == len_POS - 2:
                bigram = pos[1] + ' </s>'
            else:
                bigram = pos[1] + " " + word_POSs[i+1].rsplit('/', 1)[1]
        feature_counter[prefix + bigram] += 1
    return feature_counter

In [22]:
def bigram_pos_tag_featurizer(kbt, corpus, feature_counter, use_left=False, use_right=False, use_bt_pos=False):
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        if use_bt_pos:
            mention_pos_1 = ex.mention_1_POS.rsplit('/',1)
            mention_pos_2 = ex.mention_1_POS.rsplit('/',1)
            feature_counter["sbj_"+mention_pos_1[1]] = 1
            feature_counter["obj_"+mention_pos_1[1]] = 1
        feature_counter = pos_featurize(ex.middle_POS, feature_counter, "middle")
        if use_left:
            feature_counter = pos_featurize(ex.left_POS, feature_counter, "left")
        if use_right:
            feature_counter = pos_featurize(ex.right_POS, feature_counter, "right")
    return feature_counter

In [23]:
def middle_bigram_pos_tag_featurizer(kbt, corpus, feature_counter):
    feature_counter = bigram_pos_tag_featurizer(kbt, corpus, feature_counter)
    return feature_counter

In [24]:
pos_results = rel_ext.experiment(
    splits,
    train_split='train',
    test_split='dev',
    featurizers=[middle_bigram_pos_tag_featurizer],
    model_factory=model_factory,
    verbose=True)

relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
nationality               0.648      0.153      0.393        301       5677
capital                   0.632      0.126      0.351         95       5471
author                    0.801      0.246      0.552        509       5885
contains                  0.712      0.290      0.551       3904       9280
founders                  0.689      0.134      0.377        380       5756
profession                0.692      0.182      0.444        247       5623
parents                   0.696      0.228      0.493        312       5688
worked_at                 0.589      0.136      0.354        242       5618
film_performance          0.721      0.230      0.505        766       6142
has_sibling               0.718      0.158      0.421        499       5875
place_of_death            0.607      0.107      0.314        159       5535
place_of_bir

In [25]:
pos_results = rel_ext.experiment(
    splits,
    train_split='train',
    test_split='dev',
    featurizers=[partial(bigram_pos_tag_featurizer, use_left=True, use_right=False, use_bt_pos=True)],
    model_factory=model_factory_4k,
    verbose=True)

relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
nationality               0.600      0.249      0.468        301       5677
capital                   0.483      0.147      0.332         95       5471
author                    0.741      0.348      0.604        509       5885
contains                  0.790      0.283      0.582       3904       9280
founders                  0.670      0.197      0.453        380       5756
profession                0.663      0.223      0.475        247       5623
parents                   0.709      0.266      0.532        312       5688
worked_at                 0.633      0.236      0.473        242       5618
film_performance          0.705      0.305      0.559        766       6142
has_sibling               0.712      0.263      0.530        499       5875
place_of_death            0.587      0.233      0.450        159       5535
place_of_bir

### Your original system [4 points]

There are many options, and this could easily grow into a project. Here are a few ideas:

- Try out different classifier models, from `sklearn` and elsewhere.
- Add a feature that indicates the length of the middle.
- Augment the bag-of-words representation to include bigrams or trigrams (not just unigrams).
- Introduce features based on the entity mentions themselves. <!-- \[SPOILER: it helps a lot, maybe 4% in F-score. And combines nicely with the directional features.\] -->
- Experiment with features based on the context outside (rather than between) the two entity mentions — that is, the words before the first mention, or after the second.
- Try adding features which capture syntactic information, such as the dependency-path features used by Mintz et al. 2009. The [NLTK](https://www.nltk.org/) toolkit contains a variety of [parsing algorithms](http://www.nltk.org/api/nltk.parse.html) that may help.
- The bag-of-words representation does not permit generalization across word categories such as names of people, places, or companies. Can we do better using word embeddings such as [GloVe](https://nlp.stanford.edu/projects/glove/)?
- Consider adding features based on WordNet synsets. Here's a little code to get you started with that:
  ```
  from nltk.corpus import wordnet as wn
  dog_compatible_synsets = wn.synsets('dog', pos='n')
 ```

In [26]:
def bow_featurize(words, feature_counter, n, prefix="", directional_prefix="", use_middle_length=False):
    for i in range(0, len(words), n):
            end = i + n
            if (len(words) - i) < n:
                end = len(words)
            n_gram = ' '.join(words[i:end])
            n_gram = directional_prefix + n_gram
            feature_counter[prefix + n_gram] += 1
    if use_middle_length:
        feature_counter[directional_prefix+'NUM_WORD_IN_MIDDLE']  += len(words)
    return feature_counter

In [27]:
def ngrams_bag_of_words_featurizer(kbt, corpus, feature_counter, n=2, 
                                   directional=False, use_middle_length=False,
                                   use_left=False, use_right=False):
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        words = ex.middle.split(' ')
        directional_prefix=""
        if directional:
            directional_prefix = "FWD_"
        feature_counter = bow_featurize(words, feature_counter, n, "middle_", directional_prefix, use_middle_length)
        if use_left:
            words = ex.middle.split(' ')
            feature_counter = bow_featurize(words, feature_counter, n, "left_", directional_prefix, use_middle_length)
        if use_right:
            words = ex.middle.split(' ')
            feature_counter = bow_featurize(words, feature_counter, n, "right_", directional_prefix, use_middle_length)
        
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        words = ex.middle.split(' ')
        directional_prefix=""
        if directional:
            directional_prefix = "BWD_"
        feature_counter = bow_featurize(words, feature_counter, n, "middle_", directional_prefix, use_middle_length)
        if use_left:
            words = ex.middle.split(' ')
            feature_counter = bow_featurize(words, feature_counter, n, "left_", directional_prefix, use_middle_length)
        if use_right:
            words = ex.middle.split(' ')
            feature_counter = bow_featurize(words, feature_counter, n, "right_", directional_prefix, use_middle_length)
    return feature_counter

In [28]:
bigrams_bag_of_words_featurizer = partial(ngrams_bag_of_words_featurizer, n=2)
trigrams_bag_of_words_featurizer = partial(ngrams_bag_of_words_featurizer, n=3)

In [29]:
def ensembled_bow_pos_ngrams_final(kbt, corpus, feature_counter):
    feature_counter = directional_bag_of_words_featurizer(kbt, corpus, feature_counter, use_middle_length=True,
                                        use_entities=True, include_left=True, include_right=True)
    feature_counter = bigram_pos_tag_featurizer(kbt, corpus, feature_counter, use_left=True, use_bt_pos=True)
    return ngrams_bag_of_words_featurizer(kbt, corpus, feature_counter, n=2, directional=True, 
                                        use_left=True, use_right=True)

In [30]:
ensembled_bow_pos_ngrams_direct_results = rel_ext.experiment(
    splits,
    train_split='train',
    test_split='dev',
    featurizers=[ensembled_bow_pos_ngrams_final],
    model_factory=model_factory_4k,
    verbose=True)

relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
nationality               0.879      0.698      0.835        301       5677
capital                   0.676      0.263      0.514         95       5471
author                    0.870      0.764      0.847        509       5885
contains                  0.877      0.762      0.851       3904       9280
founders                  0.795      0.532      0.723        380       5756
profession                0.857      0.510      0.754        247       5623
parents                   0.895      0.654      0.833        312       5688
worked_at                 0.812      0.430      0.690        242       5618
film_performance          0.873      0.726      0.839        766       6142
has_sibling               0.891      0.685      0.840        499       5875
place_of_death            0.830      0.459      0.714        159       5535
place_of_bir

## Bake-off [1 point]

For the bake-off, we will release a test set right after class on April 29. The announcement will go out on Piazza. You will evaluate your custom model from the previous question on these new datasets using the function `rel_ext.bake_off_experiment`. Rules:

1. Only one evaluation is permitted.
1. No additional system tuning is permitted once the bake-off has started.

To enter the bake-off, upload this notebook on Canvas:

https://canvas.stanford.edu/courses/99711/assignments/187248

The cells below this one constitute your bake-off entry.

People who enter will receive the additional homework point, and people whose systems achieve the top score will receive an additional 0.5 points. We will test the top-performing systems ourselves, and only systems for which we can reproduce the reported results will win the extra 0.5 points.

The bake-off will close at 4:30 pm on May 1. Late entries will be accepted, but they cannot earn the extra 0.5 points. Similarly, you cannot win the bake-off unless your homework is submitted on time.

In [31]:
# Enter your bake-off assessment code in this cell. 
# Please do not remove this comment.
rel_ext.bake_off_experiment(ensembled_bow_pos_ngrams_direct_results,
    rel_ext_data_home,
    verbose=True)

relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
nationality               0.848      0.640      0.796        383       7067
capital                   0.566      0.261      0.459        115       6799
author                    0.879      0.730      0.844        645       7329
contains                  0.847      0.782      0.833       3808      10492
founders                  0.784      0.565      0.728        444       7128
profession                0.829      0.423      0.695        310       6994
parents                   0.912      0.700      0.860        427       7111
worked_at                 0.846      0.492      0.740        323       7007
film_performance          0.883      0.740      0.850       1011       7695
has_sibling               0.889      0.713      0.847        717       7401
place_of_death            0.860      0.460      0.732        200       6884
place_of_bir

In [32]:
# On an otherwise blank line in this cell, please enter
# your macro-average f-score (an F_0.5 score) as reported 
# by the code above. Please enter only a number between 
# 0 and 1 inclusive. Please do not remove this comment.
0.755

0.755