### References:
1. Videos and example code for this project: https://web.stanford.edu/class/cs224u/2021/
2. Knowledge base is a filtered version of a file available in its original form here: https://freebase-easy.cs.uni-freiburg.de/dump/




### Imports

In [1]:
# general imports 
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from collections import Counter

# imports specific to relation extraction 
import rel_ext
import utils

### Prepare the Dataset:
1. Import corpus
2. Import knowledge base
3. Combine knowledge base and corpus into a dataset object
4. Create dataset splits (training and test sets)

In [2]:
# create corpus object from pre-annotated file

corpus = rel_ext.Corpus('washington_post_test.tsv.gz')

In [3]:
# create knowledge base from pre-filtered file

kb = rel_ext.KB('Atsuko_filtered_KB.tsv.gz')

In [4]:
# list of all relation labels available in this knowledge base

kb.all_relations

['adjoins',
 'capital',
 'contains',
 'has_spouse',
 'nationality',
 'place_of_birth',
 'place_of_death',
 'worked_at']

In [5]:
# create a dataset by combining the corpus and knowledge base 

dataset = rel_ext.Dataset(corpus, kb)

In [6]:
# create training and test spits

splits = dataset.build_splits(
    split_names=['train', 'test'],
    split_fracs=[0.80, 0.20],
    seed=1)

In [7]:
# this is what the splits look like

splits

{'train': Corpus with 54,237 examples; KB with 22,313 triples,
 'test': Corpus with 14,543 examples; KB with 6,262 triples,
 'all': Corpus with 68,780 examples; KB with 28,575 triples}

In [8]:
# training set contains this distribution of labels 

splits['train'].count_examples()

                                             examples
relation               examples    triples    /triple
--------               --------    -------    -------
adjoins                     641       1283       0.50
capital                      90        406       0.22
contains                    271      14461       0.02
has_spouse                    4       2419       0.00
nationality                  14       1296       0.01
place_of_birth                0        874       0.00
place_of_death                4        676       0.01
worked_at                     1        898       0.00


In [9]:
# test set contains this distribution of labels 

splits['test'].count_examples()

                                             examples
relation               examples    triples    /triple
--------               --------    -------    -------
adjoins                     145        419       0.35
capital                      19        116       0.16
contains                     59       4220       0.01
has_spouse                    0        575       0.00
nationality                   1        302       0.00
place_of_birth                1        223       0.00
place_of_death                0        155       0.00
worked_at                     0        252       0.00


### Bag of Words Featurizer
1. Functions for the three featurizers used in this project 
    - Bag of words for middle feature (text that comes in between entity 1 and entity 2 in the sentence)
    - Bag of words for left feature (text that comes before entity 1 in the sentence)
    - Bag of words for right feature (text that comes after entity2 in the sentence)
2. Example of what this featurizer looks like when applied to one relational triple

#### Bag of words functions

In [10]:
def middle_BOW_featurizer(kbt, corpus, feature_counter):
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for word in ex.middle.split(' '):
            feature_counter[word] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        for word in ex.middle.split(' '):
            feature_counter[word] += 1
    return feature_counter

In [11]:
def left_BOW_featurizer(kbt, corpus, feature_counter):
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for word in ex.left.split(' '):
            feature_counter[word] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        for word in ex.left.split(' '):
            feature_counter[word] += 1
    return feature_counter

In [12]:
def right_BOW_featurizer(kbt, corpus, feature_counter):
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for word in ex.right.split(' '):
            feature_counter[word] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        for word in ex.right.split(' '):
            feature_counter[word] += 1
    return feature_counter

#### Example of the Bag of Words Feaurizer in Action:
1. Get all examples from the corpus of the entity pair (Keny, Somalia)
2. Apply middle bag of words featurizer
3. Print result

In [13]:
# 2 sentences in the entire corpus that contain the entity pair Kenya and Somalia 

ex = corpus.get_examples_for_entities('Kenya', 'Somalia')
ex

[Example(entity_1='Kenya', entity_2='Somalia', left='The al-Qaida-linked extremist group has vowed retribution on ', mention_1='Kenya', middle=' for sending troops to ', mention_2='Somalia', right=' to fight it.', left_POS='The/DET al/PROPN -/PUNCT Qaida/PROPN -/PUNCT linked/VERB extremist/ADJ group/NOUN has/AUX vowed/VERB retribution/NOUN on/ADP', mention_1_POS='Kenya/PROPN', middle_POS='/SPACE for/ADP sending/VERB troops/NOUN to/PART', mention_2_POS='Somalia/PROPN', right_POS='/SPACE to/PART fight/VERB it/PRON ./PUNCT'),
 Example(entity_1='Kenya', entity_2='Somalia', left='NAIROBI, ', mention_1='Kenya', middle=' A police officer in ', mention_2='Somalia', right=' says a car bomb blast near a security checkpoint at the presidential palace in the capital killed at least two people.', left_POS='NAIROBI/PROPN ,/PUNCT', mention_1_POS='Kenya/PROPN', middle_POS='/SPACE A/DET police/NOUN officer/NOUN in/ADP', mention_2_POS='Somalia/PROPN', right_POS='/SPACE says/VERB a/DET car/NOUN bomb/NOUN

In [33]:
# this is what example 1 looks like in its sentence form

''.join((ex[0][2], ex[0][3], ex[0][4], ex[0][5], ex[0][6]))

'The al-Qaida-linked extremist group has vowed retribution on Kenya for sending troops to Somalia to fight it.'

In [34]:
# this is what example 1 looks like in its sentence form

''.join((ex[1][2], ex[1][3], ex[1][4], ex[1][5], ex[1][6]))

'NAIROBI, Kenya A police officer in Somalia says a car bomb blast near a security checkpoint at the presidential palace in the capital killed at least two people.'

In [35]:
test = kb.get_triples_for_entities('Kenya', 'Somalia')
test

[KBTriple(rel='adjoins', sbj='Kenya', obj='Somalia')]

In [36]:
kbt = test[0]
kbt

KBTriple(rel='adjoins', sbj='Kenya', obj='Somalia')

In [37]:
middle_BOW_featurizer(kbt, corpus, Counter())

Counter({'': 4,
         'for': 1,
         'sending': 1,
         'troops': 1,
         'to': 1,
         'A': 1,
         'police': 1,
         'officer': 1,
         'in': 1})

### Pipeline Function to Train, Predict, and Evaluate

In [19]:
def get_results(featurizer, model):
    
    # set featurizer
    if featurizer == 'middle':
        featurizers = [middle_BOW_featurizer]
    elif featurizer == 'right':
        featurizers = [right_BOW_featurizer]
    elif featurizer == 'left':
        featurizers = [left_BOW_featurizer]
    elif featurizer == 'glove':
        featurizers = [glove_middle_featurizer]
   
    # set model
    if model == 'LR':
        model_factory = lambda: LogisticRegression(fit_intercept=True, solver='liblinear')
    elif model == 'SVC':
        model_factory = lambda: SVC(kernel='linear')
        
    # train
    train_result = rel_ext.train_models(
    splits, split_name='train',
    model_factory = model_factory,
    featurizers = featurizers)
    
    # predict
    predictions, true_labels = rel_ext.predict(
    splits, train_result, split_name='test')
    
    # obtain results and store them to a dictionary
    results = rel_ext.evaluate_predictions(predictions, true_labels)
    all_vals = list(results.values())
    all_rels = list(kb.all_relations)
    
    # all F1 scores for one featurizer/model combination
    return_me = {}
    for i in range(len(all_rels)):
        return_me[all_rels[i]] = all_vals[i][2]
    
    return return_me
        

### Run model for:
    1. Middle BOW + Logistic Regression
    2. Middle BOW + SVC
    3. Left BOW + Logistic Regression
    4. Left BOW + SVC
    5. Right BOW + Logistic Regression
    6. Right BOW + SVC

In [20]:
all_results = []

In [21]:
all_results.append(get_results('middle', 'LR'))
all_results.append(get_results('middle', 'SVC'))
all_results.append(get_results('left', 'LR'))
all_results.append(get_results('left', 'SVC'))
all_results.append(get_results('right', 'LR'))
all_results.append(get_results('right', 'SVC'))



### Convert results to a dataframe and save it as a csv file to be analyzed/graphed in a different notebook

In [22]:
import pandas as pd

In [23]:
df = pd.DataFrame(all_results)
df = df.T
df = df.reset_index()

In [24]:
df

Unnamed: 0,index,0,1,2,3,4,5
0,adjoins,0.925836,0.886311,0.925301,0.900607,0.857788,0.821475
1,capital,0.972222,0.951087,0.892857,0.875,0.815217,0.763081
2,contains,0.992354,0.995379,0.991273,0.991322,0.986098,0.988235
3,has_spouse,0.991721,0.997225,0.991721,0.995843,0.965089,0.975568
4,nationality,0.98366,0.98883,0.978544,0.981095,0.932466,0.932466
5,place_of_birth,0.984916,0.995516,0.981432,0.984916,0.92887,0.941476
6,place_of_death,0.979772,0.989783,0.974843,0.979772,0.902212,0.910693
7,worked_at,0.987461,0.996835,0.978261,0.987461,0.940299,0.943114


In [25]:
df.columns =['relation_label', 'Middle_LR', 'Middle_SVC', 'Left_LR', 'Left_SVC', 'Right_LR', 'Right_SVC']

In [26]:
df

Unnamed: 0,relation_label,Middle_LR,Middle_SVC,Left_LR,Left_SVC,Right_LR,Right_SVC
0,adjoins,0.925836,0.886311,0.925301,0.900607,0.857788,0.821475
1,capital,0.972222,0.951087,0.892857,0.875,0.815217,0.763081
2,contains,0.992354,0.995379,0.991273,0.991322,0.986098,0.988235
3,has_spouse,0.991721,0.997225,0.991721,0.995843,0.965089,0.975568
4,nationality,0.98366,0.98883,0.978544,0.981095,0.932466,0.932466
5,place_of_birth,0.984916,0.995516,0.981432,0.984916,0.92887,0.941476
6,place_of_death,0.979772,0.989783,0.974843,0.979772,0.902212,0.910693
7,worked_at,0.987461,0.996835,0.978261,0.987461,0.940299,0.943114


In [27]:
df.to_csv('model_results.csv', index=False)

### Find matches: 
I wanted to see which relational triples I can actually extract from the text.

In [28]:
def find_match(dataset):
    matches_found = []
    related_pairs = set()
    for ex in corpus.examples:
        if kb.get_triples_for_entities(ex.entity_1, ex.entity_2):
            related_pairs.add((ex.entity_1, ex.entity_2))
        if kb.get_triples_for_entities(ex.entity_2, ex.entity_1):
            related_pairs.add((ex.entity_2, ex.entity_1))

    for pair in related_pairs:
        match = kb.get_triples_for_entities(pair[0], pair[1])
        matches_found.append(match)
    return matches_found

In [29]:
matches = find_match(dataset)

In [30]:
len(matches)

399

In [31]:
# look at the first 10

matches[:10]

[[KBTriple(rel='adjoins', sbj='Kuwait', obj='Saudi_Arabia')],
 [KBTriple(rel='capital', sbj='Serbia', obj='Belgrade'),
  KBTriple(rel='contains', sbj='Serbia', obj='Belgrade')],
 [KBTriple(rel='contains', sbj='Australia', obj='Canberra'),
  KBTriple(rel='capital', sbj='Australia', obj='Canberra')],
 [KBTriple(rel='adjoins', sbj='Liberia', obj='Guinea')],
 [KBTriple(rel='contains', sbj='New_York', obj='Westchester_County')],
 [KBTriple(rel='adjoins', sbj='Georgia', obj='Florida')],
 [KBTriple(rel='contains', sbj='Mexico', obj='Ensenada')],
 [KBTriple(rel='contains', sbj='Kentucky', obj='Louisville')],
 [KBTriple(rel='contains', sbj='England', obj='Manchester')],
 [KBTriple(rel='adjoins', sbj='Honduras', obj='Nicaragua')]]

### Predicting Relation Labels for Unseen Pairs
List is a function that extracts the newly found relation labels for all of the entity pairs from the corpus that did not have a match in the knowledge base

In [32]:
rel_ext.find_new_relation_instances(dataset, featurizers=[middle_BOW_featurizer])

Highest probability examples for relation adjoins:

     1.000 KBTriple(rel='adjoins', sbj='Ladkani', obj='about_$3,000')
     1.000 KBTriple(rel='adjoins', sbj='about_$3,000', obj='Ladkani')
     0.998 KBTriple(rel='adjoins', sbj='Taiwanese', obj='China')
     0.998 KBTriple(rel='adjoins', sbj='China', obj='Taiwanese')
     0.997 KBTriple(rel='adjoins', sbj='Sudan', obj='2011')
     0.997 KBTriple(rel='adjoins', sbj='2011', obj='Sudan')
     0.992 KBTriple(rel='adjoins', sbj='2010.Now_Ouyang', obj='China')
     0.992 KBTriple(rel='adjoins', sbj='China', obj='2010.Now_Ouyang')
     0.992 KBTriple(rel='adjoins', sbj='CAD', obj='$1.4_billion')
     0.992 KBTriple(rel='adjoins', sbj='$1.4_billion', obj='CAD')

Highest probability examples for relation capital:

     0.834 KBTriple(rel='capital', sbj='National_Security_Council', obj='William_Happer')
     0.834 KBTriple(rel='capital', sbj='day', obj='Omar_Marrero')
     0.834 KBTriple(rel='capital', sbj='Omar_Marrero', obj='day')
     0.83