<a href="https://colab.research.google.com/github/dbamman/nlp21/blob/main/HW7/HW_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import json
import gensim
import sys
import gensim.downloader as api
import torch
import torch.nn as nn
import numpy as np
import nltk


# Setup

In [None]:
!python -m nltk.downloader punkt

In [None]:
#Let's install qwikidata, which we will use to qurey the wikidata triples
!pip install qwikidata
#Let's install pywikibot, which we will use to query wikipedia articles
!pip install pywikibot
!export PYWIKIBOT_NO_USER_CONFIG=1
!pip install wikipedia

In [None]:
!wget https://raw.githubusercontent.com/dbamman/nlp21/main/HW7/glove.6B.50d.50K.txt
!wget https://raw.githubusercontent.com/dbamman/nlp21/main/HW7/dev_dataset.csv
!wget https://raw.githubusercontent.com/dbamman/nlp21/main/HW7/train_dataset.csv


In [None]:
user_config = '''
mylang = 'en'
family = 'wikipedia'
username = 'ExampleBot'
'''
with open("./user-config.py", "w") as f:
  f.write(user_config)

# **IMPORTANT**: GPU is not enabled by default

You must switch runtime environments if your output of the next block of code has an error saying "ValueError: Expected a cuda device, but got: cpu"

Go to Runtime > Change runtime type > Hardware accelerator > GPU

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Running on {}".format(device))

# Deliverable 1: Distant Supervision

In this component, we will query wikipedia articles and structured triples of information.  We will align the triples with text via the algorithm in SLP Chapter 17 Figure 17.9.  We will be doing this for a small set of Wikipedia articles and relationship types.

First, let's download the text of the Wikipedia articles we are interested in.  For the purpose of Deliverable 1, this is 50 Alternative Hip Hop artists.


In [None]:

import pywikibot
import wikipedia
from pywikibot import pagegenerators
from tqdm import tqdm
import json
import requests
site = pywikibot.Site()

#This line grabs all the wikipedia articles for a given category
cat = pywikibot.Category(site,'Category:Alternative hip hop musicians')
#Now, we will use pywikibot to grab all wikipedia pages linked to this category
gen = pagegenerators.CategorizedPageGenerator(cat)


documents = {}
#For this homework, we will be stopping once 10 articles are queried
counter = 0

#iterates through all the pages
for page in tqdm(gen):
    #grabs the title as plaintext
    title = page.title(with_ns=False)
    
    #If errors, we skip over the entity 
    try:
      #grabs the wikipedia page text
      text = wikipedia.page(title)
      documents[title] = text.content
      counter += 1 
    except:
      continue
    if counter > 50:
      break

Let's look at an example of one of the returned wikipedia pages.

In [None]:
documents['WebsterX']

Because the Wikidata corpus is incredibly large, we will use a series of sparql queries to get relevant triples for our corpus.  We will return all relevant triples for alternative hip hop artists and filter out later.

Please see https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/Wikidata_Query_Help 
for more details on Wikidata sparql queries if interested.

In [None]:
'''
Now we will define the 5 queries we will be using for the distant supervision.

We are interested in artists' date of birth, place of birth, school attended,
start of musician career, and band name.

Please see https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/Wikidata_Query_Help 
for more information on how to structure queries like this, if interested
'''

q1 = '''
SELECT DISTINCT ?human ?humanLabel ?dob 
WHERE
{
    VALUES ?professions {wd:Q177220 wd:Q639669}
    ?human wdt:P31 wd:Q5 .
    ?human wdt:P106 ?professions .
    ?human wdt:P136 ?genre .
    ?human wikibase:statements ?statementcount .
    ?human wdt:P136 wd:Q438503 .  
    ?human wdt:P569 ?dob.
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}

'''

q2 = '''

SELECT DISTINCT ?human ?humanLabel ?pobLabel 
WHERE
{
    VALUES ?professions {wd:Q177220 wd:Q639669}
    ?human wdt:P31 wd:Q5 .
    ?human wdt:P106 ?professions .
    ?human wdt:P136 ?genre .
    ?human wikibase:statements ?statementcount .
    ?human wdt:P136 wd:Q438503 .  
    ?human wdt:P19 ?pob.
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
'''

q3 = '''
SELECT DISTINCT ?human ?humanLabel ?schoolLabel 
WHERE
{
    VALUES ?professions {wd:Q177220 wd:Q639669}
    ?human wdt:P31 wd:Q5 .
    ?human wdt:P106 ?professions .
    ?human wdt:P136 ?genre .
    ?human wikibase:statements ?statementcount .
    ?human wdt:P136 wd:Q438503 .  
    ?human wdt:P69 ?school.
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
'''

q4 = '''

SELECT DISTINCT ?human ?humanLabel ?start 
WHERE
{
    VALUES ?professions {wd:Q177220 wd:Q639669}
    ?human wdt:P31 wd:Q5 .
    ?human wdt:P106 ?professions .
    ?human wdt:P136 ?genre .
    ?human wikibase:statements ?statementcount .
    ?human wdt:P136 wd:Q438503 .  
    ?human wdt:P2031 ?start.
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
'''

q5 = '''
SELECT DISTINCT ?human ?humanLabel ?memberLabel
WHERE
{
    VALUES ?professions {wd:Q177220 wd:Q639669}
    ?human wdt:P31 wd:Q5 .
    ?human wdt:P106 ?professions .
    ?human wdt:P136 ?genre .
    ?human wikibase:statements ?statementcount .
    ?human wdt:P136 wd:Q438503 .  
    ?human wdt:P463 ?member.
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}

'''
queries = [q1, q2, q3, q4, q5]

In [None]:
'''
Next, let's query wikidata for the triples we are interested in 

This cell makes a sparql query to wikidata to return triples of info.
'''
from qwikidata.sparql import (get_subclasses_of_item,
                              return_sparql_query_results)
from datetime import datetime

triples = []
query_names = ["hasDateOfBirth", "hasPlaceOfBirth", "hasSchool", "hasYearStarted", "hasMembershipOf"]
count = 0
for query in queries:
  res = return_sparql_query_results(query)
  #We want to save all the triples which are returned from the query
  for item in res['results']['bindings']: 
    if count == 0:
      dt = datetime.fromisoformat(item[res['head']['vars'][2]]['value'].split("T")[0])
      triples.append((query_names[count], item[res['head']['vars'][1]]['value'], str(dt.strftime("%B")) + " " + str(dt.day) + ", " + str(dt.year)))
    elif count == 3:
      dt = datetime.fromisoformat(item[res['head']['vars'][2]]['value'].split("T")[0])
      triples.append((query_names[count], item[res['head']['vars'][1]]['value'], str(dt.year)))
    else:
      triples.append((query_names[count], item[res['head']['vars'][1]]['value'], item[res['head']['vars'][2]]['value']))
  count += 1

Let's examine the format of one of these returned triples, indicating that Kelis was born on August 21, 1979.

In [None]:
print(triples[0])

Now, let's iterate through our Wikipedia articles and align factual triples to sentences from our Wikipedia articles.

You will add code to follow the entity alignment algorithm described in Figure 17.9 in SLP3.  Do not worry about the training component of the algorithm; this is covered in Deliverable 2.

In [None]:
'''
Now, we will iterate through and align the triples to sentences to create the dataset.

This uses the algorithm from SLP3 Figure 17.9.
'''
import spacy
from tqdm import tqdm

def process_dataset(documents, triples, query_names):
  nlp = spacy.load("en_core_web_sm")
  aligned_dataset = []

  for document in tqdm(documents):
    doc = nlp(documents[document])
    sentences = list(doc.sents)
    for sent in sentences:
      sent = sent.text
      for relation in query_names:
        for triple in [t for t in triples if t[0] == relation]:
          should_align = False
          
          #YOUR CODE HERE

          if should_align:
            #Let's mark the entities with a special prefix and join the multi-word ones with underscores as in SLP
            formatted_sent = sent.replace(triple[1], "_ent1_" + "_".join(triple[1].split(" ")))
            formatted_sent = formatted_sent.replace(triple[2], "_ent2_" + "_".join(triple[2].split(" ")))
            aligned_dataset.append((formatted_sent, triple[0]))
          
  return aligned_dataset
aligned_dataset = process_dataset(documents, triples, query_names)

Now let's look at our newly-aligned dataset, containing a small number of aligned triples.

In [None]:
aligned_dataset

We've now successfully used distant supervision to align sentences from Wikipedia articles to information triples from Wikidata.  Note that the dataset is not perfect, as it is done without human annotation.  This  process scales up without additional human effort, at the cost of more compute time.  For Deliverable 2, we will be providing a dataset created using Distant Supervision, as the compute-time required to create a sizable dataset is large.

# Deliverable 2: Relation Prediction Model

Now that we have the process to create an aligned dataset, let's train a CNN-based model to predict a relationship from the text spans.  Note that we will be using a different, larger dataset than the one you created in Deliverable 1.

In [None]:
train_dataset = "./train_dataset.csv"
dev_dataset = "./dev_dataset.csv"

In [None]:
'''
Let's create a dictionary of relation types to define the classification output space
For this deliverable we have an additional category, no_relation_found, which can be
applied to sentences which do not align with a triple.
'''
query_names = ["hasDateOfBirth", "hasPlaceOfBirth", "hasSchool", "hasYearStarted", "hasMembershipOf", "no_relation_found"]

labels = {}
count = 0 
for query in query_names:
  labels[query] = count
  count += 1

In [None]:
'''
s1 and s2 define the position embeddings
'''
def get_batches(x, s1, s2, y, xType, batch_size=12):
    batches_x=[]
    batches_s1 = []
    batches_s2 = []
    batches_y=[]
    for i in range(0, len(x), batch_size):
        #import pdb; pdb.set_trace()
        batches_x.append(xType(x[i:i+batch_size]))
        batches_s1.append(xType(s1[i:i+batch_size]))
        batches_s2.append(xType(s2[i:i+batch_size]))
        batches_y.append(torch.LongTensor(y[i:i+batch_size]))
    
    return batches_x,batches_s1, batches_s2, batches_y

In [None]:
PAD_INDEX = 0             # reserved for padding words
UNKNOWN_INDEX = 1         # reserved for unknown words
SEP_INDEX = 2

MAX_DATA_LEN = 300

data_lens = []

def read_embeddings(filename, vocab_size=50000):
  """
  Utility function, loads in the `vocab_size` most common embeddings from `filename`
  
  Arguments:
  - filename:     path to file
                  automatically infers correct embedding dimension from filename
  - vocab_size:   maximum number of embeddings to load

  Returns 
  - embeddings:   torch.FloatTensor matrix of size (vocab_size x word_embedding_dim)
  - vocab:        dictionary mapping word (str) to index (int) in embedding matrix
  """

  # get the embedding size from the first embedding
  with open(filename, encoding="utf-8") as file:
    word_embedding_dim = len(file.readline().split(" ")) - 1

  vocab = {}

  embeddings = np.zeros((vocab_size, word_embedding_dim))
  with open(filename, encoding="utf-8") as file:
    for idx, line in enumerate(file):

      if idx + 2 >= vocab_size:
        break

      cols = line.rstrip().split(" ")
      val = np.array(cols[1:])
      word = cols[0]
      embeddings[idx + 2] = val
      vocab[word] = idx + 2
  
  # a FloatTensor is a multidimensional matrix
  # that contains 32-bit floats in every entry
  # https://pytorch.org/docs/stable/tensors.html
  return torch.FloatTensor(embeddings), vocab



This format_data() function is where you will add code to determine each word's position from m1 and m2.  As a reminder, we don't want to have negative values in m1_pos_list or m2_pos_list.  To address this, negative values will begin indexing after max_length (300).  For example, the position -10 would be stored as 310, the position -17 would be stored at 317, and so on. 

In [None]:
import csv
def format_data(filename, vocab, labels, max_length):
    """
    Inputs:
      filename: pointer to file holding the dataset we wish to process
      vocab: GLoVE vocabulary file created from read_embeddings function
      labels: dictionary mapping relationship name to integer index
      max_length: maximum length of input
    Returns:
      data: Input sentences processed as glove embedding indices
      data_m1: For each example in the dataset there is a list of positions (one for each word)
                from the word to the first entity (appended with _ent1_) with no negative values
      data_m2: For each example in the dataset there is a list of positions (one for each word)
                from the word to the second entity (appended with _ent2_) with no negative values
      data_labels:  Includes the integer label associated with each example in the dataset
    """    
    data = []
    data_labels = []
    data_m1 = []
    data_m2 = []
    file = open(filename)
    csvreader = csv.reader(file, delimiter=',')

    for line in csvreader:
        sentence = line[0]
        label = line[1]
        
        m1_pos_list = []
        m2_pos_list = []
        split_sentence = sentence.split(" ")

        #YOUR CODE HERE

        w_int = []
        for w in nltk.word_tokenize(sentence.lower()):
            # skip the unknown words
            if w in vocab:
                w_int.append(vocab[w])
            else:
                w_int.append(UNKNOWN_INDEX)
        data_lens.append(len(w_int))

        #makes sure the example isn't too long for our model
        if len(w_int) < 300:
          w_int.extend([PAD_INDEX] * (max_length - len(w_int)))
          data.append((w_int))
          m1_pos_list.extend([max_length-1] * (max_length-len(m1_pos_list)))
          data_m1.append(m1_pos_list)
          m2_pos_list.extend([max_length*2-1] * (max_length-len(m2_pos_list)))
          data_m2.append(m2_pos_list)
          data_labels.append(labels[label])
    return data, data_m1, data_m2, data_labels

In [None]:
class EntityCNNClassifier(nn.Module):

   def __init__(self, params, pretrained_embeddings):
      super().__init__()
      self.seq_len = params["max_seq_len"]
      self.num_labels = params["label_length"]
      self.embeddings = nn.Embedding.from_pretrained(pretrained_embeddings, freeze=False)

      self.m1_embeddings = ...
      self.m2_embeddings = ...

      self.conv_2 = nn.Conv1d(82, 16, 2, 1)
      self.pool_2 = nn.MaxPool1d(299,1)

      self.fc = nn.Linear(16, self.num_labels)
    
   def forward(self, input, m1_pos_list, m2_pos_list): 
      x_word_emb = self.embeddings(input)

      x_m1 = ...
      x_m2 = ...

      x = torch.cat((x_word_emb, x_m1, x_m2), 2)
      x = x.permute(0, 2, 1)
    
      conv = self.conv_2(x)
      conv = torch.tanh(conv)
      conv = self.pool_2(conv)
      conv = conv.view((conv.shape[0], -1))


      self.out = self.fc(conv)
      return self.out.squeeze()

   def evaluate(self, x, s1, s2, y):
      
      self.eval()
      corr = 0.
      total = 0.

      with torch.no_grad():

        for x, s1, s2, y in zip(x,s1, s2, y):
          y_preds=self.forward(x, s1, s2)
          for idx, y_pred in enumerate(y_preds):
              prediction=torch.argmax(y_pred)
              if prediction == y[idx]:
                corr += 1.
              total+=1                          
      return corr/total



In [None]:
embs, cnn_vocab = read_embeddings("glove.6B.50d.50K.txt")
cnn_train_x, cnn_train_s1, cnn_train_s2, cnn_train_y = format_data(train_dataset, cnn_vocab, labels, 300)
cnn_dev_x, cnn_dev_s1, cnn_dev_s2, cnn_dev_y = format_data(dev_dataset, cnn_vocab, labels, 300)
cnn_trainX, cnn_trainS1, cnn_trainS2, cnn_trainY=get_batches(cnn_train_x, cnn_train_s1, cnn_train_s2, cnn_train_y, torch.LongTensor)
cnn_devX, cnn_devS1, cnn_devS2, cnn_devY=get_batches(cnn_dev_x, cnn_dev_s1, cnn_dev_s2, cnn_dev_y, torch.LongTensor)


In [None]:
cnnmodel = EntityCNNClassifier(params={"max_seq_len": 100, "label_length": len(labels)}, pretrained_embeddings=embs)

optimizer = torch.optim.Adam(cnnmodel.parameters(), lr=0.001, weight_decay=1e-5)
losses = []
cross_entropy=nn.CrossEntropyLoss()

num_epochs=15
best_dev_acc = 0.

for epoch in range(num_epochs):
    cnnmodel.train()

    for x, s1, s2, y in zip(cnn_trainX, cnn_trainS1, cnn_trainS2, cnn_trainY):
      y_pred = cnnmodel.forward(x, s1, s2)
      loss = cross_entropy(y_pred.view(-1, cnnmodel.num_labels), y.view(-1))
      losses.append(loss) 
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()
    dev_accuracy=cnnmodel.evaluate(cnn_devX, cnn_devS1, cnn_devS2, cnn_devY)
    if epoch % 1 == 0:
        print("Epoch %s, dev accuracy: %.3f" % (epoch, dev_accuracy))
        if dev_accuracy > best_dev_acc:
          torch.save(cnnmodel.state_dict(), 'best-cnnmodel-parameters.pt')
          best_dev_acc = dev_accuracy

cnnmodel.load_state_dict(torch.load('best-cnnmodel-parameters.pt'))
print("\nBest Performing Model achieves dev accuracy of : %.3f" % (best_dev_acc))