<a href="https://colab.research.google.com/github/asliakalin/NLP/blob/master/6.%20Neural%20Coreference%20Resolution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 6: Neural Coreference Resolution

**Due April 20, 2020 at 11:59pm**

In this homework,  you will be implementing parts of a Pytorch implementation for neural coreference resolution, inspired by [Lee et al.(2017), “End-to-end Neural Coreference Resolution” (EMNLP)](https://arxiv.org/pdf/1707.07045.pdf). 

### REMEMBER TO UPLOAD THE DATASET!
Click the Files icon > Upload > Upload train.conll and dev.conll that you have downloaded from bCourses: Files/HW_6

### Setup

In [None]:
import sys, re
from collections import Counter

import torch
from torch import nn
import torch.optim as optim

import numpy as np
from scipy.stats import spearmanr

We noticed that running this on CPU is faster than running on GPU. Thus, we will default to running on CPU. However, feel free to change it to GPU if you wish.

In [None]:
device = torch.device("cpu")
print("Running on {}".format(device))

Running on cpu


### Download and process data
Note: You do **not** have to modify this section.

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip

--2020-04-29 05:33:30--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2020-04-29 05:33:30--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2020-04-29 05:33:31--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2020-0

In [None]:
def read_conll(filename):

  docid=None
  partID=None

  # collection
  all_sents=[]
  all_ents=[]

  # for one doc
  all_doc_sents=[]
  all_doc_ents=[]

  # for one sentence
  sent=[]
  ents=[]

  named_ents=[]
  cur_tid=0
  open_count=0

  global_eid=0
  doc_eid_to_global_eid={}

  with open(filename, encoding="utf-8") as file:
    for line in file:
      if line.startswith("#begin document"):

        all_doc_ents=[]
        all_doc_sents=[]

        open_ents={}
        open_named_ents={}

        docid=None
        matcher=re.match("#begin document \((.*)\); part (.*)$", line.rstrip())
        if matcher != None:
          docid=matcher.group(1)
          partID=matcher.group(2)

      elif line.startswith("#end document"):

        all_sents.append(all_doc_sents)
        all_ents.append(all_doc_ents)

        
      else:

        parts=re.split("\s+", line.rstrip())

        # sentence boundary
        if len(parts) < 2:
    
          all_doc_sents.append(sent)

          ents=sorted(ents, key=lambda x: (x[0], x[1]))

          all_doc_ents.append(ents)

          sent=[]
          ents=[]

          cur_tid=0

          continue

        tid=cur_tid
        token=parts[3]
        cur_tid+=1

        identifier="%s.%s" % (docid, partID)

        coref=parts[-1].split("|")

        for c in coref:
          if c.startswith("(") and c.endswith(")"):
            c=re.sub("\(", "", c)
            c=int(re.sub("\)", "", c))

            if (identifier, c) not in doc_eid_to_global_eid:
              doc_eid_to_global_eid[(identifier, c)]=len(doc_eid_to_global_eid)

            ents.append((tid, tid, doc_eid_to_global_eid[(identifier, c)], identifier))

          elif c.startswith("("):
            c=int(re.sub("\(", "", c))

            if c not in open_ents:
              open_ents[c]=[]
            open_ents[c].append(tid)
            open_count+=1

          elif c.endswith(")"):
            c=int(re.sub("\)", "", c))

            assert c in open_ents

            start_tid=open_ents[c].pop()
            open_count-=1

            if (identifier, c) not in doc_eid_to_global_eid:
              doc_eid_to_global_eid[(identifier, c)]=len(doc_eid_to_global_eid)

            ents.append((start_tid, tid, doc_eid_to_global_eid[(identifier, c)], identifier))

        sent.append(token)

  return all_sents, all_ents

def load_embeddings(filename, vocab_size):
  # 0 idx is for padding
  # 1 idx is for unknown words

  # get the embedding size from the first embedding
  with open(filename, encoding="utf-8") as file:
    word_embedding_dim=len(file.readline().split(" "))-1

  vocab={"[PAD]":0, "[UNK]":1}

  print("word_embedding_dim:", word_embedding_dim)

  embeddings=np.zeros((vocab_size, word_embedding_dim))

  with open(filename, encoding="utf-8") as file:
    for idx,line in enumerate(file):

      if idx + 2 >= vocab_size:
        break

      cols=line.rstrip().split(" ")
      val=np.array(cols[1:])
      word=cols[0]
      embeddings[idx+2]=val
      vocab[word]=idx+2

  return torch.FloatTensor(embeddings), vocab

In [None]:
embeddingFile = "glove.6B.50d.txt"
trainFile = "train.conll"
devFile = "dev.conll"

all_sents, all_ents=read_conll(trainFile)	
dev_all_sents, dev_all_ents=read_conll(devFile)

embeddings, vocab=load_embeddings(embeddingFile, 50000)

word_embedding_dim: 50


### **Part 1. Implement B3**

In this part, you’ll implement the B3 coreference metric as discussed in class without importing external libraries. 

Recall the definition: 
$B^{_{precision}^{3}} = \frac{1}{n}\sum_{i}^{n} \frac{\left |Gold_{i} \cap  System_{i} \right |}{\left | System_{i} \right |}$
$B^{_{recall}^{3}} = \frac{1}{n}\sum_{i}^{n} \frac{\left |Gold_{i} \cap  System_{i} \right |}{\left | Gold_{i} \right |}$

You should be able to pass the sanity check b3_test() after implementing it.


In [None]:
def b3(gold, system):
  """ Calculate B3 metrics given the gold and system output
    Args:
        gold  : A dictionary that contains true refereneces. The 
        key is a tuple, (docid, absolute_start_idx, absolute_end_idx)representing a target to be predicted; 
        value is the true reference entity id.

        system: A dictionary that contains predicted referenece. 
        key in gold and system should be identical; 
        value is the predicted entity generated by the model.

    Returns:
        precision, recall, F(following the formula above)

    """
  precision=0.
  recall=0.
  F = 0.
  #####
  # Your code here
  #####
  common_gold_links = {}
  common_sys_links = {}
  for k in system.keys():
    v = system[k]
    if v in common_sys_links:
      common_sys_links[v] += [k]
    else:
      common_sys_links[v] = [k]

  for k in gold.keys():
    v = gold[k]
    if v in common_gold_links:
      common_gold_links[v] += [k]
    else:
      common_gold_links[v] = [k]
  #print(common_gold_links)

  for item in system.keys():
    goldkey = gold[item]
    syskey = system[item] 
    goldchain = common_gold_links[goldkey]
    syschain = common_sys_links[syskey]
    #print(goldchain, syschain)

    precision += len(set(goldchain).intersection(set(syschain)))/len(set(syschain))
    recall += len(set(goldchain).intersection(set(syschain)))/len(set(goldchain))

  precision = precision/len(gold.keys())
  recall = recall/len(gold.keys())
  F = 2 * (recall * precision) / (recall + precision)
    
  return precision, recall, F

In [None]:
def b3_test():
  gold={"a1":1, "a2": 2, "a3": 1, "a4":1, "a5": 3, "a6":3, "a7":2, "a8":2, "a9":1, "a10":1}
  system={"a1":5, "a2": 6, "a3": 6, "a4":6, "a5": 7, "a6":7, "a7":5, "a8":5, "a9":5, "a10":8}

  precision, recall, F=b3(gold, system)
  print("P: %.3f, R: %.3f, F: %.3f" % (precision, recall, F))

  assert abs(precision-0.667) < 0.001
  assert abs(recall-0.547) < 0.001
  assert abs(F-0.601) < 0.001
  
  print ("B3 sanity check passed")
b3_test()

P: 0.667, R: 0.547, F: 0.601
B3 sanity check passed


### **Part 2. Neural coref**
In part 2, the skeleton code for mention-ranking model is provided to you, you will not need to change any code until Part 2.1 begins. The following section provides the Mention class which is used to store relavant information about a mention and the BasicCorefModel. You will, at the very least, need to carefully read these two classes and understand the information stored in Mention and the structure of the model to complete this homework.


In [None]:
class Mention():

  """
  An object to contain information about each mention
  """

  def __init__(self, mention_id, docid, absolute_start_idx, absolute_end_idx, sentence_start_idx, sentence_end_idx, sentence, vocab):
    self.docid=docid

    # mention id (globally unique within one file, but not across different train and test files)
    self.mention_id=mention_id
    # the token index of the mention start position, measured from the beginning of the document
    self.absolute_start_idx=absolute_start_idx
    # the token index of the mention end position, measured from the beginning of the document
    self.absolute_end_idx=absolute_end_idx
    # the token index of the mention start position, measured from the beginning of the sentence
    self.sentence_start_idx=sentence_start_idx
    # the token index of the mention end position, measured from the beginning of the sentence
    self.sentence_end_idx=sentence_end_idx
    # a list of tokens for all the words in the mention's sentence
    self.sentence=sentence
    # a list of tokens ids for all the words in the mention's sentence
    self.sentence_ids=[]
    self.sentence_length=len(sentence)

    for word in sentence:
      word=word.lower()
      self.sentence_ids.append(vocab[word] if word in vocab else vocab["[UNK]"])

In [None]:
def convert_data_to_training_instances(all_sents, all_ents, vocab):
  X=[]
  Y=[]
  M=[]

  global_id=0
  truth={}

  for doc_idx, doc_ent in enumerate(all_ents):
    current_token_position=0
    existing_mentions=[]
    for sent_idx, mention_list in enumerate(doc_ent):
      sent=all_sents[doc_idx][sent_idx]

      for mention_idx, mention in enumerate(mention_list):
        start_sent_idx, end_sent_idx, entity_id, identifier=mention
        mention=Mention(global_id, identifier, current_token_position+start_sent_idx, current_token_position+end_sent_idx, start_sent_idx, end_sent_idx, sent, vocab)
        M.append(mention)
        truth[global_id]=entity_id
        global_id+=1
        x=[]
        y=[]
        for aidx, antecedent in enumerate(existing_mentions):
          x.append(antecedent)
          if truth[antecedent.mention_id] == truth[mention.mention_id]:
            y.append(aidx)

        X.append(x)
        Y.append(torch.LongTensor(y).to(device))
        existing_mentions.append(mention)
      current_token_position+=len(sent)

  return X, Y, M, truth

In [None]:
######### HELPER FUNCTION FOR TRAINING STARTS #########
#########  DONT'T EDIT THIS SECTION OF CODE   #########
def forward_predict(batch_x, batch_m, scoring_function):

  this_batch_size=len(batch_x)
  num_ants=len(batch_x[0])

  # if this batch has no antecedents, then it must start a new entity
  if num_ants == 0:
    return torch.LongTensor([0]*this_batch_size)
  
  # get predictions
  preds=scoring_function(batch_x, batch_m)

  # 
  arg_sorts=torch.argsort(preds, descending=True, dim=1)
  tops=arg_sorts[:,0]

  return tops


def forward_train(batch_x, batch_y, batch_m, scoring_function):

  num_batch=len(batch_x)
  num_ants=len(batch_x[0])

  # if this batch has no candidate antecedents, then each mention must start a new entity so there is only only choice we could make (hence no loss)
  if num_ants == 0:
    return None

  preds=scoring_function(batch_x, batch_m)
  preds_sum=torch.logsumexp(preds, 1)

  running_loss=None


  for i in range(num_batch):

    # optimize marginal log-likelihood of true antecedents
    if batch_y[i].nelement() == 0:
      golds_sum=0.
    else:
      golds=torch.index_select(preds[i], 0, batch_y[i])
      golds_sum=torch.logsumexp(golds, 0)

    diff=preds_sum[i]-golds_sum

    running_loss = diff if running_loss is None else running_loss + diff

  return running_loss

def get_batches(X, Y, M, batchsize):
  sizes={}
  for i in range(len(M)):
    size=len(X[i])
    if size not in sizes:
      sizes[size]=[]
    sizes[size].append((X[i], Y[i], M[i]))

  batches=[]

  for size in sizes:
    i=0
    while (i < len(sizes[size])):

      data=sizes[size][i:i+batchsize]
      batch_x=[]
      batch_y=[]
      batch_m=[]
      for x, y, m in data:
        batch_x.append(x)
        batch_y.append(y)
        batch_m.append(m)

      batches.append((batch_x, batch_y, batch_m))
      i+=batchsize

  return batches


def train(X, Y, M, train_gold, test_X, test_Y, test_M, test_gold, model):

  batches=get_batches(X, Y, M, 32)
  test_batches=get_batches(test_X, test_Y, test_M, 32)
  optimizer = optim.Adam(model.parameters(), lr=0.001)

  for epoch in range(10):

    model.train()
    # train
    bigloss=0.
    for batch_x, batch_y, batch_m in batches:
      model.zero_grad()
      loss=forward_train(batch_x, batch_y, batch_m, model.scorer)
      if loss is not None:
        loss.backward()
        optimizer.step()
        bigloss+=loss

    # evaluate
    model.eval()

    gold={}
    predicted={}

    eid=0
    tot=0

    for batch_x, batch_y, batch_m in test_batches:
      predictions=forward_predict(batch_x, batch_m, model.scorer)

      for idx, mention in enumerate(batch_m):

        gold[mention.docid, mention.absolute_start_idx, mention.absolute_end_idx]=test_gold[mention.mention_id]
        prediction=predictions[idx]
        tot+=1
      
        # prediction is to start a new entity
        if prediction >= len(batch_x[idx]):
          predicted[mention.docid, mention.absolute_start_idx, mention.absolute_end_idx]=eid
          eid+=1

        # prediction is to link to a previous mention
        else:

          best_antecedent=batch_x[idx][prediction]
          predicted_entity_id=predicted[best_antecedent.docid, best_antecedent.absolute_start_idx, best_antecedent.absolute_end_idx]
          predicted[mention.docid, mention.absolute_start_idx, mention.absolute_end_idx]=predicted_entity_id

    P, R, F=b3(gold, predicted)
    print("loss: %.3f, B3 F: %.3f, unique entities: %s, num mentions: %s" % (bigloss, F, eid, tot))

def set_seed(seed):
  """
  Sets random seeds and sets model in deterministic
  training mode. Ensures reproducible results
  """
  torch.manual_seed(seed)
  torch.backends.cudnn.deterministic = True
  torch.backends.cudnn.benchmark = False
  np.random.seed(seed)
######### HELPER FUNCTION FOR TRAINING ENDS #########
#########  DONT'T EDIT THIS SECTION OF CODE   #########

In [None]:
class BasicCorefModel(nn.Module):

	def __init__(self, vocab, embeddings):
		super(BasicCorefModel, self).__init__()
		self.vocab=vocab
		self.embeddings = nn.Embedding.from_pretrained(embeddings)
		_, embedding_size=embeddings.shape
		self.hidden_dim=50

		self.input_size=2 * embedding_size

		self.W1 = nn.Linear(self.input_size, self.hidden_dim)
		self.tanh=nn.Tanh()
		self.W2 = nn.Linear(self.hidden_dim, 1)	

	def scorer(self, batch_x, batch_m):

		"""
		Input: a batch containing:
			-- batch_m [list of Mention objects]: mention to resolve.  batch_m[i] contains a single Mention
			-- batch_x [list of [list of Mention objects]]: candidate antecedents. batch_x[i] contains a list of candidate antecedents for mention batch_m[i]

		Each input batch is batched to contain the same number of candidate antecedents

		Output: numpy matrix [batch_size, number_of_antecedents + 1, 1] containing scores for all antecedents
			-- for j < number_of_antecedents, output[i,j] contains the score of batch_x[i][j] being the correct antecedent for batch_m[i] 
			-- for j == number_of_antecedents, output[i,j] = 0 (the score for batch_m[i] being linked to no antecedent)

		"""

		this_batch_size=len(batch_x)
		num_ants=len(batch_x[0])

		# get representations for mentions
		lastWordID=[]

		for idx, mention in enumerate(batch_m):
			lastWordID.append(mention.sentence_ids[mention.sentence_end_idx])

		# [this_batch_size, 1, embedding_size]
		mention_LW_embeddings=self.embeddings(torch.LongTensor(lastWordID).to(device)).unsqueeze(1)

		# get representations for antecedents
		antLastWords=[]
		for idx in range(len(batch_x)):
			antWords=[]
			for ant_idx, ant in enumerate(batch_x[idx]):
				antWords.append(ant.sentence_ids[ant.sentence_end_idx])

			antLastWords.append(antWords)

		# [this_batch_size, num_ants, embedding_size]
		antecedent_LW_embeddings=self.embeddings(torch.LongTensor(antLastWords).to(device))

		# We want to generate a score for each antecedent for each mention. However,
		# mention_LW_embeddings is [this_batch_size, 1, embedding_size] while,
		# antecedent_LW_embeddings is [this_batch_size, num_ants, embedding_size].
		# So let's make a bunch of copies of mention_LW_embeddings (one for each of its candidate antecedents)

		# [this_batch_size, num_ants, embedding_size]
		mention_LW_embeddings_copies=mention_LW_embeddings.expand_as(antecedent_LW_embeddings)

		# Now that they're the same size, we can concatenate them together into one big matrix

		# [this_batch_size, num_ants, (embedding_size + embedding_size)]
		all_features=torch.cat([mention_LW_embeddings_copies, antecedent_LW_embeddings], 2)
		
		# [this_batch_size, num_ants, 1]
		preds=self.W2(self.tanh(self.W1(all_features))).squeeze(-1)

		# Let's fix the score for starting a new entity to be 0; all of the other scores for candidate antecedents will end up 
		# being relative to that.

		# [this_batch_size, 1]
		zeros=torch.FloatTensor(np.zeros((this_batch_size, 1))).to(device)

		# [this_batch_size, num_ants + 1, 1]		
		preds=torch.cat((preds, zeros), 1)

		return preds

Now, everything is set up to run the BasicCorefModel. Let's run the cell below to train the model and look at the result of the model.

In [None]:
X, Y, M, train_truth=convert_data_to_training_instances(all_sents, all_ents, vocab)
dev_X, dev_Y, dev_M, dev_truth=convert_data_to_training_instances(dev_all_sents, dev_all_ents, vocab)
set_seed(159)
model=BasicCorefModel(vocab, embeddings)
model=model.to(device)
print ("Training BasicCorefModel")
train(X, Y, M, train_truth, dev_X, dev_Y, dev_M, dev_truth, model)

Training BasicCorefModel
loss: 41274.820, B3 F: 0.764, unique entities: 29578, num mentions: 29597
loss: 33196.117, B3 F: 0.764, unique entities: 29569, num mentions: 29597
loss: 29578.078, B3 F: 0.765, unique entities: 29323, num mentions: 29597
loss: 27773.996, B3 F: 0.771, unique entities: 28797, num mentions: 29597
loss: 26788.643, B3 F: 0.779, unique entities: 27948, num mentions: 29597
loss: 26151.432, B3 F: 0.782, unique entities: 27427, num mentions: 29597
loss: 25682.389, B3 F: 0.783, unique entities: 27372, num mentions: 29597
loss: 25319.137, B3 F: 0.786, unique entities: 26932, num mentions: 29597
loss: 25024.688, B3 F: 0.789, unique entities: 26854, num mentions: 29597
loss: 24776.920, B3 F: 0.793, unique entities: 26450, num mentions: 29597


### **Part 2.1 Incorporate distance**

In this part, you should incorporate the word distance information to BasicCorefModel described in the HW. The below code structure provided to you is exactly the same as BasicCorefModel, your job is to add code into both __init__() and scorer() functions as you see fit.

Hint: You might consider initialize distance embedding in __init__() function, then concatenate the original embedding and the corresponding distance embedding in scorer(). 

After implementing this, run the sanity check, test_distance(), provided to you.

In [None]:
class DistanceCorefModel(nn.Module):
	""" The code provided here starts out as just a copy of BasicCorefModel """
	def __init__(self, vocab, embeddings):
		super(DistanceCorefModel, self).__init__()
		self.vocab=vocab

		# initialize distance embeddings as identity matrix	
		self.distance_embeddings = nn.Embedding.from_pretrained(torch.eye(10,10))

		self.embeddings = nn.Embedding.from_pretrained(embeddings)
		_, embedding_size= embeddings.shape
		self.hidden_dim=50

		# update input size to reflect concatenated distance embedding of one-hot vector of size d=10
		self.input_size= 2 * embedding_size + 10
		self.W1 = nn.Linear(self.input_size, self.hidden_dim)
		self.tanh = nn.Tanh()
		self.W2 = nn.Linear(self.hidden_dim, 1)	

	def scorer(self, batch_x, batch_m):

		"""
		Input: a batch containing:
			-- batch_m [list of Mention objects]: mention to resolve.  batch_m[i] contains a single Mention
			-- batch_x [list of [list of Mention objects]]: candidate antecedents. batch_x[i] contains a list of candidate antecedents for mention batch_m[i]
		Each input batch is batched to contain the same number of candidate antecedents
		Output: numpy matrix [batch_size, number_of_antecedents + 1, 1] containing scores for all antecedents
			-- for j < number_of_antecedents, output[i,j] contains the score of batch_x[i][j] being the correct antecedent for batch_m[i] 
			-- for j == number_of_antecedents, output[i,j] = 0 (the score for batch_m[i] being linked to no antecedent)

		"""
		device = torch.device("cpu")
		this_batch_size=len(batch_x)
		num_ants=len(batch_x[0])
		#print(num_ants)
		# get representations for mentions
		lastWordID=[]

		for idx, mention in enumerate(batch_m):
			lastWordID.append(mention.sentence_ids[mention.sentence_end_idx])

		# [this_batch_size, 1, embedding_size]
		mention_LW_embeddings=self.embeddings(torch.LongTensor(lastWordID).to(device)).unsqueeze(1)
		

		# get representations for antecedents
		antLastWords=[]
		for idx in range(len(batch_x)):
			antWords=[]
			for ant_idx, ant in enumerate(batch_x[idx]):
				antWords.append(ant.sentence_ids[ant.sentence_end_idx])

			antLastWords.append(antWords)

		# [this_batch_size, num_ants, embedding_size]
		antecedent_LW_embeddings=self.embeddings(torch.LongTensor(antLastWords).to(device))
	
		antAbsDistance = []
		for idx,mention in enumerate(batch_m):
			dists = []
			for ant_id, ant in enumerate(batch_x[idx]):
				dis = mention.absolute_end_idx - ant.absolute_end_idx
				if dis <= 0: 
					antDist = 0
				elif dis <= 1:
					antDist = 1
				elif dis <= 2:
					antDist = 2
				elif dis <= 3:
					antDist = 3
				elif dis <= 4:
					antDist = 4
				elif dis <= 7:
					antDist = 5
				elif dis <= 15:
					antDist =6
				elif dis <= 31:
					antDist = 7
				elif dis <= 63:
					antDist = 8
				else:
					antDist = 9
				dists.append(antDist)
			antAbsDistance.append(dists)

		distance_abs_embeddings = self.distance_embeddings(torch.LongTensor(antAbsDistance).to(device))

		# We want to generate a score for each antecedent for each mention. However,
		# mention_LW_embeddings is [this_batch_size, 1, embedding_size] while,
		# antecedent_LW_embeddings is [this_batch_size, num_ants, embedding_size].
		# So let's make a bunch of copies of mention_LW_embeddings (one for each of its candidate antecedents)

		# [this_batch_size, num_ants, embedding_size]
		mention_LW_embeddings_copies=mention_LW_embeddings.expand_as(antecedent_LW_embeddings)
		# Now that they're the same size, we can concatenate them together into one big matrix

		# [this_batch_size, num_ants, (embedding_size + embedding_size + distance_embedding_size)]
		all_features=torch.cat([mention_LW_embeddings_copies, antecedent_LW_embeddings, distance_abs_embeddings], 2)

		# [this_batch_size, num_ants, 1]
		preds=self.W2(self.tanh(self.W1(all_features))).squeeze(-1)

		# Let's fix the score for starting a new entity to be 0; all of the other scores for candidate antecedents will end up 
		# being relative to that.

		# [this_batch_size, 1]
		zeros=torch.FloatTensor(np.zeros((this_batch_size, 1))).to(device)

		# [this_batch_size, num_ants + 1, 1]		
		preds=torch.cat((preds, zeros), 1)

		return preds

In [None]:
def test_distance(model):
  batch_x=[]
  maxLen=100
  for i in range(maxLen):
    mention=Mention(i, "testdoc", i, i+1, 0, 1, ["John", "Smith", "is", "a", "person"], model.vocab)
    batch_x.append(mention)

  mention=Mention(maxLen, "testdoc", maxLen, maxLen, 0, 0, ["He", "is", "a", "person"], model.vocab)

  preds=model.scorer([batch_x], [mention])
  preds=preds.detach().cpu().numpy()[0]
  spearman, _= spearmanr(preds, np.arange(len(preds)))
  print("Distance check: %.3f" % spearman)
  with open("distance_predictions.txt", "w", encoding="utf-8") as out:
    out.write(' '.join(["%.5f" % x for x in preds]))

In [None]:
set_seed(159)
model=DistanceCorefModel(vocab, embeddings)
model=model.to(device)
print ("Training DistanceCorefModel")
train(X, Y, M, train_truth, dev_X, dev_Y, dev_M, dev_truth, model)
test_distance(model)

Training DistanceCorefModel
loss: 38142.238, B3 F: 0.771, unique entities: 27516, num mentions: 29597
loss: 31295.287, B3 F: 0.792, unique entities: 25643, num mentions: 29597
loss: 27499.352, B3 F: 0.801, unique entities: 25037, num mentions: 29597
loss: 24978.295, B3 F: 0.806, unique entities: 24524, num mentions: 29597
loss: 23457.592, B3 F: 0.812, unique entities: 24143, num mentions: 29597
loss: 22521.248, B3 F: 0.815, unique entities: 23976, num mentions: 29597
loss: 21872.807, B3 F: 0.816, unique entities: 23865, num mentions: 29597
loss: 21376.531, B3 F: 0.817, unique entities: 23796, num mentions: 29597
loss: 20969.732, B3 F: 0.818, unique entities: 23712, num mentions: 29597
loss: 20621.184, B3 F: 0.819, unique entities: 23643, num mentions: 29597
Distance check: 0.925


### **Part 2.2 Design a fancier model**
Here comes the fun part! After completing DistanceCorefModel, you have certain degree of familiarity with the model architecture. In the section, you will be implementing a fancier model using any features you'd like. Feel free to make changes to the architecture you see fit.

Submit this notebook to gradescope and a writeup file "fancymodel.txt" describing your model and the features you use.
**Your code must implement exactly what you describe in your writeup**

In my fancy model implementation (more detailed explanation in writeup)

*   I used a 10K dictionary instead of 50K
*   I used 200-d representation of word embeddings instead of 50-d
*   included similarity between the tokens in word's sentence and in the mention's sentence (cosine similarity)
*   used distance between two words in terms of absolute index
*   used parallelism between two words' positions in their respective sentences, using bins for the first 2 words, the first 5 words, the first 7 words, the first 10 words and further away.
*   I also tried to included information about gender agreement (male, female, neutral) and number agreement (singular plural) but I couldn't find an already existing dictionary to download to use with the mentions.



In [None]:
embeddingFile = "glove.6B.200d.txt"
trainFile = "train.conll"
devFile = "dev.conll"
all_sents, all_ents=read_conll(trainFile)	
dev_all_sents, dev_all_ents = read_conll(devFile)
embeddings, vocab = load_embeddings(embeddingFile, 100000)

word_embedding_dim: 200


In [None]:
class FancyCorefModel(nn.Module):

	def __init__(self, vocab, embeddings):
		super(FancyCorefModel, self).__init__()
		self.vocab=vocab

		# initialize distance embeddings as identity matrix	
		self.distance_embeddings = nn.Embedding.from_pretrained(torch.eye(10,10))
		self.number_embeddings = nn.Embedding.from_pretrained(torch.eye(2,2))
		self.gender_embeddings = nn.Embedding.from_pretrained(torch.eye(3,3))
		self.local_positional_embeddings = nn.Embedding.from_pretrained(torch.eye(5,5))
		self.context_similarity_embeddings = nn.Embedding.from_pretrained(torch.eye(2,2))
	
		self.embeddings = nn.Embedding.from_pretrained(embeddings)
		_, embedding_size= embeddings.shape
		self.hidden_dim=50

		# update input size to reflect concatenated distance embedding of one-hot vector of size d=10
		self.input_size= 2 * embedding_size + 17
		self.W1 = nn.Linear(self.input_size, self.hidden_dim)
		self.tanh=nn.Tanh()
		self.W2 = nn.Linear(self.hidden_dim, 1)	

	def scorer(self, batch_x, batch_m):

		"""
		Input: a batch containing:
			-- batch_m [list of Mention objects]: mention to resolve.  batch_m[i] contains a single Mention
			-- batch_x [list of [list of Mention objects]]: candidate antecedents. batch_x[i] contains a list of candidate antecedents for mention batch_m[i]
		Each input batch is batched to contain the same number of candidate antecedents
		Output: numpy matrix [batch_size, number_of_antecedents + 1, 1] containing scores for all antecedents
			-- for j < number_of_antecedents, output[i,j] contains the score of batch_x[i][j] being the correct antecedent for batch_m[i] 
			-- for j == number_of_antecedents, output[i,j] = 0 (the score for batch_m[i] being linked to no antecedent)

		"""
		device = torch.device("cpu")
		this_batch_size=len(batch_x)
		num_ants=len(batch_x[0])
		#print(num_ants)
		# get representations for mentions
		lastWordID=[]

		for idx, mention in enumerate(batch_m):
			lastWordID.append(mention.sentence_ids[mention.sentence_end_idx])

		# [this_batch_size, 1, embedding_size]
		mention_LW_embeddings=self.embeddings(torch.LongTensor(lastWordID).to(device)).unsqueeze(1)
		

		# get representations for antecedents
		antLastWords=[]
		genderWords = []
		numberWords = []
		
		for idx in range(len(batch_x)):
			antWords=[]
			for ant_idx, ant in enumerate(batch_x[idx]):
				antWords.append(ant.sentence_ids[ant.sentence_end_idx])

			antLastWords.append(antWords)

		# [this_batch_size, num_ants, embedding_size]
		antecedent_LW_embeddings=self.embeddings(torch.LongTensor(antLastWords).to(device))
	
		antAbsDistance = []
		antLocalDistance = []
		similarity = []
		for idx,mention in enumerate(batch_m):
			dists = []
			locs = []
			sims = []
			sent_mention = mention.sentence_ids

			
			for ant_id, ant in enumerate(batch_x[idx]):
				dis = mention.absolute_end_idx - ant.absolute_end_idx
				local = abs(mention.sentence_start_idx - ant.sentence_start_idx)
				ant_mention = ant.sentence_ids
				for i in range(len(ant_mention)):
					if ant_mention[i] == []:
						ant_mention[i] = -1

				magn = (np.linalg.norm(sent_mention)*np.linalg.norm(ant_mention))
				if len(sent_mention) != len(ant_mention):
					#print(np.dot(ant_mention,sent_mention)/magn)
					#sims.append(np.dot(ant_mention,sent_mention)/magn)
				#else:
					if len(sent_mention) > len(ant_mention):
						diff = len(sent_mention) - len(ant_mention)
						ant_mention += [-1] * diff
						#print(np.dot(ant_mention,sent_mention)/magn)
						#sims.append(np.dot(ant_mention,sent_mention)/magn)

					else:
						diff = len(ant_mention) - len(sent_mention)
						sent_mention += [-1] * diff
						#print(np.dot(ant_mention,sent_mention)/magn)
						#sims.append(np.dot(ant_mention,sent_mention)/magn)
				val = np.dot(ant_mention,sent_mention)/magn
				if val <= 0.5:
					sims.append(0)
				else:
					sims.append(1)

				if local <= 1:
					locs.append(0)
				elif local <= 4:
					locs.append(1)
				elif local <= 7:
					locs.append(2)
				elif local <= 10:
					locs.append(3)
				else:
					locs.append(4)
			
				if dis <= 0: 
					antDist = 0
				elif dis <= 1:
					antDist = 1
				elif dis <= 2:
					antDist = 2
				elif dis <= 3:
					antDist = 3
				elif dis <= 4:
					antDist = 4
				elif dis <= 7:
					antDist = 5
				elif dis <= 15:
					antDist =6
				elif dis <= 31:
					antDist = 7
				elif dis <= 63:
					antDist = 8
				else:
					antDist = 9
				dists.append(antDist)
			antAbsDistance.append(dists)
			antLocalDistance.append(locs)
			similarity.append(sims)

		distance_abs_embeddings = self.distance_embeddings(torch.LongTensor(antAbsDistance).to(device))
		distance_local_embeddings = self.local_positional_embeddings(torch.LongTensor(antLocalDistance).to(device))
		similarity_embeddings = self.context_similarity_embeddings(torch.tensor(similarity, dtype=torch.long).to(device))
	
		# We want to generate a score for each antecedent for each mention. However,
		# mention_LW_embeddings is [this_batch_size, 1, embedding_size] while,
		# antecedent_LW_embeddings is [this_batch_size, num_ants, embedding_size].
		# So let's make a bunch of copies of mention_LW_embeddings (one for each of its candidate antecedents)

		# [this_batch_size, num_ants, embedding_size]
		mention_LW_embeddings_copies=mention_LW_embeddings.expand_as(antecedent_LW_embeddings)
		# Now that they're the same size, we can concatenate them together into one big matrix

		# [this_batch_size, num_ants, (embedding_size + embedding_size + distance_embedding_size)]
		all_features=torch.cat([mention_LW_embeddings_copies, antecedent_LW_embeddings, distance_abs_embeddings, distance_local_embeddings, similarity_embeddings], 2)

		# [this_batch_size, num_ants, 1]
		preds=self.W2(self.tanh(self.W1(all_features))).squeeze(-1)

		# Let's fix the score for starting a new entity to be 0; all of the other scores for candidate antecedents will end up 
		# being relative to that.

		# [this_batch_size, 1]
		zeros=torch.FloatTensor(np.zeros((this_batch_size, 1))).to(device)

		# [this_batch_size, num_ants + 1, 1]		
		preds=torch.cat((preds, zeros), 1)

		return preds

In [None]:
model=FancyCorefModel(vocab, embeddings)
model=model.to(device)

print ("Training FancyCorefModel")
train(X, Y, M, train_truth, dev_X, dev_Y, dev_M, dev_truth, model)

Training FancyCorefModel
loss: 35771.641, B3 F: 0.787, unique entities: 25934, num mentions: 29597
loss: 29107.641, B3 F: 0.804, unique entities: 24805, num mentions: 29597
loss: 25350.598, B3 F: 0.814, unique entities: 24395, num mentions: 29597
loss: 22846.283, B3 F: 0.820, unique entities: 24134, num mentions: 29597
loss: 21188.838, B3 F: 0.823, unique entities: 23927, num mentions: 29597
loss: 19990.572, B3 F: 0.825, unique entities: 23734, num mentions: 29597
loss: 19059.873, B3 F: 0.826, unique entities: 23568, num mentions: 29597
loss: 18303.123, B3 F: 0.826, unique entities: 23423, num mentions: 29597
loss: 17672.068, B3 F: 0.827, unique entities: 23226, num mentions: 29597
loss: 17130.100, B3 F: 0.827, unique entities: 23096, num mentions: 29597
