# Project 2: Named Entity Recognition (NER) with Sequence Labeling Models
## CS4740/5740 Fall 2021

### Project Submission Due: Oct 15th, 2021 (11.59PM)
Please submit **pdf file** of this notebook on **Gradescope**, and **ipynb** on **CMS**. For instructions on generating pdf and ipynb files, please refer to project 1 instructions.



**Names:Cam Stine Randy Tang**

**Netids:crs338 rt378**

Don't forget to share your newly copied notebook with your partner!


**Reminder: both of you can't work in this notebook at the same time from different computers/browser windows because of sync issues. We even suggest to close the tab with this notebook when you are not working on it so your partner doesn't get sync issues.**


# **Introduction** 🔎

---

In this project, you will implement a model that identifies named entities in text and tags them with the appropriate label. Particularly, the task of this project is **Named Entity Recognition**. A primer on this task is provided further on. The given dataset is a modified version of the CoNLL-2003 ([Sang et al](https://arxiv.org/pdf/cs/0306050v1.pdf)) dataset. Please use the datasets that we have released to you instead of versions found online as we have made simplifications to the dataset for your benefit. Your task is to develop NLP models to identify these named entities automatically. We will treat this as a **sequence-tagging task**: for each token in the input text, assign one of the following 5 labels: **ORG** (Organization), **PER** (Person), **LOC** (Location), **MISC** (Miscellaneous), and **O** (Not Named Entity). More information about the dataset is provided later

For this project, you will implement two sequence labeling approaches:
- Model 1 : a Hidden Markov Model (HMM)
- Model 2 : a Maximum Entropy Markov Model (MEMM), which is an adaptation of an HMM in which a Logistic Regression classifier (also known as a MaxEnt classifier) is used to obtain the lexical generation probabilities (i.e., the observation/emission probability matrix, so "observations" == "emissions" == "lexical generations"). Feature engineering is strongly suggested for this model!

Implementation of the Viterbi algorithm (for finding the most likely tag sequence to assign to an input text) is required for both models above, so make sure that you understand it ASAP.

You will implement and train two sequence tagging models, generate your predictions for the provided test set, and submit them to **Kaggle**. Please enter all code in this colab notebook and answer all the questions in the supporting document.

To refresh your memory on HMMs, MEMMs, and Viterbi you can refer to **Jurafsky & Martin Ch. 8.3–8.5** and the lecture slides which can be found on EdStem.

## **Logistics**

---

- You **must** work in **groups of 2 students**. Students in the same group will get the same grade. Thus, you should make sure that everyone in your group contributes to the project. 
- **Remember to form groups on BOTH CMS and Gradescope** or not all group members will receive grades. You can use make a post on EdStem to find a partner for this project.
- Please complete the written questions of this notebook in a clear and informative way. We have created a template document for you to answer the written questions. This document can be found [here](https://docs.google.com/document/d/1vnxYFS-rxxLOYfKG6YN35YktJZnzqQBhE1Mj1Xu7IF0/edit?usp=sharing). Please make a copy of this document for yourself and add your names and netids in the header and answer the written questions on it. You will need to submit this document to gradescope as well (do not forget to do this please!).
- At the end: please make sure to submit the following 3 items:
  1. PDF version of Colab notebook on Gradescope (instructions for converting to PDF are at the end).
  2. PDF version of Google Doc with written answers to the numbered questions on this colab on Gradescope.
  3. .ipynb version of your colab notebook on CMS.

- Note: When submitting the PDF documents to Gradescope (colab notebook & writeup doc) please join/concatenate the PDFs and then submit them as one. You may do this any way you please. You can use [this](https://pdfjoiner.com/) website if you wish to.

## **Advice**

---

1. Please read through the entire notebook before you start coding. That might inform your code structure.
2. Grading breakdown is found at the end; please consult it.
3. Google colab does **not** provide good synchronization; we do not recommend multiple people to work on the same notebook at the same time.
4. The project is somewhat open ended. ("But that's a good thing.  Really. It's more fun that way", says Claire.) We will ask you to implement some model, but precise data structures and so on can be chosen by you. However, to integrate with Kaggle, you will need to submit Kaggle predictions using the given evaluation code (more instructions later).
5. You will be asked to fill in your code at various points of the document. You will also be asked to answer questions that analyze your results and motivate your implementation. Please answer these on an additional writeup document. A template has been provided to you.

## **Named Entity Recognition: A Primer**

---

Let us now take a look at the task at hand: Named Entity Recognition (NER). This section provides a brief introduction to the task and why it is important.

**What is NER?**
NER refers to the information extraction technique of identifying and categorizing key information about entities within textual data. Let's look at an example: 

<br/>

![picture](https://drive.google.com/uc?id=1mxwn1_2Ef16_MJeyl9jJwwR6IohUOeHO)

<br/>

In the above example, we can see that the text has numerous named entities that can be categorized as LOC (location), ORG (organization), PER (person), etc. Today, the task of NER has been overwhelmed by deep learning approaches. However, for this assignment, we will try to do NER using something simpler: HMMs and MEMMs. NER is important for a number of reasons and has a wide variety of use cases such as but not limited to:
  - Detect entities in search engines and voice assistants for more relavent search results.
  - Automatically parsing resumes.
  - ...and many more!


To read more on NER, we refer to any of the following sources:
1. Medium post [1](https://umagunturi789.medium.com/everything-you-need-to-know-about-named-entity-recognition-2a136f38c08f) and [2](https://medium.com/mysuperai/what-is-named-entity-recognition-ner-and-how-can-i-use-it-2b68cf6f545d).
2. Try out [this](https://demo.allennlp.org/named-entity-recognition/named-entity-recognition) AlllenNLP demo!

## **Entity Level Mean F1**

---

Let's take a look at the metrics that you will focus on in this assignment. The standard measures to report for NER are recall, precision, and F1 score
(also called F-measure) evaluated at the **named entity level** (not at the token level). The code for this has been provided later under the validation section under Part 2. Please use this code when evaluating your models. 


If P and T are the sets of predicted and true *named entity spans*, respectively, (e.g, the five named entity spans in the above example are "Zifa", "Renate Goetschl", "Austria", "World Cup", and "Germany") then

####<center>Precision = $\frac{|\text{P}\;\cap\;\text{T}|}{|\text{P}|}$ and Recall = $\frac{|\text{P}\;\cap\;\text{T}|}{|\text{T}|}$.</center><br/>


####<center>F1 = $\frac{2 * \text{Precision} * \text{Recall}}{\text{Precision} + \text{Recall}}$. </center><br/>

For each type of named entity, e.g. *LOC*ation, *MISC*ellaneous, *ORG*anization and *PER*son, we calculate the F1 score as shown above, and take the mean of all these F1 scores to get the **Entity Level Mean F1** score for the test set. If $N$ is the total number of labels (i.e., named entity types), then

####<center>Entity Level Mean F1 = $\frac{\sum_{i = 1}^{N} \text{F1}_{{label}_i}}{N}$. </center>

More details under the validation section in Part 2.



# **Part 1: Dataset** 📈

Load the dataset as follows:
  1. Obtain the data from Kaggle at https://www.kaggle.com/c/cs4740-fa21-p2/data.
  2. Unzip the data. Put it into your google drive, and mount it on colab as per below:

In [None]:
"""from google.colab import drive
import os
drive.mount('/content/drive', force_remount=True)"""

In [None]:
"""import json

# TODO: please change the line below with your drive organization
path = os.path.join(os.getcwd(), "drive", "MyDrive","Colab Notebooks","cs4740-fa21-p2")

with open(os.path.join(path,'train.json'), 'r') as f:
     train = json.loads(f.read())

with open(os.path.join(path,'test.json'), 'r') as f:
     test = json.loads(f.read())"""

In [1]:
import pyRAPL

pyRAPL.setup()
measure = pyRAPL.Measurement('bar')
measure.begin()

In [2]:
import json

# TODO: please change the line below with your drive organization
#path = os.path.join(os.getcwd(), "drive", "MyDrive","Colab Notebooks","cs4740-fa21-p2")

with open(('train.json'), 'r') as f:
     train = json.loads(f.read())

with open(('test.json'), 'r') as f:
     test = json.loads(f.read())

Here's a few things to note about the dataset above:
1. We have just loaded 2 json files: train and test. Please note that these files are different from the original release of the CoNNL-2003 since we have already processed and tokenized them for you. Hence, the documents are represented as a list of strings. Note that it is **not** split into separate training and development/validation sets. You will need to do this yourself as needed using the train set.
2. The train file contains the following 4 fields (each is a nested list): 
  - **'text'** - actual input tokens
  - **'NER'** - the token-level entity tag (ORG/PER/LOC/MISC/O) where **O is used to denote tokens that are not part of any named entity**
  - **'POS'** - the part of speech tag (will be handy for feature engineering of the MEMM model)
  - **'index'** - index of the token in the dataset
3. The test data only has 'text', 'POS' and 'index' fields. You will need to submit your prediction of the 'NER' tag to Kaggle. More instructions on this later!

Let's take a look at a sample sentence from the dataset!

In [3]:
print(train['text'][0])
print(train['index'][0])
print(train['POS'][0])
print(train['NER'][0])

['Romania', 'state', 'budget', 'soars', 'in', 'June', '.', 'BUCHAREST', '1996-08-28', 'Romania', "'s", 'state', 'budget', 'deficit', 'jumped', 'sharply', 'in', 'June', 'to', '1,242.9', 'billion', 'lei', 'for', 'the', 'January-June', 'period', 'from', '596.5', 'billion', 'lei', 'in', 'January-May', ',', 'official', 'data', 'showed', 'on', 'Wednesday', '.', 'Six-month', 'expenditures', 'stood', 'at', '9.50', 'trillion', 'lei', ',', 'up', 'from', '7.56', 'trillion', 'lei', 'at', 'end-May', ',', 'with', 'education', 'and', 'health', 'spending', 'accounting', 'for', '31.6', 'percent', 'of', 'state', 'expenses', 'and', 'economic', 'subsidies', 'and', 'support', 'taking', 'some', '26', 'percent', '.', 'January-June', 'revenues', 'went', 'up', 'to', '8.26', 'trillion', 'lei', 'from', '6.96', 'trillion', 'lei', 'in', 'the', 'first', 'five', 'months', 'this', 'year', '.', 'Romania', "'s", 'government', 'is', 'expected', 'to', 'revise', 'the', '1996', 'budget', 'on', 'Wednesday', 'to', 'bring', '

As you can see, the above the sentence, "Romania state budget soars in June.", has already been tokenized into an array of word tokens. The index array corresponds to the index of the token in the entire dataset (not the sentence). The POS tags and the NER tags correspond to the given indices. For example, the token: **Romania** has:
  - index: 0
  - POS: 'NNP'
  - NER: **'ORG'**

### **Q1: Initial Data Observations**
What are your initial observations after you explore the dataset?  Provide some quantitative data exploration. Assess dataset size, document lengths and the token-level NER class distribution, and the entity-level NER class distribution (skipping the 'O' label for the latter). Give some examples of sentences with their named entities bracketed, e.g. [[LOC Romania] state budget soars in June .] and [[ORG Zifa] said [PER Renate Goetschl] of [LOC Austria]...]. 

Present your findings in the supporting template document!

In [4]:
from collections import Counter
word_count=Counter()
for sentence in train['text']:

  word_count.update(Counter(sentence))

word_count.most_common(10)

[('.', 5920),
 (',', 5818),
 ('the', 5795),
 ('of', 3038),
 ('to', 2734),
 ('in', 2677),
 ('a', 2359),
 (')', 2358),
 ('(', 2355),
 ('and', 2277)]

In [5]:
import numpy as np
dataset=[]
for sentence in train['text']:
  for word in sentence:
    dataset.append(word)
len(dataset)

162341

In [6]:
len(train['text'])

756

In [7]:
docu_len=[]
for docu in train['text']:
  docu_len.append(len(docu))
np.array(docu_len).mean()

214.73677248677248

In [8]:
ten_tagged=[]
for index in range(10):
  sentence_NER=[]
  for i in range(len(train['text'][index])):

    if train['NER'][index][i] != 'O':
      sentence_NER.append('[')
      sentence_NER.append(train['NER'][index][i])
      sentence_NER.append(train['text'][index][i])
      sentence_NER.append(']')
    else:
      sentence_NER.append(train['text'][index][i])
  ten_tagged.append(sentence_NER)

for sentences in ten_tagged:

    bag=np.array(sentences)
    string_sent=' '.join(bag)
    print(string_sent)


[ LOC Romania ] state budget soars in June . [ LOC BUCHAREST ] 1996-08-28 [ LOC Romania ] 's state budget deficit jumped sharply in June to 1,242.9 billion lei for the January-June period from 596.5 billion lei in January-May , official data showed on Wednesday . Six-month expenditures stood at 9.50 trillion lei , up from 7.56 trillion lei at end-May , with education and health spending accounting for 31.6 percent of state expenses and economic subsidies and support taking some 26 percent . January-June revenues went up to 8.26 trillion lei from 6.96 trillion lei in the first five months this year . [ LOC Romania ] 's government is expected to revise the 1996 budget on Wednesday to bring it into line with higher inflation , new wage and pension indexations and costs of energy imports that have pushed up the state deficit . Under the revised version state spending is expected to rise by some 566 billion lei . No new deficit forecast has been issued so far . In July the government gave a

In [9]:
ner_count=Counter()
for index in range(len(train['text'])):
  ner_count.update(Counter(train['NER'][index]))
ner_count.most_common()

[('O', 135186), ('PER', 8875), ('ORG', 8065), ('LOC', 6556), ('MISC', 3659)]

In [10]:
pos_count=Counter()
for index in range(len(train['text'])):
  pos_count.update(Counter(train['POS'][index]))
pos_count.most_common(10)

[('NNP', 27383),
 ('NN', 19053),
 ('IN', 15248),
 ('CD', 15130),
 ('DT', 10753),
 ('JJ', 9425),
 ('NNS', 8002),
 ('VBD', 6659),
 ('.', 5933),
 (',', 5818)]

In [11]:
ner_count=Counter()
for index in range(len(train['text'])):
  ner_count.update(Counter(train['NER'][index]))
ner_count.most_common()
tot=0
for i in ner_count.values():
  tot+=i

In [12]:
ner_count=Counter()
for index in range(len(train['text'])):
  ner_count.update(Counter(train['NER'][index]))
ner_count.most_common()
tot=0
for i in ner_count.values():
  tot+=i
values=[]
for i in ner_count.values():
  values.append(i)

In [13]:

dic={}
dic['LOC']=values[0]/tot
dic['O']=values[1]/tot
dic['ORG']=values[2]/tot
dic['MISC']=values[3]/tot
dic['PER']=values[4]/tot
dic

{'LOC': 0.04038412970229332,
 'O': 0.8327286390991802,
 'ORG': 0.04967937859197615,
 'MISC': 0.022538976598641132,
 'PER': 0.054668876007909276}

In [14]:
del ner_count['O']

In [15]:
total=0
for i in ner_count.values():
  total+=i

In [16]:
value=[]
for i in ner_count.values():
  value.append(i)

In [17]:
ent_dic={}
ent_dic['LOC']=value[0]/total
ent_dic['ORG']=value[1]/total
ent_dic['MISC']=value[2]/total
ent_dic['PER']=value[3]/total

ent_dic

{'LOC': 0.24142883446879027,
 'ORG': 0.29699871110292764,
 'MISC': 0.13474498250782543,
 'PER': 0.32682747192045664}

# **Part 2: Hidden Markov Model** 🧨

---

1. Code for counting and smoothing of labels and words and unkown word handing as necessary to support the Viterbi algorithm. (This is pretty much what you already know how to do from project 1.)
2. Build a Hidden Markov Model in accordance with the starter code that has been provided. If you wish to change this starter code you can. However, please ensure that your code is clear, concise, and, most important of all, modular. So break your implementation down into functions or write it within a class. We suggest you compute all probabilities in a log form when building the HMM.
3. An implementation of the **Viterbi algorithm** that can be used to infer token-level labels (identifying the appropriate named entity) for an input document. This process is commonly referred to as **decoding**. Bigram-based Viterbi is $ \mathcal{O}(sm^2)$ where s is the length of the sentence and m is the number of tags. Your implementation should have similar efficiency. The code for this can be used later on for the MEMM too.

Code of Academic Integrity:  We encourage collaboration regarding ideas, etc. However, please **do not copy code from online or share code with other students**. We will be running programs to detect plagiarism.


## **Unknown Word Handling**
---

In [18]:
# Implement unknown word handling here! You may do this any way that you please
#convert some percent of tokens into UNK.
#https://edstem.org/us/courses/12801/discussion/689899
import copy
import random
#method 1: unkify some percentage k of all tokens
def unkify_k_percent(tokens, k):
  unkTokens = copy.deepcopy(tokens)
  for doc in unkTokens:
    for i in range(len(doc)): 
      if random.random() <= k:
        doc[i] = "UNK"
  return unkTokens

#method 2: unk first occurence of each token
def unkify_first_occ(tokens):
  unkTokens = copy.deepcopy(tokens)
  seen = set()
  for doc in unkTokens:
    for i in range(len(doc)):
      if doc[i] not in seen:
        seen.add(doc[i])
        doc[i] = "UNK"
  return unkTokens

In [19]:
#ORG/PER/LOC/MISC/O
string2number = {'ORG':0,'PER':1,'LOC':2,'MISC':3,'O':4}
number2string = {0:'ORG',1:'PER',2:'LOC',3:'MISC',4:'O'}

def numifyLabels(labels):
  numLabels = copy.deepcopy(labels)
  for doc in numLabels:
    for i in range(len(doc)):
      doc[i] = string2number[doc[i]]
  return numLabels

def stringifyLabels(labels):
  strLabels = copy.deepcopy(labels)
  for doc in strLabels:
    for i in range(len(doc)):
      doc[i] = number2string[doc[i]]
  return strLabels

def stringifyLabels_doc(labels):
  strLabels = []
  for i in range(len(labels)):
    strLabels.append(number2string[labels[i]])
  return strLabels

In [20]:
#dictionary form: count['tag1']['tag2'] is count of times tag2 comes AFTER tag1.
def transition_count(labels):
  counts = dict()
  for doc in labels:
    for i in range(1,len(doc)):
      curr = doc[i]
      prev = doc[i-1]
      if (prev in counts): 
        if (curr in counts[prev]):
          counts[prev][curr] = counts[prev][curr]+1
        else:
          counts[prev][curr] = 1
      else:
        counts[prev] = dict()
        counts[prev][curr] = 1
  return counts

In [21]:
# returns the transition probabilities in log form
#NOT SMOOTHING?? there is enough data for this
#sum to one for given condition/tag!!!
import math
#dictionary form: transition['tag1']['tag2'] is probability tag1 comes AFTER tag2.
def build_transition_matrix(labels,k,log=False):
  transCounts = transition_count(labels)
  matrix = copy.deepcopy(transCounts)
  smoothVals = {}
  V=5
  for i in matrix:
    #V = len(transCounts[i])
    counts = transCounts[i].values()
    totalCount = sum(counts)
    for j in matrix[i]:
      prob = (transCounts[i][j]+k)/(totalCount + k*V)
      if log:
        matrix[i][j] = math.log(prob)
      else:
        matrix[i][j] = prob
      smoothVals[i] = k/(totalCount+k*V)
  return matrix,smoothVals

In [22]:
transMatrix0 = build_transition_matrix(train['NER'],0)
print(transMatrix0)
transMatrix2 = build_transition_matrix(train['NER'],2)
print(transMatrix2)

({'LOC': {'O': 0.8448854961832061, 'LOC': 0.13755725190839696, 'MISC': 0.0044274809160305345, 'ORG': 0.011908396946564885, 'PER': 0.0012213740458015267}, 'O': {'O': 0.8664841826526134, 'LOC': 0.040597734322118995, 'ORG': 0.035852158195788485, 'MISC': 0.01878891112086343, 'PER': 0.03827701370861568}, 'ORG': {'ORG': 0.3696676587301587, 'O': 0.6281001984126984, 'LOC': 0.0003720238095238095, 'PER': 0.000992063492063492, 'MISC': 0.0008680555555555555}, 'MISC': {'O': 0.6971850232303908, 'MISC': 0.27165892320306095, 'ORG': 0.018584312653730527, 'PER': 0.01065864990434545, 'LOC': 0.0019130910084722602}, 'PER': {'PER': 0.41111486867320485, 'O': 0.5815578852440536, 'LOC': 0.006989065494307293, 'MISC': 0.00011272686281140796, 'ORG': 0.00022545372562281593}}, {'LOC': 0.0, 'O': 0.0, 'ORG': 0.0, 'MISC': 0.0, 'PER': 0.0})
({'LOC': {'O': 0.8439024390243902, 'LOC': 0.13765243902439026, 'MISC': 0.004725609756097561, 'ORG': 0.012195121951219513, 'PER': 0.001524390243902439}, 'O': {'O': 0.866434611866033,

In [23]:
#dictionary form: counts[label][token] is the number of times token appears as label
def emission_count(tokens,labels):
  counts = dict()
  for i in range(len(tokens)):
    for j in range(len(tokens[i])):
      token = tokens[i][j]
      label = labels[i][j]
      if label in counts:
        if token in counts[label]:
          counts[label][token] = counts[label][token]+1
        else:
          counts[label][token] = 1
      else:
        counts[label] = dict()
        counts[label][token] = 1
  return counts

In [24]:
# returns the log of emission probabilities
def build_emission_matrix(tokens, labels,k,log=False):
  emissCounts = emission_count(tokens,labels)
  matrix = copy.deepcopy(emissCounts)
  smoothVals = {}
  V=0
  for i in matrix.keys():
    V= V+len(emissCounts[i].keys())
  for i in matrix.keys():
    #V = len(emissCounts[i].keys())
    counts = emissCounts[i].values()
    totalCount = sum(counts)
    for j in matrix[i]:
      prob = (emissCounts[i][j]+k)/(totalCount + k*V)
      if log:
        matrix[i][j] = math.log(prob)
      else:
        matrix[i][j] = prob
    smoothVals[i] = k/(totalCount+k*V)
  return matrix,smoothVals

In [25]:
# returns the starting state probabilities in log form
def get_start_state_probs(labels,log=False):
  counts = dict()
  for doc in labels:
    start = doc[0]
    if start in counts:
      counts[start] = counts[start]+1
    else:
      counts[start] = 1
  vals = counts.values()
  totalCount = sum(vals)
  matrix = dict()
  for i in counts:
    if log:
      matrix[i] = math.log(counts[i]/totalCount)
    else:
      matrix[i] = counts[i]/totalCount
  return matrix
  

In [26]:
# takes in the tokens & labels and returns a representation of the HMM
# call the three functions above in this function to build your HMM
#tuple of matrices, new object/class, something
def build_hmm(tokens, labels,k_trans,k_emiss,log=False):
  transMatrix,smoothProbsTrans = build_transition_matrix(labels,k_trans,log=log)
  emissMatrix,smoothProbsEmiss = build_emission_matrix(tokens,labels,k_emiss,log=log)
  startProbs = get_start_state_probs(labels,log=log)
  hmm = (transMatrix,emissMatrix,startProbs,smoothProbsTrans,smoothProbsEmiss)
  return hmm

In [27]:
#no unk handling
hmm0505 = build_hmm(train['text'],numifyLabels(train['NER']),0.5,0.5)
print(hmm0505)
hmm11 = build_hmm(train['text'],numifyLabels(train['NER']),1,1)



## **HMM Implementation**

---

The code below is just a suggestion for how you may go about building your HMM. Feel free to change it any way you see fit. In fact, you will probably have to tweak it a little to implement smoothing. In the skeleton code below, we have broken down the HMM into its three components: the transition matrix, the emission (i.e., lexical generation, observation) matrix, and the starting state probabilities. We suggest you implement them separately and then use them to build the HMM.

Note: it may help to map your classes (named entity types) to discrete values rather than string labels as you will have to do this for your MEMM anyways. However, the HMM can be done without this.

## **Viterbi Implementation**

---

At the end of your implementation, we expect a function or class that maps a sequence of tokens (observation) to a sequence of labels via the Viterbi algorithm.

In [28]:
#gets probability of tag2 following tag1 from dictionary
def transProb(tag1,tag2,hmm):
  transDict = hmm[0]
  smoothProbs = hmm[3]
  if tag2 in transDict[tag1]:
    return transDict[tag1][tag2]
  else:
    #return minLog
    return smoothProbs[tag2]

#gets prob of word given tag from dictionary, + unknown handling
def emissProb(word,tag,hmm):
  emissDict = hmm[1]
  smoothProbs = hmm[4]
  if word in emissDict[tag]:
    return emissDict[tag][word]
  elif "UNK" in emissDict[tag]:
    return emissDict[tag]["UNK"]
  else:
    #key_min = min(emissDict[tag].keys(), key=(lambda k: emissDict[tag][k]))
    #return emissDict[tag][key_min]
    #return minLog
    return smoothProbs[tag]

In [29]:
# takes in the hmm build above and an observation: list of tokens
# and returns the appropriate named entity mappings for the tokens
def viterbi(hmm, observation):
  n = len(observation)
  c = 5
  startProbs = hmm[2]
  score = np.zeros((c,n))
  backpointers = np.zeros((c,n)).astype(int)
  #init
  for i in range(c):
    tag = i
    start = startProbs[tag]
    emiss = emissProb(observation[0],tag,hmm)
    score[i][0] = math.log(start) + math.log(emiss)
    backpointers[i][0] = 0
  #forward pass
  for t in range(1,n):
    word = observation[t]
    for i in range(c):
      currTag = i
      currScores = np.zeros((c,1))
      for j in range(c):
        prevTag = j
        #LOGIT HERE
        trans = transProb(prevTag,currTag,hmm)
        #test other one
        emiss = emissProb(word,currTag,hmm)
        currScores[j] = score[j][t-1]+math.log(trans)+math.log(emiss)
      score[i][t] = max(currScores)
      backpointers[i][t] = np.argmax(currScores)
  #identify tags
  T = np.zeros((n))
  T = T.astype(int)
  T[n-1] = np.argmax(score[:,n-1])
  for i in range(n-2,-1,-1):
    idx = T[i+1]
    #b = backpointers[T[i+1]]
    T[i] = backpointers[T[i+1]][i+1]
  
  return T

In [30]:
# here's a samplle observations that you can use to test your code
obs_1 = ['Cornelll',
 'University',
 'is',
 'located',
 'in',
 'Ithaca',
 'and',
 'was',
 'founded',
 'by',
 'Ezra',
 'Cornell']

#print(viterbi(hmm=hmm00, observation= obs_1))
print(viterbi(hmm=hmm0505, observation= obs_1))

[0 0 4 4 4 4 4 4 4 4 1 1]


In [31]:
#test unk with not log probs
hmm00prob = build_hmm(unkify_k_percent(train['text'],0.5),numifyLabels(train['NER']),0,0.5,log=False)
print(viterbi(hmm=hmm00prob, observation= obs_1))

[4 4 1 1 1 1 1 1 1 1 4 4]


## **Validation Step**

---

In this part of the project, we expect you to split the training data into train and validation datasets. You may use whatever split you see fit and use any external libraries to perform this split. You may want to look into the following function for splitting data: [sklearn.model_selection.train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

Once you have split the data, train your HMM model on the training data and evaluate it on the validation data. Report **Entity Level Mean F1**, which was explained earlier. Please use the code we have provided below to compute this metric.

Please also take a look into your misclassified cases, as we will be performing error analysis in the *Evaluation* section. We expect smoothing, unknown word handling and correct emission (i.e., lexical generation) probabilities.

Consider the example below. After getting a sequence of NER labels for the sequence of tokens from your Viterbi algorithm implementation, you need to convert the sequence of tokens, associated token indices and NER labels into a format which can be used to calculate **Entity Level Mean F1**. We do this by finding the starting and ending indices of the spans representing each entity (as given in the corpus) and adding it to a list that is associated with the label with which the spans are labelled. To score your validation data on Google Colab or your local device, you can get a dictionary format as shown in the picture below from the function **format_output_labels** of both the predicted and true label sequences, and use the two dictionaries as input to the **mean_f1** function.

NOTE: We do **not** include the spans of the tokens labelled as "O" in the formatted dictionary output.

![picture](https://docs.google.com/uc?export=download&id=1M57DEHgfusVPU_hlvmiOpkS3yn9GGEgj)

In [32]:
def format_output_labels(token_labels, token_indices):
    """
    Returns a dictionary that has the labels (LOC, ORG, MISC or PER) as the keys, 
    with the associated value being the list of entities predicted to be of that key label. 
    Each entity is specified by its starting and ending position indicated in [token_indices].

    Eg. if [token_labels] = ["ORG", "ORG", "O", "O", "ORG"]
           [token_indices] = [15, 16, 17, 18, 19]
        then dictionary returned is 
        {'LOC': [], 'MISC': [], 'ORG': [(15, 16), (19, 19)], 'PER': []}

    :parameter token_labels: A list of token labels (eg. PER, LOC, ORG or MISC).
    :type token_labels: List[String]
    :parameter token_indices: A list of token indices (taken from the dataset) 
                              corresponding to the labels in [token_labels].
    :type token_indices: List[int]
    """
    label_dict = {"LOC":[], "MISC":[], "ORG":[], "PER":[]}
    prev_label = token_labels[0]
    start = token_indices[0]
    for idx, label in enumerate(token_labels):
      if prev_label != label:
        end = token_indices[idx-1]
        if prev_label != "O":
            label_dict[prev_label].append((start, end))
        start = token_indices[idx]
      prev_label = label
      if idx == len(token_labels) - 1:
        if prev_label != "O":
            label_dict[prev_label].append((start, token_indices[idx]))
    return label_dict

In [33]:
# Code for mean F1

import numpy as np

def mean_f1(y_pred_dict, y_true_dict):
    """ 
    Calculates the entity-level mean F1 score given the actual/true and 
    predicted span labels.
    :parameter y_pred_dict: A dictionary containing predicted labels as keys and the 
                            list of associated span labels as the corresponding
                            values.
    :type y_pred_dict: Dict<key [String] : value List[Tuple]>
    :parameter y_true_dict: A dictionary containing true labels as keys and the 
                            list of associated span labels as the corresponding
                            values.
    :type y_true_dict: Dict<key [String] : value List[Tuple]>

    Implementation modified from original by author @shonenkov at
    https://www.kaggle.com/shonenkov/competition-metrics.
    """
    F1_lst = []
    for key in y_true_dict:
        TP, FN, FP = 0, 0, 0
        num_correct, num_true = 0, 0
        preds = y_pred_dict[key]
        trues = y_true_dict[key]
        for true in trues:
            num_true += 1
            if true in preds:
                num_correct += 1
            else:
                continue
        num_pred = len(preds)
        if num_true != 0:
            if num_pred != 0 and num_correct != 0:
                R = num_correct / num_true
                P = num_correct / num_pred
                F1 = 2*P*R / (P + R)
            else:
                F1 = 0      # either no predictions or no correct predictions
        else:
            continue
        F1_lst.append(F1)
    return np.mean(F1_lst)

In [34]:
# Usage using above example

pred_token_labels = ["ORG", "O", "PER", "PER", "O", "LOC", "O", "O", "O", "O", "MISC", "O", "O", "O", "O", "LOC"]
true_token_labels = ["ORG", "O", "PER", "PER", "O", "LOC", "O", "O", "O", "O", "MISC", "MISC", "O", "O", "O", "LOC"]
token_indices = [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28]

y_pred_dict = format_output_labels(pred_token_labels, token_indices)
print("y_pred_dict is : " + str(y_pred_dict))
y_true_dict = format_output_labels(true_token_labels, token_indices)
print("y_true_dict is : " + str(y_true_dict))

print("Entity Level Mean F1 score is : " + str(mean_f1(y_pred_dict, y_true_dict)))

y_pred_dict is : {'LOC': [(18, 18), (28, 28)], 'MISC': [(23, 23)], 'ORG': [(13, 13)], 'PER': [(15, 16)]}
y_true_dict is : {'LOC': [(18, 18), (28, 28)], 'MISC': [(23, 24)], 'ORG': [(13, 13)], 'PER': [(15, 16)]}
Entity Level Mean F1 score is : 0.75


In [35]:
import sklearn.model_selection as skmodel
text_train,text_val,index_train,index_val,POS_train,POS_val,NER_train,NER_val = skmodel.train_test_split(train['text'], train['index'],train['POS'],train['NER'], test_size=0.2, random_state=42)
print(len(text_val))
NER_num_train = numifyLabels(NER_train)
NER_num_val = numifyLabels(NER_val)

152


In [36]:
#train hmms here

#no unk handling
#hmm00 = build_hmm(text_train,numifyLabels(NER_train),0,0)
hmm11 = build_hmm(text_train,numifyLabels(NER_train),1,1)
hmm01 = build_hmm(text_train,numifyLabels(NER_train),0,0.5)
hmmLow = build_hmm(text_train,numifyLabels(NER_train),0.05,0.05)

unkText1 = unkify_first_occ(text_train)
unkText2 = unkify_k_percent(text_train,0.1)
#first occ unk
hmm11unk1 = build_hmm(unkText1,numifyLabels(NER_train),1,1)
#10 percent unk
hmm11unk2 = build_hmm(unkText2,numifyLabels(NER_train),1,1)

In [37]:
#testing on validation set
def getPredictions(hmm,text_val,index_val,NER_val):
  labels_pred = []
  indices_pred = []
  labels_true = []
  for i in range(len(text_val)):
    num_pred = viterbi(hmm,text_val[i])
    str_pred = stringifyLabels_doc(num_pred.flatten())
    indices = index_val[i]
    labels = NER_val[i]
    labels_pred.append(str_pred)
    indices_pred.append(indices)
    labels_true.append(labels)
  
  return labels_pred,labels_true,indices_pred

def getPredDicts(hmm,text_val,index_val,NER_val):
  labels_pred,labels_true,indices_pred = getPredictions(hmm,text_val,index_val,NER_val)
  #need to add lists together into 1
  labels_pred = [x for l in labels_pred for x in l]
  indices_pred = [x for l in indices_pred for x in l]
  labels_true = [x for l in labels_true for x in l]
  pred_dict = format_output_labels(labels_pred, indices_pred)
  true_dict = format_output_labels(labels_true, indices_pred)
  return pred_dict,true_dict

def getPredictionsTest(hmm,text_val,index_val):
  labels_pred = []
  indices_pred = []
  for i in range(len(text_val)):
    num_pred = viterbi(hmm,text_val[i])
    str_pred = stringifyLabels_doc(num_pred.flatten())
    indices = index_val[i]
    labels_pred.append(str_pred)
    indices_pred.append(indices)
  #need to add lists together into 1
  labels_pred = [x for l in labels_pred for x in l]
  indices_pred = [x for l in indices_pred for x in l]
  return labels_pred,indices_pred

In [38]:
"""#parameter search
num_trans_params = 5
num_emiss_params = 5

trans_params = np.linspace(0.2, 1, num=num_trans_params)
emiss_params = np.linspace(0.2,1, num=num_emiss_params)

f1_scores = np.zeros((num_trans_params,num_emiss_params))

for j in range(len(trans_params)):
  for k in range(len(emiss_params)):
    hmm = build_hmm(text_train,numifyLabels(NER_train),trans_params[j],emiss_params[k])
    preds,trues = getPredDicts(hmm,text_val,index_val,NER_val)
    f1_scores[j][k]= mean_f1(preds, trues)"""

'#parameter search\nnum_trans_params = 5\nnum_emiss_params = 5\n\ntrans_params = np.linspace(0.2, 1, num=num_trans_params)\nemiss_params = np.linspace(0.2,1, num=num_emiss_params)\n\nf1_scores = np.zeros((num_trans_params,num_emiss_params))\n\nfor j in range(len(trans_params)):\n  for k in range(len(emiss_params)):\n    hmm = build_hmm(text_train,numifyLabels(NER_train),trans_params[j],emiss_params[k])\n    preds,trues = getPredDicts(hmm,text_val,index_val,NER_val)\n    f1_scores[j][k]= mean_f1(preds, trues)'

In [39]:
#print(f1_scores)
#find transition k doesn't really matter
#emission k = 0.2 close to optimal"""

In [40]:
hmmTest = build_hmm(text_train,numifyLabels(NER_train),0.2,0.01)
preds,trues = getPredDicts(hmmTest,text_val,index_val,NER_val)
print(mean_f1(preds,trues))

0.6872210437962467


### **Q2.1: Explain your HMM Implementations**

Explain how you implemented the HMM including the Viterbi algorithm (e.g. **which algorithms/data structures** you used). Make clear which parts were implemented from scratch vs. obtained via an existing package. Explain and motivate any design choices providing the intuition behind them (e.g. which methods you used for your HMM implementation, and why?). (Please answer on the written questions template document)


### **Q2.2: Results Analysis**

Explain here how you evaluated the models. Summarize the performance of your system and any variations that you experimented with on the validation datasets. Put the results into clearly labeled tables or diagrams and include your observations and analysis. (Please answer on the written questions template document)


### **Q2.3: Error Analysis**
When did the system work well? When did it fail?  Any ideas as to why? How might you improve the system? (Please answer on the written questions template document)


In [41]:
labels_pred,labels_true,indices_pred = getPredictions(hmmTest,text_val,index_val,NER_val)
def analyzeErrors(labels_pred,labels_true,indices_pred):
  #comparing predictions vs expected
  difference = [[],[],[]]
  #dictionary- key:value - actualTag:list of mispredictions
  mispredictedAs = {'ORG':[],'PER':[],'LOC':[],'MISC':[],'O':[]}
  #dictionary- key:value - predictedTag:list of actual tags
  mispredictedFrom = {'ORG':[],'PER':[],'LOC':[],'MISC':[],'O':[]}
  for i in range(len(labels_pred)):
    for j in range(len(labels_pred[i])):
      pred = labels_pred[i][j]
      true = labels_true[i][j]
      if pred != true:
        difference[0].append(pred)
        difference[1].append(true)
        difference[2].append(text_val[i][j])
        mispredictedAs[true].append(pred)
        mispredictedFrom[pred].append(true)
  return difference,mispredictedAs,mispredictedFrom

In [42]:
def tagCounts(labels_pred,labels_true):
  predictedCounts = {'ORG':0,'PER':0,'LOC':0,'MISC':0,'O':0}
  actualCounts = {'ORG':0,'PER':0,'LOC':0,'MISC':0,'O':0}
  for i in range(len(labels_pred)):
    for j in range(len(labels_pred[i])):
      pred = labels_pred[i][j]
      true = labels_true[i][j]
      predictedCounts[pred] = predictedCounts[pred]+1
      actualCounts[true] = actualCounts[true]+1
  return predictedCounts,actualCounts

### **Q2.4: What is the effect of unknown word handling and smoothing?**
(Please answer on the written questions template document)

# **Part 3: Maximum Entropy Markov Model** 💫

---

In this section, you will implement a Maximum Entropy Markov Model (**MEMM**) to perform the same NER task. Your model should consist of a MaxEnt classifier with Viterbi decoding. 

1. We have already performed tokenizations for documents. You can either use a MaxEnt classifier from an existing package or write the MaxEnt code yourself. **Important note:  MaxEnt classifiers are statistically equivalent to multi-class logistic regression, so you can use packages for multi-class LR instead of MaxEnt.**

2. Use the classifier to learn a probability $P(t_i|features)$. You may replace either the lexical generation probability – $P(w_i|t_i)$ – or the transition probability – $P(t_i|t_{i−1})$ – in the HMM with it, or you may replace the entire *lexical generation probability * transition probability*  calculation – $P (w_i|t_i) ∗ P (t_i|t_{i−1)} – $ in the HMM with it. 

3. To train such classifier, you need to pick some feature set. The content of the feature set is up to your choice. You should be trying different feature sets, and evaluate your choices on the validation set. Pick the feature set that performs overall the best according to the F1 measure. If you draw inspiration for your features from external sources, please link them in the code.

4. Use your own implementation of the **Viterbi algorithm**, which you can modify from the one you developed for the HMM model. You will need the probabilities that you obtain from the MaxEnt classifier. 

5. Remember to use same training and validation split when evaluating the MEMM to have a **fair comparison** with your **HMM model**.


Please also take a look into your misclassified cases, as we will be performing error analysis in *Evaluation* section. 





---
Here's a summary of the workflow for Part 3:

![alt text](https://drive.google.com/uc?export=view&id=14VfjW3yDyXLojWM_u0LeJYdDOSLkElBn)




Note that we have not provided any skeleton code for how you should do feature engineering since this is meant to be an open ended task and we want you to experiment with the dataset. However, please remember to make sure that you code is concise, clean, and readable! Ultimately, we expect a function or class  mapping a sequence of tokens to a sequence of labels. At the end of this section you should have done the following:
1. Extract Features
2. Build & Train MaxEnt
3. Call Viterbi when evaluating

### **Feature Engineering**
---

In [43]:
from sklearn.linear_model import LogisticRegression as logit
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import hstack, csr_matrix, vstack

In [44]:
#used to provide an index to the 
def indexer (vocab):
  ind=[]
  for index in range(len(vocab)):
    ind.append(index)
  return(ind)

In [45]:
def capital_feature(text):
  cap_feature=[]
  for i in range(len(text)):
    for index in range(len(text[i])):
      cap=np.zeros(1)
      if text[i][index][0].isupper():
        cap+=1
      cap_feature.append(cap)
  return cap_feature


In [46]:
def all_capital(text):
  is_cap=[]
  for i in range(len(text)):
    for index in range(len(text[i])):
      cap=np.zeros(1)
      if text[i][index].isupper():
        cap+=1
      is_cap.append(cap)
  return is_cap

In [47]:
#creates a part of speach dictionary with a number associated with each
#handles null as -1 which if there is no previous word
pos_list=[]
for docu in train['POS']:
  for pos in docu:
    pos_list.append(pos)
pos_list=np.unique(np.array(pos_list))
pos_dict=dict(zip(pos_list,indexer(pos_list)))
pos_dict['NULL']=-1

In [48]:
#vectorizes for the part of speech for current word
def pos_vector_single(pos):
  pos_feature=[]
  for doc in pos:
    for p in doc:
      feature=0
      temp=pos_dict[p]
      feature+=temp
      pos_feature.append(feature)
  return pos_feature


In [49]:
#creates a feature for the previous two words and 
#the next two words 
#if this is at the end of a document then set to -1
def pos_vector_distance(pos):
  pos_vector=[]
  for index in range(len(pos)):
    for i in range(len(pos[index])):
      temp=np.zeros(4)

      if i ==0:
        temp[0]=-1
        temp[1]=-1
        temp[2]=pos_dict[pos[index][i+1]]
        temp[3]=pos_dict[pos[index][i+2]]
      if i==1:
        temp[0]=-1
        temp[1]=pos_dict[pos[index][i-1]]
        temp[2]=pos_dict[pos[index][i+1]]
        temp[3]=pos_dict[pos[index][i+2]]
      if i==(len(pos[index])-1):
        temp[0]=pos_dict[pos[index][i-2]]
        temp[1]=pos_dict[pos[index][i-1]]
        temp[2]=-1
        temp[3]=-1
      if i == (len(pos[index])-2):
        temp[0]=pos_dict[pos[index][i-2]]
        temp[1]=pos_dict[pos[index][i-1]]
        temp[2]=pos_dict[pos[index][i+1]]
        temp[3]=-1
      if i>1 and i<(len(pos[index])-2):
        temp[0]=pos_dict[pos[index][i-2]]
        temp[1]=pos_dict[pos[index][i-1]]
        temp[2]=pos_dict[pos[index][i+1]]
        temp[3]=pos_dict[pos[index][i+2]]
      pos_vector.append(temp)
  return pos_vector
        

        

In [50]:

#this creates a bag of words that counts the distribution of
#POS tags in a document
def bag_of_word(ps):
  bow_feature=[]
  for doc in ps:
    bow=np.zeros(len(pos_list))
    for pos in doc:
      
      temp=pos_dict[pos]
      bow[temp]+=1
    for pos in doc:
      bow_feature.append(bow)
  return bow_feature


In [51]:
#creates features given an input of 1s and 0s to turn on and off features
#creates a sparse matrix to feed into the sklearn logit
#0 for off
#1 for on
def feature_selection(allcp,capital,pos,poschunk,bow,text,postest,indextest):
  cp = capital_feature(text)
  allcap=all_capital(text)
  po=pos_vector_single(postest)
  poschu=pos_vector_distance(postest)
  bagbag=bag_of_word(postest)
  features=[]
  indexx=0

  for i in range(len(text)):
    for index in range(len(text[i])):
      feature=[]
      ind=indexx
      if allcp ==1:
        feature.append(allcap[ind][0])
      if capital == 1:
        feature.append(cp[ind][0])
      if pos==1:
        feature.append(po[ind])
      if poschunk ==1:
        for x in poschu[ind]:
          feature.append(x)
      if bow ==1:
        for x in bagbag[ind]:
          feature.append(x)
      indexx+=1
      features.append(feature)
  return csr_matrix(features)

        




In [52]:
#unnests lists probably a way to do in numpy
def empty_nester(nestedlist):
  ydata=[]
  for docu in nestedlist:
    for ner in docu:
      ydata.append(ner)
  return ydata
  


In [53]:
#gets indexes for the given 
def get_indexes(nestedindex):
  indexes=[]

  for i in range(len(nestedindex)):
    for index in range(len(nestedindex[i])):
      indexes.append(nestedindex[i][index])
  return indexes

In [54]:
#this is was not used. 
def vector(text):

  vectorizer = CountVectorizer(max_features=1000)
  text=[]
  for x in text:
    for word in x:
      text.append(word)
  this=vectorizer.fit_transform(text)

In [55]:
from sklearn.model_selection import train_test_split
text_train, text_val, pos_train, pos_val, index_train, index_val, ner_train, ner_val= train_test_split(train['text'],train['POS'],train['index'],train['NER'], test_size=0.1)


In [56]:
flat_val_index=get_indexes(index_val)
flat_train_index=get_indexes(index_train)

In [57]:
trainer=feature_selection(0,1,1,1,0,text_train,pos_train,index_train)

allfeatures=feature_selection(0,1,1,1,0,train['text'],train['POS'],train['index'])
valid=feature_selection(0,1,1,1,0,text_val,pos_val,index_val)

In [58]:
ner_train_flat_number=empty_nester(numifyLabels(ner_train))
ner_val_flat_number=empty_nester(numifyLabels(ner_val))

In [59]:
memm=logit(multi_class='multinomial',max_iter=7000)
memm.fit(trainer,ner_train_flat_number)

In [60]:
val_pred=stringifyLabels_doc(memm.predict(valid))
fake_pred_dict=format_output_labels(val_pred, flat_val_index)
real_pred_dict=format_output_labels(empty_nester(ner_val),flat_val_index)
mean_f1(fake_pred_dict, real_pred_dict)


0.1895897411202449

In [61]:
#viterbi with memm model incorporating trans prob, emission prob and log regress
def viterbiMEMM(hmm, observation,indices,features,classifier):
  n = len(observation)
  c = 5
  startProbs = hmm[2]
  score = np.zeros((c,n))
  backpointers = np.zeros((c,n)).astype(int) 

  #initialization
  for i in range(c):
    tag = i
    start = startProbs[tag]
    emiss = emissProb(observation[0],tag,hmm)
    score[i][0] = math.log(start) + math.log(emiss)
    backpointers[i][0] = 0
  #forward pass
  for t in range(1,n):
    word = observation[t]
    featureVec = features[indices[t]]
    ner_prob = classifier.predict_proba(featureVec).flatten()
    classes = classifier.classes_
    for i in range(c):
      currTag = i
      currScores = np.zeros((c,1))
      regressProb = ner_prob[i]
      for j in range(c):
        prevTag = j
        #test other one
        trans = transProb(prevTag,currTag,hmm)
        emiss = emissProb(word,currTag,hmm)
        currScores[j] = score[j][t-1]+math.log(regressProb)+math.log(emiss)+math.log(trans)
      score[i][t] = max(currScores)
      backpointers[i][t] = np.argmax(currScores)
  #identify tags
  T = np.zeros((n))
  T = T.astype(int)
  T[n-1] = np.argmax(score[:,n-1])
  for i in range(n-2,-1,-1):
    idx = T[i+1]
    #b = backpointers[T[i+1]]
    T[i] = backpointers[T[i+1]][i+1]
  
  return T

In [62]:
#testing on validation set
def getMEMMPredictions(hmm,text_val,index_val,NER_val,features,classifier):
  labels_pred = []
  indices_pred = []
  labels_true = []
  for i in range(len(text_val)):
    num_pred = viterbiMEMM(hmm,text_val[i],index_val[i],features,classifier)
    str_pred = stringifyLabels_doc(num_pred.flatten())
    indices = index_val[i]
    labels = NER_val[i]
    labels_pred.append(str_pred)
    indices_pred.append(indices)
    labels_true.append(labels)
  return labels_pred,labels_true,indices_pred

def getMEMMPredDicts(hmm,text_val,index_val,NER_val,features,classifier):
  labels_pred,labels_true,indices_pred = getMEMMPredictions(hmm,text_val,index_val,NER_val,features,classifier)
  #need to add lists together into 1
  labels_pred = [x for l in labels_pred for x in l]
  indices_pred = [x for l in indices_pred for x in l]
  labels_true = [x for l in labels_true for x in l]
  pred_dict = format_output_labels(labels_pred, indices_pred)
  true_dict = format_output_labels(labels_true, indices_pred)
  return pred_dict,true_dict

def getMEMMPredictionsTest(hmm,text_val,index_val,testfeatures,classifier):
  labels_pred = []
  indices_pred = []
  for i in range(len(text_val)):
    num_pred = viterbiMEMM(hmm,text_val[i],index_val[i],testfeatures,classifier)
    str_pred = stringifyLabels_doc(num_pred.flatten())
    indices = index_val[i]
    labels_pred.append(str_pred)
    indices_pred.append(indices)
  #need to add lists together into 1
  labels_pred = [x for l in labels_pred for x in l]
  indices_pred = [x for l in indices_pred for x in l]
  return labels_pred,indices_pred

In [63]:
preds,trues =getMEMMPredDicts(hmm0505,text_val,index_val,ner_val,allfeatures,memm)

In [64]:
mean_f1(preds, trues)

0.8469137719549932

In [65]:
pred_list_MEMM, true_list_MEMM , indicis_list =getMEMMPredictions(hmm0505,text_val,index_val,ner_val,allfeatures,memm)

In [66]:
difference,mispredictedAs,mispredictedFrom = analyzeErrors(pred_list_MEMM,true_list_MEMM,indicis_list)
predictedCounts,actualCounts = tagCounts(pred_list_MEMM,true_list_MEMM)
print("predicted counts:")
print(predictedCounts)
print("actual counts:")
print(actualCounts)
for i in mispredictedAs:
  print(f"{i} was misclassified as something else {len(mispredictedAs[i])} times")
print()
for i in mispredictedFrom:
  print(f"{i} was wrongly predicted {len(mispredictedFrom[i])} times")

predicted counts:
{'ORG': 914, 'PER': 1313, 'LOC': 737, 'MISC': 216, 'O': 16233}
actual counts:
{'ORG': 958, 'PER': 1339, 'LOC': 810, 'MISC': 326, 'O': 15980}
ORG was misclassified as something else 162 times
PER was misclassified as something else 74 times
LOC was misclassified as something else 103 times
MISC was misclassified as something else 117 times
O was misclassified as something else 35 times

ORG was wrongly predicted 118 times
PER was wrongly predicted 48 times
LOC was wrongly predicted 30 times
MISC was wrongly predicted 7 times
O was wrongly predicted 288 times


In [67]:
for i in range(10,50):
  print(f"word: {difference[2][i]}")
  print(f"predicted label: {difference[0][i]}")
  print(f"actual label: {difference[1][i]}")
  print()

word: Peterborough
predicted label: O
actual label: ORG

word: Preston
predicted label: PER
actual label: ORG

word: Shrewsbury
predicted label: O
actual label: ORG

word: Millwall
predicted label: O
actual label: ORG

word: York
predicted label: LOC
actual label: ORG

word: Cardiff
predicted label: LOC
actual label: ORG

word: Fulham
predicted label: O
actual label: ORG

word: Doncaster
predicted label: O
actual label: ORG

word: Northampton
predicted label: LOC
actual label: ORG

word: Colchester
predicted label: LOC
actual label: ORG

word: Torquay
predicted label: O
actual label: ORG

word: of
predicted label: O
actual label: MISC

word: the
predicted label: O
actual label: MISC

word: Netherlands
predicted label: LOC
actual label: MISC

word: Almere
predicted label: O
actual label: LOC

word: van
predicted label: O
actual label: PER

word: Telekom
predicted label: O
actual label: ORG

word: Palmans
predicted label: O
actual label: ORG

word: .Giuseppe
predicted label: O
actual lab

In [68]:

test_features=feature_selection(0,1,1,1,0,test['text'],test['POS'],test['index'])
predictions,indicis=getMEMMPredictionsTest(hmm0505,test['text'],test['index'],test_features,memm)

### **MEMM Implementation**
---

In [112]:
measure.end()

In [113]:
usage=measure.result.pkg[0]

In [114]:
usage_joule=usage/1000000

domestic heater or air-conditioner. The peak solar energy falling on 1m² of the earth. 

When doing this assignment we easily ran this over 100 times meanings the usage was more along the lines of 

In [115]:
usage_total_joule=usage_joule*100*102

In [116]:
usage_total_joule/1000000

128.5122476226