# Lesson notebook 9 - Entities and Linking



### NER and Unsupervised Relation Extraction with SpaCy

We'll use SpaCy again, a pretrained open source language processing pipeline.  It provides a platform for processing text in a number of ways without having to perform any fine-tuning.

We'll use it to demonstrate SpaCy's NER capabilities out of the box.  Take a look at the entities it finds.  How well do you think it performs?

Then we'll use the dependency parsing capability to extract SVO triples from a set of sentences.  Again, look at how well the extraction works.

You should experiment by adding multiple sentences in to the variable nnp_doc and see how well SpaCy does with your alternative sentences.


<a id = 'returnToTop'></a>

## Notebook Contents

  * 1. [Setup](#spacySetup)
  * 2. [Spacy Language Model Selection](#spacyPipeline)
  * 3. [Named Entity Recognition](#spacyNER)
  * 4. [Dependency Parsing for Information Extraction](#spacyDep)
      * 4.1. [SVO Triple Extraction Example](#spacySVO)
  * 5. [Entity and Relation Extraction Example](#mistralRE)
  * 6. [Answers](#answers)      









[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2025-spring-main/blob/master/materials/lesson_notebooks/lesson_9_Entities_and_Linking.ipynb)

[Return to Top](#returnToTop)  
<a id = 'spacySetup'></a>

## 1. Setup  


Let's set up our environment to run the current version of [SpaCy](https://spacy.io) and feed it a sequence of text to see what it can do.  

SpaCy is an open source industrial strength NLP engine that can perform multiple functions out of the box. It strikes a good balance between speed of processing and accuracy of predictions.  It comes with a number of different language models trained on the [OntoNotes5](https://catalog.ldc.upenn.edu/LDC2013T19) data set.  This means that it is already trained to do part of speech tagging and dependency parsing.  It can also be trained to do classification and a number of other tasks in the standard NLP stack.  It is very fast.  It can be a handy way of analyzing some text for exploratory data analysis. Another use is annotating some text to then create a labelled training set that you use to train up your own model independent of spaCy.

spaCy uses a combination of techniques including embeddings and convolutional neural nets to generate the output we see.


In [1]:
!pip install -q spacy

In [2]:
!pip install -q spacy-lookups-data

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.5/98.5 MB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
import spacy
import pandas as pd

print(spacy.__version__)
print(pd.__version__)

3.7.5
2.2.2


[Return to Top](#returnToTop)  
<a id = 'spacyPipeline'></a>

## 2. Pre-trained Language Models for SpaCy

SpaCy has also been pre-trained on multiple languages.  When using it you need to select and load a specific language model.

Make sure you first download a language model then load it into SpaCy. We're selecting English via the large model which gives us access to embeddings.  There are many other options and other languages.

Downloading the large model can take a couple of minutes if your network is slower.

In [4]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [5]:
#load an english model -- the large model includes word embeddings
nlp = spacy.load("en_core_web_lg")

[Return to Top](#returnToTop)  
<a id = 'spacyNER'></a>

## 3. Named Entity Recognition

SpaCy is also trained to do some basic NER out of the box. It has been trained using OntoNotes5 so you can see the set of entity tags it uses to annotate its content. It identifies things like persons (PER), organizations (ORG), facilities (FAC), dates (DATE) and others. If those tags don't work for you, then you can train spaCy to identify different entities or use a different tag set.

You can modify the nnp_doc variable below if you want to experiment with your own set of sentences to see how they work with the existing tagset.


In [6]:
#We'll use nnp_doc to demonstrate SpaCy's information extraction capability
nnp_doc = nlp('On the afternoon of November 19, 1863, Lincoln went to Gettysburg. He gave his famous speech there. The Gettysburg Address began with four score and seven years ago.')
#nnp_doc = nlp('The School of Information is located on the Berkeley campus of the University of California. The iSchool offers a variety of Masters degrees. Berkeley is adjacent to Oakland, Albany, and El Cerrito.')

#NER example
for ent in nnp_doc.ents:
    print(ent.text, ent.start_char, ent.end_char,
        ent.label_, spacy.explain(ent.label_))

the afternoon 3 16 TIME Times smaller than a day
Lincoln 39 46 ORG Companies, agencies, institutions, etc.
Gettysburg 55 65 GPE Countries, cities, states
The Gettysburg Address 100 122 ORG Companies, agencies, institutions, etc.
four 134 138 CARDINAL Numerals that do not fall under another type
seven years ago 149 164 DATE Absolute or relative dates or periods


Note the NER parser makes some errors.  It tags Lincoln as an Organization and The Gettysburg Address as an Organization.  This is clearly wrong, but Lincoln is also ambiguous. It can refer to a larger city in Nebraska or a make of car. What happens if you add Abraham before Lincoln? That should help to disambiguagte the reference.  You could also try and fix the issue by fine-tuning the SpaCy model to teach it to prefer Lincoln as a person rather than an Organization.

[Return to Top](#returnToTop)  
<a id = 'spacyDep'></a>


## 4. Dependency Parsing for Information Extraction

As we saw last week, SpaCy performs dependency parsing right out of the box. This can be a very handy way of identifying words and the relations between them. Sometimes those relations fundamentally change the meaning of the word as in the case of negation.


In [7]:
#dependency parsing
w266_text = 'Students are learning Natural Language Processing with transformers in the W266 class.'
w266_doc = nlp(w266_text)
for token in w266_doc:
    print (token.text, token.tag_, token.head.text, token.dep_)

Students NNS learning nsubj
are VBP learning aux
learning VBG learning ROOT
Natural NNP Language compound
Language NNP Processing compound
Processing NNP learning dobj
with IN learning prep
transformers NNS with pobj
in IN transformers prep
the DT class det
W266 NNS class compound
class NN in pobj
. . learning punct


Let's capture it in a pandas data frame so it is easier to read and manipulate.  Once it's in a dataframe you can then perform additional operations like counting and filtering and searching on the content.

In [8]:
df = pd.DataFrame()
df['text'] = [token.text for token in w266_doc]
df['lemma'] = [token.lemma_ for token in w266_doc]
df['is_punctuation'] = [token.is_punct for token in w266_doc]
df['is_space'] = [token.is_space for token in w266_doc]
df['shape'] = [token.shape_ for token in w266_doc]
df['part_of_speech'] = [token.pos_ for token in w266_doc]
df['pos_tag'] = [token.tag_ for token in w266_doc]
df['head'] = [token.head.text for token in w266_doc]
df['dep'] = [token.dep_ for token in w266_doc]

df

Unnamed: 0,text,lemma,is_punctuation,is_space,shape,part_of_speech,pos_tag,head,dep
0,Students,student,False,False,Xxxxx,NOUN,NNS,learning,nsubj
1,are,be,False,False,xxx,AUX,VBP,learning,aux
2,learning,learn,False,False,xxxx,VERB,VBG,learning,ROOT
3,Natural,Natural,False,False,Xxxxx,PROPN,NNP,Language,compound
4,Language,Language,False,False,Xxxxx,PROPN,NNP,Processing,compound
5,Processing,Processing,False,False,Xxxxx,PROPN,NNP,learning,dobj
6,with,with,False,False,xxxx,ADP,IN,learning,prep
7,transformers,transformer,False,False,xxxx,NOUN,NNS,with,pobj
8,in,in,False,False,xx,ADP,IN,transformers,prep
9,the,the,False,False,xxx,DET,DT,class,det




[Return to Top](#returnToTop)  
<a id = 'spacySVO'></a>

### 4.1 SVO Triple Extraction example

You can leverage the dependency graph to identify subject-verb-object triples. These can be used to populate a knowledge graph or to extract "facts" from text.

We need to identify the dependency arc labels that we want to associate with a subject relationship and the labels we want to associate with an object relationship.


In [9]:
#SVO extraction

# specify object and subject constants
OBJECT_DEPS = {"dobj", "dative", "attr", "oprd", "pobj"}
SUBJECT_DEPS = {"nsubj", "nsubjpass", "csubj", "agent", "expl"}

# extract the subject, object and verb from the input
def extract_triples(doc):
    sub = []
    at = []
    ve = []
    for token in doc:
        # is this a verb?
        if token.pos_ == "VERB":
            ve.append(token.text)
            #print("append to VERB: "+token.text)
        # is this the object?
        if token.dep_ in OBJECT_DEPS or token.head.dep_ in OBJECT_DEPS:
            at.append(token.text)
            #print("append to OBJ: " + token.text)
        # is this the subject?
        if token.dep_ in SUBJECT_DEPS or token.head.dep_ in SUBJECT_DEPS:
            sub.append(token.text)
            #print("append to SUBJ: " + token.text)
    return " ".join(sub).strip().lower(), " ".join(ve).strip().lower(), " ".join(at).strip().lower()


# print out the pos tags and dependency relation labels
for token in nnp_doc:
    print("Token {} POS: {}, dep: {}".format(token.text, token.pos_, token.dep_))

Token On POS: ADP, dep: prep
Token the POS: DET, dep: det
Token afternoon POS: NOUN, dep: pobj
Token of POS: ADP, dep: prep
Token November POS: PROPN, dep: pobj
Token 19 POS: NUM, dep: nummod
Token , POS: PUNCT, dep: punct
Token 1863 POS: NUM, dep: nummod
Token , POS: PUNCT, dep: punct
Token Lincoln POS: PROPN, dep: nsubj
Token went POS: VERB, dep: ROOT
Token to POS: ADP, dep: prep
Token Gettysburg POS: PROPN, dep: pobj
Token . POS: PUNCT, dep: punct
Token He POS: PRON, dep: nsubj
Token gave POS: VERB, dep: ROOT
Token his POS: PRON, dep: poss
Token famous POS: ADJ, dep: amod
Token speech POS: NOUN, dep: dobj
Token there POS: ADV, dep: advmod
Token . POS: PUNCT, dep: punct
Token The POS: DET, dep: det
Token Gettysburg POS: PROPN, dep: compound
Token Address POS: PROPN, dep: nsubj
Token began POS: VERB, dep: ROOT
Token with POS: ADP, dep: prep
Token four POS: NUM, dep: nummod
Token score POS: NOUN, dep: pobj
Token and POS: CCONJ, dep: cc
Token seven POS: NUM, dep: nummod
Token years POS:

We'll process our document again, first dividing it into sentences, and then looping through those sentences to extract SVO triples from each sentence.

In [10]:
#Let's process our nnp_doc again and split it in to sentences.  We'll then operate on each sentence separetely.
sentences = list(nnp_doc.sents)
sents = []
[sents.append(str(sentence)) for sentence in sentences]
print("First three sentences:")
sents[0:3]

First three sentences:


['On the afternoon of November 19, 1863, Lincoln went to Gettysburg.',
 'He gave his famous speech there.',
 'The Gettysburg Address began with four score and seven years ago.']

Now, let's iterate through the sentences, extract the SVO triples, and display them in a pandas dataframe.

In [11]:
# Create empty lists to store all subject, verbs, and objects
subjects = []
verbs = []
objects = []


# Grab the SVOs from each parsed input sentence
for sent in sents:
    doc = nlp(sent)
    s,v,o = extract_triples(doc)
    subjects.append(s)
    verbs.append(v)
    objects.append(o)


# store them in a df
svo_df = pd.DataFrame({'subj':subjects, 'verb':verbs, 'obj':objects})
svo_df.sample(3)

Unnamed: 0,subj,verb,obj
1,he,gave,his famous speech
2,the gettysburg address,began,four score and
0,lincoln,went,"the afternoon of november 19 , 1863 gettysburg"


How well does the system perform on grabbing the subject, verbs, and objects?  Where is it failing? How might we improve the model's performance.

We can use these extracted triples to populate a database or to help generate a knowledge graph.

That's it for this demo.  Again you are encouraged to experiment with this notebook to build intuition about how systems perform on these tasks.

[Return to Top](#returnToTop)  
<a id = 'mistralRE'></a>

## Entity and Relation Extraction Example with Mistral 7B -

[Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) is a small but highly performant model. It is also possible to use it commercially. The model has been instruction fine-tuned by Mistral.ai so it should be able to follow our prompts and return good on point output.  We'll also use a quantized version (down to 4 bits thanks to Hugging Face) so we know it can load in our small GPU.  

First let's load the libaries necessary for it to work.

In [1]:
%%capture

#!pip uninstall -y transformers
#!pip install git+https://github.com/huggingface/transformers
!pip install -q -U transformers

In [2]:
!pip install -q accelerate
!pip install -q bitsandbytes

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m68.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m52.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m48.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Now, let's specify the model we want.  There is no port to TensorFlow yet so we'll use the PyTorch version.  Since we are using the Hugging Face AutoModel classes and only running inference, we don't need to write a lot of PyTorch code.

In [3]:
import torch
import pprint

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_id = "mistralai/Mistral-7B-Instruct-v0.2"

The 7 billion parameter model needs about 32GB of GPU memory to run and it might load into an A100 GPU on Google Colab Pro (if one is avaialble).  In order to shrink the memory footprint of the model we can use quantization.  Specifically, we'll shrink the 32 bit floating point number representations down to 4 bits. This is the bits and bytes config file where we specify our quantization arguments.  You can read about it [here](https://huggingface.co/blog/4bit-transformers-bitsandbytes).

In [4]:
from transformers import BitsAndBytesConfig


nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)


This model has been trained to work with dialog, meaning instances here have multiple utterance and response pairs to create the context so the model can reply. We'll populate the context with only our prompt and not have any back and forth.

First we'll ask the model to identify entities.

In [5]:
myprompt = (
    "Given the text `On the afternoon of November 19, 1863, Lincoln went to Gettysburg. He gave his famous speech there. "
    " The Gettysburg Address began with four score and seven years ago.` Please identify the people, places,"
    " and dates in the sentences.  Please output a JSON formatted record with one field for pers:, one for locs:, and date:."
)

In [7]:
#Note: It can take up to 3 minutes to download this model

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "user", "content": myprompt}
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
encodeds =encodeds.to(device)
model_inputs = encodeds.to(device)


generated_ids = model.generate(encodeds, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
pprint.pprint(decoded[0], compact=True)


`low_cpu_mem_usage` was None, now default to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


('<s> [INST] Given the text `On the afternoon of November 19, 1863, Lincoln '
 'went to Gettysburg. He gave his famous speech there.  The Gettysburg Address '
 'began with four score and seven years ago.` Please identify the people, '
 'places, and dates in the sentences.  Please output a JSON formatted record '
 'with one field for pers:, one for locs:, and date:. [/INST] {\n'
 ' "date": "November 19, 1863",\n'
 ' "pers": [{"name": "Lincoln", "role": "subject"}],\n'
 ' "locs": [{"name": "Gettysburg", "role": "location"}]\n'
 '}</s>')


Let's try a better, more complex prompt that extracts entities.  The prompt is based on work found in [this paper](https://arxiv.org/pdf/2305.15444.pdf) by Ashok and Lipton on prompting for NER.

In [9]:
myprompt = (
        "Definition: An entity is a person (person), university (university), scientist(scientist), event(event), award(award) or theory(theory). "
        "Abstract scientific concepts can be entities if they have a name associated with them. If an entity does not fit the the types above it is (misc). "
        "Dates, times, adjectives and verbs are not entities."
        "Given the paragraph below, identify a list of possible entities and for each entry explain why it either is or is not an entity:"
        "Paragraph: `On the afternoon of November 19, 1863, Lincoln went to Gettysburg. He gave his famous speech there. "
        " The Gettysburg Address began with four score and seven years ago.` "
)

messages = [
    {"role": "user", "content": myprompt}
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
encodeds = encodeds.to(device)
model_inputs = encodeds.to(device)


generated_ids = model.generate(encodeds, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
pprint.pprint(decoded[0], compact=True)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


('<s> [INST] Definition: An entity is a person (person), university '
 '(university), scientist(scientist), event(event), award(award) or '
 'theory(theory). Abstract scientific concepts can be entities if they have a '
 'name associated with them. If an entity does not fit the the types above it '
 'is (misc). Dates, times, adjectives and verbs are not entities.Given the '
 'paragraph below, identify a list of possible entities and for each entry '
 'explain why it either is or is not an entity:Paragraph: `On the afternoon of '
 'November 19, 1863, Lincoln went to Gettysburg. He gave his famous speech '
 'there.  The Gettysburg Address began with four score and seven years ago.`  '
 '[/INST] Entities:\n'
 '\n'
 '1. Lincoln: Lincoln is a person and hence he is an entity.\n'
 '2. November 19, 1863: This is a date and is not an entity as per the '
 'definition provided.\n'
 '3. Gettysburg: Gettysburg is a location and therefore it is an entity.\n'
 '4. The Gettysburg Address: The Gettysb

Let's see if we can get the LLM to do relation extraction in addition to the entity detection as seen in [this paper](https://aclanthology.org/2023.acl-long.868.pdf) by Wadhwa, Amir, and Wallace.  We'll modify our prompt to define entities and relations and then ask the LLM to identify them both along with providing an explanation.  How well do you think it does?  What might we do to improve accuracy?

In [10]:
myprompt = (
        "Definition: An entity is a person (person), location (location), organization (organization), event(event), award(award) or theory(theory). "
        "Abstract scientific concepts can be entities if they have a name associated with them. If an entity does not fit the the types above it is (misc). "
        "Dates, times, adjectives and verbs are not entities."
        "List the entities of the types [LOCATION, ORGANIZATION, PERSON] and relations of types [Organization Based In, Work For, Located In, "
        "Live In, Killed By, Travel To] among the entities in the given text. "
        "Given the paragraph below, identify a list of possible entities and for each entry explain why it either is or is not an entity: "
        "Paragraph: `on the evening of April 14th, 1865, Abraham Lincoln went to The Ford Theater in Washington D.C., where he was assasinated by John Wilkes Booth. "
        "Wilkes, a resident of Maryland, was a known Confederate sympathizer. "
)

messages = [
    {"role": "user", "content": myprompt}
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
encodeds = encodeds.to(device)
model_inputs = encodeds.to(device)


generated_ids = model.generate(encodeds, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
pprint.pprint(decoded[0], compact=True)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


('<s> [INST] Definition: An entity is a person (person), location (location), '
 'organization (organization), event(event), award(award) or theory(theory). '
 'Abstract scientific concepts can be entities if they have a name associated '
 'with them. If an entity does not fit the the types above it is (misc). '
 'Dates, times, adjectives and verbs are not entities.List the entities of the '
 'types [LOCATION, ORGANIZATION, PERSON] and relations of types [Organization '
 'Based In, Work For, Located In, Live In, Killed By, Travel To] among the '
 'entities in the given text. Given the paragraph below, identify a list of '
 'possible entities and for each entry explain why it either is or is not an '
 'entity: Paragraph: `on the evening of April 14th, 1865, Abraham Lincoln went '
 'to The Ford Theater in Washington D.C., where he was assasinated by John '
 'Wilkes Booth. Wilkes, a resident of Maryland, was a known Confederate '
 'sympathizer.  [/INST] Entities:\n'
 '\n'
 '1. Abraham Lin



[Return to Top](#returnToTop)  
<a id = 'classExercise'></a>