<a href="https://colab.research.google.com/github/UniVR-DH/ADHLab/blob/main/lecture05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Testing Named Entity Recognition

We will use the Spacy Library:
https://spacy.io/usage/spacy-101


<img src="https://drive.google.com/uc?export=view&id=1m_EMdnI5C826kgqK7r5vB4TXnB0-Wq7W" alt="Intestazione con loghi istituzionali" width="525"/>

| Docente      | Insegnamento | Anno Accademico    |
| :---        |    :----   |          ---: |
| Matteo Lissandrini      | Laboratorio Avanzato di Informatica Umanistica       | 2024/2025   |

## Usual install and basic imports

In [1]:
%pip install wikipedia-api
%pip install spacy==3.7.0



In [2]:
!pip install pdfx



In [3]:
import spacy
from spacy import displacy
# Load module for english
nlp = spacy.load("en_core_web_sm")
# More options here: https://spacy.io/models/en



In [4]:
import pdfx

## Download a PDF and put it in your Files folder
## You can also use WGET command or the requests libary (see other notebooks) to download directly

# Assume you have downloaded the file from: https://dl.acm.org/doi/pdf/10.1145/3546954
pdf = pdfx.PDFx("3546954.pdf")

pdf

<pdfx.PDFx at 0x7b163d9dee10>

In [5]:
text = pdf.get_text()
text

'DOI:10.1145/3546954 \n\nhttps://cacm.acm.org/blogs/blog-cacm\n\nChanging the Nature \nof AI Research\n\nSubbarao Kambhampati considers how artificial intelligence \nmay be straying from its roots.\n\nSubbarao \nKambhampati \nAI as (an Ersatz) \nNatural Science?\nhttps://bit.ly/3Rcf5NW\nJune 8, 2022\nIn  many  ways,  we  are  living  in  quite \na  wondrous  time  for  artificial  intel-\nligence  (AI),  with  every  week  bring-\ning  some  awe-inspiring  feat  in  yet \nanother  tacit  knowledge  (https://bit.\nly/3qYrAOY)  task  that  we  were  sure \nwould  be  out  of  reach  of  computers \nfor  quite  some  time  to  come.  Of  par-\nticular  recent  interest  are  the  large \nlearned  systems  based  on  trans-\nformer  architectures  that  are  trained \nwith  billions  of  parameters  over \nmassive  Web-scale  multimodal  cor-\npora.  Prominent  examples  include \nlarge  language  models  (https://bit.\nly/3iGdekA) like GPT3 and PALM that \nrespond  to  free-form  text  pr

In [6]:
doc = nlp(text)

In [7]:
import pandas as pd

cols = ("text", "lemma", "POS", "explain", "stopword")
rows = []

for t in doc:
    row = [t.text, t.lemma_, t.pos_, spacy.explain(t.pos_), t.is_stop]
    rows.append(row)

df = pd.DataFrame(rows, columns=cols)
df

Unnamed: 0,text,lemma,POS,explain,stopword
0,DOI:10.1145/3546954,DOI:10.1145/3546954,PROPN,proper noun,False
1,\n\n,\n\n,SPACE,space,False
2,https://cacm.acm.org/blogs/blog-cacm,https://cacm.acm.org/blogs/blog-cacm,PROPN,proper noun,False
3,\n\n,\n\n,SPACE,space,False
4,Changing,change,VERB,verb,False
...,...,...,...,...,...
2499,,,SPACE,space,False
2500,ACM,ACM,PROPN,proper noun,False
2501,,,SPACE,space,False
2502,9,9,NUM,numeral,False


In [8]:
import string
import requests
import numpy as np
import regex as re
from collections import Counter

punct_regex = re.compile('[{}]'.format(re.escape(string.punctuation))) # Regex matching any punctuation
space_regex = re.compile(' +') # Regex matching whitespace

In [9]:
for label in nlp.get_pipe('ner').labels:
    print(f"{label}: {spacy.explain(label)}")

CARDINAL: Numerals that do not fall under another type
DATE: Absolute or relative dates or periods
EVENT: Named hurricanes, battles, wars, sports events, etc.
FAC: Buildings, airports, highways, bridges, etc.
GPE: Countries, cities, states
LANGUAGE: Any named language
LAW: Named documents made into laws.
LOC: Non-GPE locations, mountain ranges, bodies of water
MONEY: Monetary values, including unit
NORP: Nationalities or religious or political groups
ORDINAL: "first", "second", etc.
ORG: Companies, agencies, institutions, etc.
PERCENT: Percentage, including "%"
PERSON: People, including fictional
PRODUCT: Objects, vehicles, foods, etc. (not services)
QUANTITY: Measurements, as of weight or distance
TIME: Times smaller than a day
WORK_OF_ART: Titles of books, songs, etc.


## Test the NER methods

Read more here:

-  https://spacy.io/usage/linguistic-features
-  https://spacy.io/usage/visualizers

In [10]:
doc = nlp("Paris Hilton has three children with Mark")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Paris Hilton 0 12 ORG
three 17 22 CARDINAL
Mark 37 41 PERSON


In [11]:
# We can render in a nice format our annotations
displacy.render(doc, style="ent", jupyter=True)

In [12]:
doc = nlp("The Hilton Paris  hotel welcomes this year more than 1640 guests")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
displacy.render(doc, style="ent", jupyter=True)

Hilton 4 10 GPE
Paris 11 16 GPE
this year 33 42 DATE
more than 1640 43 57 CARDINAL


In [13]:
doc = nlp("Hilton Paris: Born in New York City, and raised there and in Los Angeles ")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
displacy.render(doc, style="ent", jupyter=True)

Hilton 0 6 GPE
Paris 7 12 GPE
New York City 22 35 GPE
Los Angeles 61 72 GPE


In [14]:
# Longer document
doc = nlp("""
Citing high fuel prices, United Airlines said Friday it has increased fares by $6
per round trip on flights to some cities also served by lower-cost carriers.
American Airlines, a unit of AMR Corp., immediately matched the move,
spokesman Tim Wagner said.
""")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

displacy.render(doc, style="ent", jupyter=True)

United Airlines 26 41 ORG
Friday 47 53 DATE
6 81 82 MONEY
American Airlines 160 177 ORG
AMR Corp. 189 198 ORG
Tim Wagner 240 250 PERSON


## Goal:  Test how NER would work on Text from Alice in Wonderland and Aesop's Fables



In [15]:
def get_pages(book_text):
  """
  Function that given  the book text returns a list of pages
  """
  _pages = [ _page.strip() for _page in book_text.split("\n\r\n\r\n\r")] # pages are divided by multiple newlines
  _pages = [ space_regex.sub(' ', page).strip() for page in _pages ]
  _pages = [ space_regex.sub(' ', " ".join(page.splitlines())) for page in _pages ]
  _pages = [ _page for _page in _pages if _pages != '' ]

  return _pages

In [16]:
# request the raw text of Alice in Wonderland
r = requests.get(r'https://ia801604.us.archive.org/6/items/alicesadventures19033gut/19033.txt')
alice = r.text

alice_pages = get_pages(alice)


r = requests.get(r'https://ia600906.us.archive.org/29/items/aesopsfablesanew11339gut/11339.txt')
fables = r.text

fables_pages = get_pages(fables)

In [17]:
test_page = alice_pages[16]
doc = nlp(test_page)
displacy.render(doc, style="ent", jupyter=True)

## Repeate the same for Aesop's Fables

- Pick a few pages, test the NER
- Does it find any entity at all? Does it miss some entities? What is going on?

In [None]:
test_page = fables_pages[16]
doc = nlp(test_page)
displacy.render(doc, style="ent", jupyter=True)

In [None]:
test_page = fables_pages[13]
doc = nlp(test_page)
displacy.render(doc, style="ent", jupyter=True)

In [None]:
test_page = fables_pages[2]
doc = nlp(test_page)
displacy.render(doc, style="ent", jupyter=True)

**If you are curious about `entity linking` you can see this tutorial:**

https://github.com/explosion/projects/blob/v3/tutorials/nel_emerson/notebooks/notebook_video.ipynb

## Compute Frequencies of Entities in pages


## Build an inverted index for named entities

## Compute via majority voting the most likely type for a given named entity

For example, should we say Weasel is a person or an org?

## Extract named entities from a wikipedia page