<a href="https://colab.research.google.com/github/UniVR-DH/ADHLab/blob/main/lecture05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Testing Named Entity Recognition

We will use the Spacy Library:
https://spacy.io/usage/spacy-101


<img src="https://drive.google.com/uc?export=view&id=1m_EMdnI5C826kgqK7r5vB4TXnB0-Wq7W" alt="Intestazione con loghi istituzionali" width="525"/>

| Docente      | Insegnamento | Anno Accademico    |
| :---        |    :----   |          ---: |
| Matteo Lissandrini      | Laboratorio Avanzato di Informatica Umanistica       | 2024/2025   |

## Usual install and basic imports

In [1]:
%pip install wikipedia-api
%pip install spacy==3.7.0

Collecting wikipedia-api
  Downloading wikipedia_api-0.7.1.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia-api
  Building wheel for wikipedia-api (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia-api: filename=Wikipedia_API-0.7.1-py3-none-any.whl size=14346 sha256=ab784fc2645bda3c22ae3139463a6ab9563d0278b8011e63a7ab337e7b1bebcc
  Stored in directory: /root/.cache/pip/wheels/4c/96/18/b9201cc3e8b47b02b510460210cfd832ccf10c0c4dd0522962
Successfully built wikipedia-api
Installing collected packages: wikipedia-api
Successfully installed wikipedia-api-0.7.1
Collecting spacy==3.7.0
  Downloading spacy-3.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (25 kB)
Collecting weasel<0.4.0,>=0.1.0 (from spacy==3.7.0)
  Downloading weasel-0.3.4-py3-none-any.whl.metadata (4.7 kB)
Collecting typer<0.10.0,>=0.3.0 (from spacy==3.7.0)
  Downloading typer-0.9.4-py3-none-any.whl.metadata (14 kB)
Collecti

In [2]:
import string
import requests
import numpy as np
import regex as re
from collections import Counter

punct_regex = re.compile('[{}]'.format(re.escape(string.punctuation))) # Regex matching any punctuation
space_regex = re.compile(' +') # Regex matching whitespace

In [3]:
import spacy
from spacy import displacy
# Load module for english
nlp = spacy.load("en_core_web_sm")
# More options here: https://spacy.io/models/en



In [4]:
for label in nlp.get_pipe('ner').labels:
    print(f"{label}: {spacy.explain(label)}")

CARDINAL: Numerals that do not fall under another type
DATE: Absolute or relative dates or periods
EVENT: Named hurricanes, battles, wars, sports events, etc.
FAC: Buildings, airports, highways, bridges, etc.
GPE: Countries, cities, states
LANGUAGE: Any named language
LAW: Named documents made into laws.
LOC: Non-GPE locations, mountain ranges, bodies of water
MONEY: Monetary values, including unit
NORP: Nationalities or religious or political groups
ORDINAL: "first", "second", etc.
ORG: Companies, agencies, institutions, etc.
PERCENT: Percentage, including "%"
PERSON: People, including fictional
PRODUCT: Objects, vehicles, foods, etc. (not services)
QUANTITY: Measurements, as of weight or distance
TIME: Times smaller than a day
WORK_OF_ART: Titles of books, songs, etc.


## Test the NER methods

Read more here:

-  https://spacy.io/usage/linguistic-features
-  https://spacy.io/usage/visualizers

In [5]:
doc = nlp("Paris Hilton welcomes 2nd child with Carter Reum")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Paris Hilton 0 12 PERSON
2nd 22 25 ORDINAL
Carter Reum 37 48 PERSON


In [6]:
# We can render in a nice format our annotations
displacy.render(doc, style="ent", jupyter=True)

In [7]:
doc = nlp("The Hilton Paris  hotel welcomes this year more than 1640 guests")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
displacy.render(doc, style="ent", jupyter=True)

Hilton Paris 4 16 GPE
this year 33 42 DATE
more than 1640 43 57 CARDINAL


In [8]:
doc = nlp("Hilton Paris: Born in New York City, and raised there and in Los Angeles ")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
displacy.render(doc, style="ent", jupyter=True)

Hilton Paris 0 12 PERSON
New York City 22 35 GPE
Los Angeles 61 72 GPE


In [9]:
# Longer document
doc = nlp("""
Citing high fuel prices, United Airlines said Friday it has increased fares by $6
per round trip on flights to some cities also served by lower-cost carriers.
American Airlines, a unit of AMR Corp., immediately matched the move,
spokesman Tim Wagner said.
""")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

displacy.render(doc, style="ent", jupyter=True)

United Airlines 26 41 ORG
Friday 47 53 DATE
6 81 82 MONEY
American Airlines 160 177 ORG
AMR Corp. 189 198 ORG
Tim Wagner 240 250 PERSON


## Goal:  Test how NER would work on Text from Alice in Wonderland and Aesop's Fables



In [10]:
def get_pages(book_text):
  """
  Function that given  the book text returns a list of pages
  """
  _pages = [ _page.strip() for _page in book_text.split("\n\r\n\r\n\r")] # pages are divided by multiple newlines
  _pages = [ space_regex.sub(' ', page).strip() for page in _pages ]
  _pages = [ space_regex.sub(' ', " ".join(page.splitlines())) for page in _pages ]
  _pages = [ _page for _page in _pages if _pages != '' ]

  return _pages

In [11]:
# request the raw text of Alice in Wonderland
r = requests.get(r'https://ia801604.us.archive.org/6/items/alicesadventures19033gut/19033.txt')
alice = r.text

alice_pages = get_pages(alice)


r = requests.get(r'https://ia600906.us.archive.org/29/items/aesopsfablesanew11339gut/11339.txt')
fables = r.text

fables_pages = get_pages(fables)

In [12]:
test_page = alice_pages[16]
doc = nlp(test_page)
displacy.render(doc, style="ent", jupyter=True)

## Repeate the same for Aesop's Fables

- Pick a few pages, test the NER
- Does it find any entity at all? Does it miss some entities? What is going on?

**If you are curious about `entity linking` you can see this tutorial:**

https://github.com/explosion/projects/blob/v3/tutorials/nel_emerson/notebooks/notebook_video.ipynb

## Compute Frequencies of Entities in pages


## Build an inverted index for named entities