<a href="https://colab.research.google.com/github/UniVR-DH/ADHLab/blob/main/lecture05-solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Testing Named Entity Recognition

We will use the Spacy Library:
https://spacy.io/usage/spacy-101


<img src="https://drive.google.com/uc?export=view&id=1m_EMdnI5C826kgqK7r5vB4TXnB0-Wq7W" alt="Intestazione con loghi istituzionali" width="525"/>

| Docente      | Insegnamento | Anno Accademico    |
| :---        |    :----   |          ---: |
| Matteo Lissandrini      | Laboratorio Avanzato di Informatica Umanistica       | 2023/2024   |

## Usual install and basic imports

In [11]:
%pip install wikipedia-api
%pip install spacy==3.7.0



In [22]:
import gzip
import math
import string
import requests
import numpy as np
import regex as re
import matplotlib.pyplot as plt
from collections import Counter

punct_regex = re.compile('[{}]'.format(re.escape(string.punctuation))) # Regex matching any punctuation
space_regex = re.compile(' +') # Regex matching whitespace

In [23]:
import spacy
from spacy import displacy
# Load module for english
nlp = spacy.load("en_core_web_sm")

## Test the NER methods

In [14]:
doc = nlp("Paris Hilton welcomes 2nd child with Carter Reum")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Paris Hilton 0 12 PERSON
2nd 22 25 ORDINAL
Carter Reum 37 48 PERSON


In [15]:
# We can render in a nice format our annotations
displacy.render(doc, style="ent", jupyter=True)

In [16]:
doc = nlp("The Hilton Paris  hotel welcomes this year more than 1640 guests")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
displacy.render(doc, style="ent", jupyter=True)

Hilton Paris 4 16 GPE
this year 33 42 DATE
more than 1640 43 57 CARDINAL


In [17]:
doc = nlp("Hilton Paris: Born in New York City, and raised there and in Los Angeles ")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
displacy.render(doc, style="ent", jupyter=True)

Hilton Paris 0 12 PERSON
New York City 22 35 GPE
Los Angeles 61 72 GPE


In [18]:
# Longer document
doc = nlp("""
Citing high fuel prices, United Airlines said Friday it has increased fares by $6
per round trip on flights to some cities also served by lower-cost carriers.
American Airlines, a unit of AMR Corp., immediately matched the move,
spokesman Tim Wagner said.
""")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

displacy.render(doc, style="ent", jupyter=True)

United Airlines 26 41 ORG
Friday 47 53 DATE
6 81 82 MONEY
American Airlines 160 177 ORG
AMR Corp. 189 198 ORG
Tim Wagner 240 250 PERSON


## Goal:  Test how NER would work on Text from Alice in Wonderland and Aesop's Fables



In [19]:
def get_pages(book_text):
  """
  Function that given  the book text returns a list of pages
  """
  _pages = [ _page.strip() for _page in book_text.split("\n\r\n\r\n\r")] # pages are divided by multiple newlines
  _pages = [ space_regex.sub(' ', page).strip() for page in _pages ]
  _pages = [ space_regex.sub(' ', " ".join(page.splitlines())) for page in _pages ]
  _pages = [ _page for _page in _pages if _pages != '' ]

  return _pages

In [20]:
# request the raw text of Alice in Wonderland
r = requests.get(r'https://ia801604.us.archive.org/6/items/alicesadventures19033gut/19033.txt')
alice = r.text

alice_pages = get_pages(alice)


r = requests.get(r'https://ia600906.us.archive.org/29/items/aesopsfablesanew11339gut/11339.txt')
fables = r.text

fables_pages = get_pages(fables)



In [26]:
print(len(fables_pages))
print(len(alice_pages))

296
22


In [29]:
test_page = fables_pages[24]
doc = nlp(test_page)
displacy.render(doc, style="ent", jupyter=True)

In [21]:
"""
PERSON:      People, including fictional.
NORP:        Nationalities or religious or political groups.
FAC:         Buildings, airports, highways, bridges, etc.
ORG:         Companies, agencies, institutions, etc.
GPE:         Countries, cities, states.
LOC:         Non-GPE locations, mountain ranges, bodies of water.
PRODUCT:     Objects, vehicles, foods, etc. (Not services.)
EVENT:       Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART: Titles of books, songs, etc.
LAW:         Named documents made into laws.
LANGUAGE:    Any named language.
DATE:        Absolute or relative dates or periods.
TIME:        Times smaller than a day.
PERCENT:     Percentage, including ”%“.
MONEY:       Monetary values, including unit.
QUANTITY:    Measurements, as of weight or distance.
ORDINAL:     “first”, “second”, etc.
CARDINAL:    Numerals that do not fall under another type.
"""


In [None]:


invert['torta'] = [ 'Anna' , 'MArco']





In [43]:
index = {}




for pos, page in enumerate(fables_pages):
  if pos < 10:
    continue
  # for each page use spacy to identify NE
  doc = nlp(page)

  for ent in doc.ents:
      _key = (ent.text, ent.label_)
      if _key not in index:
        index[_key] = set()

      index[_key].add(pos)


print(len(index))


q="The Ass, The Fox, they saw a Lion"
doc = nlp(q)

for ent in doc.ents:
  print(ent.text, ent.label_,  index[(ent.text, ent.label_)])



    #index


###############

# index = {

#          ('Fox', 'ORG') : q,
#          ('Lion', 'ORG') :q

# }

# #displacy.render(doc, style="ent", jupyter=True)
# alice_pages
# fables_pages







605
Fox ORG {256, 259, 261, 263, 264, 138, 15, 271, 272, 273, 149, 24, 152, 32, 37, 41, 48, 53, 60, 67, 69, 71, 72, 202, 210, 211, 93, 96, 104, 237, 110, 120, 252}
Lion ORG {128, 259, 261, 135, 138, 273, 148, 24, 25, 166, 44, 181, 187, 69, 207, 84, 254, 86, 219, 223, 99, 247, 249, 252, 126}


In [42]:
print(index.keys())

dict_keys([('Charcoal', 'ORG'), ('Fuller', 'ORG'), ('THE MICE IN COUNCIL Once', 'ORG'), ('Mice', 'ORG'), ('Council', 'ORG'), ('Mouse', 'PERSON'), ('Bat', 'PERSON'), ('Weasel', 'PERSON'), ('Weasel', 'ORG'), ('FOX', 'ORG'), ('THE CROW A Crow', 'ORG'), ('Fox', 'ORG'), ('Crow', 'PERSON'), ('HORSE', 'ORG'), ('Groom', 'PRODUCT'), ('long hours', 'TIME'), ('daily', 'DATE'), ('Groom', 'PERSON'), ('WOLF', 'PERSON'), ('LAMB', 'NORP'), ('Wolf', 'PERSON'), ('Lamb', 'PERSON'), ('Last year', 'DATE'), ('PEACOCK', 'ORG'), ('Crane', 'PRODUCT'), ('Crane', 'PERSON'), ('CAT', 'ORG'), ('Cat', 'ORG'), ('SPENDTHRIFT', 'ORG'), ('early spring', 'DATE'), ('Swallow', 'PERSON'), ('Spendthrift', 'ORG'), ('One', 'CARDINAL'), ('summer', 'DATE'), ('MOON', 'ORG'), ('Moon', 'PERSON'), ('one', 'CARDINAL'), ('New Moon', 'GPE'), ('MERCURY', 'ORG'), ('THE WOODMAN', 'ORG'), ('Mercury', 'ORG'), ('Woodman', 'ORG'), ('second', 'ORDINAL'), ('Woodman', 'FAC'), ('Woodman', 'PERSON'), ('two', 'CARDINAL'), ('Honesty', 'GPE'), ('Lion

In [44]:
index[('Crow', 'PERSON')]

{15, 26, 236, 265, 286}

In [45]:
fables_pages[26]

'THE CROW AND THE PITCHER A thirsty Crow found a Pitcher with some water in it, but so little was there that, try as she might, she could not reach it with her beak, and it seemed as though she would die of thirst within sight of the remedy. At last she hit upon a clever plan. She began dropping pebbles into the Pitcher, and with each pebble the water rose a little higher until at last it reached the brim, and the knowing bird was enabled to quench her thirst. Necessity is the mother of invention.'

**If you are curious about `entity linking` you can see this tutorial:**

https://github.com/explosion/projects/blob/v3/tutorials/nel_emerson/notebooks/notebook_video.ipynb