<a href="https://colab.research.google.com/github/UniVR-DH/ADHLab/blob/main/lecture06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Testing Named Entity Recognition

We will use the Spacy Library:
https://spacy.io/usage/spacy-101


<img src="https://drive.google.com/uc?export=view&id=1m_EMdnI5C826kgqK7r5vB4TXnB0-Wq7W" alt="Intestazione con loghi istituzionali" width="525"/>

| Docente      | Insegnamento | Anno Accademico    |
| :---        |    :----   |          ---: |
| Matteo Lissandrini      | Laboratorio Avanzato di Informatica Umanistica       | 2023/2024   |

## Usual install and basic imports

In [1]:
%pip install wikipedia-api
%pip install spacy==3.7.0

Collecting wikipedia-api
  Downloading Wikipedia_API-0.6.0-py3-none-any.whl (14 kB)
Installing collected packages: wikipedia-api
Successfully installed wikipedia-api-0.6.0
Collecting spacy==3.7.0
  Downloading spacy-3.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
Collecting weasel<0.4.0,>=0.1.0 (from spacy==3.7.0)
  Downloading weasel-0.3.4-py3-none-any.whl (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.1/50.1 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
Collecting cloudpathlib<0.17.0,>=0.7.0 (from weasel<0.4.0,>=0.1.0->spacy==3.7.0)
  Downloading cloudpathlib-0.16.0-py3-none-any.whl (45 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.0/45.0 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: cloudpathlib, weasel, spacy
  Attempting uninstall: spacy
    Found existing 

In [6]:
import wikipediaapi
import re

# Step 1: Import necessary libraries
#wiki_wiki = wikipediaapi.Wikipedia('en', user_agent='YourUserAgent/1.0')  # WRONG
# wiki_wiki = wikipediaapi.Wikipedia('en', extract_format=wikipediaapi.ExtractFormat.WIKI, user_agent='YourUserAgent/1.0') # WRONG
wiki_wiki = wikipediaapi.Wikipedia('MyTestProjectName (my.name@univr.it)',
                                   'en',
                                   extract_format=wikipediaapi.ExtractFormat.WIKI)

# Step 2: Prepare a list of titles of Wikipedia pages
fantasy_literature_titles = [
    "Fantasy literature",
    "The Lord of the Rings",
    "Harry Potter",
    "A Song of Ice and Fire",
    "The Chronicles of Narnia",
    "The Hobbit",
    "Alice's Adventures in Wonderland",
    "The Wizard of Oz",
    "The Silmarillion",
    "Discworld",
    "His Dark Materials",
    "The Wheel of Time",
    "Earthsea",
    "The Once and Future King",
    "The Princess Bride",
    "The Name of the Wind",
    "Mistborn",
    "The Malazan Book of the Fallen",
    "The Kingkiller Chronicle",
    "The Inheritance Cycle"
]

# Step 3: Write code to download main content in plain text and create a collection of cleaned text
cleaned_texts = []

for title in fantasy_literature_titles:
    page_py = wiki_wiki.page(title)

    if page_py.exists():
        # Keep the original URL of the page
        original_url = page_py.fullurl

        # Remove brackets and their contents from the text
        content = re.sub(r'\[[^\]]*\]', '', page_py.text)

        # Remove newline characters and extra spaces
        clean_text = ' '.join(content.split())

        # Store the cleaned text along with the original URL
        cleaned_texts.append({'title': title, 'content': clean_text, 'url': original_url})
    else:
        print(f"Page '{title}' does not exist on Wikipedia.")

# Display the cleaned texts and their original URLs (optional)
for idx, entry in enumerate(cleaned_texts, 1):
    print(f"Text {idx} - Title: {entry['title']}\nURL: {entry['url']}\nContent:\n{entry['content']}\n{'='*50}\n")


Text 1 - Title: Fantasy literature
URL: https://en.wikipedia.org/wiki/Fantasy_literature
Content:
Fantasy literature is literature set in an imaginary universe, often but not always without any locations, events, or people from the real world. Magic, the supernatural and magical creatures are common in many of these imaginary worlds. Fantasy literature may be directed at both children and adults. Fantasy is considered a subgenre of speculative fiction and is distinguished from the genres of science fiction and horror by the absence of scientific or macabre themes, respectively, though these genres overlap. Historically, most works of fantasy were in written form, but since the 1960s, a growing segment of the fantasy genre has taken the form of films, television programs, graphic novels, video games, music and art. Many fantasy novels originally written for children and adolescents also attract an adult audience. Examples include Alice's Adventures in Wonderland, the Harry Potter series

In [7]:
from collections import defaultdict
import re

# Step 4: Build an inverted index of lowercase trigrams ignoring non-alphanumeric symbols
inverted_index = defaultdict(list)

for entry in cleaned_texts:
    title = entry['title']
    content = entry['content'].lower()
    url = entry['url']

    # Remove non-alphanumeric symbols
    content = re.sub(r'[^a-z0-9 ]', '', content)

    # Generate trigrams
    trigrams = [content[i:i+3] for i in range(len(content)-2)]

    # Build inverted index
    for trigram in set(trigrams):  # Using set to remove duplicates
        inverted_index[trigram].append({'title': title, 'url': url})

# Display the inverted index (optional)
for trigram, entries in inverted_index.items():
    print(f"Trigram: {trigram}")
    for entry in entries:
        print(f"  Title: {entry['title']}, URL: {entry['url']}")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  Title: The Chronicles of Narnia, URL: https://en.wikipedia.org/wiki/The_Chronicles_of_Narnia
  Title: The Hobbit, URL: https://en.wikipedia.org/wiki/The_Hobbit
Trigram: cai
  Title: The Chronicles of Narnia, URL: https://en.wikipedia.org/wiki/The_Chronicles_of_Narnia
Trigram: bde
  Title: The Chronicles of Narnia, URL: https://en.wikipedia.org/wiki/The_Chronicles_of_Narnia
  Title: The Hobbit, URL: https://en.wikipedia.org/wiki/The_Hobbit
Trigram: yah
  Title: The Chronicles of Narnia, URL: https://en.wikipedia.org/wiki/The_Chronicles_of_Narnia
Trigram: kpa
  Title: The Chronicles of Narnia, URL: https://en.wikipedia.org/wiki/The_Chronicles_of_Narnia
Trigram: aal
  Title: The Chronicles of Narnia, URL: https://en.wikipedia.org/wiki/The_Chronicles_of_Narnia
  Title: His Dark Materials, URL: https://en.wikipedia.org/wiki/His_Dark_Materials
  Title: The Wheel of Time, URL: https://en.wikipedia.org/wiki/The_Wheel_of_Time
Tr