<a href="https://colab.research.google.com/github/fubotz/cl_intro_ws2024/blob/main/HomeExercise1_Fabian_SCHAMBECK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Home Exericse 1: Preprocessing and NER
In this first home exercise, you will use the knowledge from Tutorial 1 and Tutorial 2 to perform some preprocessing and NLP steps on a news article of your choice. An example article in English is provided in this notebook.

In this notebook, please complete all instructions starting with 👋 ⚒ in the code cell after the sign or provide your analysis in the text cell after the sign.

We will use the newspaper libabry to facilitate the scraping of the news article from a webpage.

In [13]:
!pip install newspaper3k
!pip install lxml_html_clean
!pip install spacy-transformers



In [14]:
import newspaper
from newspaper import Article

url = 'https://edition.cnn.com/best-christmas-markets-around-the-world/index.html'
article = Article(url)
article.download()
article.parse()

#This line displays the authors of the article
print("Authors: ", article.authors, "\n")

#This line displays the title and entire text of the article
print("Title: ", article.title, "\n")
print("Text of article: \n", article.text)

Authors:  ['Tamara Hardingham-Gill'] 

Title:  The top Christmas markets for 2024 

Text of article: 
 CNN —

There’s nothing quite like a festive market to bring out the Christmas spirit in people.

While these events can be traced back to Vienna – the city’s first recorded December market was in 1298 – the tradition has spread across the world over the centuries.

From Germany and Switzerland to Singapore and New York, it’s difficult to find a coveted destination that doesn’t hold an impressive annual market. In fact, some have grown so popular that they’ve become tourist attractions in their own right.

Here’s CNN Travel’s rundown of some of the top Christmas markets taking place around the world this year:

Wiener Christkindlmarkt, Austria

With reindeer rides, a giant Ferris wheel and a classic Nativity scene, Vienna’s magical spectacle encapsulates the festive spirit fantastically.

Although there are around 20 Christmas markets in the Austrian capital from which to choose, Wiene

👋 ⚒ Use the above article or a news article of your choice and print the number of unique words in the text.

In [15]:
# Calculate and print the number of unique words in the text

unique_words = set(article.text.split())    # splits article at space char

print(sorted(unique_words))
print(len(unique_words))

# NB: output still includes numbers, punctuation and possesive markers
# NB: still distinguishes between uppercase and lowercase representations of the
# same underlying form

['$25', '(Czech', '(Finland’s', '(mulled', '(or', '(sausage),', '1', '1,000', '1.', '10', '100', '110-square-meter', '11th', '12', '1298', '13,', '13-meter', '13.', '1441.', '14th', '15', '150', '1570,', '16', '16th', '17,000-square-foot', '1786,', '1786.', '1996,', '2', '2.', '20', '20-meter-tall', '200', '200-plus', '2005,', '2024', '21', '22', '22.', '23', '23,', '23.', '24.', '25', '26,', '26.', '27', '27,000', '27.', '28', '29', '30', '31.', '3D', '4.', '450', '46-meter', '5.', '50-foot', '6.', '7.', '70', '70-meter-high', '8.', '80', '800,000', 'A', 'AP', 'Adam', 'Admission', 'Advent', 'Alexander', 'Alsatian', 'Although', 'America', 'Andrew', 'At', 'Attendees', 'Aurora.', 'Austria', 'Austrian', 'Band', 'Bank', 'Barcelona', 'Barfüsserplatz', 'Barfüsserplatz.', 'Basel', 'Basilica', 'Basilica.', 'Bavaria’s', 'Bay', 'Bay,', 'Bazilika', 'Bazilika,', 'Belgian', 'Belgium', 'Belgium’s', 'Berlin', "Berlin's", 'Best', 'BoA', 'Bollnäs', 'Borchi/Atlantide', 'Bourse,', 'Bratislava', 'Brussels

## **Preprocessing**

👋 ⚒ Now perform the following preprocessing steps and see how the number of unique words changes:

1. Lowercase all words in the text.
2. Remove punctuation markers and numbers (Hint: `string.isalpha()).
3. Lemmatize all words in the text.

In [16]:
# Preprocess the text with all three steps and then calculate the number of
# unique words in the text again

# NB: at first I tried a different approach where I preprocessed manually before
# putting the input into the SpaCy nlp pipeline
# In the end I sticked to the second solution as the output is not as restricted
# as in the first solution

import spacy
import string

"""
nlp = spacy.load("en_core_web_sm")
text = article.text.lower()

tokens_2 = []
for word in text.split():
  if word.isalpha():
    tokens_2.append(word)

text_no_punctuation = " ".join(tokens_2)
doc = nlp(text_no_punctuation)

unique_lemmas = set()
for token in doc:
  unique_lemmas.add(token.lemma_)

print(sorted(unique_lemmas))
print(len(unique_lemmas))
"""

nlp = spacy.load("en_core_web_sm")
text = article.text.lower()   # lowercase input

doc = nlp(text)

tokens = set()
for token in doc:
  if token.is_alpha:
    tokens.add((token.lemma_))   # add lemmatized word to set


print(sorted(tokens))   # reconvert set into sortable list
print(len(tokens))    # print number of unique lemmas

# Observation: lemmatizer returns "i" in its standard form "I"

['I', 'a', 'able', 'about', 'abuzz', 'across', 'activity', 'adam', 'add', 'addition', 'admission', 'adorn', 'advent', 'affair', 'after', 'ahead', 'air', 'akin', 'alamy', 'alexander', 'all', 'alleyways', 'along', 'alongside', 'alsatian', 'also', 'although', 'america', 'among', 'amusement', 'an', 'and', 'andrew', 'annual', 'annually', 'anticipated', 'any', 'anywhere', 'ap', 'apart', 'area', 'arguably', 'around', 'art', 'article', 'artificial', 'artisan', 'artisanal', 'artist', 'as', 'associate', 'at', 'atlantide', 'atmosphere', 'attend', 'attendee', 'attract', 'attraction', 'aurora', 'austria', 'austrian', 'aux', 'back', 'backdrop', 'band', 'bank', 'bar', 'barcelona', 'barfüsserplatz', 'base', 'basel', 'basilica', 'basis', 'bavaria', 'bay', 'bazilika', 'be', 'beat', 'beautiful', 'beautifully', 'become', 'beer', 'before', 'begin', 'behold', 'belgian', 'belgium', 'berlin', 'between', 'big', 'boa', 'bollnäs', 'borchi', 'both', 'bourse', 'bratislava', 'bratwurst', 'bread', 'bring', 'browse',

## **NER**

In the tutorial we have only used one of the different models available in spaCy. In this exercise, you will compare the performance of the different models of different sizes and implementations. A description of the type of available models is in the [spaCy documentation](https://spacy.io/models/en). First, the models to be used need to be installed. We will use the following three models.

In [26]:
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_lg
!python -m spacy download en_core_web_trf

  _torch_pytree._register_pytree_node(
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m80.9 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
  _torch_pytree._register_pytree_node(
Collecting en-core-web-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.7/400.7 MB[

👋 ⚒  Use each of the three models that were downloaded above and perform named entitiy recognition with each of them on the original not preprocessed article, one after another. You can use different code cells for the different models or write everything into one cell, as you prefer. For each of the model outputs, automatically calculate the number of NERs for each NER type that the model identifies.

In [27]:
# load spaCy models
nlp_sm = spacy.load("en_core_web_sm")
nlp_lg = spacy.load("en_core_web_lg")
nlp_trf = spacy.load("en_core_web_trf")

original_text = article.text

# use a function in order for the code not to be repetitive (refactored after: added text as argument)
def display_named_entities(nlp_model, model_name, text):
    doc = nlp_model(text)

    # count named entities by label
    entity_counts = {}
    for ent in doc.ents:
        entity_counts[ent.label_] = entity_counts.get(ent.label_, 0) + 1

    # display named entities and counts
    print(f"Named Entities for {model_name}:")
    print(f"Total Named Entities: {len(doc.ents)}\n")

    for label, count in entity_counts.items():
        print(f"{label}: {count}")

    print("\nEntity Details:")
    for ent in doc.ents:
        print(ent.text, ent.start_char, ent.end_char, ent.label_)

    print("\n" + "="*50 + "\n")   # line for better readability

# call the function for each model
display_named_entities(nlp_sm, "en_core_web_sm", original_text)
display_named_entities(nlp_lg, "en_core_web_lg", original_text)
display_named_entities(nlp_trf, "en_core_web_trf", original_text)


# NB: When doing this exercise I encountered massive problems with loading the
# transformer model. In the end, I found out that one always has to update
# spacy and to reload the runtime. I hope that we will cover some workarounds
# for this issue in the upcoming sessions.

Named Entities for en_core_web_sm:
Total Named Entities: 452

ORG: 55
DATE: 146
GPE: 97
ORDINAL: 3
PERSON: 35
NORP: 26
QUANTITY: 8
FAC: 31
WORK_OF_ART: 3
CARDINAL: 28
LOC: 8
EVENT: 4
TIME: 3
PRODUCT: 2
LAW: 1
LANGUAGE: 1
MONEY: 1

Entity Details:
CNN 0 3 ORG
Christmas 68 77 DATE
Vienna 138 144 GPE
first 158 163 ORDINAL
December 173 181 DATE
1298 196 200 DATE
the centuries 250 263 DATE
Germany 271 278 GPE
Switzerland 283 294 GPE
Singapore 298 307 GPE
New York 312 320 GPE
annual 399 405 DATE
Christmas 558 567 DATE
this year 606 615 DATE
Wiener Christkindlmarkt 618 641 PERSON
Austria 643 650 GPE
Ferris 681 687 PERSON
Nativity 708 716 ORG
Vienna 724 730 GPE
Austrian 853 861 NORP
Wiener Christkindlmarkt 892 915 ORG
Rathausplatz 920 932 ORG
Viennese Dream Christmas Market 1010 1041 ORG
110-square-meter 1054 1070 QUANTITY
City Hall 1115 1124 FAC
Tree of Hearts 1137 1151 WORK_OF_ART
hundreds 1188 1196 CARDINAL
Austrian 1326 1334 NORP
Christmas 1389 1398 DATE
Wiener Christkindlmarkt 1407 1430 O

You can use the following function to visualize the named entities in the text in order to facilitate the analysis.

In [19]:
# You can also visualize the detected named entities
from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)

# Question: Which model is being used for this?

👋 ⚒ Add your text of the analysis of differences between the three different models right here in the next text field.

en_core_web_sm:
    
    Total entities detected: 452.
    For the sake of this analysis, I mainly focused on the labels: ORG, DATE, FAC and GPE.
    The sm-model identified 55 organizational entities (ORG), with robust coverage of dates (146 entities) and geopolitical entities (97 GPEs).
    However, this model sometimes misclassified some location-specific entities as ORG or GPE (e.g. "Wiener Christkindlmarkt" identified inconsistently).
    This is probably due to the fact that "Wiener Christkindlmarkt" is a german and not an english term.

en_core_web_lg:

    Total entities detected: 463.
    This model shows slightly improved classification in the FAC (45) and ORG categories, and it captured some more nuanced details like the "Christmas Market" as FACs rather than PERSON.
    It also handles international location names better, although it still struggles with specific phrases (e.g. "Advent" or "Fira de Santa Llucia" sometimes categorised as a person).

en_core_web_trf:

    Total entities detected: 465.
    Significantly outperformed the others in FAC classification, identifying 102 facilities, which aligns with my conception of markets and plazas.
    en_core_web_trf struggled the least with misclassifications, particularly for complex entities like "the Great Christmas Tree" as WORK_OF_ART and "Strasbourg Christmas Market" as FAC.

👋 ⚒ Compare the analysis of the best performing spaCy model for NER on the article after it was preprocessed to the performance on the non-preprocessed article.

In [20]:
# Perform NER on preprocessed text

preprocessed_text = " ".join(tokens)    # input here: tokens = set()

display_named_entities(nlp_sm, "en_core_web_sm", preprocessed_text)
display_named_entities(nlp_lg, "en_core_web_lg", preprocessed_text)
display_named_entities(nlp_trf, "en_core_web_trf", preprocessed_text)

Named Entities for en_core_web_sm:
Total Named Entities: 79

ORG: 12
DATE: 14
GPE: 20
NORP: 12
LOC: 2
CARDINAL: 5
PERSON: 10
ORDINAL: 1
PRODUCT: 1
FAC: 1
TIME: 1

Entity Details:
bank london complete grand 0 26 ORG
week 241 245 DATE
brussels 314 322 GPE
stockholm 383 392 GPE
swedish 406 413 NORP
chicago 423 430 GPE
polish 820 826 NORP
austrian 925 933 NORP
poland 1070 1076 GPE
europe 1274 1280 LOC
four 1383 1387 CARDINAL
barcelona 1388 1397 GPE
hungary 1398 1405 GPE
thursday 1496 1504 DATE
scotland 1519 1527 GPE
finland 1538 1545 GPE
josephine plaza 1560 1575 PERSON
second 1722 1728 ORDINAL
hyde 1803 1807 PERSON
strasbourg 1849 1859 GPE
kleber germany 1900 1914 ORG
edinburgh bratislava 1948 1968 PERSON
aurora european 2208 2223 NORP
george mary central 2413 2432 PERSON
senate 2440 2446 ORG
magic christkindlmarket 2447 2470 PRODUCT
bay christkindlmarkt 2512 2532 ORG
rathausplatz france natale 2587 2613 ORG
mercati 2626 2633 ORG
danish 2734 2740 NORP
thousand 2809 2817 CARDINAL
djurgarde

Analysis of unprocessed vs. preprocessed text input performance:

    We can see that the NERs using the preprocessed text outputs less entities than with the original text as input, which is to be expected.
    As the context is lost when preprocessing, the models could not identify multi-word entities.
    The lack of structural cues also result in changes in entity labels, as certain words may be less likely to be recognized correctly without their original grammatical surroundings.

## **Multilingual NER**
In this exercise, the NER performance of spaCy in English is compared to another language of your choice.

👋 ⚒ Go the [spaCy page](https://spacy.io/models) detailing the available models to identify supported languages on the left listed under the heading "Trained Pipelines". Select a language and model of your choice. Find an article in this language and parse it using the newspaper package.

In [21]:
!python -m spacy download de_core_news_sm

  _torch_pytree._register_pytree_node(
Collecting de-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.8.0/de_core_news_sm-3.8.0-py3-none-any.whl (14.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.6/14.6 MB[0m [31m77.9 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [22]:
url_2 = 'https://www.tagesschau.de/ausland/europa/koenigsfamilie-finanzen-charity-100.html?utm_source=pocket-newtab-de-de'
article_2 = Article(url_2)
article_2.download()
article_2.parse()

#This line displays the authors of the article
print("Authors: ", article_2.authors, "\n")

#This line displays the title and entire text of the article
print("Title: ", article_2.title, "\n")
print("Text of article: \n", article_2.text)

Authors:  ['Valerie Krall', 'Dame Margaret Hodge'] 

Title:  Die lukrative Wohltätigkeit des britischen Königs 

Text of article: 
 Recherche zu royalen Finanzen Die lukrative Wohltätigkeit des britischen Königs Stand: 09.11.2024 13:54 Uhr

Die britische Königsfamilie zeigt sich gerne wohltätig - schweigt aber über finanzielle Verstrickungen. Recherchen zeigen, dass sie unter anderem viel Geld vom NHS, Schulen und der Armee erhält.

Ende April besuchte Charles III. ein Zentrum der Krebshilfe Macmillan in London. Der König ist Schirmherr der Wohltätigkeitsorganisation und wirbt öffentlichkeitswirksam für deren Unterstützung.

Was er dabei nicht erwähnt: Er selbst verdient an der Charity. Denn Macmillans Büro ist in einem Gebäude, das zum Privatbesitz des Monarchen gehört. Seit 2005 hat die Wohltätigkeitsorganisation umgerechnet fast 20 Millionen Euro Miete gezahlt.

Flecken auf der royalen Weste

Dies ist nur ein Beispiel dafür, wie die britische Königsfamilie heimlich von Pachtzahlunge

👋 ⚒ Perform NER on the selected article.

In [31]:
nlp_de = spacy.load("de_core_news_sm")

original_text_2 = article_2.text

display_named_entities(nlp_de, "de_core_news_sm", original_text_2)

Named Entities for de_core_news_sm:
Total Named Entities: 46

MISC: 18
PER: 10
LOC: 12
ORG: 6

Entity Details:
britischen 62 72 MISC
britische 113 122 MISC
NHS 269 272 MISC
Charles III. 325 337 PER
Krebshilfe Macmillan 354 374 LOC
London 378 384 LOC
Charity 555 562 ORG
Macmillans 569 579 PER
Monarchen 632 641 LOC
britische 818 827 MISC
Pachtzahlungen 855 869 LOC
Mietverträgen 874 887 LOC
Vereinigten Königreich 1047 1069 LOC
Channel 4 1110 1119 MISC
Sunday Times 1128 1140 ORG
Royals 1245 1251 LOC
Margaret Hodge 1301 1315 PER
britischen Unterhaus 1330 1350 ORG
Margaret Hodge 1434 1448 PER
Mietzahlungen 1555 1568 MISC
Grant 1793 1798 MISC
Charles 1949 1956 PER
Prinz William jedes Jahr 1961 1985 PER
Lancaster 2052 2061 LOC
Cornwall 2066 2074 LOC
Norman Baker 2323 2335 PER
NHS 2352 2355 MISC
Armee 2368 2373 ORG
Londoner 2512 2520 MISC
Krankenwagen 2533 2545 LOC
NHS 2633 2636 MISC
Charles III 2646 2657 PER
William 2667 2674 PER
Militär 2713 2720 ORG
Marine 2743 2749 ORG
Cornwall 2802 2810 LO

👋 ⚒ How well did the NER in the language of your choice work as compared to the overall performance of NER with spaCy in English?

    The NER performance of the model de_core_news_sm for German is generally less robust compared to the English models.
    This becomes evident by the higher occurrence of incorrectly categorized entities.
    Many of the entities are labeled as MISC, an ambiguous label used when the model cannot accurately classify the entity, highlighting a lack of precision.
