<a href="https://colab.research.google.com/github/fubotz/cl_intro_ws2024/blob/main/HomeExercise1_Fabian_SCHAMBECK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Home Exericse 1: Preprocessing and NER
In this first home exercise, you will use the knowledge from Tutorial 1 and Tutorial 2 to perform some preprocessing and NLP steps on a news article of your choice. An example article in English is provided in this notebook.

In this notebook, please complete all instructions starting with 👋 ⚒ in the code cell after the sign or provide your analysis in the text cell after the sign.

We will use the newspaper libabry to facilitate the scraping of the news article from a webpage.

In [8]:
!pip install newspaper3k
!pip install lxml_html_clean
!pip install nltk



In [9]:
import newspaper
from newspaper import Article

url = 'https://edition.cnn.com/best-christmas-markets-around-the-world/index.html'
article = Article(url)
article.download()
article.parse()

#This line displays the authors of the article
print("Authors: ", article.authors, "\n")

#This line displays the title and entire text of the article
print("Title: ", article.title, "\n")
print("Text of article: \n", article.text)

Authors:  ['Tamara Hardingham-Gill'] 

Title:  The top Christmas markets for 2024 

Text of article: 
 CNN —

There’s nothing quite like a festive market to bring out the Christmas spirit in people.

While these events can be traced back to Vienna – the city’s first recorded December market was in 1298 – the tradition has spread across the world over the centuries.

From Germany and Switzerland to Singapore and New York, it’s difficult to find a coveted destination that doesn’t hold an impressive annual market. In fact, some have grown so popular that they’ve become tourist attractions in their own right.

Here’s CNN Travel’s rundown of some of the top Christmas markets taking place around the world this year:

Wiener Christkindlmarkt, Austria

With reindeer rides, a giant Ferris wheel and a classic Nativity scene, Vienna’s magical spectacle encapsulates the festive spirit fantastically.

Although there are around 20 Christmas markets in the Austrian capital from which to choose, Wiene

👋 ⚒ Use the above article or a news article of your choice and print the number of unique words in the text.

In [44]:
# Calculate and print the number of unique words in the text (=types?)

# import relevant packages
import nltk
from nltk.tokenize import word_tokenize
import string

# download 'punkt' tokenizer
nltk.download('punkt', quiet=True)

# initialize text from article
text = article.text
word_tokens = word_tokenize(text)

# define filtered tokens
unwanted_chars = set(string.punctuation).union({'–', '—', '’', '“', '”'})

# filter out unwanted tokens and lowercase
unique_words = []
for word in word_tokens:
    word_lower = word.lower()   # ensures that unique words are counted in a case sensitive matter
    if word not in unwanted_chars:
        unique_words.append(word_lower)

unique_words = set(unique_words)

# print the number of unique words
print("Unique words:", sorted(unique_words))
print("Number of unique words (types):", len(unique_words))


# When doing this exercise, I wasn't sure which words to include. Besides, I
# thought about filtering out numbers and possessive forms like "'s" and it was
# tricky to decide which punctuation or special characters to remove for an
# accurate count. Furthermore, the final set of words contains words of different
# languages. I imagine this to be problematic for a possible further analysis...

Unique words: ["'s", '1', '1,000', '10', '100', '110-square-meter', '11th', '12', '1298', '13', '13-meter', '1441', '14th', '15', '150', '1570', '16', '16th', '17,000-square-foot', '1786', '1996', '2', '20', '20-meter-tall', '200', '200-plus', '2005', '2024', '21', '22', '23', '24', '25', '26', '27', '27,000', '28', '29', '30', '31', '3d', '4', '450', '46-meter', '5', '50-foot', '6', '7', '70', '70-meter-high', '8', '80', '800,000', 'a', 'able', 'about', 'abuzz', 'across', 'activities', 'activity', 'adam', 'added', 'addition', 'admission', 'adorning', 'advent', 'affair', 'after-dark', 'ahead', 'akin', 'alexander', 'all', 'alleyways', 'along', 'alongside', 'alsatian', 'also', 'although', 'america', 'among', 'amusement', 'an', 'and', 'andrew', 'annual', 'annually', 'anticipated', 'any', 'anywhere', 'ap', 'apart', 'are', 'area', 'arguably', 'around', 'article', 'artificial', 'artisan', 'artisanal', 'artists', 'arts', 'as', 'associated', 'at', 'atmosphere', 'attendees', 'attending', 'attra

## **Preprocessing**

👋 ⚒ Now perform the following preprocessing steps and see how the number of unique words changes:

1. Lowercase all words in the text.
2. Remove punctuation markers and numbers (Hint: `string.isalpha()).
3. Lemmatize all words in the text.

In [48]:
# Preprocess the text with all three steps and then calculate the number of
# unique words in the text again

# import relevant packages
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

# download 'wordnet' lemmatizer
nltk.download('wordnet', quiet=True)

# initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# filter out unwanted tokens, remove numbers, lowercase, and lemmatize
unique_lem_words = []
for word in word_tokens:
    word_lower = word.lower()   # case sensitive
    if word.isalpha():    # filters out punctuation and numbers
        lemmatized_word = lemmatizer.lemmatize(word_lower)    # lemmatizes the word
        unique_lem_words.append(lemmatized_word)

unique_lem_words = set(unique_lem_words)

# print the number of unique words
print("Unique lemmatized words:", sorted(unique_lem_words))
print("Number of unique lemmatized words:", len(unique_lem_words))

# By using the .isalpha() method, we further reduce the number of unique words.
# This is because it filters out any tokens that contain characters
# outside the standard English alphabet, such as numbers, punctuation,
# and special symbols.

Unique lemmatized words: ['a', 'able', 'about', 'abuzz', 'across', 'activity', 'adam', 'added', 'addition', 'admission', 'adorning', 'advent', 'affair', 'ahead', 'akin', 'alexander', 'all', 'alleyway', 'along', 'alongside', 'alsatian', 'also', 'although', 'america', 'among', 'amusement', 'an', 'and', 'andrew', 'annual', 'annually', 'anticipated', 'any', 'anywhere', 'ap', 'apart', 'are', 'area', 'arguably', 'around', 'art', 'article', 'artificial', 'artisan', 'artisanal', 'artist', 'associated', 'at', 'atmosphere', 'attendee', 'attending', 'attraction', 'attracts', 'aurora', 'austria', 'austrian', 'aux', 'back', 'backdrop', 'band', 'bank', 'bar', 'barcelona', 'barfüsserplatz', 'based', 'basel', 'basilica', 'basis', 'bavaria', 'bay', 'bazilika', 'be', 'beaten', 'beautiful', 'beautifully', 'become', 'been', 'beer', 'before', 'began', 'begin', 'behold', 'belgian', 'belgium', 'berlin', 'best', 'better', 'between', 'big', 'biggest', 'boa', 'bollnäs', 'both', 'bourse', 'bratislava', 'bratwurs

## **NER**

In the tutorial we have only used one of the different models available in spaCy. In this exercise, you will compare the performance of the different models of different sizes and implementations. A description of the type of available models is in the [spaCy documentation](https://spacy.io/models/en). First, the models to be used need to be installed. We will use the following three models.

In [49]:
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_lg
!python -m spacy download en_core_web_trf

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m43.8 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m867.7 kB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-w

👋 ⚒  Use each of the three models that were downloaded above and perform named entitiy recognition with each of them on the original not preprocessed article, one after another. You can use different code cells for the different models or write everything into one cell, as you prefer. For each of the model outputs, automatically calculate the number of NERs for each NER type that the model identifies.

In [57]:
import spacy

# load spaCy models
nlp_sm = spacy.load("en_core_web_sm")
nlp_lg = spacy.load("en_core_web_lg")
nlp_trf = spacy.load("en_core_web_trf")

text = article.text

# sm
print("Named Entities for en_core_web_sm:")
doc_sm = nlp_sm(text)
for ent in doc_sm.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
print("\n" + "="*50 + "\n")

# lg
print("Named Entities for en_core_web_lg:")
doc_lg = nlp_lg(text)
for ent in doc_lg.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
print("\n" + "="*50 + "\n")

#trf
print("Named Entities for en_core_web_trf:")
doc_trf = nlp_trf(text)
for ent in doc_trf.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
print("\n" + "="*50 + "\n")


Named Entities for en_core_web_sm:
CNN 0 3 ORG
Christmas 68 77 DATE
Vienna 138 144 GPE
first 158 163 ORDINAL
December 173 181 DATE
1298 196 200 DATE
the centuries 250 263 DATE
Germany 271 278 GPE
Switzerland 283 294 GPE
Singapore 298 307 GPE
New York 312 320 GPE
annual 399 405 DATE
CNN Travel’s 518 530 ORG
Christmas 558 567 DATE
this year 606 615 DATE
Wiener Christkindlmarkt 618 641 PERSON
Austria 643 650 GPE
Ferris 681 687 PERSON
Nativity 708 716 NORP
Vienna 724 730 GPE
20 825 827 CARDINAL
Austrian 853 861 NORP
Wiener Christkindlmarkt 892 915 ORG
Rathausplatz 920 932 ORG
Viennese Dream Christmas Market 1010 1041 EVENT
110-square-meter 1054 1070 QUANTITY
City Hall 1115 1124 FAC
Tree of Hearts 1137 1151 WORK_OF_ART
hundreds 1188 1196 CARDINAL
Austrian 1326 1334 NORP
Christmas 1389 1398 DATE
Wiener Christkindlmarkt 1407 1430 ORG
November 16 to December 26 1441 1467 DATE
Basel Christmas Market 1470 1492 FAC
Switzerland 1494 1505 GPE
Basel 1507 1512 GPE
Christmas 1569 1578 DATE
Meinrad Rie

You can use the following function to visualize the named entities in the text in order to facilitate the analysis.

In [None]:
# You can also visualize the detected named entities
from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)

👋 ⚒ Add your text of the analysis of differences between the three different models right here in the next text field.

*Your NE performance analysis here*

👋 ⚒ Compare the analysis of the best performing spaCy model for NER on the article after it was preprocessed to the performance on the non-preprocessed article.

In [None]:
# Your code here

## **Multilingual NER**
In this exercise, the NER performance of spaCy in English is compared to another language of your choice.

👋 ⚒ Go the [spaCy page](https://spacy.io/models) detailing the available models to identify supported languages on the left listed under the heading "Trained Pipelines". Select a language and model of your choice. Find an article in this language and parse it using the newspaper package.

In [None]:
# Remember that you first need to load the model by replacing
#"en_core_web_sm" with the name of your model
!python -m spacy download en_core_web_sm

👋 ⚒ Perform NER on the selected article.

👋 ⚒ How well did the NER in the language of your choice work as compared to the overall performance of NER with spaCy in English?

*Your NE performance analysis here*