<a href="https://colab.research.google.com/github/fubotz/cl_intro_ws2024/blob/main/HomeExercise1_Fabian_SCHAMBECK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Home Exericse 1: Preprocessing and NER
In this first home exercise, you will use the knowledge from Tutorial 1 and Tutorial 2 to perform some preprocessing and NLP steps on a news article of your choice. An example article in English is provided in this notebook.

In this notebook, please complete all instructions starting with 👋 ⚒ in the code cell after the sign or provide your analysis in the text cell after the sign.

We will use the newspaper libabry to facilitate the scraping of the news article from a webpage.

In [12]:
!pip install newspaper3k
!pip install lxml_html_clean
!pip install nltk



In [7]:
import newspaper
from newspaper import Article

url = 'https://edition.cnn.com/best-christmas-markets-around-the-world/index.html'
article = Article(url)
article.download()
article.parse()

#This line displays the authors of the article
print("Authors: ", article.authors, "\n")

#This line displays the title and entire text of the article
print("Title: ", article.title, "\n")
print("Text of article: \n", article.text)

Authors:  ['Tamara Hardingham-Gill'] 

Title:  The top Christmas markets for 2024 

Text of article: 
 CNN —

There’s nothing quite like a festive market to bring out the Christmas spirit in people.

While these events can be traced back to Vienna – the city’s first recorded December market was in 1298 – the tradition has spread across the world over the centuries.

From Germany and Switzerland to Singapore and New York, it’s difficult to find a coveted destination that doesn’t hold an impressive annual market. In fact, some have grown so popular that they’ve become tourist attractions in their own right.

Here’s CNN Travel’s rundown of some of the top Christmas markets taking place around the world this year:

Wiener Christkindlmarkt, Austria

With reindeer rides, a giant Ferris wheel and a classic Nativity scene, Vienna’s magical spectacle encapsulates the festive spirit fantastically.

Although there are around 20 Christmas markets in the Austrian capital from which to choose, Wiene

👋 ⚒ Use the above article or a news article of your choice and print the number of unique words in the text.

In [36]:
# Calculate and print the number of unique words in the text (=types)

# import relevant packages
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

# dowload 'punkt' and 'stopwords' only if it is not already dowloaded (+ensures readability)
if not nltk.data.find('tokenizers/punkt'):
    nltk.download('punkt', quiet=True)
if not nltk.data.find('corpora/stopwords'):
    nltk.download('stopwords', quiet=True)

# initialize text from article
text = article.text
word_tokens = word_tokenize(text)

# define filtered tokens
stop_words = stopwords.words('english')
punctuation = string.punctuation
other = ['–', '—', '’', '“', '”']

# filter out unwanted tokens
filtered_tokens = []
for word in word_tokens:
    word_lower = word.lower()
    if word_lower not in stop_words and word not in punctuation and word not in other:
        filtered_tokens.append(word_lower)

unique_words = set(filtered_tokens)

print('Unique words:', sorted(unique_words))
print('Number of unique words (types):', len(unique_words))

Unique words: ["'s", '1', '1,000', '10', '100', '110-square-meter', '11th', '12', '1298', '13', '13-meter', '1441', '14th', '15', '150', '1570', '16', '16th', '17,000-square-foot', '1786', '1996', '2', '20', '20-meter-tall', '200', '200-plus', '2005', '2024', '21', '22', '23', '24', '25', '26', '27', '27,000', '28', '29', '30', '31', '3d', '4', '450', '46-meter', '5', '50-foot', '6', '7', '70', '70-meter-high', '8', '80', '800,000', 'able', 'abuzz', 'across', 'activities', 'activity', 'adam', 'added', 'addition', 'admission', 'adorning', 'advent', 'affair', 'after-dark', 'ahead', 'akin', 'alexander', 'alleyways', 'along', 'alongside', 'alsatian', 'also', 'although', 'america', 'among', 'amusement', 'andrew', 'annual', 'annually', 'anticipated', 'anywhere', 'ap', 'apart', 'area', 'arguably', 'around', 'article', 'artificial', 'artisan', 'artisanal', 'artists', 'arts', 'associated', 'atmosphere', 'attendees', 'attending', 'attraction', 'attractions', 'attracts', 'aurora', 'austria', 'aus

## **Preprocessing**

👋 ⚒ Now perform the following preprocessing steps and see how the number of unique words changes:

1. Lowercase all words in the text.
2. Remove punctuation markers and numbers (Hint: `string.isalpha()).
3. Lemmatize all words in the text.

In [None]:
# Preprocess the text with all three steps and then calculate the number of
# unique words in the text again

## **NER**

In the tutorial we have only used one of the different models available in spaCy. In this exercise, you will compare the performance of the different models of different sizes and implementations. A description of the type of available models is in the [spaCy documentation](https://spacy.io/models/en). First, the models to be used need to be installed. We will use the following three models.

In [None]:
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_lg
!python -m spacy download en_core_web_trf

👋 ⚒  Use each of the three models that were downloaded above and perform named entitiy recognition with each of them on the original not preprocessed article, one after another. You can use different code cells for the different models or write everything into one cell, as you prefer. For each of the model outputs, automatically calculate the number of NERs for each NER type that the model identifies.

In [None]:
import spacy
nlp_sm = spacy.load("en_core_web_sm")
# Your code here

You can use the following function to visualize the named entities in the text in order to facilitate the analysis.

In [None]:
# You can also visualize the detected named entities
from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)

👋 ⚒ Add your text of the analysis of differences between the three different models right here in the next text field.

*Your NE performance analysis here*

👋 ⚒ Compare the analysis of the best performing spaCy model for NER on the article after it was preprocessed to the performance on the non-preprocessed article.

In [None]:
# Your code here

## **Multilingual NER**
In this exercise, the NER performance of spaCy in English is compared to another language of your choice.

👋 ⚒ Go the [spaCy page](https://spacy.io/models) detailing the available models to identify supported languages on the left listed under the heading "Trained Pipelines". Select a language and model of your choice. Find an article in this language and parse it using the newspaper package.

In [None]:
# Remember that you first need to load the model by replacing
#"en_core_web_sm" with the name of your model
!python -m spacy download en_core_web_sm

👋 ⚒ Perform NER on the selected article.

👋 ⚒ How well did the NER in the language of your choice work as compared to the overall performance of NER with spaCy in English?

*Your NE performance analysis here*