# Home Exericse 1: Preprocessing and NER
In this first home exercise, you will use the knowledge from Tutorial 1 and Tutorial 2 to perform some preprocessing and NLP steps on a news article of your choice. An example article in English is provided in this notebook.

In this notebook, please complete all instructions starting with 👋 ⚒ in the code cell after the sign or provide your analysis in the text cell after the sign.

We will use the newspaper libabry to facilitate the scraping of the news article from a webpage.

In [1]:
!pip install newspaper3k



In [2]:
import newspaper
from newspaper import Article

url = 'https://www.bbc.com/culture/article/20241018-how-monets-paintings-changed-the-way-we-see-london'
article = Article(url)
article.download()
article.parse()

#This line displays the authors of the article
print("Authors: ", article.authors, "\n")

#This line displays the title and entire text of the article
print("Title: ", article.title, "\n")
print("Text of article: \n", article.text)

Authors:  [] 

Title:  'A mesmerising mirage': How Monet's paintings changed the way we see London 

Text of article: 
 'A mesmerising mirage': How Monet's paintings changed the way we see London

Christie’s Images

A new exhibition charts how Claude Monet's revolutionary, fog-shrouded visions of the Thames would "irreversibly alter how London saw itself".

Some artists help us perceive the world more precisely. A rare few go further. They look beyond looking. Theirs is a deeper reality, more felt than seen. Claude Monet is one of those. In three visits to London between 1899 and 1901, the French Impressionist, then approaching 60 years of age, embarked upon one of the most ambitious series of penetrating paintings ever undertaken by any artist – a project that is now the focus of a groundbreaking exhibition at the Courtauld Institute, Monet and London: Views of the Thames.

From a murky miasma of toxic, soot-laced smog that choked the very breath of the Thames, Monet magicked up nearl

In [4]:
!pip install 'lxml[html_clean]'



👋 ⚒ Use the above article or a news article of your choice and print the number of unique words in the text.

In [15]:
my_text_1 = article.text
words_1 = my_text_1.split()
unique_words = set(words_1)
print(f"The text has {len(unique_words)} unique words before preprocessing.")
print(unique_words)

The text has 952 unique words before preprocessing.
{'"mysterious', 'such', 'rare', 'key', 'instalment', 'medium', 'precious', 'features,', 'His', 'need', 'blocks', 'vantage', 'artists', 'began', 'respective', 'time,', 'revolutionary,', 'not', 'scores', 'views', 'significant,', 'more', 'clean', 'R', 'understanding', 'revelation.', 'refulgent', 'explained;', '1871,', 'armature', 'succeeded', 'multi-panel', 'way', 'Its', 'smog', 'often', 'shine', 'were', 'canvases', 'By', 'city.', 'short', '14th', 'paintings,', 'given', 'rigid', 'fog-shrouded', 'fledgling', 'late', 'have', 'Essential', 'week.', 'commenting', 'getting', 'experiments', 'Parliament,', 'years.', 'corroborates', 'divergence.', 'Turner', 'virulent', 'fumble', 'already', 'illusion"', 'tapestries', 'Courtesy', 'alchemised', 'conviction', 'lens', 'best', 'esteemed', 'List', 'seeing', '"Upon', 'second', "wouldn't", 'physicist', 'Above', 'itself', '1899,', 'bridges', 'initially', 'purposes.', 'perceive', 'Musée', 'penetrates', 'int

## **Preprocessing**

👋 ⚒ Now perform the following preprocessing steps and see how the number of unique words changes:

1. Lowercase all words in the text.
2. Remove punctuation markers and numbers (Hint: `string.isalpha()).
3. Lemmatize all words in the text.

In [19]:
import spacy
import re
#I have decided to use regex as I already have experience with it :)

nlp = spacy.load("en_core_web_sm")
my_text_pp = article.text

my_text_pp = my_text_pp.lower()
my_text_pp = re.sub(r"[^a-zA-Z]+", " ", my_text_pp)
doc = nlp(my_text_pp)
my_text_pp = [token.lemma_ for token in doc]
my_text_pp = " ".join(my_text_pp)

unique_words_pp = set(my_text_pp.split())
print(f"The text has {len(unique_words_pp)} unique words after preprocessing.")
print(unique_words_pp)


The text has 733 unique words after preprocessing.
{'embark', 'fuse', 'such', 'impressionist', 'rare', 'attest', 'key', 'instalment', 'crisp', 'medium', 'billow', 'infrastructure', 'precious', 'need', 'brushstroke', 'muse', 'vantage', 'ludicrous', 'minute', 'respective', 'lewandowski', 'not', 'charing', 'more', 'clean', 'understanding', 'deep', 'refulgent', 'armature', 'weigh', 'way', 'side', 'smog', 'often', 'bass', 'shine', 'alain', 'old', 'archaeology', 'short', 'blake', 'yellow', 'rigid', 'fledgling', 'late', 'have', 'thirty', 'list', 'don', 'virulent', 'object', 'fumble', 'already', 'monet', 'conviction', 'lens', 'posit', 'esteemed', 'introduce', 'finish', 'physicist', 'second', 'glimpse', 'itself', 'cloak', 'invigorate', 'present', 'initially', 'density', 'static', 'perceive', 'florence', 'intangible', 'provocatively', 'atmospheric', 'precisely', 'describe', 'inbox', 'magnificent', 'go', 'an', 'undiscovered', 'lyon', 'third', 'dark', 'september', 'havre', 'mesmerise', 'glimmer', 

## **NER**

In the tutorial we have only used one of the different models available in spaCy. In this exercise, you will compare the performance of the different models of different sizes and implementations. A description of the type of available models is in the [spaCy documentation](https://spacy.io/models/en). First, the models to be used need to be installed. We will use the following three models.

In [1]:
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_lg
!python -m spacy download en_core_web_trf

Collecting en-core-web-sm==3.7.1
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting en-core-web-lg==3.7.1
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. 

👋 ⚒  Use each of the three models that were downloaded above and perform named entitiy recognition with each of them on the original not preprocessed article, one after another. You can use different code cells for the different models or write everything into one cell, as you prefer. For each of the model outputs, automatically calculate the number of NERs for each NER type that the model identifies.

In [11]:
import spacy
nlp_sm = spacy.load("en_core_web_sm")
nlp_lg = spacy.load("en_core_web_lg")
nlp_trf = spacy.load("en_core_web_trf")

doc_sm = nlp_sm(article.text)
doc_lg = nlp_lg(article.text)
doc_trf = nlp_trf(article.text)

def count_ner_types(doc):
    ner_counts = {}
    for ent in doc.ents:
        ner_type = ent.label_
        if ner_type in ner_counts:
            ner_counts[ner_type] += 1
        else:
            ner_counts[ner_type] = 1
    return ner_counts

ner_counts_sm = count_ner_types(doc_sm)
ner_counts_lg = count_ner_types(doc_lg)
ner_counts_trf = count_ner_types(doc_trf)

print("NER Counts for en_core_web_sm:")
for ner_type, count in ner_counts_sm.items():
    print(f"{ner_type}: {count}")

print("\n" + "NER Counts for en_core_web_lg:")
for ner_type, count in ner_counts_lg.items():
    print(f"{ner_type}: {count}")

print("\n" + "NER Counts for en_core_web_trf:")
for ner_type, count in ner_counts_trf.items():
    print(f"{ner_type}: {count}")

NER Counts for en_core_web_sm:
PERSON: 51
GPE: 24
ORG: 35
CARDINAL: 8
DATE: 26
LOC: 2
TIME: 2
ORDINAL: 6
NORP: 5
LANGUAGE: 1
WORK_OF_ART: 2
PRODUCT: 1
FAC: 1
EVENT: 1

NER Counts for en_core_web_lg:
PERSON: 48
GPE: 32
ORG: 33
CARDINAL: 9
DATE: 24
NORP: 7
TIME: 3
ORDINAL: 8
FAC: 6
LANGUAGE: 1
WORK_OF_ART: 1
EVENT: 3

NER Counts for en_core_web_trf:
PERSON: 52
GPE: 29
LOC: 8
CARDINAL: 5
DATE: 26
NORP: 4
ORG: 21
WORK_OF_ART: 9
FAC: 13
TIME: 5
ORDINAL: 5
LANGUAGE: 1
EVENT: 1


You can use the following function to visualize the named entities in the text in order to facilitate the analysis.

In [9]:
# You can also visualize the detected named entities
from spacy import displacy
displacy.render(doc_trf, style="ent", jupyter=True)

👋 ⚒ Add your text of the analysis of differences between the three different models right here in the next text field.

While analysing the three models, we can focus on several identifiers to compare their performance: accuracy of NER detection and model size vs. performace. If looking at sm, lg and trf models for my article, I see that all of them get different results: sm detects the widest range of entity types (14), but does so with worse accuracy, lg recognizes more differnet types of entities (more GPE and FAC) but fewer overall types (12), and trf outputs a more balanced rsult (covers 13 entity types with a high PERSON detection and better LOC/FAC recognition). Generally speaking, larger models perform more accurately at recognizing types of entities, as also seen in our results here: en_core_web_lg and en_core_web_trf demonstrate improved performance in specific entity types such as GPE, LOC, and FAC, compared to the smaller model en_core_web_sm that outputs more innacurate results. More excessive testing is needed to draw concrete conclusions, but I personally tend to like the accurate output of the transformer model for this one example usage, even though it runs a bit slower due to its larger size.

👋 ⚒ Compare the analysis of the best performing spaCy model for NER on the article after it was preprocessed to the performance on the non-preprocessed article.

In [21]:
import spacy
import re

nlp = spacy.load("en_core_web_trf")
my_text_pp = article.text

my_text_pp = my_text_pp.lower()
my_text_pp = re.sub(r"[^a-zA-Z]+", " ", my_text_pp)
doc = nlp(my_text_pp)
my_text_pp = [token.lemma_ for token in doc]
my_text_pp = " ".join(my_text_pp)

unique_words_pp = set(my_text_pp.split())
print(f"The text has {len(unique_words_pp)} unique words after preprocessing.")
print(unique_words_pp)

doc_trf = nlp(my_text_pp)

def count_ner_types(doc):
    ner_counts = {}
    for ent in doc.ents:
        ner_type = ent.label_
        if ner_type in ner_counts:
            ner_counts[ner_type] += 1
        else:
            ner_counts[ner_type] = 1
    return ner_counts

ner_counts_trf = count_ner_types(doc_trf)

print("\n" + "NER Counts for en_core_web_trf on a preprocessed text:")
for ner_type, count in ner_counts_trf.items():
    print(f"{ner_type}: {count}")

The text has 737 unique words after preprocessing.
{'embark', 'fuse', 'such', 'impressionist', 'rare', 'attest', 'key', 'instalment', 'crisp', 'medium', 'billow', 'infrastructure', 'precious', 'need', 'brushstroke', 'muse', 'vantage', 'ludicrous', 'minute', 'began', 'respective', 'lewandowski', 'not', 'views', 'charing', 'more', 'clean', 'understanding', 'deep', 'refulgent', 'armature', 'weigh', 'way', 'side', 'smog', 'often', 'bass', 'shine', 'alain', 'old', 'archaeology', 'short', 'blake', 'yellow', 'rigid', 'fledgling', 'late', 'have', 'thirty', 'list', 'don', 'virulent', 'object', 'fumble', 'already', 'monet', 'conviction', 'lens', 'posit', 'esteemed', 'introduce', 'finish', 'physicist', 'second', 'glimpse', 'itself', 'cloak', 'invigorate', 'present', 'initially', 'density', 'static', 'perceive', 'florence', 'intangible', 'provocatively', 'atmospheric', 'precisely', 'describe', 'inbox', 'magnificent', 'go', 'an', 'undiscovered', 'writing', 'lyon', 'third', 'dark', 'september', 'hav

NER Counts for en_core_web_trf:
PERSON: 52
GPE: 29
LOC: 8
CARDINAL: 5
DATE: 26
NORP: 4
ORG: 21
WORK_OF_ART: 9
FAC: 13
TIME: 5
ORDINAL: 5
LANGUAGE: 1
EVENT: 1

## **Multilingual NER**
In this exercise, the NER performance of spaCy in English is compared to another language of your choice.

👋 ⚒ Go the [spaCy page](https://spacy.io/models) detailing the available models to identify supported languages on the left listed under the heading "Trained Pipelines". Select a language and model of your choice. Find an article in this language and parse it using the newspaper package.

In [23]:
# Remember that you first need to load the model by replacing
#"en_core_web_sm" with the name of your model
!python -m spacy download uk_core_news_trf

Collecting uk-core-news-trf==3.7.2
  Downloading https://github.com/explosion/spacy-models/releases/download/uk_core_news_trf-3.7.2/uk_core_news_trf-3.7.2-py3-none-any.whl (410.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.6/410.6 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Collecting pymorphy3>=1.0.0 (from uk-core-news-trf==3.7.2)
  Downloading pymorphy3-2.0.2-py3-none-any.whl.metadata (1.8 kB)
Collecting pymorphy3-dicts-uk (from uk-core-news-trf==3.7.2)
  Downloading pymorphy3_dicts_uk-2.4.1.1.1663094765-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting dawg-python>=0.7.1 (from pymorphy3>=1.0.0->uk-core-news-trf==3.7.2)
  Downloading DAWG_Python-0.7.2-py2.py3-none-any.whl.metadata (7.0 kB)
Collecting pymorphy3-dicts-ru (from pymorphy3>=1.0.0->uk-core-news-trf==3.7.2)
  Downloading pymorphy3_dicts_ru-2.4.417150.4580142-py2.py3-none-any.whl.metadata (2.0 kB)
Downloading pymorphy3-2.0.2-py3-none-any.whl (53 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

👋 ⚒ Perform NER on the selected article.

In [1]:
!pip install 'lxml[html_clean]'



In [8]:
import newspaper
from newspaper import Article
import spacy
import re
from spacy import displacy

url = 'https://real-vin.com/google-zasvitiv-na-kartah-vijskovi-sistemi-ukraini'
article = Article(url)
article.download()
article.parse()
print("Text of article: \n", article.text)
nlp = spacy.load("uk_core_news_trf")
my_text_pp = article.text

my_text_pp = my_text_pp.lower()
my_text_pp = "".join([char if char.isalpha() or char.isspace() else " " for char in my_text_pp])
my_text_pp = re.sub(' +', ' ', my_text_pp)
doc = nlp(my_text_pp)
my_text_pp = [token.lemma_ for token in doc]
my_text_pp = " ".join(my_text_pp)

unique_words_pp = set(my_text_pp.split())
print(f"The text has {len(unique_words_pp)} unique words after preprocessing.")
print(unique_words_pp)

doc_trf = nlp(my_text_pp)

def count_ner_types(doc):
    ner_counts = {}
    for ent in doc.ents:
        ner_type = ent.label_
        if ner_type in ner_counts:
            ner_counts[ner_type] += 1
        else:
            ner_counts[ner_type] = 1
    return ner_counts

ner_counts_trf = count_ner_types(doc_trf)

print("\n" + "NER Counts on a preprocessed text:")
for ner_type, count in ner_counts_trf.items():
    print(f"{ner_type}: {count}")

displacy.render(doc_trf, style="ent", jupyter=True)

Text of article: 
 Google на своїх картах показав розміщення українських військових систем.

Про це без конкретизації повідомив керівник Центру протидії дезінформації (ЦПД) при РНБО Андрій Коваленко.

«Google викладає на картах оновлені знімки, на яких «палить» розміщення наших військових систем (не буду уточнювати)», — зазначив він.

Україна звернулася до офісу Google, щоб виправити ситуацію.

Пізніше Коваленко додав, що представники Google зв’язалися з українською стороною для вирішення проблеми зі знімками розміщення військових систем.

Нагадаємо, у вересні представники Google зустрілися з керівником Центру протидії дезінформації (ЦПД) РНБО України Андрієм Коваленком та обговорили заходи пришвидшення протидії пропаганді на YouTube.

«Зустрілися з Google — Марта Послад, директор із зв’язків з державними органами та громадської політики, CEE & Transatlantic; Анна Булах, Директорка з питань взаємодії з органами державної влади Google Україна», — зазначив Коваленко.

За його словами, во

  model.load_state_dict(torch.load(filelike, map_location=device))
  with torch.cuda.amp.autocast(self._mixed_precision):


The text has 97 unique words after preprocessing.
{'при', 'додати', 'зв', 'україна', 'військовий', 'вирішення', 'влада', 'конкретизація', 'уточнювати', 'нагадати', 'слово', 'тоді', 'до', 'з', 'орган', 'його', 'державний', 'оновлений', 'обговорити', 'бути', 'говорити', 'і', 'коваленко', 'не', 'буду', 'марта', 'взаємодія', 'язалися', 'показати', 'протидія', 'повідомити', 'зі', 'рнбо', 'він', 'пропаганда', 'пропагандист', 'наголосити', 'без', 'могти', 'офіс', 'зазначити', 'сподобатися', 'український', 'багато', 'керівник', 'у', 'свій', 'звернутися', 'цпд', 'щоб', 'для', 'андрій', 'суттєвий', 'вони', 'який', 'та', 'про', 'youtube', 'директор', 'пропаганді', 'google', 'із', 'знімок', 'пізніше', 'викладати', 'за', 'transatlantic', 'політика', 'що', 'пришвидшення', 'анна', 'система', 'послад', 'сторона', 'розміщення', 'захід', 'палити', 'громадський', 'скоро', 'зрушення', 'директорка', 'центр', 'язків', 'вересень', 'дезінформація', 'проблема', 'cee', 'представник', 'на', 'питання', 'наш', 'зу

👋 ⚒ How well did the NER in the language of your choice work as compared to the overall performance of NER with spaCy in English?

I used a small article in Ukraininan to test the capabilitias of multilingual NER analysis with spacy. I picked a transformer model for the Ukrainian language (uk_core_news_trf). I ran it on a pre-processed text and was surprised with a good output - all enbtities were marked and displayed correctly. This is nice to see, as very often I stumble upon a drasticly degraded performace, if comparing the same tool for eng and ukr.