## Class 03 - 24/11/2025

Standard NLP Pipeline:
- Clean Raw Texts
- Normalize
- Tokenize
- Remove Stopwords
- Part of Speach Tagging (matching words to a speech category [verbs, adjectives, nouns, proper nouns, conjunctions...])
- Named Entity Recognition (
- Stemming Lemamatization (technique to remove affixes from a word, keeping the stem, reducing the size of the index and increasing retrieval accuracy)

In [31]:
import re #for cleaning

import nltk
import spacy

text = "The rate increase today reflects the Governing Council’s assessment of the inflation outlook in light of the incoming economic and financial data, the dynamics of underlying inflation, and the strength of monetary policy transmission. The September ECB staff macroeconomic projections for the euro area see average inflation at 5.6% in 2023, 3.2% in 2024 and 2.1% in 2025. This is an upward revision for 2023 and 2024 and a downward revision for 2025. The upward revision for 2023 and 2024 mainly reflects a higher path for energy prices. Underlying price pressures remain high, even though most indicators have started to ease. ECB staff have slightly revised down the projected path for inflation excluding energy and food, to an average of 5.1% in 2023, 2.9% in 2024 and 2.2% in 2025. The Governing Council’s past interest rate increases continue to be transmitted forcefully. Financing conditions have tightened further and are increasingly dampening demand, which is an important factor in bringing inflation back to target. With the increasing impact of this tightening on domestic demand and the weakening international trade environment, ECB staff have lowered their economic growth projections significantly. They now expect the euro area economy to expand by 0.7% in 2023, 1.0% in 2024 and 1.5% in 2025."

no_number = re.sub("[^a-zA-Z]", " ", text)
print(no_number)

The rate increase today reflects the Governing Council s assessment of the inflation outlook in light of the incoming economic and financial data  the dynamics of underlying inflation  and the strength of monetary policy transmission  The September ECB staff macroeconomic projections for the euro area see average inflation at      in            in      and      in       This is an upward revision for      and      and a downward revision for       The upward revision for      and      mainly reflects a higher path for energy prices  Underlying price pressures remain high  even though most indicators have started to ease  ECB staff have slightly revised down the projected path for inflation excluding energy and food  to an average of      in            in      and      in       The Governing Council s past interest rate increases continue to be transmitted forcefully  Financing conditions have tightened further and are increasingly dampening demand  which is an important factor in bring

In [32]:
no_upper = no_number.lower()
print(no_upper)

the rate increase today reflects the governing council s assessment of the inflation outlook in light of the incoming economic and financial data  the dynamics of underlying inflation  and the strength of monetary policy transmission  the september ecb staff macroeconomic projections for the euro area see average inflation at      in            in      and      in       this is an upward revision for      and      and a downward revision for       the upward revision for      and      mainly reflects a higher path for energy prices  underlying price pressures remain high  even though most indicators have started to ease  ecb staff have slightly revised down the projected path for inflation excluding energy and food  to an average of      in            in      and      in       the governing council s past interest rate increases continue to be transmitted forcefully  financing conditions have tightened further and are increasingly dampening demand  which is an important factor in bring

In [33]:
from nltk.tokenize import word_tokenize

tokens = word_tokenize(no_upper)

from nltk.corpus import stopwords
stops = stopwords.words('english') #It is worth noting that stopwords might include words that could be relevant deppending on the context (e.g. "above")

tokens = [t for t in tokens if t not in stops]
print(tokens)

['rate', 'increase', 'today', 'reflects', 'governing', 'council', 'assessment', 'inflation', 'outlook', 'light', 'incoming', 'economic', 'financial', 'data', 'dynamics', 'underlying', 'inflation', 'strength', 'monetary', 'policy', 'transmission', 'september', 'ecb', 'staff', 'macroeconomic', 'projections', 'euro', 'area', 'see', 'average', 'inflation', 'upward', 'revision', 'downward', 'revision', 'upward', 'revision', 'mainly', 'reflects', 'higher', 'path', 'energy', 'prices', 'underlying', 'price', 'pressures', 'remain', 'high', 'even', 'though', 'indicators', 'started', 'ease', 'ecb', 'staff', 'slightly', 'revised', 'projected', 'path', 'inflation', 'excluding', 'energy', 'food', 'average', 'governing', 'council', 'past', 'interest', 'rate', 'increases', 'continue', 'transmitted', 'forcefully', 'financing', 'conditions', 'tightened', 'increasingly', 'dampening', 'demand', 'important', 'factor', 'bringing', 'inflation', 'back', 'target', 'increasing', 'impact', 'tightening', 'domesti

In [56]:
import spacy
from spacy import displacy

In [36]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(no_upper)
[(d.label_, d.text) for d in doc.ents]
displacy.render(doc,'ent', jupyter=True)

### Stemming

Different stemmers have different compression rates (rate of initial words to remaining stems)

In [51]:
from nltk.stem import SnowballStemmer, PorterStemmer, WordNetLemmatizer
snowball = SnowballStemmer('english')
snowball.stem("programmers")

'programm'

In [47]:
porter = PorterStemmer()
print(porter.stem("programmers"))

programm


### Lemmatization

In [52]:
wnl = WordNetLemmatizer()
print(wnl.lemmatize("programmers"))

programmer


### ECB All Speeches Dataset

In [65]:
import pandas as pd

path = "/Users/eduardo/Downloads/all_ECB_speeches.csv"

df = pd.read_csv(path, sep="|").dropna()

print(df.columns)
df.head()

Index(['date', 'speakers', 'title', 'subtitle', 'contents'], dtype='object')


Unnamed: 0,date,speakers,title,subtitle,contents
0,2025-10-29,Christine Lagarde,Remarks delivered at the Palazzo Vecchio on th...,"Speech by Christine Lagarde, President of the ...",SPEECH Remarks delivered at the Palazzo Vec...
1,2025-10-27,Frank Elderson,Making supervision simpler: the role of superv...,"Keynote speech by Frank Elderson, Member of th...",SPEECH Making supervision s...
2,2025-10-23,Philip R. Lane,Acceptance speech 2025 Pádraig Ó hUiginn Award,"Remarks by Philip R. Lane, Member of the Execu...",SPEECH Acceptance speech 2025 Pádraig Ó h...
3,2025-10-21,Christine Lagarde,Europe’s road to renewables,"Speech by Christine Lagarde, President of the ...",SPEECH Europe’s road to renewables Speech...
4,2025-10-21,Philip R. Lane,The transmission of monetary policy: financial...,"Welcome address by Philip R. Lane, Member of t...",SPEECH The transmission of monetary policy:...


### Compute N of Tokens

In [70]:
all_tokens = []

tokens_list = df['contents'].apply(word_tokenize).tolist()

[all_tokens.extend(tokens) for tokens in tokens_list]

len(all_tokens)

9845544

### Compute N of Unique Tokens


In [72]:
unique_tokens = len(set(all_tokens))
print(unique_tokens)

107519


### Simple Pre-processing

In [77]:
def simple_prepro(text):

    text = re.sub("[^a-zA-Z]", " ", text.lower())
    tokens = word_tokenize(text)

    return tokens

all_tokens_simple_prepro = []
df['simple_prepro'] = df['contents'].apply(simple_prepro)

print(len(df['simple_prepro'].explode().unique()))
print(len(df['simple_prepro'].explode().unique())/unique_tokens)

71689
0.6667565732568197


### Compute N of Tokens that appear only once


In [83]:
print((df['simple_prepro'].explode().value_counts()==1).sum()) #problematic to train a model if 1/3 of your unique tokens

26627


### Agressive Pre-processing


In [86]:
def agressive_prepro(text):

    text = re.sub("[^a-zA-Z]", " ", text.lower())
    tokens = word_tokenize(text)
    clean_tokens = [t for t in tokens if t not in stops]
    lemmas = [wnl.lemmatize(t) for t in clean_tokens]

    return lemmas

print(len(df['agressive_prepro'].explode().unique()))
print(len(df['agressive_prepro'].explode().unique())/unique_tokens)
print((df['agressive_prepro'].explode().value_counts()==1).sum())

67767
0.6302792994726514
25683
