Welcome to the TF-IDF Dutch Reading Exams! In this notebook, you will find words with highest, lower, and lowest weights using TF-IDF method. Here is the pipeline:
1. Load necessarily libraries
2. Load data: A sample from [Inburgeren Reading Demo A2 level exams](https://inburgeren.nl/)
3. Preprocess data
4. Compute Term Frequency - Inverse Document Frequency (TF-IDF) values per words
And that's it! You're ready to see results using widgets. Succes (not a typo, it's in Dutch 😉).

In [1]:
!pip install nbstripout


Collecting nbstripout
  Downloading nbstripout-0.8.1-py2.py3-none-any.whl.metadata (19 kB)
Downloading nbstripout-0.8.1-py2.py3-none-any.whl (16 kB)
Installing collected packages: nbstripout
Successfully installed nbstripout-0.8.1


In [2]:
!pip install -q spacy
!python -m spacy download nl_core_news_sm


Collecting nl-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/nl_core_news_sm-3.8.0/nl_core_news_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m32.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: nl-core-news-sm
Successfully installed nl-core-news-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('nl_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [3]:
import pandas as pd
import numpy as np
import math
from collections import Counter, defaultdict
import spacy

nlp = spacy.load("nl_core_news_sm")


In [4]:
!wget https://raw.githubusercontent.com/aycignl/LayerByLayerLab/main/dataset/dutch_reading_demo_exams.csv
df_corpus = pd.read_csv("dutch_reading_demo_exams.csv")
df_corpus.head()

--2025-06-19 20:32:17--  https://raw.githubusercontent.com/aycignl/LayerByLayerLab/main/dataset/dutch_reading_demo_exams.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25419 (25K) [text/plain]
Saving to: ‘dutch_reading_demo_exams.csv’


2025-06-19 20:32:17 (14.9 MB/s) - ‘dutch_reading_demo_exams.csv’ saved [25419/25419]



Unnamed: 0,exam,doc_id,content
0,Exam1,1,Iedereen leert op zijn eigen manier. Toch geve...
1,Exam1,2,Onderwerp: Jaarlijkse hardloopwedstrijd Wij...
2,Exam1,3,Hebt u meubels gezien die u wilt kopen? En wil...
3,Exam1,4,"Beste collega, Op vrijdag 9 mei gaan we met..."
4,Exam1,5,Aan: j.degraaf@werk.nl CC: On...


In [5]:
# Let's compute idf
def preprocess_text(text, remove_named_entities=True):
    doc = nlp(text)
    named_entities = {ent.text for ent in doc.ents if ent.label_ in ("PER", "LOC", "ORG")} if remove_named_entities else set()
    tokens = [
        token.lemma_.lower()
        for token in doc
        if token.is_alpha and not token.is_stop and token.text not in named_entities
    ]
    return " ".join(tokens)

df_corpus["processed"] = df_corpus["content"].apply(preprocess_text)

doc_freq = defaultdict(int)
total_docs = len(df_corpus)

for doc in df_corpus["processed"]:
    for term in set(doc.split()):
        doc_freq[term] += 1

idf = {term: math.log(total_docs / df) for term, df in doc_freq.items()}


### TF-IDF Weighting Scheme [1]

The **TF-IDF** weighting scheme assigns to term (in this use-case, terms are words) *t* a weight in document (here, documents are exams) *d* given by:

$$
\text{tf-idf}_{t,d} = \text{tf}_{t,d} \times \text{idf}_t
$$

In other words, tf-idf assigns to term *t* a weight in document *d* that is:

1. **Highest** when *t* occurs many times **within a small number of documents**  
   (high discriminating power to those documents)

2. **Lower** when the term occurs **fewer times** in a document,  
   or occurs in **many documents**
   (less pronounced relevance signal)

3. **Lowest** when the term occurs in **virtually all documents**.

## References
[1] An Introduction to Information Retrieval

In [6]:
def compute_tfidf_custom(exam_filter, top_n):
    filtered_df = df_corpus[df_corpus["exam"] == exam_filter].copy().reset_index(drop=True)

    term_tfidf_scores = defaultdict(float)
    term_doc_map = defaultdict(set)

    for idx, row in filtered_df.iterrows():
        doc_text = row["processed"]
        exam_name = row["exam"]
        text_id = f"{exam_name}-text{row['doc_id']}"
        term_counts = Counter(doc_text.split())

        for term, tf in term_counts.items():
            tfidf = tf * idf.get(term, 0)
            term_tfidf_scores[term] += tfidf
            if tfidf > 0:
                term_doc_map[term].add(text_id)

    sorted_terms = sorted(term_tfidf_scores.items(), key=lambda x: x[1], reverse=True)
    top_terms = sorted_terms[:top_n]

    # Percentile-based categorization
    scores_only = [score for _, score in top_terms]
    high_thresh = np.percentile(scores_only, 75)
    low_thresh = np.percentile(scores_only, 25)

    rows = []
    for term, score in top_terms:
        if score >= high_thresh:
            label = "Highest"
        elif score <= low_thresh:
            label = "Lowest"
        else:
            label = "Lower"

        contributing_ids = ", ".join(sorted(term_doc_map[term]))
        rows.append((term, round(score, 3), label, contributing_ids))

    return pd.DataFrame(rows, columns=["Dutch Term", "TF-IDF Score", "Category", "Text IDs"])


In [7]:
# import ipywidgets as widgets
# from IPython.display import display, clear_output

# exam_dropdown = widgets.Dropdown(
#     options=df_corpus["exam"].unique().tolist(),
#     description='Exam:',
#     style={'description_width': 'initial'},
#     layout=widgets.Layout(width='50%')
# )

# top_n_slider = widgets.IntSlider(
#     value=15,
#     min=5,
#     max=30,
#     step=1,
#     description='Top N terms:',
#     style={'description_width': 'initial'},
#     layout=widgets.Layout(width='50%')
# )

# output_area = widgets.Output()

# run_button = widgets.Button(
#     description='Compute TF-IDF',
#     button_style='primary'
# )

# def on_button_click(b):
#     with output_area:
#         clear_output()
#         result = compute_tfidf_custom(exam_dropdown.value, top_n=top_n_slider.value)
#         display(result)

# run_button.on_click(on_button_click)
# display(widgets.VBox([exam_dropdown, top_n_slider, run_button, output_area]))
