<a href="https://colab.research.google.com/github/deadbirddancing/Draft-Rep-Hausarbeit/blob/main/modul_04/Wysocki_Semantic_Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **1. API Search on the DBB Zeitungsportal**


We start by querying the DBB Zeitungsportal (using the newspaper‐issues index) for pages from 1914–1918 that mention “lawine” (avalanches) and include war‐related terms. For example, we might look for pages that mention both “lawine” (or its plural “lawinen”) and “krieg” (war) or “feind” (enemy), as these might indicate discussion about the disaster in a wartime context.


In [1]:
!pip install pysolr
!pip install pandas

Collecting pysolr
  Downloading pysolr-3.10.0.tar.gz (59 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/59.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.1/59.1 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: pysolr
  Building wheel for pysolr (pyproject.toml) ... [?25l[?25hdone
  Created wheel for pysolr: filename=pysolr-3.10.0-py2.py3-none-any.whl size=20158 sha256=8705913f9a3a9a2d59483ac40bbc5d9a93b73ff4bdf345d2246dcc323855ace0
  Stored in directory: /root/.cache/pip/wheels/74/db/d1/c64399119d95d40b618e2a4d4fadbf3fff65062c9a05185cc1
Successfully built pysolr
Installing collected packages: pysolr
Successfully installed pysolr-3.10

In [2]:
import pysolr
import pandas as pd

# Define the API endpoint for the newspaper-issues index
solr_url = 'https://api.deutsche-digitale-bibliothek.de/2/search/index/newspaper-issues'

# Initialize the pysolr client
solr = pysolr.Solr(solr_url, timeout=60)

# Construct the query:
# - 'zdb_id:2149754-0' can be used to target a specific newspaper if needed (adjust as appropriate)
# - 'type:page' restricts the search to individual pages
# - 'publication_date' is set to cover the WWI period (1914-1918)
# - 'plainpagefulltext' searches for avalanche-related terms AND war-related terms.
q = {
    'q': 'type:page AND publication_date:[1914-01-01T00:00:00Z TO 1918-12-31T23:59:59Z] '
         'AND plainpagefulltext:(lawine OR lawinen) AND plainpagefulltext:(krieg OR feind OR militär)',
    'rows': 1000
}

# Execute the query
results = solr.search(**q)

# Convert the results to a DataFrame for further processing
df_api = pd.DataFrame(results.docs)
print("API results from DBB Zeitungsportal:")
print(df_api.head())

API results from DBB Zeitungsportal:
                                                  id  pagenumber  \
0  TXWJ5GPO7QHW32XFDVOAPTMFHCFCSVL5-uuid-f6db67d9...           7   
1  ENRXBIM3OFB7MOPR7SZ6T3TIZNDADQ3R-ALTO9250561_D...           8   
2  5NRJUU73H6ZK7W4XNLOMCN5KXZBKPXBQ-FILE_0010_DDB...          10   
3  KPRC5GOIXVMFRZN5O7E3AYICC3AZ3B7A-ALTO105562_DD...           4   
4  64S6J2X7KAVVOWBS2ZFJ7L7VWG7Y7AKB-uuid-449422ab...           2   

                                         paper_title  \
0  Sächsische Volkszeitung : für christliche Poli...   
1                       Kölnische Zeitung. 1803-1945   
2  Schwäbischer Merkur : mit Schwäbischer Kronik ...   
3  Mannheimer General-Anzeiger : badische neueste...   
4  Weißeritz-Zeitung : Tageszeitung und Anzeiger ...   

                    provider_ddb_id  \
0  265BI7NE7QBS4NQMZCCGIVLFR73OCOSL   
1  VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW   
2  VNHXUCEEKHOUSYH4NVOUBHJGSRMOGK7J   
3  NWNEPSPSGSSYWU3IP75BYGGBRNQORN6A   
4  265BI7NE7QBS4NQMZCC

# **2. Semantic Search to Identify New Keywords and Filter Articles**
Next, we apply a semantic search pipeline using a transformer model to find semantically related keywords and to further filter the articles based on how they discuss the loss of life.

### **A. Discovering New Relevant Keywords**
For example, we can take a target term like “naturkatastrophen” (natural disasters) or even a combined query phrase (e.g., “Krieg und Lawinentragödie”) and find words in our corpus that are semantically similar. This may reveal additional keywords that newspapers used to frame the disaster—such as terms that either naturalize the event or subtly imply military culpability.

This step helps surface new terms such as potential synonyms or related concepts (e.g., “naturgewalt,” “kriegspropaganda,” “feindbilder,” “tragedie,” “opfer”) that may not have been obvious at first.

### **B. Document-Level Semantic Filtering**
We can also use document-level semantic search to prioritize articles that discuss the loss of life in a natural disaster within a wartime context. For example, using a query such as “Verlust von Menschenleben durch Lawine” can help filter for the relevant articles.

In [3]:
!pip install --upgrade torch
!pip install --upgrade transformers
!pip install --upgrade sentence-transformers

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

In [4]:
import pysolr
import pandas as pd
import re
from collections import Counter

from sentence_transformers import SentenceTransformer, util

import torch


# Textvorverarbeitung: Nutze 'plainpagefulltext', falls vorhanden, sonst 'title'
def preprocess_text(text):
    if pd.isna(text):
        return ""
    text = str(text).lower()
    text = re.sub(r'[^a-zäöüß\s]', '', text)
    return text

if 'plainpagefulltext' in df_api.columns:
    df_api['processed_text'] = df_api['plainpagefulltext'].apply(preprocess_text)
else:
    df_api['processed_text'] = df_api['title'].apply(preprocess_text)

# Extrahiere alle einzigartigen Wörter aus dem verarbeiteten Text
def get_unique_words(text):
    return list(set(text.split()))

all_words = []
for text in df_api['processed_text']:
    all_words.extend(get_unique_words(text))
unique_words = list(Counter(all_words).keys())

min_freq = 5
word_freq = Counter(all_words)
filtered_words = [word for word, freq in word_freq.items() if freq >= min_freq]

print(f"Anzahl der Wörter vor Filterung: {len(word_freq)}")
print(f"Anzahl der Wörter nach Filterung (mindestens {min_freq} Vorkommen): {len(filtered_words)}")

# Lade das Transformer-Modell für Wort-Ähnlichkeitsver
model_word = SentenceTransformer('sentence-transformers/LaBSE', device='cuda' if torch.cuda.is_available() else 'cpu')

target_term = "lawinentragödie"  # Zielbegriff
target_embedding = model_word.encode([target_term], batch_size=32, show_progress_bar=True)
word_embeddings = model_word.encode(unique_words, batch_size=32, show_progress_bar=True)

# Berechne die Cosinus-Ähnlichkeit und erstelle ein DataFrame
similarities = util.cos_sim(target_embedding, word_embeddings)[0].tolist()
word_sim_df = pd.DataFrame({
    'word': unique_words,
    'similarity': similarities
})

# Zeige die Top 20 ähnlichen Schlüsselwörter
top_similar = word_sim_df.sort_values('similarity', ascending=False).head(20)
print("Neue relevante Schlüsselwörter:")
print(top_similar)

# 8. Lade ein Modell für die semantische Suche auf Dokumentebene
model_doc = SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')

# Definiere die semantische Suchanfrage
semantic_query = "Lawine"

# Encodiere die Anfrage und den Artikeltext
article_embeddings = model_doc.encode(df_api['processed_text'].tolist(), convert_to_tensor=True)
query_embedding = model_doc.encode(semantic_query, convert_to_tensor=True)

# Berechne die Cosinus-Ähnlichkeiten und füge diese dem DataFrame hinzu
similarities = util.pytorch_cos_sim(query_embedding, article_embeddings)[0]
df_api['similarity'] = similarities.cpu().numpy()

# Filtere und sortiere die Artikel nach ihrer Relevanz
filtered_articles = df_api[df_api['similarity'] > 0.6].sort_values('similarity', ascending=False)

# Überprüfe, welches Titelfeld vorhanden ist
if 'paper_title' in filtered_articles.columns:
    display_columns = ['id', 'paper_title', 'similarity']
else:
    display_columns = ['id', 'title', 'similarity']

print("Top semantically relevant articles on loss of life during avalanches in wartime:")
print(filtered_articles[display_columns].to_string(index=False))


Anzahl der Wörter vor Filterung: 283890
Anzahl der Wörter nach Filterung (mindestens 5 Vorkommen): 33619


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/2.02k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/804 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.88G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/397 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/5.22M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.62M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


config.json:   0%|          | 0.00/114 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/8872 [00:00<?, ?it/s]

Neue relevante Schlüsselwörter:
                       word  similarity
258296      fliegertragödie    0.766560
190425  eifersuchtstragödie    0.745064
261561      bübnentragödien    0.740158
197905       lawincuunglück    0.735731
109410       lawineuunglück    0.734753
258282       liebestragödie    0.730592
33242        lawmnenunglück    0.725489
267459      lawfnenunglucke    0.720340
265183   lawinenkataftrophe    0.719127
217089          fllmkomödie    0.715371
4525           lawinentürze    0.714783
175         lawinenunglücke    0.713612
71345         lawineunglück    0.708648
261779  menschheiistragödie    0.707550
87289           ehetragödie    0.705490
5654          lawinengelaht    0.704879
5047         lawinenunglück    0.699750
237493  menschheitstragödie    0.695463
8390        lawinenunglücks    0.695284
244911   lawinenkatastrcphe    0.694970


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.89k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


(…)3153b3bbf80407865484b209e655e5e4729076b8:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Top semantically relevant articles on loss of life during avalanches in wartime:
Empty DataFrame
Columns: [id, paper_title, similarity]
Index: []


In [6]:
print(df_api.columns)

Index(['id', 'pagenumber', 'paper_title', 'provider_ddb_id', 'provider',
       'zdb_id', 'publication_date', 'place_of_distribution', 'language',
       'thumbnail', 'pagefulltext', 'pagename', 'preview_reference',
       'plainpagefulltext', 'processed_text', 'similarity'],
      dtype='object')


The dataframe is output as Empty, so the semantic filtering and the threshold must be adjusted:

Threshold value (0.6): After calculating the semantic similarity, only articles with a similarity value greater than 0.6 are taken into account. This value could be too high, meaning that although articles are present, they are below the threshold value.

In [7]:
model_doc = SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')

# Definiere die semantische Suchanfrage
semantic_query = "Lawine"

# Encodiere die Anfrage und den Artikeltext
article_embeddings = model_doc.encode(df_api['processed_text'].tolist(), convert_to_tensor=True)
query_embedding = model_doc.encode(semantic_query, convert_to_tensor=True)

# Berechne die Cosinus-Ähnlichkeiten und füge diese dem DataFrame hinzu
similarities = util.pytorch_cos_sim(query_embedding, article_embeddings)[0]
df_api['similarity'] = similarities.cpu().numpy()

# 9. Filtere und sortiere die Artikel nach ihrer Relevanz
filtered_articles = df_api[df_api['similarity'] > 0.4].sort_values('similarity', ascending=False) #Treshold ändern!

# Überprüfe, welches Titelfeld vorhanden ist
if 'paper_title' in filtered_articles.columns:
    display_columns = ['id', 'paper_title', 'similarity']
else:
    display_columns = ['id', 'title', 'similarity']

print("Top semantically relevant articles on loss of life during avalanches in wartime:")
print(filtered_articles[display_columns].to_string(index=False))


Top semantically relevant articles on loss of life during avalanches in wartime:
                                                                                     id                                                                                                                                                                                                                                                            paper_title  similarity
                              V3KETSVMDU3SBMTMEEZGJDDUIGFRRK6S-ALTO1772534_DDB_FULLTEXT                                                                                                                                                                                                                                      Mülheimer Volkszeitung. 1908-1919    0.535781
                             KGGVRL65XA5F2KP4YASK5XQODIWCCCU3-ALTO10119131_DDB_FULLTEXT                                                                                                          

This filtering helps pinpoint which articles discuss the loss of life in avalanches—and by examining their language, one can assess whether they frame the events as unavoidable acts of nature or subtly (or overtly) attribute them to military circumstances or enemy actions.