Part 2 - Event Detection & Linking

In this part, i wrote a script for event detection & linking for the content we scraped. 

My first idea was to extract the qualities we desired (such as union presence, number of workers on strike etc.) automatically from the articles. In this endeavour, i attempted to use sequence linking for both training the detection (using spans of a sentence in json format) and event linking. After producing very low evaluation scores (see sequence_linking_deneme.ipynb), i decided to manually annotate the spans using doccano. Unfortunately, doccano works terribly with Turkish; so that was a waste. I then gave up on automatic extraction completely, instead opting to manually label the relevant values for our linked events.

My next idea was annotating 738 of our articles by hand for their relevance (1 for relevant and 0 for irrelevant) and training the model on this set, with the majority of them being labeled negative (160 positive, 578 negative) as to reflect the pool of articles. Although training was fine, this approach lacked in event linking; producing huge clusters (see event detection & linking).

To overcome this, we attempted to use rule-based clustering for cross-document linking. However, because of normalization, tuning and rule errors; the model was not able to effectively link events (see detection_extraction_deneme3.ipynb)

Then, we tried to add a NER package to do the linking, but the Turkish depository could not be accessed (see detection_extraction_deneme4.ipynb).



Finally, after all these trials, we arrived at our last model. The details of each cell will be explained on the way.

In [1]:

import re
import math
import numpy as np
import pandas as pd

import torch
from collections import Counter, defaultdict
from openpyxl import Workbook

from transformers import AutoTokenizer, AutoModel
from sklearn.metrics import precision_recall_curve


In [2]:
#from google.colab import files
#uploaded = files.upload()


In [3]:
!pip -q install transformers accelerate evaluate openpyxl scikit-learn pandas numpy torch


In [4]:
!pip -q install "transformers>=4.38"

import re
import numpy as np
import pandas as pd

import torch
from torch.utils.data import Dataset

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DEVICE


'cuda'

After working on jupyter notebook up to this point, i decided to switch to google collab since the existence of GPU's allowed for much shorter training & prediction durations when compared to using only my PC's CPU.

In [5]:
PATH_XLSX = "evrensel_isci_sendika_2024_dec2025_clean_fin_uncorrupted_real.xlsx"  
df = pd.read_excel(PATH_XLSX)

required_cols = ["EVENT_RELEVANT", "EVENT_ID", "title", "content","date"]
missing = [c for c in required_cols if c not in df.columns]
if missing:
    raise ValueError(f"Missing columns in XLSX: {missing}")
TITLE_COL = "title"
CONTENT_COL = "content"
TEXT_COL = "text"
DATE_COL = "date"    
MANUAL_COL = "EVENT_RELEVANT"  
EVENT_ID_COL = "EVENT_ID"      
EVENT_ID = EVENT_ID_COL

LINK = "link"

df.shape, df.columns.tolist()


((9186, 15),
 ['title',
  'date',
  'link',
  'content',
  'EVENT_RELEVANT',
  'EVENT_ID',
  'Unnamed: 6',
  'error',
  'Unnamed: 8',
  'Unnamed: 9',
  'Unnamed: 10',
  'Unnamed: 11',
  'Unnamed: 12',
  650,
  160])

Now in the following part, we dip our toes into some text naturalization. This will be a huge problem for us throughout the script. After that, we correct the date parsing. They were written previously as "13 Ağustos 2024", we turned them to "13.08.2024". This will be important later for event linking, since days between article dates will be one of our constraints.

In [6]:
# %%

SINGLE_LETTER_TOKEN_RE = re.compile(r"\b[^\W\d_]\b", flags=re.UNICODE)

TR_MONTHS = {
    "ocak": 1, "şubat": 2, "subat": 2, "mart": 3, "nisan": 4, "mayıs": 5, "mayis": 5,
    "haziran": 6, "temmuz": 7, "ağustos": 8, "agustos": 8, "eylül": 9, "eylul": 9,
    "ekim": 10, "kasım": 11, "kasim": 11, "aralık": 12, "aralik": 12
}

def parse_tr_datetime(s):
    if s is None or (isinstance(s, float) and pd.isna(s)):
        return pd.NaT
    s = str(s).strip()
    if not s:
        return pd.NaT

    # Example: "8 Şubat 2024 13:43" (sometimes with extra whitespace)
    m = re.search(r"(\d{1,2})\s+([A-Za-zÇĞİÖŞÜçğıöşü]+)\s+(\d{4})(?:\s+(\d{1,2}):(\d{2}))?", s)
    if not m:
        return pd.NaT

    day = int(m.group(1))
    month_name = m.group(2).lower().replace("ı", "i")  # helps "mayıs/kasım" normalization
    # But keep Turkish chars too:
    month_name = m.group(2).lower()
    month = TR_MONTHS.get(month_name, None)
    if month is None:
        # try ascii fallback
        month = TR_MONTHS.get(month_name.replace("ş","s").replace("ğ","g").replace("ü","u").replace("ö","o").replace("ç","c").replace("ı","i"), None)
    if month is None:
        return pd.NaT

    year = int(m.group(3))
    hh = int(m.group(4)) if m.group(4) else 0
    mm = int(m.group(5)) if m.group(5) else 0

    return pd.Timestamp(year=year, month=month, day=day, hour=hh, minute=mm)

# Create a clean datetime column and use it everywhere downstream
df["DATE_DT"] = df[DATE_COL].apply(parse_tr_datetime)

print("Parsed DATE_DT non-null:", int(df["DATE_DT"].notna().sum()), "/", len(df))
df.loc[df["DATE_DT"].isna(), [DATE_COL]].head(10)

DATE_COL = "DATE_DT"


Parsed DATE_DT non-null: 9167 / 9186


In [7]:
print("df shape:", df.shape)
print("Columns ok:", all(c in df.columns for c in [TITLE_COL, CONTENT_COL, TEXT_COL, DATE_COL]))


df shape: (9186, 16)
Columns ok: False


The "Columns ok: False" seems worrying but is actually not. This will be solved when we create the "text" column in the next cell. Now, we will create that cell and apply some more normalization by clearing weird spaces.

In [8]:
# %%
def normalize_text(x):
    if pd.isna(x):
        return ""
    x = str(x).replace("\u00A0", " ")  # non-breaking space
    x = re.sub(r"\s+", " ", x).strip()
    return x

df["title"] = df["title"].apply(normalize_text)
df["content"] = df["content"].apply(normalize_text)

# final text fed into the model
df["text"] = (df["title"].astype(str) + "\n\n" + df["content"].astype(str)).str.strip()
df[["title","content","text"]].head(2)


Unnamed: 0,title,content,text
0,Bartın'da Hema'ya ait maden ocağında vagonları...,Bartın'ın Amasra ilçesindeki Hema Enerji şirke...,Bartın'da Hema'ya ait maden ocağında vagonları...
1,Bu soygun düzeni değişmeli,Pendik Marmara Eğitim ve Araştırma Hastanesind...,Bu soygun düzeni değişmeli\n\nPendik Marmara E...


In the excel sheet, EVENT_RELEVANT values are recorded as both numeric and string variables by mistake. We will normalize them here.

In [9]:

import numpy as np
import pandas as pd

def normalize_label(x):
    if pd.isna(x):
        return np.nan

  
    if isinstance(x, (int, np.integer, float, np.floating)):
        if x == 0 or x == 0.0:
            return 0
        if x == 1 or x == 1.0:
            return 1
        return np.nan

    s = str(x).strip().lower()
    if s in {"0", "0.0", "no", "n", "false"}:
        return 0
    if s in {"1", "1.0", "yes", "y", "true"}:
        return 1

    return np.nan

df["LABEL_CLEAN"] = df["EVENT_RELEVANT"].apply(normalize_label)
labeled_mask = df["LABEL_CLEAN"].notna()

print("Total rows:", len(df))
print("Labeled rows:", int(labeled_mask.sum()))
df.loc[labeled_mask, ["EVENT_RELEVANT","LABEL_CLEAN"]].head(10)


Total rows: 9186
Labeled rows: 738


Unnamed: 0,EVENT_RELEVANT,LABEL_CLEAN
0,0.0,0.0
1,0.0,0.0
2,0.0,0.0
3,0.0,0.0
4,0.0,0.0
5,0.0,0.0
6,0.0,0.0
7,0.0,0.0
8,0.0,0.0
9,0.0,0.0


Now we can finally start training. A classic 20/80 test/train split will be used. The model will be trained on the 738 manually labeled rows as said before. We can observe that both the training and testing sets are balanced with respect to the distribution of 0/1's.

In [10]:

df_labeled = df.loc[labeled_mask].copy()

df_labeled["label"] = df_labeled["LABEL_CLEAN"].astype(int)

train_df, val_df = train_test_split(
    df_labeled,
    test_size=0.2,
    random_state=42,
    stratify=df_labeled["label"]
)

print("Train size:", len(train_df), "Val size:", len(val_df))
train_df["label"].value_counts(normalize=True), val_df["label"].value_counts(normalize=True)


Train size: 590 Val size: 148


(label
 0    0.783051
 1    0.216949
 Name: proportion, dtype: float64,
 label
 0    0.783784
 1    0.216216
 Name: proportion, dtype: float64)

Just a final check before moving on.

In [11]:

s = df["EVENT_RELEVANT"]

print("dtype:", s.dtype)
print("non-null count:", s.notna().sum())


u = s.dropna().unique()
print("unique values sample (up to 50):", u[:50])


u_str = pd.Series(u).astype(str).str.strip()
print("stringified sample (up to 50):", u_str.head(50).tolist())


dtype: float64
non-null count: 738
unique values sample (up to 50): [0. 1.]
stringified sample (up to 50): ['0.0', '1.0']


We are all set. As specified in the midterm report, this project will not train a model from scratch; but modify and fine-tune BERTurk for strike detection purposes.

In [12]:

MODEL_NAME = "dbmdz/bert-base-turkish-cased" 
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

MAX_LEN = 384 

class TextClsDataset(Dataset):
    def __init__(self, texts, labels=None, tokenizer=None, max_len=384):
        self.texts = list(texts)
        self.labels = None if labels is None else list(labels)
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, i):
        enc = self.tokenizer(
            self.texts[i],
            truncation=True,
            max_length=self.max_len,
            padding=False
        )
        item = {k: torch.tensor(v) for k, v in enc.items()}
        if self.labels is not None:
            item["labels"] = torch.tensor(int(self.labels[i]))
        return item

train_ds = TextClsDataset(train_df["text"], train_df["label"], tokenizer, MAX_LEN)
val_ds   = TextClsDataset(val_df["text"],   val_df["label"],   tokenizer, MAX_LEN)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


The above error is fine, since the public model still can be accessed.

In [13]:
# %%
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2).to(DEVICE)

pos = int(train_df["label"].sum())
neg = int(len(train_df) - pos)

# More weight to positive class if positives are rare
class_weights = torch.tensor([1.0, (neg / max(pos, 1))], dtype=torch.float, device=DEVICE)
class_weights


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-turkish-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tensor([1.0000, 3.6094], device='cuda:0')

Let us train the model on 3 epochs. We will implement the ROC-AUC score, since this is a binary classification task. We will also weigh our trainer; since we value recall highly and relevant strikes are far less common then irrelevant ones in the article database.

In [14]:

import inspect

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    probs = torch.softmax(torch.tensor(logits), dim=-1).numpy()[:, 1]
    return {
        "roc_auc": float(roc_auc_score(labels, probs)) if len(np.unique(labels)) > 1 else float("nan")
    }

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Weighted loss
def custom_loss_fn(model, inputs, return_outputs=False):
    labels = inputs.pop("labels")
    outputs = model(**inputs)
    logits = outputs.logits
    loss_fct = torch.nn.CrossEntropyLoss(weight=class_weights)
    loss = loss_fct(logits, labels)
    return (loss, outputs) if return_outputs else loss

# Some transformers versions support compute_loss in Trainer; fallback safely
class WeightedTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        return custom_loss_fn(model, inputs, return_outputs=return_outputs)

args = TrainingArguments(
    output_dir="berturk_event_detect",
    learning_rate=2e-5,
    per_device_train_batch_size=8 if DEVICE=="cuda" else 4,
    per_device_eval_batch_size=16 if DEVICE=="cuda" else 8,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="roc_auc",
    greater_is_better=True,
    fp16=(DEVICE=="cuda"),
    report_to=[]
)

trainer = WeightedTrainer(
    model=model,
    args=args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

trainer.train()


  trainer = WeightedTrainer(


Epoch,Training Loss,Validation Loss,Roc Auc
1,0.6477,0.366318,0.938578
2,0.3923,0.458118,0.94208
3,0.1235,0.420896,0.945043


TrainOutput(global_step=222, training_loss=0.32484422932873974, metrics={'train_runtime': 25.5963, 'train_samples_per_second': 69.151, 'train_steps_per_second': 8.673, 'total_flos': 349279925990400.0, 'train_loss': 0.32484422932873974, 'epoch': 3.0})

The ROC AUC score is highly promising, although the model is also somewhat prone to overfitting according to the validation loss over the epochs.

In [15]:
# %%
val_out = trainer.predict(val_ds)
val_logits = val_out.predictions
val_labels = val_out.label_ids

val_probs = torch.softmax(torch.tensor(val_logits), dim=-1).numpy()[:, 1]
val_preds = (val_probs >= 0.5).astype(int)

print("ROC-AUC:", roc_auc_score(val_labels, val_probs) if len(np.unique(val_labels)) > 1 else "NA")
print(classification_report(val_labels, val_preds, digits=3))


ROC-AUC: 0.9450431034482759
              precision    recall  f1-score   support

           0      0.957     0.957     0.957       116
           1      0.844     0.844     0.844        32

    accuracy                          0.932       148
   macro avg      0.900     0.900     0.900       148
weighted avg      0.932     0.932     0.932       148



Our scores for 1's are lower than 0's, although they are still workable. This is likely because we have quite a small support in the validation set for 1's, with only 32 of them. Now, it is prediction time. To not lose too many articles (since we prioritize recall as explained before), we will pick a threshold that yields a minimum recall of 0.85. This will most likely be a low threshold.

In [16]:

full_ds = TextClsDataset(df["text"], labels=None, tokenizer=tokenizer, max_len=MAX_LEN)
full_out = trainer.predict(full_ds)
full_logits = full_out.predictions
full_probs = torch.softmax(torch.tensor(full_logits), dim=-1).numpy()[:, 1]

import numpy as np
from sklearn.metrics import precision_recall_curve

def pick_threshold_min_recall(y_true, y_prob, min_recall=0.85):
    precision, recall, thresholds = precision_recall_curve(y_true, y_prob)
    best_t = 0.0
    best_p = -1.0

    for t, p, r in zip(thresholds, precision[1:], recall[1:]):
        if r >= min_recall and p > best_p:
            best_p = p
            best_t = float(t)


    if best_p < 0:
      
        best_t = float(thresholds[np.argmax(recall[1:])])

    return best_t


BEST_T = pick_threshold_min_recall(val_labels, val_probs, min_recall=0.85)
print("Chosen threshold (min_recall=0.85):", BEST_T)

df["EVENT_PRED"] = (full_probs >= BEST_T).astype(int)
df["EVENT_PRED"].value_counts()


Chosen threshold (min_recall=0.85): 0.04676847159862518


Unnamed: 0_level_0,count
EVENT_PRED,Unnamed: 1_level_1
0,7331
1,1855


Out of the 9186 articles, our model has predicted 1855 of them to be strike relevant. Since this is such an important part of our model, we will re-check it just in case.

In [17]:
class TextClsDataset(torch.utils.data.Dataset):
    # Keep compatible with your training dataset structure:
    def __init__(self, texts, labels=None, tokenizer=None, max_len=384):
        self.texts = list(texts)
        self.labels = None if labels is None else list(labels)
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        enc = self.tokenizer(
            self.texts[idx],
            truncation=True,
            max_length=self.max_len,
            padding=False,
            return_tensors="pt",
        )
        item = {k: v.squeeze(0) for k, v in enc.items()}
        if self.labels is not None:
            item["labels"] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item

full_ds = TextClsDataset(df[TEXT_COL], labels=None, tokenizer=tokenizer, max_len=MAX_LEN)
full_out = trainer.predict(full_ds)
full_logits = full_out.predictions
full_probs = torch.softmax(torch.tensor(full_logits), dim=-1).numpy()[:, 1]

df["EVENT_PROB"] = full_probs
df["EVENT_PRED"] = (df["EVENT_PROB"] >= BEST_T).astype(int)

print(df["EVENT_PRED"].value_counts(dropna=False))

EVENT_PRED
0    7331
1    1855
Name: count, dtype: int64


All is good. We can finally move onto the linking part.

We know that we have 1855 strike-relevant articles. However, due to heavy semantic similarities, BERTurk cannot specifically detect wage/collective bargaining based strike events. Instead, it just detects "strike events" in general. We will implement some rules to narrow these 1855 articles down to wage/collective bargaining relevant ones.

At first, we had only CB_KEEP but later decided to split it into two parts: CB_KEEP_STRONG and CB_KEEP_WEAK. The reason is simple; CB_EXCLUDE was getting rid of too many articles (false negatives). You can find this version as a comment in the code. So, we had to relax our narrowing a bit. That is why we also added CB_DISMISSAL; we are not interested in strikes occuring because of dismissals but we do not want to lose the wage/collective bargaining strikes which also have workers dismissed. The CB_KEEP_STRONG and CB_KEEP_WEAK lists are self explanatory (relating to wage/collective bargaining. So is the CB_EXCLUDE, especially considering Evrensel's reporting tendencies.

In [18]:
CB_KEEP = re.compile(
    r"(toplu\s+i[şs]\s+sözleşme|toplu\s+sözleşme|tis\b|"
    r"ücret|zam|sözleşme|protokol|arabulucu|"
    r"grev\s+karar|grev\s+başla|grev\s+çıktı|iş\s+bırak)",
    re.IGNORECASE
)

# Strong CB / bargaining signals (high precision)
CB_KEEP_STRONG = re.compile(
    r"(?i)\b("
    r"toplu\s+i[şs]\s+sözleşme|toplu\s+sözleşme|\btis\b|"
    r"arabulucu|arabuluculuk|"
    r"görüşme(leri)?|müzakere|protokol|"
    r"imza(landı|sı|lamak)|sözleşme\s+imza|"
    r"zam\s+oranı|ücret\s+artış|"
    r"ikramiye|sosyal\s+hak(lar)?"
    r")\b"
)
# Weaker wage-related signals (higher recall, lower precision)
CB_KEEP_WEAK = re.compile(
    r"(?i)\b(ücret|maaş|zam|"
    r"ikramiye|prim|"
    r"yemek\s+ücreti|yol\s+ücreti|"
    r"tazminat|kıdem|ihbar|"
    r"asgari\s+ücret)\b",
    re.IGNORECASE
)


CB_EXCLUDE = re.compile(
    r"(?i)\b("
    r"iş\s+cinayet|kaza|yaralandı|öl(ü|u)m|"
    r"enflasyon|hayat\s+pahalılığı|pahalılık|"
    r"genel\s+değerlendirme|genel\s+yorum|"
    r"seçim|oy\s+ver|sandık|miting|"
    r"sendikalaş|"
    r"mülteci|deprem|sel|yangın|"
    r"iş\s+kaz|göçük|gözalt|tutuk|"
    r"dava|mahkeme|ziyaret|dayanışma|anma|basın\s+açıklama"
    r")\b"
)

CB_DISMISSAL = re.compile(
    r"(?i)\b("
    r"işten\s+(çıkar(ıl|ma)|at(ıl|ma)|çıkarıldı|atıldı)|"
    r"işten\s+çıkarmalar?|toplu\s+çıkarma|"
    r"kod\s*29|kod\s*46|"
    r"tazminat(sız|siz)|"
    r"ücretsiz\s+izin"
    r")\b"
)



def cb_filter_soft(text: str) -> int:
    t = str(text or "")

    strong = bool(CB_KEEP_STRONG.search(t))
    weak   = bool(CB_KEEP_WEAK.search(t))
    dism   = bool(CB_DISMISSAL.search(t))
    excl   = bool(CB_EXCLUDE.search(t))


    if strong:
        return 1


    if weak and not excl:
        return 1

   
    if dism and weak and not excl:
        return 1

    return 0


#def cb_filter(text: str) -> int:
#    t = str(text or "")
#    if not CB_KEEP.search(t):
#        return 0
#    if CB_EXCLUDE.search(t):
#        return 0
#    return 1

df["EVENT_PRED_CB"] = df.apply(
    lambda r: int(r["EVENT_PRED"] == 1 and cb_filter_soft(r[TEXT_COL]) == 1),
    axis=1
)

print("EVENT_PRED:", int(df["EVENT_PRED"].sum()))
print("EVENT_PRED_CB:", int(df["EVENT_PRED_CB"].sum()))

EVENT_PRED: 1855
EVENT_PRED_CB: 1150


So, we have reduced the wage/collective bargaining relevant articles to 1150 from the initial candidates of 1855.

Now, first we will implement some canonicalization (dropping suffixes, firm indicators etc) while also trying to fix broken encoding like "i ş" (this attempt will not work). Since we will be linking through employer/firm names, we must make sure that unwanted recurring tokens like union names, people names, any other junk and common Turkish words are not present. Although this list is not bad, as we will see, the issue of peope names are very, very hard to overcome in Evrensel since for some reason; they record their reporters inside the text itself instead of having them as authors. Since no comprehensive list of reporters can be found (and since weird tokenizations mix some parts of the reporters' names with a part of the text), this part was appended by hand through trial and error and observing the most common results (this is actually the case for all of the lists but is especially true for person terms.).

In [19]:
import unicodedata


LEGAL_RE = re.compile(
    r"\b(a\.?ş\.?|aş|anonim|şirketi|şti|ltd|limited|inc|corp|co|holding|sanayi|ticaret|ve)\b",
    flags=re.IGNORECASE
)
GENERIC_TAIL_RE = re.compile(
    r"\b(fabrika(sı|si|da|de|nda|nde)?|işletme(si|de|da|nde|nda)?|tesis(leri|de|da|nde|nda)?|işyeri(nde|ne|ni)?)\b",
    flags=re.IGNORECASE
)


def _fold_tr(s: str) -> str:
    if s is None or (isinstance(s, float) and pd.isna(s)):
        return ""
    s = str(s)

    s = unicodedata.normalize("NFKC", s)


    s = s.replace("’", "'").replace("`", "'")


    s = s.strip().lower()
    s = re.sub(r"\s+", " ", s)

    s = re.sub(r"\b([a-zçğıöşü])\s+([a-zçğıöşü])\b", r"\1\2", s)
    s = re.sub(r"\b([a-zçğıöşü])\s+([a-zçğıöşü])\b", r"\1\2", s)  # run twice

    return s

def canonical_employer(name: str) -> str:
    s = _fold_tr(name)
    if not s:
        return ""
    s = re.sub(r"[^\w\sçğıöşü0-9'-]", " ", s, flags=re.UNICODE)
    s = LEGAL_RE.sub(" ", s)
    s = GENERIC_TAIL_RE.sub(" ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s



UNION_TERMS = {
    "disk", "türk iş", "turk is", "hak iş", "kesk",
    "türk metal", "metal iş", "birleşik metal iş", "birlesik metal is",
    "genel iş", "petrol iş", "tek gıda iş", "tek gida is",
    "tüm bel sen", "tum bel sen", "tüm bel-sen", "tum bel-sen",
    "birtek sen", "birleşik metal", "tüm bel","eğitim sen", "iş gebze",
    "iş aliağa",
}
PERSON_TERMS = {
    "hilal tok", "ramis sağlam", "ramis saglam", "özkan atar", "ozkan atar",
    "hayrettin çakmak", "hayrettin cakmak", "cemil tugay",
    "hasret gültekin kozan", "hasret gultekin kozan",
    "sevda karaca", "iskender bayhan", "nazlıer çerik",
    "andac aydın arıduru", "volkan pekal", "özer akdemir",
    "duygu ayber gültekin", "dudu selçuk", "michael roberts", "çerik",
    "murat özveri", "andaç aydın", "ayça pektaş", "cem turan", "mehmet güngör",
    "recep tayyip", "emirhan durmaz", "ergün atalay", "ergun atalay", "vedat ışıkhan"
    "mehmet türkmen", "aydın arıduru", "ercan gül", "duygu ayber","aydın ariduru i stanbul",
    "ali çeltek", "hasret gültekin", "gültekin kozan gebze", "arzu erkan","ali rıza",
    "eda aktaş", "halil i mrek",

}

DROP_IF_CONTAINS = [
    "genel başkan", "genel baskan", "başkanı", "baskani", "genel sekreter",
    "genel sekreteri",
    "şube başkanı", "sube baskani", "başkan yardımcısı", "baskan yardimcisi"
    "belediyesi", "organize sanayi", "osb",
    "partisi", "chp", "akp", "mhp", "emep",
    "servisi", "muhabir", "haber merkezi", "gazetesi", "ajans",
    # generic wage/strike words are NOT employers:
    "ücret", "ucret", "zam", "tis", "sözleşme", "sozlesme", "grev", "işçi", "isci",
     "merhaba evrensel", "organize", "çalışma bakanlığı", "şube", "izmir şube",
    "yönetim kurulu", "genel merkez", "genel merkezi",
    "milletvekili", "bakan", "bakanı", "bakanlığı", "bakanligi",
     "valilik", "tbmm", "buyuksehir",
    "parti", "partisi", "dem parti", "i zmir", "di sk", "i şi", "il örgütü","hizmet sektörü",
    "i şçi sendi ka servi si",
]

STARTER_BAD = {
    "yapılan", "yapilan", "önünde", "onunde",
    "sosyal", "sabah", "aynı", "ayni", "en", "daha", "gece",
    "kamu", "türkiye", "turkiye", "açıklamada", "aciklamada",
    "geçtiğimiz", "gectigimiz", "günlerde", "gunlerde", "çok", "cok"
}

UNION_TERMS = {_fold_tr(x) for x in UNION_TERMS}
PERSON_TERMS = {_fold_tr(x) for x in PERSON_TERMS}
DROP_IF_CONTAINS = [_fold_tr(x) for x in DROP_IF_CONTAINS]
STARTER_BAD = {_fold_tr(x) for x in STARTER_BAD}

def looks_like_union(k: str) -> bool:
    kk = _fold_tr(k)
    if not kk:
        return False

    kk_ns = kk.replace(" ", "")

    # 1) normal substring match (space robust)
    for u in UNION_TERMS:
        uu = _fold_tr(u)
        if uu and (uu in kk or uu.replace(" ", "") in kk_ns):
            return True


    toks = kk.split()
    s = set(toks)

    if ("iş" in s or "is" in s) and ("genel" in s or "metal" in s or "petrol" in s or "özçelik" in s or "ozcelik" in s):
        return True


    if ("birleşik" in s or "birlesik" in s) and ("metal" in s):
        return True

    return False
def looks_like_person(k: str) -> bool:
    kk = _fold_tr(k)
    if not kk:
        return False
    kk_ns = kk.replace(" ", "")
    for p in PERSON_TERMS:
        pp = _fold_tr(p)
        if pp and (pp in kk or pp.replace(" ", "") in kk_ns):
            return True
    return False


def looks_like_sentence_starter(k: str) -> bool:
    kk = _fold_tr(k)
    toks = kk.split()
    if not toks:
        return True
    if toks[0] in STARTER_BAD:
        return True
    if len(toks) >= 2 and (toks[0] in STARTER_BAD or toks[1] in STARTER_BAD):
        return True
    return False

def looks_like_non_entity_phrase(k: str) -> bool:
    kk = _fold_tr(k)
    if not kk:
        return True
    if kk.endswith(("deki", "daki", "teki", "taki")):
        return True
    if any(w in kk for w in ["açıklamada", "aciklamada", "saatlerinde", "günlerde", "gunlerde"]):
        return True
    return False

def should_drop_orgish(k: str) -> bool:
    kk = _fold_tr(k)
    if not kk or len(kk) < 3:
        return True
    if looks_like_union(kk) or looks_like_person(kk):
        return True
    if looks_like_sentence_starter(kk) or looks_like_non_entity_phrase(kk):
        return True
    for bad in DROP_IF_CONTAINS:
        if bad in kk:
            return True
    return False

Now that we have our lists, we can start the phrase mining process for firm names. We will build a candidate firm list by searching through the first 1200 characters of each text, using trigger and stop terms. Then, we will drop all the candidates that exist in our created lists from above.

In [20]:

ALL_TRIG = re.compile(
    r"(grev(e)? çıktı|grev başladı|iş bırak(tı|ıyor)|üretim(i)? durdur|"
    r"grevde|grev sürüyor|(\d+)\.? ?gün(ü)?nde|"
    r"anlaşma sağlandı|grev bitti|grev sona erdi|protokol imzalandı|"
    r"kabul edildi|reddedildi|imza(landı)?|uzlaş(ma)?|anlaş(ma)?|"
    r"direnişi sürüyor|direniş(i)?)",
    re.IGNORECASE
)

CONTENT_CHARS = 1200
WIN_CHARS = 220
MAX_WINS = 8
MIN_FREQ = 3
MAX_PHRASES = 5000
NGRAM_MIN = 2
NGRAM_MAX = 5

STOP = set("""
ve veya ile için gibi üzere da de ki mi mı mu mü
işçi işçileri grev grevi direniş direnişi eylem açıklama basın
sendika sendikası işçilerden işçilerin mücadele talep sözleşme toplu iş emekçi
emekçiler emekçileri örgütlü ama fakat lakin çünkü
""".split())

EMPLOYER_DROP_TERMS = [
    "iş sözleşmesi", "is sozlesmesi", "toplu iş sözleşme", "toplu is sozlesme",
    "sözleşme", "sozlesme", "protokol", "arabulucu", "uzlaşma", "uzlasma",
    "zam", "ücret", "ucret", "talepleri", "görüşme", "gorusme",
]

def should_drop_employer_key(k: str) -> bool:
    kk = _fold_tr(k)
    if should_drop_orgish(kk):  
        return True
    for bad in EMPLOYER_DROP_TERMS:
        if bad in kk:
            return True
    return False


def clean_for_phrase_mining(text: str) -> str:
    text = (text or "")
    text = text.replace("\u00A0", " ").replace("’","'").replace("`","'")
    text = re.sub(r"\s+", " ", text).strip()
    return text

def tokenize_simple(text: str):
    return re.findall(r"[A-Za-zÇĞİÖŞÜçğıöşü0-9]+", text)

def is_titlecase_like(tok: str) -> bool:
    if len(tok) < 2:
        return False
    if tok.isupper() and len(tok) >= 2:
        return True
    return tok[0].isupper() and any(c.islower() for c in tok[1:])

def trigger_windows_text(title: str, content: str, win_chars=WIN_CHARS, max_wins=MAX_WINS):
    t = clean_for_phrase_mining(title)
    c = clean_for_phrase_mining(content)[:CONTENT_CHARS]
    text = f"{t} {c}".strip()
    wins = []
    for m in ALL_TRIG.finditer(text):
        a = max(0, m.start() - win_chars)
        b = min(len(text), m.end() + win_chars)
        wins.append(text[a:b])
        if len(wins) >= max_wins:
            break
    return wins

phrase_counts = Counter()


mask_mine = df["EVENT_PRED_CB"] == 1

for t, c in zip(df.loc[mask_mine, TITLE_COL].astype(str), df.loc[mask_mine, CONTENT_COL].astype(str)):
    wins = trigger_windows_text(t, c)
    if not wins:
        continue
    for w in wins:
        toks = tokenize_simple(w)
        flags = [is_titlecase_like(tok) for tok in toks]

        i = 0
        while i < len(toks):
            if not flags[i]:
                i += 1
                continue
            j = i
            while j < len(toks) and flags[j]:
                j += 1

            span = toks[i:j]
            for n in range(NGRAM_MIN, NGRAM_MAX + 1):
                for k in range(0, len(span) - n + 1):
                    ng = span[k:k+n]
                    ng_l = [w.lower() for w in ng]
                    if any(w in STOP for w in ng_l):
                        continue
                    if all(w.isdigit() for w in ng):
                        continue
                    phrase = " ".join(ng)
                    phrase_counts[phrase] += 1

            i = j

candidates = [(p, cnt) for p, cnt in phrase_counts.items() if cnt >= MIN_FREQ]
candidates.sort(key=lambda x: x[1], reverse=True)

def is_plausible_firm_phrase(p: str) -> bool:
    k = canonical_employer(p)
    if not k:
        return False

    if len(k.split()) < 2 or len(k) < 6:
        return False

    if looks_like_union(k) or looks_like_person(k) or should_drop_employer_key(k):
        return False
    return True

firm_phrases = [p for p, cnt in candidates if is_plausible_firm_phrase(p)]


print("Candidate phrases:", len(candidates))
print("firm_phrases kept:", len(firm_phrases))
print("Top 30 candidates:")
for p, cnt in candidates[:30]:
    print(cnt, "-", p)

Candidate phrases: 2049
firm_phrases kept: 951
Top 30 candidates:
299 - Genel İş
270 - Birleşik Metal
269 - Metal İş
266 - Petrol İş
262 - Birleşik Metal İş
159 - Şube Başkanı
137 - Özçelik İş
123 - Organize Sanayi
114 - Emek Partisi
111 - BİRTEK SEN
103 - İş İzmir
102 - Genel Başkanı
95 - Temel Conta
89 - İş Genel
80 - Genel İş İzmir
78 - Türk İş
77 - DİSK Genel
70 - DİSK Genel İş
69 - Sanayi Bölgesi
68 - İş İstanbul
65 - Özak Tekstil
64 - Organize Sanayi Bölgesi
63 - İzmir Büyükşehir
63 - İstanbul Anadolu
62 - Anadolu Yakası
61 - İş İstanbul Anadolu
60 - Genel İş İstanbul
60 - İstanbul Anadolu Yakası
60 - Genel İş İstanbul Anadolu
59 - İş İstanbul Anadolu Yakası


From 2049 candidate phrases to keeping 951 firm phrases, we have done quite well. Let us now try to go even further and use a Turkish spaCy pipeline to extract canonicalized org_keys from our articles and to filter our organization keys.

In [21]:
import spacy
from spacy.pipeline import EntityRuler

nlp_ruler = spacy.blank("tr")
ruler = nlp_ruler.add_pipe("entity_ruler")

patterns = [{"label": "ORG", "pattern": p} for p in firm_phrases]
ruler.add_patterns(patterns)

def extract_org_keys_ruler(title: str, content: str):
    txt = f"{title} {content[:CONTENT_CHARS]}"
    doc = nlp_ruler(txt)
    orgs = []
    for ent in doc.ents:
        if ent.label_ != "ORG":
            continue
        k = canonical_employer(ent.text)
        if not k:
            continue
        orgs.append(k)
    # dedup preserve order
    return list(dict.fromkeys(orgs))

df["ORG_KEYS"] = [
    extract_org_keys_ruler(t, c)
    for t, c in zip(df[TITLE_COL].astype(str), df[CONTENT_COL].astype(str))
]

# filter
def is_good_org_key(k: str) -> bool:
    k2 = canonical_employer(k)
    if not k2:
        return False
    if len(k2.split()) < 2:
        return False
    if len(k2) < 6:
        return False
    if should_drop_employer_key(k2):
        return False
    return True

df["ORG_KEYS_FILTERED"] = df["ORG_KEYS"].apply(lambda ks: [canonical_employer(k) for k in (ks or []) if is_good_org_key(k)])

print("Docs with >=1 ORG_KEY before:", int((df["ORG_KEYS"].apply(lambda x: len(x or [])) > 0).sum()))
print("Docs with >=1 ORG_KEY after :", int((df["ORG_KEYS_FILTERED"].apply(lambda x: len(x or [])) > 0).sum()))



Docs with >=1 ORG_KEY before: 4304
Docs with >=1 ORG_KEY after : 4304


No docs were deleted, but this is not necessarily a bad thing. It likely helped with cleaning inside the articles. We now build regular expression patterns to specifize employer and union keys seperately instead of the "org_keys" we had before, which included both.

In [22]:
P_FACILITY = re.compile(
    r"(?P<name>[A-ZÇĞİÖŞÜ][\w’'-.]+(?:\s+[A-ZÇĞİÖŞÜ][\w’'-.]+){0,6})\s+"
    r"(fabrika(sı|sinda|sında|da|de|nda|nde)?|işyeri(nde|ne|ni)?|tesis(leri|de|da|nde|nda)?)",
    re.UNICODE
)
P_WORKERS = re.compile(
    r"(?P<name>[A-ZÇĞİÖŞÜ][\w’'-.]+(?:\s+[A-ZÇĞİÖŞÜ][\w’'-.]+){0,6})\s+işçi(leri|ler)?",
    re.UNICODE
)

def _as_list(x):
    if x is None or (isinstance(x, float) and pd.isna(x)):
        return []
    if isinstance(x, list):
        return [str(i).strip() for i in x if str(i).strip()]
    s = str(x).strip()
    if not s:
        return []
    return [p.strip() for p in re.split(r"[;|,]\s*", s) if p.strip()]

def trigger_windows(text: str, window_chars: int = 180, max_wins: int = 6):
    t = text or ""
    wins = []
    for m in ALL_TRIG.finditer(t):
        a = max(0, m.start() - window_chars)
        b = min(len(t), m.end() + window_chars)
        wins.append(t[a:b])
        if len(wins) >= max_wins:
            break
    return wins

def extract_employer_candidates_from_text(title: str, content: str, org_list=None):
    """
    Keep mined candidates only if they already exist in ORG_KEYS_FILTERED for the doc.
    """
    txt = (str(title or "") + " " + str(content or "")).strip()
    wins = trigger_windows(txt, window_chars=180, max_wins=6)

    emp, uni = [], []
    org_list = org_list or []
    org_can = {_fold_tr(canonical_employer(o)) for o in org_list}

    for w in wins:
        for pat in (P_FACILITY, P_WORKERS):
            for m in pat.finditer(w):
                raw = m.group("name").strip()
                k = _fold_tr(canonical_employer(raw))
                if not k:
                    continue

                if k not in org_can:
                    continue

                if looks_like_union(k):
                    uni.append(k)
                    continue
                if looks_like_person(k):
                    continue
                if should_drop_employer_key(k):
                    continue

                emp.append(k)

    def dedup(seq):
        seen, out = set(), []
        for x in seq:
            if x and x not in seen:
                seen.add(x)
                out.append(x)
        return out

    return dedup(emp), dedup(uni)

def build_employer_and_union_keys(df_in, org_col="ORG_KEYS_FILTERED", max_emp_per_doc=3):
    df_out = df_in.copy()
    employer_keys, union_keys = [], []

    for t, c, orgs in zip(
        df_out[TITLE_COL].astype(str),
        df_out[CONTENT_COL].astype(str),
        df_out[org_col] if org_col in df_out.columns else [None] * len(df_out)
    ):
        org_list = _as_list(orgs)

        emp_from_org, uni_from_org = [], []
        for o in org_list:
            k = _fold_tr(canonical_employer(o))
            if not k:
                continue
            if looks_like_union(k):
                uni_from_org.append(k)
                continue
            if looks_like_person(k):
                continue
            if should_drop_employer_key(k):
                continue
            emp_from_org.append(k)

        emp_mined, uni_mined = extract_employer_candidates_from_text(t, c, org_list)

        def dedup(seq):
            seen, out = set(), []
            for x in seq:
                if x and x not in seen:
                    seen.add(x)
                    out.append(x)
            return out

        emp_all = dedup(emp_from_org + emp_mined)


        emp_clean = []
        for k in emp_all:
            kk = canonical_employer(k)   
            kk = _fold_tr(kk)            
            if not kk:
                continue
            if looks_like_union(kk):
                continue
            if looks_like_person(kk):
                continue
            if should_drop_employer_key(kk):
                continue
            emp_clean.append(kk)
        if not emp_clean and emp_from_org:
               emp_clean = emp_from_org[:1]

        emp = []
        seen = set()
        for x in emp_clean:
            if x not in seen:
                seen.add(x)
                emp.append(x)

        emp = emp[:max_emp_per_doc]

        uni = dedup(uni_from_org + uni_mined)

        employer_keys.append(emp)
        union_keys.append(uni)




    df_out["EMPLOYER_KEYS"] = employer_keys
    df_out["UNION_KEYS"] = union_keys
    return df_out

df_linked = df.copy()
df_linked = build_employer_and_union_keys(df_linked, org_col="ORG_KEYS_FILTERED", max_emp_per_doc=3)


print("Docs with >=1 ORG_KEY before:", int((df["ORG_KEYS"].apply(lambda x: len(x or [])) > 0).sum()))
print("Docs with >=1 ORG_KEY after :", int((df["ORG_KEYS_FILTERED"].apply(lambda x: len(x or [])) > 0).sum()))




Docs with >=1 ORG_KEY before: 4304
Docs with >=1 ORG_KEY after : 4304


In [23]:
from collections import Counter

c = Counter()
for ks in df_linked.loc[df_linked["EVENT_PRED_CB"]==1, "EMPLOYER_KEYS"]:
    for k in (ks or []):
        c[k] += 1

print("Top 30 EMPLOYER_KEYS:")
for k,v in c.most_common(30):
    print(v, "-", k)


Top 30 EMPLOYER_KEYS:
43 - büyükşehir belediyesi
36 - temel conta
29 - ge grid solutions
27 - büyükşehir belediyesine
25 - i stanbul anadolu yakası
25 - karşıyaka belediyesi
25 - buca belediyesi
24 - i şçi sendi ka
23 - iş sendikasının
20 - yolbulan metal
20 - hitachi energy
17 - as plastik
17 - schneider elektrik
17 - gültekin kozan
17 - i smail cem şimşek
16 - sen genel
15 - green transfo
14 - metal sanayicileri
14 - mehmet türkmen
14 - büyükşehir belediyesinde
13 - pamukkale üniversitesi
13 - genel müdürlüğü
12 - arıtaş kriyojenik
12 - iş yeri temsilcisi
12 - toros tarım
12 - kartal belediyesi
11 - iş sendikasına
11 - schneider electric
11 - iş sendikasında
11 - buca belediyesi i mar


These are our top employer keys. Even though quite a lot of actual firms are present (temel conta, green transfo, hitachi energy); there are still person names (mehmet türkmen), canonicalization issues (schneider electric and schneider elektrik), nonsense unions (iş sendikasında) and broken tokenizations ( i şçi sendi ka) we could not get rid of.

Now, we will load a multilingual sentence-transformer model to produce contextual embeddings to check for semantic similarity.

In [24]:
EMB_MODEL_NAME = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
emb_tokenizer = AutoTokenizer.from_pretrained(EMB_MODEL_NAME)
emb_model = AutoModel.from_pretrained(EMB_MODEL_NAME).to("cuda" if torch.cuda.is_available() else "cpu")
emb_model.eval()

def mean_pool(last_hidden, attention_mask):
    mask = attention_mask.unsqueeze(-1).expand(last_hidden.size()).float()
    summed = (last_hidden * mask).sum(dim=1)
    counts = mask.sum(dim=1).clamp(min=1e-9)
    return summed / counts

@torch.no_grad()
def encode_texts(texts, batch_size=32, max_len=256):
    dev = "cuda" if torch.cuda.is_available() else "cpu"
    vecs = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        enc = emb_tokenizer(
            batch,
            truncation=True,
            max_length=max_len,
            padding=True,
            return_tensors="pt"
        ).to(dev)
        out = emb_model(**enc)
        v = mean_pool(out.last_hidden_state, enc["attention_mask"])
        v = torch.nn.functional.normalize(v, p=2, dim=1)
        vecs.append(v.detach().cpu().numpy())
    return np.vstack(vecs)

def connected_components(n, edges):
    parent = list(range(n))
    rank = [0]*n

    def find(x):
        while parent[x] != x:
            parent[x] = parent[parent[x]]
            x = parent[x]
        return x

    def union(a, b):
        ra, rb = find(a), find(b)
        if ra == rb:
            return
        if rank[ra] < rank[rb]:
            parent[ra] = rb
        elif rank[ra] > rank[rb]:
            parent[rb] = ra
        else:
            parent[rb] = ra
            rank[ra] += 1

    for a, b in edges:
        union(a, b)

    comps = defaultdict(list)
    for i in range(n):
        comps[find(i)].append(i)
    return list(comps.values())

Here, we will build rare tokens for titles for extra linking evidence. 

In [25]:
def build_rare_tokens(titles, min_len=4, max_df=0.05):
    toks_all = []
    for t in titles:
        toks = re.findall(r"[A-Za-zÇĞİÖŞÜçğıöşü]+", str(t).lower())
        toks = [x for x in toks if len(x) >= min_len]
        toks_all.extend(set(toks))
    c = Counter(toks_all)
    n = len(titles)
    return {tok for tok, dfreq in c.items() if (dfreq / max(n,1)) <= max_df}

rare_tokens = build_rare_tokens(df_linked[TITLE_COL].tolist(), min_len=4, max_df=0.05)

def rare_title_tokens(title, min_len=4):
    toks = re.findall(r"[A-Za-zÇĞİÖŞÜçğıöşü]+", str(title).lower())
    return {t for t in toks if len(t) >= min_len and t in rare_tokens}

Now it is time for linking. We will use a safe time decay (to not completely miss strikes lasting a long time and hardly ever being reported on), employer buckets, similarity thresholds, rare token overlaps and trying for minimal clustering. The thresholds for our variables have been picked through many trial and errors in our re-runs.

In [26]:
def _safe_days_diff(d1, d2):
    if pd.isna(d1) or pd.isna(d2):
        return None
    try:
        return abs((pd.to_datetime(d1) - pd.to_datetime(d2)).days)
    except Exception:
        return None

def _time_decay(days_diff, tau_days=25.0):
    if days_diff is None:
        return 1.0
    return math.exp(-float(days_diff) / float(tau_days))

def assign_event_ids_hybrid(
    df_in,
    rel_flag_col="EVENT_PRED_CB",
    date_col=DATE_COL,
    employer_col="EMPLOYER_KEYS",
    sim_short=0.92,
    sim_emp=0.935,
    tau_days=80.0,
    max_rel=6000,
    EMP_MAX_BUCKET=40,
    TITLE_OVERLAP_K=2,
    MIN_CLUSTER_SIZE=2,
    DROP_EMPTY_EMPLOYER_CLUSTERS=True,
):
    df_out = df_in.copy()

    if EVENT_ID_COL not in df_out.columns:
        df_out[EVENT_ID_COL] = np.nan

    df_out[EVENT_ID_COL] = df_out[EVENT_ID_COL].where(df_out[EVENT_ID_COL].notna(), np.nan)
    df_out[EVENT_ID_COL] = df_out[EVENT_ID_COL].apply(lambda x: str(x).strip() if not pd.isna(x) else np.nan)

    rel = df_out[df_out[rel_flag_col] == 1].copy()
    if len(rel) == 0:
        print(f"No rows where {rel_flag_col} == 1.")
        return df_out
    if len(rel) > max_rel:
        print(f"Too many relevant rows ({len(rel)}).")
        return df_out

    rel["_has_date"] = rel[date_col].notna()
    rel = rel.sort_values(by=[date_col, "_has_date"], ascending=[True, False])

    idx = list(rel.index)
    n = len(idx)

    E = encode_texts(rel[TEXT_COL].tolist(), batch_size=32 if torch.cuda.is_available() else 8, max_len=256)

    title_tok_sets = [rare_title_tokens(t) for t in rel[TITLE_COL].tolist()]
    rel_dates = rel[date_col].tolist()

    emp_to_pos = defaultdict(list)
    rel_emp_lists = rel[employer_col].tolist() if employer_col in rel.columns else [None] * n
    for pos, keys in enumerate(rel_emp_lists):
        for k in (keys or []):
            if k:
                emp_to_pos[k].append(pos)

    edges = set()

    print("Relevant n:", n)
    print("Unique employer keys among relevant:", len(emp_to_pos))


    EMP_MAX_GAP_DAYS = 60          
    EMP_SIM_MIN = 0.75             
    USE_EMP_SIM = True             

    for emp, positions in emp_to_pos.items():
        if looks_like_union(emp) or should_drop_employer_key(emp):
            continue
        if len(emp) < 3:
            continue
        if len(positions) <= 1 or len(positions) > EMP_MAX_BUCKET:
            continue

        
        def _pos_date(p):
            d = rel_dates[p]
            try:
                return pd.to_datetime(d) if not pd.isna(d) else pd.Timestamp.min
            except Exception:
                return pd.Timestamp.min

        positions = sorted(positions, key=_pos_date)

        for a in range(1, len(positions)):
            i = positions[a - 1]
            j = positions[a]
            dd = _safe_days_diff(rel_dates[i], rel_dates[j])
            if dd is None or dd > EMP_MAX_GAP_DAYS:
                continue

            if USE_EMP_SIM:
                base_sim = float(np.dot(E[i], E[j]))
                if base_sim < EMP_SIM_MIN:
                    continue

            edges.add((min(i, j), max(i, j)))




    edges_after_A = len(edges)
    print("Edges after employer bucket (A):", edges_after_A)


    for i in range(n):
        for j in range(i + 1, n):

            ei = rel_emp_lists[i] if i < len(rel_emp_lists) else None
            ej = rel_emp_lists[j] if j < len(rel_emp_lists) else None
            if (not ei) and (not ej):
                continue

            base_sim = float(np.dot(E[i], E[j]))
            dd = _safe_days_diff(rel_dates[i], rel_dates[j])
            sim = base_sim * _time_decay(dd, tau_days=tau_days)
            if sim >= sim_short:
                if len(title_tok_sets[i] & title_tok_sets[j]) >= 1:
                  edges.add((min(i, j), max(i, j)))


    comps = connected_components(n, list(edges))
    clusters = [[idx[pos] for pos in comp] for comp in comps]


    filtered = []
    for cluster in clusters:
        if len(cluster) < MIN_CLUSTER_SIZE:
            continue
        if DROP_EMPTY_EMPLOYER_CLUSTERS and employer_col in df_out.columns:
            emp_nonempty = df_out.loc[cluster, employer_col].apply(lambda x: isinstance(x, list) and len(x) > 0).sum()
            if int(emp_nonempty) == 0:
                continue
        filtered.append(cluster)
    clusters = filtered


    new_counter = 1
    for cluster in clusters:
        existing = df_out.loc[cluster, EVENT_ID_COL].dropna()
        if len(existing) > 0:
            chosen = existing.value_counts().idxmax()
        else:
            chosen = f"EV{new_counter:06d}"
            new_counter += 1
        df_out.loc[cluster, EVENT_ID_COL] = chosen
    mask_rel = df_out[rel_flag_col] == 1
    unassigned = df_out.index[mask_rel & df_out[EVENT_ID_COL].isna()].tolist()


    used = set(df_out.loc[mask_rel, EVENT_ID_COL].dropna().astype(str))

    while f"EV{new_counter:06d}" in used:
        new_counter += 1

    for ix in unassigned:
        eid = f"EV{new_counter:06d}"
        while eid in used:
            new_counter += 1
            eid = f"EV{new_counter:06d}"
        df_out.at[ix, EVENT_ID_COL] = eid
        used.add(eid)
        new_counter += 1


    return df_out

df_linked = assign_event_ids_hybrid(
    df_linked,
    rel_flag_col="EVENT_PRED_CB",
    date_col=DATE_COL,
    employer_col="EMPLOYER_KEYS",
    sim_short=0.85,
    sim_emp=0.87,
    tau_days=100.0,
    EMP_MAX_BUCKET=40,
    TITLE_OVERLAP_K=1,
    MIN_CLUSTER_SIZE=2,
    DROP_EMPTY_EMPLOYER_CLUSTERS=True
)

mask_rel = df_linked["EVENT_PRED_CB"] == 1
print("Pred-relevant (CB/Wage):", int(mask_rel.sum()))
print("Unique EVENT_ID among predicted relevant:",
      int(df_linked.loc[mask_rel, EVENT_ID_COL].nunique()))
print(df_linked.loc[mask_rel].groupby(EVENT_ID_COL).size().describe())
print(df_linked.loc[mask_rel].groupby("EVENT_ID").size().describe())


Relevant n: 1150
Unique employer keys among relevant: 441
Edges after employer bucket (A): 613
Pred-relevant (CB/Wage): 1150
Unique EVENT_ID among predicted relevant: 550
count    550.000000
mean       2.090909
std        8.820767
min        1.000000
25%        1.000000
50%        1.000000
75%        1.000000
max      158.000000
dtype: float64
count    550.000000
mean       2.090909
std        8.820767
min        1.000000
25%        1.000000
50%        1.000000
75%        1.000000
max      158.000000
dtype: float64


We have 1150 articles, spread over 550 clusters with 441 unique emplyer keys. Although there is still one mega cluster, this will be solved when we firm split now. This way, the events (articles) containing more than one firm will not fall under the same EVENT_ID now.

In [27]:
def pick_primary_employer(emps):
    good = []
    for e in (emps or []):
        k = canonical_employer(e)
        if not k or should_drop_employer_key(k) or looks_like_union(k) or looks_like_person(k):
            continue
        toks = k.split()
        score = 0
        score += 2 if len(toks) >= 2 else 0
        score += 1 if len(k) >= 10 else 0
        score += min(len(k), 30) / 30.0  
        good.append((score, k))
    if not good:
        return None
    good.sort(reverse=True)
    return good[0][1]



def firm_split_event_ids(df_in, rel_flag_col="EVENT_PRED_CB", event_col="EVENT_ID", employer_col="EMPLOYER_KEYS"):
    df_out = df_in.copy()
    df_out["EVENT_ID_FIRM"] = np.nan

    rel = df_out[df_out[rel_flag_col] == 1].copy()
    for ev, g in rel.groupby(event_col):
        # assign per employer; if no employer, keep one bucket
        for idx_row in g.index:
            emps = df_out.at[idx_row, employer_col]
            if not isinstance(emps, list) or len(emps) == 0:
                df_out.at[idx_row, "EVENT_ID_FIRM"] = f"{ev}_F0"
            else:
                # pick first employer as primary key for firm-level split
                primary = pick_primary_employer(emps)
                if primary is None:
                     df_out.at[idx_row, "EVENT_ID_FIRM"] = f"{ev}_F0"
                else:
                    df_out.at[idx_row, "EVENT_ID_FIRM"] = f"{ev}_{primary}"


    return df_out

df_linked = firm_split_event_ids(df_linked, rel_flag_col="EVENT_PRED_CB", event_col="EVENT_ID", employer_col="EMPLOYER_KEYS")



  df_out.at[idx_row, "EVENT_ID_FIRM"] = f"{ev}_{primary}"


Now, we will just save these outputs to an excel file. Notice how the firm-split values are still not showing. This is not the case in the excel file.

In [28]:
def list_to_str(x):
    if isinstance(x, list):
        return "; ".join([str(i) for i in x if str(i).strip()])
    if pd.isna(x):
        return ""
    return str(x)

def flatten_unique_keys(series_of_lists):
    s = set()
    for ks in series_of_lists:
        if ks is None or (isinstance(ks, float) and pd.isna(ks)):
            continue
        if isinstance(ks, list):
            for k in ks:
                k = str(k).strip()
                if k:
                    s.add(k)
        else:
            for k in str(ks).split(";"):
                k = k.strip()
                if k:
                    s.add(k)
    return "; ".join(sorted(s))

mask = df_linked["EVENT_PRED_CB"] == 1

events_firm = (
    df_linked.loc[mask]
    .groupby("EVENT_ID_FIRM", dropna=False)
    .agg(
        start=(DATE_COL, "min"),
        end=(DATE_COL, "max"),
        duration=(DATE_COL, lambda x: (pd.to_datetime(x).max() - pd.to_datetime(x).min()).days + 1 if x.notna().any() else ""),
        n_articles=(TITLE_COL, "count"),
        firms=("ORG_KEYS_FILTERED", flatten_unique_keys),
        employers=("EMPLOYER_KEYS", flatten_unique_keys),
        unions=("UNION_KEYS", flatten_unique_keys),
    )
    .reset_index()
)

for c in ["start", "end"]:
    events_firm[c] = pd.to_datetime(events_firm[c], errors="coerce")


if "EVENT_PROB" not in df_linked.columns:
    if "EVENT_PROB" in df.columns:
        df_linked = df_linked.join(df[["EVENT_PROB"]], how="left")
    else:
        df_linked["EVENT_PROB"] = np.nan

for col in ["EVENT_PRED", "EVENT_PRED_CB"]:
    if col not in df_linked.columns:
        if col in df.columns:
            df_linked = df_linked.join(df[[col]], how="left")
        else:
            df_linked[col] = 0


if MANUAL_COL not in df_linked.columns:
    df_linked[MANUAL_COL] = np.nan


print("Missing among export-critical cols:",
      [c for c in ["EVENT_PROB","EVENT_PRED","EVENT_PRED_CB",MANUAL_COL] if c not in df_linked.columns])


wb = Workbook()
ws1 = wb.active
ws1.title = "Firm_Level_Strikes"
ws1.append(list(events_firm.columns))

for _, row in events_firm.iterrows():
    ws1.append([list_to_str(v) for v in row.tolist()])

ws2 = wb.create_sheet("Articles_By_Firm_Event")
cols = ["EVENT_ID_FIRM", DATE_COL, TITLE_COL, LINK, "ORG_KEYS_FILTERED", "EMPLOYER_KEYS", "UNION_KEYS", "EVENT_ID", MANUAL_COL, "EVENT_PRED", "EVENT_PRED_CB", "EVENT_PROB"]
ws2.append(cols)

tmp = df_linked.loc[mask, cols].copy()
tmp[DATE_COL] = pd.to_datetime(tmp[DATE_COL], errors="coerce")
for col in ["ORG_KEYS_FILTERED", "EMPLOYER_KEYS", "UNION_KEYS"]:
    tmp[col] = tmp[col].apply(list_to_str)

for _, r in tmp.iterrows():
    ws2.append([list_to_str(v) for v in r.tolist()])

path = "firm_level_strikes_trigger_window.xlsx"
wb.save(path)
print("Saved:", path)

Missing among export-critical cols: []
Pred-relevant (CB/Wage): 1150
Unique EVENT_ID among predicted relevant: 550
count    550.000000
mean       2.090909
std        8.820767
min        1.000000
25%        1.000000
50%        1.000000
75%        1.000000
max      158.000000
dtype: float64
count    550.000000
mean       2.090909
std        8.820767
min        1.000000
25%        1.000000
50%        1.000000
75%        1.000000
max      158.000000
dtype: float64
Saved: firm_level_strikes_trigger_window.xlsx


This is just a sanity check.

In [29]:
def top30_listcol(df_in, col, mask):
    c = Counter()
    for ks in df_in.loc[mask, col]:
        if ks is None or (isinstance(ks, float) and pd.isna(ks)):
            continue
        if isinstance(ks, list):
            for k in ks:
                k = str(k).strip()
                if k:
                    c[k] += 1
        else:
            for k in str(ks).split(";"):
                k = k.strip()
                if k:
                    c[k] += 1
    return c.most_common(30)

print("Top 30 ORG_KEYS_FILTERED:")
for k, v in top30_listcol(df_linked, "ORG_KEYS_FILTERED", df_linked["EVENT_PRED_CB"]==1):
    print(v, "-", k)

print("\nTop 30 EMPLOYER_KEYS:")
for k, v in top30_listcol(df_linked, "EMPLOYER_KEYS", df_linked["EVENT_PRED_CB"]==1):
    print(v, "-", k)

Top 30 ORG_KEYS_FILTERED:
44 - büyükşehir belediyesi
38 - temel conta
37 - arıtaş kriyojenik
32 - green transfo
31 - ge grid solutions
31 - i stanbul anadolu yakası
27 - büyükşehir belediyesine
25 - i şçi sendi ka
25 - karşıyaka belediyesi
25 - buca belediyesi
24 - hitachi energy
23 - iş sendikasının
21 - yolbulan metal
21 - schneider elektrik
21 - schneider electric
20 - gültekin kozan
18 - i smail cem şimşek
17 - as plastik
17 - mehmet türkmen
16 - sen genel
15 - kartal belediyesi
14 - iş yeri temsilcisi
14 - metal sanayicileri
14 - maltepe belediyesi
14 - büyükşehir belediyesinde
13 - iş sendikasına
13 - toros tarım
13 - pamukkale üniversitesi
13 - genel müdürlüğü
12 - seyit aslan

Top 30 EMPLOYER_KEYS:
43 - büyükşehir belediyesi
36 - temel conta
29 - ge grid solutions
27 - büyükşehir belediyesine
25 - i stanbul anadolu yakası
25 - karşıyaka belediyesi
25 - buca belediyesi
24 - i şçi sendi ka
23 - iş sendikasının
20 - yolbulan metal
20 - hitachi energy
17 - as plastik
17 - schneider

Now, we will check how our detection performs with respect to our manually labeled data.

In [30]:

mask_manual = df_linked[MANUAL_COL] == 1         
mask_pred = df_linked["EVENT_PRED"] == 1

print("Manual CB/Wage:", mask_manual.sum())
print("Predicted CB/Wage:", mask_pred.sum())
print("Overlap:", (mask_manual & mask_pred).sum())


Manual CB/Wage: 160
Predicted CB/Wage: 1855
Overlap: 155


In [31]:
print("Recall:", (mask_manual & mask_pred).sum() / mask_manual.sum())
print("Precision:", (mask_manual & mask_pred).sum() / mask_pred.sum())


Recall: 0.96875
Precision: 0.08355795148247978


As expected, for wage/collective bargaining detection; we have extremely high recall but low precision.

In [32]:
mask_manual = df_linked[MANUAL_COL] == 1

mask_pred_raw = df_linked["EVENT_PRED"] == 1
mask_pred_cb   = df_linked["EVENT_PRED_CB"] == 1     

print("Manual:", mask_manual.sum())

print("\nMODEL ONLY (EVENT_PRED)")
print("Pred:", mask_pred_raw.sum())
print("Overlap:", (mask_manual & mask_pred_raw).sum())

print("\nMODEL + RULE GATE (EVENT_PRED_CB)")
print("Pred:", mask_pred_cb.sum())
print("Overlap:", (mask_manual & mask_pred_cb).sum())


Manual: 160

MODEL ONLY (EVENT_PRED)
Pred: 1855
Overlap: 155

MODEL + RULE GATE (EVENT_PRED_CB)
Pred: 1150
Overlap: 116


In [33]:
def cb_filter_reason(text: str):
    t = str(text or "")
    if not CB_KEEP.search(t):
        return "FAIL_KEEP"
    if CB_EXCLUDE.search(t):
        return "HIT_EXCLUDE"
    return "PASS"


fn = df_linked[(df_linked[MANUAL_COL]==1) & (df_linked["EVENT_PRED_CB"]==0)].copy()

fn["cb_reason"] = fn[TEXT_COL].apply(cb_filter_reason)

print(fn["cb_reason"].value_counts())


cols = [TITLE_COL, LINK, "cb_reason"]
display(fn[cols].head(30))


cb_reason
HIT_EXCLUDE    25
PASS           14
FAIL_KEEP       5
Name: count, dtype: int64


Unnamed: 0,title,link,cb_reason
276,İslahiye OSB'de Key Mensucat işçilerinin diren...,https://www.evrensel.net/haber/510329/islahiye...,HIT_EXCLUDE
283,"1 Mayıs’ı birliğimizi, kararlılığımızı gösterm...",https://www.evrensel.net/haber/517237/mega-pol...,HIT_EXCLUDE
287,Grevdeki Mersen işçileri Fransız Konsolosluğu ...,https://www.evrensel.net/haber/519324/grevdeki...,PASS
297,Eğitim Sen grevdeki Purmo işçilerini ziyaret etti,https://www.evrensel.net/haber/520439/egitim-s...,HIT_EXCLUDE
302,Grevdeki Kristal Yağ işçilerinden birlik çağrı...,https://www.evrensel.net/haber/522517/grevdeki...,HIT_EXCLUDE
305,Kristal Yağ işçilerinin grevi bir ayı geride b...,https://www.evrensel.net/haber/522871/kristal-...,HIT_EXCLUDE
309,Eti Krom eylemi 14’üncü günü geride bıraktı,https://www.evrensel.net/haber/523213/eti-krom...,PASS
313,İzBB’de çalışan emekçiler belediyenin zamsız t...,https://www.evrensel.net/haber/524499/izbbde-c...,HIT_EXCLUDE
318,Hatay'da grevdeki Yolbulan ve Befesa işçilerin...,https://www.evrensel.net/haber/525461/hatayda-...,HIT_EXCLUDE
322,"EMEP'li Bayhan, CarrefourSA işçilerinin direni...",https://www.evrensel.net/haber/526679/emepli-b...,HIT_EXCLUDE


In [34]:
mask_manual = df_linked[MANUAL_COL] == 1
mask_pred_cb = df_linked["EVENT_PRED_CB"] == 1

print("Manual:", mask_manual.sum())
print("Pred_CB:", mask_pred_cb.sum())
print("Overlap:", (mask_manual & mask_pred_cb).sum())
print("Recall:", (mask_manual & mask_pred_cb).sum() / mask_manual.sum())
print("Precision:", (mask_manual & mask_pred_cb).sum() / mask_pred_cb.sum())

Manual: 160
Pred_CB: 1150
Overlap: 116
Recall: 0.725
Precision: 0.10086956521739131


It seems our blocking rules are still too strong, increasing precision slightly for a relatively big fall in recall. However, this is a necessary evil. In the end, i was left with 816 unique events encompassing 1151 articles. Manually, i was able to identify 134 unique events. So, our linking was still pretty subpar. For these unique events, i manually labeled "EVENT_ID","FIRM","WORKER_TOTAL","WORKER_STRIKE", "STRIKING_WORKER_RATIO","SECTOR","STRIKE_DURATION (DAYS)", "UNION_PRESENCE","LEGAL_STRIKE" and "RESULT". Both files can be seen at "firm_level_strikes_trigger_window(20)" and "Evrensel_2024_2025_found_strikes.xlsx" respectively. The columns are self explanatory ( "UNION_PRESENCE","LEGAL_STRIKE" and "RESULT" are binary variables) except when "RESULT" is blank, it means that the strike is ongoing/result unknown, and in this case "STRIKE_DURATION (DAYS)" indicates the strike duration up until the last date the strike was mentioned.

Now, we will use this file for quantitative analysis.

Go to file "discrete_time_hazard_strikes_FIN.ipynb" for the next part. 