### **Enhancements based on the baseline: Dataset Mention Extraction 📄🔍**

This notebook is an improvement upon the work from: [LB-0.289-improved-basline-prompt-engineering](https://www.kaggle.com/code/kawchar85/lb-0-289-improved-basline-prompt-engineering). Thank u.

**Inspiration for using Regex & Context Chunking:**

Inspired by the need to extract dataset accessions and DOIs with high precision, I combined regex with smart context slicing and domain-specific heuristics.

**What I changed:**

1. **Regex-Based Identifier Extraction**

   * Added robust patterns to detect **DOIs**, **GSE/SRA**, **CHEMBL**, **UniProt**, and other dataset-related IDs.

2. **Heuristic Keyword Filtering**

   * Matched surrounding text against known **dataset-related phrases** (e.g., “data available at”, “repository”) to filter meaningful mentions.

3. **Smart Contextual Chunking**

   * Implemented a `TextChunker` that aligns context by sentence boundaries, ensuring that extracted snippets are informative and self-contained.

4. **Dataset DOI Classification**

   * Checked matched DOIs against a curated list of known **dataset DOI prefixes** to validate dataset relevance.

5. **Parallel PDF Processing**

   * Boosted performance with **ThreadPoolExecutor**, allowing multiple PDFs to be parsed concurrently.

6. **Model Testing: Non-Reasoning vs Reasoning**

   * This notebook includes evaluation with **Qwen 2.5** for non-reasoning classification and **Qwen 3** for reasoning-intensive classification — allowing comparison and ablation between the two modes.

7. **Detailed False Negative (FN) Analysis**

   * Added in-depth analysis to categorize and quantify **False Negatives** (FN), separated into:

     * **Wrongly classified**: Model predicted something, but not exactly correct.
     * **Completely missed**: No prediction was made for a ground-truth item.
   * Each group is further broken down into:

     * **DOI**-based errors (e.g., wrong prefix or mismatched)
     * **Accession ID** errors (e.g., GSE, PRJNA, etc.)
   * This helps reveal weaknesses such as:

     * Ambiguous contexts
     * Incomplete extraction logic
     * Confusions between similar dataset identifiers

**Next Goal:**

1. **Improve Chunk and Reduce Junk Chunk**
2. **What to Improve Regex**
3. **Added Prompt Caching**
4. **Reduce Runtime for run**
5. **More F1-Score Reduce FN**

**I hope this notebook to goal gold medal notebook**


In [1]:
#!pip install pymupdf

[0m^C


In [2]:
#pip install pymupdf -i https://pypi.tuna.tsinghua.edu.cn/simple

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
[0m[31mERROR: Could not find a version that satisfies the requirement pymupdf (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for pymupdf[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
!pip list | grep -i pdf

pdf2image                          1.17.0
pypdf                              5.4.0


In [4]:
import os
import re
#import fitz
from pypdf import PdfReader
import numpy as np
import pandas as pd
from pathlib import Path
from tqdm.auto import tqdm
import torch
from concurrent.futures import ThreadPoolExecutor, as_completed
import multiprocessing as mp
import numpy as np
import pandas as pd
from pathlib import Path
from tqdm.auto import tqdm
from logits_processor_zoo.vllm import MultipleChoiceLogitsProcessor
import vllm
import sys

sys.path.append('/kaggle/input/secret-process')

import my_secret

print("Starting PDF processing using secret code...")
chunks, chunks2, ids = my_secret.main()

ModuleNotFoundError: No module named 'logits_processor_zoo'

## Load LLM

In [None]:
think_mode = True

In [None]:
if think_mode:
    model_path = "/kaggle/input/qwen-3/transformers/8b-awq/1"
    llm = vllm.LLM(
        model_path,
        quantization='awq',
        tensor_parallel_size=torch.cuda.device_count(),
        gpu_memory_utilization=0.92,
        trust_remote_code=True,
        dtype="half",
        enforce_eager=True,
        max_model_len=4096,
        disable_log_stats=True,
        enable_prefix_caching=True
    )
else:
    model_path = "/kaggle/input/qwen2.5/transformers/32b-instruct-awq/1"
    llm = vllm.LLM(
        model_path,
        quantization='awq',
        tensor_parallel_size=torch.cuda.device_count(),
        gpu_memory_utilization=0.92,
        trust_remote_code=True,
        dtype="half",
        enforce_eager=True,
        max_model_len=1024+512,
        disable_log_stats=True,
        enable_prefix_caching=True
    )
tokenizer = llm.get_tokenizer()

# System prompts

In [None]:
SYS_PROMPT_DOI = """
You are an expert at identifying RESEARCH DATA citations in academic papers.
Your task is to determine if a DOI in the provided text specifically refers to a dataset, software, or data repository, NOT another academic paper.

**Crucial Rules:**
1.  **LOOK FOR DATA CONTEXT:** The DOI must be near keywords like "data available", "deposited in", "repository", "accession number", "software", "code".
2.  **IGNORE BIBLIOGRAPHY:** If the DOI is clearly part of a numbered or author-year list in a "References" or "Bibliography" section, you MUST respond with "Irrelevant".
3.  **PRIORITIZE DATA DOIs:** If there are multiple DOIs, return the one most likely to be a dataset.

Only respond with either a full normalized DOI URL starting with "https://doi.org/" or the single word "Irrelevant".
Do NOT include any other text or explanation.
"""

if think_mode:
    
    SYS_PROMPT_ACCESSION = """
    You are an expert at analyzing research data usage in academic papers.
    
    Think step-by-step about the surrounding text, identifying clues such as:
    - PRIMARY data: “we deposited”, “data generated in this study”, “our data”, “submitted to”, “newly generated”
    - SECONDARY data: “downloaded from”, “obtained from”, “previously published”, “publicly available”, “existing dataset”
    - NONE: mentioned only in references, general methodology descriptions without actual usage, or contexts unrelated to research data
    
    Silently reason through the classification.
    
    Please show your choice in the answer field with only the choice letter, e.g.,  
    "answer": "C"
    """
    
    SYS_PROMPT_CLASSIFY_DOI = """
    You are an expert at analyzing research data citations in academic papers.
    
    First, reason step-by-step about whether the DOI refers to data that is:
    A) Primary – generated specifically for this study  
    B) Secondary – reused or derived from prior work  
    C) None – merely cited in references, not research data, or otherwise unrelated
    
    Perform this reasoning silently.
    
    Please show your choice in the answer field with only the choice letter, e.g.,  
    "answer": "B"
    """

else:    
    SYS_PROMPT_ACCESSION = """
    You are an expert at analyzing research data usage in academic papers.
    
    Look for contextual clues:
    - For PRIMARY data: "we deposited", "data generated in this study", "our data", "submitted to", "newly generated"
    - For SECONDARY data: "downloaded from", "obtained from", "previously published", "publicly available", "existing dataset"
    - For NONE: mentioned in references, methodology descriptions without actual usage, or unrelated contexts
    
    Respond with only one letter: A, B, or C.
    """
    
    SYS_PROMPT_CLASSIFY_DOI = """
    You are an expert at analyzing research data citations in academic papers.
    
    Classify the data as:
    A) Primary: if the data was generated specifically for this study
    B) Secondary: if the data was reused or derived from prior work  
    C) None: if the DOI is in references, doesn't refer to research data, or is unrelated
    
    Respond with only one letter: A, B, or C.
    """

## Ask LLM to extract DOI links

In [None]:
prompts = []
for article_id, academic_text in chunks:
    messages = [
        {"role": "system", "content": SYS_PROMPT_DOI},
        {"role": "user", "content": academic_text}
    ]

    if think_mode:

        prompt = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            tokenize=False,enable_thinking=False
        )
    else:
         prompt = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            tokenize=False
        )
    
    prompts.append(prompt)

outputs = llm.generate(
    prompts,
    vllm.SamplingParams(
        seed=0,
        skip_special_tokens=True,
        max_tokens=64,
        temperature=0
    ),
    use_tqdm=True
)

responses = [output.outputs[0].text.strip() for output in outputs]

doi_pattern = re.compile(r'(10\.\d{4,9}/[-._;()/:A-Z0-9]+)', re.I)

doi_urls = []
for response in responses:
    if response.lower() == "irrelevant":
        doi_urls.append("Irrelevant")
    else:
        match = doi_pattern.search(response)
        if match:
            doi_urls.append("https://doi.org/" + match.group(1))
        else:
            doi_urls.append("Irrelevant")


In [None]:
import re

def parse_answer_with_regex(response_text: str):

    if not isinstance(response_text, str):
        return None

    match = re.search(r'answer\b.*?([ABC])\b', response_text, re.IGNORECASE | re.DOTALL)
    if match:
        return match.group(1)

    all_choices = re.findall(r'[ABC]', response_text)
    if all_choices:
        return all_choices[-1]
        
    return None

In [None]:
prompts = []
valid_indices = []

if think_mode:
    for i, (chunk, url) in enumerate(zip(chunks, doi_urls)):
        if url == "Irrelevant":
            continue
    
        article_id, academic_text = chunk
        messages = [
            {"role": "system", "content": SYS_PROMPT_CLASSIFY_DOI},
            {"role": "user", "content": f"DOI: {url}\n\nAcademic text:\n{academic_text}"}
        ]
    
        prompt = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            tokenize=False,
            enable_thinking=True
        )
        prompts.append(prompt)
        valid_indices.append(i)
    
    outputs = llm.generate(
        prompts,
        vllm.SamplingParams(
            seed=777,
            temperature=0.65,
            top_p=0.95,
            top_k=20,
            skip_special_tokens=True,
            max_tokens=2048+1024,
            presence_penalty=1.5
        ),
        use_tqdm=True
    )

    choice_to_type_map = {'A': 'Primary', 'B': 'Secondary', 'C': None}

    responses = [output.outputs[0].text.strip() for output in outputs]
    
    parsed_doi_choices = [parse_answer_with_regex(resp) for resp in responses]
    final_doi_answers = [choice_to_type_map.get(choice) for choice in parsed_doi_choices]
    
    answers = [None] * len(chunks)
    for i, answer in zip(valid_indices, final_doi_answers):
        answers[i] = answer
    
    
else:
    for i, (chunk, url) in enumerate(zip(chunks, doi_urls)):
        if url == "Irrelevant":
            continue
    
        article_id, academic_text = chunk
        messages = [
            {"role": "system", "content": SYS_PROMPT_CLASSIFY_DOI},
            {"role": "user", "content": f"DOI: {url}\n\nAcademic text:\n{academic_text}"}
        ]
    
        prompt = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            tokenize=False,enable_thinking=False
        )
        prompts.append(prompt)
        valid_indices.append(i)
    
    mclp = MultipleChoiceLogitsProcessor(tokenizer, choices=["A", "B", "C"])
    
    outputs = llm.generate(
        prompts,
        vllm.SamplingParams(
            seed=777,
            temperature=0.05, 
            skip_special_tokens=True,
            max_tokens=1,
            logits_processors=[mclp],
            logprobs=len(mclp.choices)
        ),
        use_tqdm=True
    )
    
    logprobs = []
    for lps in [output.outputs[0].logprobs[0].values() for output in outputs]:
        logprobs.append({lp.decoded_token: lp.logprob for lp in list(lps)})
    
    logit_matrix = pd.DataFrame(logprobs)[["A", "B", "C"]].values
    choices = ["Primary", "Secondary", None]
    answers = [None] * len(chunks)
    
    for i, (idx, logit_row) in enumerate(zip(valid_indices, logit_matrix)):
        max_logit = np.max(logit_row)
        max_idx = np.argmax(logit_row)
        
        if max_logit > -2.0:
            answers[idx] = choices[max_idx]

In [None]:
prompts = []

if think_mode:
    for chunk, acc_id in zip(chunks2, ids):
        article_id, academic_text = chunk
        messages = [
            {"role": "system", "content": SYS_PROMPT_ACCESSION},
            {"role": "user", "content": f"Accession ID: {acc_id}\n\nAcademic text:\n{academic_text}"}
        ]
    
        prompt = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            tokenize=False,
            enable_thinking=True
        )
        prompts.append(prompt)
    
    outputs = llm.generate(
        prompts,
        vllm.SamplingParams(
            seed=777,
            temperature=0.65,
            top_p=0.95,
            top_k=20,
            skip_special_tokens=True,
            max_tokens=2048+1024,
            presence_penalty=1.5
        ),
        use_tqdm=True
    )
    choice_to_type_map = {'A': 'Primary', 'B': 'Secondary', 'C': None}

    responses = [output.outputs[0].text.strip() for output in outputs]
    
    parsed_doi_choices = [parse_answer_with_regex(resp) for resp in responses]
    answers2 = [choice_to_type_map.get(choice) for choice in parsed_doi_choices]

else:
    for chunk, acc_id in zip(chunks2, ids):
        article_id, academic_text = chunk
        messages = [
            {"role": "system", "content": SYS_PROMPT_ACCESSION},
            {"role": "user", "content": f"Accession ID: {acc_id}\n\nAcademic text:\n{academic_text}"}
        ]
    
        prompt = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            tokenize=False,enable_thinking=False
        )
        prompts.append(prompt)
    
    outputs = llm.generate(
        prompts,
        vllm.SamplingParams(
            seed=777,
            temperature=0.05,
            skip_special_tokens=True,
            max_tokens=1,
            logits_processors=[mclp],
            logprobs=len(mclp.choices)
        ),
        use_tqdm=True
    )
    
    logprobs2 = []
    for lps in [output.outputs[0].logprobs[0].values() for output in outputs]:
        logprobs2.append({lp.decoded_token: lp.logprob for lp in list(lps)})
    
    logit_matrix2 = pd.DataFrame(logprobs2)[["A", "B", "C"]].values
    choices2 = ["Primary", "Secondary", None]
    
    answers2 = []
    for logit_row in logit_matrix2:
        max_logit = np.max(logit_row)
        max_idx = np.argmax(logit_row)
        
        if max_logit > -2.0:
            answers2.append(choices2[max_idx])
        else:
            answers2.append(None)
    

## Prepare Submission

In [None]:

sub_df = pd.DataFrame()
sub_df["article_id"] = [c[0] for c in chunks]
sub_df["dataset_id"] = doi_urls
sub_df["dataset_id"] = sub_df["dataset_id"].str.lower()
sub_df["type"] = answers
sub_df = sub_df[sub_df["type"].notnull()].reset_index(drop=True)

sub_df2 = pd.DataFrame()
sub_df2["article_id"] = [c[0] for c in chunks2]
sub_df2["dataset_id"] = ids
sub_df2["type"] = answers2
sub_df2 = sub_df2[sub_df2["type"].notnull()].reset_index(drop=True)

# Combine and clean
sub_df = pd.concat([sub_df, sub_df2], ignore_index=True)
sub_df = sub_df[sub_df["type"].isin(["Primary", "Secondary"])].reset_index(drop=True)

# Enhanced deduplication with priority to Primary data
sub_df = sub_df.sort_values(by=["article_id", "dataset_id", "type"], 
                           key=lambda x: x.map({"Primary": 0, "Secondary": 1}) if x.name == "type" else x)\
               .drop_duplicates(subset=['article_id', 'dataset_id'], keep="first")\
               .reset_index(drop=True)

sub_df['row_id'] = range(len(sub_df))
sub_df.to_csv("submission.csv", index=False, columns=["row_id", "article_id", "dataset_id", "type"])

print("Final submission stats:")
print(sub_df["type"].value_counts())
print(f"Total entries: {len(sub_df)}")

## Evaluate validation score

In [None]:
def f1_score(tp, fp, fn):
    return 2 * tp / (2 * tp + fp + fn) if (2 * tp + fp + fn) != 0 else 0.0
    
if not os.getenv('KAGGLE_IS_COMPETITION_RERUN'):
    pred_df = pd.read_csv("submission.csv")
    label_df = pd.read_csv("/kaggle/input/make-data-count-finding-data-references/train_labels.csv")
    label_df = label_df[label_df['type'] != 'Missing'].reset_index(drop=True)

    hits_df = label_df.merge(pred_df, on=["article_id", "dataset_id", "type"])
    
    tp = hits_df.shape[0]
    fp = pred_df.shape[0] - tp
    fn = label_df.shape[0] - tp
    
    print("\nValidation Results:")
    print("TP:", tp)
    print("FP:", fp)
    print("FN:", fn)
    print("F1 Score:", round(f1_score(tp, fp, fn), 3))

In [None]:
import os
import pandas as pd

def calculate_f1_score(y_true, y_pred):
    if y_true.empty or y_pred.empty:
        tp = 0
        fp = len(y_pred)
        fn = len(y_true)
    else:
        hits = y_true.merge(y_pred, on=["article_id", "dataset_id", "type"])
        tp = len(hits)
        fp = len(y_pred) - tp
        fn = len(y_true) - tp
    
    f1 = 2 * tp / (2 * tp + fp + fn) if (2 * tp + fp + fn) > 0 else 0.0
    return tp, fp, fn, f1

def analyze_error_sources(pred_df, label_df):
    label_df_filtered = label_df[label_df['type'] != 'Missing'].copy()

    is_doi_pred = pred_df['dataset_id'].str.startswith('https://doi.org/')
    is_doi_label = label_df_filtered['dataset_id'].str.startswith('10.')

    pred_doi = pred_df[is_doi_pred]
    pred_accession = pred_df[~is_doi_pred]
    label_df_filtered['dataset_id_normalized'] = label_df_filtered['dataset_id'].apply(
        lambda x: f"https://doi.org/{x}" if x.startswith('10.') else x
    )
    label_df_filtered = label_df_filtered.rename(columns={'dataset_id': 'original_dataset_id', 'dataset_id_normalized': 'dataset_id'})
    
    is_doi_label_norm = label_df_filtered['dataset_id'].str.startswith('https://doi.org/')

    label_doi = label_df_filtered[is_doi_label_norm]
    label_accession = label_df_filtered[~is_doi_label_norm]

    tp_doi, fp_doi, fn_doi, f1_doi = calculate_f1_score(label_doi, pred_doi)
    tp_acc, fp_acc, fn_acc, f1_acc = calculate_f1_score(label_accession, pred_accession)
    
    print("="*40)
    print("🔬 Error Analysis by ID Type")
    print("="*40)

    print("\n--- DOI ---")
    print(f"Total Predictions: {len(pred_doi)}")
    print(f"True Positives (TP): {tp_doi}")
    print(f"False Positives (FP): {fp_doi}")
    print(f"False Negatives (FN): {fn_doi}")
    print(f"F1 Score: {f1_doi:.4f}")

    print("\n--- Accession ID ---")
    print(f"Total Predictions: {len(pred_accession)}")
    print(f"True Positives (TP): {tp_acc}")
    print(f"False Positives (FP): {fp_acc}")
    print(f"False Negatives (FN): {fn_acc}")
    print(f"F1 Score: {f1_acc:.4f}")
    
    print("\n" + "="*40)
    print("Total FP:", fp_doi + fp_acc)
    print("Total FN:", fn_doi + fn_acc)
    print("="*40)

if not os.getenv('KAGGLE_IS_COMPETITION_RERUN'):
    try:
        pred_df = pd.read_csv("submission.csv")
        pred_df['dataset_id'] = pred_df['dataset_id'].astype(str)
        
        label_df = pd.read_csv("/kaggle/input/make-data-count-finding-data-references/train_labels.csv")
        label_df['dataset_id'] = label_df['dataset_id'].astype(str)

        analyze_error_sources(pred_df, label_df)

    except FileNotFoundError as e:
        print(f"Error: Could not find a required file. {e}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

In [None]:
import pandas as pd

try:
    pred_df = pd.read_csv("submission.csv")
    label_df = pd.read_csv("/kaggle/input/make-data-count-finding-data-references/train_labels.csv")
    label_df_filtered = label_df[label_df['type'] != 'Missing'].copy()
except FileNotFoundError as e:
    print(f"เกิดข้อผิดพลาด: ไม่พบไฟล์ - {e}")
    exit()

fn_df = pd.merge(
    label_df_filtered,
    pred_df,
    on=['article_id', 'dataset_id', 'type'],
    how='left',
    indicator=True
).query('_merge == "left_only"').drop(columns=['_merge'])

merged_df = pd.merge(
    fn_df,
    pred_df,
    on=['article_id', 'dataset_id'],
    how='left',
    indicator='source'
)

classified_incorrectly_df = merged_df[merged_df['source'] == 'both']
classified_incorrectly_count = len(classified_incorrectly_df)

completely_missed_df = merged_df[merged_df['source'] == 'left_only']
completely_missed_count = len(completely_missed_df)

incorrect_doi_count = classified_incorrectly_df[classified_incorrectly_df['dataset_id'].str.startswith('https://', na=False)].shape[0]
incorrect_accession_count = classified_incorrectly_df[~classified_incorrectly_df['dataset_id'].str.startswith('https://', na=False)].shape[0]


missed_doi_count = completely_missed_df[completely_missed_df['dataset_id'].str.startswith('https://', na=False)].shape[0]
missed_accession_count = completely_missed_df[~completely_missed_df['dataset_id'].str.startswith('https://', na=False)].shape[0]


print("="*55)
print("Analyst False Negatives (FN)")
print("="*55)
print(f"All FN: {fn_df.shape[0]} record")
print("-" * 55)
print(f"↳ It have but wrong answer: {classified_incorrectly_count} record")
print(f"    Wrong DOI: {incorrect_doi_count} record")
print(f"    Wrong Accession ID: {incorrect_accession_count} record")
print("-" * 55)
print(f"↳ Can't find this: {completely_missed_count} record")
print(f"    Can't find DOI: {missed_doi_count} record")
print(f"    Can't find Accession ID: {missed_accession_count} record")
print("="*55)