<a href="https://colab.research.google.com/github/elizabethavargas/Dataset-Description-Generation/blob/main/evaluation_methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Create Evaluator Class

In [4]:
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git@31b667b54139962832ea2de890383eed14a0a17d"
import unsloth
from unsloth import FastLanguageModel
import torch
import pandas as pd
from tqdm import tqdm

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git@31b667b54139962832ea2de890383eed14a0a17d (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git@31b667b54139962832ea2de890383eed14a0a17d)
  Using cached unsloth-2025.10.10-py3-none-any.whl


In [5]:
evaluation_system_message = "You are a helpful and precise assistant for checking the quality of the dataset description."
evaluation_prompt = """You will be given one tabular dataset description. Your task is to rate the description on 3 metrics.
    Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed.

    Evaluation Criteria:
    1. Completeness (1-10) - Evaluates how thoroughly the dataset description covers essential aspects such as the scope of data, query workloads, summary statistics, and possible tasks or applications.
    A high score indicates that the description provides a comprehensive overview, including details on dataset size, structure, fields, and potential use cases.
    2. Conciseness (1-10) - Measures the efficiency of the dataset description in conveying necessary information without redundancy.
    A high score indicates that the description is succinct, avoiding unnecessary details while employing semantic types (e.g., categories, entities) to streamline communication.
    3. Readability (1-10) -  Evaluates the logical flow and readability of the dataset description.
    A high score suggests that the description progresses logically from one section to the next, creating a coherent and integrated narrative that facilitates understanding of the dataset.

    Evaluation Steps:
    Read the dataset description carefully and identify the main topic and key points. Assign a score for each criterion on a scale of 1 to 10, where 1 is the lowest and 10 is the highest based on the Evaluation Criteria.

    Example 1:
    Description: The dataset provides information on alcohol-impaired driving deaths and occupant deaths across various states in the United States. It includes data for 51 states, detailing the number of alcohol-impaired driving deaths and occupant deaths, with values ranging from 0 to 3723 and 0 to 10406, respectively. Each entry also contains the state abbreviation and its geographical coordinates. The dataset is structured with categorical and numerical data types, focusing on traffic safety and casualty statistics. Key attributes include state names, death counts, and location coordinates, making it a valuable resource for analyzing traffic safety trends and issues related to impaired driving.
    Evaluation Form (scores ONLY): Completeness: 7, Conciseness: 9, Readability: 9

    Example 2:
    Description: The dataset provides a comprehensive overview of traffic safety statistics across various states in the United States, specifically focusing on alcohol-impaired driving deaths and occupant deaths. It includes data from 51 unique states, represented by their two-letter postal abbreviations, such as MA (Massachusetts), SD (South Dakota), AK (Alaska), MS (Mississippi), and ME (Maine). Each entry in the dataset captures critical information regarding the number of alcohol-impaired driving deaths and the total occupant deaths resulting from traffic incidents.
    The column "Alcohol-Impaired Driving Deaths" is represented as an integer, indicating the number of fatalities attributed to alcohol impairment while driving. The dataset reveals a range of values, with the highest recorded number being 2367 deaths in Mississippi, highlighting the severity of the issue in certain regions. In contrast, states like Alaska report significantly lower figures, with only 205 alcohol-impaired driving deaths.
    The "Occupant Deaths" column also consists of integer values, representing the total number of deaths among vehicle occupants, regardless of the cause. This data spans from 0 to 10406, with Mississippi again showing the highest number of occupant deaths at 6100, which raises concerns about overall traffic safety in the state.
    Additionally, the dataset includes a "Location" column that provides geographical coordinates for each state, enhancing the spatial understanding of the data. The coordinates are formatted as latitude and longitude pairs, allowing for potential mapping and geographical analysis of traffic safety trends.
    Overall, this dataset serves as a valuable resource for researchers, policymakers, and public safety advocates aiming to understand and address the impact of alcohol on driving safety across different states. It highlights the need for targeted interventions and policies to reduce alcohol-impaired driving incidents and improve occupant safety on the roads.
    Evaluation Form (scores ONLY): Completeness: 8, Conciseness: 7, Readability: 8

    Please provide scores for the given dataset description based on the Evaluation Criteria. Do not include any additional information or comments in your response."""


evaluation_models = [
    "unsloth/Meta-Llama-3.1-70B-Instruct",
    "unsloth/Qwen2-72B-Instruct",
]

class HFEvaluator:
    """Evaluates descriptions using a Hugging Face model"""

    def __init__(self, model_name):
        if model_name not in evaluation_models:
            raise ValueError(f"Model '{model_name}' is not in the list of available models. "
                             f"Choose from: {evaluation_models}")
        self.model_name = model_name
        self.template = evaluation_prompt
        self.system_message = evaluation_system_message

        # Load model + tokenizer
        self.model, self.tokenizer = FastLanguageModel.from_pretrained(
            model_name=model_name,
            max_seq_length=4096,
            dtype=None,
            load_in_4bit=True,
        )

        FastLanguageModel.for_inference(self.model)

        if "Qwen" in model_name:
            self.tokenizer.eos_token = "<|im_end|>"       # real EOS
            self.tokenizer.pad_token = "<|endoftext|>"    # real PAD
            self.tokenizer.bos_token = self.tokenizer.pad_token

            self.eos_ids = [self.tokenizer.eos_token_id]
            self.pad_id = self.tokenizer.pad_token_id
            self.bos_id = self.tokenizer.bos_token_id

        else:  # LLaMA
            self.eos_ids = [
                self.tokenizer.eos_token_id,
                self.tokenizer.convert_tokens_to_ids("<|eot_id|>")
            ]

    def evaluate_description(self, description):
        """Evaluates a description given a prompt and temperature"""

        # Build final message content in one step
        user_content = (
            f"{self.template}\n"
            f"Description: {description}\n"
            "Evaluation Form (scores ONLY): "
        )

        prompt = self.tokenizer.apply_chat_template(
            [
                {"role": "system", "content": self.system_message},
                {"role": "user", "content": user_content},
            ],
            tokenize=False,
        )

        inputs = self.tokenizer(prompt, return_tensors="pt").to("cuda")

        if "Llama" in self.model_name or "Meta-Llama" in self.model_name:
            with torch.no_grad():
                outputs = self.model.generate(
                    **inputs,
                    max_new_tokens=50,
                    do_sample=True,
                    temperature=0.3,
                    num_beams=1,
                    eos_token_id=self.eos_ids,
                    pad_token_id=self.tokenizer.eos_token_id,
                    use_cache=True,
                )

        else:  # Qwen branch
            with torch.no_grad():
                outputs = self.model.generate(
                    **inputs,
                    max_new_tokens=50,
                    do_sample=True,
                    temperature=0.3,
                    eos_token_id=self.eos_ids,
                    pad_token_id=self.pad_id,
                    bos_token_id=self.bos_id,
                    use_cache=True,
                )

        text = self.tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
        return text[len(prompt):].strip()


In [None]:
!hf auth login



    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
The token `ergth` has been saved to /root/.cache/huggingface/stored_tokens
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate

In [6]:
description = "The description should be clear and concise, explaining the purpose and potential uses of the dataset.\nAnswer: Without any specific details about the dataset, it's challenging to provide a comprehensive description. However, given that the title, agency, category, tags, and column definitions are all unspecified, we can infer that this dataset might serve as a placeholder or a starting point for various data-related tasks. It could potentially be utilized for general data analysis, data visualization projects, or as a foundation for developing more specialized datasets tailored to specific industries or research areas. Its flexibility allows it to be adapted to numerous applications depending on the user's needs and objectives. Since there are no predefined columns or tags, users have the freedom to structure the data according to their requirements, making it suitable for a wide range of data-driven projects. Overall, this dataset represents a versatile resource that can be customized and leveraged for diverse purposes within the realm of data science and analytics."

llama_evaluator = HFEvaluator("unsloth/Meta-Llama-3.1-70B-Instruct")
llama_evaluator.evaluate_description(description)


==((====))==  Unsloth 2025.10.10: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details. 

## Dataset Retrieval Evaluation
Purpose: Measures how generated descriptions improve dataset search and findability.

- Method: Integrates descriptions into a keyword-based search engine.
- Metric: Normalized Discounted Cumulative Gain (NDCG@k).
- Evaluates ranking quality by comparing ideal vs. actual search results.
- Higher scores mean relevant datasets appear earlier in search results.
- Tools: Tested with BM25 (lexical keyword matching) and SPLADE (semantic term expansion).

In [None]:
from sklearn.metrics import ndcg_score

# Example: relevance scores for 5 datasets
true_relevance = [[3, 2, 3, 0, 1]]   # ground truth relevance
scores = [[0.9, 0.8, 0.7, 0.2, 0.3]] # model scores

# Compute NDCG@5
ndcg = ndcg_score(true_relevance, scores, k=5)
print("NDCG@5:", ndcg)


Reference-Based Evaluation
Purpose: Compares generated descriptions against existing dataset descriptions.

Metrics:
METEOR: Accounts for precision, recall, synonyms, and stemming.
ROUGE: Measures overlap of n-grams, sequences, and recall.
BERTScore: Uses contextual embeddings to assess semantic similarity.
Outcome: Determines how closely AutoDDGâ€™s descriptions match human-written ones in wording and meaning.

In [None]:
from rank_bm25 import BM25Okapi

corpus = [
    "health insurance premiums liabilities assets",
    "citi bike trips 2022",
    "yellow taxi trip data multiple years"
]
tokenized_corpus = [doc.split(" ") for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)

query = "insurance financial data"
scores = bm25.get_scores(query.split(" "))
print(scores)


In [None]:
# METEOR
import nltk
from nltk.translate.meteor_score import meteor_score

reference = "This dataset contains wind speed and direction measurements."
candidate = "Wind speed and direction data collected during 2003."
print("METEOR:", meteor_score([reference], candidate))

# ROUGE
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference, candidate)
print("ROUGE:", scores)

# BERTScore
from bert_score import score
cands = [candidate]
refs = [reference]
P, R, F1 = score(cands, refs, lang="en", verbose=True)
print("BERTScore F1:", F1.mean().item())


In [None]:
from __future__ import annotations

import math
from typing import Dict, Iterable, List

import numpy as np
import pandas as pd
from beartype import beartype
from rank_bm25 import BM25Okapi


@beartype
def compute_dcg(relevances: Iterable[float], p: int) -> float:
    """
    Discounted cumulative gain at rank p

    Args:
        relevances: Relevance scores
        p: Cut-off rank

    Returns:
        DCG value
    """

    dcg = 0.0
    for index, relevance in enumerate(relevances):
        if index >= p:
            break
        dcg += (2**relevance - 1) / math.log2(index + 2)
    return dcg


@beartype
def compute_avg_single_Q(
    stats: Dict[str, Dict[str, Dict[str, List[float]]]],
    description_version_key: str,
    Q_key: str,
) -> pd.DataFrame:
    """
    Average nested metrics into a DataFrame by index version

    Args:
        stats: Nested metrics dict
        description_version_key: Key selecting description version
        Q_key: Key selecting metric group

    Returns:
        DataFrame of averaged metrics
    """

    averages: Dict[str, Dict[str, float]] = {}
    ndcg_dicts = stats[description_version_key][Q_key]
    for index_version, ndcg_metric in ndcg_dicts.items():
        averages[index_version] = {
            key: float(np.average(scores)) for key, scores in ndcg_metric.items()
        }
    return pd.DataFrame(averages)


@beartype
def compute_ndcg(
    retrieved_relevances: Iterable[float], ideal_relevances: Iterable[float], p: int
) -> float:
    """
    Normalised DCG at rank p

    Args:
        retrieved_relevances: Relevances by retrieved order
        ideal_relevances: Relevances by ideal order
        p: Cut-off rank

    Returns:
        nDCG value
    """

    dcg = compute_dcg(retrieved_relevances, p)
    idcg = compute_dcg(ideal_relevances, p)
    return dcg / idcg if idcg > 0 else 0.0


@beartype
def downstream_task_rank(
    documents: List[str],
    query: str,
    relevances: List[float],
    ks: Iterable[int],
    debug: bool = False,
) -> Dict[int, Dict[str, float]]:
    """
    BM25 ranking with nDCG@k over the inputted documents

    Args:
        documents: List of document texts
        query: Query string
        relevances: Ground-truth relevance scores
        ks: List of cut-off values
        debug: Print debug output if True

    Returns:
        Mapping k -> metrics
    """

    def _compute_ndcg(relevance_true: List[float], relevance_test: List[float], k: int) -> float:
        ideal_dcg = np.sum(np.array(relevance_true) / np.log2(np.arange(2, k + 2)))
        dcg = np.sum(np.array(relevance_test) / np.log2(np.arange(2, k + 2)))
        return float(dcg / ideal_dcg)

    tokenized_corpus = [doc.lower().split() for doc in documents]
    bm25 = BM25Okapi(tokenized_corpus)
    tokenized_query = query.lower().split()
    if debug:
        print(tokenized_query)

    scores = bm25.get_scores(tokenized_query)
    sorted_indices = np.argsort(scores)[::-1]
    sorted_rel_true = sorted(relevances, reverse=True)
    sorted_rel_test = np.array(relevances)[sorted_indices].tolist()

    results: Dict[int, Dict[str, float]] = {}
    for k in ks:
        ndcg = _compute_ndcg(sorted_rel_true[:k], sorted_rel_test[:k], k)
        results[k] = {"ndcg": ndcg}
    return results