<a href="https://colab.research.google.com/github/elizabethavargas/Dataset-Description-Generation/blob/main/testing_prompts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Testing Prompts
The paper used GPT 4o-mini and LLaMA-3.1-8B-Instruct. However, inorder to make generation and


models = [
    "unsloth/Meta-Llama-3.1-8B-Instruct",
    "unsloth/Meta-Llama-3.1-70B-Instruct",
    "unsloth/Qwen2-72B-Instruct",
    "unsloth/Qwen2-7B-Instruct",
]


### Setup LLMs

In [5]:
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git@31b667b54139962832ea2de890383eed14a0a17d"
import unsloth
from unsloth import FastLanguageModel
import torch
import pandas as pd
from tqdm import tqdm

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git@31b667b54139962832ea2de890383eed14a0a17d (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git@31b667b54139962832ea2de890383eed14a0a17d)
  Using cached unsloth-2025.10.10-py3-none-any.whl


In [7]:
!hf auth login




    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: fineGrained).
The token `colab2` has been saved to /root/.cache/huggingface/stored_tokens
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authe

## Create Objects

In [8]:
generation_models = [
    "unsloth/Meta-Llama-3.1-8B-Instruct",
    "unsloth/Qwen2-7B-Instruct",
]

class HFGenerator:
    """Generates descriptions using a Hugging Face model"""

    def __init__(self, model_name):
        if model_name not in generation_models:
            raise ValueError(f"Model '{model_name}' is not in the list of available models. "
                             f"Choose from: {generation_models}")

        self.model_name = model_name

        # Load model + tokenizer
        self.model, self.tokenizer = FastLanguageModel.from_pretrained(
            model_name=model_name,
            max_seq_length=4096,
            dtype=None,
            load_in_4bit=True,
        )

        FastLanguageModel.for_inference(self.model)

        if "Qwen" in model_name:
            self.tokenizer.pad_token = "<|extra_0|>"
            self.tokenizer.eos_token = "</s>"
            self.tokenizer.bos_token = "<s>"

            self.eos_ids = [self.tokenizer.eos_token_id]

        else:  # LLaMA
            self.eos_ids = [
                self.tokenizer.eos_token_id,
                self.tokenizer.convert_tokens_to_ids("<|eot_id|>")
            ]

    def generate_description(self, prompt, temperature=0.0):
        """Generates a description given a prompt and temperature"""

        inputs = self.tokenizer(prompt, return_tensors="pt").to("cuda")
        do_sample = temperature > 0

        if "Llama" in self.model_name or "Meta-Llama" in self.model_name:
            with torch.no_grad():
                outputs = self.model.generate(
                    **inputs,
                    max_new_tokens=200,
                    do_sample=do_sample,
                    temperature=temperature,
                    num_beams=1,
                    eos_token_id=self.eos_ids,
                    pad_token_id=self.tokenizer.eos_token_id,
                    use_cache=True,
                )

        else:  # Qwen branch
            with torch.no_grad():
                outputs = self.model.generate(
                    **inputs,
                    max_new_tokens=200,
                    do_sample=do_sample,
                    temperature=temperature,
                    eos_token_id=self.eos_ids,
                    pad_token_id=self.tokenizer.pad_token_id,
                    use_cache=True,
                )
        text = self.tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
        return text[len(prompt):].strip()


In [10]:
test_prompt = """You are a data documentation expert.
Your task is to rewrite the dataset description so it sounds professional, informative, and engaging — suitable for the NYC Open Data catalog.

Dataset title: 2019 For Hire Vehicles Trip Data
Category: Transportation
Agency: Taxi and Limousine Commission (TLC)
Tags: ['taxi', 'trip data', 'fhv', 'trip', 'base', 'high volume', 'uber', 'lyft', 'via']

Current description:
"These records are generated from the For-Hire Vehicle (“FHV”) Trip Record submissions made by traditional livery, luxury, and black car bases. The FHV trip records include fields capturing the dispatching base license number and the pick-up date, time, and taxi zone location ID, which correspond with the NYC Taxi Zones open dataset. Each row represents a single trip in an FHV."

Example row:
{
  "dispatching_base_num": "B01239",
  "pickup_datetime": "2019-01-01T00:10:37.000",
  "dropoff_datetime": "2019-01-01T00:26:19.000",
  "dolocationid": "265"
}

Column definitions:
dispatching_base_num: The TLC Base License Number of the base that dispatched the trip
pickup_datetime: The date and time of the trip pick-up
dropOff_datetime: The date and time of the trip dropoff
PUlocationID: TLC Taxi Zone in which the trip began
DOlocationID: TLC Taxi Zone in which the trip ended
SR_Flag: Indicates if the trip was a part of a shared ride chain offered by a High Volume FHV company (e.g. Uber Pool, Lyft Line).
Affiliated_base_number: Base number of the base with which the vehicle is affiliated.

When improving the description:
- Do NOT restate or list individual column definitions.
- Expand on what the dataset enables — such as transportation planning, ride-share regulation, equity analysis, or urban mobility research.
- Include *context* (why this data matters, who uses it, what insights it offers).
- Use confident, clear, natural language.
- Keep it concise (1–2 paragraphs).
- Write as if it were the official NYC Open Data description.

**Improved description:**
"""


In [11]:
llama_generator = HFGenerator("unsloth/Meta-Llama-3.1-8B-Instruct")
llama_generator.generate_description(test_prompt)

"The 2019 For Hire Vehicles Trip Data dataset provides a comprehensive view of taxi and ride-hailing activity in New York City. With over 150 million records, this dataset offers a rich source of information for transportation planners, researchers, and policymakers seeking to understand the dynamics of urban mobility. By analyzing trip patterns, ridership trends, and service provider data, users can gain insights into the city's transportation ecosystem, inform evidence-based policy decisions, and evaluate the effectiveness of regulations.\n\nThis dataset captures the dispatching base, pick-up and drop-off times, and locations of trips, as well as information on shared ride chains and affiliations. By leveraging this data, users can explore questions such as: How do different ride-hailing services impact traffic congestion and air quality? What are the demographics of riders and drivers, and how do they vary by service type? How can the city optimize its transportation infrastructure 

In [12]:
qwen_generator = HFGenerator("unsloth/Qwen2-7B-Instruct")
qwen_generator.generate_description(test_prompt)

==((====))==  Unsloth 2025.10.10: Fast Qwen2 patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.55G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/266 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/80.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/367 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

"The 2019 For Hire Vehicles Trip Data, curated by the Taxi and Limousine Commission (TLC), serves as a comprehensive chronicle of rideshare activity within New York City's bustling transportation landscape. This invaluable dataset captures the essence of urban mobility through meticulously recorded trips by traditional livery, luxury, and black car bases, alongside the burgeoning realm of high-volume for-hire vehicles (FHVs) including industry giants like Uber, Lyft, and Via.\n\nEach entry in this dataset is a snapshot of a journey, meticulously detailing the dispatching base license number, the precise moment of pickup and drop-off, and the starting and ending locations within the city's intricate network of taxi zones. This information is pivotal for a myriad of applications:\n\nFor transportation planners, these records offer a granular view into the patterns and dynamics of urban travel, enabling them to optimize routes, enhance infrastructure, and forecast demand with precision. I

## Create & Apply Prompt Templates
The first prompt is adapted from the autoDDG prompts

In [13]:
description = None
dataset_sample = None
title = None
agency = None
category = None
column_definitions = None
tags = None


system_message = f"""You are an assistant for a dataset search engine. Your goal
is to improve the readability of dataset descriptions for dataset search engine users."""

introduction = f"""Answer the question using the following information.

    First, consider the dataset sample:

    {dataset_sample}"""

initial_description = f"""The initial description is {description}."""

title_agency_cat = f"""Additionally the dataset title is {title}, the agency is {agency} and the category is
{category} Based on this topic and agency, please add sentence(s) describing what this
dataset can be used for."""

tag = f"""The tags are {tags}."""

column_defs = f"""Additionally, the column definitions are {column_definitions}."""

closing_instruction = f"""Question: Based on the information above and the
requirements, provide a dataset description in sentences. Use only natural,
readable sentences without special formatting."""



In [14]:
# read datasets.pkl
import pandas as pd
datasets = pd.read_pickle("datasets.pkl")
datasets[1]

{'dataset_id': 'npwk-bcm6',
 'data_example': {'school_year': '2006-2007',
  'report_type': 'Citywide',
  'program': 'GENERAL EDUCATION',
  'grade_or_service_category': 'Kindergarten',
  'average_class_size': '20.7'},
 'dataset_name': 'Class Size Report (2006-2007)',
 'category': 'Education',
 'description': 'For schools with students in any grades between Kindergarten and 9th grade (where 9th grade is the termination grade for the school), class size is reported by four program areas: general education, special education self-contained class, collaborative team teaching and gifted and talented self-contained class. Within each program area class size is reported by grade or service category, which indicates how a special education self-contained class is delivered. Class size is calculated by dividing the number of students in a program and grade by the number of official classes in that program and grade.\nThe following data is excluded from all the reports: District 75 schools, bridg

In [None]:
new_descriptions = {}

for dataset in datasets:
  dataset_sample = dataset["data_example"]
  description = dataset['description']
  title = dataset['dataset_name']
  agency = dataset['agency']
  category = dataset['category']
  column_definitions = dataset["column_info"]
  tags = dataset['tags']
  dataset_id = dataset['dataset_id'] # Define dataset_id here

  prompt = system_message
  if dataset_sample is not None:
    prompt += introduction
  if description is not None:
    prompt += initial_description
  if title is not None:
    prompt += title_agency_cat
  if tags is not None:
    prompt += tag
  if column_definitions is not None:
    prompt += column_defs
  prompt += closing_instruction

  qwen_description = qwen_generator.generate_description(prompt)
  llama_description = llama_generator.generate_description(prompt)

  # Initialize the inner dictionary if it doesn't exist
  if dataset_id not in new_descriptions:
    new_descriptions[dataset_id] = {}

  new_descriptions[dataset_id]['qwen_description'] = qwen_description
  new_descriptions[dataset_id]['llama_description'] = llama_description