<a href="https://colab.research.google.com/github/elizabethavargas/Dataset-Description-Generation/blob/main/testing_prompts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Testing Prompts
The paper used GPT 4o-mini and LLaMA-3.1-8B-Instruct. However, GPT 4o-mini is no longer available so we can switch to 5-mini?

### Setup LLMs

In [2]:
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git@31b667b54139962832ea2de890383eed14a0a17d"
!pip install openai

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git@31b667b54139962832ea2de890383eed14a0a17d (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git@31b667b54139962832ea2de890383eed14a0a17d)
  Cloning https://github.com/unslothai/unsloth.git (to revision 31b667b54139962832ea2de890383eed14a0a17d) to /tmp/pip-install-mpbvbzcx/unsloth_b0a6aee231a9448bb1a9919f35a59628
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-mpbvbzcx/unsloth_b0a6aee231a9448bb1a9919f35a59628
  Running command git rev-parse -q --verify 'sha^31b667b54139962832ea2de890383eed14a0a17d'
  Running command git fetch -q https://github.com/unslothai/unsloth.git 31b667b54139962832ea2de890383eed14a0a17d
  Running command git checkout -q 31b667b54139962832ea2de890383eed14a0a17d
  Resolved https://github.com/unslothai/unsloth.git to commit 31b667b54139962832ea2de890383eed14a0a17d
  Installing build dependencies ... [?25l[?25hdone
  Getti

In [3]:
import unsloth
from unsloth import FastLanguageModel
import torch
import pandas as pd
from tqdm import tqdm

# Load the model and tokenizer from Hugging Face
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length = 4096,
    dtype = None,
    load_in_4bit = True,
)

client = OpenAI(
  api_key=API_KEY,
)

## Basic Testing

In [5]:
test_prompt = """You are a data documentation expert.
Your task is to rewrite the dataset description so it sounds professional, informative, and engaging — suitable for the NYC Open Data catalog.

Dataset title: 2019 For Hire Vehicles Trip Data
Category: Transportation
Agency: Taxi and Limousine Commission (TLC)
Tags: ['taxi', 'trip data', 'fhv', 'trip', 'base', 'high volume', 'uber', 'lyft', 'via']

Current description:
"These records are generated from the For-Hire Vehicle (“FHV”) Trip Record submissions made by traditional livery, luxury, and black car bases. The FHV trip records include fields capturing the dispatching base license number and the pick-up date, time, and taxi zone location ID, which correspond with the NYC Taxi Zones open dataset. Each row represents a single trip in an FHV."

Example row:
{
  "dispatching_base_num": "B01239",
  "pickup_datetime": "2019-01-01T00:10:37.000",
  "dropoff_datetime": "2019-01-01T00:26:19.000",
  "dolocationid": "265"
}

Column definitions:
dispatching_base_num: The TLC Base License Number of the base that dispatched the trip
pickup_datetime: The date and time of the trip pick-up
dropOff_datetime: The date and time of the trip dropoff
PUlocationID: TLC Taxi Zone in which the trip began
DOlocationID: TLC Taxi Zone in which the trip ended
SR_Flag: Indicates if the trip was a part of a shared ride chain offered by a High Volume FHV company (e.g. Uber Pool, Lyft Line).
Affiliated_base_number: Base number of the base with which the vehicle is affiliated.

When improving the description:
- Do NOT restate or list individual column definitions.
- Expand on what the dataset enables — such as transportation planning, ride-share regulation, equity analysis, or urban mobility research.
- Include *context* (why this data matters, who uses it, what insights it offers).
- Use confident, clear, natural language.
- Keep it concise (1–2 paragraphs).
- Write as if it were the official NYC Open Data description.

**Improved description:**
"""


In [6]:
# Prepare model for inference
FastLanguageModel.for_inference(model)

# Get token IDs for stopping
eos_id = tokenizer.eos_token_id
eot_id = tokenizer.convert_tokens_to_ids("<|eot_id|>")

# Tokenize the test prompt
inputs = tokenizer(test_prompt, return_tensors="pt").to("cuda")

# Generate with proper stopping
with torch.no_grad():
    outputs = model.generate(
        input_ids=inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_new_tokens=200,
        do_sample=False,
        temperature=0.0,
        num_beams=1,
        eos_token_id=[eos_id, eot_id],
        pad_token_id=eos_id,
        use_cache=True,
    )

# Decode and parse the entire output
response_text = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()

print(response_text)

You are a data documentation expert.
Your task is to rewrite the dataset description so it sounds professional, informative, and engaging — suitable for the NYC Open Data catalog.

Dataset title: 2019 For Hire Vehicles Trip Data
Category: Transportation
Agency: Taxi and Limousine Commission (TLC)
Tags: ['taxi', 'trip data', 'fhv', 'trip', 'base', 'high volume', 'uber', 'lyft', 'via']

Current description:
"These records are generated from the For-Hire Vehicle (“FHV”) Trip Record submissions made by traditional livery, luxury, and black car bases. The FHV trip records include fields capturing the dispatching base license number and the pick-up date, time, and taxi zone location ID, which correspond with the NYC Taxi Zones open dataset. Each row represents a single trip in an FHV."

Example row:
{
  "dispatching_base_num": "B01239",
  "pickup_datetime": "2019-01-01T00:10:37.000",
  "dropoff_datetime": "2019-01-01T00:26:19.000",
  "dolocationid": "265"
}

Column definitions:
dispatching_b

## Create Objects

In [7]:
class HFGenerator:
    """Generates descriptions using a Hugging Face model"""
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
        # Prepare model for inference
        FastLanguageModel.for_inference(self.model)
        self.eos_id = self.tokenizer.eos_token_id
        self.eot_id = self.tokenizer.convert_tokens_to_ids("<|eot_id|>")

    def generate_description(self, prompt, temperature=0.0):
        """Generates a description given a prompt and temperature"""

        inputs = self.tokenizer(prompt, return_tensors="pt").to("cuda")

        with torch.no_grad():
            outputs = self.model.generate(
                input_ids=inputs.input_ids,
                attention_mask=inputs.attention_mask,
                max_new_tokens=200,
                do_sample=True if temperature > 0 else False,
                temperature=temperature,
                num_beams=1,
                eos_token_id=[self.eos_id, self.eot_id],
                pad_token_id=self.eos_id,
                use_cache=True,
            )

        response_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
        # Extract only the generated description by removing the input prompt part
        generated_description = response_text[len(prompt):].strip()
        return generated_description

In [14]:
class OpenAIGenerator:
    """Generates descriptions using an Open AI model"""
    def __init__(self, api_key):
        self.client = OpenAI(api_key=api_key)

    def generate_description(self, prompt, model_name="gpt-5-nano", temperature=1.0):
        """Generates a description given a prompt, model name, and temperature."""
        response = self.client.responses.create(
          model=model_name,
          input=test_prompt,
          store=True,
          temperature=temperature
          )
        return response.output_text

In [15]:
openAIGenerator = OpenAIGenerator(API_KEY)
openAIGenerator.generate_description(test_prompt)


'The 2019 For Hire Vehicles Trip Data comprises trip-level records submitted to the Taxi and Limousine Commission (TLC) from traditional livery, luxury, and black-car bases operating in New York City. Each record captures core trip attributes, including the dispatch base, pick-up and drop-off times, and the corresponding Taxi Zone locations, which are linked to the NYC Taxi Zones dataset to enable precise geospatial analysis. This dataset provides a granular view of for-hire vehicle activity across the city, reflecting how rides are dispatched, when and where trips originate and end, and how zones relate to service patterns.\n\nThis data supports a wide range of analytical and policy objectives, from transportation planning and urban mobility research to ride-hail regulation and equity analysis. Users—city agencies, planners, researchers, and advocates—can examine base-level activity, demand dynamics by time and place, and the reach of high-volume or shared-ride models. The dataset und

In [8]:
# Load the model and tokenizer from Hugging Face
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length = 4096,
    dtype = None,
    load_in_4bit = True,
)

generator = HFGenerator(model, tokenizer)
generator.generate_description(test_prompt)

==((====))==  Unsloth 2025.10.10: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


"The 2019 For Hire Vehicles Trip Data dataset provides a comprehensive view of taxi and ride-hailing activity in New York City. With over 150 million records, this dataset offers a rich source of information for transportation planners, researchers, and policymakers seeking to understand the dynamics of urban mobility. By analyzing trip patterns, ridership trends, and service provider data, users can gain insights into the city's transportation ecosystem, inform evidence-based policy decisions, and evaluate the effectiveness of regulations.\n\nThis dataset captures the dispatching base, pick-up and drop-off times, and locations of trips, as well as information on shared ride chains and affiliations. By leveraging this data, users can explore questions such as: How do different ride-hailing services impact traffic congestion and air quality? What are the demographics of riders and drivers, and how do they vary by service type? How can the city optimize its transportation infrastructure 

## Create & Apply Prompt Templates
The first prompt is adapted from the autoDDG prompts

In [None]:
system_message = f"""You are an assistant for a dataset search engine. Your goal
is to improve the readability of dataset descriptions for dataset search engine users."""


introduction = f"""Answer the question using the following information.

    First, consider the dataset sample:

    {dataset_sample}"""

agency_cat = f"""Additionally the agency is {agency} and the category is
{category} Based on this topic and agency, please add sentence(s) describing what this
dataset can be used for."""

closing_instruction = f"""Question: Based on the information above and the
requirements, provide a dataset description in sentences. Use only natural,
readable sentences without special formatting."""


