<a href="https://colab.research.google.com/github/elizabethavargas/Dataset-Description-Generation/blob/main/quick_start.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Quick Start Guide


## Installation

In [None]:
!git clone https://github.com/elizabethavargas/Dataset-Description-Generation.git
!pip install -r Dataset-Description-Generation/requirements.txt

Cloning into 'Dataset-Description-Generation'...
remote: Enumerating objects: 170, done.[K
remote: Counting objects: 100% (170/170), done.[K
remote: Compressing objects: 100% (126/126), done.[K
remote: Total 170 (delta 81), reused 96 (delta 33), pack-reused 0 (from 0)[K
Receiving objects: 100% (170/170), 3.88 MiB | 9.13 MiB/s, done.
Resolving deltas: 100% (81/81), done.
Collecting unsloth (from -r Dataset-Description-Generation/requirements.txt (line 1))
  Downloading unsloth-2025.12.1-py3-none-any.whl.metadata (65 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.8/65.8 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
Collecting jupyter (from -r Dataset-Description-Generation/requirements.txt (line 7))
  Downloading jupyter-1.1.1-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting unsloth_zoo>=2025.12.3 (from unsloth->-r Dataset-Description-Generation/requirements.txt (line 1))
  Downloading unsloth_zoo-2025.12.3-py3-none-any.whl.metadata (32 kB)
Collecting tyro (f

## Imports

In [None]:
import sys
sys.path.append('/content/Dataset-Description-Generation')
from src.generator.generator import HFGenerator
from src.utils.nyc_utils import fetch_dataset_info
from src.evaluator.evaluator import OpenAIEvaluator

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


## Load Dataset Info
In the url of every NYC Open Data dataset, there is a dataset id. For example in the following url, the dataset id is **8wbx-tsch**: https://data.cityofnewyork.us/Transportation/For-Hire-Vehicles-FHV-Active/8wbx-tsch/about_data This url can be used to fetch the dataset info.

In [None]:
test_dataset = fetch_dataset_info('8wbx-tsch')
test_dataset


--- Querying dataset (ID: 8wbx-tsch) ---
Finished dataset: For Hire Vehicles (FHV) - Active


{'dataset_id': '8wbx-tsch',
 'dataset_name': 'For Hire Vehicles (FHV) - Active',
 'data_example': {'active': 'YES',
  'vehicle_license_number': '6032728',
  'name': 'UPPAL, ARSHDEEP',
  'license_type': 'FOR HIRE VEHICLE',
  'expiration_date': '2026-06-30T00:00:00.000',
  'permit_license_number': 'AA005',
  'dmv_license_plate_number': 'T117661C',
  'vehicle_vin_number': 'KMHLM4AJ2NU023825',
  'wheelchair_accessible': 'PILOT',
  'vehicle_year': '2022',
  'base_number': 'B03152',
  'base_name': 'EXIT LUXURY INC.',
  'base_type': 'BLACK-CAR',
  'veh': 'HYB',
  'base_telephone_number': '(718)472-9800',
  'base_address': '29 - 10   36 AVENUE LONGISLAND CITY NY 11106',
  'reason': 'G',
  'last_date_updated': '2025-12-06T00:00:00.000',
  'last_time_updated': '13:25'},
 'category': 'Transportation',
 'description': "<b>PLEASE NOTE:</b> This dataset, which includes all TLC licensed for-hire vehicles which are in good standing and able to drive, is updated every day in the evening between 4-7pm. 

## Generate Description

In [None]:
llama_generator = HFGenerator("unsloth/Meta-Llama-3.1-8B-Instruct")

==((====))==  Unsloth 2025.11.6: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

In [None]:
test_description = llama_generator.generate_description(test_dataset, user_focused=True, few_shot=False)
test_description

'This dataset, For Hire Vehicles (FHV) - Active, is provided by the Taxi and Limousine Commission (TLC) and contains information about active for-hire vehicles in New York City. The dataset includes details such as vehicle license numbers, owner names, license types, expiration dates, and base information. It is updated daily between 4-7pm and can be used to track the status of for-hire vehicles, identify active drivers, and monitor the taxi industry in New York City. This dataset can be used by researchers, policymakers, and industry professionals to analyze trends, identify patterns, and make informed decisions about the for-hire vehicle industry.'

## LLM Evaluation

To evaluate the dataset description, you will need an Open AI API key.

In [None]:
evaluator = OpenAIEvaluator(api_key)

evaluator.evaluate_description(test_description)

'Completeness: 6, Conciseness: 9, Readability: 9'