Absolutely! Let's craft an effective dataset format for fine-tuning your LLM to excel at intent identification and location/information extraction for your weather chatbot.

Recommended Dataset Format

I recommend a structured format using either CSV (Comma-Separated Values) or JSON (JavaScript Object Notation) to store your training examples. Each example in the dataset should consist of the following fields:

text: The raw user query (string)
intent: The identified intent of the query (string). Examples of intents for your weather chatbot could be:
get_current_weather
get_forecast
get_temperature
get_wind_speed
get_humidity
location: The city or location mentioned in the query (string)
weather_info: (Optional) The specific weather information requested (string). This is only needed for some intents, such as get_temperature. Examples could be:
temperature
wind_speed
humidity
forecast
all (for getting all available information)
CSV Example
text,intent,location,weather_info
What's the weather like in London today?,get_current_weather,London,
Tell me the forecast for New York for the next week.,get_forecast,New York,
How hot is it in Tokyo right now?,get_temperature,Tokyo,temperature
Is it windy in Chicago?,get_wind_speed,Chicago,wind_speed
What's the humidity level in Mumbai?,get_humidity,Mumbai,humidity
Give me all the weather details for Paris.,get_current_weather,Paris,all

In [14]:
import torch
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments,AutoTokenizer
from peft import LoraConfig, get_peft_model
from datasets import load_dataset 


In [8]:
# !pip install datasets

In [11]:
# 1. Load Your Dataset (Assuming CSV or JSON format)
dataset = load_dataset("csv", data_files="gemini_tempdata.csv")  

Generating train split: 0 examples [00:00, ? examples/s]

In [12]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'intent', 'location', 'weather_info'],
        num_rows: 6
    })
})

In [17]:
tokenizer = AutoTokenizer.from_pretrained("Phi-3-mini-4k-instruct")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [18]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/6 [00:00<?, ? examples/s]

In [None]:
# 3. Load and Adapt the Phi-3 Mini Model
model = AutoModelForCausalLM.from_pretrained("Phi-3-mini-4k-instruct")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
peft_config = LoraConfig(
    r=8,  # Adjust the rank for efficiency vs. performance tradeoff
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "v_proj"],  # You can experiment with these
)

In [None]:







model = get_peft_model(model, peft_config)

# 4. Set up Training Arguments
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,  # Adjust based on your GPU/CPU resources
    gradient_accumulation_steps=4,  # Increase if you face OOM issues
    num_train_epochs=3,
    learning_rate=2e-4,  # You can experiment with different values
    logging_steps=50,
    save_strategy="epoch",
)

# 5. Train the Model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],  # Assuming you have a 'train' split in your dataset
)
trainer.train()

# 6. Save the Fine-Tuned Model
model.save_pretrained("./fine_tuned_phi3")  # Save in a suitable directory