<a href="https://colab.research.google.com/github/dapopov-st/MiniLlamaSQL/blob/main/Fine_tune_CodeLlama_7b_Instruct_on_b_mc2_sql_create_context.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tune CodeLlama-7b-Instruct on b-mc2-sql-create-context.ipynb

Built starting from  [Colab NB](https://colab.research.google.com/drive/1PEQyJO1-f6j0S_XJ8DV50NkpzasXkrzd?usp=sharing).

This notebook runs on a A100 GPU on Colab. (Last update: 24 Aug 2023)


# Llama 2 Script

## TODOS:
Try additional datasets (merge)
- https://huggingface.co/datasets/b-mc2/sql-create-context
- Mistral or Zephyr 7b
- Evaluate on SQLEval https://defog.ai/blog/open-sourcing-sqleval/ (question/answer/context format) just like knowrohit07/know_sql. Possibly include Clinton/Text-to-sql-v1 (much larger dataset, include/rename columns I need)
- If stick with Llama (or better yet, CodeLlama Phind/Phind-CodeLlama-34B-v2 or TheBloke/CodeLlama-7B-GGUF, TheBloke/CodeLlama-13B-Instruct-GGUF; or better yet https://huggingface.co/codellama/CodeLlama-34b-Instruct-hf CodeLlama instruct, which was also finetuned on SQL), can distill knowledge to TinyLlama TinyLlama/TinyLlama-1.1B-intermediate-step-715k-1.5T.  If use this approach, likely need to spin up a cluster on GCP.
- Consder fine tuning on CoSQL (harder)

Additional:
- https://www.datacamp.com/tutorial/mistral-7b-tutorial is quite good for Mistral finetuning/general finetuning advice
- Bing Chat Experimental: According to the web search results I found for you, both Code-Llama Instruct 7b and SQLCoder 7B are state-of-the-art large language models for generating SQL queries from natural language123. However, based on the performance benchmarks reported by their developers, SQLCoder 7B seems to have a slight edge over Code-Llama Instruct 7b in terms of execution accuracy12. SQLCoder 7B achieved 71% correct SQL queries on a novel dataset not seen in training, while Code-Llama Instruct 7b achieved 70%2. SQLCoder 7B also outperformed Code-Llama Instruct 7b on most query categories, such as date, group by, order by, ratio, and join2. However, Code-Llama Instruct 7b has the advantage of being faster and more suitable for low latency tasks, as it can be served on a single GPU3. SQLCoder 7B requires a more powerful hardware setup, such as an A100 40GB GPU or an RTX 40902. Therefore, the answer to your question may depend on your specific use case and preferences.
- If the above is accurate, Code-LLama would probably be best for distillation to TinyLLama-1.1B.

- Bing: According to the web search results, CodeLlama-7b-Instruct was not fine-tuned on b-mc2/sql-create-context. However, there are some other models that were fine-tuned on this dataset, such as qblocks/llama2_SQL_Answers_finetuned1 and TheBloke/CodeLlama-7B-Instruct-GPTQ2. These models are available on Hugging Face3 and can be used for generating SQL queries from natural language questions.

## Installs and Config

- To set up the tokenizer for CodeLlama-7b-Instruct model fine tuning, you need to install the transformers library from the main development branch, as the CodeLlamaTokenizer class is not available in the latest release yet1. You can do this by running the following command in your terminal:


In [None]:
!pip install git+https://github.com/huggingface/transformers.git@main

Collecting git+https://github.com/huggingface/transformers.git@main
  Cloning https://github.com/huggingface/transformers.git (to revision main) to /tmp/pip-req-build-zhu7hlv2
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-zhu7hlv2
  Resolved https://github.com/huggingface/transformers.git to commit af8acc4760d44e48f953e075e3b13a43843d5f91
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.36.0.dev0-py3-none-any.whl size=8075999 sha256=5631855edb21017fe093ff61c1245909d14c78e9be8a6be57add311f3a5a2b2f
  Stored in directory: /tmp/pip-ephem-wheel-cache-iggidq93/wheels/cf/59/82/6492402e887a68975030bf8c06532260abc16abb7c

In [None]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 trl==0.4.7 datasets==2.10.1 wandb==0.16.0
#transformers==4.31.0

In [None]:
import os
import torch
from datasets import load_dataset
from transformers import (
    LlamaForCausalLM,
    LlamaTokenizer,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer


In [None]:
from IPython.core.display import display, HTML

# Set the code page width to a larger value
display(HTML("<style>.container { width:100% !important; }</style>"))

In [None]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
import wandb # do from scipt later
wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [None]:
from google.colab import drive
drive.mount('/content/drive')

output_dir = '/content/drive/MyDrive/SQL/sql_output_dir'
logging_dir = '/content/drive/MyDrive/SQL/sql_logging_dir'

Mounted at /content/drive


In [None]:
import sys,gc,traceback
import torch
def clean_ipython_hist():
    # Code in this function mainly copied from IPython source
    if not 'get_ipython' in globals(): return
    ip = get_ipython()
    user_ns = ip.user_ns
    ip.displayhook.flush()
    pc = ip.displayhook.prompt_count + 1
    for n in range(1, pc): user_ns.pop('_i'+repr(n),None)
    user_ns.update(dict(_i='',_ii='',_iii=''))
    hm = ip.history_manager
    hm.input_hist_parsed[:] = [''] * pc
    hm.input_hist_raw[:] = [''] * pc
    hm._i = hm._ii = hm._iii = hm._i00 =  ''



def clean_tb():
    # h/t Piotr Czapla
    if hasattr(sys, 'last_traceback'):
        traceback.clear_frames(sys.last_traceback)
        delattr(sys, 'last_traceback')
    if hasattr(sys, 'last_type'): delattr(sys, 'last_type')
    if hasattr(sys, 'last_value'): delattr(sys, 'last_value')

def clean_mem():
    clean_tb()
    clean_ipython_hist()
    gc.collect()
    torch.cuda.empty_cache()

In [None]:
clean_mem()

## Original HF Script

In [None]:

# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
from dataclasses import dataclass, field
from typing import Optional

import torch
from datasets import load_dataset
from peft import LoraConfig
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    AutoTokenizer,
    TrainingArguments,
)

from trl import SFTTrainer

# This example fine-tunes Llama v2 model on Guanace dataset
# using QLoRA. At the end of the script we perform merging the weights
# Use it by correctly passing --model_name argument when running the
# script.
#
# Versions used:
# accelerate == 0.21.0
# peft == 0.4.0
# bitsandbytes == 0.40.2
# transformers == 4.31.0
# trl == 0.4.7

# For models that have `config.pretraining_tp > 1` install:
# pip install git+https://github.com/huggingface/transformers.git

@dataclass
class ScriptArguments:
    """
    These arguments vary depending on how many GPUs you have, what their capacity and features are, and what size model you want to train.
    """

    local_rank: Optional[int] = field(default=-1, metadata={"help": "Used for multi-gpu"})

    per_device_train_batch_size: Optional[int] = field(default=4)
    per_device_eval_batch_size: Optional[int] = field(default=1)
    gradient_accumulation_steps: Optional[int] = field(default=4)
    learning_rate: Optional[float] = field(default=2e-4)
    max_grad_norm: Optional[float] = field(default=0.3)
    weight_decay: Optional[int] = field(default=0.001)
    lora_alpha: Optional[int] = field(default=16)
    lora_dropout: Optional[float] = field(default=0.1)
    lora_r: Optional[int] = field(default=64)
    max_seq_length: Optional[int] = field(default=512)
    model_name: Optional[str] = field(
        default="meta-llama/Llama-2-7b-hf",
        metadata={
            "help": "The model that you want to train from the Hugging Face hub. E.g. gpt2, gpt2-xl, bert, etc."
        }
    )
    dataset_name: Optional[str] = field(
        default="timdettmers/openassistant-guanaco",
        metadata={"help": "The preference dataset to use."},
    )
    use_4bit: Optional[bool] = field(
        default=True,
        metadata={"help": "Activate 4bit precision base model loading"},
    )
    use_nested_quant: Optional[bool] = field(
        default=False,
        metadata={"help": "Activate nested quantization for 4bit base models"},
    )
    bnb_4bit_compute_dtype: Optional[str] = field(
        default="float16",
        metadata={"help": "Compute dtype for 4bit base models"},
    )
    bnb_4bit_quant_type: Optional[str] = field(
        default="nf4",
        metadata={"help": "Quantization type fp4 or nf4"},
    )
    num_train_epochs: Optional[int] = field(
        default=1,
        metadata={"help": "The number of training epochs for the reward model."},
    )
    fp16: Optional[bool] = field(
        default=False,
        metadata={"help": "Enables fp16 training."},
    )
    bf16: Optional[bool] = field(
        default=False,
        metadata={"help": "Enables bf16 training."},
    )
    packing: Optional[bool] = field(
        default=False,
        metadata={"help": "Use packing dataset creating."},
    )
    gradient_checkpointing: Optional[bool] = field(
        default=True,
        metadata={"help": "Enables gradient checkpointing."},
    )
    optim: Optional[str] = field(
        default="paged_adamw_32bit",
        metadata={"help": "The optimizer to use."},
    )
    lr_scheduler_type: str = field(
        default="constant",
        metadata={"help": "Learning rate schedule. Constant a bit better than cosine, and has advantage for analysis"},
    )
    max_steps: int = field(default=10000, metadata={"help": "How many optimizer update steps to take"})
    warmup_ratio: float = field(default=0.03, metadata={"help": "Fraction of steps to do a warmup for"})
    group_by_length: bool = field(
        default=True,
        metadata={
            "help": "Group sequences into batches with same length. Saves memory and speeds up training considerably."
        },
    )
    save_steps: int = field(default=10, metadata={"help": "Save checkpoint every X updates steps."})
    logging_steps: int = field(default=10, metadata={"help": "Log every X updates steps."})
    merge_and_push: Optional[bool] = field(
        default=False,
        metadata={"help": "Merge and push weights after training"},
    )
    output_dir: str = field(
        default="./results",
        metadata={"help": "The output directory where the model predictions and checkpoints will be written."},
    )


parser = HfArgumentParser(ScriptArguments)
script_args = parser.parse_args_into_dataclasses()[0]


def create_and_prepare_model(args):
    compute_dtype = getattr(torch, args.bnb_4bit_compute_dtype)

    bnb_config = BitsAndBytesConfig(
        load_in_4bit=args.use_4bit,
        bnb_4bit_quant_type=args.bnb_4bit_quant_type,
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=args.use_nested_quant,
    )

    if compute_dtype == torch.float16 and args.use_4bit:
        major, _ = torch.cuda.get_device_capability()
        if major >= 8:
            print("=" * 80)
            print("Your GPU supports bfloat16, you can accelerate training with the argument --bf16")
            print("=" * 80)

    # Load the entire model on the GPU 0
    # switch to `device_map = "auto"` for multi-GPU
    device_map = {"": 0}

    model = AutoModelForCausalLM.from_pretrained(
        args.model_name,
        quantization_config=bnb_config,
        device_map=device_map,
        use_auth_token=True
    )

    # check: https://github.com/huggingface/transformers/pull/24906
    model.config.pretraining_tp = 1

    peft_config = LoraConfig(
        lora_alpha=script_args.lora_alpha,
        lora_dropout=script_args.lora_dropout,
        r=script_args.lora_r,
        bias="none",
        task_type="CAUSAL_LM",
    )

    tokenizer = AutoTokenizer.from_pretrained(script_args.model_name, trust_remote_code=True)
    tokenizer.pad_token = tokenizer.eos_token

    return model, peft_config, tokenizer


training_arguments = TrainingArguments(
    output_dir=script_args.output_dir,
    per_device_train_batch_size=script_args.per_device_train_batch_size,
    gradient_accumulation_steps=script_args.gradient_accumulation_steps,
    optim=script_args.optim,
    save_steps=script_args.save_steps,
    logging_steps=script_args.logging_steps,
    learning_rate=script_args.learning_rate,
    fp16=script_args.fp16,
    bf16=script_args.bf16,
    max_grad_norm=script_args.max_grad_norm,
    max_steps=script_args.max_steps,
    warmup_ratio=script_args.warmup_ratio,
    group_by_length=script_args.group_by_length,
    lr_scheduler_type=script_args.lr_scheduler_type,
)

model, peft_config, tokenizer = create_and_prepare_model(script_args)
model.config.use_cache = False
dataset = load_dataset(script_args.dataset_name, split="train")

# Fix weird overflow issue with fp16 training
tokenizer.padding_side = "right"

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=script_args.max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=script_args.packing,
)

trainer.train()

if script_args.merge_and_push:
    output_dir = os.path.join(script_args.output_dir, "final_checkpoints")
    trainer.model.save_pretrained(output_dir)

    # Free memory for merging weights
    del model
    torch.cuda.empty_cache()

    from peft import AutoPeftModelForCausalLM

    model = AutoPeftModelForCausalLM.from_pretrained(output_dir, device_map="auto", torch_dtype=torch.bfloat16)
    model = model.merge_and_unload()

    output_merged_dir = os.path.join(script_args.output_dir, "final_merged_checkpoint")
    model.save_pretrained(output_merged_dir, safe_serialization=True)

## Mistral 7B script

In [None]:
import os
from dataclasses import dataclass, field
from typing import Optional
from datasets.arrow_dataset import Dataset
import torch
from datasets import load_dataset
from peft import LoraConfig
from peft import AutoPeftModelForCausalLM
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    AutoTokenizer,
    TrainingArguments,
)

from trl import SFTTrainer

torch.manual_seed(42)

@dataclass
class ScriptArguments:
    """
    These arguments vary depending on how many GPUs you have, what their capacity and features are, and what size model you want to train.
    """

    local_rank: Optional[int] = field(default=-1, metadata={"help": "Used for multi-gpu"})

    per_device_train_batch_size: Optional[int] = field(default=4)
    per_device_eval_batch_size: Optional[int] = field(default=4)
    gradient_accumulation_steps: Optional[int] = field(default=4)
    learning_rate: Optional[float] = field(default=2e-5)
    max_grad_norm: Optional[float] = field(default=0.3)
    weight_decay: Optional[int] = field(default=0.01)
    lora_alpha: Optional[int] = field(default=16)
    lora_dropout: Optional[float] = field(default=0.1)
    lora_r: Optional[int] = field(default=32)
    max_seq_length: Optional[int] = field(default=512)
    model_name: Optional[str] = field(
        default="mistralai/Mistral-7B-Instruct-v0.1",
        metadata={
            "help": "The model that you want to train from the Hugging Face hub. E.g. gpt2, gpt2-xl, bert, etc."
        }
    )
    dataset_name: Optional[str] = field(
        default="iamtarun/python_code_instructions_18k_alpaca",
        metadata={"help": "The preference dataset to use."},
    )

    use_4bit: Optional[bool] = field(
        default=True,
        metadata={"help": "Activate 4bit precision base model loading"},
    )
    use_nested_quant: Optional[bool] = field(
        default=False,
        metadata={"help": "Activate nested quantization for 4bit base models"},
    )
    bnb_4bit_compute_dtype: Optional[str] = field(
        default="float16",
        metadata={"help": "Compute dtype for 4bit base models"},
    )
    bnb_4bit_quant_type: Optional[str] = field(
        default="nf4",
        metadata={"help": "Quantization type fp4 or nf4"},
    )
    num_train_epochs: Optional[int] = field(
        default=100,
        metadata={"help": "The number of training epochs for the reward model."},
    )
    fp16: Optional[bool] = field(
        default=False,
        metadata={"help": "Enables fp16 training."},
    )
    bf16: Optional[bool] = field(
        default=True,
        metadata={"help": "Enables bf16 training."},
    )
    packing: Optional[bool] = field(
        default=False,
        metadata={"help": "Use packing dataset creating."},
    )
    gradient_checkpointing: Optional[bool] = field(
        default=True,
        metadata={"help": "Enables gradient checkpointing."},
    )
    optim: Optional[str] = field(
        default="paged_adamw_32bit",
        metadata={"help": "The optimizer to use."},
    )
    lr_scheduler_type: str = field(
        default="constant",
        metadata={"help": "Learning rate schedule. Constant a bit better than cosine, and has advantage for analysis"},
    )
    max_steps: int = field(default=1000000, metadata={"help": "How many optimizer update steps to take"})
    warmup_ratio: float = field(default=0.03, metadata={"help": "Fraction of steps to do a warmup for"})
    group_by_length: bool = field(
        default=True,
        metadata={
            "help": "Group sequences into batches with same length. Saves memory and speeds up training considerably."
        },
    )
    save_steps: int = field(default=50, metadata={"help": "Save checkpoint every X updates steps."})
    logging_steps: int = field(default=50, metadata={"help": "Log every X updates steps."})
    merge_and_push: Optional[bool] = field(
        default=False,
        metadata={"help": "Merge and push weights after training"},
    )
    output_dir: str = field(
        default="./results_packing",
        metadata={"help": "The output directory where the model predictions and checkpoints will be written."},
    )


parser = HfArgumentParser(ScriptArguments)
script_args = parser.parse_args_into_dataclasses()[0]


def gen_batches_train():
    ds = load_dataset(script_args.dataset_name, streaming=True, split="train")
    total_samples = 10000
    val_pct = 0.1
    train_limit = int(total_samples * (1 - val_pct))
    counter = 0

    for sample in iter(ds):
        if counter >= train_limit:
            break

        original_prompt = sample['prompt'].replace("### Input:\n", '').replace('# Python code\n', '')
        instruction_start = original_prompt.find("### Instruction:") + len("### Instruction:")
        # prompt has ### Input\n which i want to remove
        instruction_end = original_prompt.find("### Output:")

        instruction = original_prompt[instruction_start:instruction_end].strip()
        content_start = original_prompt.find("### Output:") + len("### Output:")
        content = original_prompt[content_start:].strip()
        new_text_format = f'<s>[INST] {instruction} [/INST] ```python\n{content}```</s>'

        tokenized_output = tokenizer(new_text_format)
        yield {'text': new_text_format}

        counter += 1

def gen_batches_val():
    ds = load_dataset(script_args.dataset_name, streaming=True, split="train")
    total_samples = 10000
    val_pct = 0.1
    train_limit = int(total_samples * (1 - val_pct))
    counter = 0

    for sample in iter(ds):
        if counter < train_limit:
            counter += 1
            continue

        if counter >= total_samples:
            break

        original_prompt = sample['prompt'].replace("### Input:\n", '').replace('# Python code\n', '')
        instruction_start = original_prompt.find("### Instruction:") + len("### Instruction:")
        instruction_end = original_prompt.find("### Output:")
        instruction = original_prompt[instruction_start:instruction_end].strip()
        content_start = original_prompt.find("### Output:") + len("### Output:")
        content = original_prompt[content_start:].strip()
        new_text_format = f'<s>[INST] {instruction} [/INST] ```python\n{content}```</s>'

        tokenized_output = tokenizer(new_text_format)
        yield {'text': new_text_format}

        counter += 1


def create_and_prepare_model(args):
    compute_dtype = getattr(torch, args.bnb_4bit_compute_dtype)

    bnb_config = BitsAndBytesConfig(
        load_in_4bit=args.use_4bit,
        bnb_4bit_quant_type=args.bnb_4bit_quant_type,
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=args.use_nested_quant,
    )

    if compute_dtype == torch.float16 and args.use_4bit:
        major, _ = torch.cuda.get_device_capability()
        if major >= 8:
            print("=" * 80)
            print("Your GPU supports bfloat16, you can accelerate training with the argument --bf16")
            print("=" * 80)

    # Load the entire model on the GPU 0
    # switch to `device_map = "auto"` for multi-GPU
    device_map = {"": 0}

    model = AutoModelForCausalLM.from_pretrained(
        args.model_name,
        quantization_config=bnb_config,
        device_map=device_map,
        use_auth_token=True,
        # revision="refs/pr/35"
    )

    #### LLAMA STUFF
    # check: https://github.com/huggingface/transformers/pull/24906
    model.config.pretraining_tp = 1
    # model.config.
    #### LLAMA STUFF
    model.config.window = 256

    peft_config = LoraConfig(
        lora_alpha=script_args.lora_alpha,
        lora_dropout=script_args.lora_dropout,
        # target_modules=["query_key_value"],
        r=script_args.lora_r,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    )

    tokenizer = AutoTokenizer.from_pretrained(script_args.model_name, trust_remote_code=True)
    tokenizer.pad_token = tokenizer.eos_token

    return model, peft_config, tokenizer


training_arguments = TrainingArguments(
    output_dir=script_args.output_dir,
    per_device_train_batch_size=script_args.per_device_train_batch_size,
    gradient_accumulation_steps=script_args.gradient_accumulation_steps,
    optim=script_args.optim,
    save_steps=script_args.save_steps,
    logging_steps=script_args.logging_steps,
    learning_rate=script_args.learning_rate,
    fp16=script_args.fp16,
    bf16=script_args.bf16,
    evaluation_strategy="steps",
    max_grad_norm=script_args.max_grad_norm,
    max_steps=script_args.max_steps,
    warmup_ratio=script_args.warmup_ratio,
    group_by_length=script_args.group_by_length,
    lr_scheduler_type=script_args.lr_scheduler_type,
    report_to='wandb',
)

model, peft_config, tokenizer = create_and_prepare_model(script_args)
model.config.use_cache = False
# dataset = load_dataset(script_args.dataset_name, split="train")


# Usage
# train_gen = gen_batches('train', total_samples=10000, val_pct=0.1)
# val_gen = gen_batches('val', total_samples=10000, val_pct=0.1)
train_gen = Dataset.from_generator(gen_batches_train)
val_gen = Dataset.from_generator(gen_batches_val)

# dataset = gen_batches(script_args.per_device_train_batch_size)

# Fix weird overflow issue with fp16 training
tokenizer.padding_side = "right"

trainer = SFTTrainer(
    model=model,
    train_dataset=train_gen,
    eval_dataset=val_gen,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=script_args.max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=script_args.packing,
)

trainer.train()

if script_args.merge_and_push:
    output_dir = os.path.join(script_args.output_dir, "final_checkpoints")
    trainer.model.save_pretrained(output_dir)

    # Free memory for merging weights
    del model
    torch.cuda.empty_cache()

    model = AutoPeftModelForCausalLM.from_pretrained(output_dir, device_map="auto", torch_dtype=torch.bfloat16)
    model = model.merge_and_unload()

    output_merged_dir = os.path.join(script_args.output_dir, "final_merged_checkpoint")
    model.save_pretrained(output_merged_dir, safe_serialization=True)

## My script below

In [None]:
import os
from dataclasses import dataclass, field
from typing import Optional
from datasets.arrow_dataset import Dataset
import torch
from datasets import load_dataset
from peft import LoraConfig
from peft import AutoPeftModelForCausalLM
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    AutoTokenizer,
    TrainingArguments,
)

from trl import SFTTrainer


# Used for multi-GPU setup; -1 default means single GPU
local_rank=-1
#Will likely need to adjust (to 1?) when see memory consumption
per_device_train_batch_size=4
per_device_eval_batch_size=4
gradient_accumulation_steps =4 # if change batch size to 1, increase this to 16
learning_rate=2e-5
# Maximum norm for gradient cliping
max_grad_norm=0.3
weight_decay=0.01
lora_alpha=16
lora_dropout=0.1
lora_r=32
# Should be OK, but may need to adjust in conjunction with per_device_batch_size and gradient_accumulation_steps
max_seq_length=512

model_name="codellama/CodeLlama-7b-Instruct-hf" #Adjusted

dataset_name="b-mc2/sql-create-context"
#activate 4-bit precision base model loading
use_4bit=True

use_nested_quant=False #can use different levels of precision at different levels of the model, can help balance accuracy and computational efficiency

bnb_4bit_compute_dtype="float16" #computations performed in 16 bit; can lower accuracy, experiment
# Q: Why use nested quantization nf4 with use_nested_quant=False?
bnb_4bit_quant_type="nf4"
# OK, maybe set this to 1 or two. Adjust in conjunction with max_steps
num_train_epochs=100
# MosaicML: use bf16 over fp16 for training. bf16 tends to be more numerically stable. Can use on A100, I think, but not on T4.
fp16=False
bf16=True
# I think I should set packing = True, not sure why HF set to false for Llama2 finetune
packing=False
# Store some but not all intermediate outputs from the forward pass to reduce training
gradient_checkpointing=True
#TODO: May need to adjust to 8bit (16bit?) for T4
optim="paged_adamw_32bit"
#May consider cosine since not much learning happens at the start (perhaps OK if it's an extended warmup step in essense)
lr_scheduler_type="constant"
#Adjust to, say 100; specifies the total number of training steps to perform. A training step is one iteration over a batch of inputs, followed by optimization of the model parameters based on the calculated loss.  Can be useful for more control when training over a large ds; consider doing 1 epoch w/
#num_train_epochs=1 once working
max_steps=1000000
# Fraction of steps to use for warmup
warmup_ratio=0.03
# I guess since not using packing, can group by length to save memoryand speed up training
group_by_length=True
#Save checkpoint every X steps
save_steps=50
#Log every X updates steps.
logging_steps=50
# Merge and push weights after training
merge_and_push=False
#The output directory where the model predictions and checkpoints will be written.
output_dir=output_dir

In [None]:
NUM_SAMPLES = 10000

def gen_batches_train():
    ds = load_dataset(script_args.dataset_name, streaming=True, split="train")
    total_samples = NUM_SAMPLES
    val_pct = 0.1
    train_limit = int(total_samples * (1 - val_pct))
    counter = 0

    for sample in iter(ds):
        if counter >= train_limit:
            break

        #original_prompt = sample['prompt'].replace("### Input:\n", '').replace('# Python code\n', '')
        # instruction_start = original_prompt.find("### Instruction:") + len("### Instruction:")
        # # prompt has ### Input\n which i want to remove
        # instruction_end = original_prompt.find("### Output:")

        # instruction = original_prompt[instruction_start:instruction_end].strip()
        # content_start = original_prompt.find("### Output:") + len("### Output:")
        # content = original_prompt[content_start:].strip()
        # new_text_format = f'<s>[INST] {instruction} [/INST] ```python\n{content}```</s>'
        question, context, answer = sample['question'], sample['context'], sample['answer']
        new_text_format = f'<s> [INST] <<SYS>> Write SQL code to answer the question based on the context. Please wrap your code answer using ```: <</SYS>> {question} {context} [/INST] {answer}' #Adjusted


        tokenized_output = tokenizer(new_text_format)
        yield {'text': new_text_format}

        counter += 1

def gen_batches_val():
    ds = load_dataset(script_args.dataset_name, streaming=True, split="train")
    total_samples = NUM_SAMPLES
    val_pct = 0.1
    train_limit = int(total_samples * (1 - val_pct))
    counter = 0

    for sample in iter(ds):
        if counter < train_limit:
            counter += 1
            continue

        if counter >= total_samples:
            break


        # original_prompt = sample['prompt'].replace("### Input:\n", '').replace('# Python code\n', '')
        # instruction_start = original_prompt.find("### Instruction:") + len("### Instruction:")
        # instruction_end = original_prompt.find("### Output:")
        # instruction = original_prompt[instruction_start:instruction_end].strip()
        # content_start = original_prompt.find("### Output:") + len("### Output:")
        # content = original_prompt[content_start:].strip()
        # new_text_format = f'<s>[INST] {instruction} [/INST] ```python\n{content}```</s>'
        question, context, answer = sample['question'], sample['context'], sample['answer']
        new_text_format = f'<s> [INST] <<SYS>> Write SQL code to answer the question based on the context. Please wrap your code answer using ```: <</SYS>> {question} {context} [/INST] {answer}' #Adjusted

        tokenized_output = tokenizer(new_text_format)
        yield {'text': new_text_format}

        counter += 1


def create_and_prepare_model(args):
    compute_dtype = getattr(torch, args.bnb_4bit_compute_dtype)

    bnb_config = BitsAndBytesConfig(
        load_in_4bit=args.use_4bit,
        bnb_4bit_quant_type=args.bnb_4bit_quant_type,
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=args.use_nested_quant,
    )

    if compute_dtype == torch.float16 and args.use_4bit:
        major, _ = torch.cuda.get_device_capability()
        if major >= 8:
            print("=" * 80)
            print("Your GPU supports bfloat16, you can accelerate training with the argument --bf16")
            print("=" * 80)

    # Load the entire model on the GPU 0
    # switch to `device_map = "auto"` for multi-GPU
    device_map = {"": 0}

    model = AutoModelForCausalLM.from_pretrained(
        args.model_name,
        quantization_config=bnb_config,
        device_map=device_map,
        use_auth_token=True,
        # revision="refs/pr/35"
    )

    #### LLAMA STUFF
    # check: https://github.com/huggingface/transformers/pull/24906
    model.config.pretraining_tp = 1
    # model.config.
    #### LLAMA STUFF
    model.config.window = 256

    peft_config = LoraConfig(
        lora_alpha=script_args.lora_alpha,
        lora_dropout=script_args.lora_dropout,
        # target_modules=["query_key_value"],
        r=script_args.lora_r,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    )

    tokenizer = AutoTokenizer.from_pretrained(script_args.model_name, trust_remote_code=True)
    tokenizer.pad_token = tokenizer.eos_token

    return model, peft_config, tokenizer
from collections import namedtuple
script_args_ = {
    "local_rank": -1,
    "per_device_train_batch_size": 4,
    "per_device_eval_batch_size": 4,
    "gradient_accumulation_steps": 4,
    "learning_rate": 2e-5,
    "max_grad_norm": 0.3,
    "weight_decay": 0.01,
    "lora_alpha": 16,
    "lora_dropout": 0.1,
    "lora_r": 32,
    "max_seq_length": 512,
    "model_name": "codellama/CodeLlama-7b-Instruct-hf",
    "dataset_name": "b-mc2/sql-create-context",
    "use_4bit": True,
    "use_nested_quant": False,
    "bnb_4bit_compute_dtype": "float16",
    "bnb_4bit_quant_type": "nf4",
    "num_train_epochs": 100,
    "fp16": True,# Switch around
    "bf16": False, # if have A100
    "packing": False,
    "gradient_checkpointing": True,
    "optim": "paged_adamw_32bit",
    "lr_scheduler_type": "constant",
    "max_steps": 1000000,
    "warmup_ratio": 0.03,
    "group_by_length": True,
    "save_steps": 50,
    "logging_steps": 50,
    "merge_and_push": False,
    "output_dir": output_dir
}
ScriptArgs = namedtuple('ScriptArgs',script_args_.keys())
script_args = ScriptArgs(*script_args_.values())
training_arguments = TrainingArguments(
    output_dir=script_args.output_dir,
    per_device_train_batch_size=script_args.per_device_train_batch_size,
    gradient_accumulation_steps=script_args.gradient_accumulation_steps,
    optim=script_args.optim,
    save_steps=script_args.save_steps,
    logging_steps=script_args.logging_steps,
    learning_rate=script_args.learning_rate,
    fp16=script_args.fp16,
    bf16=script_args.bf16,
    evaluation_strategy="steps",
    max_grad_norm=script_args.max_grad_norm,
    max_steps=script_args.max_steps,
    warmup_ratio=script_args.warmup_ratio,
    group_by_length=script_args.group_by_length,
    lr_scheduler_type=script_args.lr_scheduler_type,
    report_to='wandb',
)

# training_arguments = TrainingArguments(
#     output_dir=output_dir,
#     per_device_train_batch_size=per_device_train_batch_size,
#     gradient_accumulation_steps=gradient_accumulation_steps,
#     optim=optim,
#     save_steps=save_steps,
#     logging_steps=logging_steps,
#     learning_rate=learning_rate,
#     fp16=fp16,
#     bf16=bf16,
#     evaluation_strategy="steps",
#     max_grad_norm=max_grad_norm,
#     max_steps=max_steps,
#     warmup_ratio=warmup_ratio,
#     group_by_length=group_by_length,
#     lr_scheduler_type=lr_scheduler_type,
#     report_to='wandb',
# )

model, peft_config, tokenizer = create_and_prepare_model(script_args)
model.config.use_cache = False
# dataset = load_dataset(script_args.dataset_name, split="train")


# Usage
# train_gen = gen_batches('train', total_samples=10000, val_pct=0.1)
# val_gen = gen_batches('val', total_samples=10000, val_pct=0.1)
train_gen = Dataset.from_generator(gen_batches_train)
val_gen = Dataset.from_generator(gen_batches_val)

# dataset = gen_batches(script_args.per_device_train_batch_size)

# Fix weird overflow issue with fp16 training
tokenizer.padding_side = "right" #change to 'left' if don't use fp16?

trainer = SFTTrainer(
    model=model,
    train_dataset=train_gen,
    eval_dataset=val_gen,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=script_args.max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=script_args.packing,
)

trainer.train()

if script_args.merge_and_push:
    output_dir = os.path.join(script_args.output_dir, "final_checkpoints")
    trainer.model.save_pretrained(output_dir)

    # Free memory for merging weights
    del model
    torch.cuda.empty_cache()

    model = AutoPeftModelForCausalLM.from_pretrained(output_dir, device_map="auto", torch_dtype=torch.bfloat16)
    model = model.merge_and_unload()

    output_merged_dir = os.path.join(script_args.output_dir, "final_merged_checkpoint")
    model.save_pretrained(output_merged_dir, safe_serialization=True)



config.json:   0%|          | 0.00/646 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/749 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading and preparing dataset generator/default to /root/.cache/huggingface/datasets/generator/default-8aa8a2c08e45f66c/0.0.0...


Generating train split: 0 examples [00:00, ? examples/s]

Downloading readme:   0%|          | 0.00/3.35k [00:00<?, ?B/s]

Dataset generator downloaded and prepared to /root/.cache/huggingface/datasets/generator/default-8aa8a2c08e45f66c/0.0.0. Subsequent calls will reuse this data.
Downloading and preparing dataset generator/default to /root/.cache/huggingface/datasets/generator/default-45b0f20fc05bb4d8/0.0.0...


Generating train split: 0 examples [00:00, ? examples/s]

Dataset generator downloaded and prepared to /root/.cache/huggingface/datasets/generator/default-45b0f20fc05bb4d8/0.0.0. Subsequent calls will reuse this data.




Map:   0%|          | 0/9000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

You are using 8-bit optimizers with a version of `bitsandbytes` < 0.41.1. It is recommended to update your version as a major bug has been fixed in 8-bit optimizers.
[34m[1mwandb[0m: Currently logged in as: [33mdpopovvelasco[0m. Use [1m`wandb login --relogin`[0m to force relogin


You're using a CodeLlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
50,2.4932,1.725899




- Note: LoRA helps your memory usage, does not help your speed (MosaicML).
- GitHub Copilot: The `max_steps` parameter in CodeLlama-7b-Instruct specifies the total number of training steps to perform. A training step is one iteration over a batch of inputs, followed by optimization of the model parameters based on the calculated loss.

Setting `max_steps` overrides the `num_train_epochs` parameter. This means that regardless of how many epochs you've set for training, the training process will stop once it reaches the number of steps specified by `max_steps`.

This can be useful for fine-tuning the training process, as it gives you more granular control over exactly how much training is performed. It's particularly useful when you're working with large datasets, where a single epoch (i.e., one pass over the entire dataset) can take a long time.

In [None]:
# The model that you want to train from the Hugging Face hub
#model_name = "NousResearch/Llama-2-7b-chat-hf"
model_name = "meta-llama/Llama-2-13b-hf"

# The instruction dataset to use
#dataset_name = "mlabonne/guanaco-llama2-1k"
dataset_name = "knowrohit07/know_sql"

# Fine-tuned model name
new_model = "llama-2-13b-sql"

################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 32 # TODO: changed from 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.05 #TODO: changed from .1

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = output_dir

# Number of training epochs
num_train_epochs = 5

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule
lr_scheduler_type = "cosine"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 0

# Log every X updates steps
logging_steps = 25

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

In [None]:
# Define the formatting function
def formatting_func(example): #Worked, but result wasn't great: did not use GROUP BY in SQL
    text = f"### Question: {example['question']}\n ### Context: {example['context']}\n ### Answer: {example['answer']}"
    #print(text.split('\n'))
    return text.split('\n')


In [None]:
# Load dataset (you can process it here)
#dataset = load_dataset(dataset_name, split="train")
#dataset = load_dataset(dataset_name, split="validation") #TODO: works,trains super quikcly, guess grabbed small subset
dataset=load_dataset('knowrohit07/know_sql', revision='f33425d13f9e8aab1b46fa945326e9356d6d5726')['train']

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

# Load base model
model = LlamaForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1



Your GPU supports bfloat16: accelerate training with bf16=True


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.bos_token = "<s>"
tokenizer.eos_token = "</s>"
tokenizer.unk_token = "<unk>"
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training


# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)


In [None]:
# Set training parameters.  Training for 1 epoch.
training_arguments = TrainingArguments(
    output_dir=output_dir,
    logging_dir=logging_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    formatting_func=formatting_func,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing
)



# Train the model
trainer.train()


# Save trained model
trainer.model.save_pretrained(new_model)



Map:   0%|          | 0/78562 [00:00<?, ? examples/s]



{'loss': 1.4418, 'learning_rate': 0.000131930153013598, 'epoch': 0.42}
{'loss': 1.2081, 'learning_rate': 1.4314282383241096e-05, 'epoch': 0.83}
{'train_runtime': 219.3233, 'train_samples_per_second': 1.081, 'train_steps_per_second': 0.274, 'train_loss': 1.328073501586914, 'epoch': 1.0}


In [None]:
# Set training parameters: Training for 5 epochs
training_arguments = TrainingArguments(
    output_dir=output_dir,
    logging_dir=logging_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    formatting_func=formatting_func,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing
)



# Train the model
trainer.train()


# Save trained model
trainer.model.save_pretrained(new_model)

Map:   0%|          | 0/78562 [00:00<?, ? examples/s]



{'loss': 1.4518, 'learning_rate': 0.00019851186053243666, 'epoch': 0.42}
{'loss': 1.1896, 'learning_rate': 0.00019036283606085053, 'epoch': 0.83}
{'loss': 1.1934, 'learning_rate': 0.00017567128158176953, 'epoch': 1.25}
{'loss': 1.218, 'learning_rate': 0.000155500908021347, 'epoch': 1.67}
{'loss': 1.1128, 'learning_rate': 0.00013131210861240026, 'epoch': 2.08}
{'loss': 1.0871, 'learning_rate': 0.00010485622221144484, 'epoch': 2.5}
{'loss': 1.0573, 'learning_rate': 7.804873131325954e-05, 'epoch': 2.92}
{'loss': 1.0316, 'learning_rate': 5.283057559252341e-05, 'epoch': 3.33}
{'loss': 0.9779, 'learning_rate': 3.102762227218957e-05, 'epoch': 3.75}
{'loss': 1.0043, 'learning_rate': 1.4218468069322578e-05, 'epoch': 4.17}
{'loss': 1.0098, 'learning_rate': 3.620144238882206e-06, 'epoch': 4.58}
{'loss': 0.9497, 'learning_rate': 0.0, 'epoch': 5.0}
{'train_runtime': 1097.6116, 'train_samples_per_second': 1.08, 'train_steps_per_second': 0.273, 'train_loss': 1.106935551961263, 'epoch': 5.0}


In [None]:
ds = load_dataset('knowrohit07/know_sql', revision='f33425d13f9e8aab1b46fa945326e9356d6d5726')

Downloading readme:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/21.7M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
trn = ds['train']
trn[3]



{'question': 'What are the hosts of competitions whose theme is not "Aliens"?',
 'context': 'CREATE TABLE farm_competition (Hosts VARCHAR, Theme VARCHAR)',
 'answer': "SELECT Hosts FROM farm_competition WHERE Theme <> 'Aliens'"}

In [None]:
tst = dict(**trn[3])
tst['question'] = 'Get the count of competition hosts by theme.'
tst



{'question': 'Get the count of competition hosts by theme.',
 'context': 'CREATE TABLE farm_competition (Hosts VARCHAR, Theme VARCHAR)',
 'answer': "SELECT Hosts FROM farm_competition WHERE Theme <> 'Aliens'"}

In [None]:
#def sql_prompt(d): return fmt.format(d["context"], d["question"], d['answer'])
def sql_prompt(example): return f"### Question: {example['question']}\n ### Context: {example['context']}\n ### Answer: {example['answer']}"

In [None]:
print(sql_prompt(tst))

### Question: Get the count of competition hosts by theme.
 ### Context: CREATE TABLE farm_competition (Hosts VARCHAR, Theme VARCHAR)
 ### Answer: SELECT Hosts FROM farm_competition WHERE Theme <> 'Aliens'


In [None]:
toks = tokenizer(sql_prompt(tst), return_tensors="pt")

In [None]:
res = trainer.model.generate(**toks.to("cuda"), max_new_tokens=250).to('cpu')



In [None]:
for elt in tokenizer.batch_decode(res):
  print(elt)


<s> ### Question: Get the count of competition hosts by theme.
 ### Context: CREATE TABLE farm_competition (Hosts VARCHAR, Theme VARCHAR)
 ### Answer: SELECT Hosts FROM farm_competition WHERE Theme <> 'Aliens'

### Question: Get the count of competition hosts by theme and the count of competitions by theme.
 ### Context: CREATE TABLE farm_competition (Hosts VARCHAR, Theme VARCHAR)
 ### Answer: SELECT Hosts, COUNT(Theme) FROM farm_competition GROUP BY Hosts

### Question: Get the count of competition hosts by theme and the count of competitions by theme.
 ### Context: CREATE TABLE farm_competition (Hosts VARCHAR, Theme VARCHAR)
 ### Answer: SELECT Hosts, COUNT(Theme) FROM farm_competition GROUP BY Hosts

### Question: Get the count of competition hosts by theme and the count of competitions by theme.
 ### Context: CREATE TABLE farm_competition (Hosts VARCHAR, Theme VARCHAR)
 ### Answer: SELECT Hosts, COUNT(Theme) FROM farm_competition GROUP BY Hosts

### Question: Get the count of compe

In [None]:
trainer.save_model(output_dir)

###Possible TODOs:
- Evaluate formally
- Train a Mistral version
- Comapare to other models (GPT4 and Mistral's model trained on SQL): Expect Mistral 7B to do better according to https://dev.to/ananddas/mistral-7b-beats-llama-2-13b-on-all-benchmarks-55j2
- Train a larger Mistral model then

In [None]:
# %load_ext tensorboard
# %tensorboard --logdir results/runs

In [None]:
#!model.push_to_hub(trainer.model, use_temp_dir=False)
#!tokenizer.push_to_hub(trainer.model, use_temp_dir=False)