In [23]:
!pip install datasets
!pip install transformers -U
!pip install accelerate -U
!pip install trl
!pip install bitsandbytes       # for quantization
!pip install peft               # to allow us to use LoRA

Collecting transformers
  Downloading transformers-4.56.2-py3-none-any.whl.metadata (40 kB)
Downloading transformers-4.56.2-py3-none-any.whl (11.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.6/11.6 MB[0m [31m71.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.56.1
    Uninstalling transformers-4.56.1:
      Successfully uninstalled transformers-4.56.1
Successfully installed transformers-4.56.2


In [24]:
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [25]:
from datasets import load_dataset

DATASET_NAME = "ChrisHayduk/Llama-2-SQL-Dataset"
dataset = load_dataset(DATASET_NAME)    # downloads the entire dataset into our runtime

In [26]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['input', 'output'],
        num_rows: 70719
    })
    eval: Dataset({
        features: ['input', 'output'],
        num_rows: 7858
    })
})


In [27]:
print(dataset["train"][0]["input"])
print(dataset["train"][0]["output"])

Below is an instruction that describes a SQL generation task, paired with an input that provides further context about the available table schemas. Write SQL code that appropriately answers the request.

### Instruction:
What is the release date of Milk and Money?

### Input:
CREATE TABLE table_name_50 (release_date VARCHAR, title VARCHAR)

### Response: 
SELECT release_date FROM table_name_50 WHERE title = "milk and money"


In [28]:
full_training_data = dataset["train"]
shuffled = full_training_data.shuffle()     # randomize the dataset's order to remove any bias or ordering the creator may have used
training_dataset = shuffled.select(range(1000)) # only use the first 1000 examples for fine tuning (is why we need line above to prevent bias)
# ensures our randomly selected 1000 dataset is representative of the entire dataset
# when we have super large pretrained models, 1000 examples actually works great for fine tuning (the quality of our fine tuning data matters 
# a lot more than the number of examples)

In [29]:
import bitsandbytes as bnb
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,                      # load model in 4 bit format
    bnb_4bit_quant_type="nf4",              # set the quantization data type to normalized floating point 4
    bnb_4bit_compute_dtype="float16"        # set compute data type (used for actual computations but not for storing model weights) to be higher precision
)

In [30]:
import transformers
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

MODEL_NAME = "NousResearch/Llama-2-7b-hf"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=quantization_config,
    device_map="auto"
)
model.config.use_cache = True       # speeds up generation

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME,
    trust_remote_code = True
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

ImportError: The installed version of bitsandbytes (<0.43.1) requires CUDA, but CUDA is not available. You may need to install PyTorch with CUDA support or upgrade bitsandbytes to >=0.43.1.

In [None]:
def construct_datapoint(x):     # prepares our prompts for the format our model expects during training
    combined = x["input"] + x["output"]
    return tokenizer(combined, padding = True)      # tokenize the concatenated result right now, as the model expects integers just like before (with padding set to True)

training_dataset = training_dataset.map(construct_datapoint)    # for every single element in the training dataset, apply the above function (using Rust parallel processing)
# we didn't need to do this for GPT-2 because the dataset we were using was already in the format we wanted (every element was a giant string with the prompt and response together)

# the dataset we're given has an input and output format (prompt / response), however we don't actually train language models like this
# we actually just pass in a giant block of text and the model learns to predict the next token with all the training examples embedded within that block of text
# we need to actually concatenate the inputs and the outputs together for every single data point to pass in a single piece of text into the model for training

In [None]:
print(training_dataset)

Because we called the tokenizer, the dataset now has input_ids (every token has been converted to its integer id that the model will actually understand) as well as an attention_mask (important for the model in learning to predict the next token without being given the hack of actually seeing the next token in the sequence)

In [None]:
from peft import (LoraConfig, get_peft_model, prepare_model_for_kbit_training)

peft_config = LoraConfig(
    r = 16,     # the rank of matrices B and A, the higher this is the closer it is to fine tuning all the parameters of our desired layer
    # the lower this is, the faster the process will be, the less memory it will take, but might not get best results because you can't fine tune as many parameters
    # 16 is a typical value for r, can vary depending on size of model, but for 7B+ r = 16 is a common choice
    lora_alpha = 32,    # scaling factor used in the matrix multiplications, when tuning this its similar to tuning the real alpha used for the learning rate in gradient descent
    # 32 is also common, people usually use a value that is 2 * r
    target_modules = ['q_proj', 'k_proj', 'down_proj', 'v_proj', 'gate_proj', 'o_proj', 'up_proj'],
    # above is which layers do we want to apply lora to, anything not in this will be frozen
    # q, k, v are from attention; down, up, gate proj are part of feedforward layers of the transformer
    lora_dropout = 0.05,    # dumbing down to prevent overfitting by randomly turning off some of the nodes (setting their values to be 0) during every iteration of training
    # We don't want the model to be too strong such that it memorizes random noise in the training data
    task_type="CAUSAL_LM"   # next token prediction
)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)      # all layers except attention layers are frozen

generation_configuration = model.generation_config
generation_configuration.pad_token_id = tokenizer.eos_token_id
generation_configuration.eos_token_id = tokenizer.eos_token_id
generation_configuration.max_new_tokens = 256  # max amount of tokens to be generated by model, it can't just go on forever
generation_configuration.do_sample = True
generation_configuration.temperature = 0.7
generation_configuration.top_p = 0.9

In [None]:
def generate(prompt):
    generation_configuration.max_new_tokens = 20

    encoded = tokenizer.encode(prompt, add_special_tokens = True, return_tensors = "pt").to(device)
    with torch.inference_mode():
        out = model.generate(input_ids = encoded, generation_config = generation_configuration, repetition_penalty = 2.0, do_sample = True)
    string_decoded = tokenizer.decode(out[0], clean_up_tokenization_spaces = True)
    print(string_decoded)

Can't directly talk to this model like it's a chatbot (hasn't been fine-tuned on any data yet), only been pretrained on massive amount of text. Use it for chat completion, keep predicting the next token in the sequence.

In [None]:
generate('today I want to')

In [None]:
generate('the name of the first person to land on the moon was')

In [None]:
training_args = transformers.TrainingArguments(
    per_device_train_batch_size=1,  
    gradient_accumulation_steps=4,  # simulate a larger batch size (the larger batch size is the less iterations training will take: the model is exposed to more data at every iteration) to speed up training
    num_train_epochs=1,
    learning_rate=2e-4,
    fp16=True,
    optim="paged_adamw_8bit",
    lr_scheduler_type="cosine",
    warmup_ratio=0.05,
    output_dir="fine_tuning",
)

trainer = transformers.Trainer(model = model, train_dataset = training_dataset, data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm = False), args = training_args)
# we previously used SFT trainer (highly optimized trainer) because it supports example packing (multiple data points are packed in the same sequence that's fed into the model to increase efficiency)
# we only did this because of the weird nature of our guanaco dataset, most of the time we won't need to do this so we can use Trainer class instead
# Trainer class also expects data_collator which is an object that takes care of any additional data preprocessing that may have not already been done to make sure nothing goes wrong with training

model.config.use_cache = False  # since we're doing training and not generation now (not continuously predicting the next token and passing it back into the model to predict the next token)
# we need to learn the relationship between an existing training set (not generating new tokens from scratch)

In [None]:
trainer.train()

In [None]:
evaluation_dataset = dataset['eval'].shuffle()

sample_sql_question = evaluation_dataset[0]['input']
correct_answer = evaluation_dataset[0]['output'] 

generate(sample_sql_question)

In [None]:
sample_sql_question

In [None]:
correct_answer

This is still pretty good considering, as before we did any fine-tuning at all (when we just had the pretrained model) it wasn't even able to talk back and forth with us / answer these questions at all. It could only keep predicting the next token in the sequence like an autocomplete. This is much better, with only 10 mins of fine-tuning on a dataset of 1000 examples, the model has become pretty good at understanding these instructions, comprehending the table schema, and generating an almost correct SQL query as a response. If we had a larger dataset and train for a longer amount of time, we could increase the performance of this model even further.