# <center>Fine Tuning TinyLlama</center>
    
    
<center><img src='https://github.com/jzhang38/TinyLlama/raw/main/.github/TinyLlama_logo.png' height=380px width=380px></center>

## Project Summary
    
TinyLlama is a 1.1B Llama model that is currently being trained on 3 trillion tokens, which recently started on September 1st. In this project, I fine-tune the latest version of TinyLlama to generate song lyrics in the style of Taylor Swift. 

I used Hugging Face's transformers and peft (parameter-efficient fine-tuning) packages for this project. One of the major challenges of fine-tuning a large language model (LLM) is the high memory usage on the GPU. To address this challenge, I used the quantization and fine-tuning methods described in the 2023 paper "QLoRA: Efficient Finetuning of Quantized LLMs". These methods are summarized below:

- Low-rank adaptation: This technique freezes the existing weights of TinyLlama and adds two smaller matrices with lower rank than the weight matrices into the model. Only these two smaller matrices are then trained, instead of all of the model weights. Another way to think of this is that we are grouping weights together and traing a scalar for each group, which is much easier than traing each weight by individually. In addition, low-rank adaptation is only done for the query and values weights in the attention heads of the transformers, while all other areas of the model are frozen. This greatly reduces the computation needed to fine-tune the model, while not impairing performance. 

- Double quantization: All weights in TinyLlama are quantized into 4 bits, and the quantization constants are then quantized into 8 bits. This further reduces the memory usage of the model. Low-rank adaptation weights are stored in 16 bits, and model weights are upscaled to 16 bits at computation time. 

- NormalFloat data type: The NormalFloat data type is used for quantization. This data type minimizes information loss during quantization by assigning each data point to a quantile bin based on the estimated normal distribution of the data.

- Gradient checkpointing: This technique minimizes the memory storage requirements during training by recalculating some of the gradients from the forward pass instead of storing them all.

- Paged optimizers: This technique enables the CPU to help the GPU with any memory spikes that occur during training, especially when the backward pass reaches a checkpoint. 

These methods collectively enhance the efficiency of the project, enabling the creation of Taylor Swift-style song lyrics while optimizing GPU memory utilization and computational resources.

Link to TinyLlama - https://huggingface.co/PY007/TinyLlama-1.1B-step-50K-105b

In [1]:
!pip install trl transformers accelerate git+https://github.com/huggingface/peft.git -Uqqq
!pip install bitsandbytes einops wandb -Uqqq

In [2]:
import torch
import glob
import pandas as pd
import numpy as np
import re
from peft import get_peft_model, PeftConfig, PeftModel, LoraConfig, prepare_model_for_kbit_training
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments, GenerationConfig
from trl import SFTTrainer
from datasets import Dataset



In [3]:
# Importing the dataset 
path = '/kaggle/input/taylor-swift-song-lyrics-all-albums/'
csv_files = glob.glob(path + "/*.csv")
df_list = (pd.read_csv(i) for i in csv_files)
df = pd.concat(df_list, ignore_index=True)
lyrics = '\n'.join(df.loc[:,'lyric']) 
print(lyrics[:200])

Knew he was a killer first time that I saw him
Wondered how many girls he had loved and left haunted
But if he's a ghost, then I can be a phantom
Holdin' him for ransom, some
Some boys are tryin' too 


In [4]:
# List of all unique characters
print(' '.join(sorted(set(lyrics))))


   ! " & ' ( ) , - . 0 1 2 3 4 5 6 7 8 9 : ; ? A B C D E F G H I J K L M N O P Q R S T U V W X Y [ ] a b c d e f g h i j k l m n o p q r s t u v w x y z |   é í ï ó е   ​ – — ‘ ’ ” …  


In [5]:
# Cleaning the file by removing/replacing unnecessary characters and removing sections 
# that are not lyrics
replace_with_space = ['\u2005', '\u200b', '\u205f', '\xa0', '-']
replace_letters = {'í':'i', 'é':'e', 'ï':'i', 'ó':'o', ';':',', '‘':'\'', '’':'\'', ':':',', 'е':'e'} 
remove_list = ['\)', '\(', '–','"','”', '"', '\[.*\]', '.*\|.*', '—']

cleaned_lyrics = lyrics

for old, new in replace_letters.items():
    cleaned_lyrics = cleaned_lyrics.replace(old, new)
for string in remove_list:
    cleaned_lyrics = re.sub(string,'',cleaned_lyrics)
for string in replace_with_space:
    cleaned_lyrics = re.sub(string,' ',cleaned_lyrics)
print(''.join(sorted(set(cleaned_lyrics))))


 !',.0123456789?ABCDEFGHIJKLMNOPQRSTUVWXYabcdefghijklmnopqrstuvwxyz…


In [6]:
# Setting aside a portion for training the model and a portion for testing the data to prevent 
# the model from overfitting to the data it is tested on
split_point = int(len(cleaned_lyrics)*0.95)
train_data = cleaned_lyrics[:split_point]
test_data = cleaned_lyrics[split_point:]
train_data_seg = []
for i in range(0, len(train_data), 500):
        text = train_data[i:min(i+500, len(train_data))]
        train_data_seg.append(text)
train_data_seg = Dataset.from_dict({'text':train_data_seg})
print(len(train_data_seg))

557


In [17]:
# You will need to create a Hugging Face account if you do not have one, 
# and then generate a write token to enter in the widget below
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [18]:
# Loading the model with double quantization
model_name = "PY007/TinyLlama-1.1B-step-50K-105b"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,           
    bnb_4bit_quant_type="nf4",    
    bnb_4bit_use_double_quant=True, 
    bnb_4bit_compute_dtype=torch.bfloat16, 
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config, 
    device_map="auto",  
    trust_remote_code=True, 
)

In [19]:
# Creating tokenizer and defining the pad token
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) 
tokenizer.pad_token = tokenizer.eos_token

In [20]:
# Generating lyrics with the base model. The repetition penalty in the generation config prevents the model from continually repeating the same string.
device = 'cuda' if torch.cuda.is_available() else 'cpu'
def generate_lyrics(query, model):
    encoding = tokenizer(query, return_tensors="pt").to(device)
    generation_config = GenerationConfig(max_new_tokens=250, pad_token_id = tokenizer.eos_token_id,repetition_penalty=1.3, eos_token_id = tokenizer.eos_token_id)
    outputs = model.generate(input_ids=encoding.input_ids, generation_config=generation_config)
    text_output = tokenizer.decode(outputs[0],skip_special_tokens=True)
    print('INPUT\n', query, '\n\nOUTPUT\n', text_output[len(query):])
generate_lyrics(test_data[200:700], model)

INPUT
  to get you where you wanna go
Oh, they didn't teach you that in prep school so it's up to me
But no amount of vintage dresses gives you dignity
Think about what you did

She's not a saint and she's not what you think
She's an actress, whoa
She's better known for the things that she does
On the mattress, whoa
Soon she's gonna find stealing other people's toys
On the playground won't make you many friends
She should keep in mind, she should keep in mind
There is nothing I do better than revenge,  

OUTPUT
 
I don't know why but I feel like I have something to say
And if there was one thing I would change about myself
It'd be my name.

So now we can all just sit back and enjoy this song
We could even sing along with her on our own
If only we had some money left over from last time
Well then maybe we could buy her a car
Or at least give her a ride home

<NAME> - <NAME>, <NAME>. (2013). "The Best Of The Olsen Twins". Retrieved April 9, 2018, from https://www.youtube.com/watch?v=g7Yf

In [21]:
# Setting arguments for low-rank adaptation 

model = prepare_model_for_kbit_training(model)

lora_alpha = 32 # The weight matrix is scaled by lora_alpha/lora_rank, so I set lora_alpha = lora_rank to remove scaling
lora_dropout = 0.05 
lora_rank = 32 

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_rank,
    bias="none",  # setting to 'none' for only training weight params instead of biases
    task_type="CAUSAL_LM")

peft_model = get_peft_model(model, peft_config)

In [22]:
# Setting training arguments 

output_dir = "tommyadams/tinyllama" # Model repo on your hugging face account where you want to save your model
per_device_train_batch_size = 3
gradient_accumulation_steps = 2  
optim = "paged_adamw_32bit" 
save_strategy="steps" 
save_steps = 10 
logging_steps = 10  
learning_rate = 2e-3  
max_grad_norm = 0.3 # Sets limit for gradient clipping
max_steps = 200     # Number of training steps
warmup_ratio = 0.03 # Portion of steps used for learning_rate to warmup from 0
lr_scheduler_type = "cosine" # I chose cosine to avoid learning plateaus

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    lr_scheduler_type=lr_scheduler_type,
    push_to_hub=True,
    report_to='none'
)

In [23]:
trainer = SFTTrainer(
    model=peft_model,
    train_dataset=train_data_seg,
    peft_config=peft_config,
    max_seq_length=500,
    dataset_text_field='text',
    tokenizer=tokenizer,
    args=training_arguments
)
peft_model.config.use_cache = False

  0%|          | 0/1 [00:00<?, ?ba/s]

In [24]:
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,2.8298
20,2.6352
30,2.5815
40,2.6166
50,2.5888
60,2.5329
70,2.5101
80,2.4791
90,2.5277
100,2.2997


TrainOutput(global_step=200, training_loss=2.403262138366699, metrics={'train_runtime': 348.0262, 'train_samples_per_second': 3.448, 'train_steps_per_second': 0.575, 'total_flos': 649013488091136.0, 'train_loss': 2.403262138366699, 'epoch': 2.15})

In [25]:
# Generating lyrics with fine-tuned model
generate_lyrics(test_data[200:700], model)

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


INPUT
  to get you where you wanna go
Oh, they didn't teach you that in prep school so it's up to me
But no amount of vintage dresses gives you dignity
Think about what you did

She's not a saint and she's not what you think
She's an actress, whoa
She's better known for the things that she does
On the mattress, whoa
Soon she's gonna find stealing other people's toys
On the playground won't make you many friends
She should keep in mind, she should keep in mind
There is nothing I do better than revenge,  

OUTPUT
 20/10
And if there was one thing I could have done differently
It would be this: If I had been around when he broke my heart
I wouldn't have felt like such a fool then
If I hadn't seen him through all those years
He never got over his breakup with me
So why don't we just say goodbye?
Goodbye, baby, oh, yeah
You know how much I love you
Baby, oh, yeah
We can always re-live our first kiss
In your old car, on the way home from work
When you were still single
Now we are married, bu

## Results

Fine-tuning the model for 200 steps on a P100 GPU took about 6 minutes. Before fine-tuning, the model generated a few lines of lyrics in response to the prompt, but then listed some video data from YouTube that it was likely trained on. After fine-tuning, the language model showed improvement in that it learned the common words in Taylor Swift's song lyrics. However, many of the lines were still nonsensical and humorous. To further improve this model, I could start with a larger base model with more parameters (such as Falcon 7b), train the model for longer, and provide longer training segments so that the model can learn song structure in terms of verses and choruses.