<a href="https://colab.research.google.com/github/danielbauer1860/LDS_Project/blob/main/proposal/Llama_2_fine_tuning_proposal_news.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7 tensorboard huggingface_hub[cli] xformers

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: read).
Your token has been saved to /root/.ca

In [None]:
import pandas as pd
import os
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
    )
from datasets import load_dataset, Dataset
from trl import SFTTrainer
from peft import LoraConfig, get_peft_model
import transformers

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Linguistic Data Science/data/bnc_baby_texts.csv', sep='|')
df

Unnamed: 0,text,category
0,Oxford Art Journal. Sample containing about ...,ACA
1,The Lancet. Sample containing about 44333 wo...,ACA
2,Computers and the humanities. Sample contain...,ACA
3,British journal of social work. Sample conta...,ACA
4,The British polity. Sample containing about ...,ACA
...,...,...
177,Liverpool Daily Post and Echo: Arts section....,NEWS
178,"The Guardian, electronic edition of 1989-12-...",NEWS
179,"Independent, electronic edition of 1989-10-0...",NEWS
180,The Scotsman: Applied Science pages. Sample ...,NEWS


In [None]:
df_aca, df_dem, df_fic, df_news = [y for x, y in df.groupby(['category'])]

  df_aca, df_dem, df_fic, df_news = [y for x, y in df.groupby(['category'])]


In [None]:
df_aca

Unnamed: 0,text,category
0,Oxford Art Journal. Sample containing about ...,ACA
1,The Lancet. Sample containing about 44333 wo...,ACA
2,Computers and the humanities. Sample contain...,ACA
3,British journal of social work. Sample conta...,ACA
4,The British polity. Sample containing about ...,ACA
5,Design of computer data files. Sample conta...,ACA
6,The age of Balfour and Baldwin 1902-1940. Sa...,ACA
7,Lectures on electromagnetic theory. Sample c...,ACA
8,Handling geographical information. Sample co...,ACA
9,Crime. Sample containing about 33296 words ...,ACA


In [None]:
aca_dataset = Dataset.from_pandas(df_aca)
fic_dataset = Dataset.from_pandas(df_fic)
news_dataset = Dataset.from_pandas(df_news)

In [None]:
aca_dataset

Dataset({
    features: ['text', 'category', '__index_level_0__'],
    num_rows: 30
})

# Fine-tuning Llama 2

Norouzi (2023) describes an efficient way to handle Llama 2 despite the system RAM limitations that Google Colab comes with and was, therefore, used as the main source throughout this section. For more details see: [Mastering Llama 2: A Comprehensive Guide to Fine-Tuning in Google Colab](https://artificialcorner.com/mastering-llama-2-a-comprehensive-guide-to-fine-tuning-in-google-colab-bedfcc692b7f). Additionally, [this notebook](https://colab.research.google.com/drive/12dVqXZMIVxGI0uutU6HG9RWbWPXL3vts?authuser=2#scrollTo=qmA4G6C64dJ4) was used as a reference for further parameters.

In [None]:
# Declaring the model name; as described in the paper, the 7 billion parameter version of Llama 2 is used
model_name = "meta-llama/Llama-2-7b-hf"

The parameters for the quantization config need to be defined:

In [None]:
# Load the entire model on the GPU 0
device_map = {"": 0}

# Set base model loading in 4-bits
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = torch.float16

# Quantization type (fp4 or nf4); nf4 is shown to be better in the QLoRA paper
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

In [None]:
# Initialize the quantization config using the previously declared parameters
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=bnb_4bit_compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant
)

Now that we got both the `device_map` set to one GPU and the quantization config `bnb_config` initialized, the base model can be loaded into the system without any memory issues:

In [None]:
# Load the model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map=device_map,
    quantization_config=bnb_config,
)
model.config.use_cache = False

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
# Load the corresponding tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

As the original tokenizer has `pad_id` set to -1, we need to to define a custom padding token. While the [Llama 2 huggingface documentation](https://huggingface.co/docs/transformers/main/model_doc/llama2#overview) recommends using `tokenizer.add_special_tokens({"pad_token":"<pad>"})`, Norouzi points out that this can introduce CUDA-related errors. Therefore, he sees directly setting the `pad_token` as a safer option.




In [None]:
#Define a custom padding token
tokenizer.pad_token = tokenizer.eos_token

# Set the padding direction to the right
tokenizer.padding_side = "right"

These paramaters are set in accordance to the QLoRA method.

In [None]:
# LoRA attention dimension
lora_r = 64
# Alpha for LoRA scaling
lora_alpha = 16
# Dropout probability for LoRA
lora_dropout = 0.1

In [None]:
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM"
)

In [None]:
output_dir = "/content/drive/MyDrive/results"
final_checkpoint_dir = os.path.join(output_dir, "final_checkpoint")

Some of the training parameters; TODO: revisit this and optimize

This will be varied

In [None]:
per_device_train_batch_size = 1
gradient_accumulation_steps = 1
num_train_epochs = 1
optim = "paged_adamw_32bit"
save_steps = 1000
logging_steps = 10
learning_rate = 4e-5
max_grad_norm = 0.3
warmup_ratio = 0.03
lr_scheduler_type = "constant"

In [None]:
training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
    num_train_epochs=num_train_epochs
)

In [None]:
max_seq_length = 2500
packing = False

In [None]:
news_trainer = SFTTrainer(
    model=model,
    train_dataset=news_dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments
)



Map:   0%|          | 0/97 [00:00<?, ? examples/s]

In [None]:
resume_checkpoint = False

In [None]:
transformers.logging.set_verbosity_info()

In [None]:
for name, module in news_trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

In [None]:
news_trainer.train(resume_checkpoint)

***** Running training *****
  Num examples = 97
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 97
  Number of trainable parameters = 33,554,432
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.5782
20,1.5752
30,1.737
40,1.5651
50,1.6073
60,1.474
70,1.381
80,1.344
90,1.3144




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=97, training_loss=1.4949309457208693, metrics={'train_runtime': 930.51, 'train_samples_per_second': 0.104, 'train_steps_per_second': 0.104, 'total_flos': 4643543367720960.0, 'train_loss': 1.4949309457208693, 'epoch': 1.0})

In [None]:
model_to_save = news_trainer.model.module if hasattr(news_trainer.model, 'module') else news_trainer.model  # Take care of distributed/parallel training
model_to_save.save_pretrained("outputs")

In [None]:
lora_config = LoraConfig.from_pretrained('outputs')
model = get_peft_model(model, lora_config)

In [None]:
model.push_to_hub("dbauer1860/llama-2-bnc-baby-news", create_pr=1)

Uploading the following files to dbauer1860/llama-2-bnc-baby-news: README.md,adapter_config.json,adapter_model.bin


adapter_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/dbauer1860/llama-2-bnc-baby-news/commit/9b5bf98cde8a14d3a1b72ac6906bce1d64eed094', commit_message='Upload model', commit_description='', oid='9b5bf98cde8a14d3a1b72ac6906bce1d64eed094', pr_url='https://huggingface.co/dbauer1860/llama-2-bnc-baby-news/discussions/2', pr_revision='refs/pr/2', pr_num=2)