Let's think about quantization from a very high level - and use some oversimplifications to understand what's really happening under the hood.

In essence, we can think of quantization as placing a pin on the number line (our quantization constant) and then expressing a low-precision zero-centered size-64 block range around that pinned number. Exploiting the fact that our weights are normally distributed and that we scale them to be in the range [-1, 1], this lets use use our NF4 datatype to roughly optimally express our high precision weights in a low precision format. While we still do need some higher precision numbers - this process lets use represent many numbers in low precision for the cost of 1 number in high precision.

However, we can take it one step further - and we can actually quantize the range of quantization constants we wind up with as well! This winds up saving us ~0.373 bits per parameter.

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
!pip install torch
!pip install accelerate @ git+https://github.com/huggingface/accelerate.git
!pip install bitsandbytes
!pip install datasets==2.13.1
!pip install transformers @ git+https://github.com/huggingface/transformers.git
!pip install peft @ git+https://github.com/huggingface/peft.git
!pip install trl @ git+https://github.com/lvwerra/trl.git
!pip install scipy
!pip install peft

[31mERROR: Invalid requirement: '@'[0m[31m
[0mCollecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: bitsandbytes
Successfully installed bitsandbytes-0.42.0
Collecting datasets==2.13.1
  Downloading datasets-2.13.1-py3-none-any.whl (486 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets==2.13.1)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets==2.13.1)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m20.3 MB/s[0m eta [3

Set up Python environment

***fine-tune LLaMA 2 models on  datasets***



In [None]:
import argparse
import bitsandbytes as bnb
from datasets import load_dataset
from functools import partial
import os
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, AutoPeftModelForCausalLM
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed, Trainer, TrainingArguments, BitsAndBytesConfig, \
    DataCollatorForLanguageModeling, Trainer, TrainingArguments

In [None]:
import torch
torch.cuda.is_available()

True

In [None]:
import pandas as pd
import time

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
file_path = "/content/drive/My Drive/Datasets/W_7/TIKVAH.csv"
dataset_file_path = "/content/drive/My Drive/Datasets/W_7/TIKVAH_dataset.csv"

In [None]:
df = pd.read_csv(file_path)

In [None]:
df.tail()

Unnamed: 0,id,text,date,channel_id,hashtags,emojis,symbols,links,mentions
39749,84203,#AddisAbaba በእኛ የህግ አማካሪ በኩል ጥፋት ነበረ ” አቶ አበባው...,2024-01-11T21:34:06,1130580549,"['#AddisAbaba', '#ጥፋት', '#‘ታስሮ', '#መመሪያውም']",,“-““““““““,[],['tikvahethiopia']
39750,84205,ቀብሩ ዛሬ ተፈፅሟል በቢሾፍቱ ቃጂማ ጊዮርጊስ ቤተክርስቲያን። የ8 ኣመት ...,2024-01-11T21:59:26,1130580549,[],,“-““““,[],['tikvahethiopia']
39751,84207,የግብፁ መሪ የኤርትራው ፕሬዜዳንት ኢሳያስ አፈወርቂ ግብፅን እንዲጎበኙ ግ...,2024-01-12T00:16:19,1130580549,[],,ℹ,[],['tikvahethiopia']
39752,84209,#Ethiopia የጠቅላይ ሚኒስትሩ የብሄራዊ ደህንነት አማካሪ አምባሳደር ...,2024-01-12T00:19:12,1130580549,"['#Ethiopia', '#ጫና']",,"""""""""",[],['tikvahethiopia']
39753,84217,በምጥ የተያዘችን እናት ሊያመጣ ሲሄድ በተተኮሰበት ጥይት ተመቶ ህይወቱ አ...,2024-01-12T00:54:13,1130580549,['#ተገደለ።'],,"""""-",[],['tikvahethiopia']


In [None]:
dataset = df[['text','hashtags']]
dataset.tail()

Unnamed: 0,text,hashtags
39749,#AddisAbaba በእኛ የህግ አማካሪ በኩል ጥፋት ነበረ ” አቶ አበባው...,"['#AddisAbaba', '#ጥፋት', '#‘ታስሮ', '#መመሪያውም']"
39750,ቀብሩ ዛሬ ተፈፅሟል በቢሾፍቱ ቃጂማ ጊዮርጊስ ቤተክርስቲያን። የ8 ኣመት ...,[]
39751,የግብፁ መሪ የኤርትራው ፕሬዜዳንት ኢሳያስ አፈወርቂ ግብፅን እንዲጎበኙ ግ...,[]
39752,#Ethiopia የጠቅላይ ሚኒስትሩ የብሄራዊ ደህንነት አማካሪ አምባሳደር ...,"['#Ethiopia', '#ጫና']"
39753,በምጥ የተያዘችን እናት ሊያመጣ ሲሄድ በተተኮሰበት ጥይት ተመቶ ህይወቱ አ...,['#ተገደለ።']


In [None]:
dataset = dataset.dropna(subset=['hashtags'])
#dataset = dataset[dataset['hashtags'].astype(bool)]  # Keep only non-empty lists
dataset = dataset[dataset['hashtags'].apply(lambda x: x != '[]')]

# Reset the index after dropping rows
dataset = dataset.reset_index(drop=True)

In [None]:

dataset.head()

Unnamed: 0,text,hashtags
0,ሰበር ዜና : ደህና ሁኑ ልጆች! / RIP #ETHIOPIA | አርቲስት ተ...,['#ETHIOPIA']
1,# ተስፋዬ ሳህሉ #,"['#', '#']"
2,️መልካም ቀን️ # በህይወት እስካለህ: : ልትሳሳት ፣ልትወድቅ ትቸላለህ ...,"['#', '#', '#', '#']"
3,አስመሳይ ነው የበዛው ጥቅሙን ፈላጊ ሳንቲም ባገኘም ቁጥር እራሱን አስቀዳ...,['#Panfalon']
4,# ክብር ለኢትዮጵያ እናቶች #,"['#', '#']"


In [None]:
dataset.shape

(23005, 2)

In [None]:
import re

def update_hashtags(dataset):
  ''' Preprocess data : if # followed by space/s then by word ,
  concatenate the # and the word'''

  for index, row in dataset.iterrows():
        text = row['text']

        # Using regular expression to find hashtags followed by one or more spaces and a word
        matches = re.findall(r'#\s*(\w+)', text)

        for match in matches:
            hashtag = '#' + match
            # Update 'hashtag' column
            dataset.at[index, 'hashtags'] = hashtag
            # Update 'text' column
            dataset.at[index, 'text'] = re.sub(r'#\s*' + match, hashtag, row['text'])


# Call the function to update hashtags
update_hashtags(dataset)

# Display the updated DataFrame
dataset.head()


Unnamed: 0,text,hashtags
0,ሰበር ዜና : ደህና ሁኑ ልጆች! / RIP #ETHIOPIA | አርቲስት ተ...,#ETHIOPIA
1,#ተስፋዬ ሳህሉ #,#ተስፋዬ
2,️መልካም ቀን️ #በህይወት እስካለህ: : ልትሳሳት ፣ልትወድቅ ትቸላለህ ፣...,#ስብሀት
3,አስመሳይ ነው የበዛው ጥቅሙን ፈላጊ ሳንቲም ባገኘም ቁጥር እራሱን አስቀዳ...,#Panfalon
4,#ክብር ለኢትዮጵያ እናቶች #,#ክብር


In [None]:
df2 = dataset.copy()

In [None]:
from datasets import Dataset

# Create a dictionary containing your Amharic text data
data_dict = {"text": dataset['text'].tolist(), "hashtags": dataset['hashtags'].tolist()}

# Create a Dataset object
dataset = Dataset.from_dict(data_dict)



In [None]:
# df2['formatted_text'] = 'text: ' + df2['text'] +',' + 'hashtags: #' + df2['hashtags'].astype(str)

# # Create a dictionary containing your Amharic text data
# data_dict = {"formatted_text": df2['formatted_text'].tolist()}

# # Create a Dataset object
# fullDataset = Dataset.from_dict(data_dict)



In [None]:
# # Print the first few examples
# print(fullDataset['formatted_text'][:5])

In [None]:
# print(len(fullDataset))

In [None]:
# # Save the dataset to a file (e.g., in Arrow format)
# fullDataset.to_csv("sample_data/fullDataset.csv")


In [None]:
train_dataset = dataset.select(range(18404))
test_dataset = dataset.select(range(18404, len(dataset)))
dataset = train_dataset
dataset_subset = test_dataset

In [None]:
print(dataset['text'][0])

ሰበር ዜና : ደህና ሁኑ ልጆች! / RIP #ETHIOPIA | አርቲስት ተስፋዬ ሳህሉ(አባባ ተስፋዬ) ከዚህ ኣለም በ94 ኣመታቸው ተለዩ። ጤና ይስጥልኝ ልጆች!  የዛሬ አበባዎች፤ የነገ ፍሬዎች!  እንደምን አላችሁ ልጆች!  አያችሁ ልጆች!  የኢትዮጵያ ቴሌቪዥን የልጆች ክፍለ ጊዜ ዝግጅት እናንተን ለማስደሰት ልክ በሰኣቱ ይገኛል። አባባ ደሞ የልጆች ሰኣት እንዳያልፍባቸው በሩጫ ዲ ዲ ዲ ከተፍ ፤ እናንተ ደግሞ ቆማችሃል። ይሄ በጣም ጥሩ ነው ልጆች። አንድ አባት ሲመጣ በአክብሮት መነሳት አስፈላጊ ነው ። ደህና ሁኑ ልጆች!  ደህና ሁኑ ልጆች! ደህና ሁኑ ልጆች! ነፍስ ይማር Getu Temesgen


In [None]:
print(dataset_subset['text'][0])

#አብን የአማራ ብሄራዊ ንቅናቄ (አብን) ዶ/ር በለጠ ሞላን በድጋሚ የፓርቲው ሊቀመንበር አድርጎ መረጠ። አቶ መልካሙ ሹምዬ ደግሞ የፓርቲው ምክትል ሊቀመንበር ሆነው ተመርጠዋል። የአብን የሕዝብ ግንኙነት ሃላፊ አቶ ጣሂር መሀመድ ፤ ፓርቲው ለ3 ቀናት የማእከላዊ ኮሚቴ ስብሰባውን ሲያካሂድ እንደቆየና የሥራ አስፈፃሚ አባላቱን በአዲስ በማደራጀት ማጠናቀቁን ገልፀዋል። በዚህም መሰረት፦ ዶ/ር በለጠ ሞላ ሊቀመንበር አቶ መልካሙ ሹምዬ ምክትል ሊቀመንበር ዶ/ር ደሳለኝ ጫኔ የውጭ ግንኙነት ሃላፊ አቶ ዩሱፍ ኢብራሂም የሕግ ጉዳዮች ሃላፊ አቶ ክርስቲያን ታደለ የፖሊሲ እና ስትራቴጂ ክፍል ሃላፊ አቶ ጋሻው መርሻ የፖለቲካ ጉዳዮች ሃላፊ አቶ ጣሂር መሀመድ የሕዝብ ግንኙነት ሃላፊ ዶ/ር ቴዎድሮስ ሃ/ማርያም የአብን ፅ/ቤት ሃላፊ አቶ ሀሳቡ ተስፋየ አደረጃጀት ጉዳዮች ሃላፊ አድርጎ የሥራ አስፈፃሚውን በአዲስ አደራጅቷል። ከተመረጡት የፓርቲው የሥራ አስፈፃሚዎች መካከል ስስቱ አዲስ መሆናቸው ተገልጿል። ፓርቲው ዛሬ 3ኛ መደበኛ ጠቅላላ ጉባኤውን በአማራ ክልል መዲና ባህር ዳር ማካሄድ መጀመሩን #ኢብኮ / #አሚኮ ዘግቧል።


In [None]:
# Custom Tokenizer
class CustomTokenizer:
  def __init__(self):
        self.pad_token = "[PAD]"  # You can choose any string for the pad_token

  def tokenize(self, text):
    # Custom tokenization logic here
    # For simplicity, let's split the text into tokens based on spaces
    tokens = text.split()
    return tokens

# Instantiate the custom tokenizer
custom_tokenizer = CustomTokenizer()

Function  to download LLaMA 2 model and its tokenizer. It requires a bitsandbytes configuration

In [None]:
def load_model(model_name, bnb_config):
    n_gpus = torch.cuda.device_count()
    max_memory = f'{40960}MB'

#method from the Hugging Face Transformers library to load a pre-trained language model
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto", # dispatch efficiently the model on the available ressources
        max_memory = {i: max_memory for i in range(n_gpus)},
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=True)

    # Needed for LLaMA tokenizer
    tokenizer.pad_token = tokenizer.eos_token

    return model, tokenizer


Pre-processing dataset

Instruction fine-tuning is a common technique used to fine-tune a base LLM for a specific downstream use-case.



In [None]:
def create_prompt_formats(sample):
    """
    Format various fields of the sample ('text', 'hashtags',)
    Then concatenate them using two newline characters
    :param sample: Sample dictionnary
    """

    INTRO_BLURB = "Identify Hashtags from the given text."
    INSTRUCTION_KEY = "### Text:"
    RESPONSE_KEY = "Hashtags:"
    END_KEY = "### End"

    blurb = f"{INTRO_BLURB}"
    text = f"{INSTRUCTION_KEY}\n{sample['text']}"
    response = f"{RESPONSE_KEY}\n{sample['hashtags']}"
    end = f"{END_KEY}"

    parts = [part for part in [blurb, text, response, end] if part]

    formatted_prompt = "\n\n".join(parts)

    sample["text"] = formatted_prompt

    return sample

use the model tokenizer to process these prompts into tokenized ones.

* The goal is to create input sequences of uniform length (which are suitable for fine-tuning the language model

because it maximizes efficiency and minimize computational overhead), that must not exceed the model’s maximum token limit.

In [None]:
def get_max_length(model):
    conf = model.config
    max_length = None
    for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
        max_length = getattr(model.config, length_setting, None)
        if max_length:
            print(f"Found max lenth: {max_length}")
            break
    if not max_length:
        max_length = 1024
        print(f"Using default max length: {max_length}")
    return max_length


def preprocess_batch(batch, tokenizer, max_length):
    """
    Tokenizing a batch
    """
    return tokenizer(
        batch["text"],
        max_length=max_length,
        truncation=True,
    )


def preprocess_dataset(tokenizer, max_length: int, seed, dataset):
    """Format & tokenize it so it is ready for training
    :param tokenizer (AutoTokenizer): Model Tokenizer
    :param max_length (int): Maximum number of tokens to emit from tokenizer
    """

    # Add prompt to each sample
    print("Preprocessing dataset...")
    dataset = dataset.map(create_prompt_formats)#, batched=True)

    # Apply preprocessing to each batch of the dataset & and remove 'instruction', 'context', 'response', 'category' fields
    _preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)
    dataset = dataset.map(
        _preprocessing_function,
        batched=True,
        remove_columns=["text", "hashtags"],
    )

    # Filter out samples that have input_ids exceeding max_length
    dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)

    # Shuffle dataset
    dataset = dataset.shuffle(seed=seed)

    return dataset

**Create a bitsandbytes configuration**

> This allows to load our LLM in 4 bits. This way, we can divide the used memory by 4 and import the model on smaller devices. We choose to apply bfloat16 compute data type and nested quantization for memory-saving purposes.



In [None]:
''' This function, create_bnb_config(), is designed to create and return a
configuration object for quantization using the Bits and Bytes (BNB)
quantization scheme. '''
def create_bnb_config():
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )

    return bnb_config

** LoRa configuration**

> To leverage the LoRa method, we need to wrap the model as a PeftModel.


In [None]:
def create_peft_config(modules):
    """
    Create Parameter-Efficient Fine-Tuning config for the model
    :param modules: Names of the modules to apply Lora to
    """
    config = LoraConfig(
        r=16,  # dimension of the updated matrices
        lora_alpha=64,  # parameter for scaling
        target_modules=modules,
        lora_dropout=0.1,  # dropout probability for layers
        bias="none",
        task_type="CAUSAL_LM",
    )

    return config

> Previous function needs the target modules to update the necessary
matrices. The following function will get them for our model:

In [None]:


def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if 'lm_head' in lora_module_names:  # needed for 16-bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)

> Once everything is set up and the base model is prepared, we can
use the print_trainable_parameters() helper function to see how many trainable parameters are in the model.

In [None]:
def print_trainable_parameters(model, use_4bit=False):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        num_params = param.numel()
        # if using DS Zero 3 and the weights are initialized empty
        if num_params == 0 and hasattr(param, "ds_numel"):
            num_params = param.ds_numel

        all_param += num_params
        if param.requires_grad:
            trainable_params += num_params
    if use_4bit:
        trainable_params /= 2
    print(
        f"all params: {all_param:,d} || trainable params: {trainable_params:,d} || trainable%: {100 * trainable_params / all_param}"
    )


**Train**

Now, we can pre-process our dataset and load our model using the set configurations


In [None]:

from huggingface_hub import login

login("HUGGINGFACE TOKEN")

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
# Load model from HF with user's token and with bitsandbytes config

model_name = "meta-llama/Llama-2-7b-hf"

bnb_config = create_bnb_config()

model, tokenizer2 = load_model(model_name, bnb_config)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [None]:
#tokenizer = custom_tokenizer

In [None]:

import random

seed = 42
random.seed(50)

In [None]:
## Preprocess dataset

max_length = get_max_length(model)

dataset = preprocess_dataset(tokenizer2, max_length, seed, dataset)

Found max lenth: 4096
Preprocessing dataset...


Map:   0%|          | 0/18404 [00:00<?, ? examples/s]

Map:   0%|          | 0/18404 [00:00<?, ? examples/s]

Filter:   0%|          | 0/18404 [00:00<?, ? examples/s]

**Fine-tuning process using Single GPU**

In [None]:
def train(model, tokenizer, dataset, output_dir):
    # Apply preprocessing to the model to prepare it by
    # 1 - Enabling gradient checkpointing to reduce memory usage during fine-tuning
    model.gradient_checkpointing_enable()

    # 2 - Using the prepare_model_for_kbit_training method from PEFT
    model = prepare_model_for_kbit_training(model)

    # Get lora module names
    modules = find_all_linear_names(model)

    # Create PEFT config for these modules and wrap the model to PEFT
    peft_config = create_peft_config(modules)
    model = get_peft_model(model, peft_config)

    # Print information about the percentage of trainable parameters
    print_trainable_parameters(model)

    # Training parameters
    trainer = Trainer(
        model=model,
        train_dataset=dataset,
        args=TrainingArguments(
            per_device_train_batch_size=1,
            gradient_accumulation_steps=4,
            warmup_steps=2,
            max_steps=50,
            learning_rate=2e-4,
            fp16=True,
            logging_steps=1,
            output_dir="outputs",
            optim="paged_adamw_8bit",
        ),
        data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
    )

    model.config.use_cache = False  # re-enable for inference to speed up predictions for similar inputs


    # Verifying the datatypes before training

    dtypes = {}
    for _, p in model.named_parameters():
        dtype = p.dtype
        if dtype not in dtypes: dtypes[dtype] = 0
        dtypes[dtype] += p.numel()
    total = 0
    for k, v in dtypes.items(): total+= v
    for k, v in dtypes.items():
        print(k, v, v/total)

    do_train = True

    # Launch training
    print("Training...")

    if do_train:
        train_result = trainer.train()
        metrics = train_result.metrics
        trainer.log_metrics("train", metrics)
        trainer.save_metrics("train", metrics)
        trainer.save_state()
        print(metrics)

    ###

    # Saving model
    print("Saving last checkpoint of the model...")
    os.makedirs(output_dir, exist_ok=True)
    trainer.model.save_pretrained(output_dir)

    # Free memory for merging weights
    del model
    del trainer
    torch.cuda.empty_cache()


output_dir = "results/llama2/final_checkpoint"
train(model, tokenizer2, dataset, output_dir)


all params: 3,540,389,888 || trainable params: 39,976,960 || trainable%: 1.1291682911958425
torch.float32 302387200 0.08541070604255438
torch.uint8 3238002688 0.9145892939574456
Training...


You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,1.0765
2,1.1646
3,1.1747
4,0.9646
5,1.0061
6,0.997
7,0.9498
8,0.9518
9,0.9249
10,0.8949


***** train metrics *****
  epoch                    =       0.01
  total_flos               =  7728012GF
  train_loss               =     0.8976
  train_runtime            = 0:03:17.67
  train_samples_per_second =      1.012
  train_steps_per_second   =      0.253
{'train_runtime': 197.6706, 'train_samples_per_second': 1.012, 'train_steps_per_second': 0.253, 'total_flos': 8297890158944256.0, 'train_loss': 0.8975745522975922, 'epoch': 0.01}
Saving last checkpoint of the model...


* If we prefer to have a number of epochs (entire training dataset
 will be passed through the model) instead of a number of training
 steps (forward and backward passes through the model with one batch
 of data), we can replace the max_steps argument by num_train_epochs.

* The trainer.model.save_pretrained(output_dir) function, saves the fine-tuned model’s weights, configuration, and tokenizer files to load later and use the model for inference.

**Merge weights**

> Once we have our fine-tuned weights, we can build our fine-tuned
model and save it to a new directory, with its associated tokenizer
By performing these steps, we can have a memory-efficient fine-tuned
model and tokenizer ready for inference!

In [None]:
model = AutoPeftModelForCausalLM.from_pretrained(output_dir, device_map="auto", torch_dtype=torch.bfloat16)
model = model.merge_and_unload()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
output_merged_dir = "results/llama2/final_merged_checkpoint"
os.makedirs(output_merged_dir, exist_ok=True)


In [None]:
# save tokenizer for easy inference
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.save_pretrained(output_merged_dir)

('results/llama2/final_merged_checkpoint/tokenizer_config.json',
 'results/llama2/final_merged_checkpoint/special_tokens_map.json',
 'results/llama2/final_merged_checkpoint/tokenizer.json')

In [None]:
#model.save_pretrained(output_merged_dir, safe_serialization=True)


In [None]:
def create_prompt_formats_for_test(sample):
    """
    Format various fields of the sample ('text', 'hashtags',)
    Then concatenate them using two newline characters
    :param sample: Sample dictionnary
    """

    INTRO_BLURB = "Identify Hashtags from the given text."
    INSTRUCTION_KEY = "### Text:"
    # RESPONSE_KEY = "Hashtags:"
    END_KEY = "### End"

    blurb = f"{INTRO_BLURB}"
    text = f"{INSTRUCTION_KEY}\n{sample['text']}"
    # response = f"{RESPONSE_KEY}\n{sample['hashtags']}"
    # end = f"{END_KEY}"

    parts = [part for part in [blurb, text] if part]

    formatted_prompt = "\n\n".join(parts)

    sample["text"] = formatted_prompt

    return sample

In [None]:
sample = dataset_subset[10]

prompt = create_prompt_formats_for_test(sample)

In [None]:
print(prompt)

{'text': 'Identify Hashtags from the given text.\n\n### Text:\n#DrAbiyAhmed ጠ/ሚር ዶክተር አቢይ አህመድ በሰርቢያ ቤልግሬድ በተካሄደው 18ኛው የአለም የቤት ውስጥ ውድድር ኢትዮጵያ በአንደኝነት ደረጃ ስላጠናቀቀች የተሰማቸውን ደስታ ገለፁ። ጠቅላይ ሚኒስትሩ ለመላው ኢትዮጵያውያን እንኳን ደስ አለን ! እንኳን ደስ አላችሁ ! ብለዋል።', 'hashtags': '#DrAbiyAhmed'}


In [None]:
import time

**Inference using Instruction or Question Only**


In [None]:
input_text = f"Instruction: {prompt['text']}"

In [None]:
# Tokenize the input
input_ids = tokenizer.encode(input_text, return_tensors="pt").to(model.device)

# Measure inference time
start_time = time.time()

# Generate predictions
output = model.generate(input_ids, max_length=500, temperature=1.0, top_k=50, top_p=0.95, num_return_sequences=1)
generated_output = tokenizer.decode(output[0], skip_special_tokens=True)

end_time = time.time()

# Calculate and print the inference time
inference_time = end_time - start_time


In [None]:
# Print the formatted input
print(f"======")
print(f"Input:\n======\n{input_text}\n")
print(f"======================")
print(f"Generated Output:\n======================\n{generated_output}\n")
print(f"=========================================")
print(f"Inference Time:{inference_time} seconds\n==========================================")

Input:
Instruction: Identify Hashtags from the given text.

### Text:
#DrAbiyAhmed ጠ/ሚር ዶክተር አቢይ አህመድ በሰርቢያ ቤልግሬድ በተካሄደው 18ኛው የአለም የቤት ውስጥ ውድድር ኢትዮጵያ በአንደኝነት ደረጃ ስላጠናቀቀች የተሰማቸውን ደስታ ገለፁ። ጠቅላይ ሚኒስትሩ ለመላው ኢትዮጵያውያን እንኳን ደስ አለን ! እንኳን ደስ አላችሁ ! ብለዋል።

Generated Output:
Instruction: Identify Hashtags from the given text.

### Text:
#DrAbiyAhmed ጠ/ሚር ዶክተር አቢይ አህመድ በሰርቢያ ቤልግሬድ በተካሄደው 18ኛው የአለም የቤት ውስጥ ውድድር ኢትዮጵያ በአንደኝነት ደረጃ ስላጠናቀቀች የተሰማቸውን ደስታ ገለፁ። ጠቅላይ ሚኒስትሩ ለመላው ኢትዮጵያውያን እንኳን ደስ አለን ! እንኳን ደስ አላችሁ ! ብለዋል። #Ethiopia ኢትዮጵያ

Hashtags:
#Ethiopia

### End


Inference Time:1.6159441471099854 seconds


In [None]:
# Print the formatted input
print(f"======")
print(f"Input:\n======\n{input_text}\n")
print(f"======================")
print(f"Generated Output:\n======================\n{generated_output}\n")
print(f"=========================================")
print(f"Inference Time:{inference_time} seconds\n==========================================")

Input:
Instruction: Identify Hashtags from the given text.

### Text:
#DrAbiyAhmed ጠ/ሚር ዶክተር አቢይ አህመድ በሰርቢያ ቤልግሬድ በተካሄደው 18ኛው የአለም የቤት ውስጥ ውድድር ኢትዮጵያ በአንደኝነት ደረጃ ስላጠናቀቀች የተሰማቸውን ደስታ ገለፁ። ጠቅላይ ሚኒስትሩ ለመላው ኢትዮጵያውያን እንኳን ደስ አለን ! እንኳን ደስ አላችሁ ! ብለዋል።

Generated Output:
Instruction: Identify Hashtags from the given text.

### Text:
#DrAbiyAhmed ጠ/ሚር ዶክተር አቢይ አህመድ በሰርቢያ ቤልግሬድ በተካሄደው 18ኛው የአለም የቤት ውስጥ ውድድር ኢትዮጵያ በአንደኝነት ደረጃ ስላጠናቀቀች የተሰማቸውን ደስታ ገለፁ። ጠቅላይ ሚኒስትሩ ለመላው ኢትዮጵያውያን እንኳን ደስ አለን ! እንኳን ደስ አላችሁ ! ብለዋል። #Ethiopia #Ethiopian #DrAbiyAhmed

Hashtags:
#DrAbiyAhmed

### End


Inference Time:1.4403419494628906 seconds


**Fine Tuning Using multiple GPU**

In [None]:
# def train(model, tokenizer, dataset, output_dir):
#     # Apply preprocessing to the model to prepare it by
#     # 1 - Enabling gradient checkpointing to reduce memory usage during fine-tuning
#     model.gradient_checkpointing_enable()

#     # 2 - Using the prepare_model_for_kbit_training method from PEFT
#     model = prepare_model_for_kbit_training(model)

#     # Get lora module names
#     modules = find_all_linear_names(model)

#     # Create PEFT config for these modules and wrap the model to PEFT
#     peft_config = create_peft_config(modules)
#     model = get_peft_model(model, peft_config)

#     # Print information about the percentage of trainable parameters
#     print_trainable_parameters(model)

#     #total_batch_size = n_gpus * per_device_batch_size
#     # Training parameters
#     trainer = Trainer(
#         model=model,
#         train_dataset=dataset,
#         args=TrainingArguments(
#             n_gpu=2,
#             per_device_train_batch_size=2,
#             gradient_accumulation_steps=4,
#             warmup_steps=2,
#             max_steps=20,
#             learning_rate=2e-4,
#             fp16=True,
#             logging_steps=1,
#             output_dir="outputs",
#             optim="paged_adamw_8bit",

#         ),
#         data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
#     )

#     model.config.use_cache = False  # re-enable for inference to speed up predictions for similar inputs


#     # Verifying the datatypes before training

#     dtypes = {}
#     for _, p in model.named_parameters():
#         dtype = p.dtype
#         if dtype not in dtypes: dtypes[dtype] = 0
#         dtypes[dtype] += p.numel()
#     total = 0
#     for k, v in dtypes.items(): total+= v
#     for k, v in dtypes.items():
#         print(k, v, v/total)

#     do_train = True

#     # Launch training
#     print("Training...")

#     if do_train:
#         train_result = trainer.train()
#         metrics = train_result.metrics
#         trainer.log_metrics("train", metrics)
#         trainer.save_metrics("train", metrics)
#         trainer.save_state()
#         print(metrics)

#     ###

#     # Saving model
#     print("Saving last checkpoint of the model...")
#     os.makedirs(output_dir, exist_ok=True)
#     trainer.model.save_pretrained(output_dir)

#     # Free memory for merging weights
#     del model
#     del trainer
#     torch.cuda.empty_cache()


# output_dir = "results/llama2/final_checkpoint_2g"
# train(model, tokenizer, dataset, output_dir)


In [None]:
# model_2g = AutoPeftModelForCausalLM.from_pretrained(output_dir, device_map="auto", torch_dtype=torch.bfloat16)
# model_2g = model_2g.merge_and_unload()

In [None]:
# # save tokenizer for easy inference
# tokenizer_2g = AutoTokenizer.from_pretrained(model_name)