In [1]:
!nvidia-smi

Wed Sep  6 19:46:16 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04   Driver Version: 525.116.04   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA RTX A6000    Off  | 00000000:00:05.0 Off |                  Off |
| 30%   38C    P8    16W / 300W |      1MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
import os
import torch

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import prepare_model_for_kbit_training, PeftModel, LoraConfig, get_peft_model


In [3]:
from datasets import load_dataset

train_data , test_data =load_dataset("databricks/databricks-dolly-15k", split=["train[:80%]", "train[80%:]"])

Using custom data configuration databricks--databricks-dolly-15k-7427aa6e57c34282
Reusing dataset json (/root/.cache/huggingface/datasets/databricks___json/databricks--databricks-dolly-15k-7427aa6e57c34282/0.0.0/a3e658c4731e59120d44081ac10bf85dc7e1388126b92338344ce9661907f253)


  0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
import pandas as pd

train_df = pd.DataFrame(train_data)
test_df = pd.DataFrame(test_data)

In [5]:
train_df = train_df[train_df["context"].str.len()>=10]
test_df = test_df[test_df["context"].str.len()>=10]
train_df.reset_index(drop=True,inplace=True)
test_df.reset_index(drop=True,inplace=True)


In [6]:
train_df

Unnamed: 0,instruction,context,response,category
0,When did Virgin Australia start operating?,"Virgin Australia, the trading name of Virgin A...",Virgin Australia commenced services on 31 Augu...,closed_qa
1,When was Tomoaki Komorida born?,Komorida was born in Kumamoto Prefecture on Ju...,"Tomoaki Komorida was born on July 10,1981.",closed_qa
2,If I have more pieces at the time of stalemate...,Stalemate is a situation in chess where the pl...,No. \nStalemate is a drawn position. It doesn'...,information_extraction
3,"Given a reference text about Lollapalooza, whe...",Lollapalooza /ˌlɒləpəˈluːzə/ (Lolla) is an ann...,Lollapalooze is an annual musical festival hel...,closed_qa
4,Who was John Moses Browning?,"John Moses Browning (January 23, 1855 – Novemb...",John Moses Browning is one of the most well-kn...,information_extraction
...,...,...,...,...
3560,What languages are spoken in Tunisia?,The official language of Tunisia is Modern Sta...,The official language of Tunisia is Modern Sta...,closed_qa
3561,Extract the venues of each Phish concert refer...,"On October 1, 2008, the band announced on thei...",-Hampton Coliseum\n-Fenway Park\n-Red Rocks Am...,information_extraction
3562,"Extract the title of the game, the name of its...",Horizon Zero Dawn is a 2017 action role-playin...,"Horizon Zero Dawn, Guerrilla Games, Aloy",information_extraction
3563,What's the architecture in Maskavas Forstate l...,Maskavas Forštate (German: Moskauer Vorstadt) ...,The architecture of Maskavas Forštate reflects...,closed_qa


In [7]:
train_df.shape, test_df.shape

((3565, 4), (901, 4))

In [8]:
def prepare_dataset(df, split="train"):
    text_col=[]
    instruction="""write a precise summary of the below input text. 
    Return your response in bullet points which covers the keypoints of the input text.
    only provide full sentences response summary."""
    if split == "train":
        for _, row in df.iterrows():
            inst=row["instruction"]
            inputc=row["context"]
            output=row["response"]
            text = ("### Instruction: \n" + instruction +"\n" + inst
                    +"\n### Input: \n" + inputc + "\n### Response: \n" + output)
            text_col.append(text)
        df.loc[:, "text"] = text_col
    else:
        for _, row in df.iterrows():
            inst=row["instruction"]
            inputc=row["context"]
            text = ("### Instruction: \n" + instruction +"\n" + inst
                    +"\n### Input: \n" + inputc + "\n### Response: \n" )
            text_col.append(text)
        df.loc[:, "text"] = text_col
    return df    

In [9]:
train_df = prepare_dataset(train_df, "train")
test_df = prepare_dataset(test_df, "test")

In [10]:
print(train_df["text"][0])

### Instruction: 
write a precise summary of the below input text. 
    Return your response in bullet points which covers the keypoints of the input text.
    only provide full sentences response summary.
When did Virgin Australia start operating?
### Input: 
Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.
### Response: 
Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.


In [11]:
print(test_df["text"][0])

### Instruction: 
write a precise summary of the below input text. 
    Return your response in bullet points which covers the keypoints of the input text.
    only provide full sentences response summary.
Using the two paragraphs below, when was the Ukrainian Chorus Dumka of NY founded, and when did it play in Ukraine for the first time?
### Input: 
Ukrainian Chorus Dumka of New York was founded in 1949 with the goal "to preserve and cultivate the rich musical heritage of Ukraine", both for the church and for secular occasions. In the beginning, the chorus was a men's chorus of Ukrainian immigrants who met to sing music they loved. The first music director was L. Krushelnycky. The group became a mixed choir in 1959. 

They have performed in New York at locations including in Alice Tully Hall, Avery Fisher Hall, Brooklyn Academy of Music, Carnegie Hall, Madison Square Garden, St. Patrick's Cathedral, and Town Hall. They toured to the Kennedy Center in Washington, and in several Europea

In [12]:
from datasets import Dataset
dataset = Dataset.from_pandas(train_df)
dataset

Dataset({
    features: ['instruction', 'context', 'response', 'category', 'text'],
    num_rows: 3565
})

In [13]:
model_name = "meta-llama/Llama-2-7b-hf"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_type="float16"
)
model =AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config,trust_remote_code=True,device_map="auto")
model.config.use_cache = True

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [14]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True,return_token_type_ids=False)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [15]:
#Lora configuration
peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=["q_proj","v_proj"],
    bias="none",
    task_type="CAUSAL_LM"
)

In [16]:
# training args & prepare model for kbit training
from transformers import TrainingArguments
from trl import SFTTrainer
from peft import prepare_model_for_kbit_training
from time import perf_counter


args = TrainingArguments(
    output_dir="./llama2_ft_dolly",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    optim="paged_adamw_8bit",
    learning_rate=2e-4,
    lr_scheduler_type="linear",
    save_strategy="epoch",
    logging_steps=10,
    num_train_epochs=1,
    max_steps=100,
    fp16=True,
    push_to_hub=False,
)

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)


In [17]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    args=args,
    tokenizer=tokenizer,
    packing=False,
    max_seq_length=512
)



  0%|          | 0/4 [00:00<?, ?ba/s]

In [18]:
start_time = perf_counter()
trainer.train()
end_time = perf_counter()
training_time = end_time - start_time
print(f"Time taken for training: {training_time}")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[34m[1mwandb[0m: Currently logged in as: [33mvikram-n[0m. Use [1m`wandb login --relogin`[0m to force relogin


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss
10,1.7202
20,1.3312
30,1.1506
40,1.1849
50,1.1855
60,1.1547
70,1.1485
80,1.144
90,1.1536
100,1.1566


Time taken for training: 723.2419702569023


In [19]:
model_to_save = trainer.model.module if hasattr(trainer.model, "module") else trainer.model
model_to_save.save_pretrained("./llama2_ft_dolly/results")

In [20]:
lora_config = LoraConfig.from_pretrained("./llama2_ft_dolly/results")
tmodel = get_peft_model(model_to_save, lora_config)

In [21]:
from transformers import GenerationConfig
start_time = perf_counter()
text = test_df["text"][5]

inputs = tokenizer(text, return_tensors="pt").to("cuda")
generation_config = GenerationConfig(
    penalty_alpha=0.6, do_sample=True, top_k=5, temperature=0.5, repetition_penalty=1.2)
outputs = tmodel.generate(
    input_ids=inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    generation_config=generation_config,
    max_new_tokens=100,)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
end_time = perf_counter()
output_time = end_time - start_time
print(f"Time taken for output: {output_time} seconds")



### Instruction: 
write a precise summary of the below input text. 
    Return your response in bullet points which covers the keypoints of the input text.
    only provide full sentences response summary.
When was Irina Vysheslavska born?
### Input: 
Irina Vysheslavska was born in Kiev on February 20, 1939, into a family of great cultural traditions. Her father Leonid Vysheslavsky was a noted poet and her mother Agnes Baltaga was a writer. Several of her ancestors were priests in Greece, Romania and Ukraine.
### Response: 
- Irina Visheshlavsa is from kiev ukraine and she has many famous relatives such as her dad leonard visheshlavsa who wrote poetry and her mum agnes baltage who also wrote poems. they are both very talented people but unfortunately irinas parents died when she was young so now shes an orphan living with her uncle.

Time taken for output: 23.099115974036977 seconds


In [22]:
text1 = test_df["text"][10]
inputs = tokenizer(text1, return_tensors="pt").to("cuda")
generation_config = GenerationConfig(
    penalty_alpha=0.6, do_sample=True, top_k=5, temperature=0.5, repetition_penalty=1.2)
outputs = tmodel.generate(
    input_ids=inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    generation_config=generation_config,
    max_new_tokens=100,)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
end_time = perf_counter()
output_time = end_time - start_time
print(f"Time taken for output: {output_time} seconds")

### Instruction: 
write a precise summary of the below input text. 
    Return your response in bullet points which covers the keypoints of the input text.
    only provide full sentences response summary.
Was Venice always part of Italy?
### Input: 
The Republic of Venice lost its independence when Napoleon Bonaparte conquered Venice on 12 May 1797 during the War of the First Coalition. Napoleon was seen as something of a liberator by the city's Jewish population. He removed the gates of the Ghetto and ended the restrictions on when and where Jews could live and travel in the city.

Venice became Austrian territory when Napoleon signed the Treaty of Campo Formio on 12 October 1797. The Austrians took control of the city on 18 January 1798. Venice was taken from Austria by the Treaty of Pressburg in 1805 and became part of Napoleon's Kingdom of Italy. It was returned to Austria following Napoleon's defeat in 1814, when it became part of the Austrian-held Kingdom of Lombardy–Venetia. In

In [23]:
!nvidia-smi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Wed Sep  6 19:59:51 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04   Driver Version: 525.116.04   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA RTX A6000    Off  | 00000000:00:05.0 Off |                  Off |
| 58%   77C    P2   183W / 300W |   8593MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [24]:
from transformers import GPTQConfig

quantization_config = GPTQConfig(bits=4,dataset=["c4"],desc_act=False)
quant_model = AutoModelForCausalLM.from_pretrained("./llama2_ft_dolly/outputs", quantization_config=quantization_config,device_map="auto")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Quantizing model.layers blocks :   0%|          | 0/32 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

In [25]:
# saving the quantized model
quant_model.save_pretrained("./llama2_ft_dolly/quantized", safe_serialization=True)
tokenizer.save_pretrained("./llama2_ft_dolly/quantized")

('./llama2_ft_dolly/quantized/tokenizer_config.json',
 './llama2_ft_dolly/quantized/special_tokens_map.json',
 './llama2_ft_dolly/quantized/tokenizer.model',
 './llama2_ft_dolly/quantized/added_tokens.json',
 './llama2_ft_dolly/quantized/tokenizer.json')

In [26]:
start_time = perf_counter()
test = test_df["text"][5]
inputs = tokenizer(test, return_tensors="pt").to("cuda")
gen_config = GenerationConfig(
    penalty_alpha=0.6, do_sample=True, top_k=5, temperature=0.5, repetition_penalty=1.2)
outputs = quant_model.generate(
    input_ids=inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    generation_config=gen_config,
    max_new_tokens=100,)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
end_time = perf_counter()
output_time = end_time - start_time
print(f"Time taken for output: {output_time} seconds")

### Instruction: 
write a precise summary of the below input text. 
    Return your response in bullet points which covers the keypoints of the input text.
    only provide full sentences response summary.
When was Irina Vysheslavska born?
### Input: 
Irina Vysheslavska was born in Kiev on February 20, 1939, into a family of great cultural traditions. Her father Leonid Vysheslavsky was a noted poet and her mother Agnes Baltaga was a writer. Several of her ancestors were priests in Greece, Romania and Ukraine.
### Response: 
Irina Vyscheslawskaya (b. Feb 20th) is Ukrainian composer from Kyiv. She studied composition at the Kyiv Conservatory under Mykola Kolomiyets with further studies at the Moscow State Tchaikovsky Conservatory as well as private lessons in Paris with Nadia Boulanger. From an early age she composed music for piano and voice that has been performed by leading artists including Martha Argerich, Ol
Time taken for output: 3.429542585974559 seconds


In [27]:
start_time = perf_counter()
test1 = test_df["text"][10]
inputs = tokenizer(test1, return_tensors="pt").to("cuda")
gen_config = GenerationConfig(
    penalty_alpha=0.6, do_sample=True, top_k=5, temperature=0.5, repetition_penalty=1.2)
outputs = quant_model.generate(
    input_ids=inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    generation_config=gen_config,
    max_new_tokens=100,)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
end_time = perf_counter()
output_time = end_time - start_time
print(f"Time taken for output: {output_time} seconds")

### Instruction: 
write a precise summary of the below input text. 
    Return your response in bullet points which covers the keypoints of the input text.
    only provide full sentences response summary.
Was Venice always part of Italy?
### Input: 
The Republic of Venice lost its independence when Napoleon Bonaparte conquered Venice on 12 May 1797 during the War of the First Coalition. Napoleon was seen as something of a liberator by the city's Jewish population. He removed the gates of the Ghetto and ended the restrictions on when and where Jews could live and travel in the city.

Venice became Austrian territory when Napoleon signed the Treaty of Campo Formio on 12 October 1797. The Austrians took control of the city on 18 January 1798. Venice was taken from Austria by the Treaty of Pressburg in 1805 and became part of Napoleon's Kingdom of Italy. It was returned to Austria following Napoleon's defeat in 1814, when it became part of the Austrian-held Kingdom of Lombardy–Venetia. In