# Fine-tuning Llama 2 (7B Chat) for KIIT QnA with LoRA

This notebook demonstrates how to fine-tune the `meta-llama/Llama-2-7b-chat-hf` model on a custom KIIT Question-Answering dataset using Parameter-Efficient Fine-Tuning (PEFT) with LoRA and the `trl` library.

**Steps:**
1. Install necessary libraries.
2. Load the base Llama 2 model and tokenizer (quantized).
3. Load the custom KIIT dataset (`kiit_data.jsonl`).
4. Configure LoRA (`PeftModel`).
5. Configure Training Arguments.
6. Initialize and run the `SFTTrainer`.
7. Save the LoRA adapters.
8. Merge adapters with the base model and save the final model locally.
9. Test inference with the fine-tuned model.
10. Test Inference using Adapters Directly (No Merge)


## Step 1: Install Libraries


In [None]:
!pip install -q accelerate peft bitsandbytes transformers trl datasets torch


## Step 2: Load Model, Tokenizer, and Configure Quantization


In [None]:
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer
from datasets import load_dataset
import os

# Suppress warnings
logging.set_verbosity(logging.CRITICAL)


In [None]:
# --- Configuration ---

# Model from Hugging Face hub (Make sure you have access requested and granted)
base_model_name = "meta-llama/Llama-2-7b-chat-hf"

# Custom KIIT dataset path
dataset_path = "kiit_data.jsonl" # Ensure this file is accessible

# Output directory for LoRA adapters
output_adapter_dir = "kiit-llama2-7b-chat-lora-adapters"

# Output directory for the final merged model (for local use/Streamlit)
final_merged_model_dir = "kiit-llama2-7b-chat-final"

# --- Quantization Config ---
# Use 4-bit quantization to reduce memory usage
compute_dtype = getattr(torch, "float16") # Or bfloat16 if supported (Ampere+)

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False, # Optional
)

# --- Load Base Model ---
print(f"Loading base model: {base_model_name}...")
# Replace 'YOUR_HF_TOKEN' with your actual Hugging Face token.
# Make sure to request access to the model on Hugging Face's website and get your token.
HF_TOKEN = "hf_*************************"  # Replace with your Hugging Face token

model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=quant_config,
    device_map={"": 0}, # Automatically load model layers onto GPU 0
    token=HF_TOKEN # Pass the token here
)
# Configure model for training
model.config.use_cache = False # Disable caching for gradient checkpointing
model.config.pretraining_tp = 1 # Tensor parallelism setting (usually 1 for single GPU)
print("Base model loaded.")

# --- Load Tokenizer ---
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True, token=HF_TOKEN) # Pass the token here as well
# Set padding token to EOS token for autoregressive models
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Important for fp16 training
print("Tokenizer loaded.")

Loading base model: meta-llama/Llama-2-7b-chat-hf...


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Base model loaded.
Loading tokenizer...
Tokenizer loaded.


## Step 3: Load KIIT Dataset

Load the JSON Lines file created earlier.


In [None]:
# Load the dataset
try:
    dataset = load_dataset('json', data_files=dataset_path, split="train")
    print(f"Dataset loaded successfully from {dataset_path}")
    print(f"Dataset size: {len(dataset)}")
    print("\nFirst example:")
    # Print the structure of the first example's text
    print(dataset[0]['text'].replace('\\n', '\n')) # Make newlines readable
except Exception as e:
    print(f"Error loading dataset from '{dataset_path}': {e}")
    print("Please ensure 'kiit_data.jsonl' exists in the correct path and is formatted correctly.")
    # Stop execution if dataset fails to load
    raise SystemExit("Dataset loading failed.")


Dataset loaded successfully from kiit_data.jsonl
Dataset size: 1047

First example:
<s>[INST] <<SYS>>
You are a helpful assistant knowledgeable about Kalinga Institute of Industrial Technology (KIIT). Provide concise and accurate information based on the user's question about KIIT.
<</SYS>>

What does KIIT stand for? [/INST] KIIT stands for Kalinga Institute of Industrial Technology. Established in 1992, it's a deemed university located in Bhubaneswar, Odisha, India. </s>


## Step 4: Configure LoRA (PEFT)

Configure LoRA to adapt specific layers of the base model efficiently.


In [None]:
# LoRA configuration
peft_config = LoraConfig(
    lora_alpha=16,          # Scaling factor for LoRA weights
    lora_dropout=0.1,       # Dropout probability for LoRA layers
    r=64,                   # Rank of the LoRA matrices (higher rank = more parameters)
    bias="none",            # Bias terms to train ('none', 'all', 'lora_only')
    task_type="CAUSAL_LM",  # Task type
    # Target modules specific to Llama 2 architecture
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "embed_tokens"

    ]
)

print("LoRA Config set up.")


LoRA Config set up.


## Step 5: Configure Training Arguments

Set hyperparameters for the training process. Adjust `num_train_epochs`, `per_device_train_batch_size`, and `gradient_accumulation_steps` based on your dataset size and GPU memory.


In [None]:
import os

# --- Set PyTorch CUDA Allocation Configuration ---
# Recommended by the OOM error message to potentially reduce fragmentation
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
print(f"Set PYTORCH_CUDA_ALLOC_CONF={os.environ.get('PYTORCH_CUDA_ALLOC_CONF')}")


Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True


In [None]:
from trl import SFTConfig # Import SFTConfig

# --- Check GPU Capability for fp16/bf16 ---
use_bf16 = False # Default to False
if torch.cuda.is_available():
    major, minor = torch.cuda.get_device_capability()
    if major >= 8: # Ampere or newer supports bfloat16
        use_bf16 = True
        print("Ampere+ GPU detected, enabling bf16.")
    else:
        print("Older GPU detected, using fp16.")
else:
    print("CUDA not available, cannot use bf16 or fp16.")

# Use SFTConfig to hold all arguments
training_args = SFTConfig(
    # --- Training Arguments ---
    output_dir="./kiit-results",
    num_train_epochs=1,
    per_device_train_batch_size=1,          # Keep batch size at 1
    gradient_accumulation_steps=8,          # Keep accumulation
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    optim="paged_adamw_8bit",
    save_strategy="steps",
    save_steps=100,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=not use_bf16,
    bf16=use_bf16,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="cosine",
    report_to="tensorboard",


    max_seq_length=512,
    dataset_text_field="text",
    packing=False,
)

print(f"SFTConfig set up: fp16={training_args.fp16}, bf16={training_args.bf16}, max_seq_length={training_args.max_seq_length}, LoRA r=64.")



Older GPU detected, using fp16.
SFTConfig set up: fp16=True, bf16=False, max_seq_length=512, LoRA r=64.


## Step 6: Initialize and Run Trainer (`SFTTrainer`)

Initialize the `SFTTrainer` from `trl`, which simplifies the process of supervised fine-tuning.


In [None]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    processing_class=tokenizer,
    args=training_args

)

# Start training
print("Starting fine-tuning...")
trainer.train()
print("Fine-tuning finished.")


Starting fine-tuning...
{'loss': 1.9617, 'grad_norm': 58.88863754272461, 'learning_rate': 0.00019308737486442045, 'num_tokens': 23253.0, 'mean_token_accuracy': 0.6693099531531334, 'epoch': 0.19102196752626552}
{'loss': 0.9357, 'grad_norm': 1.1141765117645264, 'learning_rate': 0.00015425462638657595, 'num_tokens': 46159.0, 'mean_token_accuracy': 0.7916224443912506, 'epoch': 0.38204393505253104}
{'loss': 0.7847, 'grad_norm': 0.8637145757675171, 'learning_rate': 9.501541143393028e-05, 'num_tokens': 68880.0, 'mean_token_accuracy': 0.8164774137735367, 'epoch': 0.5730659025787965}
{'loss': 0.7508, 'grad_norm': 0.8939021825790405, 'learning_rate': 3.7651019814126654e-05, 'num_tokens': 91162.0, 'mean_token_accuracy': 0.8221593025326729, 'epoch': 0.7640878701050621}




{'loss': 0.6828, 'grad_norm': 0.8343603014945984, 'learning_rate': 3.7375753049987973e-06, 'num_tokens': 113686.0, 'mean_token_accuracy': 0.8278827980160713, 'epoch': 0.9551098376313276}




{'train_runtime': 1645.3758, 'train_samples_per_second': 0.636, 'train_steps_per_second': 0.079, 'train_loss': 1.011131897339454, 'num_tokens': 117925.0, 'mean_token_accuracy': 0.8320745095610619, 'epoch': 0.9933142311365807}
Fine-tuning finished.


## Step 7: Save LoRA Adapters

Save the trained adapter weights. These are small files representing the changes made to the base model.


In [None]:
# Save the LoRA adapter weights
print(f"Saving LoRA adapters to {output_adapter_dir}...")
trainer.model.save_pretrained(output_adapter_dir)
print("Adapters saved.")

# Save tokenizer files as well (good practice)
print(f"Saving tokenizer to {output_adapter_dir}...")
tokenizer.save_pretrained(output_adapter_dir)
print("Tokenizer saved with adapters.")


Saving LoRA adapters to kiit-llama2-7b-chat-lora-adapters...




Adapters saved.
Saving tokenizer to kiit-llama2-7b-chat-lora-adapters...
Tokenizer saved with adapters.


## Step 8: Merge Adapters and Save Final Model Locally

This step combines the original Llama 2 weights with your trained LoRA adapters to create a new, standalone model directory. This is essential for easy loading in Streamlit or other applications without needing the PEFT library during inference.

**Requires significant RAM/GPU memory.**


In [None]:
from peft import AutoPeftModelForCausalLM
import gc # Garbage collector

# --- Clear Memory Before Loading Full Model ---
print("Clearing trainer and model from memory...")
del trainer # Delete the trainer object
del model   # Delete the LoRA-wrapped model object
gc.collect() # Force garbage collection
torch.cuda.empty_cache() # Clear GPU cache
print("Memory cleared.")

Clearing trainer and model from memory...
Memory cleared.


In [None]:
# --- Load Base Model (non-quantized or FP16) ---
print(f"Reloading base model ({base_model_name}) for merging...")
# Load in float16 for merging efficiency
base_model_reload = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    return_dict=True,
    torch_dtype=torch.float16, # Load in half-precision
    device_map='auto', # Load across available GPUs if needed
    trust_remote_code=True
)
print("Base model reloaded.")

# --- Load PEFT Model (Adapters) ---
print(f"Loading PEFT adapters from {output_adapter_dir}...")
# Load the PeftModel using the base model and the adapter directory
# device_map='auto' will try to load adapters onto the same devices
lora_model = PeftModel.from_pretrained(
    base_model_reload,
    output_adapter_dir,
    device_map='auto',
    offload_folder="offload" # Create a directory named "offload" for this
)
print("PEFT model (adapters) loaded.")

# --- Merge Adapters ---
print("Merging LoRA adapters into the base model...")
merged_model = lora_model.merge_and_unload()
print("Adapters merged successfully.")

# --- Save Merged Model ---
print(f"Saving the final merged model to {final_merged_model_dir}...")
# Use safe_serialization=True for better compatibility and safety
merged_model.save_pretrained(final_merged_model_dir, safe_serialization=True)
print("Merged model saved.")

# --- Save Tokenizer with Merged Model ---
print(f"Saving tokenizer to {final_merged_model_dir}...")
# Reload tokenizer associated with the adapters (or use the original one)
tokenizer_for_merged = AutoTokenizer.from_pretrained(output_adapter_dir)
tokenizer_for_merged.save_pretrained(final_merged_model_dir)
print("Tokenizer saved with merged model.")

# --- Final Memory Cleanup ---
print("Cleaning up merged model objects...")
del base_model_reload
del lora_model
del merged_model
gc.collect()
torch.cuda.empty_cache()
print("Cleanup complete.")


Reloading base model (meta-llama/Llama-2-7b-chat-hf) for merging...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Base model reloaded.
Loading PEFT adapters from kiit-llama2-7b-chat-lora-adapters...




PEFT model (adapters) loaded.
Merging LoRA adapters into the base model...


In [None]:
# merge_script.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import gc
import os

base_model_name = "meta-llama/Llama-2-7b-chat-hf"
adapter_dir = "kiit-llama2-7b-chat-lora-adapters" # Path to downloaded adapters
output_merged_dir = "kiit-llama2-7b-chat-final" # Output path

print(f"Loading base model: {base_model_name}")
# Load on CPU first if RAM is plentiful, or use device_map if GPU available
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    return_dict=True,
    torch_dtype=torch.float16, # Or desired final precision
    # device_map="auto", # Optional if GPU available in merge environment
    low_cpu_mem_usage=True # Helps on systems with high RAM but slower disk
)
print("Base model loaded.")

print(f"Loading adapters from: {adapter_dir}")
# Load adapters onto the base model
model_to_merge = PeftModel.from_pretrained(
    base_model,
    adapter_dir
    # device_map="auto" # Optional
)
print("Adapters loaded.")

print("Merging adapters...")
merged_model = model_to_merge.merge_and_unload()
print("Merge complete.")

print(f"Saving merged model to: {output_merged_dir}")
merged_model.save_pretrained(output_merged_dir, safe_serialization=True)
print("Merged model saved.")

print(f"Saving tokenizer to: {output_merged_dir}")
tokenizer = AutoTokenizer.from_pretrained(adapter_dir) # Load tokenizer from adapter dir
tokenizer.save_pretrained(output_merged_dir)
print("Tokenizer saved.")

print("Merge script finished.")


Loading base model: meta-llama/Llama-2-7b-chat-hf


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Base model loaded.
Loading adapters from: kiit-llama2-7b-chat-lora-adapters
Adapters loaded.
Merging adapters...


## Step 9: Test Inference with Fine-tuned Merged Model

Load the final standalone model saved locally and test its QnA capabilities on KIIT-specific questions.


In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import gc

# --- Load the Final Merged Model ---
print(f"Loading the final fine-tuned model from: {final_merged_model_dir}")

# Ensure torch_dtype matches how you saved the merged model (float16 in this case)
ft_model = AutoModelForCausalLM.from_pretrained(
    final_merged_model_dir,
    torch_dtype=torch.float16,
    device_map="auto", # Load onto available GPU(s)
)
ft_tokenizer = AutoTokenizer.from_pretrained(final_merged_model_dir)

print("Final fine-tuned model and tokenizer loaded.")

# --- Set up Generation Pipeline ---
# device_map="auto" should handle device placement
pipe = pipeline(
    "text-generation",
    model=ft_model,
    tokenizer=ft_tokenizer,
    torch_dtype=torch.float16,
    device_map="auto" # Redundant but safe
)

# --- Test Prompts ---
def ask_kiit_bot(question):
    """Generates an answer using the fine-tuned model and Llama 2 chat format."""
    system_prompt = "You are a helpful assistant knowledgeable about Kalinga Institute of Industrial Technology (KIIT). Provide concise and accurate information based on the user's question about KIIT."
    prompt = f"<s>[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n{question} [/INST]"

    print(f"\n--- Testing Prompt ---\n{question}\n")
    sequences = pipe(
        prompt,
        do_sample=True,       # Enable sampling for more varied responses
        top_k=10,             # Consider top 10 probable tokens
        num_return_sequences=1,
        eos_token_id=ft_tokenizer.eos_token_id,
        max_new_tokens=200    # Limit the answer length
    )

    print("--- Generated Response ---")
    full_response = sequences[0]['generated_text']
    # Extract only the text after [/INST]
    answer_part = full_response.split('[/INST]')[-1].strip()
    # Remove potential EOS token if it appears right at the end
    if answer_part.endswith(ft_tokenizer.eos_token):
         answer_part = answer_part[:-len(ft_tokenizer.eos_token)].strip()
    print(answer_part)
    print("-" * 26)


# Test Case 1
ask_kiit_bot("What is the fee structure for B.Tech CSE?")

# Test Case 2
ask_kiit_bot("How many schools does KIIT have?")

# Test Case 3
ask_kiit_bot("Tell me about the placement process at KIIT.")

# Test Case 4 (More conversational)
ask_kiit_bot("Is the campus life good at KIIT?")

# --- Cleanup ---
# del ft_model
# del ft_tokenizer
# del pipe
# gc.collect()
# torch.cuda.empty_cache()


 ## Step 10: Test Inference using Adapters Directly (No Merge)

In [None]:
# --- Step 10: Test Inference using Adapters Directly (No Merge) ---

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig
from peft import PeftModel # Import PeftModel
import gc
import os

# --- Configuration ---
# Ensure these are defined from previous steps
base_model_name = "meta-llama/Llama-2-7b-chat-hf"
adapter_dir = "kiit-llama2-7b-chat-lora-adapters" # Directory where adapters were saved

# --- Define Quantization Config (Needed again for loading base model) ---
# Determine compute dtype based on GPU capability check
major, minor = torch.cuda.get_device_capability() if torch.cuda.is_available() else (0, 0)
if major >= 8: # Ampere or newer
    compute_dtype = torch.bfloat16
    print("Using bfloat16.")
else:
    compute_dtype = torch.float16
    print("Using float16.")

bnb_config_inference = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=True, # Match training config if possible
)

# --- Load Quantized Base Model ---
print(f"Loading base model ({base_model_name}) quantized...")
base_model_inference = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config_inference,
    device_map="auto",
    trust_remote_code=True
    # token="YOUR_HF_TOKEN" # Add if needed
)
print("Base model loaded.")

# --- Load Tokenizer ---
# Load tokenizer from the adapter directory (where it was saved)
print(f"Loading tokenizer from {adapter_dir}...")
tokenizer_inference = AutoTokenizer.from_pretrained(adapter_dir)
print("Tokenizer loaded.")

# --- Load LoRA Adapters onto the Base Model ---
print(f"Loading LoRA adapters from {adapter_dir} onto the base model...")
# This automatically uses the device map from the base model
model_with_adapters = PeftModel.from_pretrained(base_model_inference, adapter_dir)
print("LoRA adapters loaded.")

# --- Set up Pipeline ---
print("Setting up text generation pipeline...")
# Use the model with adapters loaded
pipe = pipeline(
    "text-generation",
    model=model_with_adapters, # Use the PEFT model
    tokenizer=tokenizer_inference,
    torch_dtype=compute_dtype, # Match the compute dtype
    device_map="auto"
)
print("Pipeline ready.")

# --- Test Inference ---
def ask_kiit_bot_adapters(question):
    """Generates response using the model with adapters."""
    system_prompt = "You are a helpful assistant knowledgeable about Kalinga Institute of Industrial Technology (KIIT). Provide concise and accurate information based on the user's question about KIIT."
    prompt = f"<s>[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n{question} [/INST]"

    print(f"\n--- Testing Prompt ---\n{question}\n")
    sequences = pipe(
        prompt,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer_inference.eos_token_id,
        max_new_tokens=250
    )

    print("--- Generated Response ---")
    full_response = sequences[0]['generated_text']
    answer_part = full_response.split('[/INST]')[-1].strip()
    if answer_part.endswith(tokenizer_inference.eos_token):
         answer_part = answer_part[:-len(tokenizer_inference.eos_token)].strip()
    print(answer_part)
    print("-" * 26)

# --- Run Tests ---
ask_kiit_bot_adapters("What is the eligibility criteria for B.Tech programs at KIIT?")
ask_kiit_bot_adapters("Describe the campus infrastructure.")
ask_kiit_bot_adapters("Who is the founder of KIIT?")

# --- Cleanup ---
del base_model_inference
del model_with_adapters
del tokenizer_inference
del pipe
gc.collect()
torch.cuda.empty_cache()


Using float16.
Loading base model (meta-llama/Llama-2-7b-chat-hf) quantized...


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Base model loaded.
Loading tokenizer from kiit-llama2-7b-chat-lora-adapters...
Tokenizer loaded.
Loading LoRA adapters from kiit-llama2-7b-chat-lora-adapters onto the base model...


Device set to use cuda:0
The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['AriaTextForCausalLM', 'BambaForCausalLM', 'BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'Cohere2ForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'DeepseekV3ForCausalLM', 'DiffLlamaForCausalLM', 'ElectraForCausalLM', 'Emu3ForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FalconMambaForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'Gemma3ForConditionalGeneration', 'Gemma3ForCausalLM', 'GitForCausalLM', 'GlmForCausalLM', 'GotOcr2ForConditionalGeneration', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeo

LoRA adapters loaded.
Setting up text generation pipeline...
Pipeline ready.

--- Testing Prompt ---
What is the eligibility criteria for B.Tech programs at KIIT?

--- Generated Response ---
12th with 60%+ in PCM subjects. 100% seats reserved for Odisha applicants. 2024: 1.6 Lacs applicants.
--------------------------

--- Testing Prompt ---
Describe the campus infrastructure.

--- Generated Response ---
150+ buildings with 100% Wi-Fi. 120+ sports facilities. 550+ CCTV cameras. 24/7 security guards.
--------------------------

--- Testing Prompt ---
Who is the founder of KIIT?

--- Generated Response ---
Dr. Achyuta Samanta. Founded KIIT in 1992 as a tribal university. 2008 UGC Act gave it central status.
--------------------------


In [None]:
#DOWNLOAD Adapters folder
!zip -r /content/kiit-llama2-7b-chat-lora-adapters.zip /content/kiit-llama2-7b-chat-lora-adapters


  adding: content/kiit-llama2-7b-chat-lora-adapters/ (stored 0%)
  adding: content/kiit-llama2-7b-chat-lora-adapters/README.md (deflated 66%)
  adding: content/kiit-llama2-7b-chat-lora-adapters/tokenizer_config.json (deflated 66%)
  adding: content/kiit-llama2-7b-chat-lora-adapters/special_tokens_map.json (deflated 73%)
  adding: content/kiit-llama2-7b-chat-lora-adapters/tokenizer.model (deflated 55%)
  adding: content/kiit-llama2-7b-chat-lora-adapters/adapter_config.json (deflated 56%)
  adding: content/kiit-llama2-7b-chat-lora-adapters/tokenizer.json (deflated 85%)
  adding: content/kiit-llama2-7b-chat-lora-adapters/adapter_model.safetensors (deflated 28%)
