## Training the Model

In this notebook, we will train the model using the dataset we created.

### Overview
- **Objective**: Train the Qwen2.5-3B model to answer AI-related questions using the generated dataset.
- **Approach**: Load the dataset, configure the training parameters, and train the model using the `SFTTrainer` from the `trl` library.

### Steps:
1. **Load Dataset**: Load the dataset from the JSON file created in the previous steps.
2. **Preprocess Data**: Tokenize the data and prepare it for training.
3. **Configure Training**: Set up the training arguments and configure the model for training.
4. **Train Model**: Use the `SFTTrainer` to train the model on the dataset.

By following these steps, we will fine-tune the Qwen2.5-3B model to improve its ability to answer AI-related questions.

In [None]:
!pip install datasets
!pip install transformers
!pip install faiss-cpu

### Import Libraries

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29 peft trl triton
    !pip install --no-deps cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth



import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import numpy as np
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from transformers import Trainer, TrainingArguments
from unsloth import FastLanguageModel
import torch
import json
import fitz
import markdown
from datasets import load_dataset
import pickle
import numpy as np
from langchain.schema import Document
from sentence_transformers import SentenceTransformer
import json
import faiss

### Loading the Pre-trained Model

Now, we load the Qwen2.5-3B model using the `FastLanguageModel` class with specific settings to optimize performance and memory usage.

#### Parameters:
- **`max_seq_length`**: `2048` - Defines the maximum sequence length supported by the model.
- **`dtype`**: `None` - Automatically detects the appropriate data type, or can be set to `float16` for specific GPUs.
- **`load_in_4bit`**: `True` - Enables 4-bit quantization to reduce memory usage.

#### Process:
1. **Model Selection**:
   - A list of available 4-bit quantized models is provided for reference.
   - The `Qwen2.5-3B` model is selected for training.

2. **Load Model**:
   - Use `FastLanguageModel.from_pretrained()` to load the selected model (`unsloth/Qwen2.5-3B`), with specified parameters for sequence length, data type, and quantization.


In [None]:

max_seq_length = 2048
dtype = None
load_in_4bit = True


fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit",
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",
]

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen2.5-3B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,

)

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

### Loading and Formatting the Dataset

In this cell, we load the dataset from a JSON file, split it into training and testing sets, and format the prompts for training.

#### Process:
1. **Load Dataset**:
   - Load the dataset from the JSON file using `load_dataset("json", data_files="/content/dataset.json")`.
   - Split the dataset into training and testing sets with a 99:1 ratio using `train_test_split`.

2. **Define Prompt Template**:
   - Define the `alpaca_prompt` template to structure the instruction, input, and response for training.
   - The Alpaca format is used here to provide a clear and consistent structure for the model to learn from. It includes:
     - **Instruction**: Describes the task to be performed.
     - **Input**: Provides context or additional information needed to complete the task.
     - **Response**: The expected output or answer to the instruction based on the input.

3. **Format Prompts**:
   - Define the `formatting_prompts_func` function to format the dataset examples using the `alpaca_prompt` template.
   - Add the end-of-sequence token (`EOS_TOKEN`) to each formatted text to indicate the end of the response.

4. **Apply Formatting**:
   - Apply the `formatting_prompts_func` to the training dataset using the `map` function.

#### Reason for using Alpaca Format:
- **Consistency**: The Alpaca format provides a consistent structure for each training example, making it easier for the model to learn the relationship between instructions, inputs, and responses.
- **Clarity**: By clearly separating the instruction, input, and response, the model can better understand the context and generate appropriate answers.
- **Flexibility**: This format can accommodate a wide range of tasks and contexts, making it suitable for training a versatile chatbot model.


In [None]:
from datasets import load_dataset, concatenate_datasets
# Load the dataset from the JSON file
dataset = load_dataset("json", data_files="/content/dataset.json")
dataset = dataset['train'].train_test_split(test_size=0.01, shuffle=True)

train_dataset = dataset['train']
test_dataset = dataset['test']

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    instructions = examples["Input"]
    inputs = examples["Context"]
    outputs = examples["Response"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

#dataset = load_dataset("json", data_files="/content/output.json")
#dataset = dataset.map(formatting_prompts_func, batched=True)
train_dataset = train_dataset.map(formatting_prompts_func, batched=True)

# Concatenate the datasets using concatenate_datasets
#dataset = concatenate_datasets([dataset['train'], train_dataset])

In [None]:
print(dataset['train'][302])  # Access the 'train' split and then the element

### Configuring and Initializing the Trainer

Now we configure and initialize the `SFTTrainer` from the `trl` library to train the model using the formatted dataset.

#### Parameters and Reasoning:

1. **Trainer Initialization**:
   - **`model`**: The pre-trained Qwen2.5-3B model loaded earlier.
   - **`tokenizer`**: The tokenizer associated with the model.
   - **`train_dataset`**: The training dataset prepared and formatted in previous steps.
   - **`dataset_text_field`**: `"text"` - The field in the dataset containing the formatted text.
   - **`max_seq_length`**: `2048` - Maximum sequence length for the model.
   - **`dataset_num_proc`**: `2` - Number of processes to use for data loading.

2. **Training Arguments**:
   - **`per_device_train_batch_size`**: `2` - Small batch size due to limited dataset size and to fit within memory constraints.
   - **`gradient_accumulation_steps`**: `4` - Accumulate gradients over 4 steps to effectively increase the batch size without requiring more memory.
   - **`warmup_steps`**: `5` - Number of warmup steps to gradually increase the learning rate at the start of training.
   - **`max_steps`**: `60` - Total number of training steps, kept low due to the small dataset size.
   - **`learning_rate`**: `2e-4` - Learning rate for the optimizer, chosen to balance between convergence speed and stability.
   - **`optim`**: `"adamw_8bit"` - Use the AdamW optimizer with 8-bit precision to reduce memory usage.
   - **`weight_decay`**: `0.01` - Weight decay for regularization to prevent overfitting.
   - **`lr_scheduler_type`**: `"linear"` - Linear learning rate scheduler for gradual learning rate decay.
   - **`seed`**: `3407` - Seed for reproducibility.
   - **`output_dir`**: `"outputs"` - Directory to save training outputs and checkpoints.
   - **`report_to`**: `"none"` - Disable reporting to external services (e.g., WandB).

the `SFTTrainer` is configured and initialized, ready to train the model on the provided dataset.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="none",  # Use this for WandB etc
    ),
)

## Training the model

In [None]:
trainer_stats = trainer.train()


### Testing the Trained Model


Here we test out newly trained model for sample instructions manually

By the end of this cell, the model generates a response to the given instruction using the specified prompt format, demonstrating its ability to provide concise and relevant answers.

In [None]:
query_alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a brief and precise response that directly answers the question without unnecessary information.

1. Keep the response brief and concise.
2. Avoid elaborating on unrelated topics.
3. Provide only the most relevant information from the input.
### Instruction:
{}

### Input:
{}

### Response:
{}"""

FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
   query_alpaca_prompt.format(
        "is deepseek R1 Zero fast? ", # instruction
        " ", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 300)

Save Trained Model Using 4 bit Quantizising

In [None]:
# Save to 4-bit Q4_K_M GGUF
model.save_pretrained_gguf("model", tokenizer, quantization_method="q4_k_m")






### Loading and Using Embeddings for Similarity Search (RAG Retrieval)

In this cell, we load embeddings and document chunks, and set up a FAISS index for similarity search.

#### Process:
1. **Load Embedding Model**:
   - Load the `multi-qa-mpnet-base-dot-v1` model from `SentenceTransformer` for generating embeddings.

2. **Load Document Chunks**:
   - Load the document chunks from a pickle file (`chunks.pkl`).
   - Convert the `LangChain` `Document` objects to a list of dictionaries containing text and metadata.

3. **Load Embeddings**:
   - Load the precomputed embeddings from a NumPy file (`vector_store.npy`).
   - Print the shape of the loaded embeddings to verify.

4. **Normalize Embeddings**:
   - Normalize the embeddings to ensure they are suitable for cosine similarity calculations.

5. **Create FAISS Index**:
   - Create a FAISS index for inner product (cosine similarity) using the loaded embeddings.
   - Add the embeddings to the FAISS index.

6. **Define Function for Similarity Search**:
   - `get_embeddings(query_text)`: Encodes a query text, normalizes it, and finds the top 3 similar document chunks using the FAISS index.
   - Returns the text content of the related document chunks.


In [None]:


embedding_model = SentenceTransformer('multi-qa-mpnet-base-dot-v1')



with open("chunks.pkl", "rb") as f:     #loading chunks
    document_chunks = pickle.load(f)



# Convert LangChain Document objects to dict
chunk_data = [{"text": doc.page_content, "metadata": doc.metadata} for doc in document_chunks]


embeddings_array = np.load("vector_store.npy")

# Check loaded data
#print(pick_chunks[0].page_content)  # Print first chunk text
print(embeddings_array.shape)


# Normalize embeddings for cosine similarity

# Create a FAISS index for cosine similarity (using Inner Product)
dimension = embeddings_array.shape[1]
index = faiss.IndexFlatIP(dimension)
index.add(embeddings_array)

def get_embeddings(query_text=""):
  query_embedding = embedding_model.encode([query_text]).astype('float32')
  # Normalize query embedding before searching
  query_embedding = query_embedding / np.linalg.norm(query_embedding)
  distances, indices = index.search(query_embedding, k=3)
  related_indices = [i for i in indices[0] if i!=-1]
  return [document_chunks[i].page_content for i in related_indices]

## ~~Add LSTM for trained model~~

This part is not finished

In [None]:
import torch
import torch.nn as nn

class LSTMMemory(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_layers=1):
        super(LSTMMemory, self).__init__()
        self.lstm = nn.LSTM(input_dim, hidden_dim, num_layers, batch_first=True)
        self.hidden_dim = hidden_dim

    def forward(self, input_sequence, hidden_state=None):
        output, (hn, cn) = self.lstm(input_sequence, hidden_state)
        return output, (hn, cn)

    def init_hidden(self, batch_size):
        # Initialize hidden state and cell state for LSTM
        return (torch.zeros(1, batch_size, self.hidden_dim),
                torch.zeros(1, batch_size, self.hidden_dim))

# Initialize the LSTM Memory
input_dim = 768  # Embedding dimension (based on your model)
hidden_dim = 512
lstm_memory = LSTMMemory(input_dim, hidden_dim)


## Response generate
Response generating prompt and procedure is defined here. Chat History integration in not integrated yet.

In [None]:
chat_history = []


# Generate Response Function
def generate_response(query, chat_history, model, tokenizer, lstm_memory):
    # Retrieve relevant chunks
    relevant_chunks = get_embeddings(query)
    print(f"relevant chunks: \n{relevant_chunks}")
    print("\n\n\n")
    context = ' '.join(chat_history + relevant_chunks)
    context=relevant_chunks
    prompt= f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a brief and precise response that directly answers the question without unnecessary information.

1. Keep the response brief and concise.
2. Avoid elaborating on unrelated topics.
3. Provide only the most relevant information from the input.
### Instruction:
{query}

### Input:
{context}

### Response:"""
    # Prepare input for LSTM
  #  inputs = tokenizer(context, return_tensors='pt')
  #  embeddings = model(**inputs).last_hidden_state.mean(dim=1).unsqueeze(0)

    # Update LSTM memory
 #   lstm_output, hidden_state = lstm_memory(embeddings, hidden_state)

    # Generate response
#    input_ids = tokenizer.encode(query + tokenizer.eos_token, return_tensors="pt")
#    response_ids = model.generate(input_ids, max_length=100)
#    response = tokenizer.decode(response_ids[0], skip_special_tokens=True)
    inputs= tokenizer(
      [
        prompt# output - leave this blank for generation!
      ], return_tensors = "pt").to("cuda")
    text_streamer = TextStreamer(tokenizer)
    response = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 200)
    return response



# **Finnal Chat Bot**

You can keep chatting with newly trained model here.

In [None]:
# Continuous Chat Loop
print("Start chatting with your bot (type 'exit' to stop):")
while True:
    user_input = input("You: ")
    if user_input.lower() == 'exit':
        break

    response = generate_response(
        user_input, chat_history, model, tokenizer, lstm_memory
    )
    response = tokenizer.decode(response[0],skip_special_tokens=True)

    # Update chat history for context and memory
    chat_history.append(user_input)
    chat_history.append(response)