## Generating Dataset for Model Training

In this notebook, we are generating a dataset to train the model using the pre-trained `Qwen2.5-3B model`. This approach is chosen due to resource(Unavailability of free APIs for better models) and time constraints.

### Overview
- **Objective**: Create a dataset suitable for training the chatbot model that answers AI-related questions.
- **Approach**: Utilize the `Qwen2.5-3B model` to generate the dataset while using the`multi-qa-mpnet-base-dot-v1` model to generate embeddings efficiently.

### Steps:
1. **Load Data**: Loading the vector embeddings previously generated.
2. **Generate dataset**: leveraging `faiss` to get similar embeddings. feed them to model using predefined prompts for generating data.
3. **Save Dataset**: Store the generated dataset for use in model training.


Installing Dependancies

In [None]:
!pip install numpy langchain langchain-community faiss-cpu sentence-transformers

### Loading the embeddings & initializing vector store
We are loading the embedding vector and chunks we made previously.



In [None]:
import pickle
import numpy as np
from langchain.schema import Document
from sentence_transformers import SentenceTransformer
import json

embedding_model = SentenceTransformer('multi-qa-mpnet-base-dot-v1')



with open("chunks.pkl", "rb") as f:     #loading chunks
    document_chunks = pickle.load(f)



# Convert LangChain Document objects to dict
chunk_data = [{"text": doc.page_content, "metadata": doc.metadata} for doc in document_chunks]


embeddings_array = np.load("vector_store.npy")

# Check loaded data
#print(pick_chunks[0].page_content)  # Print first chunk text
print(embeddings_array.shape)

### Finding Similar Embeddings with FAISS

In this cell, we use FAISS to find similar embeddings for interactions.

#### Libraries Used:
- `faiss`: A library for efficient similarity search and clustering of dense vectors.

#### Process:
1. **Normalize Embeddings**:
   - Ensure embeddings are normalized for cosine similarity.

2. **Create FAISS Index**:
   - Create a FAISS index for inner product (cosine similarity) using the embeddings.

3. **Define Functions**:
   - `get_embeddings(query_text)`: Encodes a query text, normalizes it, and finds the top 3 similar document chunks.
   - `get_similar_indices(chunk_index, k)`: Finds the top `k` similar document chunks for a given chunk index, excluding itself.


In [None]:
import faiss

# Normalize embeddings for cosine similarity

# Create a FAISS index for cosine similarity (using Inner Product)
dimension = embeddings_array.shape[1]
index = faiss.IndexFlatIP(dimension)
index.add(embeddings_array)

def get_embeddings(query_text=""):
  query_embedding = embedding_model.encode([query_text]).astype('float32')
  # Normalize query embedding before searching
  query_embedding = query_embedding / np.linalg.norm(query_embedding)
  distances, indices = index.search(query_embedding, k=3)
  related_indices = [i for i in indices[0] if i!=-1]
  return [document_chunks[i].page_content for i in related_indices]


def get_similar_indices(chunk_index=-1,k=3):
    chunk_embedding = embeddings_array[chunk_index].reshape(1, -1)  # Ensure correct shape

    # Perform FAISS search
    distances, indices = index.search(chunk_embedding, k=k+1)  # k+1 to exclude itself

    # Remove the chunk itself from results (if present)
    related_indices = [i for i in indices[0] if (i != chunk_index and i!=-1)]

    # Return related chunks
    return [document_chunks[i].page_content for i in related_indices]

testing the retrieval of embeddings

In [None]:
# Query the vector database
query_text = "What are the first-generation reasoning models?"
retrieved_chunks= get_embeddings(query_text)
print(retrieved_chunks)

### Prompt Templates for Dataset Generation

We define various prompt templates used to generate responses and queries for the dataset.
We are doing query generation+query evolution along with Expected response generation for optimal dataset Creation 


#### Functions:
1. **`Queryprompt(contexts)`**:
   - Generates a prompt to create JSON objects with questions or statements related to the given contexts.

2. **`multi_context_template(context, original_input)`**:
   - Rewrites an input to require information from all elements in the context, ensuring conciseness and relevance.

3. **`reasoning_template(context, original_input)`**:
   - Rewrites an input to explicitly request multi-step reasoning, ensuring it is concise and understandable.

4. **`hypothetical_scenario_template(context, original_input)`**:
   - Rewrites an input to incorporate a hypothetical scenario, encouraging application of knowledge from the context.

5. **`response_template(context, input)`**:
   - Generates a detailed and informative response to the given input, using relevant details from the context.

These templates help in structuring the prompts for generating diverse and relevant dataset entries.

In [None]:
def Queryprompt(contexts):    
   return f"""I want you act as a copywriter. Based on the given context,
which is list of strings, Generate a list of JSON objects
with a `input` key. The `input` can be a question or a statement
 related to the contexts .

contexts:
{contexts}"""


def multi_context_template(context,original_input):
  return f"""
  I want you to rewrite the given `input` so that it requires readers to use information from all elements in `Context`.

1. `Input` should require information from all `Context` elements.
2. `Rewritten Input` must be concise and fully answerable from `Context`.
3. Do not use phrases like 'based on the provided context.'
4. `Rewritten Input` should not exceed 15 words.

Context: {context}
Input: {original_input}
Rewritten Input:
"""

def reasoning_template(context,original_input):
  return f"""
I want you to rewrite the given `input` so that it explicitly requests multi-step reasoning.

1. `Rewritten Input` should require multiple logical connections or inferences.
2. `Rewritten Input` should be concise and understandable.
3. Do not use phrases like 'based on the provided context.'
4. `Rewritten Input` must be fully answerable from `Context`.
5. `Rewritten Input` should not exceed 15 words.

Context: {context}
Input: {original_input}
Rewritten Input:
"""

def hypothetical_scenario_template(context,original_input):
  return f"""
I want you to rewrite the given `input` to incorporate a hypothetical or speculative scenario.

1. `Rewritten Input` should encourage applying knowledge from `Context` to deduce outcomes.
2. `Rewritten Input` should be concise and understandable.
3. Do not use phrases like 'based on the provided context.'
4. `Rewritten Input` must be fully answerable from `Context`.
5. `Rewritten Input` should not exceed 15 words.

Context: {context}
Input: {original_input}
Rewritten Input:
"""

def response_template(context, input):
  return f"""
I want you to generate a well-structured response to the given `input`, using relevant details from `Context`.
1. `Response` must be detailed, informative.
2. `Response` should provide clear and accurate information without unnecessary repetition.
3. Do not use phrases like 'based on the provided context.'
4. Maintain a professional and engaging tone.

Context: {context}
Input: {input}
Response:
"""

The response of the model contains input and some other unnecessary information for dataset. So we process the responses with following functions in different cases.

In [None]:
import re
def extract_questions(response):
    chatbot_response = tokenizer.decode(response[0], skip_special_tokens=True)

    cleaned_response = re.sub(r"```json|```", "", chatbot_response).strip()

    # Try to find the JSON structure using regex
    questions = re.findall(r'"input":\s*"([^"]+)"', cleaned_response)

    return questions

def extract_response(response):
    text = tokenizer.decode(response[0], skip_special_tokens=True)
    start_index = text.find('Response:')

    if start_index != -1:
        # Slice the text from the 'Rewritten Input:' point onwards
        extracted_text = text[start_index+9:].strip()
        return extracted_text
    else:
        return None

def extract_rewritten_input(response):
    # Pattern to extract text between "Rewritten Input:" and "<|endoftext|>"
    text = tokenizer.decode(response[0], skip_special_tokens=True)
    start_index = text.find('Rewritten Input:')

    if start_index != -1:
        # Slice the text from the 'Rewritten Input:' point onwards
        extracted_text = text[start_index+16:].strip()
        return extracted_text
    else:
        return None

### Evolving Queries and Generating Responses

In this cell, we define a function to perform random evolution steps on queries and generate responses, creating entries for the training dataset.

#### Libraries Used:
- `transformers.TextStreamer`: Used for streaming text generation.
- `re`: Regular expression operations.

#### Process:
1. **Templates and Evolution Steps**:
   - Define a list of evolution templates (`multi_context_template`, `reasoning_template`, `hypothetical_scenario_template`).
   - Set the number of evolution steps to apply.

2. **Function `evolve_query(original_input, context)`**:
   - Randomly choose a template from `evolution_templates`.
   - Generate an evolved prompt using the chosen template.
   - Tokenize and generate a response using the model.
   - Extract the rewritten input from the response.
   - Generate a final response using the `response_template`.
   - Append the original input, final response, and context to the `training_dataset`.

`training_dataset` contains evolved queries and their corresponding responses, ready for use in model training.

In [None]:
!pip install torch transformers
!pip install tqdm pandas

In [None]:
from transformers import TextStreamer
import re

evolution_templates = [multi_context_template, reasoning_template, hypothetical_scenario_template]

# Number of evolution steps to apply
num_evolution_steps = 3


training_dataset=[]


# Function to perform random evolution steps
def evolve_query(original_input, context):
      # Choose a random (or using custom logic) template from the list
      chosen_template = random.choice(evolution_templates)
      # Replace the placeholders with the current context and input
      evolved_prompt = chosen_template(str(context), original_input)
      # Update the current input with the "Rewritten Input" section

      inputs = tokenizer(evolved_prompt, return_tensors = "pt").to("cuda")
      text_streamer = TextStreamer(tokenizer)
      response = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 250)
      evolved_question = extract_rewritten_input(response)

      prompt_for_response = response_template(str(context), evolved_question)

      inputs = tokenizer(prompt_for_response, return_tensors = "pt").to("cuda")
      text_streamer = TextStreamer(tokenizer)
      response = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 250)
      final_response = extract_response(response)

      training_dataset.append({"Input": original_input, "Response": final_response, "Context": context})
      original_input = evolved_question




## Generating training dataset

So far we were laying the groundwork for dataset generation. Now we will start the process by loading the required libraries and our base model
#### Libraries Used:
- `torch`: PyTorch library for tensor operations and deep learning.
- `transformers`: Hugging Face library for transformer models.
- `numpy`: Library for numerical operations.
- `datasets`: Hugging Face library for loading and processing datasets.
- `sklearn.model_selection`: Provides tools for splitting datasets into training and testing sets.
- `unsloth.FastLanguageModel`: Custom library for efficient language model operations.

By importing these libraries, we set up the environment for model loading, data processing, and training.

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29 peft trl triton
    !pip install --no-deps cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth

In [None]:
!pip install -U bitsandbytes

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import numpy as np
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from transformers import Trainer, TrainingArguments
from unsloth import FastLanguageModel
import torch


In [None]:
!pip install -U bitsandbytes

### Loading and Configuring the Model

Here we load and configure the Qwen2.5-3B model using the `FastLanguageModel` class with specific settings to optimize performance and memory usage.

#### Parameters:
- **`max_seq_length`**: Set to 2048 to define the maximum sequence length.
- **`dtype`**: Automatically detected or set to `float16` for certain GPUs.
- **`load_in_4bit`**: Enabled to use 4-bit quantization, reducing memory usage.

#### Process:
1. **Model Selection**:
   - Choose from a list of available models, including various versions of Qwen2.5 and other models from the `unsloth` repository.

2. **Load Model**:
   - Use `FastLanguageModel.from_pretrained()` to load the selected model (`unsloth/Qwen2.5-3B`), with specified parameters for sequence length, data type, and quantization.

3. **Configure PEFT (Parameter-Efficient Fine-Tuning)**:
   - Apply PEFT to the model using `FastLanguageModel.get_peft_model()` with specified parameters for target modules, LoRA settings, and gradient checkpointing.

4. **Enable Faster Inference**:
   - Call `FastLanguageModel.for_inference()` to enable native 2x faster inference.

By the end of this cell, the Qwen2.5-3B model is loaded and configured for efficient training and inference.

In [None]:
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth


model, tokenizer = FastLanguageModel.from_pretrained(
    # Can select any from the below:
    # "unsloth/Qwen2.5-0.5B", "unsloth/Qwen2.5-1.5B", "unsloth/Qwen2.5-3B"
    # "unsloth/Qwen2.5-14B",  "unsloth/Qwen2.5-32B",  "unsloth/Qwen2.5-72B",
    # And also all Instruct versions and Math. Coding verisons!
    model_name = "unsloth/Qwen2.5-3B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference


Again due time contraints and colab environment limitations we skipped evolution of queries with this function for faster dataset generation

In [None]:
def without_evolve(original_input, context):
      prompt_for_response = response_template(str(context), original_input)
      inputs = tokenizer(prompt_for_response, return_tensors = "pt").to("cuda")
      text_streamer = TextStreamer(tokenizer)
      response = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 250)
      final_response = extract_response(response)

      training_dataset.append({"Input": original_input, "Response": final_response, "Context": context})

### Generating Queries and Saving Dataset

In this cell, we generate queries based on document chunks, process them, and save the resulting dataset.

#### Process:
1. **Generate Queries**:
   - Loop 800 times to generate queries.
   - Randomly select a reference index from `document_chunks`.
   - Retrieve similar contexts using `get_similar_indices(reference_index)`.
   - Generate a prompt using `Queryprompt(contexts)`.
   - Tokenize the prompt and generate a response using the model.
   - Extract questions from the response.
   - If questions are found, randomly select one as `test_question`.

2. **Process Query**:
   - Call `without_evolve(test_question, contexts)` to process the query without evolution (or use `evolve_query` if needed).

3. **Save Dataset**:
   - Save the `training_dataset` to a JSON file named `dataset.json`.

the generated queries and their corresponding contexts are processed and saved in a JSON file for use in model training.

In [None]:
import random

for i in range(800):
  reference_index = random.randint(0, len(document_chunks) - 1)
  contexts = get_similar_indices(reference_index)
  prompt = Queryprompt(contexts)
  inputs = tokenizer(prompt, return_tensors = "pt").to("cuda")

  text_streamer = TextStreamer(tokenizer)
  response = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 150)
  questions= extract_questions(response)
  if (len(questions)!=0):
    test_question = random.choice(questions)
  else :
    continue
  #evolve_query(test_question,contexts)     
  without_evolve(test_question,contexts)

  with open('dataset.json', 'w') as json_file:
      json.dump(training_dataset, json_file, indent=4)

