# **1. Introduction**

In [1]:
%%capture
%pip install -q bitsandbytes
%pip install -q transformers
%pip install -q peft
%pip install -q accelerate
%pip install -q trl
%pip install -q torch
%pip install -q langchain pypdf sentence-transformers
%pip install -U langchain-community
%pip install chromadb

## **Load all libraries**

In [2]:
%%capture
import os, torch
import pandas as pd
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoConfig, TrainingArguments, pipeline
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
from trl import SFTTrainer
from datasets import Dataset
from IPython.display import Markdown, display
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.vectorstores import Qdrant
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline

## The code below configures a large language model (LLM) for inference with quantization techniques for efficiency. Here's a breakdown of what each part does:

**Model Path and Quantization Configuration**

1. **Model Path:** The `model` variable stores the path to a pre-trained causal language model (likely a 2-billion parameter model) on Kaggle Datasets.

2. **BitsAndBytesConfig:** The `bnbConfig` object defines the configuration for quantization using the BitsAndBytes library. Here are the key arguments:
    * `load_in_4bit (bool, optional)`: This argument enables 4-bit quantization, reducing memory usage by approximately fourfold compared to the original model.
    * `bnb_4bit_quant_type (str, optional)`: This parameter specifies the type of 4-bit quantization to use. Here, it's set to `"nf4"`, a specific quantization format supported by BitsAndBytes.
    * `bnb_4bit_compute_dtype (torch.dtype, optional)`: This argument defines the data type used for computations during inference. Here, it's set to `torch.bfloat16`, a lower-precision format that can improve speed on compatible hardware.

**Loading Tokenizer and Model with Quantization**

1. **AutoTokenizer:** The `AutoTokenizer.from_pretrained` function loads the tokenizer associated with the pre-trained model at the specified path (`model`). The `quantization_config` argument is crucial here. It tells the tokenizer to consider the quantization information (e.g., potential padding changes) while processing text.

2. **AutoModelForCausalLM:** Similarly, `AutoModelForCausalLM.from_pretrained` loads the actual LLM model from the path (`model`). Again, the `device_map="auto"` argument allows automatic device placement (CPU or GPU) and the `quantization_config` ensures the model is loaded with the 4-bit quantization configuration.

**Overall, this code snippet aims to achieve two goals:**

* **Load a pre-trained LLM:** It retrieves a pre-trained causal language model from the specified path.
* **Enable Quantization for Efficiency:** By using the `BitsAndBytesConfig` and arguments during loading, the code configures the tokenizer and model to leverage 4-bit quantization for memory reduction and potentially faster inference on compatible hardware.


<h3><strong>Know More about <a href="https://www.kaggle.com/code/lorentzyeung/what-s-4-bit-quantization-how-does-it-help-llama2">4-bit quantization</a></strong></h3>

In [None]:
from huggingface_hub import snapshot_download , login


login(token="hf_") 

model_path = snapshot_download("google/gemma-2b-it")

print(model_path)  # Check where it was downloaded


Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/23.6k [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

gemma-2b-it.gguf:   0%|          | 0.00/10.0G [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

/root/.cache/huggingface/hub/models--google--gemma-2b-it/snapshots/96988410cbdaeb8d5093d1ebdc5a8fb563e02bad


In [4]:
# import json
# from transformers import AutoTokenizer

# # Save the model (already done)
# model.save_pretrained("./fine_tuned_gemma")

# # Save the tokenizer with a cleaned config
# save_directory = "./fine_tuned_gemma"

# # Extract essential tokenizer config
# tokenizer_config = {
#     "tokenizer_class": tokenizer.__class__.__name__,
#     "model_max_length": tokenizer.model_max_length,
#     "padding_side": tokenizer.padding_side,
#     "truncation_side": tokenizer.truncation_side,
#     "bos_token": tokenizer.bos_token,
#     "eos_token": tokenizer.eos_token,
#     "unk_token": tokenizer.unk_token,
#     "pad_token": tokenizer.pad_token,
#     "additional_special_tokens": tokenizer.additional_special_tokens,
#     "chat_template": tokenizer.chat_template,
#     "name_or_path": tokenizer.name_or_path,
#     "add_bos_token": tokenizer._add_bos_token,
#     "add_eos_token": tokenizer._add_eos_token,
# }

# # Save the cleaned config
# with open(f"{save_directory}/tokenizer_config.json", "w", encoding="utf-8") as f:
#     json.dump(tokenizer_config, f, indent=2, ensure_ascii=False)

# # Save the vocabulary and other necessary files
# tokenizer.save_vocabulary(save_directory)
# # If your tokenizer uses a .model file (e.g., tokenizer.model), copy it
# import shutil
# vocab_file = tokenizer.init_kwargs.get("vocab_file")
# if vocab_file:
#     shutil.copy(vocab_file, f"{save_directory}/tokenizer.model")

# print("Tokenizer saved successfully!")

In [5]:
# model = "/kaggle/input/gemma/transformers/2b-it/2"

bnbConfig = BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(model_path, device_map="auto")

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map = "auto",
    quantization_config=bnbConfig
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

# **2. Fine Tune Model**

## **Load the dataset**

In [6]:
# data = pd.read_csv("/kaggle/input/dataset-python-question-answer/Dataset_Python_Question_Answer.csv")
# dataset = Dataset.from_pandas(data)
# data.head()
data = pd.read_csv("/kaggle/input/child-friendly-dataset/Child_Friendly_Python_QA.csv")
# data = data.drop(columns=["Answer"], errors='ignore')  # 'ignore' prevents errors if column doesn't exist
# data = data.rename(columns={"Child_Friendly_Answer": "Answer"})
dataset = Dataset.from_pandas(data)

data.head()

Unnamed: 0,Question,Answer,Child_Friendly_Answer
0,What is the difference between a variable and...,"[""Sure, here's the difference between a variab...",", here's the difference between a a magical tr..."
1,What is the difference between a built-in fun...,"[""Sure. Here's the difference between built-in...",. here's the difference between built-in a sup...
2,What is the difference between the `print` fu...,"[""Sure. Here's the difference between the two ...",. here's the difference between the two a supe...
3,What is the difference between an expression ...,"[""Sure! Here's the difference between an expre...",! here's the difference between an a tiny reci...
4,What is the difference between `True` and `Fa...,"[""Sure. Here's the difference between `True` a...",. here's the difference between `true` and `fa...


## **Define a formatting function for the model output.**

In [7]:
print(dataset.column_names)

['Question', 'Answer', 'Child_Friendly_Answer']


In [8]:
def formatting_func(example):
    template = "Instruction:\n{instruction}\n\nResponse:\n{response}"
    line = template.format(instruction=example['Question'], response=example['Answer'])
    return line

In [9]:
import os
os.environ["WANDB_DISABLED"] = "true"

## **What is WANDB?**

WANDB is a cloud platform for experiment tracking specifically designed for machine learning. It provides functionalities like:

* **Logging** training metrics, parameters, and visualizations.
* **Version control** for your machine learning experiments.
* **Sweeping** hyperparameters to find the best performing configuration.
* **Collaboration** features to share and discuss experiments with your team.

## **Why disable WANDB?**

There are a few reasons why you might want to disable WANDB:

* **You don't need experiment tracking:** If your code is a simple experiment or you're not interested in tracking the results, disabling WANDB can reduce overhead.
* **Privacy concerns:** WANDB logs your experiment data to the cloud. If your data is sensitive, you might not want to upload it.
* **Troubleshooting errors:**  Sometimes errors can arise from WANDB itself. Disabling it can help isolate the issue.

## **Setting the `WANDB_DISABLED` environment variable**

We sets an environment variable called `WANDB_DISABLED` to `"true"`. This tells the WANDB library to not initialize itself, effectively disabling it for current script.

## **In summary:**

* WANDB is a useful tool for experiment tracking in machine learning.
* You might disable WANDB if you don't need experiment tracking or for debugging purposes.

In [10]:
lora_config = LoraConfig(
    r = 8,
    target_modules = ["q_proj", "o_proj", "k_proj", "v_proj",
                      "gate_proj", "up_proj", "down_proj"],
    task_type = "CAUSAL_LM",
)

### The code above defines a configuration object called `lora_config` for a technique called LoRA (Low-Rank Adaptation). Here's a breakdown of what each parameter does:

**LoRA - Low-Rank Adaptation**

LoRA is a technique used to fine-tune large language models (LLMs) more efficiently. It allows you to adapt pre-trained models to new tasks with minimal memory and computational cost compared to traditional fine-tuning.

**LoraConfig Parameters:**

* **r (int):** This parameter defines the rank of the low-rank decomposition used in LoRA. It controls the trade-off between accuracy and memory usage. A lower value of `r` uses less memory but might lead to slightly lower accuracy. The default value is typically 8, as set in our code.

* **target_modules (List[str]):** This list specifies the Transformer layers where LoRA will be applied. The provided configuration targets several key projection layers within the Transformer architecture:
    * `q_proj`: Query projection
    * `o_proj`: Output projection
    * `k_proj`: Key projection
    * `v_proj`: Value projection
    * `gate_proj`: Gate projection (used in attention layers)
    * `up_proj`: Upsampling projection (used in some encoder-decoder architectures)
    * `down_proj`: Downsampling projection (used in some encoder-decoder architectures)

By applying LoRA to these projection layers, the model can learn task-specific adaptations without modifying the original large model weights significantly.

* **task_type (str, optional):** This parameter specifies the type of task you're fine-tuning the model for. While not used in this specific configuration, some libraries might leverage this information to optimize LoRA for specific task categories (e.g., "CAUSAL_LM" for causal language modeling).

**In summary:**

This configuration defines how LoRA will be applied to a pre-trained model for fine-tuning. It specifies the rank of the decomposition (memory usage) and the target layers within the Transformer architecture where LoRA will be used to adapt the model to a new task.


In [11]:
# tokenizer.model_max_length = 512 

In [12]:
trainer = SFTTrainer(
    
    model=model,
    train_dataset=dataset
    
    ,
    args=TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=50,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit",
        # dataset_text_field="text" 
        # max_seq_length=512,
    )
    
    ,
    peft_config=lora_config,
    formatting_func=formatting_func,
    # dataset_kwargs={"batched": False} 
)

Applying formatting function to train dataset:   0%|          | 0/419 [00:00<?, ? examples/s]

Converting train dataset to ChatML:   0%|          | 0/419 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/419 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/419 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/419 [00:00<?, ? examples/s]

### The code above creates an instance of `SFTTrainer` from the Transformers library specifically designed for supervised fine-tuning (SFT) tasks. Here's a breakdown of what each part does:

**SFTTrainer for Supervised Fine-Tuning**

This code utilizes `SFTTrainer` to fine-tune a pre-trained model (`model`) on a specific training dataset (`dataset`). It's designed for tasks where you have labeled data and want to adapt the model for a new purpose.

**Key Parameters:**

* **model (PreTrainedModel):** This argument specifies the pre-trained model you want to fine-tune.

* **train_dataset (Dataset):** This argument points to the training dataset you'll use for fine-tuning. The dataset should be formatted appropriately for the task.

* **max_seq_length (int):** This parameter defines the maximum sequence length allowed in the training data. Sequences exceeding this length will be truncated.

* **args (TrainingArguments):** This argument is an instance of `TrainingArguments` that defines various hyperparameters for the training process. Here are some notable arguments within `args`:
    * `per_device_train_batch_size (int)`: Sets the batch size per device (GPU/TPU) during training. Here, it's set to 1, which is a small batch size commonly used with gradient accumulation.
    * `gradient_accumulation_steps (int)`: This parameter allows accumulating gradients over several batches before updating the model weights. Here, it's set to 4, effectively increasing the effective batch size.
    * `warmup_steps (int)`: This defines the number of warmup steps where the learning rate is gradually increased from 0 to its full value. Here, it's set to 2.
    * `max_steps (int)`: This parameter specifies the total number of training steps. Here, it's set to 50, which might be a short training run for fine-tuning depending on your dataset size and task complexity.
    * `learning_rate (float)`: This sets the learning rate for the optimizer. Here, it's set to 2e-4, which is a common starting point for fine-tuning.
    * `fp16 (bool)`: Enables training using 16-bit floating-point precision (mixed precision) for faster training with minimal accuracy loss (if supported by your hardware).
    * `logging_steps (int)`: Defines how often training metrics are logged during training. Here, it's set to 1, logging metrics for every step. 
    * `output_dir (str)`: Specifies the directory where training outputs (model checkpoints, logs, etc.) will be saved. Here, it's set to "outputs".
    * `optim (str)`: Defines the optimizer used for training. Here, it's set to "paged_adamw_8bit", which is likely an optimizer with specific memory optimizations. 

* **peft_config (LoraConfig):** This argument is likely referencing the `lora_config` you defined earlier. It provides the configuration for LoRA (Low-Rank Adaptation), which helps fine-tune the model more efficiently.

* **formatting_func (Callable):** This argument (if provided) specifies a custom function for formatting the training data before feeding it to the model. This allows for specific pre-processing steps tailored to your task.

**In essence:**

This code snippet configures and initializes an `SFTTrainer` for fine-tuning a pre-trained model with LoRA for memory efficiency. The training hyperparameters are set within the `TrainingArguments` object. 

In [13]:
trainer.train()

Step,Training Loss
1,5.6048
2,5.5036
3,4.8797
4,4.2643
5,3.8999
6,5.6644
7,4.2097
8,3.7917
9,4.5187
10,3.5101


TrainOutput(global_step=50, training_loss=2.8512011623382567, metrics={'train_runtime': 117.0933, 'train_samples_per_second': 1.708, 'train_steps_per_second': 0.427, 'total_flos': 967136896880640.0, 'train_loss': 2.8512011623382567})

# **3. Retrieval Augment Generation (RAG)**
```Retrieval Augmented Generation (RAG)``` is a paradigm in language model architecture that integrates both retrieval and generation processes to enhance the model's understanding and response capabilities. In essence, it combines the strengths of retrieval-based models, which excel at accessing and utilizing external knowledge sources, with generative models, which can generate novel and contextually relevant responses.

The primary benefit of RAG in large language models (LLMs) is its ability to leverage external knowledge sources during the generation process. By retrieving relevant information from a predefined knowledge base or corpus, the model can augment its understanding of the input context and produce more accurate and informative responses. This approach not only improves the coherence and relevance of generated text but also enables the model to incorporate real-world knowledge and factual accuracy into its outputs.

RAG aims to achieve several key objectives:

1. **Enhanced Contextual Understanding:** By retrieving relevant information from external sources, RAG can better understand the context of a given prompt or query, leading to more contextually appropriate responses.

2. **Improved Content Quality:** Integrating external knowledge sources allows RAG to generate content that is more accurate, informative, and relevant to the input context, enhancing the overall quality of generated text.

3. **Factually Accurate Responses:** By accessing external knowledge bases, RAG can ensure that its responses are factually accurate and grounded in real-world information, reducing the likelihood of generating misleading or incorrect information.

The workflow of RAG typically involves the following steps:

1. **Retrieval:** The model first retrieves relevant information from a knowledge base or corpus based on the input prompt or query. This retrieval process aims to identify key facts, concepts, or contextually relevant information to inform the generation process.

2. **Augmentation:** The retrieved information is then used to augment the model's understanding of the input context. By incorporating this external knowledge, the model can generate more informed and contextually appropriate responses.

3. **Generation:** Finally, the model generates a response based on the augmented understanding of the input context, leveraging both the original prompt and the retrieved information to produce a coherent and relevant output.

The necessity of using RAG lies in its ability to address the limitations of traditional generative models, such as lack of factual accuracy and coherence in responses. By integrating retrieval-based mechanisms, RAG can access external knowledge sources to enhance its understanding of the input context, leading to more accurate, informative, and contextually relevant generated text. This approach is particularly valuable in tasks requiring a deep understanding of complex topics or access to large knowledge bases, such as question answering, dialogue generation, and content summarization.

## **Load documents for RAG**

In [14]:
# # Instantiate a PyPDFDirectoryLoader object with the specified directory path
pdf_loader = PyPDFDirectoryLoader("/kaggle/input/knowledge-base")

# # Load PDF documents from the specified directory
pdfs = pdf_loader.load()

In [15]:
# import the HuggingFaceEmbeddings class, 
embeddings = HuggingFaceEmbeddings(
    # This argument specifies the pre-trained model name to be used for generating embeddings.
    # Here, "sentence-transformers/all-mpnet-base-v2" is a pre-trained sentence transformer model 
    # from the Sentence Transformers library (not Transformers).
    # Sentence transformer models are specifically trained to generate meaningful representations 
    # of sentences that capture semantic similarity.
    model_name="sentence-transformers/all-mpnet-base-v2",

    # This argument is likely specific to the HuggingFaceEmbeddings class and might 
    # not be present in the base Transformers library.
    # It sets the device to "cuda" to leverage the GPU for faster processing if available.
    model_kwargs={"device": "cuda"}
)

  embeddings = HuggingFaceEmbeddings(


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [16]:
# Instantiate a RecursiveCharacterTextSplitter object with specified parameters
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)

# Split documents into chunks using the RecursiveCharacterTextSplitter
all_splits = text_splitter.split_documents(pdfs)

In [17]:
# zip_file_path = '/content/vectordb.zip'  # Replace with ZIP file's path
# extract_dir = '/content/vectordb' 

In [18]:
# import zipfile
# zip_file_path=
# with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
#     zip_ref.extractall(extract_dir)

In [19]:
from langchain.vectorstores import Chroma
vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory='./vectordb')
#When running in notebook, need to use .persist()
vectordb.persist() 

#load chroma db instead of constructing it again (faster)
# vectordb = Chroma(persist_directory="/kaggle/input/vectordb2", embedding_function=embeddings)

  vectordb.persist()


In [20]:
# Create a retriever
retriever = vectordb.as_retriever()

In [21]:
# !zip -r vectordb.zip /kaggle/working/vectordb

In [22]:
# This code creates a pipeline for text generation using a pre-trained model (model) 
# and its tokenizer (tokenizer). It leverages mixed precision (torch.bfloat16) 
# for potentially faster inference and limits generated text to 512 tokens.
pipeline = pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer,
    model_kwargs = {"torch.dtype": torch.bfloat16},
    max_new_tokens=512  
)

## Here's a breakdown of the code snippet below and the parameters used for text generation:

**1. Prompt Creation (`pipeline.tokenizer.apply_chat_template`)**

* **Function:** This part uses a function (likely) provided by the text-generation pipeline to specifically format the conversation history (`messages`) into a prompt suitable for the LLM. 
* **Arguments:**
    * `messages (List[Dict])`: This argument is a list of dictionaries representing the conversation history. Each dictionary likely contains keys like "role" (indicating user or system) and "content" (the actual text).
    * `tokenize (bool, optional)`: This argument is likely set to `False` here, indicating that the tokenizer should not be applied at this stage. The conversation history might already be pre-processed text.
    * `add_generation_prompt (bool, optional)`: This argument is likely set to `True`, instructing the function to add a prompt at the beginning specifically designed for chat-like text generation tasks. The exact format of this prompt might be specific to the pipeline implementation.

**2. Text Generation with Pipeline (`pipeline`)**

* **Function:** This line calls the text-generation pipeline itself to generate text that continues the conversation based on the provided prompt.
* **Arguments:**
    * `prompt (str)`: This argument is the crucial prompt created in the previous step, containing the formatted conversation history and potentially a generation prompt.
    * `max_new_tokens (int, optional)`: This argument limits the number of tokens (words) the LLM can generate to 512 in this case.
    * `add_special_tokens (bool, optional)`: This argument, set to `True` here, instructs the pipeline to add special tokens (like start/end of sequence tokens) to the prompt before feeding it to the LLM.
    * `do_sample (bool, optional)`: Set to `True` here, enabling random sampling during generation, which can introduce some variation in the response compared to greedy decoding.
    * `temperature (float, optional)`: This argument controls the randomness of the generated text. A temperature of 1.0 samples more uniformly from all possibilities, while lower values like 0.7 (used here) favor higher probability tokens, resulting in a more conservative response.
    * `top_k (int, optional)`: This argument restricts the generation to only consider the top k most likely tokens at each step, potentially reducing the likelihood of going off on tangents. Here, it's set to 10.
    * `top_p (float, optional)`: This argument controls the sampling process by favoring the top p probability mass of the distribution (considering the top k tokens). A value of 0.95 (used here) indicates a preference for high-probability tokens, influencing the creativity of the response.

In [23]:
# # question = "What is the difference between a variable and an object"
# system =  ""You are a friendly tutor who explains programming to 10-year-olds using emojis.Keep explanations simple, use relatable examples, and add a cheerful tone. 😊"
# question =system + "What is the difference between a variable and an object"
# message = [
#     {"role": "user", "content": question},
# ]

# prompt = pipeline.tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=True)

# outputs = pipeline(
#     prompt,
#     max_new_tokens=512,
#     add_special_tokens=True,
#     do_sample=True,
#     temperature=0.7,
#     top_k=10,
#     top_p=0.95
# )
# Markdown(outputs[0]["generated_text"][len(prompt):])

## The code below combines a large language model (LLM) for text generation with a retrieval system for building a more informative response system. Here's a breakdown of what each part does:

**1. Creating a Custom LLM Pipeline (`gemma_llm`)**

* **HuggingFacePipeline:** This line likely wraps the existing text-generation pipeline (`pipeline`) from the previous steps into a custom `HuggingFacePipeline` class. This might provide additional functionalities or a more convenient way to interact with the pipeline.
* **Model Kwargs:** It sets a default argument (`temperature=0.7`) for the custom pipeline. This argument controls the randomness of the generated text during text generation (explained earlier).

**2. Building a RetrievalQA Object (`qa`)**

* **RetrievalQA.from_chain_type:** This line creates a `RetrievalQA` object, likely from a custom library or framework that combines retrieval and question answering functionalities.
* **Arguments:**
    * `llm (TextGenerationPipeline)`: This argument takes the `gemma_llm` object, which provides the text-generation capabilities.
    * `chain_type (str)`: This argument is set to `"stuff"`, which might be a specific type of retrieval-LM chain supported by the `RetrievalQA` class. The exact functionality of `"stuff"` depends on the library's implementation.
    * `retriever (RetrievalInterface)`: This argument likely refers to a separate `retriever` object (not shown here) that handles retrieving relevant information from a knowledge base or document store based on the user's query.

**In essence:**

This code combines the text-generation capabilities of the LLM with a retrieval system. The `RetrievalQA` object likely leverages the retrieved information (through the `retriever` object) to inform the LLM's text generation process, potentially leading to more comprehensive and informative responses to user queries. The specific details of how retrieval and text generation are chained together depend on the `RetrievalQA` implementation and the `"stuff"` chain type.


In [24]:
gemma_llm = HuggingFacePipeline(
    pipeline=pipeline,
    model_kwargs={
        "temperature": 0.7,
        "max_new_tokens": 512,
        "add_special_tokens": True,
        "do_sample": True,
        "top_k": 10,
        "top_p": 0.95
    },
)
# Create a RetrievalQA object
qa = RetrievalQA.from_chain_type(
    llm=gemma_llm,  # Pass the text-generation pipeline object
    chain_type="stuff",
    retriever=retriever  # retriever object
)

  gemma_llm = HuggingFacePipeline(


In [25]:
def askQuestion(question):
    
    system =  "You are a friendly tutor who explains programming to 10-year-olds using and emojis.Keep explanations simple, use relatable examples, and add a cheerful tone. 😊"
    question = system + question    
    message = [
        {"role": "user", "content": question},
    ]
    
    prompt = pipeline.tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=True, truncation=True)
    result = qa.invoke(prompt)
    display(Markdown(result['result'].split('Helpful Answer:')[1]))
    

In [26]:
ques='''I wrote a Python function to check if a number is prime, but it's not working correctly. Here’s my code:"
    
    def is_prime(n):
        for i in range(2, n):
            if n % i == 0:
                return False
        return True
    
    print(is_prime(1))  # Expected: False
    print(is_prime(2))  # Expected: True
    print(is_prime(4))  # Expected: False
    "It returns True for 1, but 1 is not a prime number. What’s wrong?'''
askQuestion(ques)



Hey there! 👋

Let's take a look at your code and see what's going on. 😊

The function is called `is_prime` and it takes a single argument, `n`.

The function checks if the number `n` is a prime number. It does this by iterating through all the numbers from 2 to `n-1`.

Inside the for loop, it checks if `n` is divisible by any of the numbers in that range. If it finds a divisor, it returns `False`.

However, there's a small issue in the code. It returns `True` for `n` = 1, which is not a prime number.

So, the function is not working correctly for numbers less than 2.

Here's the corrected code:

```python
def is_prime(n):
    if n <= 1:
        return False
    for i in range(2, int(n**0.5) + 1):
        if n % i == 0:
            return False
    return True
```

Now, let's see how it works:

1. The function first checks if `n` is less than or equal to 1. If it is, it returns `False` because 1 is not considered a prime number.

2. If `n` is greater than 1 and not divisible by any numbers in the range from 2 to its square root, it means `n` is a prime number.

3. If it finds a divisor during the for loop, it returns `False`.

4. If the for loop completes without finding any divisors, it returns `True`, indicating that `n` is a prime number.

With these changes, the function should work correctly for all positive integers, including prime numbers. 😊

In [27]:
q='''

    numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    filtered_numbers = filter(lambda x: x % 2 == 0, numbers)
    print(list(filtered_numbers))  # Expected output: [1, 3, 5, 7, 9]
    Question:
    Identify the bug in the code.
    Fix the bug so that the function correctly returns only the odd numbers.

'''
askQuestion(q)


The bug in the code is that the condition in the lambda function is checking if the number is divisible by 2. However, the code is supposed to return only the odd numbers from the list. To fix this, we need to change the condition to check if the number is odd.

Here's the corrected code:

```python
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
filtered_numbers = filter(lambda x: x % 2!= 0, numbers)
print(list(filtered_numbers))  # Expected output: [1, 3, 5, 7, 9]
```

In this corrected code, we use the lambda function to filter the numbers list. The condition now checks if the number is odd (represented by the condition `x % 2!= 0`) and adds the numbers to the `filtered_numbers` list.

In [28]:

q='''okay so if I have a list and I wanna like take out the middle thing but I don't know how many things there are what do I do?'''
askQuestion(q)

 "Hey there! 👋 Let's say you have a list of things, like a yummy fruit basket. To take out the middle item, we can use a special method called'middle'. It's like a magician's trick where we move things around a bit.

Here's how it works:

1. First, we use the 'insert' method to add a new item at the end of the list. Think of it like adding a new toy to the middle of the fruit basket.

2. Then, we use the'remove' method to take the middle item out. It's like taking out a toy from the middle of the basket.

3. If you want to remove the entire middle item, we can use the 'del' statement. It's like a magician's trick where we remove a toy from the middle of the basket.

4. If you want to change the order of the items in the list, we can use the'swap' method. It's like a magician's trick where we rearrange toys in the basket.

So, to take out the middle item, we use the'middle' method, and to remove the entire middle item, we use the 'del' statement. It's like a magician's trick where we rearrange the toys in the basket and then remove one of them!"

In [29]:
q="okay but can Python make my computer talk?"
askQuestion(q)



Sure, here's a simplified explanation of how Python can make your computer talk:

"Python can make your computer talk by using a special language called **Natural Language Processing (NLP)**. NLP is like a translator that can understand human language and convert it into code.

When you write Python code, you're essentially telling the computer to do something. For example, you could tell Python to:

* Open a new window
* Write a song
* Play a game
* Read a book

When you run the code, Python will use its NLP capabilities to understand what you've written and then carry out the instructions.

So, while Python itself doesn't directly "talk to" your computer, it can be used to control it and make it do things that would otherwise be difficult or impossible for a human to do.

In [30]:
q="why is my turtle not moving I said forward but it's just sitting there??"
askQuestion(q)



Hey there! 👋

Let's work together to understand why your turtle isn't moving. 🤔

First, let's check if the turtle has been initialized properly. 🐢

**Step 1: Checking if the turtle is alive**

Like a friendly shopkeeper checking if the batteries are inserted correctly, make sure the turtle is connected to the power source. 🔌

**Step 2: Checking if the turtle is moving**

Hold down the "spacebar" key. 🎉 Did you hear the turtle move? 🐢

**Step 3: Checking if the turtle is listening**

Is the turtle looking for instructions? 🤔

**Step 4: Checking if the turtle is facing the right direction**

Make sure the turtle is facing the direction you're telling it to move in. 😉

**Step 5: Checking for any errors**

Is the turtle plugged in properly? Is the power source turned on? Are there any errors in the code?

Remember, my friend, troubleshooting takes time and patience. Don't get discouraged if you don't get it right away. Keep trying different things and asking for help from a friendly tutor like me! 😊

In [34]:
output_dir = "./fine_tuned_gemma"

# Save the fine-tuned model and tokenizer
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

('./fine_tuned_gemma/tokenizer_config.json',
 './fine_tuned_gemma/special_tokens_map.json',
 './fine_tuned_gemma/tokenizer.model',
 './fine_tuned_gemma/added_tokens.json',
 './fine_tuned_gemma/tokenizer.json')

In [32]:
trainer.model.save_pretrained(output_dir)

In [33]:
from peft import PeftModel

# Load the base model (Gemma-2B-it)
base_model_path = model_path  # Already defined as the snapshot download path
base_model = AutoModelForCausalLM.from_pretrained(base_model_path, device_map="auto")

# Load the fine-tuned LoRA model
lora_model = PeftModel.from_pretrained(base_model, output_dir)

# Merge the LoRA adapters with the base model
merged_model = lora_model.merge_and_unload()

# Save the merged model
merged_output_dir = "./merged_gemma"
merged_model.save_pretrained(merged_output_dir)
tokenizer.save_pretrained(merged_output_dir)
print(f"Merged model saved to {merged_output_dir}")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Merged model saved to ./merged_gemma


In [34]:
# Install llama.cpp and its Python bindings
!git clone https://github.com/ggerganov/llama.cpp.git
%cd llama.cpp
!make
!pip install -r requirements.txt

Cloning into 'llama.cpp'...
remote: Enumerating objects: 47520, done.[K
remote: Counting objects: 100% (28/28), done.[K
remote: Compressing objects: 100% (23/23), done.[K
remote: Total 47520 (delta 12), reused 5 (delta 5), pack-reused 47492 (from 3)[K
Receiving objects: 100% (47520/47520), 99.42 MiB | 30.22 MiB/s, done.
Resolving deltas: 100% (34154/34154), done.
/kaggle/working/llama.cpp
Makefile:2: *** The Makefile build is deprecated. Use the CMake build instead. For more details, see https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md.  Stop.
Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cpu, https://download.pytorch.org/whl/cpu, https://download.pytorch.org/whl/cpu, https://download.pytorch.org/whl/cpu
Collecting gguf>=0.1.0 (from -r ./requirements/requirements-convert_legacy_llama.txt (line 4))
  Downloading gguf-0.14.0-py3-none-any.whl.metadata (3.7 kB)
Collecting protobuf<5.0.0,>=4.21.0 (from -r ./requirements/requirements-convert

In [35]:
!apt-get update
!apt-get install -y cmake build-essential

Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]                           
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease                                              
Get:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]                                
Get:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]                             
Get:7 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ Packages [69.9 kB]       
Get:8 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [1,381 kB]
Get:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease [18.1 kB]              
Get:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease [24.3 kB]
Get:11 http://archive

In [38]:
!mkdir -p build
%cd build
!cmake ..
!cmake --build . --config Release

/kaggle/working/llama.cpp/build
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.34.1")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- Including CPU backend
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- x86 detected
-- Adding CPU backend variant ggml-cpu: -march=native 
-- Configuring d

In [44]:

# Install Python requirements (for conversion script)
!pip install -r ../requirements.txt

# Convert merged model to GGUF (if not done)
!python ../convert_hf_to_gguf.py ../../merged_gemma --outfile ../../gemma-2b-finetuned.gguf

# Quantize (if not done)
!./bin/llama-quantize ../../gemma-2b-finetuned.gguf ../../gemma-2b-finetuned-q4_k_m.gguf Q4_K_M


Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cpu, https://download.pytorch.org/whl/cpu, https://download.pytorch.org/whl/cpu, https://download.pytorch.org/whl/cpu
main: build = 5021 (e39e727e)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing '../../gemma-2b-finetuned.gguf' to '../../gemma-2b-finetuned-q4_k_m.gguf' as Q4_K_M
llama_model_loader: loaded meta data with 29 key-value pairs and 164 tensors from ../../gemma-2b-finetuned.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Merged_Gemma
llama_model_loader: - kv   3:                         ge