![Colegio Bourbaki](./Images/Bourbaki.png)

## Procesamiento de Lenguaje Natural

En este notebook haremos lo siguiente:

1. **Explicaremos** la diferencia entre:   
- Generación aumentada por recuperación (**RAG**)   
- **Ajuste fino** de un modelo de lenguaje
- Uso de **ambos juntos** 

2. **Implementaremos un pequeño proceso RAG**:   
- Usaremos un transformador de oraciones para incrustar documentos  
- Almacenaremos las incrustaciones en un índice vectorial  
- Recuperaremos pasajes relevantes  
- Usaremos un pequeño modelo de chat de pesos abiertos para responder preguntas de ese contexto 

3. **Ajustar un pequeño modelo de pesos abiertos** en un pequeño conjunto de datos de preguntas y respuestas   
- Utilizar LoRA / QLoRA para ajustarlo a una GPU de ~4 GB   
- Comparar las respuestas **antes y después** del ajuste. Se trata de una GPU como la **NVIDIA GeForce GTX 1650 Ti 4 GB**, por lo que haremos lo siguiente: - Utilizar un modelo pequeño: `TinyLlama/TinyLlama-1.1B-Chat-v1.0`. 
- Cargarlo en **4 bits** siempre que sea posible. 
- Mantener tamaños de lote pequeños.


## RAG frente al ajuste fino (conceptual)

### ¿Qué es RAG (generación aumentada por recuperación)?

Los LLM tienen un **conocimiento limitado**: solo saben lo que vieron durante el entrenamiento previo.  
RAG añade un **almacén de conocimiento externo** (por ejemplo, una base de datos vectorial):

1. Se **incrustan** los documentos (artículos, documentos, tickets) en vectores.
2. En el momento de la consulta, se:
   - Incrusta la pregunta del usuario.
   - Recupera los **documentos más similares**.
   - Pasa la *pregunta + el contexto recuperado* al LLM.
3. El modelo responde *utilizando ese contexto*, sin cambiar sus pesos.

**Ventajas:**
- Ideal para **datos nuevos y que cambian con frecuencia** (como las noticias diarias).
- No requiere un entrenamiento pesado, solo incrustación + recuperación.
- Seguro: no sobrescribe el modelo.

**Desventajas:**
- La calidad de la respuesta depende de la **calidad de la recuperación** y del tamaño de la solicitud.
- Limitado por la **ventana de contexto**: solo se puede pasar una cantidad limitada de texto.

---

### ¿Qué es el ajuste fino?

El ajuste fino significa **continuar entrenando** un LLM preentrenado en una **tarea o dominio específico**:

- Ejemplo: miles de pares de preguntas y respuestas sobre la nube, Kubernetes, fintech, etc.
- El modelo **actualiza sus pesos** para interiorizar este dominio.

**Ventajas:**
- El modelo mejora de forma nativa en ese dominio o estilo.
- No es necesario proporcionar siempre un contexto largo: «sabe» más en sus pesos.

**Desventajas:**
- **Es costoso** (tiempo de GPU, canalización de entrenamiento).
- Necesita **datos buenos y seleccionados**.
- El modelo sigue teniendo un límite de conocimiento fijo (no «verá» nuevos artículos a menos que se vuelva a entrenar).

---

Puede:

- Utilizar GPT-4 / modelos más grandes (o cualquier «modelo experto») para **generar pares de preguntas y respuestas** a partir de documentos.
- **Ajustar finamente un modelo de pesos abiertos más pequeño** en estos pares de preguntas y respuestas.
- Mantener RAG también para inyectar **documentos muy recientes**.

Resultado:
- El modelo pequeño mejora en **jerga y estilo** gracias al ajuste fino.
- RAG lo mantiene **actualizado** con nuevos documentos.

En el resto de este cuaderno implementaremos:

1. Un pequeño **canal RAG**.
2. Un pequeño **ajuste fino LoRA**.
3. Una rápida **comparación**.

In [1]:
# Dependencies
# !pip install -q \
#   torch \
#   transformers \
#   accelerate \
#   bitsandbytes \
#   peft \
#   sentence-transformers \
#   datasets \
#   scikit-learn \
#   faiss 

### Librerias

In [2]:
import numpy as np
import faiss
import json
import os
import sys
import torch

from datasets import Dataset, load_dataset
from pathlib import Path
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from pprint import pprint
from sentence_transformers import SentenceTransformer, util
from sklearn.neighbors import NearestNeighbors
from torch.utils.data import DataLoader
from torchinfo import summary
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    GenerationConfig,
    DataCollatorForLanguageModeling,
    Trainer,
    TrainingArguments,
)

### Configuración

In [3]:
os.environ["CUDA_LAUNCH_BLOCKING"] = "0"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:32"
torch.backends.cuda.matmul.fp32_precision = (
    "ieee"  # torch.backends.cuda.matmul.allow_tf32 = True
)
torch.backends.cudnn.conv.fp32_precision = (
    "tf32"  # torch.backends.cudnn.allow_tf32 = True
)
torch.cuda.empty_cache()
torch.backends.cudnn.enabled = True
torch.backends.cudnn.benchmark = False

In [4]:
print("__Python VERSION:", sys.version)
print("__pyTorch VERSION:", torch.__version__)
print(
    "__CUDA VERSION",
)
print("__CUDNN VERSION:", torch.backends.cudnn.version())
print("__Number CUDA Devices:", torch.cuda.device_count())
print("__Devices")
print("Active CUDA Device: GPU", torch.cuda.current_device())
print("Available devices ", torch.cuda.device_count())
print("Current cuda device ", torch.cuda.current_device())

__Python VERSION: 3.12.11 (main, Sep  5 2025, 19:35:43) [GCC 13.3.0]
__pyTorch VERSION: 2.9.0+cu128
__CUDA VERSION
__CUDNN VERSION: 91002
__Number CUDA Devices: 1
__Devices
Active CUDA Device: GPU 0
Available devices  1
Current cuda device  0


In [5]:
! nvidia-smi

Wed Nov 19 18:18:15 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce GTX 1650 Ti     Off |   00000000:01:00.0 Off |                  N/A |
| N/A   56C    P8              2W /   50W |     177MiB /   4096MiB |     38%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [6]:
device = "cuda" if torch.cuda.is_available() else "cpu"

Vamos con un ejemplo pequeño

In [7]:
corpus_docs = [
    # 1
    """OpenAI released a new model that improves reasoning on complex code and math problems. 
    The model is optimized for tool use and retrieval-augmented generation pipelines.""",
    # 2
    """Google announced updates to its Vertex AI platform, making it easier to deploy and monitor 
    large language models at enterprise scale.""",
    # 3
    """Meta open-sourced a set of Llama-based models with billions of parameters, 
    enabling researchers and companies to fine-tune them for their own use cases.""",
    # 4
    """Microsoft integrated generative AI into its Office suite, adding features such as 
    AI-powered summarization, drafting assistance, and automatic meeting notes generation.""",
    # 5
    """Amazon Web Services introduced cheaper GPU instances optimized for inference workloads 
    like chatbots, code assistants, real-time search, and document question-answering.""",
    # 6
    """NVIDIA released new open-source libraries for accelerating transformer inference, 
    offering significant speedups on consumer GPUs like the RTX 4090.""",
    # 7
    """Anthropic published a research paper describing improvements in constitutional AI, 
    focusing on scalable oversight and safer model behavior.""",
    # 8
    """Apple reportedly began testing on-device LLMs for future iPhone models, enabling 
    private AI features such as offline summarization and personal context reasoning.""",
    # 9
    """Hugging Face launched a new inference API tier with higher throughput and native 
    support for vLLM, making it cheaper to serve models like Mistral-7B and Llama-3-8B.""",
    # 10
    """Mistral AI released Mixtral-8x22B, a sparse mixture-of-experts model offering state-of-the-art 
    performance while remaining efficient enough for commercial deployment.""",
    # 11
    """IBM announced a partnership with NASA to fine-tune foundation models on geospatial data 
    to improve climate analysis, wildfire prediction, and satellite imagery classification.""",
    # 12
    """Databricks released DBRX, a 132B-weight mixture-of-experts model trained on curated 
    scientific and enterprise datasets, outperforming models of similar size.""",
    # 13
    """Stability AI introduced Stable Diffusion 3, featuring improved text-image alignment 
    and reduced hallucination in multilingual prompting scenarios.""",
    # 14
    """Snowflake added native vector search capabilities, allowing enterprises to store embeddings 
    and run RAG pipelines directly on their data warehouse.""",
    # 15
    """Cohere launched a secure enterprise-grade embedding model designed for document retrieval, 
    semantic search, and multi-lingual knowledge-base applications.""",
    # 16
    """Red Hat announced AI-enhanced DevOps tooling, including automated deployment validation 
    powered by small specialized LLMs.""",
    # 17
    """Salesforce updated Einstein GPT with better CRM-specific reasoning, including lead scoring, 
    automatic email drafting, and pipeline forecasting.""",
    # 18
    """Dropbox introduced AI-powered universal search across files, documents, PDFs, and images, 
    enabling users to query semantic content instantly.""",
    # 19
    """Slack rolled out AI summarization for channels and threads, automatically generating 
    daily digests and extracting key decisions from long discussions.""",
    # 20
    """Zoom added real-time conversation translation and AI-based meeting action items, 
    powered by a fine-tuned multilingual transformer model.""",
]

corpus_titles = [
    "OpenAI releases new reasoning model",
    "Google updates Vertex AI",
    "Meta open-sources Llama models",
    "Microsoft adds AI to Office",
    "AWS introduces cheaper GPU instances",
    "NVIDIA releases transformer acceleration libs",
    "Anthropic improves constitutional AI",
    "Apple tests on-device LLMs",
    "Hugging Face launches new inference tier",
    "Mistral releases Mixtral-8x22B",
    "IBM partners with NASA on geospatial AI",
    "Databricks releases DBRX",
    "Stability AI releases SD3",
    "Snowflake adds vector search",
    "Cohere launches enterprise embedding model",
    "Red Hat adds AI DevOps tools",
    "Salesforce updates Einstein GPT",
    "Dropbox adds AI universal search",
    "Slack adds AI summaries",
    "Zoom adds real-time AI translation",
]

In [8]:
len(corpus_docs)

20

Realizamos el embedding de los documentos:

In [9]:
# small and fast embedding model (open weights)
embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"

In [10]:
embedder = SentenceTransformer(model_name_or_path=embedding_model_name, device=device)

In [11]:
# Compute embeddings
doc_embeddings = embedder.encode(
    sentences=corpus_docs, 
    convert_to_numpy=True,
    show_progress_bar=True,
    device=device,
    normalize_embeddings=True
)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [12]:
doc_embeddings, doc_embeddings.shape

(array([[-0.05694531, -0.01272646, -0.06879752, ...,  0.05730383,
          0.04768759,  0.00835872],
        [-0.07661536, -0.08271152,  0.03887794, ..., -0.00933229,
          0.05409345, -0.03391702],
        [-0.03868605, -0.02878194, -0.02075998, ..., -0.04762491,
         -0.01284006,  0.03692014],
        ...,
        [-0.02095782, -0.03330291, -0.04754037, ...,  0.04166466,
          0.05232637,  0.02517612],
        [-0.00392685, -0.02862822, -0.01042572, ...,  0.05525399,
         -0.05279417, -0.02444052],
        [-0.08633485, -0.04698378,  0.00952209, ...,  0.04447945,
         -0.1053777 , -0.02525596]], dtype=float32),
 (20, 384))

Vamos a crear un índice FAISS para búsqueda eficiente de vecinos más cercanos y un indice por NearestNeighbors en sklearn.

In [13]:
# Sklearn
nn_index = NearestNeighbors(n_neighbors=3, metric="cosine")
nn_index.fit(doc_embeddings)

# Faiss
faiss_emb = np.array(doc_embeddings).astype("float32")
faiss_index = faiss.IndexFlatIP(faiss_emb.shape[1])  # cosine similarity via inner product
faiss.normalize_L2(faiss_emb)
faiss_index.add(faiss_emb)

### RAG

![Quant1](./Images/RAG.png)

Fuente: P. Iusztin & M. Labonne - LLM Engineer's Handbook - Chapter 4 - RAG Feature Pipeline

Vamos a crear una función que nos genera la salida bruta (input+output) y la salida neta (output):

In [14]:
def generate_answer(model, tokenizer, prompt, max_length, max_new_tokens):
    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        truncation=True,
        max_length=max_length,
    ).to(device)

    gen_config = GenerationConfig(
        max_new_tokens=max_new_tokens,
        do_sample=True,  # activa el muestreo aleatorio (sampling) en lugar de argmax
        # necesario para que temperature / top_p tengan efecto
        temperature=0.3,  # escala la "suavidad" del softmax
        top_p=0.9,  # nucleus sampling: el modelo elige solo entre las palabras que
        # acumulan el 90% de la probabilidad total (variable-size)
        # top_k=50,        # OPCIONAL: limitar a las k palabras más probables
        pad_token_id=tokenizer.pad_token_id,
    )

    with torch.no_grad():
        output = model.generate(**inputs, generation_config=gen_config)

    # Full decoded output (prompt + generated)
    full_decoded = tokenizer.decode(output[0], skip_special_tokens=True)

    # Only the continuation (generated tokens after the prompt)
    generated_ids = output[0][inputs["input_ids"].shape[1] :]
    generated_decoded = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()

    return full_decoded, generated_decoded

En la generación de texto, estos parámetros controlan cuánta **aleatoriedad**, **creatividad** o **determinismo** tendrá el modelo.

**1. do_sample**
Indica si el modelo debe usar muestreo aleatorio en lugar de escoger siempre la palabra más probable.

`do_sample=False` → Decodificación determinista
- El modelo siempre elige el token con mayor probabilidad (argmax).  
- Equivale a *greedy decoding* o *beam search*.  
- La salida es siempre igual para un mismo input.

`do_sample=True` → Decodificación con muestreo
- El modelo **no** toma siempre la palabra más probable.  
- Muestra aleatoriamente según la distribución de probabilidades (softmax).  
- Permite creatividad y variación.


**2. Temperatura**
La temperatura controla qué tan “plana” o “concentrada” es la distribución de probabilidades.

Efectos prácticos:
- **Temperatura baja (0.0 – 0.5):**  
  Texto más determinista, formal, predecible.
- **Temperatura media (0.7 – 1.0):**  
  Buen balance entre coherencia y creatividad.
- **Temperatura alta (≥1.2):**  
  Texto muy creativo e impredecible.

Interpretación intuitiva:
“Más temperatura = más libertad para elegir palabras.”

Matemáticamente:
La temperatura \(T\) se aplica escalando los logits del modelo:

$P(w_i) = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}$

donde:
- $z_i$ = logit del token \(i\)  
- $T$ = temperatura


**3. Top-p (Nucleus Sampling)**
Top-p controla aleatoriedad seleccionando solo los tokens cuya **probabilidad acumulada** alcanza un umbral \(p\).

Ejemplos:
- **p = 0.5** → muy conservador  
- **p = 0.9** → equilibrado (el más usado)  
- **p = 0.95–0.99** → más creativo  

Algoritmo:
1. Ordenar los tokens por probabilidad:  
   $P(w_1) \ge P(w_2) \ge \dots \ge P(w_n)$
2. Construir el conjunto mínimo \(S\) tal que:  
   $\sum_{w_i \in S} P(w_i) \ge p$
3. Hacer muestreo **solo dentro de \(S\)**:
   $w \sim \text{Multinomial}\big(P(w_i \mid w_i \in S)\big)$

Propiedad clave:
El tamaño del conjunto **varía dinámicamente** según la distribución → más flexible que top-k.


**4. Top-k**
Top-k limita la elección a las **k palabras más probables**, descartando el resto.

Ejemplos:
- **k pequeño (10):** más control y coherencia.  
- **k grande (50–100):** más diversidad.  
- **k infinito / desactivado:** usa todos los tokens.

Matemáticamente:
Top-k actúa **antes del softmax**:

1. Seleccionar los $k$ logits más altos.  
2. Descartar los otros.  
3. Aplicar softmax solo sobre esos $k$:

$
P(w_i)=
\begin{cases}
\frac{e^{z_i}}{\sum_{j \in \text{top-k}} e^{z_j}} & i \in \text{top-k} \\
0 & i \notin \text{top-k}
\end{cases}
$


**Resumen comparativo**

| Parámetro | Qué controla | Tipo de límite |
|-----------|--------------|----------------|
| **do_sample** | Si hay muestreo o no | booleano |
| **temperatura** | Suavidad de la distribución | escala continua |
| **top-k** | Número fijo de candidatos | tamaño fijo |
| **top-p** | Probabilidad acumulada | tamaño variable |

**Top-p es más flexible e inteligente**, porque se adapta a la forma de la distribución.  
**Top-k es más simple y estable**, pero rígido.

Ahora, una funcion retrieval de contexto:

In [15]:
def retrieve_context(question, k, backend):
    """
    Retrieve top-k most similar documents using a selected backend:

        - 'sklearn' : brute-force KNN using Scikit-Learn
        - 'faiss'   : FAISS IndexFlatIP (optimized inner-product ANN)
        - 'st'      : SentenceTransformers' own cosine-similarity search
    """

    # Embed query → normalized vector (good for cosine similarity / inner product)
    q_emb = embedder.encode(
        [question], convert_to_numpy=True, normalize_embeddings=True
    )

    # ---------------------------------------------------------
    # 1) Scikit-Learn NearestNeighbors (exact search)
    # ---------------------------------------------------------
    if backend == "sklearn":
        # Brute-force cosine similarity via sklearn's KNN search.
        # Works well for small / medium corpora (<100k).
        distances, indices = nn_index.kneighbors(q_emb, n_neighbors=k)
        return [corpus_docs[i] for i in indices[0]]

    # ---------------------------------------------------------
    # 2) FAISS (fast ANN search using inner product)
    # ---------------------------------------------------------
    elif backend == "faiss":
        # FAISS expects float32 arrays.
        q = q_emb.astype("float32")
        # Normalize for cosine similarity (since IP ≈ cosine when vectors are normalized)
        faiss.normalize_L2(q)
        # Very fast search (exact or ANN depending on index type)
        distances, indices = faiss_index.search(q, k)
        return [corpus_docs[i] for i in indices[0]]

    # ---------------------------------------------------------
    # 3) SentenceTransformers semantic_search (exact cosine)
    # ---------------------------------------------------------
    elif backend == "st":
        # Computes cosine similarity against all doc embeddings.
        # This is brute-force but highly optimized in PyTorch/Numpy.
        hits = util.semantic_search(q_emb, doc_embeddings, top_k=k)[0]
        return [corpus_docs[hit["corpus_id"]] for hit in hits]

    else:
        raise ValueError(f"Unknown retrieval backend: {backend}")

Definimos 2 funciones: una que arma el prompt y otra que genera la respuesta del sistema RAG.

In [16]:
def build_prompt(question, tokenizer, contexts):
    context_text = contexts[0]
    prompt = f"""
    You are a helpful assistant specialized in technology news.
    Use ONLY the context below to answer the user question.
    If the answer is not in the context, say I don't know.
    Answer ONE short sentence. Do NOT repeat context or question
    \n  Question: {question}
    \n  Context: {context_text}
    \n Answer:
    """
    count = len(tokenizer(prompt, return_tensors="pt")["input_ids"][0])
    return prompt, count

In [17]:
def rag_answer(model, tokenizer, question, k, max_length, max_new_tokens, backend):
    """
    Full RAG flow:
    - Retrieve similar docs
    - Build ChatML prompt with context
    - Generate answer with the LLM
    """
    contexts = retrieve_context(question, k, backend)
    prompt, token_count = build_prompt(question, tokenizer, contexts)
    full_output, gen_output = generate_answer(model, tokenizer, prompt, max_length, max_new_tokens)
    return prompt, full_output, gen_output, contexts, token_count

Elegimos un modelo:

In [18]:
base_model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

Link del modelo: https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0

Link de interes: https://codingscape.com/blog/llms-with-largest-context-windows

Y un tokenizador:

In [19]:
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

In [20]:
tokenizer.vocab, tokenizer.vocab_size

({'▁Ott': 13476,
  '▁clust': 16993,
  'chem': 14969,
  '▁`.': 5050,
  'φ': 30237,
  'remove': 5992,
  '══': 13648,
  '▁Fragment': 19063,
  'ведения': 25321,
  '▁ie': 19282,
  '▁vars': 24987,
  '▁bi': 4768,
  '▁приз': 25660,
  'km': 8848,
  'amentos': 26376,
  '▁already': 2307,
  'half': 24498,
  '<0xBE>': 193,
  'Float': 11031,
  '▁spread': 9677,
  '▁flow': 4972,
  'pub': 5467,
  'LES': 17101,
  'istic': 4695,
  '▁dressed': 27121,
  '▁R': 390,
  'System': 3924,
  'anging': 9776,
  'joint': 12090,
  'евич': 13177,
  '▁etwas': 23452,
  '▁Raz': 24961,
  '▁Gabriel': 18672,
  '▁meets': 28103,
  'Async': 8123,
  'zet': 4975,
  'plementation': 14607,
  '←': 30245,
  'ським': 29796,
  'handler': 13789,
  '▁entities': 16212,
  'otta': 13536,
  '▁charset': 17425,
  'spre': 17703,
  'Mon': 7185,
  'isti': 14194,
  '▁брига': 26672,
  '▁Д': 1453,
  'zeti': 21047,
  '▁storia': 19097,
  '▁ros': 14652,
  'anna': 9713,
  "%'": 29001,
  '▁город': 12816,
  '▁SC': 12314,
  'variant': 19365,
  '▁made': 175

### Quantization

Para hacer mas eficiente el worfklow vamos a realizar 'quantization'

![Quant1](./Images/Quant1.png)

![Quant1](./Images/Quant2.png)

Fuente: P. Iusztin & M. Labonne - LLM Engineer's Handbook - Chapter 8 - Inference Optimization

In [21]:
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

Definimos el modelo:

In [22]:
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

In [23]:
base_model.to(device).eval()

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 2048)
    (layers): ModuleList(
      (0-21): 22 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=5632, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=5632, bias=False)
          (down_proj): Linear4bit(in_features=5632, out_features=2048, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm(

In [24]:
summary(base_model)

Layer (type:depth-idx)                             Param #
LlamaForCausalLM                                   --
├─LlamaModel: 1-1                                  --
│    └─Embedding: 2-1                              65,536,000
│    └─ModuleList: 2-2                             --
│    │    └─LlamaDecoderLayer: 3-1                 22,024,192
│    │    └─LlamaDecoderLayer: 3-2                 22,024,192
│    │    └─LlamaDecoderLayer: 3-3                 22,024,192
│    │    └─LlamaDecoderLayer: 3-4                 22,024,192
│    │    └─LlamaDecoderLayer: 3-5                 22,024,192
│    │    └─LlamaDecoderLayer: 3-6                 22,024,192
│    │    └─LlamaDecoderLayer: 3-7                 22,024,192
│    │    └─LlamaDecoderLayer: 3-8                 22,024,192
│    │    └─LlamaDecoderLayer: 3-9                 22,024,192
│    │    └─LlamaDecoderLayer: 3-10                22,024,192
│    │    └─LlamaDecoderLayer: 3-11                22,024,192
│    │    └─LlamaDecoderLayer: 3-12

Probemos el sistema RAG:

In [25]:
question = "Which company open-sourced Llama-based models and for what purpose?"

In [26]:
prompt, raw_answer, answer, ctx, token_count = rag_answer(
    base_model, tokenizer, question, 3, 256, 128, "faiss"
)

In [27]:
print("\nPrompt:", prompt)


Prompt: 
    You are a helpful assistant specialized in technology news.
    Use ONLY the context below to answer the user question.
    If the answer is not in the context, say I don't know.
    Answer ONE short sentence. Do NOT repeat context or question
    
  Question: Which company open-sourced Llama-based models and for what purpose?
    
  Context: Meta open-sourced a set of Llama-based models with billions of parameters, 
    enabling researchers and companies to fine-tune them for their own use cases.
    
 Answer:
    


In [28]:
print("Prompt token count:", token_count)

Prompt token count: 140


In [29]:
print("Retrieved Contexts:\n")
for i, c in enumerate(ctx, 1):
    print(f"--- Context {i} ---")
    print(c.strip(), "\n")

Retrieved Contexts:

--- Context 1 ---
Meta open-sourced a set of Llama-based models with billions of parameters, 
    enabling researchers and companies to fine-tune them for their own use cases. 

--- Context 2 ---
Hugging Face launched a new inference API tier with higher throughput and native 
    support for vLLM, making it cheaper to serve models like Mistral-7B and Llama-3-8B. 

--- Context 3 ---
Apple reportedly began testing on-device LLMs for future iPhone models, enabling 
    private AI features such as offline summarization and personal context reasoning. 



In [30]:
print("Raw RAG Answer:", raw_answer)

Raw RAG Answer: 
    You are a helpful assistant specialized in technology news.
    Use ONLY the context below to answer the user question.
    If the answer is not in the context, say I don't know.
    Answer ONE short sentence. Do NOT repeat context or question
    
  Question: Which company open-sourced Llama-based models and for what purpose?
    
  Context: Meta open-sourced a set of Llama-based models with billions of parameters, 
    enabling researchers and companies to fine-tune them for their own use cases.
    
 Answer:
     Meta's Llama-based models are open-sourced for the purpose of enabling researchers and companies to fine-tune them for their own use cases.
     The models are based on Llama, a popular open-source machine learning library.
     Meta's Llama-based models are available for research and development purposes.


In [31]:
print('RAG Answer:\n', answer)

RAG Answer:
 Meta's Llama-based models are open-sourced for the purpose of enabling researchers and companies to fine-tune them for their own use cases.
     The models are based on Llama, a popular open-source machine learning library.
     Meta's Llama-based models are available for research and development purposes.


Veamos algo interesante:

In [32]:
question

'Which company open-sourced Llama-based models and for what purpose?'

In [33]:
raw_answer, answer = generate_answer(base_model, tokenizer, question, 256, 128)

In [34]:
print("Baseline Answer (no RAG, no fine-tuning):\n", answer)

Baseline Answer (no RAG, no fine-tuning):
 


In [35]:
answer

''

Usemos de nuevo el prompt anterior:

In [36]:
prompt

"\n    You are a helpful assistant specialized in technology news.\n    Use ONLY the context below to answer the user question.\n    If the answer is not in the context, say I don't know.\n    Answer ONE short sentence. Do NOT repeat context or question\n    \n  Question: Which company open-sourced Llama-based models and for what purpose?\n    \n  Context: Meta open-sourced a set of Llama-based models with billions of parameters, \n    enabling researchers and companies to fine-tune them for their own use cases.\n    \n Answer:\n    "

In [37]:
raw_answer, answer = generate_answer(base_model, tokenizer, prompt, 256, 128)
print("Baseline Answer (no RAG, no fine-tuning):\n", answer)

Baseline Answer (no RAG, no fine-tuning):
 Meta open-sourced Llama-based models for research and development.
     The models are used for various applications, including natural language processing, 
     computer vision, and machine learning.
     The open-sourcing of these models is a significant step towards making them more accessible to the wider community.
     The models are available for researchers and companies to fine-tune for their own use cases.
     Meta's commitment to open-sourcing its technology is a testament to its commitment to innovation and collaboration.


Creamos un nuevo prompt que tenga ordenes, pregunta y espacio para respuesta:

In [38]:
prompt = f"""
ONLY answer the question below. Do NOT repeat the question below in the answer.
Question: {question}
Answer:
"""

In [39]:
raw_answer, answer = generate_answer(base_model, tokenizer, prompt, 256, 128)
print("Baseline Answer (no RAG, no fine-tuning):\n", answer)

Baseline Answer (no RAG, no fine-tuning):
 Llama-based models are open-sourced by Google. They are used for image recognition and object detection tasks. The company uses these models for various applications, including autonomous vehicles, medical imaging, and security.


Veamos esto que también es interesante

In [40]:
prompt = f"""{question}:"""

In [41]:
prompt

'Which company open-sourced Llama-based models and for what purpose?:'

In [42]:
raw_answer, answer = generate_answer(base_model, tokenizer, prompt, 256, 128)
print("Baseline Answer (no RAG, no fine-tuning):\n", answer)

Baseline Answer (no RAG, no fine-tuning):
 Llama is a machine learning library for Python, which is open-sourced by its author, <|assistant|> (https://github.com/johannesbuchner/llama). The library is designed to be easy to use and powerful, and it provides a wide range of tools for building and deploying machine learning models.

The main purpose of Llama is to provide a powerful and flexible framework for building and deploying machine learning models. It is designed to be easy to use, with a focus on simplicity and ease of use. Llama provides a wide range of tools for


Veamos ahora como hacer fine tuning con un toy example:

In [43]:
with open("Data/qa_data.json", "r") as file:
    qa_pairs = json.load(file)

In [44]:
qa_pairs

[{'instruction': 'Which company released a new reasoning model optimized for tool use and RAG?',
  'input': '',
  'output': 'OpenAI released a new model optimized for tool use and retrieval-augmented generation.'},
 {'instruction': 'What did Google update to make it easier to deploy large language models?',
  'input': '',
  'output': 'Google updated its Vertex AI platform to make it easier to deploy and monitor large language models.'},
 {'instruction': 'Which company open-sourced Llama-based models and why is this important?',
  'input': '',
  'output': 'Meta open-sourced Llama-based models, enabling researchers and companies to fine-tune them for their own use cases.'},
 {'instruction': 'What AI features did Microsoft add to Office?',
  'input': '',
  'output': 'Microsoft added generative AI features to Office, including AI-powered summarization, drafting, and meeting notes.'},
 {'instruction': 'What did AWS introduce for inference workloads?',
  'input': '',
  'output': 'AWS introdu

In [45]:
dataset = Dataset.from_list(qa_pairs)
dataset

Dataset({
    features: ['instruction', 'input', 'output'],
    num_rows: 20
})

### Supervised FT

![ft1](./Images/FT1.png)

![ft2](./Images/FT2.png)

![ft3](./Images/FT3.png)

Fuente: P. Iusztin & M. Labonne - LLM Engineer's Handbook - Chapter 5 - Supervised Fine Tuning

In [46]:
def format_example(example):
    # Alpaca-style formatting
    if example["input"]:
        return f"""Below is an instruction and an input. Write a helpful answer.

### Instruction:
{example["instruction"]}

### Input:
{example["input"]}

### Response:
{example["output"]}
"""
    else:
        return f"""Below is an instruction. Write a helpful answer.

### Instruction:
{example["instruction"]}

### Response:
{example["output"]}
"""


def tokenize_function(example):
    text = format_example(example)
    tokenized = tokenizer(
        text,
        truncation=True,
        max_length=256,
        padding="max_length",
    )
    # For causal LM, labels = input_ids
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

In [47]:
tokenized_dataset = dataset.map(tokenize_function, batched=False)

# Remove the original text columns
tokenized_dataset = tokenized_dataset.remove_columns(["instruction", "input", "output"])

# Set format
tokenized_dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "labels"],
)

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

In [48]:
tokenized_dataset[0]

{'input_ids': tensor([    1, 13866,   338,   385, 15278, 29889, 14350,   263,  8444,  1234,
         29889,    13,    13,  2277, 29937,  2799,  4080, 29901,    13,  8809,
           436,  5001,  5492,   263,   716, 24481,  1904, 27545,   363,  5780,
           671,   322,   390, 10051, 29973,    13,    13,  2277, 29937, 13291,
         29901,    13,  6585, 23869,  5492,   263,   716,  1904, 27545,   363,
          5780,   671,   322,  5663, 16837, 29899,  2987,   358,   287, 12623,
         29889,    13,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,   

In [49]:
train_loader = DataLoader(
    tokenized_dataset,
    batch_size=2,
    shuffle=True,
)

Definimos un nuevo modelo a realizar FT:

In [50]:
ft_model = AutoModelForCausalLM.from_pretrained(
    base_model_name, quantization_config=bnb_config, device_map="auto", use_cache=False
)

In [51]:
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=[
        "q_proj",
        "v_proj",
        "k_proj",
        "o_proj",
    ],  # may need to adjust per model
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

In [52]:
# Prepare model for k-bit training (LoRA on top of 4-bit base)
ft_model.to(device).train()
ft_model = prepare_model_for_kbit_training(ft_model)
ft_model = get_peft_model(ft_model, lora_config)
ft_model.gradient_checkpointing_enable(
    gradient_checkpointing_kwargs={"use_reentrant": False}
)
ft_model.enable_input_require_grads()
ft_model.print_trainable_parameters()

trainable params: 2,252,800 || all params: 1,102,301,184 || trainable%: 0.2044


In [53]:
summary(ft_model)

Layer (type:depth-idx)                                            Param #
PeftModelForCausalLM                                              --
├─LoraModel: 1-1                                                  --
│    └─LlamaForCausalLM: 2-1                                      --
│    │    └─LlamaModel: 3-1                                       552,323,072
│    │    └─Linear: 3-2                                           (65,536,000)
Total params: 617,859,072
Trainable params: 2,252,800
Non-trainable params: 615,606,272

In [54]:
optimizer = torch.optim.AdamW(ft_model.parameters(), lr=1e-3, amsgrad=True, weight_decay=0.01)

In [55]:
num_epochs = 2
for epoch in range(num_epochs):
    total_loss = 0.0

    for step, batch in enumerate(train_loader):
        # batch is a dict of tensors with shape [batch_size, seq_len]
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = ft_model(**batch)

        loss = outputs.loss
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        print(f"Epoch {epoch+1} | Step {step+1} | Loss: {loss.item():.4f}")

    avg_loss = total_loss / len(train_loader)
    print(f"== Epoch {epoch+1} finished | Avg loss: {avg_loss:.4f} ==")

Epoch 1 | Step 1 | Loss: 14.7345
Epoch 1 | Step 2 | Loss: 8.1762
Epoch 1 | Step 3 | Loss: 3.2726
Epoch 1 | Step 4 | Loss: 0.9673
Epoch 1 | Step 5 | Loss: 0.9050
Epoch 1 | Step 6 | Loss: 0.9186
Epoch 1 | Step 7 | Loss: 0.8490
Epoch 1 | Step 8 | Loss: 0.8782
Epoch 1 | Step 9 | Loss: 0.8042
Epoch 1 | Step 10 | Loss: 0.6960
== Epoch 1 finished | Avg loss: 3.2202 ==
Epoch 2 | Step 1 | Loss: 0.7238
Epoch 2 | Step 2 | Loss: 0.5138
Epoch 2 | Step 3 | Loss: 0.6198
Epoch 2 | Step 4 | Loss: 0.7040
Epoch 2 | Step 5 | Loss: 0.5648
Epoch 2 | Step 6 | Loss: 0.6583
Epoch 2 | Step 7 | Loss: 0.4236
Epoch 2 | Step 8 | Loss: 0.4197
Epoch 2 | Step 9 | Loss: 0.4426
Epoch 2 | Step 10 | Loss: 0.4661
== Epoch 2 finished | Avg loss: 0.5537 ==


Vamos a cargar el modelo ya entrenado:

In [56]:
output_dir = "./tinyllama-tech-lora"
ft_model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print("Saved LoRA adapter to:", output_dir)

Saved LoRA adapter to: ./tinyllama-tech-lora


In [57]:
torch.cuda.empty_cache()

In [58]:
# # Load FT Model
# ft_model = AutoModelForCausalLM.from_pretrained(output_dir)

# # Load the tokenizer
# tokenizer = AutoTokenizer.from_pretrained(output_dir)

In [59]:
# ft_model.to(device)

Nuevamente:

In [60]:
question = "Which company open-sourced Llama-based models and for what purpose?"

In [61]:
raw_answer, answer = generate_answer(ft_model, tokenizer, question, 256, 128)

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
Caching is incompatible with gradient checkpointing in LlamaDecoderLayer. Setting `past_key_values=None`.


In [62]:
print(answer)




In [63]:
prompt

'Which company open-sourced Llama-based models and for what purpose?:'

In [64]:
raw_answer, answer = generate_answer(ft_model, tokenizer, prompt, 256, 128)

In [65]:
print(answer)




In [66]:
print("\n--- RAG (no FT) ---")
prompt, full_output, gen_output, contexts, token_count = rag_answer(
    base_model, tokenizer, question, 3, 256, 128, 'faiss'
)
print(gen_output)


--- RAG (no FT) ---
Meta open-sourced Llama-based models for research and development.
     The models are designed for tasks such as image classification, object detection, and language modeling.
     The open-sourcing of Llama models will enable researchers and companies to fine-tune them for their own use cases.
     The models are designed for tasks such as image classification, object detection, and language modeling.
     Meta open-sourced Llama-based models for research and development.
     The models are designed for tasks such as image classification, object detection, and language model


In [67]:
question

'Which company open-sourced Llama-based models and for what purpose?'

In [68]:
print("\n--- Fine-tuned + RAG ---")
_, raw_answer, answer, _, _ = rag_answer(
    ft_model, tokenizer, question, 3, 256, 128, "faiss"
)
print(answer)


--- Fine-tuned + RAG ---



Ahora, vamos a entrenar un modelo un poco más grande desde HF:

In [69]:
raw_ds = load_dataset("HuggingFaceH4/ultrafeedback_binarized")

In [70]:
train_sft = raw_ds["train_sft"]
test_sft = raw_ds["test_sft"]
pprint(train_sft)

Dataset({
    features: ['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', 'score_chosen', 'score_rejected'],
    num_rows: 61135
})


In [71]:
pprint(train_sft[0])

{'chosen': [{'content': 'how can i develop a habit of drawing daily',
             'role': 'user'},
            {'content': 'Developing a daily habit of drawing can be '
                        'challenging but with consistent practice and a few '
                        'tips, it can become an enjoyable and rewarding part '
                        'of your daily routine. Here are some strategies to '
                        'help you develop the habit of drawing daily:\n'
                        '\n'
                        '1. Set a specific time: Allocate a specific time of '
                        'the day to draw. It could be in the morning, '
                        'afternoon, or evening. Make drawing a part of your '
                        'daily routine.\n'
                        '2. Set a specific duration: Determine the amount of '
                        'time you want to spend on drawing each day. It can be '
                        'as little as 10 minutes or as long a

In [72]:
def ultrafeedback_to_qa(example):
    """
    Convert one UltraFeedback example into your {instruction, input, output} format.
    We'll use the 'messages' field.
    """
    msgs = example["messages"]

    # last user message
    user_msgs = [m["content"] for m in msgs if m["role"] == "user"]
    # last assistant message
    assistant_msgs = [m["content"] for m in msgs if m["role"] == "assistant"]

    if not user_msgs or not assistant_msgs:
        return {
            "instruction": "",
            "input": "",
            "output": "",
        }

    instruction = user_msgs[-1].strip()
    output = assistant_msgs[-1].strip()

    return {
        "instruction": instruction,
        "input": "",
        "output": output,
    }

In [73]:
# For GPU sanity, start with a subset
train_small = train_sft.shuffle().select(range(1000))
eval_small = test_sft.shuffle().select(range(100))

train_qa = [ultrafeedback_to_qa(ex) for ex in train_small]
eval_qa = [ultrafeedback_to_qa(ex) for ex in eval_small]

# Filter out empties
train_qa = [ex for ex in train_qa if ex["instruction"] and ex["output"]]
eval_qa = [ex for ex in eval_qa if ex["instruction"] and ex["output"]]

In [74]:
pprint(train_qa[0])

{'input': '',
 'instruction': 'Given a pair of words, deduce the type of relationship '
                "between them. The various types of relations are: 'Entails, "
                'HasProperty, Synonym, Antonym, HasA, MemberOf, PartOf, '
                "MadeOf, IsA'. Let's denote the first word by X and the second "
                "word by Y. An 'IsA' relation holds when 'X is a kind of Y'. "
                "An 'Antonym' relation holds when 'X can be used as the "
                "opposite of Y'. A 'Synonym' relation applies when 'X can be "
                "used in place of Y, without changing the meaning'. A 'PartOf' "
                "relation holds when 'X is a part of Y'. A 'MemberOf' relation "
                "holds when 'X is a member of Y'. A 'MadeOf' relation holds "
                "when 'X is made of Y'. An 'Entailment' relation holds when "
                "'If X is true, then Y is true as well'. A 'HasA' relation "
                "holds when 'X can have or contain 

In [75]:
train_dataset = Dataset.from_list(train_qa)
eval_dataset = Dataset.from_list(eval_qa)

In [76]:
pprint(train_dataset[0])

{'input': '',
 'instruction': 'Given a pair of words, deduce the type of relationship '
                "between them. The various types of relations are: 'Entails, "
                'HasProperty, Synonym, Antonym, HasA, MemberOf, PartOf, '
                "MadeOf, IsA'. Let's denote the first word by X and the second "
                "word by Y. An 'IsA' relation holds when 'X is a kind of Y'. "
                "An 'Antonym' relation holds when 'X can be used as the "
                "opposite of Y'. A 'Synonym' relation applies when 'X can be "
                "used in place of Y, without changing the meaning'. A 'PartOf' "
                "relation holds when 'X is a part of Y'. A 'MemberOf' relation "
                "holds when 'X is a member of Y'. A 'MadeOf' relation holds "
                "when 'X is made of Y'. An 'Entailment' relation holds when "
                "'If X is true, then Y is true as well'. A 'HasA' relation "
                "holds when 'X can have or contain 

In [77]:
def format_ex(example):
    instr = example["instruction"]
    inp = example["input"]
    out = example["output"]

    if inp:
        prompt = (
            "Below is an instruction and additional input. "
            "Write a helpful, honest, and concise response.\n\n"
            f"### Instruction:\n{instr}\n\n"
            f"### Input:\n{inp}\n\n"
            "### Response:\n"
        )
    else:
        prompt = (
            "Below is an instruction. "
            "Write a helpful, honest, and concise response.\n\n"
            f"### Instruction:\n{instr}\n\n"
            "### Response:\n"
        )

    # For causal LM, we feed prompt + output as a single sequence
    full_text = prompt + out
    return {"text": full_text}

In [78]:
train_dataset = train_dataset.map(format_ex)
eval_dataset = eval_dataset.map(format_ex)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [79]:
def tokenize_fn(example):
    return tokenizer(
        example["text"],
        truncation=True,
        max_length=256,
        padding="max_length",
    )

In [80]:
train_tokenized = train_dataset.map(
    tokenize_fn, batched=True, remove_columns=train_dataset.column_names
)
eval_tokenized = eval_dataset.map(
    tokenize_fn, batched=True, remove_columns=eval_dataset.column_names
)

train_tokenized.set_format(type="torch", columns=["input_ids", "attention_mask"])
eval_tokenized.set_format(type="torch", columns=["input_ids", "attention_mask"])

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [81]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)

In [82]:
ft_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,
    device_map="auto",
    use_cache=False,
)

ft_model = prepare_model_for_kbit_training(ft_model)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # typical for Llama-like
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

ft_model = get_peft_model(ft_model, lora_config)
ft_model.gradient_checkpointing_enable(
    gradient_checkpointing_kwargs={"use_reentrant": False}
)
ft_model.enable_input_require_grads()
ft_model.print_trainable_parameters()

trainable params: 4,505,600 || all params: 1,104,553,984 || trainable%: 0.4079


In [83]:
output_dir = "./tinyllama-ultrafeedback-lora"

In [84]:
training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,  # effective batch size = 8
    learning_rate=1e-4,
    num_train_epochs=1,  # start with 1 epoch; increase if stable
    warmup_ratio=0.03,
    logging_steps=20,
    eval_steps=200,
    save_steps=200,
    save_total_limit=2,
    report_to=[],
)

In [85]:
trainer = Trainer(
    model=ft_model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=eval_tokenized,
    data_collator=data_collator,
)

In [86]:
trainer.train()

Step,Training Loss
20,1.8053
40,1.5935
60,1.6004
80,1.5006
100,1.5398
120,1.5685


TrainOutput(global_step=125, training_loss=1.5959888496398926, metrics={'train_runtime': 5291.6606, 'train_samples_per_second': 0.189, 'train_steps_per_second': 0.024, 'total_flos': 1595931623424000.0, 'train_loss': 1.5959888496398926, 'epoch': 1.0})

In [87]:
adapter_dir = "./tinyllama-ultrafeedback-lora-adapter"
ft_model.save_pretrained(adapter_dir)
tokenizer.save_pretrained(adapter_dir)
print("Saved LoRA adapter to:", adapter_dir)

Saved LoRA adapter to: ./tinyllama-ultrafeedback-lora-adapter


In [88]:
corpus_docs = [
    f"User: {ex['prompt']}\n\nAssistant: {ultrafeedback_to_qa(ex)['output']}"
    for ex in train_small
]

corpus_titles = [f"UltraFeedback sample {i}" for i in range(len(corpus_docs))]

In [89]:
# Compute embeddings
doc_embeddings = embedder.encode(
    corpus_docs,
    convert_to_numpy=True,
    show_progress_bar=True,
    device=device,
    normalize_embeddings=True,
)
doc_embeddings = np.array(doc_embeddings).astype("float32")

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

In [90]:
# Sklearn
nn_index = NearestNeighbors(n_neighbors=10, metric="cosine")
nn_index.fit(doc_embeddings)
# Faiss
faiss_emb = np.array(doc_embeddings).astype("float32")
faiss_index = faiss.IndexFlatIP(
    faiss_emb.shape[1]
)  # cosine similarity via inner product
faiss.normalize_L2(faiss_emb)
faiss_index.add(faiss_emb)

In [125]:
question = "how can i develop a habit of drawing daily?"

In [126]:
prompt, raw_answer, answer, ctx, token_count = rag_answer(
    base_model, tokenizer, question, 3, 256, 128, "faiss"
)

In [127]:
print("\nPrompt:", prompt)


Prompt: 
    You are a helpful assistant specialized in technology news.
    Use ONLY the context below to answer the user question.
    If the answer is not in the context, say I don't know.
    Answer ONE short sentence. Do NOT repeat context or question
    
  Question: how can i develop a habit of drawing daily?
    
  Context: User: Write a 1500-word report in APA format on the impact of mindfulness practices, such as meditation and breathing exercises, on enhancing creativity in individuals. Use at least three scholarly articles from peer-reviewed journals as sources, and include examples of how mindfulness has been applied in various creative fields, such as music, writing, and visual arts. Additionally, discuss potential limitations or challenges of incorporating mindfulness into a creative practice, and provide recommendations for those interested in implementing mindfulness into their own creative process.

Assistant: Title: The Impact of Mindfulness Practices on Enhancing C

In [128]:
answer

'practices, such as meditation and breathing exercises, on enhancing creativity in individuals. The paper will analyze the research conducted by scholars on mindfulness and creativity, and provide examples of how mindfulness has been applied in various creative fields. The paper will also discuss potential limitations or challenges of incorporating mindfulness into a creative practice, and provide recommendations for those interested in implementing mindfulness into their own creative process.\n\nIntroduction\n\nCreativity is a complex and multifaceted concept that encompasses various aspects of human behavior, including'

In [129]:
raw_answer, answer = generate_answer(base_model, tokenizer, question, 256, 128)

In [130]:
print("Baseline Answer (no RAG, no fine-tuning):\n", answer)

Baseline Answer (no RAG, no fine-tuning):
 


In [131]:
raw_answer, answer = generate_answer(ft_model, tokenizer, question, 256, 128)

In [132]:
print("Baseline Answer (zero shot FT, no RAG):\n", answer)

Baseline Answer (zero shot FT, no RAG):
 


In [135]:
prompt

"\n    You are a helpful assistant specialized in technology news.\n    Use ONLY the context below to answer the user question.\n    If the answer is not in the context, say I don't know.\n    Answer ONE short sentence. Do NOT repeat context or question\n    \n  Question: how can i develop a habit of drawing daily?\n    \n  Context: User: Write a 1500-word report in APA format on the impact of mindfulness practices, such as meditation and breathing exercises, on enhancing creativity in individuals. Use at least three scholarly articles from peer-reviewed journals as sources, and include examples of how mindfulness has been applied in various creative fields, such as music, writing, and visual arts. Additionally, discuss potential limitations or challenges of incorporating mindfulness into a creative practice, and provide recommendations for those interested in implementing mindfulness into their own creative process.\n\nAssistant: Title: The Impact of Mindfulness Practices on Enhancing

In [136]:
raw_answer, answer = generate_answer(base_model, tokenizer, prompt, 256, 128)
print("Baseline Answer (no RAG, no fine-tuning):\n", answer)

Baseline Answer (no RAG, no fine-tuning):
 practices, such as meditation and breathing exercises, on enhancing creativity in individuals. The paper will review three scholarly articles from peer-reviewed journals to provide evidence for the effectiveness of mindfulness in enhancing creativity. The paper will also discuss potential limitations and challenges of incorporating mindfulness into a creative practice, and provide recommendations for those interested in implementing mindfulness into their own creative process.

Introduction

Creativity is a vital aspect of human development, and it is often associated with innovation, originality, and the ability to generate


In [137]:
prompt = f"""
ONLY answer the question below. Do NOT repeat the question below in the answer.
Question: {question}
Answer:
"""

In [138]:
raw_answer, answer = generate_answer(base_model, tokenizer, prompt, 256, 128)
print("Baseline Answer (no RAG, no fine-tuning):\n", answer)

Baseline Answer (no RAG, no fine-tuning):
 1. Start small: Begin by drawing a simple sketch or drawing a few lines.
2. Make it a habit: Once you have started drawing, make it a habit. Try to draw every day, even if it's just for a few minutes.
3. Choose a specific time: Choose a specific time of day, such as before bed, during lunch break, or during your commute.
4. Make it a part of your routine: Try to draw every day, even if it's just for a few minutes, as it becomes a habit.
5. Visualize the result:


In [139]:
prompt = f"""{question}:"""

In [140]:
prompt

'how can i develop a habit of drawing daily?:'

In [141]:
raw_answer, answer = generate_answer(base_model, tokenizer, prompt, 256, 128)
print("Baseline Answer (no RAG, no fine-tuning):\n", answer)

Baseline Answer (no RAG, no fine-tuning):
 I'm not a big fan of drawing. I've never been able to draw a straight line. But I can draw a picture. I can draw a picture of a picture. I can draw a picture of a picture of a picture. I can draw a picture of a picture of a picture of a picture. I can draw a picture of a picture of a picture of a picture of a picture. I can draw a picture of a picture of a picture of a picture of a picture of a picture of a picture of a picture of a picture of a picture of a picture of a picture of a picture of a picture of


In [142]:
raw_answer, answer = generate_answer(ft_model, tokenizer, question, 256, 128)

In [144]:
print(answer)




In [145]:
print("\n--- RAG (no FT) ---")
prompt, full_output, gen_output, contexts, token_count = rag_answer(
    base_model, tokenizer, question, 3, 256, 128, 'faiss'
)
print(gen_output)


--- RAG (no FT) ---
practices, such as meditation and breathing exercises, on enhancing creativity in individuals. The paper will draw on three scholarly articles from peer-reviewed journals to provide a comprehensive overview of the topic. The paper will discuss the potential benefits of mindfulness on creativity, including increased focus, improved problem-solving, and enhanced emotional regulation. The paper will also explore the challenges of incorporating mindfulness into a creative practice, including the need for self-awareness, mindfulness training, and the potential for distraction. The paper will provide


In [123]:
question

'how can i develop a habit of drawing daily'

In [146]:
print("\n--- Fine-tuned + RAG ---")
_, raw_answer, answer, _, _ = rag_answer(
    ft_model, tokenizer, question, 3, 256, 128, "faiss"
)
print(answer)


--- Fine-tuned + RAG ---
practices. Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below Below
