![Colegio Bourbaki](./Images/Bourbaki.png)

## Procesamiento de Lenguaje Natural

En este notebook haremos lo siguiente:

1. **Explicaremos** la diferencia entre:   
- Generación aumentada por recuperación (**RAG**)   
- **Ajuste fino** de un modelo de lenguaje
- Uso de **ambos juntos** 

2. **Implementaremos un pequeño proceso RAG**:   
- Usaremos un transformador de oraciones para incrustar documentos  
- Almacenaremos las incrustaciones en un índice vectorial  
- Recuperaremos pasajes relevantes  
- Usaremos un pequeño modelo de chat de pesos abiertos para responder preguntas de ese contexto 

3. **Ajustar un pequeño modelo de pesos abiertos** en un pequeño conjunto de datos de preguntas y respuestas   
- Utilizar LoRA / QLoRA para ajustarlo a una GPU de ~4 GB   
- Comparar las respuestas **antes y después** del ajuste. Suponemos que se trata de una GPU como la **NVIDIA GeForce GTX 1650 Ti 4 GB**, por lo que haremos lo siguiente: - Utilizar un modelo pequeño: `TinyLlama/TinyLlama-1.1B-Chat-v1.0`. 
- Cargarlo en **4 bits** siempre que sea posible. 
- Mantener tamaños de lote pequeños. Este es un cuaderno *didáctico*: no espere una calidad de vanguardia, pero debería ayudarle a comprender **cuándo utilizar RAG y cuándo ajustar**.


## RAG frente al ajuste fino (conceptual)

### ¿Qué es RAG (generación aumentada por recuperación)?

Los LLM tienen un **conocimiento limitado**: solo saben lo que vieron durante el entrenamiento previo.  
RAG añade un **almacén de conocimiento externo** (por ejemplo, una base de datos vectorial):

1. Se **incrustan** los documentos (artículos, documentos, tickets) en vectores.
2. En el momento de la consulta, se:
   - Incrusta la pregunta del usuario.
   - Recupera los **documentos más similares**.
   - Pasa la *pregunta + el contexto recuperado* al LLM.
3. El modelo responde *utilizando ese contexto*, sin cambiar sus pesos.

**Ventajas:**
- Ideal para **datos nuevos y que cambian con frecuencia** (como las noticias diarias).
- No requiere un entrenamiento pesado, solo incrustación + recuperación.
- Seguro: no sobrescribe el modelo.

**Desventajas:**
- La calidad de la respuesta depende de la **calidad de la recuperación** y del tamaño de la solicitud.
- Limitado por la **ventana de contexto**: solo se puede pasar una cantidad limitada de texto.

---

### ¿Qué es el ajuste fino?

El ajuste fino significa **continuar entrenando** un LLM preentrenado en una **tarea o dominio específico**:

- Ejemplo: miles de pares de preguntas y respuestas sobre la nube, Kubernetes, fintech, etc.
- El modelo **actualiza sus pesos** para interiorizar este dominio.

**Ventajas:**
- El modelo mejora de forma nativa en ese dominio o estilo.
- No es necesario proporcionar siempre un contexto largo: «sabe» más en sus pesos.

**Desventajas:**
- **Es costoso** (tiempo de GPU, canalización de entrenamiento).
- Necesita **datos buenos y seleccionados**.
- El modelo sigue teniendo un límite de conocimiento fijo (no «verá» nuevos artículos a menos que se vuelva a entrenar).

---

Puede:

- Utilizar GPT-4 / modelos más grandes (o cualquier «modelo experto») para **generar pares de preguntas y respuestas** a partir de documentos.
- **Ajustar finamente un modelo de pesos abiertos más pequeño** en estos pares de preguntas y respuestas.
- Mantener RAG también para inyectar **documentos muy recientes**.

Resultado:
- El modelo pequeño mejora en **jerga y estilo** gracias al ajuste fino.
- RAG lo mantiene **actualizado** con nuevos documentos.

En el resto de este cuaderno implementaremos:

1. Un pequeño **canal RAG**.
2. Un pequeño **ajuste fino LoRA**.
3. Una rápida **comparación**.

In [38]:
# Dependencies
# !pip install -q \
#   torch \
#   transformers \
#   accelerate \
#   bitsandbytes \
#   peft \
#   sentence-transformers \
#   datasets \
#   scikit-learn


### Librerias

In [39]:
import numpy as np
import faiss
import os
import sys
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    GenerationConfig,
)
from torch.utils.data import DataLoader
from torchinfo import summary
from sentence_transformers import SentenceTransformer, util
from sklearn.neighbors import NearestNeighbors
from datasets import Dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

### Configuración

In [40]:
os.environ["CUDA_LAUNCH_BLOCKING"] = "0"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:32"
torch.backends.cuda.matmul.fp32_precision = (
    "ieee"  # torch.backends.cuda.matmul.allow_tf32 = True
)
torch.backends.cudnn.conv.fp32_precision = (
    "tf32"  # torch.backends.cudnn.allow_tf32 = True
)
torch.cuda.empty_cache()
torch.backends.cudnn.enabled = True
torch.backends.cudnn.benchmark = False

In [41]:
print("__Python VERSION:", sys.version)
print("__pyTorch VERSION:", torch.__version__)
print(
    "__CUDA VERSION",
)
print("__CUDNN VERSION:", torch.backends.cudnn.version())
print("__Number CUDA Devices:", torch.cuda.device_count())
print("__Devices")
print("Active CUDA Device: GPU", torch.cuda.current_device())
print("Available devices ", torch.cuda.device_count())
print("Current cuda device ", torch.cuda.current_device())

__Python VERSION: 3.12.11 (main, Sep  5 2025, 19:35:43) [GCC 13.3.0]
__pyTorch VERSION: 2.9.0+cu128
__CUDA VERSION
__CUDNN VERSION: 91002
__Number CUDA Devices: 1
__Devices
Active CUDA Device: GPU 0
Available devices  1
Current cuda device  0


In [42]:
! nvidia-smi

Mon Nov 17 21:30:13 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce GTX 1650 Ti     Off |   00000000:01:00.0  On |                  N/A |
| N/A   58C    P0             14W /   50W |    1264MiB /   4096MiB |     31%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [43]:
device = "cuda" if torch.cuda.is_available() else "cpu"

Vamos con un ejemplo pequeño

In [44]:
corpus_docs = [
    # 1
    """OpenAI released a new model that improves reasoning on complex code and math problems. 
    The model is optimized for tool use and retrieval-augmented generation pipelines.""",
    # 2
    """Google announced updates to its Vertex AI platform, making it easier to deploy and monitor 
    large language models at enterprise scale.""",
    # 3
    """Meta open-sourced a set of Llama-based models with billions of parameters, 
    enabling researchers and companies to fine-tune them for their own use cases.""",
    # 4
    """Microsoft integrated generative AI into its Office suite, adding features such as 
    AI-powered summarization, drafting assistance, and automatic meeting notes generation.""",
    # 5
    """Amazon Web Services introduced cheaper GPU instances optimized for inference workloads 
    like chatbots, code assistants, real-time search, and document question-answering.""",
    # 6
    """NVIDIA released new open-source libraries for accelerating transformer inference, 
    offering significant speedups on consumer GPUs like the RTX 4090.""",
    # 7
    """Anthropic published a research paper describing improvements in constitutional AI, 
    focusing on scalable oversight and safer model behavior.""",
    # 8
    """Apple reportedly began testing on-device LLMs for future iPhone models, enabling 
    private AI features such as offline summarization and personal context reasoning.""",
    # 9
    """Hugging Face launched a new inference API tier with higher throughput and native 
    support for vLLM, making it cheaper to serve models like Mistral-7B and Llama-3-8B.""",
    # 10
    """Mistral AI released Mixtral-8x22B, a sparse mixture-of-experts model offering state-of-the-art 
    performance while remaining efficient enough for commercial deployment.""",
    # 11
    """IBM announced a partnership with NASA to fine-tune foundation models on geospatial data 
    to improve climate analysis, wildfire prediction, and satellite imagery classification.""",
    # 12
    """Databricks released DBRX, a 132B-weight mixture-of-experts model trained on curated 
    scientific and enterprise datasets, outperforming models of similar size.""",
    # 13
    """Stability AI introduced Stable Diffusion 3, featuring improved text-image alignment 
    and reduced hallucination in multilingual prompting scenarios.""",
    # 14
    """Snowflake added native vector search capabilities, allowing enterprises to store embeddings 
    and run RAG pipelines directly on their data warehouse.""",
    # 15
    """Cohere launched a secure enterprise-grade embedding model designed for document retrieval, 
    semantic search, and multi-lingual knowledge-base applications.""",
    # 16
    """Red Hat announced AI-enhanced DevOps tooling, including automated deployment validation 
    powered by small specialized LLMs.""",
    # 17
    """Salesforce updated Einstein GPT with better CRM-specific reasoning, including lead scoring, 
    automatic email drafting, and pipeline forecasting.""",
    # 18
    """Dropbox introduced AI-powered universal search across files, documents, PDFs, and images, 
    enabling users to query semantic content instantly.""",
    # 19
    """Slack rolled out AI summarization for channels and threads, automatically generating 
    daily digests and extracting key decisions from long discussions.""",
    # 20
    """Zoom added real-time conversation translation and AI-based meeting action items, 
    powered by a fine-tuned multilingual transformer model.""",
]

corpus_titles = [
    "OpenAI releases new reasoning model",
    "Google updates Vertex AI",
    "Meta open-sources Llama models",
    "Microsoft adds AI to Office",
    "AWS introduces cheaper GPU instances",
    "NVIDIA releases transformer acceleration libs",
    "Anthropic improves constitutional AI",
    "Apple tests on-device LLMs",
    "Hugging Face launches new inference tier",
    "Mistral releases Mixtral-8x22B",
    "IBM partners with NASA on geospatial AI",
    "Databricks releases DBRX",
    "Stability AI releases SD3",
    "Snowflake adds vector search",
    "Cohere launches enterprise embedding model",
    "Red Hat adds AI DevOps tools",
    "Salesforce updates Einstein GPT",
    "Dropbox adds AI universal search",
    "Slack adds AI summaries",
    "Zoom adds real-time AI translation",
]

len(corpus_docs)

20

In [45]:
# small and fast embedding model (open weights)
embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"
embedder = SentenceTransformer(embedding_model_name, device=device)

In [46]:
# Compute embeddings
doc_embeddings = embedder.encode(
    corpus_docs, convert_to_numpy=True, show_progress_bar=True, device=device, normalize_embeddings=True
)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [47]:
doc_embeddings, doc_embeddings.shape

(array([[-0.05694531, -0.01272646, -0.06879752, ...,  0.05730383,
          0.04768759,  0.00835872],
        [-0.07661536, -0.08271152,  0.03887794, ..., -0.00933229,
          0.05409345, -0.03391702],
        [-0.03868605, -0.02878194, -0.02075998, ..., -0.04762491,
         -0.01284006,  0.03692014],
        ...,
        [-0.02095782, -0.03330291, -0.04754037, ...,  0.04166466,
          0.05232637,  0.02517612],
        [-0.00392685, -0.02862822, -0.01042572, ...,  0.05525399,
         -0.05279417, -0.02444052],
        [-0.08633485, -0.04698378,  0.00952209, ...,  0.04447945,
         -0.1053777 , -0.02525596]], dtype=float32),
 (20, 384))

In [48]:
# Sklearn
nn_index = NearestNeighbors(n_neighbors=3, metric="cosine")
nn_index.fit(doc_embeddings)
# Faiss
faiss_emb = np.array(doc_embeddings).astype("float32")
faiss_index = faiss.IndexFlatIP(faiss_emb.shape[1])  # cosine similarity via inner product
faiss.normalize_L2(faiss_emb)
faiss_index.add(faiss_emb)

In [49]:
base_model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

In [50]:
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

In [51]:
tokenizer.vocab, tokenizer.vocab_size

({'▁Bür': 15455,
  'Pr': 4040,
  '▁share': 6232,
  '▁Major': 11019,
  '▁Host': 16956,
  '▁Lag': 16952,
  'гу': 3325,
  '▁сель': 11823,
  'full': 8159,
  'irche': 26846,
  '▁Dro': 22938,
  'Change': 7277,
  'Processor': 18689,
  'Access': 6638,
  'тре': 11414,
  '▁Late': 23089,
  '▁scarc': 19494,
  'ω': 30206,
  'GA': 12739,
  '▁Vil': 16450,
  '▁Ball': 13402,
  'oko': 15218,
  '▁costa': 26303,
  '▁vere': 24269,
  '▁anywhere': 12214,
  '▁typically': 12234,
  'èt': 23855,
  '▁également': 8648,
  'itories': 20106,
  '▁rang': 19120,
  '▁brig': 16724,
  '▁Gén': 26236,
  '▁rim': 12726,
  '▁Fish': 12030,
  'Git': 28712,
  'NET': 6006,
  'key': 1989,
  '▁upgrad': 20337,
  '▁площа': 20281,
  'ierten': 12025,
  '▁Y': 612,
  'estic': 15931,
  'êtes': 22730,
  '▁ERR': 22307,
  'ře': 12859,
  '▁relatives': 14576,
  '▁cl': 1067,
  '▁Oriental': 29702,
  'Also': 17351,
  '▁uncle': 22169,
  '**': 1068,
  '▁artifact': 24238,
  '▁▁▁▁▁▁▁▁▁▁': 965,
  '▁performance': 4180,
  '▁craw': 29349,
  '▁kleine': 2011

In [52]:
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

In [53]:
base_model.to(device).eval()

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 2048)
    (layers): ModuleList(
      (0-21): 22 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=5632, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=5632, bias=False)
          (down_proj): Linear4bit(in_features=5632, out_features=2048, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm(

In [54]:
summary(base_model)

Layer (type:depth-idx)                             Param #
LlamaForCausalLM                                   --
├─LlamaModel: 1-1                                  --
│    └─Embedding: 2-1                              65,536,000
│    └─ModuleList: 2-2                             --
│    │    └─LlamaDecoderLayer: 3-1                 22,024,192
│    │    └─LlamaDecoderLayer: 3-2                 22,024,192
│    │    └─LlamaDecoderLayer: 3-3                 22,024,192
│    │    └─LlamaDecoderLayer: 3-4                 22,024,192
│    │    └─LlamaDecoderLayer: 3-5                 22,024,192
│    │    └─LlamaDecoderLayer: 3-6                 22,024,192
│    │    └─LlamaDecoderLayer: 3-7                 22,024,192
│    │    └─LlamaDecoderLayer: 3-8                 22,024,192
│    │    └─LlamaDecoderLayer: 3-9                 22,024,192
│    │    └─LlamaDecoderLayer: 3-10                22,024,192
│    │    └─LlamaDecoderLayer: 3-11                22,024,192
│    │    └─LlamaDecoderLayer: 3-12

Vamos a crear una función que nos genera la salida bruta (input+output) y la salida neta (output):

In [55]:
def generate_answer(model, tokenizer, prompt, max_length, max_new_tokens):
    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        truncation=True,
        max_length=max_length,
    ).to(device)

    gen_config = GenerationConfig(
        max_new_tokens=max_new_tokens,
        do_sample=True,
        temperature=0.3,
        top_p=0.9,
        pad_token_id=tokenizer.pad_token_id,
    )

    with torch.no_grad():
        output = model.generate(**inputs, generation_config=gen_config)

    # Full decoded output (prompt + generated)
    full_decoded = tokenizer.decode(output[0], skip_special_tokens=True)

    # Only the continuation (generated tokens after the prompt)
    generated_ids = output[0][inputs["input_ids"].shape[1] :]
    generated_decoded = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()

    return full_decoded, generated_decoded

In [56]:
def retrieve_context(question, k, backend):
    """
    Retrieve top-k most similar documents using selected backend:
        - 'sklearn' : NearestNeighbors (your current version)
        - 'faiss'   : FAISS IndexFlatIP
        - 'st'      : sentence_transformers.util.semantic_search
    """
    q_emb = embedder.encode([question], convert_to_numpy=True, normalize_embeddings=True)

    # 1) Scikit-Learn NearestNeighbors
    if backend == "sklearn":
        distances, indices = nn_index.kneighbors(q_emb, n_neighbors=k)
        return [corpus_docs[i] for i in indices[0]]
    
    # 2) FAISS (cosine via inner product)
    elif backend == "faiss":
        q = q_emb.astype("float32")
        faiss.normalize_L2(q)
        distances, indices = faiss_index.search(q, k)
        return [corpus_docs[i] for i in indices[0]]

    # 3) SentenceTransformers semantic search
    elif backend == "st":
        hits = util.semantic_search(q_emb, doc_embeddings, top_k=k)[0]
        return [corpus_docs[hit["corpus_id"]] for hit in hits]

    else:
        raise ValueError(f"Unknown retrieval backend: {backend}")

In [57]:
def build_prompt(question, tokenizer, contexts):
    context_text = contexts[0]
    prompt = f"""
    You are a helpful assistant specialized in technology news.
    Use ONLY the context below to answer the user question.
    If the answer is not in the context, say I don't know.
    Answer one short sentence. Do NOT repeat context or question
    \n  Question: {question}
    \n  Context: {context_text}
    \n Answer:
    """
    count = len(tokenizer(prompt, return_tensors="pt")["input_ids"][0])
    return prompt, count

In [58]:
def rag_answer(model, tokenizer, question, k, max_length, max_new_tokens, backend):
    """
    Full RAG flow:
    - Retrieve similar docs
    - Build ChatML prompt with context
    - Generate answer with the LLM
    """
    contexts = retrieve_context(question, k, backend)
    prompt, token_count = build_prompt(question, tokenizer, contexts)
    full_output, gen_output = generate_answer(model, tokenizer, prompt, max_length, max_new_tokens)
    return prompt, full_output, gen_output, contexts, token_count

In [59]:
question = "Which company open-sourced Llama-based models?"


In [60]:
prompt, raw_answer, answer, ctx, token_count = rag_answer(
    base_model, tokenizer, question, 3, 256, 128, "st"
)

In [61]:
print("\nPrompt:", prompt)


Prompt: 
    You are a helpful assistant specialized in technology news.
    Use ONLY the context below to answer the user question.
    If the answer is not in the context, say I don't know.
    Answer one short sentence. Do NOT repeat context or question
    
  Question: Which company open-sourced Llama-based models?
    
  Context: Meta open-sourced a set of Llama-based models with billions of parameters, 
    enabling researchers and companies to fine-tune them for their own use cases.
    
 Answer:
    


In [62]:
print("Prompt token count:", token_count)

Prompt token count: 135


In [63]:
print("Retrieved Contexts:\n")
for i, c in enumerate(ctx, 1):
    print(f"--- Context {i} ---")
    print(c.strip(), "\n")

Retrieved Contexts:

--- Context 1 ---
Meta open-sourced a set of Llama-based models with billions of parameters, 
    enabling researchers and companies to fine-tune them for their own use cases. 

--- Context 2 ---
Hugging Face launched a new inference API tier with higher throughput and native 
    support for vLLM, making it cheaper to serve models like Mistral-7B and Llama-3-8B. 

--- Context 3 ---
Apple reportedly began testing on-device LLMs for future iPhone models, enabling 
    private AI features such as offline summarization and personal context reasoning. 



In [64]:
print("Raw RAG Answer:", raw_answer)

Raw RAG Answer: 
    You are a helpful assistant specialized in technology news.
    Use ONLY the context below to answer the user question.
    If the answer is not in the context, say I don't know.
    Answer one short sentence. Do NOT repeat context or question
    
  Question: Which company open-sourced Llama-based models?
    
  Context: Meta open-sourced a set of Llama-based models with billions of parameters, 
    enabling researchers and companies to fine-tune them for their own use cases.
    
 Answer:
     Meta is an American multinational technology company that develops artificial intelligence and machine learning technologies.
     Llama is a type of deep learning model that is popular in the field of natural language processing.
     Open-sourcing Llama-based models is a significant step towards making AI more accessible to researchers and companies.
     The models can be used for a wide range of applications, including natural language processing, speech recognition, an

In [65]:
print('RAG Answer:\n', answer)

RAG Answer:
 Meta is an American multinational technology company that develops artificial intelligence and machine learning technologies.
     Llama is a type of deep learning model that is popular in the field of natural language processing.
     Open-sourcing Llama-based models is a significant step towards making AI more accessible to researchers and companies.
     The models can be used for a wide range of applications, including natural language processing, speech recognition, and image classification.
     The open-sourcing of these models is a significant step towards making AI more accessible to researchers and companies.


Veamos algo interesante:

In [66]:
question

'Which company open-sourced Llama-based models?'

In [67]:
raw_answer, answer = generate_answer(base_model, tokenizer, question, 256, 128)

In [68]:
print("Baseline Answer (no RAG, no fine-tuning):\n", answer)

Baseline Answer (no RAG, no fine-tuning):
 


Usemos de nuevo el prompt anterior:

In [69]:
prompt

"\n    You are a helpful assistant specialized in technology news.\n    Use ONLY the context below to answer the user question.\n    If the answer is not in the context, say I don't know.\n    Answer one short sentence. Do NOT repeat context or question\n    \n  Question: Which company open-sourced Llama-based models?\n    \n  Context: Meta open-sourced a set of Llama-based models with billions of parameters, \n    enabling researchers and companies to fine-tune them for their own use cases.\n    \n Answer:\n    "

In [70]:
raw_answer, answer = generate_answer(base_model, tokenizer, prompt, 256, 128)
print("Baseline Answer (no RAG, no fine-tuning):\n", answer)

Baseline Answer (no RAG, no fine-tuning):
 Meta open-sourced a set of Llama-based models with billions of parameters, enabling researchers and companies to fine-tune them for their own use cases.
     The models are based on the Llama architecture, which is a popular machine learning framework.
     The models are open-sourced to enable researchers and companies to use them for their own use cases.
     The context is a specific use case, so the answer is not general knowledge.


Creamos un nuevo prompt que tenga ordenes, pregunta y espacio para respuesta:

In [71]:
prompt = f"""
ONLY answer the question below. Do NOT repeat the question below in the answer.
Question: {question}
Answer:
"""

In [72]:
raw_answer, answer = generate_answer(base_model, tokenizer, prompt, 256, 128)
print("Baseline Answer (no RAG, no fine-tuning):\n", answer)

Baseline Answer (no RAG, no fine-tuning):
 The answer is Google.


Veamos esto que también es interesante

In [73]:
prompt = f"""{question}:"""

In [74]:
prompt

'Which company open-sourced Llama-based models?:'

In [75]:
raw_answer, answer = generate_answer(base_model, tokenizer, prompt, 256, 128)
print("Baseline Answer (no RAG, no fine-tuning):\n", answer)

Baseline Answer (no RAG, no fine-tuning):
 The Llama project, which aims to create a standard for building machine learning models, has opened up its source code to the public. The project, which was founded by Google's DeepMind AI research group, has been working on the project for the past year. The Llama project is a collection of open-source models that are designed to be easy to use and to be able to be trained on a variety of data sets. The project is focused on building models that can be used for a variety of tasks, including natural language processing, image recognition, and recommendation systems. The project is open-s


Veamos ahora como hacer fine tuning con un toy example:

In [30]:
qa_pairs = [
    {
        "instruction": "Which company released a new reasoning model optimized for tool use and RAG?",
        "input": "",
        "output": "OpenAI released a new model optimized for tool use and retrieval-augmented generation.",
    },
    {
        "instruction": "What did Google update to make it easier to deploy large language models?",
        "input": "",
        "output": "Google updated its Vertex AI platform to make it easier to deploy and monitor large language models.",
    },
    {
        "instruction": "Which company open-sourced Llama-based models and why is this important?",
        "input": "",
        "output": "Meta open-sourced Llama-based models, enabling researchers and companies to fine-tune them for their own use cases.",
    },
    {
        "instruction": "What AI features did Microsoft add to Office?",
        "input": "",
        "output": "Microsoft added generative AI features to Office, including AI-powered summarization, drafting, and meeting notes.",
    },
    {
        "instruction": "What did AWS introduce for inference workloads?",
        "input": "",
        "output": "AWS introduced cheaper GPU instances optimized for inference workloads like chatbots, code assistants, and document question-answering.",
    },
    {
        "instruction": "What did NVIDIA release to accelerate transformer inference?",
        "input": "",
        "output": "NVIDIA released open-source libraries that accelerate transformer inference, improving performance on consumer GPUs.",
    },
    {
        "instruction": "What research focus did Anthropic publish new work on?",
        "input": "",
        "output": "Anthropic published research on improving constitutional AI with scalable oversight and safer model behavior.",
    },
    {
        "instruction": "What AI capability is Apple reportedly testing for iPhones?",
        "input": "",
        "output": "Apple is testing on-device LLMs that enable private offline capabilities like summarization and personal context reasoning.",
    },
    {
        "instruction": "What did Hugging Face launch to improve model serving?",
        "input": "",
        "output": "Hugging Face launched a new inference API tier with high throughput and native vLLM support.",
    },
    {
        "instruction": "Which company released the Mixtral-8x22B model and what type of model is it?",
        "input": "",
        "output": "Mistral AI released Mixtral-8x22B, a sparse mixture-of-experts model that provides state-of-the-art performance efficiently.",
    },
    {
        "instruction": "What partnership did IBM announce involving geospatial data?",
        "input": "",
        "output": "IBM partnered with NASA to fine-tune foundation models on geospatial data to improve climate and satellite analysis.",
    },
    {
        "instruction": "What model did Databricks release and what is notable about it?",
        "input": "",
        "output": "Databricks released DBRX, a 132B-parameter MoE model trained on curated scientific and enterprise datasets.",
    },
    {
        "instruction": "What did Stability AI introduce with improved text-image alignment?",
        "input": "",
        "output": "Stability AI introduced Stable Diffusion 3, offering better text-image alignment and reduced hallucinations.",
    },
    {
        "instruction": "What new capability did Snowflake add for enterprise AI pipelines?",
        "input": "",
        "output": "Snowflake added native vector search, enabling RAG workflows directly within its data warehouse.",
    },
    {
        "instruction": "What product did Cohere launch for retrieval and semantic search?",
        "input": "",
        "output": "Cohere launched an enterprise-grade embedding model designed for semantic search and multilingual retrieval.",
    },
    {
        "instruction": "What AI enhancement did Red Hat introduce for DevOps workflows?",
        "input": "",
        "output": "Red Hat introduced AI-enhanced DevOps tools, including automated deployment validation powered by small LLMs.",
    },
    {
        "instruction": "What updates did Salesforce make to Einstein GPT?",
        "input": "",
        "output": "Salesforce updated Einstein GPT with improved CRM reasoning, including lead scoring and automated email drafting.",
    },
    {
        "instruction": "What AI feature did Dropbox add to help users navigate their files?",
        "input": "",
        "output": "Dropbox added AI-powered universal search that allows semantic querying across documents, PDFs, and images.",
    },
    {
        "instruction": "How is Slack using AI to help teams stay informed?",
        "input": "",
        "output": "Slack added AI summarization for channels and threads, generating digests and extracting key decisions.",
    },
    {
        "instruction": "What real-time AI capabilities did Zoom add to its platform?",
        "input": "",
        "output": "Zoom added real-time translation and AI-generated meeting action items powered by a multilingual transformer model.",
    },
]



In [31]:
dataset = Dataset.from_list(qa_pairs)
dataset

Dataset({
    features: ['instruction', 'input', 'output'],
    num_rows: 20
})

In [32]:
def format_example(example):
    # Alpaca-style formatting
    if example["input"]:
        return f"""Below is an instruction and an input. Write a helpful answer.

### Instruction:
{example["instruction"]}

### Input:
{example["input"]}

### Response:
{example["output"]}
"""
    else:
        return f"""Below is an instruction. Write a helpful answer.

### Instruction:
{example["instruction"]}

### Response:
{example["output"]}
"""


def tokenize_function(example):
    text = format_example(example)
    tokenized = tokenizer(
        text,
        truncation=True,
        max_length=256,
        padding="max_length",
    )
    # For causal LM, labels = input_ids
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

In [33]:
tokenized_dataset = dataset.map(tokenize_function, batched=False)

# Remove the original text columns
tokenized_dataset = tokenized_dataset.remove_columns(["instruction", "input", "output"])

# Set format
tokenized_dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "labels"],
)

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

In [34]:
tokenized_dataset[0]

{'input_ids': tensor([    1, 13866,   338,   385, 15278, 29889, 14350,   263,  8444,  1234,
         29889,    13,    13,  2277, 29937,  2799,  4080, 29901,    13,  8809,
           436,  5001,  5492,   263,   716, 24481,  1904, 27545,   363,  5780,
           671,   322,   390, 10051, 29973,    13,    13,  2277, 29937, 13291,
         29901,    13,  6585, 23869,  5492,   263,   716,  1904, 27545,   363,
          5780,   671,   322,  5663, 16837, 29899,  2987,   358,   287, 12623,
         29889,    13,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
             2,     2,     2,     2,   

In [35]:
train_loader = DataLoader(
    tokenized_dataset,
    batch_size=2,
    shuffle=True,
)

Definimos un nuevo modelo a realizar FT:

In [36]:
ft_model = AutoModelForCausalLM.from_pretrained(
    base_model_name, quantization_config=bnb_config, device_map="auto", use_cache=False
)

In [37]:
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=[
        "q_proj",
        "v_proj",
        "k_proj",
        "o_proj",
    ],  # may need to adjust per model
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

In [38]:
# Prepare model for k-bit training (LoRA on top of 4-bit base)
ft_model.to(device).train()
ft_model = prepare_model_for_kbit_training(ft_model)
ft_model = get_peft_model(ft_model, lora_config)
ft_model.gradient_checkpointing_enable(
    gradient_checkpointing_kwargs={"use_reentrant": False}
)
ft_model.enable_input_require_grads()
ft_model.print_trainable_parameters()

trainable params: 2,252,800 || all params: 1,102,301,184 || trainable%: 0.2044


In [39]:
summary(ft_model)

Layer (type:depth-idx)                                            Param #
PeftModelForCausalLM                                              --
├─LoraModel: 1-1                                                  --
│    └─LlamaForCausalLM: 2-1                                      --
│    │    └─LlamaModel: 3-1                                       552,323,072
│    │    └─Linear: 3-2                                           (65,536,000)
Total params: 617,859,072
Trainable params: 2,252,800
Non-trainable params: 615,606,272

In [40]:
optimizer = torch.optim.AdamW(ft_model.parameters(), lr=1e-3, amsgrad=True, weight_decay=0.01)

In [None]:
num_epochs = 2
for epoch in range(num_epochs):
    total_loss = 0.0

    for step, batch in enumerate(train_loader):
        # batch is a dict of tensors with shape [batch_size, seq_len]
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = ft_model(**batch)

        loss = outputs.loss
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        print(f"Epoch {epoch+1} | Step {step+1} | Loss: {loss.item():.4f}")

    avg_loss = total_loss / len(train_loader)
    print(f"== Epoch {epoch+1} finished | Avg loss: {avg_loss:.4f} ==")

Epoch 1 | Step 1 | Loss: 14.5722
Epoch 1 | Step 2 | Loss: 8.6726
Epoch 1 | Step 3 | Loss: 3.9434
Epoch 1 | Step 4 | Loss: 1.1957
Epoch 1 | Step 5 | Loss: 0.9001
Epoch 1 | Step 6 | Loss: 0.9521
Epoch 1 | Step 7 | Loss: 0.8895
Epoch 1 | Step 8 | Loss: 0.9358
Epoch 1 | Step 9 | Loss: 0.9274
Epoch 1 | Step 10 | Loss: 0.8169
== Epoch 1 finished | Avg loss: 3.3806 ==


KeyboardInterrupt: 

In [67]:
output_dir = "./tinyllama-tech-lora"
ft_model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print("Saved LoRA adapter to:", output_dir)

Saved LoRA adapter to: ./tinyllama-tech-lora


In [112]:
ft_model.eval()

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 2048)
        (layers): ModuleList(
          (0-21): 22 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2048, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.L

In [None]:
print(question)

=== Question ===
Which company open-sourced Llama-based models?


In [128]:
raw_answer, answer = zero_shot_answer(ft_model, question)

In [129]:
print("\n--- Baseline (no RAG, no FT) ---")
print(answer)


--- Baseline (no RAG, no FT) ---
Yes, Llama is an open-source project that allows developers to use the Llama library for building models.


In [130]:
print("\n--- RAG (no FT) ---")
prompt, full_output, gen_output, contexts, token_count = rag_answer(
    base_model, tokenizer, question, 3, 256, 128, 'faiss'
)
print(gen_output)


--- RAG (no FT) ---
Meta is an American multinational technology company that specializes in artificial intelligence.
     Llama is a type of machine learning algorithm that is popular in the field of natural language processing.
     Open-sourcing means that the code is available for others to use, modify, and improve.
     Meta's Llama-based models are a set of pre-trained language models that can be fine-tuned for specific tasks.
     The models are available for researchers and companies to use for their own purposes.
     Meta's open-sourcing of Llama-


In [131]:
print("\n--- Fine-tuned + RAG ---")
_, raw_answer, answer, _, _ = rag_answer(
    ft_model, tokenizer, question, 3, 256, 128, "faiss"
)
print(answer)


--- Fine-tuned + RAG ---
Meta is an American multinational technology company that develops and sells social media platforms, including Facebook, Instagram, and WhatsApp.
     Llama is a machine learning framework developed by Google that is used for training large neural networks.
     Open-sourcing Llama-based models is a significant step towards making AI more accessible to researchers and companies.


Ahora, vamos a obtener noticias de la API de hacker-news

Vamos a realizar chunking. Hay muchas estrategias.