<a href="https://colab.research.google.com/github/alessandrovicenti10/exampleRAGtutorial/blob/main/RAG_Colab_txt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM RAG Tutorial
<a target="_blank" href="https://colab.research.google.com/github/SamHollings/llm_tutorial/blob/main/llm_tutorial_rag.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This tutorial will give you a simple introduction to how to get started with an LLM to make a simple RAG app.

RAG (Retrieval Augmented Generation) allows us to give foundational models local context, without doing expensive fine-tuning and can be done even normal everyday machines like your laptop.
The basic idea is that we store documents as vectors in a database. When the user asks a question to the LLM, we can use langchain to first pass that question to the vector database, which retrieves relevant documents (these can be broken up into chunks, given metadata, summarised and various other steps to improve retrieval). The original question and these documents are then passed to the LLM (e.g. Claude) which then gives back the answer. So, in effect the model seems like it knows about what was in the database, e.g. local knowledge about your business, or hobby or whatever, whe in reality, that information was just injected into the prompt just prior to the model seeing it!

The main libraries we will use are:
- Langchain: which is basically a wrapper around the various LLMs and other tools to make it more consistent (so you can swap say.. OpenAI for Anthropic, easily)
- Anthropic: which is the library through which we will access the Claude model (more on why this is chosen below)
- ChromaDB: this is a simple vector database, which is a key part of the RAG model.
- sentence-transformer: this is an open-source model for embedding text

None of the above are "the best" tools - they're just examples, and you may whish to use difference embedding models, LLMs, vector databases, etc.

In [None]:
# this forces google collab to install the dependencies
if "google.colab" in str(get_ipython()):
    print("Running on Colab")
    !git clone https://github.com/SamHollings/llm_tutorial.git -q
    %cd llm_tutorial


Running on Colab
/content/llm_tutorial


In [None]:
!pip install -r requirements.txt -q -q


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.2/79.2 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.5/199.5 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m628.3/628.3 kB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m43.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.7/11.7 MB[0m [31m69.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.2/117.2 kB[0m [31m6.6 MB/s[0m eta [3

In [None]:
pip install ipykernel==5.5.6


Collecting ipykernel==5.5.6
  Downloading ipykernel-5.5.6-py3-none-any.whl.metadata (1.1 kB)
Downloading ipykernel-5.5.6-py3-none-any.whl (121 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.0/121.0 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: ipykernel
  Attempting uninstall: ipykernel
    Found existing installation: ipykernel 6.29.5
    Uninstalling ipykernel-6.29.5:
      Successfully uninstalled ipykernel-6.29.5
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
jupyterlab 4.3.3 requires ipykernel>=6.5.0, but you have ipykernel 5.5.6 which is incompatible.
notebook 6.5.5 requires jupyter-client<8,>=5.3.4, but you have jupyter-client 8.6.3 which is incompatible.[0m[31m
[0mSuccessfully installed ipykernel-5.5.6


In [None]:
pip install transformers




In [None]:
import os
import numpy as np
from tqdm import tqdm
from sentence_transformers import SentenceTransformer
from transformers import pipeline
from scipy.spatial.distance import cdist

# Configurazione
DEV_MODE = True
PERSIST_DIRECTORY = "/content/llm_tutorial/db"
EMBEDDING_MODEL = "sentence-transformers/all-mpnet-base-v2"

if DEV_MODE:
    PERSIST_DIRECTORY += "/dev"

# Inizializza il modello di embedding
embedding_model = SentenceTransformer(EMBEDDING_MODEL)

# Inizializza il client HuggingFace
huggingface_pipeline = pipeline("text-generation", model="gpt2")

# Funzione per generare gli embedding
def generate_embeddings(texts):
    embeddings = embedding_model.encode(texts, convert_to_tensor=True)
    return embeddings.cpu().numpy()  # Sposta su CPU e converte in NumPy

# Salva gli embedding su file
def save_embeddings(embeddings, metadata):
    os.makedirs(PERSIST_DIRECTORY, exist_ok=True)
    temp_file_path = os.path.join(PERSIST_DIRECTORY, "embeddings_temp.npz")
    final_file_path = os.path.join(PERSIST_DIRECTORY, "embeddings.npz")
    np.savez(temp_file_path, embeddings=embeddings, metadata=metadata)
    os.rename(temp_file_path, final_file_path)  # Sostituisce il file solo se il salvataggio è riuscito
    print(f"File degli embeddings salvato in: {final_file_path}")

# Carica gli embedding da file
def load_embeddings():
    file_path = os.path.join(PERSIST_DIRECTORY, "embeddings.npz")
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"Il file degli embedding '{file_path}' non esiste.")
    try:
        data = np.load(file_path, allow_pickle=True)
        return data["embeddings"], data["metadata"]
    except Exception as e:
        raise ValueError(f"Errore durante il caricamento di '{file_path}': {e}")

# Recupera i documenti più simili alla query
def retrieve_documents(query, embeddings, metadata, k=4):
    query_embedding = generate_embeddings([query])[0]  # Genera embedding per la query
    distances = cdist([query_embedding], embeddings, metric="cosine")[0]
    indices = np.argsort(distances)[:k]
    return [metadata[i] for i in indices]

# Invia una domanda al modello HuggingFace
def query_huggingface(prompt):
    response = huggingface_pipeline(
        prompt,
        max_new_tokens=150,         # Numero massimo di nuovi token generati
        num_return_sequences=1,    # Un'unica sequenza di ritorno
        truncation=True,           # Attiva il troncamento per gli input troppo lunghi
        pad_token_id=50256         # Imposta l'ID del token di padding (modifica se necessario per il tuo modello)
    )
    return response[0]["generated_text"]


# Popolazione del database vettoriale
def populate_vectorstore():
    documents = []
    docs_path = "/content/llm_tutorial/docs/goldacre_review.txt"

    # Leggi il file specifico
    if not os.path.exists(docs_path):
        raise FileNotFoundError(f"File '{docs_path}' non trovato.")

    with open(docs_path, "r", encoding="utf-8") as text_file:
        documents.append(text_file.read())

    print(f"Documenti caricati: {len(documents)}")

    # Genera embedding per i documenti
    metadata = documents  # Metadata è semplicemente il testo originale
    embeddings = generate_embeddings(documents)
    print(f"Embedding generati: {embeddings.shape}")
    save_embeddings(embeddings, metadata)

# Main
if __name__ == "__main__":
    embeddings_file_path = os.path.join(PERSIST_DIRECTORY, "embeddings.npz")

    # Controlla se il file degli embedding esiste ed è integro
    if not os.path.exists(embeddings_file_path):
        print(f"File '{embeddings_file_path}' non trovato. Creazione del database vettoriale...")
        populate_vectorstore()
    else:
        print(f"File '{embeddings_file_path}' trovato. Verifica integrità...")

        try:
            embeddings, metadata = load_embeddings()
        except (FileNotFoundError, ValueError) as e:
            print(f"Errore durante il caricamento del file: {e}")
            print("Ricreazione del database vettoriale...")
            populate_vectorstore()
            embeddings, metadata = load_embeddings()

    # Carica embedding e metadata
    embeddings, metadata = load_embeddings()

    # Domanda dell'utente
    question = "Describe what the Goldacre says about RAP (Reproducible Analytical Pipelines) and what we need to do to make them work."

    # Recupera i documenti più rilevanti
    relevant_docs = retrieve_documents(question, embeddings, metadata)

    # Crea il prompt per HuggingFace con troncamento
    max_prompt_length = 500  # Lunghezza massima del prompt
    truncated_docs = " ".join(relevant_docs)[:max_prompt_length]
    prompt = f"Using the following documents: {truncated_docs}\nAnswer the question: {question}"


    # Ottieni la risposta
    answer = query_huggingface(prompt)
    print("Answer from HuggingFace:", answer)


File '/content/llm_tutorial/db/dev/embeddings.npz' trovato. Verifica integrità...
Answer from HuggingFace: Using the following documents: Skip to main content
 GOV.UK
 Navigation menu 
Menu Search GOV.UK 
HomeHealth and social careTechnology in health and social careBetter, broader, safer: using health data for research and analysis
Department
of Health &
Social Care
Independent report
Better, broader, safer: using health data for research and analysis
Published 7 April 2022

Applies to England
Contents
Review team
Senior stakeholder group
Background information
Ministerial introduction
Foreword
Executive summary
Summary recommend
Answer the question: Describe what the Goldacre says about RAP (Reproducible Analytical Pipelines) and what we need to do to make them work.
Key points

What this report will cover


Conduct the investigation


Identify the risks

Satisfy the team


Develop solutions to the problem. Develop a plan to address the problem.

Use research data


Identify the cost

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive
