

### Exemple 1 – Recherche sémantique simple (TF-IDF + cosinus + MMR)

Dans ce premier exemple, nous ne faisons **pas encore de RAG complet**, mais seulement de la **recherche d’information**. Les phrases sont transformées en vecteurs TF-IDF, puis on calcule la **similarité cosinus** pour trouver les passages les plus proches d’une question. On introduit aussi **MMR (Maximal Marginal Relevance)** pour montrer comment on peut équilibrer *pertinence* et *diversité* des passages récupérés.

In [None]:

from typing import List, Tuple
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

docs = [
    "Graph neural networks generalize convolution to graph-structured data.",
    "RAG retrieves relevant passages and augments the prompt for generation.",
    "Cosine similarity measures the angle between two vectors.",
    "FAISS is a library for efficient similarity search at scale.",
    "Sentence transformers map text to dense semantic vectors.",
    "Maximal Marginal Relevance (MMR) balances relevance and diversity."
]


vectorizer = TfidfVectorizer(stop_words="english")
doc_matrix = vectorizer.fit_transform(docs)

def embed_query_tfidf(q: str):
    return vectorizer.transform([q])

def retrieve_topk(q: str, k: int = 3) -> List[Tuple[int, float]]:
    q_vec = embed_query_tfidf(q)
    sims = cosine_similarity(q_vec, doc_matrix).ravel()
    idx = np.argsort(-sims)[:k]
    return [(int(i), float(sims[i])) for i in idx]

def mmr(query_vec, doc_mat, lambda_mult=0.7, top_k=3):

    rel = cosine_similarity(query_vec, doc_mat).ravel()
    selected = []
    candidates = list(range(doc_mat.shape[0]))

    while len(selected) < top_k and candidates:
        if not selected:
            i = int(np.argmax(rel[candidates]))
            selected.append(candidates[i])
            candidates.pop(i)
        else:
            cand_mat = doc_mat[candidates]
            sel_mat = doc_mat[selected]
            div = cosine_similarity(cand_mat, sel_mat).max(axis=1)

            scores = lambda_mult * rel[candidates] - (1 - lambda_mult) * div
            i = int(np.argmax(scores))
            selected.append(candidates[i])
            candidates.pop(i)


    return [(i, float(rel[i])) for i in selected]

def generate_answer(query: str, hits: List[Tuple[int, float]]) -> str:
    context = "\n- ".join(docs[i] for i, _ in hits)
    return f"""Question: {query}

Grounded answer (based on retrieved context):
{("Here is what we found:\n- " + context) if hits else "No supporting passages found."}
"""

# ==== Try it ====
query = "How does RAG produce better answers than a plain LLM?"
hits = retrieve_topk(query, k=4)
hits_mmr = mmr(embed_query_tfidf(query), doc_matrix, lambda_mult=0.7, top_k=3)

print("Top-k by cosine:", [(docs[i], round(score, 3)) for i, score in hits], "\n")
print("MMR re-ranked:",   [(docs[i], round(score, 3)) for i, score in hits_mmr], "\n")
print(generate_answer(query, hits_mmr))


Top-k by cosine: [('RAG retrieves relevant passages and augments the prompt for generation.', 0.378), ('Graph neural networks generalize convolution to graph-structured data.', 0.0), ('Cosine similarity measures the angle between two vectors.', 0.0), ('FAISS is a library for efficient similarity search at scale.', 0.0)] 

MMR re-ranked: [('RAG retrieves relevant passages and augments the prompt for generation.', 0.378), ('Graph neural networks generalize convolution to graph-structured data.', 0.0), ('Cosine similarity measures the angle between two vectors.', 0.0)] 

Question: How does RAG produce better answers than a plain LLM?

Grounded answer (based on retrieved context):
Here is what we found:
- RAG retrieves relevant passages and augments the prompt for generation.
- Graph neural networks generalize convolution to graph-structured data.
- Cosine similarity measures the angle between two vectors.





### Exemple 2 – RAG “from scratch” avec SentenceTransformers + FAISS + llama.cpp

Dans cet exemple, on construit un pipeline RAG **à la main**. On découpe d’abord les textes en petits morceaux (chunks), puis on utilise un **modèle d’embeddings de phrases** pour les transformer en vecteurs. Ces vecteurs sont indexés dans **FAISS** pour permettre une recherche rapide. Enfin, on interroge un **LLM local (llama.cpp)** en lui donnant les chunks les plus pertinents comme contexte.

In [None]:
!huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct-GGUF qwen2.5-1.5b-instruct-q5_k_m.gguf --local-dir . --local-dir-use-symlinks False

Downloading 'qwen2.5-1.5b-instruct-q5_k_m.gguf' to '.cache/huggingface/download/GmkwVQxx3T_8EJQrCC1D671Wtjg=.b46661073c18e5b56a41fa320975f866a00def1ff08feef4718e013258896f8c.incomplete'
qwen2.5-1.5b-instruct-q5_k_m.gguf: 100% 1.29G/1.29G [00:10<00:00, 124MB/s]
Download complete. Moving file to qwen2.5-1.5b-instruct-q5_k_m.gguf
qwen2.5-1.5b-instruct-q5_k_m.gguf


In [None]:
from typing import List
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from langchain_text_splitters import RecursiveCharacterTextSplitter
from llama_cpp import Llama


docs = [
    ("doc1", "RAG (Retrieval-Augmented Generation) combines retrieval over a vector store with a generator LLM to answer questions grounded in your data."),
    ("doc2", "FAISS is a library for efficient similarity search on dense vectors. It supports IndexFlatIP for cosine-like similarity via normalized vectors."),
    ("doc3", "Weaviate is a vector database you can run locally with Docker; you can store vectors and metadata and perform hybrid or vector search."),
    ("doc4", "LangChain provides chains and integrations: text splitters, embedding helpers, vector stores, and retrieval-augmented QA pipelines."),
    ("doc5", "Llama.cpp lets you run GGUF local models on CPU/GPU. Use a chat-tuned model for best QA quality with RAG."),
]

text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
chunks = []
metas = []
for doc_id, text in docs:
    for i, chunk in enumerate(text_splitter.split_text(text)):
        chunks.append(chunk)
        metas.append({"doc_id": doc_id, "chunk_id": i})

embed_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
emb = embed_model.encode(chunks, convert_to_numpy=True, normalize_embeddings=True).astype("float32")

d = emb.shape[1]
index = faiss.IndexFlatIP(d)
index.add(emb)

def retrieve(query: str, k: int = 3):
    qv = embed_model.encode([query], convert_to_numpy=True, normalize_embeddings=True).astype("float32")
    scores, idxs = index.search(qv, k)  # (1, k)
    results = []
    for score, idx in zip(scores[0], idxs[0]):
        results.append({"score": float(score), "text": chunks[idx], "meta": metas[idx]})
    return results


GGUF_PATH = "/content/qwen2.5-1.5b-instruct-q5_k_m.gguf"
llm = Llama(
    model_path=GGUF_PATH,
    n_ctx=4096,
    n_threads=8,
    n_gpu_layers=0
)

def answer_with_rag(query: str, top_k: int = 3):
    retrieved = retrieve(query, k=top_k)
    context_block = "\n\n".join(
        [f"[{i+1}] {r['text']}" for i, r in enumerate(retrieved)]
    )
    prompt = f"""You are a helpful expert. Answer the user's question using ONLY the context.
If the answer isn't in the context, say you don't know.

# Context
{context_block}

# Question
{query}

# Answer"""

    out = llm.create_chat_completion(
        messages=[{"role":"user","content":prompt}],
        temperature=0.2,
        max_tokens=512,
    )
    return out["choices"][0]["message"]["content"], retrieved

if __name__ == "__main__":
    query = "how does RAG use FAISS and LangChain together?"
    answer, retrieved = answer_with_rag(query, top_k=3)
    print("Top passages:")
    for r in retrieved:
        print(f"- {r['meta']} | score={r['score']:.3f}\n  {r['text']}\n")
    print("Answer:\n", answer)


llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from /content/qwen2.5-1.5b-instruct-q5_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = qwen2.5-1.5b-instruct
llama_model_loader: - kv   3:                            general.version str              = v0.1
llama_model_loader: - kv   4:                           general.finetune str              = qwen2.5-1.5b-instruct
llama_model_loader: - kv   5:                         general.size_label str              = 1.8B
llama_model_loader: - kv   6:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   7:        

Top passages:
- {'doc_id': 'doc1', 'chunk_id': 0} | score=0.370
  RAG (Retrieval-Augmented Generation) combines retrieval over a vector store with a generator LLM to answer questions grounded in your data.

- {'doc_id': 'doc4', 'chunk_id': 0} | score=0.291
  LangChain provides chains and integrations: text splitters, embedding helpers, vector stores, and retrieval-augmented QA pipelines.

- {'doc_id': 'doc2', 'chunk_id': 0} | score=0.111
  FAISS is a library for efficient similarity search on dense vectors. It supports IndexFlatIP for cosine-like similarity via normalized vectors.

Answer:
 RAG uses FAISS for efficient similarity search on dense vectors, and LangChain provides chains and integrations such as text splitters, embedding helpers, vector stores, and retrieval-augmented QA pipelines.




### Exemple 3 – RAG avec LangChain + FAISS + LlamaCpp

Ici, nous faisons la même chose que dans l’exemple précédent, mais en utilisant **LangChain** pour simplifier le code. LangChain fournit des composants prêts à l’emploi : découpeur de texte, embeddings, vector store FAISS, et chaîne de question-réponse (`RetrievalQA`). Au lieu d’assembler nous-mêmes toutes les étapes, nous les **“câblons” via LangChain**. L’objectif pédagogique est de montrer comment, une fois que vous avez compris le RAG “from scratch”, vous pouvez passer à un framework qui rend le code plus court, plus lisible et plus facile à maintenir.



In [None]:
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.llms import LlamaCpp
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

docs = [
  "RAG couples retrieval with generation for grounded answers.",
  "FAISS performs fast vector similarity search.",
  "LangChain wires together splitters, embeddings, vector stores, and QA chains.",
]
splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
splits = splitter.create_documents(docs)

emb = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vs = FAISS.from_documents(splits, emb)

llm = LlamaCpp(model_path="/content/qwen2.5-1.5b-instruct-q5_k_m.gguf", n_ctx=4096, temperature=0.2)

template = """Use only the context to answer. If missing, say you don't know.
Context:
{context}

Question: {question}
Answer:"""
prompt = PromptTemplate(template=template, input_variables=["context","question"])

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vs.as_retriever(search_kwargs={"k":3}),
    chain_type_kwargs={"prompt": prompt},
)

print(qa({"query":"how does FAISS help RAG?"}))




### Exemple 4 – RAG avec base vectorielle externe (Weaviate + SentenceTransformers + Llama)

Dans ce dernier exemple, nous passons à un scénario plus **“production”** en utilisant **Weaviate** comme base de données vectorielle. Les embeddings sont toujours calculés côté Python (SentenceTransformers), mais les vecteurs et les métadonnées sont **stockés et recherchés dans Weaviate**, qui gère l’indexation et la recherche à grande échelle. Le LLM (llama.cpp) reste côté application. Cet exemple illustre la différence entre un petit RAG local (FAISS en mémoire) et un RAG plus industriel basé sur une **base vectorielle distante**, comme on le ferait dans une application réelle.

In [None]:

import os, uuid, pathlib
import numpy as np
import weaviate
from weaviate.classes.config import Property, DataType, Configure
from sentence_transformers import SentenceTransformer
from llama_cpp import Llama


WCS_REST_URL = "https://6ca79jmsuiwhl8eebxug.c0.europe-west3.gcp.weaviate.cloud"  # from your dashboard
WCS_API_KEY  = "a1VETStBWUxvMVdCOGRXK19ndGovMWZmdkRYQzRUeFByaUl5MTU5NjBWTlV6QldaNjhIZ0lybnB2Sk5zPV92MjAw"  # <-- your key

GGUF_PATH = "/content/qwen2.5-1.5b-instruct-q5_k_m.gguf"  # e.g. "/content/models/llama-3.1-8B-instruct.Q4_K_M.gguf"

# Quick sanity check on model path (helpful in Colab)
assert pathlib.Path(GGUF_PATH).exists(), f"GGUF model not found at: {GGUF_PATH}"

# --- 3) Connect to Weaviate v4 (collections API) ---
client = weaviate.connect_to_weaviate_cloud(
    cluster_url=WCS_REST_URL,
    auth_credentials=weaviate.auth.AuthApiKey(WCS_API_KEY),
    headers={"X-OpenAI-Project": "rag-demo-colab"}
)


print("Connected to Weaviate:", client.is_connected())

COLLECTION = "Note"

try:
    client.collections.delete(COLLECTION)
except Exception:
    pass

notes = client.collections.create(
    name=COLLECTION,
    # no auto-vectorization; we'll push our own vectors
    vectorizer_config=None,
    properties=[
        Property(name="text", data_type=DataType.TEXT)
    ],
    vector_index_config=Configure.VectorIndex.hnsw()
)

print("Collection ready:", notes.name)

docs = [
    "RAG ties retrieval over your corpus to an LLM so answers stay grounded.",
    "FAISS is efficient for similarity search over dense vectors.",
    "Weaviate is a vector DB: store vectors plus metadata, and query by vector.",
    "Sentence Transformers turn text into dense vectors that capture semantics.",
    "In RAG, you retrieve top-k passages and feed them into the LLM as context."
]

embed = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
vecs = embed.encode(docs, normalize_embeddings=True)

for text, vec in zip(docs, vecs):
    notes.data.insert(
        properties={"text": text},
        vector=vec.astype(np.float32)
    )

print("Inserted", notes.aggregate.over_all().total_count, "objects")

def retrieve(query: str, k: int = 3):
    qv = embed.encode([query], normalize_embeddings=True)[0].astype(np.float32)
    res = notes.query.near_vector(
        near_vector=qv,
        limit=k,
        return_properties=["text"],
        return_metadata=weaviate.classes.query.MetadataQuery(distance=True)
    )
    items = [
        {"text": o.properties["text"], "distance": o.metadata.distance}
        for o in res.objects
    ]
    return items

# --- 8) Local LLM via llama.cpp ---
# Tip: On Colab CPU, keep n_threads modest. If you have Colab T4, you can set n_gpu_layers > 0 if compiled with CUDA.
llm = Llama(
    model_path=GGUF_PATH,
    n_ctx=4096,
    n_threads=8,
    n_gpu_layers=0,     # set >0 only if your wheel supports GPU and your Colab has CUDA
    verbose=False
)

def answer_with_llama(context: str, question: str, temperature: float = 0.2, max_tokens: int = 512):
    prompt = f"""You are a precise assistant. Answer using ONLY the context. If insufficient, say "I don't know".

Context:
{context}

Question: {question}
Answer:"""
    # Prefer chat API if the model is chat-tuned; otherwise use .create_completion
    out = llm.create_chat_completion(
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature,
        max_tokens=max_tokens,
    )
    return out["choices"][0]["message"]["content"].strip()

def rag_answer(query: str, k: int = 3):
    hits = retrieve(query, k=k)
    ctx = "\n\n".join(f"[{i+1}] {h['text']}" for i, h in enumerate(hits))
    answer = answer_with_llama(ctx, query)
    return answer, hits


q = "What is FAISS used for in a RAG pipeline?"
ans, hits = rag_answer(q, k=3)

print("Query:", q)
print("\nTop-K retrieved:")
for i, h in enumerate(hits, 1):
    print(f"{i}. (distance={h['distance']:.4f}) {h['text']}")

print("\nAnswer:\n", ans)




 Mettre en place un système RAG à partir de fichiers PDF




In [None]:
pip install numpy weaviate-client sentence-transformers llama-cpp-python pypdf

Collecting weaviate-client
  Downloading weaviate_client-4.18.1-py3-none-any.whl.metadata (3.7 kB)
Collecting validators<1.0.0,>=0.34.0 (from weaviate-client)
  Downloading validators-0.35.0-py3-none-any.whl.metadata (3.9 kB)
Collecting authlib<2.0.0,>=1.2.1 (from weaviate-client)
  Downloading authlib-1.6.5-py2.py3-none-any.whl.metadata (9.8 kB)
Collecting deprecation<3.0.0,>=2.1.0 (from weaviate-client)
  Downloading deprecation-2.1.0-py2.py3-none-any.whl.metadata (4.6 kB)
Downloading weaviate_client-4.18.1-py3-none-any.whl (598 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m598.1/598.1 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading authlib-1.6.5-py2.py3-none-any.whl (243 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m243.6/243.6 kB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading deprecation-2.1.0-py2.py3-none-any.whl (11 kB)
Downloading validators-0.35.0-py3-none-any.whl (44 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import os
import pathlib
import numpy as np
from sentence_transformers import SentenceTransformer
from llama_cpp import Llama
from pypdf import PdfReader
import faiss  # FAISS pour le stockage vectoriel

# ==========================================
# 1. CONFIGURATION & INITIALISATION
# ==========================================

DATA_PATH = "/content/drive/MyDrive/CYBERSECURITÉ FONDEMENTS ET PRATIQUES AVANCÉES - Introduction.pdf"

from huggingface_hub import hf_hub_download

repo_id = "Qwen/Qwen2.5-1.5B-Instruct-GGUF"
filename = "qwen2.5-1.5b-instruct-q5_k_m.gguf"

print("⏳ Téléchargement du modèle depuis HuggingFace...")
MODEL_PATH = hf_hub_download(repo_id=repo_id, filename=filename)
print("Modèle téléchargé :", MODEL_PATH)


⏳ Téléchargement du modèle depuis HuggingFace...
Modèle téléchargé : /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-1.5B-Instruct-GGUF/snapshots/91cad51170dc346986eccefdc2dd33a9da36ead9/qwen2.5-1.5b-instruct-q5_k_m.gguf


In [None]:

# ==========================================
# 2. Chargement du modèle d'embedding
# ==========================================
print("⏳ Chargement du modèle d'embedding...")
embed_model = SentenceTransformer('all-MiniLM-L6-v2')

# ==========================================
# 3. Chargement du LLM
# ==========================================
print("⏳ Chargement du LLM...")
llm = Llama(
    model_path=MODEL_PATH,
    n_ctx=2048,
    verbose=False
)

# ==========================================
# 4. Lecture du PDF
# ==========================================
print("⏳ Lecture du PDF et création des chunks...")
reader = PdfReader(DATA_PATH)
texts = [page.extract_text() for page in reader.pages if page.extract_text() is not None]

# Fractionnement simple en chunks
chunk_size = 500  # caractères
chunks = []
for t in texts:
    for i in range(0, len(t), chunk_size):
        chunks.append(t[i:i+chunk_size])



⏳ Chargement du modèle d'embedding...
⏳ Chargement du LLM...


llama_context: n_ctx_per_seq (2048) < n_ctx_train (32768) -- the full capacity of the model will not be utilized


⏳ Lecture du PDF et création des chunks...


In [None]:
# ==========================================
# 5. Embedding et index FAISS
# ==========================================
print("⏳ Création des embeddings et de l'index FAISS...")
vectors = embed_model.encode(chunks, normalize_embeddings=True)
dim = vectors.shape[1]

index = faiss.IndexFlatIP(dim)  # Inner Product pour similarité cosinus normalisée
index.add(np.array(vectors, dtype=np.float32))

# ==========================================
# 6. Fonction de recherche et réponse
# ==========================================
def retrieve_faiss(query, k=3):
    q_vec = embed_model.encode([query], normalize_embeddings=True).astype(np.float32)
    distances, indices = index.search(q_vec, k)
    results = [(chunks[i], float(distances[0][j])) for j, i in enumerate(indices[0])]
    return results

def answer_with_llama(context, question, temperature=0.2, max_tokens=512):
    prompt = f"""You are a precise assistant. Answer using ONLY the context. If insufficient, say "I don't know".

Context:
{context}

Question: {question}
Answer:"""
    out = llm.create_chat_completion(
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature,
        max_tokens=max_tokens
    )
    return out["choices"][0]["message"]["content"].strip()

def rag_answer(query, k=3):
    hits = retrieve_faiss(query, k)
    ctx = "\n\n".join(f"[{i+1}] {h[0]}" for i, h in enumerate(hits))
    answer = answer_with_llama(ctx, query)
    return answer, hits



⏳ Création des embeddings et de l'index FAISS...


In [None]:
# ==========================================
# 7. Champ de saisie utilisateur
# ==========================================
while True:
    query = input("\nPose ta question (ou tape 'exit' pour quitter) : ")
    if query.lower() in ['exit', 'quit']:
        print("Fin du programme.")
        break

    ans, hits = rag_answer(query, k=3)

    print("\nTop-K retrieved chunks:")
    for i, (text, score) in enumerate(hits, 1):
        print(f"{i}. (score={score:.4f}) {text[:100]}...")

    print("\nRéponse générée:\n", ans)



Top-K retrieved chunks:
1. (score=0.7085) INTRODUCTION A LA SECURITE 
INFORMATIQUE
La Triade CIA vs DAD
• Disclosure: est l'exposition d'infor...
2. (score=0.6750) INTRODUCTION A LA SECURITE 
INFORMATIQUE
La Triade CIA et la Non-Répudiation
• La non-répudiation, m...
3. (score=0.6510) INTRODUCTION A LA SECURITE 
INFORMATIQUE
La Triade CIA vs DAD
16
CYBERSECURITE - FONDEMENTS ET PRATI...

Réponse générée:
 La triade CID/CIA, également connue sous le nom de la triade CIA, comprend les trois propriétés fondamentales suivantes :

1. Confidentialité (Confidentiality)
   - Disclosure: Exposition d'informations sensibles à des personnes non autorisées.
   - Alteration: Modification non autorisée des informations.
   - Destruction/Denial: Perturbation de l'accès légitime d'un utilisateur autorisé.

2. Intégrité (Integrity)
   - Alteration: Modification non autorisée des informations.

3. Autonomie (Availability)
   - Destruction/Denial: Perturbation de l'accès légitime d'un utilisateur autori