## Retrieval-Augmented Generation

<div style="text-align: center;">
    <img src="images/rag.png" alt="RAG">
</div>

### Embeddings et semantique

But : encoder un texte sous la forme d'un vecteur, de sorte que deux textes voisins sémantiquement soient encodés en deux vecteurs proches.

![Texte alternatif](images/vectors-and-semantics.png "Vectors")

### Embeddings : Bag of words

![Texte alternatif](images/Bag-of-words.png "BoW")

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'Demonstration text, first document',
    "Demo text, and here's a second document.",
    'And finally, this is the third document.'
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print("Vocabulary :", vectorizer.get_feature_names_out())
print("BoW vector:\n", X.toarray())

Vocabulary : ['and' 'demo' 'demonstration' 'document' 'finally' 'first' 'here' 'is'
 'second' 'text' 'the' 'third' 'this']
BoW vector:
 [[0 0 1 1 0 1 0 0 0 1 0 0 0]
 [1 1 0 1 0 0 1 0 1 1 0 0 0]
 [1 0 0 1 1 0 0 1 0 0 1 1 1]]


### Embeddings par transformers

In [1]:
# pip install sentence-transformers
from sentence_transformers import SentenceTransformer

sentences = ["This is an example sentence.", "Each sentence is converted into a fixed-sized vector."]

# Entraîné sur des données essentiellement anglophones.
# Conçu pour être léger et rapide, tout en gardant une bonne précision pour l’anglais.
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)

for sentence, embedding in zip(sentences, embeddings):
    print(f'"{sentence}" -> {embedding[:3]}...')

print(f"Embedding size: {len(embedding)}")

RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [None]:
from sentence_transformers import SentenceTransformer

#Entraîné avec un objectif de détection de paraphrases sur un corpus multilingue.
#Performances équilibrées pour la similarité sémantique, la recherche d’information et la classification zero-shot en plusieurs langues.

model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
print(model.encode(["Texte à encoder"]))

### Similarité sémantique

In [24]:
import numpy as np

def cosine_similarity(A, B):
    dot_product = np.dot(A, B)
    norm_A = np.linalg.norm(A)
    norm_B = np.linalg.norm(B)
    return dot_product / (norm_A * norm_B)

cosine_similarity(embeddings[0], embeddings[1])

np.float32(0.37034684)

### Embeddings OpenAI

In [28]:
from openai import OpenAI

openai = OpenAI()

def embed(text, model="text-embedding-3-large", dimensions=3072): #3072: dimension maximale
    return openai.embeddings.create(input = [text], model=model, dimensions=dimensions).data[0].embedding

vector1 = embed("What is Mycobacterium kansasii ?")
vector2 = embed("To sum up, we have presented a case of Mycobacterium kansasii monoarthritis of the elbow complicated with unusual clinical and radiological findings. A combination of synovectomy and multidrug antimycobacterial treatment yielded a favorable clinical course without recurrence of arthritis after 10 months of follow-up. This case emphasizes the need to consider this rare infection in the differential diagnosis of intra-articular soft tissue tumor-like lesions of the elbow even in immunocompetent patients.")
cosine_similarity(vector1, vector2)

np.float64(0.5604925298797377)

### RAG : principe de base

<div style="text-align: center;">
    <img src="images/rag2.png" alt="RAG">
</div>

In [27]:
from langchain_mistralai.chat_models import ChatMistralAI

llm = ChatMistralAI(model_name="mistral-large-latest")

query = "What is Mycobacterium kansasii ?"
context = "To sum up, we have presented a case of Mycobacterium kansasii monoarthritis of the elbow complicated with unusual clinical and radiological findings. A combination of synovectomy and multidrug antimycobacterial treatment yielded a favorable clinical course without recurrence of arthritis after 10 months of follow-up. This case emphasizes the need to consider this rare infection in the differential diagnosis of intra-articular soft tissue tumor-like lesions of the elbow even in immunocompetent patients."

text = f"""You are an expert in the Mycobacterium field. 
Answer to the following question by only using the context below.

question: {query}

context : {context}"""

response = llm.invoke(text)
print(response.content)

Mycobacterium kansasii


### Implémentation d'un vectorstore

In [31]:
#pip install langchain-community langchain-openai faiss-cpu

import warnings; warnings.simplefilter('ignore')
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings

loader = PyPDFLoader("images/Guyeux_2024.pdf")
pages = loader.load_and_split()

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
faiss_index = FAISS.from_documents(pages, embeddings)
docs = faiss_index.similarity_search("Is there a lineage 10 in M.tuberculosis?", k=2)

In [32]:
from textwrap import shorten, fill

for doc in docs:
    print(f"Page {doc.metadata["page"]}: {fill(shorten(doc.page_content, 500), 80)}\n")

Page 3: M. africanum Lineage 10, Central Africa Conclusions Through the extensive mining
of WGS and genotyp- ing databases, we newly identified a thus far rare M.
tuberculosis complex lineage, L10 (proposed), pres- ent in central Africa. The
lineage is characterized by a new region of deletion, IS6110 insertions, and 243
SNPs, including gyrA G7901T, recN C1920096T, and dnaG C2621730T. L10 represents
a sister clade to L6, found mainly in western Africa, and L9, specifically in
eastern Africa, and [...]

Page 0: nity of Lille, Lille, France (P. Supply, C. Gaudin); London School of Hygiene
and Tropical Medicine, London, UK (J.E. Phelan, T.G. Clark, L. Rigouts, B. de
Jong); Université Paris-Saclay, Saint- Aubin, France (C. Sola); Université Paris
Cité, Paris (C. Sola) DOI: https://doi.org/10.3201/eid3003.231466 Analysis of
genome sequencing data from >100,000 genomes of Mycobacterium tuberculosis
complex using TB-Annotator software revealed a previously unknown lineage,
proposed name L10, 

### Version OpenAI

In [None]:
#pip install langchain-community langchain-openai faiss-cpu
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

loader = PyPDFLoader("images/Guyeux_2024.pdf")
pages = loader.load_and_split()

faiss_index = FAISS.from_documents(pages, OpenAIEmbeddings())
docs = faiss_index.similarity_search("Is there a lineage 10 in M.tuberculosis?", k=2)

### Text splitters

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text = '''Vous pouvez partager un article en cliquant sur les icônes de partage en haut à droite de celui-ci. 
La reproduction totale ou partielle d’un article, sans l’autorisation écrite et préalable du Monde, est strictement interdite. 
Pour plus d’informations, consultez nos conditions générales de vente. 

Comme la finance, la politique est parfois affaire d’opportunités. Aux Etats-Unis, l’opposition démocrate à Donald Trump a en tout cas trouvé un nouvel angle d’attaque après l’annonce par le président américain d’une pause dans sa guerre commerciale : elle le soupçonne d’avoir manipulé les marchés boursiers et d’avoir ainsi favorisé des délits d’initié.
Lire aussi | Article réservé à nos abonnés Droits de douane : les Bourses rechutent, l’inquiétude s’étend aux emprunts d’Etat

Le sénateur Adam Schiff a écrit, jeudi 10 avril, au directeur par intérim du Bureau pour l’éthique gouvernementale (Office of Government Ethics, OGE), une agence fédérale indépendante, et à Susan Wiles, la cheffe de cabinet de la Maison Blanche, pour leur demander d’ouvrir une enquête « urgente » afin de déterminer si « le président Trump, sa famille ou d’autres membres de [son] administration » ont commis la veille des délits d’initié en profitant d’informations confidentielles sur le revirement de sa politique commerciale.

'''

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=200,
    keep_separator=False,
    separators=["\n\n", "\n", ". "]
)

texts = text_splitter.create_documents([text])

for k in texts[:7]:
    print(k.page_content)
    print("="*20+'\n')


### Loaders (LangChain)

In [None]:
from langchain_community.document_loaders import YoutubeLoader

loader = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=YcIbZGTRMjI", 
    language=['fr'],
    add_video_info=False
)

print(loader.load())

### Vectorstores

Nombreux et multiples...
 - FAISS, Chroma : faciles à maîtriser, déployer...
 - Milvus : multi-embeddings, BM25, filtrage par colonne...

<div style="text-align: center;">
    <img src="images/Milvus.png" alt="RAG">
</div>