Pour ce projet, nous allons utiliser des LLM (Large Language Models). 

Pour cela, il sera nécessaire de créer un token OpenAI : Cela coûte 5 euros, qui correspond à la somme minimum pour activer un token OpenAI. Pendant le projet, nous ne dépenserons que des centimes, ce qui signifie que le solde restant pourra être utilisé pour d'autres projets à venir.

*N.B.: Il est également possible d'utiliser des bibliothèques gratuites. C'est le cas d'Ollama, qui permet d'accéder à des modèles tels que Llama 3, qui sont suffisants pour répondre aux besoins de ce projet.*

In [37]:
from dotenv import load_dotenv
import os

load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# Chargement des données
Téléchargement du fichier JSONL meta.jsonl, qui contient des descriptions de produits.

In [2]:
import pandas as pd

df_meta = pd.read_json('data/meta.jsonl', lines=True)
df_meta.head()

Unnamed: 0,main_category,title,average_rating,rating_number,features,description,price,images,videos,store,categories,details,parent_asin,bought_together,subtitle,author
0,Cell Phones & Accessories,ARAREE Slim Diary Cell Phone Case for Samsung ...,3.8,5,"[Genuine Cow leather with 6 different colors, ...","[JUST LOOK, You can tell the difference. Make ...",,[{'thumb': 'https://m.media-amazon.com/images/...,[],araree,"[Cell Phones & Accessories, Cases, Holsters & ...",{'Product Dimensions': '3.35 x 0.59 x 6.18 inc...,B013SK1JTY,,,
1,Cell Phones & Accessories,Bastmei for OnePlus 7T Case Extremely Light Ul...,4.4,177,[Ultra-thin & Ultra-light: The ultra slim fit ...,[],11.98,[{'thumb': 'https://m.media-amazon.com/images/...,[],Bastmei,"[Cell Phones & Accessories, Cases, Holsters & ...",{'Package Dimensions': '7.6 x 4.29 x 0.75 inch...,B07ZPSG8P5,,,
2,Cell Phones & Accessories,Wireless Fones Branded New Iphone 5C/LITE Hot ...,4.0,2,[],[],,[{'thumb': 'https://m.media-amazon.com/images/...,[],WIRELESS FONES,"[Cell Phones & Accessories, iPhone Accessories]","{'Item model number': 'Apple Iphone 5C', 'Othe...",B00GKR3L12,,,
3,Cell Phones & Accessories,"iPhone 6 Plus + Case, DandyCase Perfect PATTER...",4.0,15,"[Slim-Fit design for the iPhone 6 Plus (5.5"" s...",[Case does not need to be removed for charging...,,[{'thumb': 'https://m.media-amazon.com/images/...,[],DandyCase,"[Cell Phones & Accessories, iPhone Accessories]",{'Product Dimensions': '5.43 x 0.28 x 2.64 inc...,B00PB8U8BW,,,
4,Cell Phones & Accessories,"Case for Galaxy S6/S6 Edge, Thin Translucent V...",4.0,1,[],[],,[{'thumb': 'https://m.media-amazon.com/images/...,[],7Pite,"[Cell Phones & Accessories, Cases, Holsters & ...",{'Package Dimensions': '8.31 x 3.74 x 0.55 inc...,B07D3RHSRV,,,


Suppression des descriptions vides

In [3]:
df_meta = df_meta[df_meta['description'].apply(lambda x: len(x) > 0)]
df_meta.head()

Unnamed: 0,main_category,title,average_rating,rating_number,features,description,price,images,videos,store,categories,details,parent_asin,bought_together,subtitle,author
0,Cell Phones & Accessories,ARAREE Slim Diary Cell Phone Case for Samsung ...,3.8,5,"[Genuine Cow leather with 6 different colors, ...","[JUST LOOK, You can tell the difference. Make ...",,[{'thumb': 'https://m.media-amazon.com/images/...,[],araree,"[Cell Phones & Accessories, Cases, Holsters & ...",{'Product Dimensions': '3.35 x 0.59 x 6.18 inc...,B013SK1JTY,,,
3,Cell Phones & Accessories,"iPhone 6 Plus + Case, DandyCase Perfect PATTER...",4.0,15,"[Slim-Fit design for the iPhone 6 Plus (5.5"" s...",[Case does not need to be removed for charging...,,[{'thumb': 'https://m.media-amazon.com/images/...,[],DandyCase,"[Cell Phones & Accessories, iPhone Accessories]",{'Product Dimensions': '5.43 x 0.28 x 2.64 inc...,B00PB8U8BW,,,
5,Cell Phones & Accessories,Rikki Knight 3D Chevron Peach on White with An...,1.0,1,"[Includes attached ribbon Wristlet Strap., Bla...",[The Rikki Knight new Flip Wallet iPhone case ...,22.42,[{'thumb': 'https://m.media-amazon.com/images/...,[],Rikki Knight,"[Cell Phones & Accessories, iPhone Accessories]","{'Product Dimensions': '5 x 2.5 x 0.1 inches',...",B00JIFK1HA,,,
7,Cell Phones & Accessories,Piel Frama Wallet Case for Samsung Galaxy Nexu...,5.0,1,"[Sync through travel cable, Removable belt cli...","[Handcrafted in Spain by Leather artisans, pie...",,[{'thumb': 'https://m.media-amazon.com/images/...,[],Piel Frama,"[Cell Phones & Accessories, Cases, Holsters & ...",{'Product Dimensions': '5.71 x 3.15 x 0.79 inc...,B0071OC36W,,,
11,Sports & Outdoors,"BCSLINE Outdoor Exercise Jogging Armband Case,...",3.2,5,"[Practical&Beautiful high quality Armband, Exe...",[Features: Carry your phone around with style!...,,[{'thumb': 'https://m.media-amazon.com/images/...,[],BCSCasek,"[Cell Phones & Accessories, Cases, Holsters & ...","{'Brand Name': 'BCSCasek', 'Color': 'Blue', 'M...",B00JI55D2I,,,


# Prétraitement et segmentation des textes
Installation de la librairie LangChain (alternative: LlamaIndex)

In [4]:
#!pip install langchain

Division des longues descriptions en morceaux de taille appropriée avec RecursiveCharacterTextSplitter.
- Taille des morceaux (chunk_size) : par exemple, 512 caractères.
- Chevauchement entre les morceaux (chunk_overlap) : 128 caractères.

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def preprocess_and_split_descriptions(descriptions, chunk_size=256, chunk_overlap=128):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    split_texts = []
    
    for desc in descriptions:
        if isinstance(desc, list):
            desc = " ".join(desc)
        elif pd.isna(desc):
            desc = ""
        
        chunks = text_splitter.split_text(desc)
        split_texts.append(chunks)
    
    return split_texts

df_meta["description_chunks"] = preprocess_and_split_descriptions(df_meta["description"])

print(df_meta[["description_chunks"]].head(3))

                                  description_chunks
0  [JUST LOOK, You can tell the difference. Make ...
3  [Case does not need to be removed for charging...
5  [The Rikki Knight new Flip Wallet iPhone case ...


### *Remarque :*
*La taille de 256 tokens permet de donner du sens aux informations tout en gardant une granularité permettant de conserver un volume raisonnable d'informations dans chaque chunk, idéal pour des tâches comme la recherche ou la réponse à des questions.*

*Le chevauchement de 128 (soit 50 % du chunk), garantie que des parties importantes du texte apparaissent dans plusieurs chunks.*

# Création d'un index vectoriel
Génération des embeddings pour chaque morceau de texte à l’aide du modèle d’embedding Sentence Embeddings de HuggingFace (all-MiniLM-L6-v2).
 
Il est également possible d'utiliser OllamaEmbeddings.

In [6]:
sentences_list = df_meta["description_chunks"].explode().dropna().unique().tolist()
sentences_list

['JUST LOOK, You can tell the difference. Make everyday more convenient, it is slim but has big rooms. If you are looking for a rich and luxurious appearance, look no further. These double shoulders are the perfect leather for creating attractive finished',
 "rich and luxurious appearance, look no further. These double shoulders are the perfect leather for creating attractive finished belts, straps and wallets. It doesn't only show the perfect weight for accessories where rugged durability is needed but also",
 "belts, straps and wallets. It doesn't only show the perfect weight for accessories where rugged durability is needed but also has a natural finish and coarse grain.",
 'Case does not need to be removed for charging. Camera opening allows unobstructed use of camera and flash. DandyCase proudly presents the premium "PERFECT PATTERN" from the line of stylish cases that will make your friends jealous! Stand out from the rest',
 'the premium "PERFECT PATTERN" from the line of stylis

In [7]:
from langchain.embeddings import HuggingFaceEmbeddings

MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
hf_embeddings = HuggingFaceEmbeddings(model_name=MODEL_NAME)

  hf_embeddings = HuggingFaceEmbeddings(model_name=MODEL_NAME)


# Création d’une base de données vectorielle
Stockage des embeddings générés dans une base de données vectorielle Chromadb.

In [8]:
from langchain.vectorstores import Chroma
CHROMA_DB_PATH = "./chroma_db"

vector_store = Chroma.from_texts(
    texts=sentences_list,
    embedding=hf_embeddings,
    persist_directory=CHROMA_DB_PATH
)

# Création d’un système de récupération (retrieval system)
Configuration d'un retriever à partir de la base vectorielle pour rechercher les descriptions pertinentes correspondant aux requêtes des utilisateurs.

In [9]:
retriever = vector_store.as_retriever()

# Réglage du LLM
Structuration du prompt en utilisant des Techniques de Prompt Engineering :
- Indication au modèle d'utiliser uniquement les informations issues de la base de données. 
- Ajout d'une instruction claire pour qu'il ne fasse pas appel à des connaissances internes. 
- Précision que le modèle doit indiquer qu'il ne sait pas lorsqu'il ne peut pas trouver une réponse appropriée. 
- Explication dans le prompt que le modèle agit comme un assistant destiné à répondre aux questions des utilisateurs à partir des documents fournis. 
- Demande au modèle de fournir les passages exacts ou les documents spécifiques d'où il a extrait les informations. 
- Précision que le modèle ne doit pas générer d'informations sensibles, inappropriées ou potentiellement incorrectes.

In [10]:
prompt = """
You are an intelligent assistant that answers questions using only the information contained in the relevant documents provided. Here are the relevant documents for this question: {documents}
Based solely on these documents, answer the following question precisely and concisely: {question}.
"""

# Recherche des Documents Pertinents
Utilisation d'un retriever pour rechercher les documents pertinents en fonction de la requête utilisateur.
- Transformation de la requête utilisateur en embedding avec le même modèle utilisé pour les documents.
- Utilisation de la méthode de recherche de ChromaDB pour trouver les documents les plus proches dans l'espace vectoriel.

In [11]:
from langchain.chains.retrieval_qa.base import RetrievalQA

user_query = "Tell me about OnePlus 6T"

relevant_docs = retriever.get_relevant_documents(user_query)
relevant_docs

  relevant_docs = retriever.get_relevant_documents(user_query)


[Document(metadata={}, page_content='with our cutting-edge Screen Unlock technology. Featuring our largest display ever and a resilient glass back, the OnePlus 6T was crafted with care and purpose. Experience a 6.41 inch Optic AMOLED display for true immersion through an 86% screen-to-body'),
 Document(metadata={}, page_content='with our cutting-edge Screen Unlock technology. Featuring our largest display ever and a resilient glass back, the OnePlus 6T was crafted with care and purpose. Experience a 6.41 inch Optic AMOLED display for true immersion through an 86% screen-to-body'),
 Document(metadata={}, page_content='with our cutting-edge Screen Unlock technology. Featuring our largest display ever and a resilient glass back, the OnePlus 6T was crafted with care and purpose. Experience a 6.41 inch Optic AMOLED display for true immersion through an 86% screen-to-body'),
 Document(metadata={}, page_content='our Fast Charge technology gets you up and running in just half an hour. Our oper

# Chaîne de récupération
- Combinaison des descriptions des produits pertinentes ({{documents}}) avec la question utilisateur.
- Création d'un modèle LLM avec gpt-3.5-turbo pour OpenAI.

*Il est également possible d'utiliser llama 3.1 pour Ollama.*

- Génération d'une réponse structurée.

In [18]:
from langchain_core.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-3.5-turbo",
    openai_api_key=OPENAI_API_KEY
)

qa_prompt = PromptTemplate(
    template=prompt,
    input_variables=["documents", "question"]
)

retrieval_qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={
        "prompt": qa_prompt,
        "document_variable_name": "documents"
    }
)

  llm = ChatOpenAI(


In [33]:
response = retrieval_qa({"query": user_query})

print("Question:", user_query)
print("Generated response:", response['result'])

Question: Tell me about OnePlus 6T
Generated response: The OnePlus 6T features a large 6.41 inch Optic AMOLED display with an 86% screen-to-body ratio, a resilient glass back, and cutting-edge Screen Unlock technology. It also includes Fast Charge technology for quick charging and runs on Android Pie for a smooth user experience.


# Vérification de la Réussite de votre Architecture RAG
## Correspondance entre les réponses et les données
Vérification que les réponses fournies par le modèle LLM correspondent bien aux informations présentes dans les données du fichier meta.jsonl.

Inspection des documents pertinents récupérés par le système de recherche (retriever) et passés dans le prompt.

*La réponse donnée par le LLM est-elle effectivement basée sur les données pertinentes récupérées ?*

In [43]:
print("Sources:")
for source in response['source_documents']:
    print(source)

Sources:
page_content='with our cutting-edge Screen Unlock technology. Featuring our largest display ever and a resilient glass back, the OnePlus 6T was crafted with care and purpose. Experience a 6.41 inch Optic AMOLED display for true immersion through an 86% screen-to-body'
page_content='with our cutting-edge Screen Unlock technology. Featuring our largest display ever and a resilient glass back, the OnePlus 6T was crafted with care and purpose. Experience a 6.41 inch Optic AMOLED display for true immersion through an 86% screen-to-body'
page_content='with our cutting-edge Screen Unlock technology. Featuring our largest display ever and a resilient glass back, the OnePlus 6T was crafted with care and purpose. Experience a 6.41 inch Optic AMOLED display for true immersion through an 86% screen-to-body'
page_content='our Fast Charge technology gets you up and running in just half an hour. Our operating system is all about ensuring your phone works for you, not the other way around. Po

### *Remarque :*
*On remarque que la réponse donnée par le LLM provient bien du document meta.jsonl et que la qualité des la réponse est basée sur des données pertinentes récupérées.*

## Capacité du LLM à reconnaître ses limites
Vérification que le modèle indique clairement qu’il ne sait pas répondre lorsqu’il reçoit une question dont la réponse n’est pas dans le fichier meta.jsonl.

Le modèle ne doit pas utiliser ses connaissances internes pour générer des réponses.

Afin de garantir ce comportement, nous allons manipuler le prompt en ajoutant des instructions explicites, par exemple : "Répondez uniquement en utilisant les informations fournies. Si aucune information pertinente n'est disponible, indiquez que vous ne savez pas.

In [53]:
prompt = """
You are an intelligent assistant that answers questions using only the information contained in the relevant documents provided. Here are the relevant documents for this question: {documents}
Based solely on these documents, answer the following question precisely and concisely: {question}.
If no relevant information is available, please indicate that you do not know.
"""

llm = ChatOpenAI(
    model="gpt-3.5-turbo",
    openai_api_key=OPENAI_API_KEY
)

qa_prompt = PromptTemplate(
    template=prompt,
    input_variables=["documents", "question"]
)

retrieval_qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={
        "prompt": qa_prompt,
        "document_variable_name": "documents"
    }
)

user_query = "What is the best song in the world?"

response = retrieval_qa({"query": user_query})
print("Question:", user_query)
print("Generated response:", response['result'])

Question: What is the best song in the world?
Generated response: I do not have any information on what the best song in the world is based on the provided documents.


### *Remarque :*
*On remarque que la réponse est "I don't know", ce qui est le comportement attendu.*



## Impact des paramètres sur les réponses
Test de l'influence des paramètres tels que temperature et top-p influencent les réponses du modèle.

In [50]:
prompt = """
You are a passionate and creative assistant, deeply enamored with the task of providing insightful and imaginative responses. Drawing inspiration solely from the provided documents: {documents}, craft a captivating and heartfelt answer to the following query: {question}. Remember, your words have the power to enchant and delight!
"""

llm = ChatOpenAI(
    model="gpt-3.5-turbo",
    openai_api_key=OPENAI_API_KEY
)

qa_prompt = PromptTemplate(
    template=prompt,
    input_variables=["documents", "question"]
)

retrieval_qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={
        "prompt": qa_prompt,
        "document_variable_name": "documents"
    }
)

user_query = "Tell me about OnePlus 6T"

response = retrieval_qa({"query": user_query})
print("Question:", user_query)
print("Generated response:", response['result'])

Question: Tell me about OnePlus 6T
Generated response: The OnePlus 6T is a masterpiece of technological innovation and thoughtful design. With its cutting-edge Screen Unlock technology, you can effortlessly access your phone in an instant. The stunning 6.41 inch Optic AMOLED display provides a truly immersive viewing experience, with an impressive 86% screen-to-body ratio that will captivate your senses.

Crafted with care and purpose, the OnePlus 6T features a resilient glass back that exudes elegance and durability. With Fast Charge technology, you can power up your device in just half an hour, ensuring that you are always ready to go. The operating system, powered by Android Pie, is designed to work seamlessly with you, offering a plethora of features to enhance your user experience.

In summary, the OnePlus 6T is more than just a smartphone - it is a work of art that combines cutting-edge technology with thoughtful design. Experience the future of mobile technology with the OnePlus

### *Remarque :*
*On remarque que la réponse est plus créative et moins cohérente que la précédente.*