<a href="https://colab.research.google.com/github/doukansurel/Retrieval-Augmented-Generation/blob/main/RAG_Metrics%26Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -q llama-index==0.9.14.post3 deeplake==3.8.12 openai==1.3.8 cohere==4.37

In [3]:
import os

os.environ["OPENAI_API_KEY"] = "OPENAI_API_KEY"
os.environ["ACTIVELOOP_TOKEN"] = "ACTIVELOOP_TOKEN"

In [4]:
from llama_index import ServiceContext
from llama_index.llms import OpenAI
from llama_index.vector_stores import DeepLakeVectorStore
from llama_index.storage.storage_context import StorageContext
from llama_index import VectorStoreIndex



In [5]:
# build service context
llm = OpenAI(model="gpt-4",temperature=0.0)
service_context = ServiceContext.from_defaults(llm=llm)

[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [6]:
!mkdir -p 'data/paul_graham/'
!curl 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -o 'data/paul_graham/paul_graham_essay.txt'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 75042  100 75042    0     0   269k      0 --:--:-- --:--:-- --:--:--  268k


In [7]:
from llama_index.node_parser import SimpleNodeParser
from llama_index import SimpleDirectoryReader

doc = SimpleDirectoryReader("/content/data/paul_graham").load_data()
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(doc)

In [8]:
vector_store = VectorStoreIndex(nodes)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [13]:
index = VectorStoreIndex.from_documents(
    documents=doc
)

In [15]:
from llama_index.evaluation import FaithfulnessEvaluator

#define evaluator
evaluator = FaithfulnessEvaluator(service_context=service_context)

#query index
query_engine = index.as_query_engine()
response = query_engine.query(
     "What does Paul Graham do?"
)

eval_results  = evaluator.evaluate_response(response=response)

print( "> response:", response )

print( "> evaluator result:", eval_results.passing )

> response: Paul Graham is a writer and entrepreneur. He has written numerous essays on various topics and has published them online. He has also authored a book called "Hackers & Painters." In addition to writing, Paul Graham has worked on projects such as spam filters and has invested in startups as an angel investor. He is also one of the founders of Y Combinator, an angel firm.
> evaluator result: True


RAGAS

In [16]:
!pip install html2text==2020.1.16 ragas==0.0.22

Installing collected packages: pysbd, pyarrow-hotfix, jsonpointer, html2text, langsmith, jsonpatch, langchain-core, langchain-community, datasets, langchain, ragas
Successfully installed datasets-2.16.0 html2text-2020.1.16 jsonpatch-1.33 jsonpointer-2.4 langchain-0.0.352 langchain-community-0.0.6 langchain-core-0.1.3 langsmith-0.0.75 pyarrow-hotfix-0.6 pysbd-0.3.4 ragas-0.0.22


In [17]:
from llama_index.readers.web import SimpleWebPageReader
from llama_index import VectorStoreIndex, ServiceContext

documents = SimpleWebPageReader(html_to_text=True).load_data( ["https://en.wikipedia.org/wiki/New_York_City"] )

vector_index = VectorStoreIndex.from_documents(
    documents, service_context=ServiceContext.from_defaults(chunk_size=512)
)

query_engine = vector_index.as_query_engine()

response_vector = query_engine.query("How did New York City get its name?")

print(response_vector)

New York City got its name in honor of the Duke of York, who later became King James II of England. The Duke's elder brother, King Charles II, appointed him as the proprietor of the former territory of New Netherland, including the city of New Amsterdam, when England seized it from Dutch control.


Modelleri değerlendirme hedefimize dönersek, bir sonraki adım, daha doğru bir performans değerlendirmesi sağlamak için ideal olarak orijinal belgeden türetilmiş bir dizi soru oluşturmayı içerir.

In [18]:
eval_questions = [
    "What is the population of New York City as of 2020?",
    "Which borough of New York City has the highest population?",
    "What is the economic significance of New York City?",
    "How did New York City get its name?",
    "What is the significance of the Statue of Liberty in New York City?",
]

eval_answers = [
    "8,804,000",  # incorrect answer
    "Queens",  # incorrect answer
    "New York City's economic significance is vast, as it serves as the global financial capital, housing Wall Street and major financial institutions. Its diverse economy spans technology, media, healthcare, education, and more, making it resilient to economic fluctuations. NYC is a hub for international business, attracting global companies, and boasts a large, skilled labor force. Its real estate market, tourism, cultural industries, and educational institutions further fuel its economic prowess. The city's transportation network and global influence amplify its impact on the world stage, solidifying its status as a vital economic player and cultural epicenter.",
    "New York City got its name when it came under British control in 1664. King Charles II of England granted the lands to his brother, the Duke of York, who named the city New York in his own honor.",
    "The Statue of Liberty in New York City holds great significance as a symbol of the United States and its ideals of liberty and peace. It greeted millions of immigrants who arrived in the U.S. by ship in the late 19th and early 20th centuries, representing hope and freedom for those seeking a better life. It has since become an iconic landmark and a global symbol of cultural diversity and freedom.",
]

eval_answers = [[a] for a in eval_answers]

Bu aşama değerlendirme sürecinin kurulum aşamasıdır. QueryEngine'in yeterliliği, bu belirli soruları ne kadar etkili bir şekilde işleyip yanıtladığına ve yanıtları performansı ölçmek için bir standart olarak kullanmasına göre değerlendirilir. Metrikleri Ragas kütüphanesinden içe aktarmamız gerekiyor.

In [19]:
from ragas.metrics import(
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from ragas.metrics.critique import harmfulness

metrics = [
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    harmfulness
]

Metrik listesi, metrikleri bir koleksiyon halinde derler; bu koleksiyon daha sonra QueryEngine performansının çeşitli yönlerini değerlendirmek için değerlendirme sürecinde kullanılabilir. Her metrik için puanları içeren sonuçlar daha ayrıntılı olarak analiz edilebilir. Son olarak değerlendirmeyi yapalım:

In [20]:
from ragas.llama_index import evaluate
result  = evaluate(query_engine,metrics,eval_questions,eval_answers)

#print the final scores
print(result)

evaluating with [faithfulness]


100%|██████████| 1/1 [00:15<00:00, 15.75s/it]


evaluating with [answer_relevancy]


100%|██████████| 1/1 [00:18<00:00, 18.32s/it]


evaluating with [context_precision]


100%|██████████| 1/1 [00:01<00:00,  1.88s/it]


evaluating with [context_recall]


100%|██████████| 1/1 [00:05<00:00,  5.15s/it]


evaluating with [harmfulness]


100%|██████████| 1/1 [00:01<00:00,  1.44s/it]


{'faithfulness': 0.8000, 'answer_relevancy': 0.7866, 'context_precision': 0.6000, 'context_recall': 0.8000, 'harmfulness': 0.0000}


Faithfulness:sistemin yanıtlarının kaynak materyalin gerçek içeriğine ne kadar doğru şekilde uyduğunu ölçer.bu, yanıtların çoğunlukla doğru ve kaynağa sadık olduğu anlamına gelir.

answer_relevancy:sistemin yanıtlarının verilen sorgularla ne kadar alakalı olduğunu ölçer.

context_precision:sistem tarafından yanıt oluşturmak için kullanılan bağlamın kesinliğini değerlendirir.
context_recall:sistem tarafından belirlenen ilgili bağlamın geri çağrılma oranını ölçer.

harmfulness:sistemi zararlı veya uygunsuz içerik üretimine yönelik ölçer. 0 puanı, değerlendirilen yanıtlarda zararlı içerik üretilmediğini ifade eder.

The Custom RAG Pipeline Evaluation

In [21]:
!wget 'https://raw.githubusercontent.com/idontcalculate/data-repo/main/venus_transmission.txt'

--2023-12-29 12:47:21--  https://raw.githubusercontent.com/idontcalculate/data-repo/main/venus_transmission.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19241 (19K) [text/plain]
Saving to: ‘venus_transmission.txt’


2023-12-29 12:47:22 (27.6 MB/s) - ‘venus_transmission.txt’ saved [19241/19241]



In [22]:
from llama_index import SimpleDirectoryReader

reader = SimpleDirectoryReader(input_files=["/content/venus_transmission.txt"])

docs = reader.load_data()
print(f"Loaded {len(docs)} docs")

Loaded 1 docs


SimpleNodeParser, bu bağlamda, belgeleri düğümler olarak bilinen yapılandırılmış bir formata dönüştürür ve özellikle yığın boyutunu tanımlama, örtüşmeyi yönetme ve meta verileri birleştirme açısından belgelerin ayrıştırılmasında özelleştirmeye hizmet eder. Belgenin her bir parçası bir düğüm olarak değerlendirilir. Bu durumda ayrıştırıcının chunk_size değeri 512 olarak ayarlanır; bu, her düğümün orijinal belgedeki 512 karakterden oluşacağı anlamına gelir. Bu parçalar daha sonra indeksleri oluşturmak için kullanılabilir.

In [23]:
from llama_index.node_parser import SimpleNodeParser
from llama_index import VectorStoreIndex

#Build İndex with a chunk_size of 512
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes =node_parser.get_nodes_from_documents(docs)
vector_index = VectorStoreIndex(nodes)

In [24]:
query_engine = vector_index.as_query_engine()

response_vector = query_engine.query("What was The first beings to inhabit the planet?")
print( response_vector.response )

The first beings to inhabit the planet were a dinoid and reptoid race from two different systems outside our solar system.


Sorgu motoru tarafından oluşturulan yanıt, yanıt_vektöründe depolanır. Böylece belge düğümlere işlenir, dizine eklenir ve ardından bir dil modeli kullanılarak sorgulanır. Yanıtı daha ayrıntılı araştırmak için, soruyu yanıtlamak için kullanılan dizinden alınan belgeye erişmek üzere .source_nodes anahtarını kullanabiliriz.

In [25]:
# First retrieved node
response_vector.source_nodes[0].get_text()

"They had heard of this beautiful new planet. At this time, Earth had two moons to harmonize the weather conditions and control the tides of the large bodies of water.\nThe first beings to inhabit the planet were a dinoid and reptoid race from two different systems outside our solar system. They were intelligent and walked on two legs like humans and were war-like considering themselves to be superior to all other life forms. In the past, the four races of humans had conflicts with them before they outgrew such behavior. They arrived on Earth to rob it of its minerals and valuable gems. Soon they had created a terrible war. They were joined by re-\n1\nenforcements from their home planets. One set up its base on one of the Earth's moons, the other on Earth. It was a terrible war with advanced nuclear and laser weapons like you see in your science fiction movies. It lasted very long. Most of the life forms lay in singed waste and the one moon was destroyed. No longer interested in Earth,

In [26]:
response_vector.source_nodes[1].get_text()

"Due to the radiation, the survivors of the dinoids and reptoids mutated into the Dinosaurs and giant reptilians you know of in your history. The humans that were trapped there mutated into what you call Neanderthals.\nThe Earth remained a devastated ruin, covered by a huge dark nuclear cloud and what vegetation was left was being devoured by the giant beings, also humans and animals by some. It was this way for hundreds of years before a giant comet crashed into one of the oceans and created another huge cloud. This created such darkness that the radiating heat of the Sun could not interact with Earth's gravitational field and an ice age was created. This destroyed the mutated life forms and gave the four races the chance to cleanse and heal the Earth with technology and their energy.\nOnce again, they brought various forms of life to the Earth, creating again a paradise, except for extreme weather conditions and extreme tidal activities.\nDuring this time they realized that their pla

Sorgu motorunun ilgili bulduğu ikinci düğümden gelen metinsel bilgileri görüntüleyebilir ve sorguya yanıt olarak ek bağlam veya bilgi sağlayabilirsiniz. Bu, sorgu motorunun elde ettiği bilginin genişliğini ve dizine eklenen belgelerin farklı bölümlerinin genel yanıta nasıl katkıda bulunduğunu anlamaya yardımcı olur.

Generate_question_context_pairs sınıfı, her düğümün içeriğine dayalı sorular oluşturmak için LLM'den yararlanır: Her düğüm için iki soru oluşturulacak ve sonuçta her öğenin bir bağlamdan (düğümün metni) ve karşılık gelen bir soru grubundan oluştuğu bir veri kümesi elde edilecektir. Soru-Cevap veri seti, bir RAG sisteminin soru oluşturma ve bağlam anlama görevlerindeki yeteneklerini değerlendirmemize hizmet edecektir.

In [27]:
from llama_index.llms import OpenAI
from llama_index.evaluation import generate_question_context_pairs

#Define an LLm

llm = OpenAI(model="gpt-3.5-turbo")

qa_dataset = generate_question_context_pairs(
    nodes,
    llm=llm,
    num_questions_per_chunk=2
)

queries = list(qa_dataset.queries.values())
print(queries[0:10])

100%|██████████| 13/13 [00:23<00:00,  1.78s/it]

['How did the beings described in the context information communicate with different life forms and dimensions? How did their telepathic abilities and technology contribute to their understanding of creation and the creator?', 'Describe the role of different races in the colonization of planets within our solar system, as mentioned in the context information. How did Earth differ from other planets during that time period and what was its status in relation to the other planets?', 'Explain the concept of creativity as understood by the advanced beings in the context. How did they use their creative energy and what were the responsibilities associated with it?', 'Describe the initial state of Earth before it became a planet and settled into an orbit around the Sun. How did the four races of humans contribute to the development of life on Earth?', "How did the arrival of the dinoid and reptoid races on Earth lead to a devastating war? Discuss the reasons behind their conflict with the fo




In [28]:
from llama_index.evaluation import RetrieverEvaluator

retriever = vector_index.as_retriever(similarity_top_k=2)

retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr","hit_rate"],retriever=retriever
)

In [30]:
#Evaluate
import pandas as pd

eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

def display_results(name,eval_results):

    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)

    hit_rate = full_df["hit_rate"].mean()
    mrr = full_df["mrr"].mean()

    metric_df = pd.DataFrame(
        {"Retriever Name": [name], "Hit Rate": [hit_rate], "MRR": [mrr]}
    )

    return metric_df

display_results("OpenAI Embedding Retriever", eval_results)

Unnamed: 0,Retriever Name,Hit Rate,MRR
0,OpenAI Embedding Retriever,0.923077,0.807692


In [31]:
# gpt-3.5-turbo
gpt35 = OpenAI(temperature=0, model="gpt-3.5-turbo")
service_context_gpt35 = ServiceContext.from_defaults(llm=gpt35)

# gpt-4
gpt4 = OpenAI(temperature=0, model="gpt-4")
service_context_gpt4 = ServiceContext.from_defaults(llm=gpt4)

vector_index = VectorStoreIndex(nodes, service_context = service_context_gpt35)
query_engine = vector_index.as_query_engine()

eval_query = queries[10]
response_vector = query_engine.query(eval_query)

print( "> eval_query: ", eval_query )
print( "> response_vector:", response_vector )

> eval_query:  How did the colonies respond to the declaration of war by the dark forces, and what measures did they take to protect their knowledge and technology?
> response_vector: The colonies did not fight back against the dark forces when they declared war. Instead, they sent most of their people into hiding in order to rebuild the colonies later. They also destroyed everything to ensure that their knowledge and technology would not fall into the hands of the dark forces. Additionally, Lemuria and Atlantis were destroyed by their inhabitants to prevent the misuse of their knowledge and technology by the dark forces.


Artık her bir metriğin ölçülmesinden sorumlu değerlendirici sınıflarını oluşturabiliriz. Daha sonra test kriterlerini karşılayıp karşılamadığını belirlemek için örnek bir yanıt kullanacağız.

In [32]:
from llama_index.evaluation import RelevancyEvaluator
from llama_index.evaluation import FaithfulnessEvaluator

relevancy_gpt4 = RelevancyEvaluator(service_context=service_context_gpt4)
faithfulness_gpt4 = FaithfulnessEvaluator(service_context=service_context_gpt4)

# Compute faithfulness evaluation

eval_result = faithfulness_gpt4.evaluate_response(response=response_vector)
# check passing parameter in eval_result if it passed the evaluation.
print( eval_result.passing )

# Relevancy evaluation
eval_result = relevancy_gpt4.evaluate_response(
    query=eval_query, response=response_vector
)
# You can check passing parameter in eval_result if it passed the evaluation.
print( eval_result.passing )

True
True


Her bir örneği değerlendirme veri setinden beslemek ve uygun sonuçları almak için bir for döngüsü gerçekleştirmeliyiz. Bu durumda değerlendirme sürecini toplu ve eş zamanlı çalıştıran LlamaIndex BatchEvalRunner sınıfını kullanabiliriz. Bu, değerlendirmenin daha hızlı yapılabileceği anlamına gelir.

In [33]:
from llama_index.evaluation import BatchEvalRunner

# Let's pick top 10 queries to do evaluation
batch_eval_queries = queries[:10]

# Initiate BatchEvalRunner to compute FaithFulness and Relevancy Evaluation.
runner = BatchEvalRunner(
    {"faithfulness": faithfulness_gpt4, "relevancy": relevancy_gpt4},
    workers=8,
)

# Compute evaluation
eval_results = await runner.aevaluate_queries(
    query_engine, queries=batch_eval_queries
)

# get faithfulness score
faithfulness_score = sum(result.passing for result in eval_results['faithfulness']) / len(eval_results['faithfulness'])
# get relevancy score
relevancy_score = sum(result.passing for result in eval_results['faithfulness']) / len(eval_results['relevancy'])

print( "> faithfulness_score", faithfulness_score )
print( "> relevancy_score", relevancy_score )

> faithfulness_score 1.0
> relevancy_score 1.0


Toplu işleme yöntemi, sistemin performansının bir dizi farklı sorgu üzerinden hızlı bir şekilde değerlendirilmesine yardımcı olur. 1,0'lık bir doğruluk puanı, oluşturulan yanıtların halüsinasyon içermediğini ve tamamen alınan bağlama dayandığını gösterir. Ayrıca 1,0 olan Uygunluk puanı, oluşturulan yanıtların, alınan bağlam ve sorgularla tutarlı bir şekilde uyumlu olduğunu gösterir.