# Contents
- [Introduction]()
- [Testset generation]()
- [Build RAG with llama-index]()
- [Tracing using Phoenix]()
- [Evaluation]()
- [Embedding analysis]()
- [Conclusion]()

TODO

- Launch Phoenix.
- Index corpus.
- Transform embedding spans into a corpus (including node IDs).
- Relaunch Phoenix.
- Instrument LlamaIndex.
- Run LlamaIndex application.
- Look at UI, understand traces.
- Export and flatten span data into a phoenix dataset.
- Save to disk.
- Restart Phoenix.
- Instrument LangChain.
- Run evaluations.
- Show evaluations.
- Close Phoenix.
- Load saved trace data into `px.TraceDataset`.
- Map root span IDs over Ragas evaluations, create `SpanEvaluations`, used `add_evaluations` API to attach evaluations to trace dataset and launch Phoenix with that trace dataset.
- User gets to see their annotated applications spans.
- Display traces with annotated evaluations.
- Export using the query DSL to get a primary dataset for visualizing embeddings. Alternatively, call `get_spans_dataframe` on trace `px.TraceDataset`. Whichever is simpler.
- Wrangle out the corpus dataset from LlamaIndex in-memory vector store.
- Re-launch Phoenix with primary and corpus datasets.

## Introduction

In this notebook

In [None]:
!pip install ragas pypdf arize-phoenix llama-index pandas

In [1]:
import pandas as pd

# Display the complete contents of dataframe cells.
pd.set_option("display.max_colwidth", None)

## Synthetic Test data generation

Follow the instructions [here](https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage) to install `git-lfs`.

In [None]:
! git lfs install
! git clone https://huggingface.co/datasets/explodinggradients/prompt-engineering-papers

In [2]:
from llama_index import SimpleDirectoryReader

In [3]:
dir_path = "./prompt-engineering-papers"
reader = SimpleDirectoryReader(dir_path, num_files_limit=2)
documents = reader.load_data()

In [4]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

# generator with openai models
generator = TestsetGenerator.with_openai()

# set question type distribution
distribution = {simple: 0.5, reasoning: 0.25, multi_context: 0.25}

# generate testset
testset = generator.generate_with_llamaindex_docs(
    documents, test_size=10, distributions=distribution
)

embedding nodes:   0%|          | 0/222 [00:00<?, ?it/s]

Generating:   0%|          | 0/10 [00:00<?, ?it/s]

max retries exceeded for ReasoningEvolution(generator_llm=LangchainLLMWrapper(run_config=RunConfig(timeout=60, max_retries=15, max_wait=90, exception_types=<class 'openai.RateLimitError'>)), docstore=InMemoryDocumentStore(splitter=<langchain.text_splitter.TokenTextSplitter object at 0x29da905b0>, nodes=[Node(page_content='arXiv:1605.08386v1  [math.CO]  26 May 2016HEAT-BATH RANDOM WALKS WITH MARKOV BASES\nCAPRICE STANLEY AND TOBIAS WINDISCH\nAbstract. Graphs on lattice points are studied whose edges come from a ﬁ nite set of\nallowed moves of arbitrary length. We show that the diameter of these graphs on ﬁbers of a\nﬁxed integer matrix can be bounded from above by a constant. W e then study the mixing\nbehaviour of heat-bath random walks on these graphs. We also state explicit conditions\non the set of moves so that the heat-bath random walk, a genera lization of the Glauber\ndynamics, is an expander in ﬁxed dimension.\nContents\n1. Introduction 1\n2. Graphs and statistics 3\n3. Bounds 

In [5]:
test_df = testset.to_pandas()
test_df.to_csv("ragas_testdata.csv")
test_df.head()

Unnamed: 0,question,contexts,ground_truth,evolution_type,episode_done
0,What is the relationship between the lattice width of F and the diameter of F(M) according to Lemma 3.1?,"[1miisapathofminimallength, then ∥u′−v′∥ ≤r·∥M∥\nand the claim follows from diam( F(M))≥distF(M)(u′,v′) =r. □\nRemark 3.2. LetF ⊂Zdbe a normal set. For all l∈ {−1,0,1}dandu,v∈ Fwe have\n(u−v)Tl≤ ∥u−v∥1and thus width l(F) := max {(u−v)Tl:u,v∈ F} ≤ max{∥u−v∥1:\nu,v∈ F}. Suppose that u′,v′∈ Fare such that ∥u′−v′∥1= max{∥u−v∥1:u,v∈ F}and\nletl′\ni:= sign(u′\ni−v′\ni) fori∈[d], then\n∥u′−v′∥1= (u′−v′)T·l′≤widthl′(F)≤max{∥u−v∥1:u,v∈ F}=∥u′−v′∥1.\nThelattice width ofFis width( F) := minl∈Zdwidthl(F) and thus Lemma 3.1gives\n∥M∥1·diam(F(M))≥width(F).]",The relationship between the lattice width of F and the diameter of F(M) according to Lemma 3.1 is that the lattice width of F is greater than or equal to the diameter of F(M) multiplied by the 1-norm of M.,simple,True
1,What are some strategies proposed to enhance the in-context learning capability of language models?,"[ parameter adap-\ntation to learn the best model parameters for the\ntask with a limited number of supervised exam-\nples (Wang and Yao, 2019). In contrast, ICL does\nnot require parameter updates and is directly per-\nformed on pretrained LLMs.\n4 Model Warmup\nAlthough LLMs have shown promising ICL ca-\npability, many studies also show that the ICL ca-pability can be further improved through a con-\ntinual training stage between pretraining and ICL\ninference, which we call model warmup for short.\nWarmup is an optional procedure for ICL, which\nadjusts LLMs before ICL inference, including mod-\nifying the parameters of the LLMs or adding ad-\nditional parameters. Unlike finetuning, warmup\ndoes not aim to train the LLM for specific tasks but\nenhances the overall ICL capability of the model.\n4.1 Supervised In-context Training\nTo enhance ICL capability, researchers proposed\na series of supervised in-context finetuning strate-\ngies by constructing in-context training data and\nmultitask training. Since the pretraining objectives\nare not optimized for in-context learning (Chen\net al., 2022a), Min et al. (2022b) proposed a method\nMetaICL to eliminate the gap between pretraining\nand downstream ICL usage. The pretrained LLM\nis continually trained on a broad range of tasks\nwith demonstration examples, which boosts its few-\nshot abilities. To further encourage the model to\nlearn input-label mappings from the context, Wei\net al. (2023a) propose symbol tuning. This ap-\nproach fine-tunes language models on in-context\ninput-label pairs, substituting natural language la-\nbels (e.g., ""positive/negative sentiment"") with arbi-\ntrary symbols (e.g., ""foo/bar""). As a result, symbol\ntuning demonstrates an enhanced capacity to utilize\nin-context information for overriding prior seman-\ntic knowledge.\nBesides, recent work indicates the potential\nvalue of instructions (Mishra et al., 2021) and there\nis a research direction focusing on supervised in-\nstruction tuning. Instruction tuning enhances the\nICL ability of LLMs through training on task in-\nstructions. Tuning the]","Some strategies proposed to enhance the in-context learning capability of language models include supervised in-context finetuning, multitask training, MetaICL, symbol tuning, and instruction tuning.",simple,True
2,How do Transformers utilize implicit empirical risk minimization to enhance their task recognition ability?,"[ Trans-\nformers can implement a proper function class\nthrough implicit empirical risk minimization for\nthe demonstrations. Pan et al. (2023) decoupled the\nICL ability into task recognition ability and task\nlearning ability, and further showed how they uti-lize demonstrations. From an information-theoretic\nperspective, Hahn and Goyal (2023) showed an er-\nror bound for ICL under linguistically motivated\nassumptions to explain how next-token prediction\ncan bring about the ICL ability. Si et al. (2023)\nfound that large language models exhibit prior fea-\nture biases and showed a way to use intervention\nto avoid unintended features in ICL.\nAnother series of work attempted to build con-\nnections between ICL and gradient descent. Tak-\ning linear regression as a starting point, Akyürek\net al. (2022) found that Transformer-based in-\ncontext learners can implement standard finetun-\ning algorithms implicitly, and von Oswald et al.\n(2022) showed that linear attention-only Transform-\ners with hand-constructed parameters and mod-\nels learned by gradient descent are highly related.\nBased on softmax regression, Li et al. (2023e)\nfound that self-attention-only Transformers showed\nsimilarity with models learned by gradient-descent.\nDai et al. (2022) figured out a dual form between\nTransformer attention and gradient descent and fur-\nther proposed to understand ICL as implicit fine-\ntuning. Further, they compared GPT-based ICL\nand explicit finetuning on real tasks and found that\nICL indeed behaves similarly to finetuning from\nmultiple perspectives.\nFunctional Components Focusing on specific\nfunctional modules, Olsson et al. (2022) found that\nthere exist some induction heads in Transformers\nthat copy previous patterns to complete the next\ntoken. Further, they expanded the function of in-\nduction heads to more abstract pattern matching\nand completion, which may implement ICL. Wang\net al. (2023b) focused on the information flow in\nTransformers and found that during the ICL pro-\ncess, demonstration label words serves as anchors,\nwhich aggregates and distributes key information\nfor the final prediction.\n3Takeaway :(1) Knowing and considering\n]","Transformers utilize implicit empirical risk minimization to enhance their task recognition ability by decoupling the ability into task recognition ability and task learning ability, and utilizing demonstrations.",simple,True
3,What is the relationship between Markov bases and the mixing time of a random walk?,"[HEAT-BATH RANDOM WALKS WITH MARKOV BASES 9\neﬃciently. If the input of Algorithm 1is a normal set F={u∈Zd:Au≤b}that is given\ninH-representation, then the length of the ray RF,m(v) can be computed with a number of\nrounding, division, and comparing operations that is linea r in the number of rows of A.\nThere are situations in which the heat-bath random walk prov ides no speed-up compared\nwith the simple walk (Example 4.3). Intuitively, adding more moves to the set of allowed\nmoves should improve the mixing time of the random walk. In ge neral, however, this is not\ntrue for the heat-bath walk (Example 4.4).\nExample 4.3. Forn∈N, consider the normal set\nFn:={[\n0 1 1 ···1\n1 0 0 ···0]\n,[\n1 0 1 ···1\n0 1 0 ···0]\n,...,[\n1 1···1 0\n0 0···0 1]}\n⊂Q2×n.\nIn the language of [7, Section 1.1], Fnis precisely the ﬁber of the 2 ×nindependence model\nwhere row sums are ( n−1,1) and column sums are (1 ,1,...,1). The minimal Markov\nbasis of the independence model, often referred to as the basic moves , is precisely the set\nMn:={v−u:u,v∈ Fn} \ {0}. In particular, the ﬁber graph Fn(Mn) is the complete\ngraph on nnodes. All rays along basic moves have length 2 and thus the tr ansition matrices\nof the simple random walk and the heat-bath random walk coinc ide. There are n·(n−1)\nmany basic moves and the transition matrix of both random wal ks is\n1\nn(n−1)\n1...1\n......\n1...1\n+(n(n−1)−n)\nn(n−1)·In.\nThe second largest eigenvalue is 1 −1\nn−1which implies]",,simple,True
4,"How does the theory of general relativity explain the inability of anything, including light, to escape a black hole?","[1,...,vr\nProposition 4.1. LetF ⊂ZdandM ⊂Zdbe ﬁnite sets. Let f:M →[0,1]andπ:F →\n(0,1)be mass functions. Then Hπ,f\nF,Mis aperiodic, has stationary distribution π, is reversible\nwith respect to π, and all of its eigenvalues are non-negative. The random walk is irreducible\nif and only if {m∈ M:f(m)>0}is a Markov basis for F.\nProof.Since for any u∈ Fand any m∈ M,Hπ\nF,m(u,u)>0, there are halting states and\nthusHπ,f\nF,Mis aperiodic. By deﬁnition, π(x)Hπ\nF,m(x,y) =π(y)Hπ\nF,m(y,x) and thus Hπ,f\nF,M\nis reversible with respect to πandπis a stationary distribution. The statement on the\neigenvalues is exactly [8, Lemma 1.2]. Let M′={m∈ M:f(m)>0}andf′=f|M′, then\nHπ,f\nF,M=Hπ,f′\nF,M′and thus the heat-bath random walk is irreducible if and only ifM′is a\nMarkov basis for F. □\nRemark 4.2. Analyzing the speed of convergence of random walks with seco nd largest\neigenvalues does not take the computation time of a single tr ansition into account. From\na computational point of view, the diﬀerence of the simple wal k and the heat-bath random\nwalk is Step 4 of Algorithm 1. However, we argue that Step 4 can be done eﬃciently in\nmany cases. For instance, a hard normalizing constant of πcancels out. If πis the uniform\ndistribution, then one needs to sample uniformly from RF,m(v) in Step 4, which can be done]",,simple,True


## Build RAG with llama-index

In [6]:
import phoenix as px

session = px.launch_app()

🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


In [7]:
import llama_index

llama_index.set_global_handler("arize_phoenix")

In [8]:
import nest_asyncio
from llama_index import VectorStoreIndex, ServiceContext
from llama_index.embeddings import OpenAIEmbedding
from datasets import Dataset

nest_asyncio.apply()


def build_query_engine(documents):
    vector_index = VectorStoreIndex.from_documents(
        documents,
        service_context=ServiceContext.from_defaults(chunk_size=512),
        embed_model=OpenAIEmbedding(),
    )

    query_engine = vector_index.as_query_engine(similarity_top_k=2)
    return query_engine


def generate_response(query_engine, question):
    response = query_engine.query(question)
    return {
        "answer": response.response,
        "contexts": [c.node.get_content() for c in response.source_nodes],
    }


# Function to evaluate as Llama index does not support async evaluation for HFInference API
def generate_ragas_dataset(query_engine, test_df):
    test_questions = test_df["question"].values
    responses = [generate_response(query_engine, q) for q in test_questions]

    dataset_dict = {
        "question": test_questions,
        "answer": [response["answer"] for response in responses],
        "contexts": [response["contexts"] for response in responses],
        "ground_truth": test_df["ground_truth"].values.tolist(),
    }
    ds = Dataset.from_dict(dataset_dict)
    return ds

In [None]:
generate_response(query_engine, test_df["question"][1])

In [None]:
query_engine = build_query_engine(documents)
ragas_eval_dataset = generate_responses(query_engine, test_df)

![](../../_static/imgs/arize-tracing1.gif)

In [None]:
ragas_eval_dataset

## Evaluation

In [None]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_correctness,
    context_recall,
    context_precision,
)

In [None]:
from phoenix.trace.langchain import OpenInferenceTracer

tracer = OpenInferenceTracer()

In [None]:
ragas_scores = evaluate(
    dataset=ragas_eval_dataset,
    metrics=[faithfulness, answer_correctness, context_recall, context_precision],
    callbacks=[tracer],
)

In [None]:
ragas_scores

![](../../_static/imgs/arize-tracing2.gif)

## Embedding analysis
TBD:
- cluster queries
- color each data point based on question type?
- display average score for each cluster