## Baseline

Build a baseline RAG and evaluate with ragas.

In [2]:
!tree data/

[01;34mdata/[0m
├── 37signals-is-you.md
├── benefits-and-perks.md
├── code-of-conduct.md
├── faq.md
├── getting-started.md
├── how-we-work.md
├── international-travel-guide.md
├── LICENSE.md
├── making-a-career.md
├── managing-work-devices.md
├── moonlighting.md
├── our-internal-systems.md
├── our-rituals.md
├── performance-plans.md
├── product-histories.md
├── README.md
├── stateFMLA.md
├── titles-for-data.md
├── titles-for-designers.md
├── titles-for-ops.md
├── titles-for-programmers.md
├── titles-for-support.md
├── vocabulary.md
├── what-influenced-us.md
├── what-we-stand-for.md
└── where-we-work.md

0 directories, 26 files


In [4]:
from llama_index import SimpleDirectoryReader

reader = SimpleDirectoryReader('./data', recursive=True)
docs = reader.load_data()

len(docs)

155

In [5]:
from llama_index import VectorStoreIndex

index = VectorStoreIndex.from_documents(docs)
qe = index.as_query_engine()

In [7]:
r = qe.query("What is the company policy about Moonlighting?")
print(r)


The company policy about Moonlighting is that it is allowed as long as it does not create conflicts of interest or affect the employee's time, dedication, or performance at 37signals. Examples of activities that are not allowed include working full or part time for another company in the same industry, going on a regular speaking circuit tour, consulting for other companies in the same industry, aggressively marketing availability for side work, and taking on anything outside of work that will pull attention from work.


In [8]:
r = qe.query("What are the Benefits and Perks offered?")
print(r)


The Benefits and Perks offered are Paid Time Off.


### Running Evaluations

In [11]:
from llama_index.evaluation import DatasetGenerator
from ragas.llama_index import evaluate

data_generator = DatasetGenerator.from_documents(docs)
eval_questions = data_generator.generate_questions_from_nodes(num=20)

eval_questions

['How does 37signals emphasize the importance of customer support and communication?',
 "Why does 37signals believe that marketing is everyone's responsibility?",
 'How does 37signals cultivate its image through customer interactions?',
 "What role does word of mouth play in 37signals' customer acquisition strategy?",
 'How does 37signals contribute to the open-source community?',
 'What advice does 37signals give to new employees about work-life balance?',
 'How does 37signals encourage collaboration and teamwork among its employees?',
 'What is the significance of setting boundaries and not becoming consumed by work at 37signals?',
 'How does 37signals view the role of writing and communication in their organization?',
 'How does 37signals prioritize customer satisfaction and experience?',
 'How can other organizations use this handbook as inspiration for their own organizations?',
 "What is the recommended approach for personalizing the handbook to match one's own organization?",
 '

In [12]:
import nest_asyncio
nest_asyncio.apply()

In [13]:
from ragas.metrics import faithfulness, answer_relevancy, context_relevancy

result = evaluate(
    qe, 
    [faithfulness, answer_relevancy, context_relevancy],
    eval_questions,
)

result

evaluating with [faithfulness]


100%|████████████████████████████████████████████████████████████| 2/2 [04:48<00:00, 144.49s/it]


evaluating with [answer_relevancy]


100%|█████████████████████████████████████████████████████████████| 2/2 [00:53<00:00, 26.86s/it]


evaluating with [context_relevancy]


100%|█████████████████████████████████████████████████████████████| 2/2 [00:47<00:00, 23.84s/it]


{'ragas_score': 0.2195, 'faithfulness': 0.9012, 'answer_relevancy': 0.9691, 'context_relevancy': 0.0868}

Seems like overall scores are good. Lets look into a bit more to see which questions gave bad results.

In [15]:
df = result.to_pandas()
low_faithfulness = df[df.faithfulness < 0.9]
low_faithfulness

Unnamed: 0,question,contexts,answer,faithfulness,answer_relevancy,context_relevancy
2,How does 37signals cultivate its image through...,[37signals Is You\n\nEveryone working at 37sig...,\n37signals cultivates its image through custo...,0.857143,1.0,0.0
5,What advice does 37signals give to new employe...,[Getting Started\n\nGetting started at 37signa...,\n37signals encourages new employees to take a...,0.75,0.878195,0.0
6,How does 37signals encourage collaboration and...,"[Employee Gifts\n\nAt the end of every year, 3...",\n37signals encourages collaboration and teamw...,0.75,0.997122,0.428571
7,What is the significance of setting boundaries...,[Books\n\n* Turn The Ship Around: “Leadership ...,\nSetting boundaries and not becoming consumed...,0.666667,0.985963,0.05
14,"As a teacher/professor, how would you structur...","[Frequently Asked Questions, FAQ\n\nThere are ...","\nAs a teacher/professor, one way to structure...",0.0,0.961137,0.0


With a small utility function we can see the question, answer, contexts.

In [45]:
msg = """\
**Q: {question}**

A: {answer}

**Contexts**

---

{context}
"""

ctx_msg = """\
{c}

---

"""
def view_row(i, df):
    r = df.iloc[i]
    c = ''.join([ctx_msg.format(c=c) for c in r["contexts"]])
    display(Markdown(msg.format(
        question=r["question"],
        answer=r["answer"],
        context=c,
    )))

In [46]:
view_row(4, low_faithfulness)

**Q: As a teacher/professor, how would you structure the quiz/examination to ensure diversity in the nature of the questions?**

A: 
As a teacher/professor, one way to structure the quiz/examination to ensure diversity in the nature of the questions is to use the Frequently Asked Questions (FAQ) provided in BC4 as a guide. This will help ensure that the questions cover a wide range of topics and are not too focused on any one particular area. Additionally, it is important to ensure that the questions are not too difficult or too easy, and that they are appropriate for the level of the students taking the quiz/examination. Finally, it is important to provide a variety of question types, such as multiple choice, short answer, and essay questions, to ensure that students are able to demonstrate their knowledge in different ways.

**Contexts**

---

Frequently Asked Questions

---

FAQ

There are many questions that arise from IT policies such as this, so we've produced an FAQ in BC4 to help answer them.

---




Seems like the question is a bit off and the LLM actually makes a lot of stuff to answer the questions because the retrieved context is not very good.

Taking a look at the faithfulness metric logs from langsmith.

**statements generated**
```txt
- One way to structure the quiz/examination to ensure diversity in the nature of the questions is to use the Frequently Asked Questions (FAQ) provided in BC4 as a guide.
- Using the FAQ as a guide will help ensure that the questions cover a wide range of topics and are not too focused on any one particular area.
- It is important to ensure that the questions are not too difficult or too easy, and that they are appropriate for the level of the students taking the quiz/examination.
- Providing a variety of question types, such as multiple choice, short answer, and essay questions, will ensure that students are able to demonstrate their knowledge in different ways.
```

**checking for support in context**
```txt
1. One way to structure the quiz/examination to ensure diversity in the nature of the questions is to use the Frequently Asked Questions (FAQ) provided in BC4 as a guide.
Explanation: The context mentions that an FAQ has been produced to answer questions related to IT policies. However, there is no information suggesting that the FAQ can be used as a guide to structure a quiz/examination. Verdict: No.

2. Using the FAQ as a guide will help ensure that the questions cover a wide range of topics and are not too focused on any one particular area.
Explanation: The context mentions that the FAQ has been produced to answer questions related to IT policies. It does not provide any information about the content or topics covered in the FAQ. Therefore, it cannot be deduced that using the FAQ as a guide will ensure question diversity. Verdict: No.

3. It is important to ensure that the questions are not too difficult or too easy, and that they are appropriate for the level of the students taking the quiz/examination.
Explanation: The context does not provide any information about the difficulty level or appropriateness of the questions. Therefore, it cannot be deduced whether it is important to ensure these factors. Verdict: No.

4. Providing a variety of question types, such as multiple choice, short answer, and essay questions, will ensure that students are able to demonstrate their knowledge in different ways.
Explanation: The context does not provide any information about the variety of question types or how students can demonstrate their knowledge. Therefore, it cannot be deduced that providing a variety of question types will ensure different ways of demonstrating knowledge. Verdict: No.

Final verdict for each statement in order: No. No. No. No.
```

In [44]:
low_relavency = df[df.answer_relevancy < 0.9]
low_relavency

Unnamed: 0,question,contexts,answer,faithfulness,answer_relevancy,context_relevancy
5,What advice does 37signals give to new employe...,[Getting Started\n\nGetting started at 37signa...,\n37signals encourages new employees to take a...,0.75,0.878195,0.0


### why?
Here you can see that the answer is only about paid-time-off while the question is much more general which shows the answer_relavency needs to be improved.

In [47]:
view_row(0, low_relavency)

**Q: What advice does 37signals give to new employees about work-life balance?**

A: 
37signals encourages new employees to take advantage of their 18 days of paid time off plus 11 local holidays every year. They also suggest tracking vacation time and note that vacation time rolls over year to year with a maximum bank of 27 days at any time. If an employee is terminated or resigns from 37signals with vacation days in their bank, they will be paid the monetary equivalent for those unused days (prorated at the time of their departure). This encourages employees to maintain a healthy work-life balance.

**Contexts**

---

Getting Started

Getting started at 37signals involves a lot of little details, a number of big tasks, learning the details of your new job, meeting new coworkers, all while working remotely. Your teammates, your manager, your 37signals buddy, your Ops buddy, and our People team are all here to help as you navigate your first few days and weeks.

---

Vacation Time

37signals offers 18 days of paid time off plus 11 local holidays every year. We ask that you track your time off. Your vacation time is prorated based on your start date during your first year at 37signals. Your vacation time rolls over year to year with a maximum bank of 27 days at any time. If you are terminated or resign from 37signals with vacation days in your bank, you’ll be paid the monetary equivalent for those unused days (prorated at the time of your departure).

---


