### Cleaned Data

In [1]:
import glob

In [43]:
files = glob.glob("data/**", recursive=True)

file_ends = set()
for f in files:
    splits = f.split('.')
    if len(splits) == 2:
        file_ends.add(splits[-1])
file_ends

{'md'}

In [44]:
file_ends = file_ends - {'md'}
file_ends

set()

In [45]:
files = []
glob_str = "**/*.{end}"
for e in file_ends:
    ext_files = glob.glob(glob_str.format(end=e), recursive=True)
    files.extend(ext_files)
    
len(files)

0

In [46]:
import os

for file in files:
    try:
        os.remove(file)
    except FileNotFoundError:
        print(file, " not found!")

In [2]:
!tree data

[01;34mdata[0m
├── [01;34mhandbook[0m
│   ├── [01;34mabout[0m
│   │   ├── [01;34mimages[0m
│   │   ├── _index.md
│   │   ├── migration-details.md
│   │   ├── [01;34mmigration-reports[0m
│   │   │   ├── about-migration-report.md
│   │   │   ├── being-a-public-company-migration-report.md
│   │   │   ├── cadence-migration-report.md
│   │   │   ├── ceo-migration-report.md
│   │   │   ├── communication-migration-report.md
│   │   │   ├── content-websites-responsibility-migration-report.md
│   │   │   ├── eba-migration-report.md
│   │   │   ├── e-group-weekly-migration-report.md
│   │   │   ├── esg-migration-report.md
│   │   │   ├── family-and-friends-day-migration-report.md
│   │   │   ├── faq-gitlab-licensing-technology-to-independent-chinese-company-migration-report.md
│   │   │   ├── gitlab-all-company-meetings-migration-report.md
│   │   │   ├── group-conversations-migration-report.md
│   │   │   ├── handbook-usage-migration-report.md
│   │   │   ├── hist

In [3]:
!tree data/handbook/ceo

[01;34mdata/handbook/ceo[0m
├── [01;34mchief-of-staff-team[0m
│   ├── _index.md
│   ├── [01;34mjihu-support[0m
│   │   ├── [01;34mimages[0m
│   │   ├── _index.md
│   │   ├── jihu-contribution-process.md
│   │   ├── jihu-database-change-process.md
│   │   ├── jihu-security-review-process.md
│   │   ├── jihu-validation-pipelines.md
│   │   └── release-certification.md
│   ├── performance-indicators.md
│   ├── [01;34mreadmes[0m
│   │   ├── dlangemak.md
│   │   ├── _index.md
│   │   ├── ipedowitz.md
│   │   ├── jamiemaynard.md
│   │   └── streas.md
│   └── workplace.md
├── _index.md
└── shadow.md

4 directories, 16 files


## Baseline LlamaIndex

using the `ceo` section which is viable [here](https://handbook.gitlab.com/handbook/ceo/). Feel free to use any other sections as well but do point to the hosted webpage so that it is easier to view.

In [4]:
from llama_index import SimpleDirectoryReader

reader = SimpleDirectoryReader('./data/handbook/ceo/', recursive=True)
docs = reader.load_data()

len(docs)

279

In [5]:
from llama_index import VectorStoreIndex

index = VectorStoreIndex.from_documents(docs)
qe = index.as_query_engine()

In [6]:
r = qe.query("What is Sid's view on Strong Opinions weakly held? can you give any examples")
print(r)


Sid believes in “strong opinions, weakly held.” He encourages people to have strong opinions, but to be open to changing them if presented with compelling new information and a data driven perspective. For example, if someone has a strong opinion about a product feature, but then new data is presented that suggests a different approach, Sid would encourage them to consider the new data and potentially change their opinion.


In [7]:
r = qe.query("What is the CEO Shadow Program?")
print(r)


The CEO Shadow Program is a program designed to give team members and eligible individuals an overview of all aspects of the company. Through attending meetings and completing short-term tasks, participants gain a better understanding of the company and its operations. The program also provides an opportunity for the CEO to build relationships with team members and identify challenges and opportunities earlier. Additionally, shadows can connect with one another, creating new cross-functional relationships.


In [8]:
r = qe.query("Who is the CEO?")
print(r)


The CEO is not specified in the context information.


### Evaluating with ragas

In [9]:
from llama_index.evaluation import DatasetGenerator
from ragas.llama_index import evaluate

data_generator = DatasetGenerator.from_documents(docs)
eval_questions = data_generator.generate_questions_from_nodes(num=20)

eval_questions

['What are some specific processes that are detailed on this page for Sid, CEO of GitLab?',
 'How is this page intended to be helpful?',
 'What is the purpose of the Executive Business Administrators (EBAs) mentioned on this page?',
 'Can you provide examples of guidelines mentioned on this page for the EBAs?',
 'How does the page suggest handling items that might seem pretentious or overbearing?',
 'What is the role of Sid in GitLab?',
 'How can someone deviate from the page and update it?',
 "What is the significance of the CEO's involvement in the processes detailed on this page?",
 'How does the page encourage collaboration and feedback?',
 'What is the overall purpose of this page in relation to Sid, CEO of GitLab?',
 "What is Sid Sijbrandij's role at GitLab Inc.?",
 'How did Sid Sijbrandij first become interested in programming?',
 "What is the purpose of GitLab's single application?",
 "What is Sid Sijbrandij's educational background?",
 'How did Sid Sijbrandij commercialize Git

In [10]:
import nest_asyncio
nest_asyncio.apply()

In [11]:
from ragas.metrics import faithfulness, answer_relevancy, context_relevancy

result = evaluate(
    qe, 
    [faithfulness, answer_relevancy, context_relevancy],
    eval_questions,
)

result

evaluating with [faithfulness]


100%|████████████████████████████████████████████████████████████| 2/2 [03:48<00:00, 114.23s/it]


evaluating with [answer_relevancy]


100%|█████████████████████████████████████████████████████████████| 2/2 [00:49<00:00, 24.76s/it]


evaluating with [context_relevancy]


100%|█████████████████████████████████████████████████████████████| 2/2 [00:36<00:00, 18.01s/it]


{'ragas_score': 0.2560, 'faithfulness': 0.7533, 'answer_relevancy': 0.9209, 'context_relevancy': 0.1075}

In [12]:
df = result.to_pandas()

df[df.faithfulness < 0.71]

Unnamed: 0,question,contexts,answer,faithfulness,answer_relevancy,context_relevancy
0,What are some specific processes that are deta...,[Intro\n\nThis page details processes specific...,\nSome specific processes that are detailed on...,0.333333,0.999171,0.0
2,What is the purpose of the Executive Business ...,[Intro\n\nThis page details processes specific...,\nThe purpose of the Executive Business Admini...,0.666667,0.983159,0.2
4,How does the page suggest handling items that ...,"[Brand\n\nPlease refer to our guidelines, Prod...",\nThe page does not suggest any specific handl...,0.5,0.989259,0.0
7,What is the significance of the CEO's involvem...,[Review the CEO Handbook\n\nThe CEO has a sect...,\nThe CEO's involvement in the processes detai...,0.333333,0.982521,0.0
12,What is the purpose of GitLab's single applica...,[Vision\n\nGitLab is an influencer and educato...,\nThe purpose of GitLab's single application i...,0.333333,0.962176,0.0
17,What were some of the innovative web applicati...,"[CEO Bio\n\nSid Sijbrandij is the Co-founder, ...",\nSome of the innovative web applications deve...,0.0,0.996383,0.0
18,How did Sid Sijbrandij lead GitLab through Y C...,"[CEO Bio\n\nSid Sijbrandij is the Co-founder, ...",\nSid Sijbrandij led GitLab through Y Combinat...,0.6,0.937906,0.018182


Seems like faithfullness is something we should be improving. Lets try GPT-4 and see if we can bring an improvement.

In [14]:
from llama_index import ServiceContext
from llama_index.llms import OpenAI

gpt4 = OpenAI(model="gpt-4")
sc = ServiceContext.from_defaults(llm=gpt4)
index = VectorStoreIndex.from_documents(docs, service_context=sc)
gpt4_qe = index.as_query_engine()

In [15]:
r = gpt4_qe.query("What is Sid's view on Strong Opinions weakly held? can you give any examples")
print(r)

Sid believes in the concept of "strong opinions, weakly held." This means that while he may have strong beliefs or ideas, he is open to changing his mind if presented with compelling new information and a data-driven perspective. No specific examples are given in the context.


In [16]:
r = gpt4_qe.query("Who is the CEO?")
print(r)

The context does not provide information on who the CEO is.


In [18]:
gpt4_result = evaluate(
    gpt4_qe, 
    [faithfulness, answer_relevancy, context_relevancy],
    eval_questions,
)

gpt4_result

evaluating with [faithfulness]


100%|█████████████████████████████████████████████████████████████| 2/2 [02:33<00:00, 76.93s/it]


evaluating with [answer_relevancy]


100%|█████████████████████████████████████████████████████████████| 2/2 [00:51<00:00, 25.98s/it]


evaluating with [context_relevancy]


100%|█████████████████████████████████████████████████████████████| 2/2 [00:29<00:00, 14.71s/it]


{'ragas_score': 0.2153, 'faithfulness': 0.8500, 'answer_relevancy': 0.9318, 'context_relevancy': 0.0856}

We managed to improve faithfulness from 0.75 to 0.85. How do we improve further?

First we have to figure out what exactly is wrong by seeing the traces. We recommend having something like langsmith to view these traces which will help you zone in on the issues.