#### 🚧 BONUS CHALLENGE 🚧

> NOTE: Completing this challenge will provide full marks on the assignment, regardless of the complete of the notebook. You do not need to complete this in the notebook for full marks.

##### **MINIMUM REQUIREMENTS**:

1. Baseline `LCEL RAG` Application using `NAIVE RETRIEVAL`
2. Baseline Evaluation using `RAGAS METRICS`
  - [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)
  - [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
  - [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
  - [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
  - [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)
3. Implement a `SEMANTIC CHUNKING STRATEGY`.
4. Create an `LCEL RAG` Application using `SEMANTIC CHUNKING` with `NAIVE RETRIEVAL`.
5. Compare and contrast results.

##### **SEMANTIC CHUNKING REQUIREMENTS**:

Chunk semantically similar (based on designed threshold) sentences, and then paragraphs, greedily, up to a maximum chunk size. Minimum chunk size is a single sentence.

Have fun!

In [1]:
# Let's install the same libraries as the week 4 day 2 Homework
!pip install -U -q langchain langchain-openai langchain_core langchain-community langchainhub openai langchain-qdrant
!pip install -qU ragas
!pip install -qU qdrant-client pymupdf pandas

# And some other useful utils
!pip install -qU nltk 

In [2]:
# And set the OpenAI key
import os
import openai
from getpass import getpass

openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

In [3]:
# And load up the same document collection to work with
from langchain_community.document_loaders import PyMuPDFLoader
from pprint import pprint

PDF_LINK = "https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf"

loader = PyMuPDFLoader(
    file_path=PDF_LINK,
)

documents = loader.load()

# visual inspection shows that page 7 (zero-indexed page 6) is pretty much the first page with any meaningful text, let's start there.
documents=documents[6:]

# Each document will now contain text extracted as blocks (pages)
for doc in documents[:10]:
    pprint(doc.page_content)

('Part 1: Why not to do a startup\n'
 'In this series of posts I will walk through some of my accumu-\n'
 'lated knowledge and experience in building high-tech startups.\n'
 'My speciXc experience is from three companies I have co-\n'
 'founded: Netscape, sold to America Online in 1998 for $4.2\n'
 'billion; Opsware (formerly Loudcloud), a public soaware com-\n'
 'pany with an approximately $1 billion market cap; and now\n'
 'Ning, a new, private consumer Internet company.\n'
 'But more generally, I’ve been fortunate enough to be involved\n'
 'in and exposed to a broad range of other startups — maybe 40\n'
 'or 50 in enough detail to know what I’m talking about — since\n'
 'arriving in Silicon Valley in 1994: as a board member, as an angel\n'
 'investor, as an advisor, as a friend of various founders, and as a\n'
 'participant in various venture capital funds.\n'
 'This series will focus on lessons learned from this entire cross-\n'
 'section of Silicon Valley startups — so don’t think

<font color="blue">We don't want to have artificial page-breaks, so let's recombine those into one long document before we start semantic chunking. It also looks like pymupdfloader didn't give us a good way to distinguish between paragraphs, as every line is just separated by a newline. To avoid spending a lot of time on pdf munging, let's just strip out the newlines and treat it as one long text. 

(Note: Unstructured could probably do a better job handling these paragraphs, but I ran into dependency conflicts installing it on my laptop and didn't feel like taking the time to resolve them.)

In [4]:
from pprint import pprint

one_big_string = ""
for doc in documents:
    cleaned_content = doc.page_content.strip().replace("\n", " ")
    one_big_string += cleaned_content

print(one_big_string)

Part 1: Why not to do a startup In this series of posts I will walk through some of my accumu- lated knowledge and experience in building high-tech startups. My speciXc experience is from three companies I have co- founded: Netscape, sold to America Online in 1998 for $4.2 billion; Opsware (formerly Loudcloud), a public soaware com- pany with an approximately $1 billion market cap; and now Ning, a new, private consumer Internet company. But more generally, I’ve been fortunate enough to be involved in and exposed to a broad range of other startups — maybe 40 or 50 in enough detail to know what I’m talking about — since arriving in Silicon Valley in 1994: as a board member, as an angel investor, as an advisor, as a friend of various founders, and as a participant in various venture capital funds. This series will focus on lessons learned from this entire cross- section of Silicon Valley startups — so don’t think that anything I am talking about is referring to one of my own companies: mo

In [5]:
# Split into sentences
import nltk
nltk.download('punkt')  # Download sentence tokenizer
nltk.download('punkt_tab')

from nltk.tokenize import sent_tokenize

# Create the sentences as a list
sentences=sent_tokenize(one_big_string)

# Print out a few as a sanity check
for sentence in sentences[:5]:
    print(f"sentence: {sentence}\n\n")

sentence: Part 1: Why not to do a startup In this series of posts I will walk through some of my accumu- lated knowledge and experience in building high-tech startups.


sentence: My speciXc experience is from three companies I have co- founded: Netscape, sold to America Online in 1998 for $4.2 billion; Opsware (formerly Loudcloud), a public soaware com- pany with an approximately $1 billion market cap; and now Ning, a new, private consumer Internet company.


sentence: But more generally, I’ve been fortunate enough to be involved in and exposed to a broad range of other startups — maybe 40 or 50 in enough detail to know what I’m talking about — since arriving in Silicon Valley in 1994: as a board member, as an angel investor, as an advisor, as a friend of various founders, and as a participant in various venture capital funds.


sentence: This series will focus on lessons learned from this entire cross- section of Silicon Valley startups — so don’t think that anything I am talking abo

[nltk_data] Downloading package punkt to /Users/Angela/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/Angela/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


<font color="blue">Now let's find semantically similar sentences. I'm going to do it a naive way, using text-embedding-ada-3 and setting a cosine similarity threshold of >0.4 for "related" sentences, with the threshold based on quick trial and error. This isn't the most robust method but should be fine for a proof of concept. There are plenty of other methods we could try (for example, cross encoders?)

First, we need to get the embeddings for each sentence. 


In [6]:
from langchain_openai import OpenAIEmbeddings

# set up embedding model and some constants 
EMBEDDING_MODEL = "text-embedding-3-small"

embeddings = OpenAIEmbeddings(
   model=EMBEDDING_MODEL
)

sentence_embeddings = await embeddings.aembed_documents(sentences)

In [7]:
SIMILARITY_THRESHOLD = 0.4 # cosine similarity threshold
MAX_CHUNK_SIZE=1000 # max chunk size, although we can go over it to preserve a sentence


In [8]:
# Utility function for cosine similarity
import numpy as np
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))


In [9]:
# This is the main cell that does semantic chunking. Check the relatedness of each sentence pair and greedily grow chunks to maximum size.
combined_chunks = []
this_chunk = sentences[0]

for i in range(len(sentences)-1):

    similarity = cosine_similarity(sentence_embeddings[i],sentence_embeddings[i+1])

    if (len(this_chunk) > MAX_CHUNK_SIZE) or (similarity<SIMILARITY_THRESHOLD):
        # we are over the max chunk size or sentences are unrelated, time to start a new one
        if this_chunk != "": combined_chunks.append(this_chunk)
        this_chunk = sentences[i+1]
    else:
        this_chunk += ("  ")+sentences[i+1]


In [14]:
# Get a sense for how the chunks look now
print(f"Num chunks: {len(combined_chunks)}\n")

smallest_chunk = min(combined_chunks,key=len)
largest_chunk = max(combined_chunks, key=len)

print(f"Smallest chunk ({len(smallest_chunk)} chars):{smallest_chunk}\n")
print(f"Biggest chunk ({len(largest_chunk)} chars):{largest_chunk}\n")
for chunk in combined_chunks[:5]:
    print(f"Chunk: {chunk}\n")

Num chunks: 1621

Smallest chunk (3 chars):No.

Biggest chunk (1879 chars):That’s an extreme case, but even a non-extreme version of this process — and all big companies have one; they have to — is mind-bend- ingly complex to try to understand, even from the inside, let alone the outside.  “… and the breath of the whale is frequently attended with such an insupportable smell, as to bring on a disorder of the brain.” — Ulloa’s South America You can count on there being a whole host of impinging forces that will aWect the dynamic of decision-making on any issue at a big company.  The consensus building process, trade-oWs, quids pro quo, poli- tics, rivalries, arguments, mentorships, revenge for past wrongs, Part 5: The Moby Dick theory of big companies 35turf-building, engineering groups, product managers, product marketers, sales, corporate marketing, Xnance, HR, legal, chan- nels, business development, the strategy team, the international divisions, investors, Wall Street analysts, ind

<font color="blue">Having orphan chunks like "No" isn't great. In practice you probably want some minimum chunk size, but the longer chunks seem to be coherent and topical. Let's try it out as is.

Create a RAG pipeline (I've written my own wrapper but you could also use the stuff_documents_chain). Grab the test data we previously used in class and also copy over the code to run RAGAS. 

In [26]:
import importlib
import vanilla_rag

importlib.reload(vanilla_rag)

rag_pipeline = await vanilla_rag.vanilla_rag(combined_chunks, openai.api_key)

# Test it out
response = await rag_pipeline.ainvoke({"input":"What is a good rule of thumb to follow when selecting an industry to invest in?"})

print(response)

created qdrant client
populated vector db
created chain
{'response': AIMessage(content="A good rule of thumb to follow when selecting an industry to invest in is to choose an industry where the founders of the important companies are still alive and actively involved. Additionally, if you're entering an old industry, make sure to align with the forces of radical change that could disrupt the existing order. Once you've picked an industry, aim to get to the center of it quickly, focusing on the core of change and opportunity.", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 84, 'prompt_tokens': 553, 'total_tokens': 637}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_483d39d857', 'finish_reason': 'stop', 'logprobs': None}, id='run-bd5eb1cf-3cf7-4e02-ad01-a029f0682ba0-0', usage_metadata={'input_tokens': 553, 'output_tokens': 84, 'total_tokens': 637}), 'context': [Document(metadata={'_id': '9fd58a66555f47248977e6ffe4b2e208',

<font color="blue">Now for the test set and evaluation code

In [27]:
# Import test set
import pandas as pd

test_df = pd.read_csv("testset.csv")

test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

In [31]:
# Get the answers to evaluate
from datasets import Dataset

answers = []
contexts = []

for question in test_questions:
  response = await rag_pipeline.ainvoke({"input" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])

response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

In [32]:
# Evaluate target metrics
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

results = evaluate(response_dataset, metrics)
print(results)

Evaluating:   0%|          | 0/95 [00:00<?, ?it/s]

{'faithfulness': 0.8054, 'answer_relevancy': 0.8152, 'context_recall': 0.8604, 'context_precision': 0.6711, 'answer_correctness': 0.5325}


<font color="blue">Let's compare the results from all our runs.

- Baseline with ada 2: 
{'faithfulness': 0.7181, 'answer_relevancy': 0.8632, 'context_recall': 0.7539, 'context_precision': 0.6594, 'answer_correctness': 0.5941}

- Same thing with TE3: 
{'faithfulness': 0.5940, 'answer_relevancy': 0.8591, 'context_recall': 0.8167, 'context_precision': 0.6930, 'answer_correctness': 0.5590}

- With compression retriever, using TE3 embeddings and gpt-4o-mini as compressor: 
{'faithfulness': 0.5662, 'answer_relevancy': 0.8607, 'context_recall': 0.5554, 'context_precision': 0.6535, 'answer_correctness': 0.5863}

- With parent doc retriever using TE3:
{'faithfulness': 0.8329, 'answer_relevancy': 0.8994, 'context_recall': 1.0000, 'context_precision': 0.7632, 'answer_correctness': 0.5469}

- With simple semantic chunker using TE3:
{'faithfulness': 0.8054, 'answer_relevancy': 0.8152, 'context_recall': 0.8604, 'context_precision': 0.6711, 'answer_correctness': 0.5325}

It looks like the parent doc retriever still wins, with the best context-related metrics. The simple semantic chunker comes in second for context_recall and third for context_precision. Presumably, other other metrics such as relevancy could be improved with a better prompt or primary model. The semantic chunker itself can certainly be improved from the version in this notebook.

These metrics give us some info, but it's also worth considering that at scale, the parent doc retriever could be more expensive or slower due to the extra tokens, so it would still be important to include use case considerations when making a final decision.