#### 🚧 BONUS CHALLENGE 🚧

> NOTE: Completing this challenge will provide full marks on the assignment, regardless of the complete of the notebook. You do not need to complete this in the notebook for full marks.

##### **MINIMUM REQUIREMENTS**:

1. Baseline `LCEL RAG` Application using `NAIVE RETRIEVAL`
2. Baseline Evaluation using `RAGAS METRICS`
  - [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)
  - [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
  - [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
  - [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
  - [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)
3. Implement a `SEMANTIC CHUNKING STRATEGY`.
4. Create an `LCEL RAG` Application using `SEMANTIC CHUNKING` with `NAIVE RETRIEVAL`.
5. Compare and contrast results.

##### **SEMANTIC CHUNKING REQUIREMENTS**:

Chunk semantically similar (based on designed threshold) sentences, and then paragraphs, greedily, up to a maximum chunk size. Minimum chunk size is a single sentence.

Have fun!

In [1]:
# Let's install the same libraries as the week 4 day 2 Homework
!pip install -U -q langchain langchain-openai langchain_core langchain-community langchainhub openai langchain-qdrant
!pip install -qU ragas
!pip install -qU qdrant-client pymupdf pandas

# And some other useful utils
!pip install -qU nltk 

In [9]:
# And set the OpenAI key
import os
import openai
from getpass import getpass

openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

In [3]:
# And load up the same document collection to work with
from langchain_community.document_loaders import PyMuPDFLoader
from pprint import pprint

PDF_LINK = "https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf"

loader = PyMuPDFLoader(
    file_path=PDF_LINK,
)

documents = loader.load()

# visual inspection shows that page 7 (zero-indexed page 6) is pretty much the first page with any meaningful text, let's start there.
documents=documents[6:]

# Each document will now contain text extracted as blocks (paragraphs)
for doc in documents[:10]:
    pprint(doc.page_content)

('Part 1: Why not to do a startup\n'
 'In this series of posts I will walk through some of my accumu-\n'
 'lated knowledge and experience in building high-tech startups.\n'
 'My speciXc experience is from three companies I have co-\n'
 'founded: Netscape, sold to America Online in 1998 for $4.2\n'
 'billion; Opsware (formerly Loudcloud), a public soaware com-\n'
 'pany with an approximately $1 billion market cap; and now\n'
 'Ning, a new, private consumer Internet company.\n'
 'But more generally, I’ve been fortunate enough to be involved\n'
 'in and exposed to a broad range of other startups — maybe 40\n'
 'or 50 in enough detail to know what I’m talking about — since\n'
 'arriving in Silicon Valley in 1994: as a board member, as an angel\n'
 'investor, as an advisor, as a friend of various founders, and as a\n'
 'participant in various venture capital funds.\n'
 'This series will focus on lessons learned from this entire cross-\n'
 'section of Silicon Valley startups — so don’t think

Now, we don't want to have the artificial page-break chunks. So let's recombine those into one long document before we start our semantic chunking strategy. It also looks like pymupdfloader didn't really give us a good way to distinguish between paragraphs, as every line is just separated by a newline. To avoid spending a lot of time on pdf munging, let's just strip out the newlines and treat it as one long paragraph. 

In [4]:
from pprint import pprint

one_big_string = ""
for doc in documents:
    cleaned_content = doc.page_content.strip().replace("\n", " ")
    one_big_string += cleaned_content

print(one_big_string)

Part 1: Why not to do a startup In this series of posts I will walk through some of my accumu- lated knowledge and experience in building high-tech startups. My speciXc experience is from three companies I have co- founded: Netscape, sold to America Online in 1998 for $4.2 billion; Opsware (formerly Loudcloud), a public soaware com- pany with an approximately $1 billion market cap; and now Ning, a new, private consumer Internet company. But more generally, I’ve been fortunate enough to be involved in and exposed to a broad range of other startups — maybe 40 or 50 in enough detail to know what I’m talking about — since arriving in Silicon Valley in 1994: as a board member, as an angel investor, as an advisor, as a friend of various founders, and as a participant in various venture capital funds. This series will focus on lessons learned from this entire cross- section of Silicon Valley startups — so don’t think that anything I am talking about is referring to one of my own companies: mo

In [5]:
# Split into sentences
import nltk
nltk.download('punkt')  # Download sentence tokenizer
nltk.download('punkt_tab')

from nltk.tokenize import sent_tokenize

# Create the sentences as a list
sentences=sent_tokenize(one_big_string)

# Print out a few as a sanity check
for sentence in sentences[:5]:
    print(f"sentence: {sentence}\n\n")

sentence: Part 1: Why not to do a startup In this series of posts I will walk through some of my accumu- lated knowledge and experience in building high-tech startups.


sentence: My speciXc experience is from three companies I have co- founded: Netscape, sold to America Online in 1998 for $4.2 billion; Opsware (formerly Loudcloud), a public soaware com- pany with an approximately $1 billion market cap; and now Ning, a new, private consumer Internet company.


sentence: But more generally, I’ve been fortunate enough to be involved in and exposed to a broad range of other startups — maybe 40 or 50 in enough detail to know what I’m talking about — since arriving in Silicon Valley in 1994: as a board member, as an angel investor, as an advisor, as a friend of various founders, and as a participant in various venture capital funds.


sentence: This series will focus on lessons learned from this entire cross- section of Silicon Valley startups — so don’t think that anything I am talking abo

[nltk_data] Downloading package punkt to /Users/Angela/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/Angela/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


<font color="blue">Now that we have the data somewhat tidied up, let's find semantically similar sentences. I'm going to define semantic similarity a very naive way, using text-embedding-ada-3 and setting a cosine similarity threshold of >0.5 for "related" sentences and <0.5 for "unrelated" sentences. First, we need to get the embeddings for each sentence. 


In [10]:
from langchain_openai import OpenAIEmbeddings

# set up embedding model and some constants 
EMBEDDING_MODEL = "text-embedding-3-small"

embeddings = OpenAIEmbeddings(
   model=EMBEDDING_MODEL
)

sentence_embeddings = await embeddings.aembed_documents(sentences)

In [29]:
SIMILARITY_THRESHOLD = 0.4 # cosine similarity threshold
MAX_CHUNK_SIZE=1000 # max chunk size, although we can go over it to preserve a sentence


In [12]:
import numpy as np
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))


In [30]:
combined_chunks = []
this_chunk = sentences[0]

for i in range(len(sentences)-1):

    similarity = cosine_similarity(sentence_embeddings[i],sentence_embeddings[i+1])

    # Uncomment for debug output
    #print(f"""sentence {i}: {sentences[i]}\nsentence {i+1}: {sentences[i+1]}\n
    #      similarity: {similarity}\nthis_chunk: {this_chunk}\ncombined_chunks: {combined_chunks}\n\n\n""")

    if (len(this_chunk) > MAX_CHUNK_SIZE) or (similarity<SIMILARITY_THRESHOLD):
        # we are over the max chunk size or sentences are unrelated, time to start a new one
        if this_chunk != "": combined_chunks.append(this_chunk)
        this_chunk = sentences[i+1]
    else:
        this_chunk += ("  ")+sentences[i+1]


In [31]:
print(f"Num chunks: {len(combined_chunks)}\n")

smallest_chunk = min(combined_chunks,key=len)
largest_chunk = max(combined_chunks, key=max)

print(f"Smallest chunk ({len(smallest_chunk)} chars):{smallest_chunk}\n")
print(f"Biggest chunk ({len(largest_chunk)} chars):{largest_chunk}\n")
for chunk in combined_chunks[:5]:
    print(f"Chunk: {chunk}\n")

Num chunks: 1621

Smallest chunk (3 chars):No.

Biggest chunk (1166 chars):When you get right down to it, you can ignore almost every- thing else.  I’m not suggesting that you do ignore everything else — just that judging from what I’ve seen in successful startups, you can.  Whenever you see a successful startup, you see one that has reached product/market Ht — and usually along the way screwed up all kinds of other things, from channel model to pipeline development strategy to marketing plan to press rela- tions to compensation policies to the CEO sleeping with the venture capitalist.  And the startup is still successful.  Conversely, you see a surprising number of really well-run startups that have all aspects of operations completely buttoned down, HR policies in place, great sales model, thoroughly thought-through marketing plan, great interview processes, outstanding catered food, 30″ monitors for all the program- mers, top tier VCs on the board — heading straight oG a cliG due to

In [32]:
# Embed the combined chunks, proceed with naive RAG as usual
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

LOCATION = ":memory:"
COLLECTION_NAME = "PMarca Blogs Semantic Chunking"
VECTOR_SIZE = 1536

In [33]:
qdrant_client = QdrantClient(location=LOCATION)

qdrant_client.create_collection(
    collection_name=COLLECTION_NAME,
    vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE)
)

qdrant_vector_store = QdrantVectorStore(
    client=qdrant_client,
    collection_name=COLLECTION_NAME,
    embedding=embeddings  # Embedding function from OpenAI embeddings
)

await qdrant_vector_store.aadd_texts(combined_chunks)

['a4adae1841494279a037426c6507403d',
 'c0026b740f5548bfaad716e1ceb24232',
 'd2f3dc84cbb0473a8d773ce9f1b5716a',
 'ab5e87d2294c4bb7a6533de2689c6096',
 '787419fcb8da4deeb793a53f6865df72',
 'cf6d3bd45b3a436898a16e20cd460f56',
 '2c6d33666e794758a021342c2be1b0f0',
 '399f7044ebfe437eac263571bec250c2',
 '7098b145564a479b9c9806fb4711cf2d',
 '02c13b3b89f142b88d6eb448afb0587f',
 '08ea27cc0b314b9bbb71daa1b0b59f02',
 '272320522c30477a9547dc85f5f5bc50',
 '3e8ee093546d413b9f7f92274697b2f9',
 '5331bb18033f433eac5d28eb5978a0ce',
 '435255c5798a42eab47241480af97c1c',
 '4d03548553974aec94fb76719eef82d2',
 'bfc32759bc62408290a313d35e689efb',
 'b722a417cf344b488b52f6dd6b85b824',
 '9052d00dca0c4a63801480ba18abcd89',
 '48e1c3dce15a4c47812647abb1c670d5',
 '7f72d10175974439ac4ff835a3c06dca',
 'fca2fdc796e34cc6a0df99dcc5bceccb',
 '8f8db9884d4f4f6d9b698c6fa4cb4944',
 'ac5f9353b46e47728aaf3951eb8563d4',
 'dddd7dedd20d41eb9f4ee62bfca55d00',
 'aa4b92f1616b4b3d93c6368147cab801',
 '8273675389b243dfb33c46c8b7ae57c9',
 

In [34]:
retriever = qdrant_vector_store.as_retriever()

In [35]:
retrieved_documents = await retriever.ainvoke("What is a rule of thumb for selecting an industry to invest in?")

In [36]:
for document in retrieved_documents:
    print(f"Document: {document.page_content}")

Document: If you’re going to enter an old industry, make sure to do it on the side of the forces of radical change that threaten to up-end the existing order — and make sure that those forces of change have a reasonable chance at succeeding.  Second rule of thumb: Once you have picked an industry, get right to the center of it as fast as you possibly can.  Your target is the core of change and opportunity — Xgure out where the action is and head there, and do not delay your progress for extraneous opportunities, no matter how lucrative they might be.
Document: Part 2: Skills and education 119Part 3: Where to go and why When picking an industry to enter, my favorite rule of thumb is this: Pick an industry where the founders of the industry — the founders of the important companies in the industry — are still alive and actively involved.
Document: Apply this rule when selecting which company to go to.  Go to the company where all the action is happening.  Or, if you are going to join a s