# Lab 1 - Overview of embeddings-based retrieval

In this notebook we can find first a sample of how a document can be prepared and added into Chroma DB.    
Then we create a RAG methon and use a LLM (ChatGPT) to answer a question based on the output of queryng the DB. 

In [2]:
from pypdf import PdfReader

reader = PdfReader("The-Mom-Test-en.pdf")

pdf_texts = [p.extract_text().strip() for p in reader.pages]

# Filter the empty strings
pdf_texts = [text for text in pdf_texts if text]

In [4]:
#print(pdf_texts[0])

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter

In [6]:
character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=1000,
    chunk_overlap=0
)
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

print(f"\nTotal chunks: {len(character_split_texts)}")


Total chunks: 232


In [7]:
token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)

token_split_texts = []
for text in character_split_texts:
    token_split_texts += token_splitter.split_text(text)
   
print(token_split_texts[10])
print(f"\nTotal chunks: {len(token_split_texts)}")

  from .autonotebook import tqdm as notebook_tqdm


failing the mom test son : “ mom, mom, i have an idea for a business — can i run it by you? ” i am about to expose my ego — please don ’ t hurt my feelings. mom : “ of course, dear. ” you are my only son and i am ready to lie to protect you. son : “ you like your ipad, right? you use it a lot? ” mom : “ yes. ” you led me to this answer, so here you go. son : “ okay, so would you ever buy an app which was like a cookbook for your ipad? ” i am optimistically asking a hypothetical question and you know what i want you to say. mom : “ hmmm. ” as if i need another cookbook at my age. son : “ and it only costs $ 40 — that ’ s cheaper than those hardcovers on your shelf. ” i ’ m going to skip that lukewarm signal and tell you more about my great idea. mom : “ well... ” aren ’ t apps supposed to cost a dollar? son : “ and you can share recipes with your friends, and there ’ s an iphone app which is your shopping list. and videos of that celebrity chef you love.

Total chunks: 236


In [8]:
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

embedding_function = SentenceTransformerEmbeddingFunction()

In [9]:
#print(embedding_function([token_split_texts[10]]))

[[-0.054110657423734665, 0.10771001875400543, -0.005316439084708691, -0.03969819098711014, 0.025499017909169197, -0.018412992358207703, 0.005782117135822773, 0.05646170675754547, 0.09644944220781326, 0.016840169206261635, 0.05963132530450821, -0.025572126731276512, -0.021317411214113235, -0.04658119007945061, 0.08316187560558319, -0.0281317550688982, 0.12755519151687622, -0.017795145511627197, -0.07733485102653503, 0.002549288794398308, -0.02086193673312664, 0.03136380389332771, 0.11139810085296631, 0.03590168431401253, 0.03599106892943382, 0.04973244667053223, -0.0028566515538841486, -0.05273302271962166, -0.09187264740467072, 0.04816967621445656, 0.018823346123099327, -0.03593284636735916, -0.01929320953786373, 0.044532742351293564, 0.0014540485572069883, -0.07881123572587967, -0.006094179581850767, 0.016404516994953156, -0.02853977121412754, -0.07467714697122574, -0.01987944357097149, 0.037578143179416656, -0.0681605190038681, 0.033071838319301605, 0.01004654448479414, -0.0266507044

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [10]:
chroma_client = chromadb.Client()

# The string is just the collection name. 

chroma_collection = chroma_client.create_collection("TheMomTest_book", embedding_function=embedding_function)

ids = [str(i) for i in range(len(token_split_texts))]

# The .add method will embedd the token_split_texts using the embedding_function specified above

chroma_collection.add(ids=ids, documents=token_split_texts)

chroma_collection.count()

236

In [13]:
query = "What is the main idea behind the book?"

results = chroma_collection.query(query_texts=[query], n_results=5)

# Under the hood the .query() method will embedd the query using the same embedding funtion used when adding the documents. 
# Here is where chroma_db searchs for the documents that look similar to the query and then return some documents (5 here)

retrieved_documents = results['documents'][0]

for document in retrieved_documents:
    print(document)
    print('\n')

spindler, and tim barnes for showing me the good bits of the startup education world and giving me my first chance to teach. beyond the obvious influence from steve blank and eric ries, a big thanks to some other writers who have directly helped this book with their work : amy hoy on worldviews, brant cooper on segmentation, richard rumelt and lafley / martin on strategy, neil rackham on sales, and derek sivers on remembering that businesses are meant to make you happy. and of course, big thanks for mom & dad for gently planting the entrepreneurial seed through both encouragement and their own collection of insane startup and / or shipwreck stories. the cover was put together by devin hunt. the author image is provided by heisenbergmedia. com. thanks! 121


this book isn ’ t a summary or description or re - interpretation of the process of customer development. that ’ s a bigger concept and something steve blank has covered comprehensively in 4 steps to the e. piphany and the startup o

In [14]:
import os
import openai
from openai import OpenAI

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

openai_client = OpenAI()

In [15]:
def rag(query, retrieved_documents, model="gpt-3.5-turbo"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "You are a seasoned educator counseling fresh entrepreneurs in YCombinator. Your users are asking questions about information contained in the book."
            "You will be shown the user's question, and the relevant information from the book. Answer the user's question using only this information."
        },
        {"role": "user", "content": f"Question: {query}. \n Information: {information}"}
    ]
    
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

In [16]:
output = rag(query=query, retrieved_documents=retrieved_documents)

print(output)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


The main idea behind the book is to provide guidance on how to properly talk to customers and learn from them. It is not a summary or description of the process of customer development, but rather focuses specifically on customer conversation. The book is intended for individuals who have read about customer development or lean startup and want to know how to have their first customer conversation, traditional business or sales people looking to be more effective in a young company, mentors, supporters, or investors in startups who want to help them have more useful customer conversations, individuals who have a new business idea and want to determine its viability before quitting their job, those seeking funding who need more evidence of solving a real problem, individuals who find the process of customer conversation awkward and want an easier way to do it, those with a vague sense of an opportunity and want to clarify it, and individuals who have always wanted to start their own com