# Chroma & Lanchain quickstart 

In this notebook we will try to see how to interact with chomadb and langchain in order to wrap all this into a script for the next activities 🤗

## Chroma docker container 

First you've must run chroma inside docker with the command below : 

```bash
docker run -d \
  --name chroma-db \
  -p 8001:8001 \
  -e ALLOW_RESET=true \
  -e ANONYMIZED_TELEMETRY=false \
  -e CHROMA_SERVER_AUTH_CREDENTIALS_ENABLE=false \
  -e CHROMA_SERVER_HTTP_PORT=8000 \
  -v "$(pwd)/data/chroma:/chroma/chroma" \
  --network=host \
  ghcr.io/chroma-core/chroma:latest
```

You should have a chroma container running, you can check it with de `docker ps` command. 


In [1]:
import chromadb
from chromadb.utils import embedding_functions
from chromadb.config import Settings


In [2]:
# Connect with no authentication on the port of your choice 
chroma_client = chromadb.HttpClient(host='localhost', 
                                    port=8001,)

Follow the [official quickstart](https://docs.trychroma.com/docs/overview/getting-started) to create a collection named `document_store`, add document into it and try to query the collection like in the cells below 

> ⚠️ you will not have the same values as output since we use differents data 

In [3]:
chroma_client.count_collections()

6

In [4]:
chroma_client.list_collections()

['document_store',
 'index_b0bc2f8a-da73-4b6f-a2f7-03dd57a644d4',
 'index_601514aa-c91f-4c50-98ce-bf098a471754',
 'index_default',
 'index_172ac334-b578-472f-b888-79b4d08eb6aa',
 'index_9885888a-e4e9-41e6-abf3-01e17be7452b']

In [5]:
collection = chroma_client.get_collection(name='document_store')

In [6]:
collection.get()['ids'][:5]

['0e3f30ea-574c-4134-a103-0b2669491080',
 '6f4e7096-7afd-4790-a550-d47862292341',
 'cc81e872-194a-4780-b237-334fea3b3a85',
 '81d4d14a-9db2-4b3c-af7e-212aba06a779',
 '828b3041-f982-4594-8106-449d791fb4b4']

## Langchain in a nutshell 

[LangChain](https://python.langchain.com/v0.1/docs/get_started/quickstart/) is a framework that simplifies building applications powered by language models, providing components for document handling, memory, agents, and chains to orchestrate complex workflows. 

Its primary purpose is to help developers create context-aware AI applications that can connect language models to external data sources and tools.RetryClaude can make mistakes. Please double-check responses. Let's take a look on how it's work. 

In [7]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI()

In [8]:
llm.invoke("how can I be better at coding ?")

AIMessage(content="1. Practice regularly: Coding is a skill that improves with practice. Set aside time each day to work on coding challenges, projects, or practice problems to improve your coding skills.\n\n2. Focus on problem-solving: Coding is essentially problem-solving, so focus on understanding the problem at hand before writing any code. Break the problem down into smaller, manageable tasks and work through them sequentially.\n\n3. Collaborate with others: Join coding communities, attend coding meetups, or work on projects with other developers. Collaborating with others can help you learn new techniques, get feedback on your code, and discover best practices.\n\n4. Learn from your mistakes: When you encounter bugs or errors in your code, take the time to understand what went wrong and how you can improve it. Learning from your mistakes can help you become a better coder in the long run.\n\n5. Stay updated: The field of coding is constantly evolving, so it's important to stay up

In [9]:
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a world class engineer with more than 20y of experience."),
    ("user", "{input}")
])

In [10]:
chain = prompt | llm 

In [11]:
chain.invoke("how can I be better at coding ?")

AIMessage(content="To become better at coding, I would suggest the following strategies based on my experience as a world-class engineer:\n\n1. **Practice Regularly**: Like any skill, coding requires regular practice to improve. Code consistently, work on small projects, and challenge yourself with increasingly difficult tasks.\n\n2. **Learn from Others**: Collaborate with more experienced developers, participate in code reviews, and seek feedback on your code. This will help you learn new techniques and best practices.\n\n3. **Read Code**: Study code written by experienced developers. Reading and understanding high-quality code is a great way to learn new concepts and improve your coding skills.\n\n4. **Continuous Learning**: Stay updated on the latest technologies, tools, and industry trends. Attend workshops, conferences, and read books or online resources to expand your knowledge.\n\n5. **Write Clean and Maintainable Code**: Focus on writing code that is easy to read, maintain, and

### Parser 

LangChain's parsers are specialized tools that help extract structured data from unstructured text or LLM outputs. They convert raw text into usable formats like dictionaries, lists, or custom objects. Let's follow the doc and use `StrOutputParser` class 


In [12]:
from langchain_core.output_parsers import StrOutputParser

output_parser = StrOutputParser()
chain = prompt | llm | output_parser

In [14]:
chain.invoke("how can I be better at coding ?")

'To become better at coding, especially with over 20 years of experience, here are some advanced tips:\n\n1. **Continuous Learning**: Stay updated with the latest technologies, trends, and best practices in the industry. Attend workshops, webinars, and conferences to keep your skills sharp.\n\n2. **Practice Regularly**: Just like any skill, coding improves with practice. Challenge yourself with coding puzzles, open source projects, or personal projects that push your boundaries.\n\n3. **Code Reviews**: Engage in code reviews with colleagues or in online communities. This not only helps you learn new techniques but also provides valuable feedback on your own code.\n\n4. **Refactor and Optimize**: Take the time to refactor your code regularly. Optimize for performance and readability. This exercise will help you write cleaner, more efficient code.\n\n5. **Build Complex Projects**: Undertake challenging projects that require you to dive deep into complex problems. Solving these will broad

In [15]:
type(chain.invoke("how can I be better at coding ?"))

str

In this notebook we will not be interacting with the [Retrieval Chain](https://python.langchain.com/v0.1/docs/get_started/quickstart/) since we do not need web informtions, but you can check it from the doc 😎

In the next part let's see how to connect langchain and chroma [here](https://python.langchain.com/docs/integrations/vectorstores/chroma/#basic-initialization) 

In [None]:
#!pip install -qU chromadb langchain-chroma

In [16]:
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

vector_store = Chroma(
    collection_name="example_collection",
    embedding_function=embeddings,
    persist_directory="./chroma",  # Where to save data locally, remove if not necessary
)

### Your mini mission 

Now that you have the basics your mission is : code a function named batch_upload_document like below and upload this [PDF](https://arxiv.org/abs/1706.03762) inside your collection and do a simple query 

```python
def batch_upload_documents(
    files: List[UploadFile] = File(...),
    title_prefix: str = Form("Document"),
    source: str = Form(""),
    author: str = Form(""),
    tags: str = Form(""),
    chunk_size: int = Form(1000, description="Size of text chunks for PDF documents"),
    chunk_overlap: int = Form(100, description="Overlap between text chunks for PDF documents"),
    index_id: Optional[str] = Form(None, description="Optional index ID to add documents to")):
    """
    Upload multiple documents to the knowledge base and optionally add to an index.
    Handles both text files and PDFs.
    
    For PDF files, the document will be split into chunks using LangChain's text splitter.
    """
    #TODO: code here 
    pass 
```

You should see this kind of results below 🥸

In [17]:
results = collection.query(
    query_texts=["attention"],
    n_results=5
)
results

{'ids': [['a39e9252-9f9d-4787-98af-e428a1a8088f',
   'dcff76f7-450b-4315-a348-20368b50afcc',
   'f9b79128-b5e9-4094-afc9-649096630d00',
   'db5de2e2-4281-48d8-bc51-5d928e7c3720',
   '70c60f56-8215-4e21-86be-e6e6ee061ea9']],
 'distances': [[0.9198479056358337,
   0.9198479056358337,
   0.9216511249542236,
   0.9216515948930766,
   0.9234551787376404]],
 'embeddings': None,
 'metadatas': [[{'author': '',
    'created_at': '2025-03-11T00:08:04.573912',
    'description': 'PDF page 9 uploaded as part of batch on 2025-03-11',
    'filename': 'attention.pdf_page9_chunk37',
    'id': 'a39e9252-9f9d-4787-98af-e428a1a8088f',
    'source': 'attention.pdf',
    'tags': 'pdf',
    'title': 'Document 1: attention.pdf (Page 9, Chunk 37)',
    'updated_at': '2025-03-11T00:08:04.573923'},
   {'author': '',
    'created_at': '2025-03-11T00:07:29.513343',
    'description': 'PDF page 9 uploaded as part of batch on 2025-03-11',
    'filename': 'attention.pdf_page9_chunk37',
    'id': 'dcff76f7-450b-4315-