In [1]:
!pip show chromadb

Name: chromadb
Version: 0.5.0
Summary: Chroma.
Home-page: 
Author: 
Author-email: Jeff Huber <jeff@trychroma.com>, Anton Troynikov <anton@trychroma.com>
License: 
Location: y:\ml_projects\mcqgen\env\lib\site-packages
Requires: bcrypt, build, chroma-hnswlib, fastapi, grpcio, importlib-resources, kubernetes, mmh3, numpy, onnxruntime, opentelemetry-api, opentelemetry-exporter-otlp-proto-grpc, opentelemetry-instrumentation-fastapi, opentelemetry-sdk, orjson, overrides, posthog, pydantic, pypika, PyYAML, requests, tenacity, tokenizers, tqdm, typer, typing-extensions, uvicorn
Required-by: 


In [8]:
import os
from dotenv import load_dotenv
# Load environment variables from the .env file
load_dotenv()

key = os.getenv("OPENAI_API_KEY")

In [9]:
from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader, DirectoryLoader
from langchain_openai import OpenAIEmbeddings, OpenAI

In [10]:
loader = DirectoryLoader("news_articles", glob="./*.txt")

In [14]:
document = loader.load()

In [15]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
text_chunks = text_splitter.split_documents(document)

In [20]:
text_chunks[0].page_content

'Signaling that investments in the supply chain sector remain robust, Pando, a startup developing fulfillment management technologies, today announced that it raised $30 million in a Series B round, bringing its total raised to $45 million.\n\nIron Pillar and Uncorrelated Ventures led the round, with participation from existing investors Nexus Venture Partners, Chiratae Ventures and Next47. CEO and founder Nitin Jayakrishnan says that the new capital will be put toward expanding Pando’s global sales, marketing and delivery capabilities.\n\n“We will not expand into new industries or adjacent product areas,” he told TechCrunch in an email interview. “Great talent is the foundation of the business — we will continue to augment our teams at all levels of the organization. Pando is also open to exploring strategic partnerships and acquisitions with this round of funding.”'

In [22]:
len(text_chunks)

233

### Creating DB

In [24]:
from langchain import embeddings

In [26]:
persist_directory = 'db'
embedding = OpenAIEmbeddings()

In [27]:
vectordb = Chroma.from_documents(documents=text_chunks, embedding=embedding,
                                 persist_directory=persist_directory)

In [29]:
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

In [39]:
retriever = vectordb.as_retriever()
docs = retriever.invoke("How much money did Microsoft raise?")

In [40]:
print(docs[0].page_content)

April 28, 2023

VC firms including Sequoia Capital, Andreessen Horowitz, Thrive and K2 Global are picking up new shares, according to documents seen by TechCrunch. A source tells us Founders Fund is also investing. Altogether the VCs have put in just over $300 million at a valuation of $27 billion to $29 billion. This is separate to a big investment from Microsoft announced earlier this year, a person familiar with the development told TechCrunch, which closed in January. The size of Microsoft’s investment is believed to be around $10 billion, a figure we confirmed with our source.

April 25, 2023

Called ChatGPT Business, OpenAI describes the forthcoming offering as “for professionals who need more control over their data as well as enterprises seeking to manage their end users.”


In [41]:
len(docs)

4

In [42]:
retriever = vectordb.as_retriever(search_kwargs={"k":2})
docs2 = retriever.invoke("How much money did Microsoft raise?")

In [43]:
len(docs2)

2

### Make a chain

In [45]:
from langchain.chains import RetrievalQA

In [46]:
llm = OpenAI()

qa_chain = RetrievalQA.from_chain_type(
    llm, chain_type = 'stuff', retriever=retriever, return_source_documents = True
)

In [47]:
# Cite sources
def process_llm_response(llm_response):
    print(llm_response['result'])
    print("\n\nSources: ")
    for source in llm_response['source_documents']:
        print(source.metadata['source'])

In [49]:
# Example
query = "How much money did Microsoft raise?"
llm_response = qa_chain({"query": query})
process_llm_response(llm_response)

 The size of Microsoft's investment is believed to be around $10 billion.


Sources: 
news_articles\05-03-chatgpt-everything-you-need-to-know-about-the-ai-powered-chatbot.txt
news_articles\05-03-checks-the-ai-powered-data-protection-project-incubated-in-area-120-officially-exits-to-google.txt


### Delete the database

In [None]:
!zip -r db.zip ./db

In [None]:
# To cleanup you can delete the collection
vectordb.delete_collection()

In [None]:
# Delete the directory
!rm -rf db/