**Importing Libraries**

In [1]:
# ! pip install -U langchain openai tiktoken chromadb pypdf sentence_transformers langchain-openai

In [2]:
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader, PyPDFLoader, DirectoryLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain_openai import OpenAI
import os

**Setting up the OpenAI key**

In [3]:
key = open("./data/OpenAI-key.txt").read().split()[0]
os.environ["OPENAI_API_KEY"] = key

**Loading PDF Files**

In [4]:
pdfLoader = DirectoryLoader("./data",glob="./*.pdf",loader_cls=PyPDFLoader)
docs = pdfLoader.load()
print("Total Documents Found: ",len(docs))

Total Documents Found:  450


**Splitting text**

In [5]:
text_splitter = RecursiveCharacterTextSplitter(
                    chunk_size = 1000,
                    chunk_overlap = 200,
                    length_function = len
                  )
texts = text_splitter.split_documents(docs)

In [6]:
len(texts)

2026

**Initializing the vector store**

In [7]:
dir = "ChromaDB"
embedding = OpenAIEmbeddings(disallowed_special=())
db = Chroma.from_documents(
                  documents = texts,
                  embedding = embedding,
                  persist_directory = dir
                  )
db.persist()

  warn_deprecated(


**Load from the directory**

In [8]:
db = None
db = Chroma(persist_directory = dir,
            embedding_function = embedding
            )

**Creating Retreiver**

In [9]:
retriever = db.as_retriever()
docs = retriever.get_relevant_documents("What is a prompt?")

In [10]:
docs

[Document(page_content='presented with a broad or undetailed prompt, its output predominantly exhibits a\ngeneric nature, which, while being applicable across a range of contexts, may not be\noptimal for any specific application. In contrast, a detailed and precise prompt enables\nthe model to generate content that is more aligned with the unique requirements of\n3', metadata={'page': 2, 'source': 'data/prompt engineering in LLM Review.pdf'}),
 Document(page_content='presented with a broad or undetailed prompt, its output predominantly exhibits a\ngeneric nature, which, while being applicable across a range of contexts, may not be\noptimal for any specific application. In contrast, a detailed and precise prompt enables\nthe model to generate content that is more aligned with the unique requirements of\n3', metadata={'page': 2, 'source': 'data/prompt engineering in LLM Review.pdf'}),
 Document(page_content='presented with a broad or undetailed prompt, its output predominantly exhibits a

**Creating chain using Embeddings and retriever**

In [11]:
chain = RetrievalQA.from_chain_type(
                llm = OpenAI(),
                chain_type = "stuff",
                retriever = retriever,
                return_source_documents = True
              )

In [12]:
chain

RetrievalQA(combine_documents_chain=StuffDocumentsChain(llm_chain=LLMChain(prompt=PromptTemplate(input_variables=['context', 'question'], template="Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\n{context}\n\nQuestion: {question}\nHelpful Answer:"), llm=OpenAI(client=<openai.resources.completions.Completions object at 0x7b09f58aa230>, async_client=<openai.resources.completions.AsyncCompletions object at 0x7b09f72885e0>, openai_api_key='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', openai_proxy='')), document_variable_name='context'), return_source_documents=True, retriever=VectorStoreRetriever(tags=['Chroma', 'OpenAIEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x7b09f503afe0>))

**Outputs**

In [13]:
query = "What is prompt engineering?"
chain.invoke(query)

{'query': 'What is prompt engineering?',
 'result': ' Prompt engineering is the systematic design and optimization of input prompts to guide the responses of LLMs, ensuring accuracy, relevance, and coherence in the generated output. It is a crucial process in harnessing the full potential of LLMs and making them more applicable across diverse domains. Prompt engineering encompasses a spectrum of techniques, ranging from foundational approaches to more sophisticated methods, and is continually evolving through emerging research.',
 'source_documents': [Document(page_content='behavior [15, 16].\nThe discipline of prompt engineering has advanced alongside LLMs. What orig-\ninated as a fundamental practice of shaping prompts to direct model outputs has\nmatured into a structured research area, replete with its distinct methodologies and\nestablished best practices. Prompt engineering refers to the systematic design and\noptimization of input prompts to guide the responses of LLMs, ensuring

In [14]:
query = "Why do we need prompts?"
chain.invoke(query)

{'query': 'Why do we need prompts?',
 'result': ' Prompts are necessary in order to guide the model towards solving specific tasks, such as text classification and generation, in natural language processing. They provide structured information to the model, making it easier to extract task-specific information and improve performance on downstream tasks. Additionally, prompts can be reformulated to suit different tasks or models, increasing their transferability and usefulness.',
 'source_documents': [Document(page_content='10 Challenges\nAlthough prompt-based learning has shown signiﬁcant potential among different tasks and scenarios, several\nchallenges remain, some of which we detail below.\n10.1 Prompt Design\nTasks beyond Classiﬁcation and Generation Most existing works about prompt-based learning revolve around\neither text classiﬁcation or generation-based tasks. Applications to information extraction and text analysis tasks\nhave been discussed less, largely because the design 

In [15]:
query = "How do LLMs work?"
chain.invoke(query)

{'query': 'How do LLMs work?',
 'result': ' LLMs (Language Model Models) are trained using large amounts of data, and they can be fine-tuned for specific tasks to improve performance. There are different approaches to sharing trained LLM models, including freely sharing them with a license, requiring individuals to apply for access, providing API access without model parameters or detailed information, or providing no access at all. These decisions are based on factors such as model use, potential harms, and business decisions. LLMs are susceptible to data leakage attacks, where significant segments of text can be extracted from the model.',
 'source_documents': [Document(page_content='training logs will provide a guide for those training their own LLMs.\nWe have several interesting directions to pursue. First, task ﬁne-tuning has yielded\nsigniﬁcant improvements in LLMs, and we plan to consider what unique opportunities exist\n38', metadata={'page': 37, 'source': 'data/BloombergGPT.pd