# Indexing: Inspecting and Managing Documents in a Vectorstore

In [1]:
# Run the line of code below to check the version of langchain in the current environment.
# Substitute "langchain" with any other package name to check their version.

In [93]:
pip show langchain

Name: langchain
Version: 0.3.23
Summary: Building applications with LLMs through composability
Home-page: 
Author: 
Author-email: 
License: MIT
Location: /Users/akhil/anaconda3/lib/python3.11/site-packages
Requires: langchain-core, langchain-text-splitters, langsmith, pydantic, PyYAML, requests, SQLAlchemy
Required-by: langchain-community
Note: you may need to restart the kernel to use updated packages.


In [94]:
%load_ext dotenv
%dotenv -o

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [95]:
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document

In [96]:
embedding = OpenAIEmbeddings(model='text-embedding-ada-002')

In [97]:
vectorstore_from_directory = Chroma(persist_directory = "./intro-to-ds-lectures", 
                                    embedding_function = embedding)

In [98]:
vectorstore_from_directory.get()

{'ids': ['6eacbdc6-fbe0-4b4f-8867-626eb8340d6b',
  '79cd123b-c14a-4026-a2c9-3f3f5e366895',
  '51f4c182-42c3-4a4c-b0a6-4282cda07a65',
  'a5b36d87-a133-40d8-8a95-2bcbbb4109da',
  '903d54f3-2837-44cb-9fc3-81d17e830616',
  '42a10a34-3bbb-4d24-8e63-d9ab283682ff',
  '47d6421f-fe5f-4e43-b75e-3a4089b30c89',
  'cf27a5a8-d42d-4593-9d83-6512eebdf5e5',
  'b51e6258-f4bb-489d-adb3-af286922040b',
  '2a39a50e-4391-45bd-97f7-130eb2efbc38',
  'f187450b-fead-44f4-a0f9-a03b0b70f7d6',
  'b0dd7c87-a89b-4327-9bec-788458df30de',
  '5b53436d-c2ab-4639-9958-90c08dd0723d',
  '7109c06b-1608-4a06-b746-3dd75108f91e',
  '7a596ba0-15c0-415a-a657-39cc76d11a90',
  '5f72ffa3-0cb4-4254-9f1e-d8b96cb098b3',
  'fad926dc-f376-4e8e-822b-e95be99a2b39',
  '28e212a5-8000-4d6c-bec0-76fced562694',
  '8551d6b6-e63f-4a62-8b95-a72d0ede9354',
  '7a2dd52f-92af-4a82-89a7-ce7940883344',
  '5fecede9-7547-4a25-935e-8ec05b8bb1ff',
  '7d960c28-0826-4c0e-b4b6-671e0cd1b477',
  '3dc0b7e7-af0c-4659-8b24-af4d7fec6355',
  '75482bed-283a-4d99-859d-

In [99]:
vectorstore_from_directory.get(ids = "e71f1a46-5739-496f-b919-a5092c14cf6a", 
                               include = ["embeddings"])

{'ids': [],
 'embeddings': array([], dtype=float64),
 'documents': None,
 'uris': None,
 'included': ['embeddings'],
 'data': None,
 'metadatas': None}

In [100]:
added_document = Document(page_content='Alright! So… Let’s discuss the not-so-obvious differences between the terms analysis and analytics. Due to the similarity of the words, some people believe they share the same meaning, and thus use them interchangeably. Technically, this isn’t correct. There is, in fact, a distinct difference between the two. And the reason for one often being used instead of the other is the lack of a transparent understanding of both. So, let’s clear this up, shall we? First, we will start with analysis', 
                          metadata={'Course Title': 'Introduction to Data and Data Science', 
                                    'Lecture Title': 'Analysis vs Analytics'})

In [101]:
vectorstore_from_directory.add_documents([added_document])

['7411a7dd-aa38-408e-8ad4-9ee0a082dc4d']

In [102]:
vectorstore_from_directory.get("4d1cf745-ebbd-47cd-95e0-ddd4b7d11f4f")

{'ids': [],
 'embeddings': None,
 'documents': [],
 'uris': None,
 'included': ['metadatas', 'documents'],
 'data': None,
 'metadatas': []}

In [103]:
updated_document = Document(page_content='Great! We hope we gave you a good idea about the level of applicability of the most frequently used programming and software tools in the field of data science. Thank you for watching!', 
                            metadata={'Course Title': 'Introduction to Data and Data Science', 
                                     'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need'})

In [104]:
vectorstore_from_directory.update_document(document_id = "4d1cf745-ebbd-47cd-95e0-ddd4b7d11f4f", 
                                           document = updated_document)

In [105]:
vectorstore_from_directory.get("4d1cf745-ebbd-47cd-95e0-ddd4b7d11f4f")

{'ids': [],
 'embeddings': None,
 'documents': [],
 'uris': None,
 'included': ['metadatas', 'documents'],
 'data': None,
 'metadatas': []}

In [106]:
vectorstore_from_directory.delete("55409552-1943-4892-949a-3b475ff9c840")

In [107]:
vectorstore_from_directory.get("55409552-1943-4892-949a-3b475ff9c840")

{'ids': [],
 'embeddings': None,
 'documents': [],
 'uris': None,
 'included': ['metadatas', 'documents'],
 'data': None,
 'metadatas': []}

# Retrieval: Similarity Search

In [62]:
# Now our goal with this lesson is the following.

# First, define a question related to data science.

# Secondly, use the embedding function of the vector store to create a vector representation of this

# question.

# And thirdly, retrieve a pre-selected number of documents relevant to this question.

# At the next step, these retrieved documents will be fed to an LLM to devise a response.

# So let's create a variable called question defined as follows.

In [108]:
vectorstore = Chroma(persist_directory = "./intro-to-ds-lectures", 
                     embedding_function = embedding)

In [109]:
added_document = Document(page_content='Alright! So… How are the techniques used in data, business intelligence, or predictive analytics applied in real life? Certainly, with the help of computers. You can basically split the relevant tools into two categories—programming languages and software. Knowing a programming language enables you to devise programs that can execute specific operations. Moreover, you can reuse these programs whenever you need to execute the same action', 
                          metadata={'Course Title': 'Introduction to Data and Data Science', 
                                    'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need'})

In [110]:
vectorstore.add_documents([added_document])

['c24d3a07-fd1b-4f31-b48a-61d6ad9cfe3a']

In [111]:
question = "What is Peter Lynch's approach to long-term investing?"

In [112]:
retrieved_docs = vectorstore.similarity_search(query = question, 
                                               k = 5)

In [68]:
# The first parameter we must pass is query.

# Our question.

# The second parameter to set is k corresponding to the number of retrieved documents.

# It defaults to four, but let's change its value to five documents, allowing us to identify a pain

# point in the similarity search algorithm.

In [113]:
retrieved_docs

[Document(metadata={'Lecture Title': 'Analysis vs Analytics', 'Course Title': 'Introduction to Data and Data Science'}, page_content="What is Peter Lynch's approach to long-term investing?He believes in holding stocks for years, focusing on the company's growth potential rather than short-term market movements. Why does Peter Lynch emphasize avoiding emotional decisions?Emotional decisions lead to impulsive trades, while disciplined investing based on research yields better results"),
 Document(metadata={'Course Title': 'Introduction to Data and Data Science', 'Lecture Title': 'Analysis vs Analytics'}, page_content="What is Peter Lynch's approach to long-term investing?He believes in holding stocks for years, focusing on the company's growth potential rather than short-term market movements. Why does Peter Lynch emphasize avoiding emotional decisions?Emotional decisions lead to impulsive trades, while disciplined investing based on research yields better results"),
 Document(metadata={

In [114]:
# Displaying the variable, we find that retrieved documents is a list of five documents.

# I find this a bit difficult to read, so let me use a for loop that displays only the page content and

# the lecture title of each document.

# Much better.

In [115]:
for i in retrieved_docs:
    print(f"Page Content: {i.page_content}\n----------\nLecture Title:{i.metadata['QuestionBank']}\n")

KeyError: 'QuestionBank'

# Retrieval: Maximal Marginal Relevance Search

In [116]:
question = "What is Peter Lynch's approach to long-term investing?"

In [117]:
retrieved_docs = vectorstore.max_marginal_relevance_search(
    query=question, 
    k=3, 
    lambda_mult = 1, 
    filter = {"Lecture Title": "Programming Languages & Software Employed in Data Science - All the Tools You Need"}
)

In [118]:
for i in retrieved_docs:
    print(f"Page Content: {i.page_content}\n----------\nLecture Title:{i.metadata['Lecture Title']}\n")

Page Content: How does Lynch evaluate a company‚Äôs debt levels?Lynch avoids highly leveraged companies, preferring firms with manageable debt and strong cash flow. Why does Lynch believe in holding stocks for the long term?He advocates long-term holding, allowing compounding and growth to maximize returns. How does Lynch approach sector-based investing?Lynch invests across various sectors but favors those with strong long-term potential and less market hype
----------
Lecture Title:Programming Languages & Software Employed in Data Science - All the Tools You Need

Page Content: Why does Lynch focus on a company's long-term earnings potential?He focuses on sustainable long-term earnings rather than short-term market trends. How does Lynch approach international investing?Lynch primarily invests in domestic markets but does not rule out strong international opportunities. What is Lynch's take on investing in commodities?Commodity investments are risky and cyclical, making them less attr

# Generation: Stuffing Documents

In [119]:
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnablePassthrough
from langchain_core.runnables import RunnableParallel
from langchain_core.output_parsers import StrOutputParser

In [120]:
vectorstore = Chroma(persist_directory = "./intro-to-ds-lectures", 
                     embedding_function = OpenAIEmbeddings(model='text-embedding-ada-002'))

In [121]:
len(vectorstore.get()['documents'])

448

In [122]:
retriever = vectorstore.as_retriever(search_type = 'mmr', 
                                     search_kwargs = {'k':3, 
                                                      'lambda_mult':0.7})

In [123]:
TEMPLATE = '''
Answer the following question:
{question}

To answer the question, use only the following context:
{context}

At the end of the response, specify the name of the lecture this context is taken from in the format:
Resources: *Lecture Title*
where *Lecture Title* should be substituted with the title of all resource lectures.
'''

prompt_template = PromptTemplate.from_template(TEMPLATE)

In [124]:
chat = ChatOpenAI(model_name = 'gpt-4', 
                  model_kwargs = {'seed':365},
                  max_tokens = 250)

  if await self.run_code(code, result, async_=asy):


In [125]:
question = "How did Peter Lynch respond to the 1990 recession?"

In [126]:
chain = {'context': retriever, 
         'question': RunnablePassthrough()} | prompt_template

In [127]:
chain.invoke(question)

StringPromptValue(text='\nAnswer the following question:\nHow did Peter Lynch respond to the 1990 recession?\n\nTo answer the question, use only the following context:\n[Document(metadata={\'Course Title\': \'Introduction to Data and Data Science\', \'Lecture Title\': \'Analysis vs Analytics\'}, page_content=\'How did Peter Lynch respond to the 1990 recession?He intensified research, visiting over 200 companies that year alone, seeking undervalued opportunities. What does Lynch mean by "rechecking the story"?He periodically reviewed companies to ensure they still had strong fundamentals and growth potential. How does Peter Lynch classify "Fast Growers"?Companies with rapid earnings growth and high potential for price appreciation, though often riskier\'), Document(metadata={\'Course Title\': \'Introduction to Data and Data Science\', \'Lecture Title\': \'Analysis vs Analytics\'}, page_content=\'How did Peter Lynch respond to the 1990 recession?He intensified research, visiting over 200

In [128]:
print("\nAnswer the following question:\nWhat software do data scientists use?\n\nTo answer the question, use only the following context:\n[Document(page_content='As you can see from the infographic, R, and Python are the two most popular tools across all columns. Their biggest advantage is that they can manipulate data and are integrated within multiple data and data science software platforms. They are not just suitable for mathematical and statistical computations. In other words, R, and Python are adaptable. They can solve a wide variety of business and data-related problems from beginning to the end', metadata={'Course Title': 'Introduction to Data and Data Science', 'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need'}), Document(page_content='It’s actually a software framework which was designed to address the complexity of big data and its computational intensity. Most notably, Hadoop distributes the computational tasks on multiple computers which is basically the way to handle big data nowadays. Power BI, SaS, Qlik, and especially Tableau are top-notch examples of software designed for business intelligence visualizations', metadata={'Course Title': 'Introduction to Data and Data Science', 'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need'}), Document(page_content='Great! We hope we gave you a good idea about the level of applicability of the most frequently used programming and software tools in the field of data science. Thank you for watching!', metadata={'Course Title': 'Introduction to Data and Data Science', 'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need'})]\n\nAt the end of the response, specify the name of the lecture this context is taken from in the format:\nResources: *Lecture Title*\nwhere *Lecture Title* should be substituted with the title of all resource lectures.\n")


Answer the following question:
What software do data scientists use?

To answer the question, use only the following context:
[Document(page_content='As you can see from the infographic, R, and Python are the two most popular tools across all columns. Their biggest advantage is that they can manipulate data and are integrated within multiple data and data science software platforms. They are not just suitable for mathematical and statistical computations. In other words, R, and Python are adaptable. They can solve a wide variety of business and data-related problems from beginning to the end', metadata={'Course Title': 'Introduction to Data and Data Science', 'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need'}), Document(page_content='It’s actually a software framework which was designed to address the complexity of big data and its computational intensity. Most notably, Hadoop distributes the computational tasks on multiple computers

# Generation: Generating a Response

In [129]:
chain = ({'context': retriever, 
         'question': RunnablePassthrough()} 
         | prompt_template 
         | chat 
         | StrOutputParser())

In [130]:
chain.invoke(question)

'Peter Lynch responded to the 1990 recession by intensifying his research. He visited over 200 companies that year alone, looking for undervalued opportunities.\n\nResources: Analysis vs Analytics'

In [131]:
He intensified research, visiting over 200 companies that year alone, seeking undervalued opportunities.

SyntaxError: invalid syntax (2349505155.py, line 1)

In [132]:
the notebook uses:

1. RAG (Retrieval-Augmented Generation): It appears to be implemented using LangChain, which is a popular framework for building RAG systems. The integration of Chroma as a vector store indicates a Retrieval component.

2. Chroma (Vectorstore): The notebook imports Chroma from langchain_community.vectorstores. Chroma is being used for vector-based indexing and retrieval of documents.

3. Vectorstore Indexing: Chroma is employed as the vector store for indexing document embeddings. It supports efficient similarity search, which is essential for the retrieval part of RAG.

4. OpenAI's Embedding Model: The embedding model used is text-embedding-ada-002, which is a powerful and efficient embedding model provided by OpenAI. It is known for generating high-quality embeddings suitable for various NLP tasks.

5. GPT Model: Though the specific GPT model isn't mentioned in the extracted cells, it is likely that a GPT-3 or GPT-4 model (such as gpt-3.5-turbo or gpt-4) is used for the generation part of the RAG system, especially if OpenAI's API is involved.

Configuration:
A. Embedding Model: text-embedding-ada-002 from OpenAI.

B. Vector Store: Chroma from langchain_community.vectorstores.

C. Framework: LangChain is being used to manage the RAG pipeline.

SyntaxError: unterminated string literal (detected at line 9) (1514874197.py, line 9)

In [133]:
print('Data scientists use a variety of software tools. R and Python are the two most popular tools as they can manipulate data and are integrated within multiple data and data science software platforms. They are adaptable and can solve a wide range of business and data-related problems. Hadoop is a software framework designed to handle the complexity and computational intensity of big data by distributing computational tasks on multiple computers. Additionally, Power BI, SaS, Qlik, and Tableau are top-notch examples of software designed for business intelligence visualizations.\n\nResources: Programming Languages & Software Employed in Data Science - All the Tools You Need')

Data scientists use a variety of software tools. R and Python are the two most popular tools as they can manipulate data and are integrated within multiple data and data science software platforms. They are adaptable and can solve a wide range of business and data-related problems. Hadoop is a software framework designed to handle the complexity and computational intensity of big data by distributing computational tasks on multiple computers. Additionally, Power BI, SaS, Qlik, and Tableau are top-notch examples of software designed for business intelligence visualizations.

Resources: Programming Languages & Software Employed in Data Science - All the Tools You Need
