### In this example we would be building a simple vector store with llama index
We've covered various ideas in llama index, now we would be exploring combining thos basic ideas to implement a vector store

In [2]:
from llama_index.indices.vector_store import VectorStoreIndex
from llama_index.retrievers import VectorIndexRetriever
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.storage import StorageContext
from llama_index.indices.loading import load_index_from_storage
from dotenv import load_dotenv
load_dotenv()

True

In [3]:
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2024-02-10 22:59:50--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’


2024-02-10 22:59:51 (1.41 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]



In [4]:
from llama_index.readers.file.base import SimpleDirectoryReader

In [5]:
# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

In [6]:
len(documents)

1

In [7]:
index = VectorStoreIndex.from_documents(documents=documents)

In [8]:
index.ref_doc_info

{'c4c7eb35-392d-4546-99a0-78481984f168': RefDocInfo(node_ids=['3037ad99-71c6-4f2f-b9dc-2b766d990cec', '95d7d3ab-7f10-46e5-814a-624e45ddb42e', '5563f361-7183-4137-a9a3-3aca7b751374', 'a7ce764d-ebb4-4898-aa54-e088cbf966bf', '802ded21-4118-4b30-b2b6-d45621edc503', '9239476c-c099-403f-8afb-43db77ce8dbb', '28f208d1-54c0-47ff-b3d2-bdf108fb1ab1', '2381c633-61c9-45e7-ad23-fb2bc67b0c0c', 'dc95d3b7-ebf3-4239-aad8-bbf51e0243e1', '86328329-23b6-4873-9212-5d99a7b787b7', '7a35fe7c-66f1-4841-b0dc-5d2becdb6318', 'b7d4c21a-cf49-454a-bab4-d84a87564a23', '09389635-1134-4233-8fca-c924f551a712', 'ad1fc8c0-cfeb-4d5a-943a-7d52e901c7f1', 'b6ebd551-99c0-4913-8fc9-620695784809', '19ce608e-7c7e-416b-8d83-2f635dd78acf', '3fd18065-877f-42c2-8e28-3b26ba901005', '282cd029-0409-4720-9d60-b8f4746c6bf5', '1b4be1e5-e52d-4f79-9f27-87380881fcf3', '240d812e-9e89-4779-af8d-a05ae4dcca2e', '38851cc5-11ee-4f05-b662-c752d8d6d6af', 'cac44424-310a-4621-ae86-80c9f212a71b'], metadata={'file_path': 'data/paul_graham/paul_graham_essa

In [9]:
index.set_index_id("vector_index")
index.storage_context.persist()

In [11]:
# load stored index
storage_context = StorageContext.from_defaults(persist_dir="storage")
index = load_index_from_storage(storage_context=storage_context)

In [13]:
query_engine = index.as_query_engine(response_mode="tree_summarize")
response = query_engine.query("What did the author do growing up?")

In [16]:
from IPython.display import display, Markdown

In [17]:
display(Markdown(f"<b>{response}</b>"))

<b>The author grew up to the point where he was able to write software, start a company, and sell it to Yahoo. After that, he became wealthy and decided to pursue his passion for painting. He faced challenges adjusting to his new life in California and eventually returned to New York to continue painting and resume his old patterns, now with the added luxury of taxis and charming restaurants. He experimented with new techniques for painting and looked for an apartment to buy in his preferred neighborhood.</b>

In [20]:
query_modes = [
    "svm",
    "linear_regression",
    "logistic_regression",
]
for query_mode in query_modes:
    # set Logging to DEBUG for more detailed outputs
    query_engine = index.as_query_engine(vector_store_query_mode=query_mode)
    response = query_engine.query("What skill does the author have?")
    print(f"Query mode: {query_mode}")
    display(Markdown(f"<b>{response}</b>"))



Query mode: svm


<b>The author, Paul Graham, has the skill of leading and managing a startup incubator, Y Combinator, and writing essays. He also has the ability to identify and recruit talented individuals to take over and lead the organization. Additionally, he demonstrates strong problem-solving abilities and a dedication to working hard to ensure the success of the startups under his care.</b>



Query mode: linear_regression


<b>The author, Paul Graham, is an expert in running start-up incubators, such as Y Combinator. He is also skilled in programming, having written software in various languages including Arc and Lisp. Additionally, he has strong leadership abilities, as evidenced by his role in leading and eventually passing the leadership of Y Combinator to Sam Altman.</b>



Query mode: logistic_regression


<b>The author, Paul Graham, has the skill of leading and managing a startup incubator, Y Combinator, and writing essays. He also has the ability to identify and recruit talented individuals to take over and lead the organization. Additionally, he demonstrates strong problem-solving abilities and a dedication to working hard to ensure the success of the startups under his care.</b>

In [21]:
print(response.source_nodes[0].text)

[18] The worst thing about leaving YC was not working with Jessica anymore. We'd been working on YC almost the whole time we'd known each other, and we'd neither tried nor wanted to separate it from our personal lives, so leaving was like pulling up a deeply rooted tree.

[19] One way to get more precise about the concept of invented vs discovered is to talk about space aliens. Any sufficiently advanced alien civilization would certainly know about the Pythagorean theorem, for example. I believe, though with less certainty, that they would also know about the Lisp in McCarthy's 1960 paper.

But if so there's no reason to suppose that this is the limit of the language that might be known to them. Presumably aliens need numbers and errors and I/O too. So it seems likely there exists at least one path out of McCarthy's Lisp along which discoveredness is preserved.



Thanks to Trevor Blackwell, John Collison, Patrick Collison, Daniel Gackle, Ralph Hazell, Jessica Livingston, Robert Morris

In [22]:
from llama_index.schema import QueryBundle

In [23]:
query_bundle = QueryBundle(
    query_str="What did the author do growing up?",
    custom_embedding_strs=["The author grew up painting."],
)
query_engine = index.as_query_engine()
response = query_engine.query(query_bundle)

In [24]:
from pprint import pprint

In [25]:
pprint(response.response)

('The author grew up hearing about the World Wide Web and became interested in '
 'its potential. He decided to start a company to put art galleries online, '
 'but this idea was not successful. He then focused on creating software for '
 'building online stores, which led to the development of web applications. '
 'The author also had an idea for a web app to help create other web apps and '
 'started a new company called Aspra to pursue this idea. However, he '
 'eventually decided to build a subset of this project as an open source '
 "project instead. The author's experiences with starting companies and "
 'developing software influenced his later work with Y Combinator.')


In [26]:
query_engine = index.as_query_engine(
    vector_store_query_mode="mmr", vector_store_kwargs={"mmr_threshold": 0.2}
)
response = query_engine.query("What did the author do growing up?")

In [27]:
pprint(response.response)

('The author does not provide any information about what he did growing up in '
 'the context information provided.')


In [28]:
print(response.get_formatted_sources())

> Source (Doc id: b6ebd551-99c0-4913-8fc9-620695784809): As Jessica and I were walking home from dinner on March 11, at the corner of Garden and Walker st...

> Source (Doc id: cac44424-310a-4621-ae86-80c9f212a71b): [18] The worst thing about leaving YC was not working with Jessica anymore. We'd been working on ...


In [29]:
from llama_index.schema import Document

doc = Document(text="target", metadata={"tag": "target"})

In [31]:
index.insert(doc)

In [36]:
from llama_index.vector_stores.types import ExactMatchFilter, MetadataFilters

In [37]:
filters = MetadataFilters(
    filters= [ExactMatchFilter(key="tag", value="target")]
)

In [39]:
retreiver = VectorIndexRetriever(
    index=index,
    filters=filters
)

In [40]:
query_engine = RetrieverQueryEngine(
    retriever=retreiver
)

In [41]:
response = query_engine.query("what was the authors hobby")

In [42]:
pprint(response.response)

('I cannot provide an answer based on the given context as it does not provide '
 "sufficient information about the author's hobby.")


In [43]:
len(response.source_nodes)

1

In [44]:
response.get_formatted_sources()

'> Source (Doc id: f343bbd3-fb16-428d-8cac-eae83a4922e0): target'

In [45]:
source_nodes = response.source_nodes

In [46]:
print(source_nodes[0].text)
print(source_nodes[0].metadata)

target
{'tag': 'target'}
