# LlamaIndex Introduction: Precision and Simplicity in Information Retrieval

## Data Connectors
The effectiveness of a RAG-based application is significantly enhanced by accessing a vector store that compiles information from various sources. However, managing data in diverse formats can be challenging.

Data connectors, also called `Readers`, are essential in LlamaIndex. Readers are responsible for parsing and converting the data into a simplified Document representation consisting of text and basic metadata. Data connectors are designed to streamline the data ingestion process, automate the task of fetching data from various sources (like APIs, PDFs, and SQL databases), and format it.

`LlamaHub` is an open-source project that hosts data connectors. `LlamaHub` repository offers data connectors for ingesting all possible data formats into the LLM.

You can check out the LlamaHub repository and test some of the loaders here. You can explore various integrations and data sources with the embedded link below. These implementations make the preprocessing step as simple as executing a function. Take the Wikipedia integration, for instance.

In [1]:
%pip install -q llama-index openai cohere deeplake==3.9.27 wikipedia

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/618.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m614.4/618.7 kB[0m [31m46.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m618.7/618.7 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.7/17.7 MB[0m [31m54.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m252.5/252.5 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.6/77.6 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
# Enable Logging
import logging
import sys

#You can set the logging level to DEBUG for more verbose output,
# or use level=logging.INFO for less detailed information.
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

We have also added a logging mechanism to the code. Logging in LlamaIndex is a way to monitor the operations and events that occur during the execution of your application. Logging helps develop and debug the process and understand the details of what the application is doing. In a production environment, you can configure the logging module to output log messages to a file or a logging service.

We can now use the `download_loader` method to access integrations from LlamaHub and activate them by passing the integration name to the class. In our sample code, the `WikipediaReader` class takes in several page titles and returns the text contained within them as `Document` objects.

In [3]:
from llama_index.core import download_loader

WikipediaReader = download_loader("WikipediaReader")

loader = WikipediaReader()

documents = loader.load_data(pages=['Natural Language Processing', 'Artificial Intelligence'])
print(len(documents))

  WikipediaReader = download_loader("WikipediaReader")


2


This retrieved information can be stored and utilized to enhance the knowledge base of our chatbot.

## Nodes
In LlamaIndex, once data is ingested as documents, it passes through a processing structure that transforms these documents into `Node` objects. `Nodes` are smaller, more granular data units created from the original documents. Besides their primary content, these nodes also contain metadata and contextual information.

LlamaIndex features a `NodeParser` class designed to convert the content of documents into structured nodes automatically. The `SimpleNodeParser` converts a list of document objects into nodes.

In [4]:
from llama_index.core.node_parser import SimpleNodeParser

# Assuming documents have already been loaded

# Initialize the parser
parser = SimpleNodeParser.from_defaults(chunk_size=512, chunk_overlap=20)

# Parse documents into nodes
nodes = parser.get_nodes_from_documents(documents)
print(len(nodes))

58


The code above splits the two retrieved documents from the Wikipedia page into 58 smaller chunks with slight overlap.

## Indices
At the heart of LlamaIndex is the capability to index and search various data formats like documents, PDFs, and database queries. Indexing is an initial step for storing information in a database; it essentially transforms the unstructured data into embeddings that capture semantic meaning and optimize the data format so it can be easily accessed and queried.

LlamaIndex has a variety of index types, each fulfills a specific role. We have highlighted some of the popular index types in the following subsections.

### Summary Index
The [Summary Index](https://docs.llamaindex.ai/en/stable/examples/index_structs/doc_summary/DocSummary/) extracts a summary from each document and stores it with all the nodes in that document. Since it’s not always easy to match small node embeddings with a query, sometimes having a document summary helps.

### Vector Store Index
The Vector Store Index generates embeddings during index construction to identify the top-k most similar nodes in response to a query.

It’s suitable for small-scale applications and easily scalable to accommodate larger datasets using high-performance vector databases.

The crawled Wikipedia documents can be stored in a Deep Lake vector store, and an index object can be created based on its data. We can create the dataset in Activeloop and append documents to it by employing the `DeepLakeVectorStore` class. First, we need to set the Activeloop and OpenAI API keys in the environment using the following code.

In [9]:
import os
from google.colab import userdata

# Get the API key from Colab's userdata
openai_api_key = userdata.get('OPENAI_API_KEY')

# Set it as an environment variable
os.environ["OPENAI_API_KEY"] = openai_api_key

# Get the activeloop token from Colab's userdata
activeloop_token = userdata.get('ACTIVELOOP_TOKEN')

# Set it as an environment variable
os.environ["ACTIVELOOP_TOKEN"] = activeloop_token

activeloop_org_id = userdata.get('ACTIVELOOP_ORG_ID')

os.environ["ACTIVELOOP_ORG_ID"] = activeloop_org_id

To connect to the platform, use the `DeepLakeVectorStore` class and provide the `dataset path` as an argument.
Running the following code will create an empty dataset.

In [7]:
%pip install llama-index-vector-stores-deeplake

Collecting llama-index-vector-stores-deeplake
  Downloading llama_index_vector_stores_deeplake-0.3.2-py3-none-any.whl.metadata (730 bytes)
Downloading llama_index_vector_stores_deeplake-0.3.2-py3-none-any.whl (7.1 kB)
Installing collected packages: llama-index-vector-stores-deeplake
Successfully installed llama-index-vector-stores-deeplake-0.3.2


In [10]:
from llama_index.vector_stores.deeplake import DeepLakeVectorStore

my_activeloop_org_id = activeloop_org_id
my_activeloop_dataset_name = "LlamaIndex_intro"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"

# Create an index over the documnts
vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=False)

Your Deep Lake dataset has been successfully created!


Now, we need to create a storage context using the `StorageContext` class and the Deep Lake dataset as the source. Pass this storage to a `VectorStoreIndex` class to create the index (generate embeddings) and store the results on the defined dataset.

In [11]:
from llama_index.core.storage.storage_context import StorageContext
from llama_index.core import VectorStoreIndex

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

Uploading data to deeplake dataset.


100%|██████████| 31/31 [00:00<00:00, 66.88it/s]


Dataset(path='hub://glorrysibomana758/LlamaIndex_intro', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype      shape      dtype  compression
  -------    -------    -------    -------  ------- 
   text       text      (31, 1)      str     None   
 metadata     json      (31, 1)      str     None   
 embedding  embedding  (31, 1536)  float32   None   
    id        text      (31, 1)      str     None   




The created database will be accessible in the future. The Deep Lake database efficiently stores and retrieves high-dimensional vectors.

# Query Engines

The next step is to leverage the generated indexes to query through the information. The Query Engine is a wrapper that combines a Retriever and a Response Synthesizer into a pipeline. The pipeline uses the query string to fetch nodes and then sends them to the LLM to generate a response. A query engine can be created by calling the `as_query_engine()` method on an already-created index.

The code below uses the documents fetched from the Wikipedia page to construct a Vector Store Index using the `GPTVectorStoreIndex` class. The `.from_documents()` method simplifies building indexes on these processed documents. The created index can then be utilized to generate a `query_engine` object, allowing us to ask questions based on the documents using the `.query()` method.

In [12]:
from llama_index.core import GPTVectorStoreIndex

index = GPTVectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What does NLP stands for?")
print( response.response )

NLP stands for natural language processing.


The indexes can also function solely as retrievers for fetching documents relevant to a query. This capability enables the creation of a Custom Query Engine, offering more control over various aspects, such as the prompt or the output format.

## Routers
Routers play a role in determining the most appropriate retriever for extracting context from the knowledge base. The routing function selects the optimal query engine for each task, improving performance and accuracy.

These functions are beneficial when dealing with multiple data sources, each holding unique information. Consider an application that employs a SQL database and a Vector Store as its knowledge base. In this setup, the router can determine which data source is most applicable to the given query.

## Saving and Loading Indexes Locally
All the examples we explored involved storing indexes on cloud-based vector stores like Deep Lake. However, there are scenarios where saving the data on a disk might be necessary for rapid testing. The concept of storing refers to saving the index data, which includes the nodes and their associated embeddings, to disk. This is done using the `persist()` method from the ``storage_context object related to the index.

In [13]:
# store index as vector embeddings on the disk
index.storage_context.persist()
# This saves the data in the 'storage' by default
# to minimize repetitive processing

If the index already exists in storage, you can load it directly instead of recreating it. We simply need to determine whether the index already exists on disk and proceed accordingly; here is how to do it:

In [14]:
# Index Storage Checks
import os.path
from llama_index.core import (
    VectorStoreIndex,
    StorageContext,
    load_index_from_storage,
)
from llama_index.core import download_loader

# Let's see if our index already exists in storage.
if not os.path.exists("./storage"):
    # If not, we'll load the Wikipedia data and create a new index
    WikipediaReader = download_loader("WikipediaReader")
    loader = WikipediaReader()
    documents = loader.load_data(pages=['Natural Language Processing', 'Artificial Intelligence'])
    index = VectorStoreIndex.from_documents(documents)
    # Index storing
    index.storage_context.persist()

else:
    # If the index already exists, we'll just load it:
    storage_context = StorageContext.from_defaults(persist_dir="./storage")
    index = load_index_from_storage(storage_context)

In this example, the `os.path.exists("./storage")` function is used to check if the 'storage' directory exists. If it does not exist, the Wikipedia data is loaded, and a new index is created.