<center><h1><b>`RCK-GPT`</b> : A Production-grade RAG pipeline for Financial Analysis</h1></center>

### LangChain vs. LlamaIndex


**Creating a chatbot with a local LLM**

Choose to keep the LLM running in a standalone inference server, instead of loading it into memory every single time you run the scripts. This saves time and avoids wearing down the disk. We do this via a OpenAI-compatible API schema.

In [None]:
# llamaindex
from llama_index.llms import ChatMessage, OpenAILike

llm = OpenAILike(
    api_base = 'https://localhost:1234/v1',
    timeout = 600,
    api_key = 'loremIpsum',
    is_chat_model = True,   # chat or completion?
    context_window = 32768
)

chat_history = [
    ChatMessage(
        role = 'system',
        content = system_prompt
    ),
    ChatMessage(
        role = 'user',
        content = user_prompt,
    )
]

output = llm.chat(chat_history)
print(output)

<center><h1>Using LLMs<h1></center>

You can use more than one also. They are used in multiple stages :
1. Indexing - to determine relevance or summarize and then index the summaries
2. Querying, in two ways - (i) Retrieval - array of options such as multiple different indices and make decisions about where best to find the information you are looking for, an agentic LLM can also use tools to query different data sources, (ii) Response synthesis - an LLM can combine answers to multiple sub-queries into a single coherent answer or can transform data

LlamaIndex provides a single interface to a large number of different LLMs, allowing you to pass in any LLM you choose to any stage of the pipeline.

In [None]:
# using OpenAI
from llama_index import OpenAI

response = OpenAI().complete('Paul Graham is ')
print(response)

# customisation : gpt-4 instead of gpt-3.5-turbo
# VectorStoreIndex will use gpt-4 to encode or embed your documents for later querying
from llama_index.llms import OpenAI
from llama_index import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    ServiceContext
)

llm - OpenAI(
    temperature = 0.1,
    model='gpt-4'
)
service_context = ServiceContext.from_defaults(llm=llm)

documents = SimpleDirectoryReader('data').load_data()
index = VectorStoreIndex.from_documents(
    documents,
    service_context=service_context
)

**Available LLMs**

Integrations with OpenAI, HuggingFace, PaLM, etc. You can also run a model like Llama2 locally. Use it in the ServiceContext.

In [None]:
# e.g. using llama2 locally
# this uses llama2-chat-13b from LlamaCPP
from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(llm='local')

**Privacy and Security**

By default, LlamaIndex sends your data to OpenAI for generating embeddings and natural language responses. However, this can be configured according to our preferences. LlamaIndex provides the flexibility to use our own embedding model or run a LLM locally is desired.

Privacy and handling of your data are subject to OpenAI's policies. LlamaIndex offers modules to connect with other vector stores within indexe to store embeddings. It is worth noting that each vector store has its own privacy policies and practices. Also be default, LlamaIndex has an option to store your embeddings locally.

Picking the proper LLM is the first step in building the application. LLMs are a core component, they can be used as a standalone module or plugged into other core LlamaIndex modules like indices, retrievers, query engines, etc. They are always used in the responses synthesis step, i.e. after retrieval. Depending on the type of index being used, LLMs may be also be used during index construction, insertion and query traversal.

LlamaIndex provides a unified interface for defining LLM modules, whether it is from OpenAI, HuggingFace or LangChain. It consists of the following - 
1. Support for text completion and chat
2. Support for streaming and non-streaming endpoints
3. Support for synchronous and asynchronous endpoints

In [None]:
# non-streaming
response = OpenAI().complete('Paul Graham is ')
print(response)

**Using LLMs as standalone modules**

1. Text completion example

In [None]:
from llama_index import OpenAI

# non-streaming
response = OpenAI().complete('Paul Graham is ')
print(response)

# using streaming endpoint
from llama_index.llms import OpenAI

llm = OpenAI()
resp = llm.stream_complete('Paul Graham is ')
for delta in resp:
    print(delta, end='')

2. Chat example

In [None]:
from llama_index.llms import ChatMessage, OpenAI

messages = [
    ChatMessage(
        role = 'system',
        content = 'You are a helpful assistant'
    ),
    ChatMessage(
        role = 'user',
        content = 'what is your name?'
    ),
]

resp = OpenAI().chat(messages)
print(resp)

**Customizing LLMs with LlamaIndex Abstractions**

You can plugin these abstractions within our other modules like indexes, retrievers, query engines, agents, etc. allowing to build advacned workflows over our data.

*Customisation examples:*
1. Changing the underlying LLM


In [None]:
llm = OpenAI(
    temperature = 0.1,
    model = 'gpt-4'
)
service_context = ServiceContext.from_defaults(llm=llm)

2. Changing the number of output tokens

In [None]:
# for OpenAI, Cohere and AI21
llm = OpenAI(
    temperature = 0,
    model = 'text-davinci-002',
    max_tokens = 512    # maxTokens for AI21
)
service_context = ServiceContext.from_defaults(llm=llm)

# for other LLM classes, configure the `context_window` and `num_output` explicitly via the `ServiceContext`,
# as this information is not available by default

context_window = 4096
num_output = 256

llm = OpenAI(
    temperature = 0,
    model = 'text-davinci-002',
    max_tokens = num_output
)

service_context = ServiceContext.from_defaults(
    llm = llm,
    context_window = context_window,
    num_output = num_output
)

3. Using a HuggingFace LLM : LlamaIndex supports using LLMs from HuggingFace directly, also you can setup a local embedding model. Many open-source models require some preamble before each prompt, which is a `system_prompt`. Additionally, queries may need an additional wrapper around the `query_str` itself. All this is available on the model card.

In [None]:
from llama_index.prompts import PromptTemplate

system_prompt = '''<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
'''

# this wraps the default prompts that are internal to llama_index
query_wrapper_prompt = PromptTemplate('<|USER|>{query_str}<|ASSISTANT|>')

import torch
from llama_index.llms import HuggingFaceLLM

llm = HuggingFaceLLM(
    context_window = 4096,
    max_new_tokens = 256,
    generate_kwargs = {
        'temperature' : 0.7,
        'do_sample' : False
    },
    system_prompt = system_prompt,
    query_wrapper_prompt = query_wrapper_prompt,
    tokenizer_name = 'StabilityAI/stablelm-tuned-alpha-3b',
    model_name = 'StabilityAI/stablelm-tuned-alpha-3b',
    device_map = 'auto',
    stopping_ids = [50278, 50279, 50277, 1, 0],
    tokenizer_kwargs = {
        'max_length' : 4096
    }
)

service_context = ServiceContext.from_defaults(
    chunk_size = 1024,
    llm = llm
)

Some models may raise error is all the keys from the tokenizer are passed to the model. A common tokenizer output that causes issues is `token_type_ids`. Here is how to configure the predictor to remove it -

In [None]:
HuggingFaceLLM(
    #...
    tokenizer_outputs_to_remove = ['token_type_ids']
)

**Tokenization**

By default, LlamaIndex uses a global tokenizer for all token counting. This defaults to `cl100k1` from tiktoken, to match the default LLM `gpt-3.5-turbo`. If you change the LLM, you may need to update the tokenizer to ensure accurate token counts, chunking and prompting. The single requirement for a tokenizer is that it is a callable function, that takes a string and returns a list. You can also set a global tokenizer.


In [None]:
from llama_index import set_global_tokenizer

# tiktoken
import tiktoken
set_global_tokenizer(tiktoken.encoding_for_model('gpt-3.5-turbo').encode)

# HuggingFace
from transformers import AutoTokenizer
set_global_tokenizer(
    AutoTokenizer.from_pretrained('HuggingFaceH4/zephyr-7b-beta').encode
)

**Open-Source LLMs**

1. [llama2-chat-7b-4bit](https://colab.research.google.com/drive/14N-hmJ87wZsFqHktrw40OU6sVcsiSzlQ?usp=sharing)
2. [llama2-chat-13b (replicate)](https://colab.research.google.com/drive/1S3eCZ8goKjFktF9hIakzcHqDE72g0Ggb?usp=sharing)
3. [llama2-70b-chat (replicate)](https://colab.research.google.com/drive/1BeOuVI8StygKFTLSpZ0vGCouxar2V5UW?usp=sharing)
4. [Mistral-7b-instruct-v0.1 4bit](https://colab.research.google.com/drive/1ZAdrabTJmZ_etDp10rjij_zME2Q3umAQ?usp=sharing)
5. [zephyr-7b-alpha](https://colab.research.google.com/drive/16Ygf2IyGNkb725ZqtRmFQjwWBuzFX_kl?usp=sharing)
6. [zephyr-7b-beta](https://colab.research.google.com/drive/1UoPcoiA5EOBghxWKWduQhChliMHxla7U?usp=sharing)
7. [stablelm-zephyr-3b](https://colab.research.google.com/drive/1USBIOs4yUkjOcxTKBr7onjlzATE-974T?usp=sharing)
8. [starling-lm-7b-alpha](https://colab.research.google.com/drive/1Juk073EWt2utxHZY84q_NfVT9xFwppf8?usp=sharing)

**Module Integrations**

1. HuggingFace LLM - Camel-5b

In [None]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms import HuggingFaceLLM

# download data
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

# load documents, build the VectorStoreIndex
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

# setup prompts - specific to StableLM
from llama_index.prompts import PromptTemplate

# model - 1
# setup prompts : for StableLM, taken from model card
system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

# model - 2 
# This will wrap the default prompts that are internal to llama-index
# taken from https://huggingface.co/Writer/camel-5b-hf
query_wrapper_prompt = PromptTemplate(
    "Below is an instruction that describes a task. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{query_str}\n\n### Response:"
)

import torch

llm = HuggingFaceLLM(
    context_window=2048,    #4096
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.25, "do_sample": False},
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="Writer/camel-5b-hf",    #StabilityAI/stablelm-tuned-alpha-3b
    model_name="Writer/camel-5b-hf",
    device_map="auto",
    #stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 2048},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)
service_context = ServiceContext.from_defaults(chunk_size=512, llm=llm) #1024
index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

# query index
# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
print(response)

# query index - streaming
query_engine = index.as_query_engine(streaming=True)
# set Logging to DEBUG for more detailed outputs
response_stream = query_engine.query("What did the author do growing up?")

# can be slower to start streaming since llama-index often involves many LLM calls
response_stream.print_response_stream()

# can also get a normal response object
response = response_stream.get_response()
print(response)

# can also iterate over the generator yourself
generated_text = ""
for text in response.response_gen:
    generated_text += text

2. Local Llama2 + VectorStoreIndex : GPU, A100 with at least 40GB memory

In [None]:
!pip install ipywidgets

# login to HF hub with access to llama2 models
# using 'huggingface-cli login' in console

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))


from IPython.display import Markdown, display

import torch
from llama_index.llms import HuggingFaceLLM
from llama_index.prompts import PromptTemplate

# Model names (make sure you have access on HF)
LLAMA2_7B = "meta-llama/Llama-2-7b-hf"
LLAMA2_7B_CHAT = "meta-llama/Llama-2-7b-chat-hf"
LLAMA2_13B = "meta-llama/Llama-2-13b-hf"
LLAMA2_13B_CHAT = "meta-llama/Llama-2-13b-chat-hf"
LLAMA2_70B = "meta-llama/Llama-2-70b-hf"
LLAMA2_70B_CHAT = "meta-llama/Llama-2-70b-chat-hf"

selected_model = LLAMA2_13B_CHAT

SYSTEM_PROMPT = """You are an AI assistant that answers questions in a friendly manner, based on the given source documents. Here are some rules you always follow:
- Generate human readable output, avoid creating output with gibberish text.
- Generate only the requested output, don't include any other language before or after the requested output.
- Never say thank you, that you are happy to help, that you are an AI agent, etc. Just answer directly.
- Generate professional language typically used in business documents in North America.
- Never generate offensive or foul language.
"""

query_wrapper_prompt = PromptTemplate(
    "[INST]<<SYS>>\n" + SYSTEM_PROMPT + "<</SYS>>\n\n{query_str}[/INST] "
)

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=2048,
    generate_kwargs={"temperature": 0.0, "do_sample": False},
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name=selected_model,
    model_name=selected_model,
    device_map="auto",
    # change these settings below depending on your GPU
    model_kwargs={"torch_dtype": torch.float16, "load_in_8bit": True},
)

# download data
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

from llama_index import SimpleDirectoryReader

# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
documents

from llama_index import (
    VectorStoreIndex,
    ServiceContext,
    set_global_service_context,
)

service_context = ServiceContext.from_defaults(
    llm=llm, embed_model="local:BAAI/bge-small-en"
)
set_global_service_context(service_context)

index = VectorStoreIndex.from_documents(documents)

# querying
# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
display(Markdown(f"<b>{response}</b>"))

# streaming support
import time

query_engine = index.as_query_engine(streaming=True)
response = query_engine.query("What happened at interleaf?")

start_time = time.time()

token_count = 0
for token in response.response_gen:
    print(token, end="")
    token_count += 1

time_elapsed = time.time() - start_time
tokens_per_second = token_count / time_elapsed

print(f"\n\nStreamed output at {tokens_per_second} tokens/s")


3. LlamaCPP : Highly configurable, depending on the model used, you may want to pass in `messages_to_prompt` and `completion_to_prompt` functions to help format the model inputs. Default model is llama2-chat, we can use util functions in `llama_index.llms.llama_utils`, any kwargs during initialization to be passed in `model_kwargs`, any kwargs during inference to be passed in `generate_kwargs`.

In [None]:
model_url = "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q4_0.bin"

llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    model_url=model_url,
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path=None,
    temperature=0.1,
    max_new_tokens=256,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=3900,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": 1},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)

# chat completions
response = llm.complete("Hello! Can you tell me a poem about cats and dogs?")
print(response.text)

# stream_complete
response_iter = llm.stream_complete("Can you write me a poem about fast cars?")
for response in response_iter:
    print(response.delta, end="", flush=True)

# query engine with llamacpp
# we can simply pass the llm abstraction to the query engine as usual
# but first, let is change the global tokenizer to match the LLM
from llama_index import set_global_tokenizer
from transformers import AutoTokenizer

set_global_tokenizer(
    AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-chat-hf").encode
)
# use Huggingface embeddings
from llama_index.embeddings import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
# create a service context
service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model,
)
# load documents
documents = SimpleDirectoryReader(
    "../../../examples/paul_graham_essay/data"
).load_data()
# create vector store index
index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)
# set up query engine
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
print(response)


4. [Using LlamaIndex with local models using GPT4ALL](https://colab.research.google.com/drive/16QMQePkONNlDpgiltOi7oRQgmB8dU5fl?usp=sharing#scrollTo=20cf0152)

<center><h1>Embeddings</h1></center>

Embeddings are used to represent the documents using a sophisticated numerical representation. Take text as input and return a long list of numbers used to capture the semantics of the text. When calculating similarity, by default cosine similarity is used. Default model is `text-embedding-ada-002` from OpenAI.

**Usage Pattern**

Most common usage is setting it in the service context object and then using it to construct the index and query. Input documents will be broken down into nodes and the embedding model will generate an embedding for each node. Optionally you can set a global service context to avoid passing it into other objects every time.

In [None]:
from llama_index import ServiceContext
from llama_index.embeddings import OpenAIEmbedding

embed_model = OpenAIEmbedding()
service_context = ServiceContext.from_defaults(embed_model=embed_model)

# local model
service_context = ServiceContext.from_defaults(embed_model='local')

# global service context
from llama_index import set_global_service_context
set_global_service_context(service_context)

**Customization**

1. Batch size : default 10, may incur a rate limit, may be too small

In [None]:
# set batch size to 42
embed_model = OpenAIEmbedding(embed_batch_size=42)

2. Local embedding model

In [None]:
service_context = ServiceContext.from_defaults(
    embed_model = 'local:BAAI/bge-large-en'
)

3. HuggingFace Optimum ONNX

In [None]:
pip install transformers optimum[exporters]

from llama_index.embeddings import OptimumEmbedding
OptimumEmbedding.create_and_save_optimum_model(
    'BAAI/bge-small-en-v1.5',
    './bge_onnx'
)

embed_model = OptimumEmbedding(folder_name = './bge_onnx')
service_context = ServiceContext.from_defaults(embed_model = embed_model)

**Local Embeddings with HuggingFace**

Support for BGE, Instructor and more. Also to create and use ONNX models.

In [None]:
# HuggingFace Embedding
from llama_index.embeddings import HuggingFaceEmbedding

# loads BAAI/bge-small-en
embed_model = HuggingFaceEmbedding()

# loads BAAI/bge-small-en-v1.5
embed_model = HuggingFaceEmbedding(
    model_name = 'BAAI/bge-small-en-v1.5'
)

embeddings = embed_model.get_text_embedding('Hello World!')
print(len(embeddings))
print(embeddings[:5])

In [None]:
# Instructor Embedding
from llama_index.embeddings import InstructorEmbedding

embed_model = InstructorEmbedding(
    model_name = 'hkunlp/instructor-base'
)

embeddings = embed_model.get_text_embedding('Hello World!')
print(len(embeddings))
print(embeddings[:5])

Benchmarking

In [None]:
!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext

documents = SimpleDirectoryReader(
    input_files=["IPCC_AR6_WGII_Chapter03.pdf"]
).load_data()

# HF
import os
import openai

# needed to synthesize responses later
os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]
from llama_index.embeddings import HuggingFaceEmbedding

# loads BAAI/bge-small-en-v1.5
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
test_emeds = embed_model.get_text_embedding("Hello World!")

service_context = ServiceContext.from_defaults(embed_model=embed_model)
%%timeit -r 1 -n 1
index = VectorStoreIndex.from_documents(
    documents, service_context=service_context, show_progress=True
)

# ONNX
from llama_index.embeddings import OptimumEmbedding

embed_model = OptimumEmbedding(folder_name="./bge_onnx")
test_emeds = embed_model.get_text_embedding("Hello World!")

service_context = ServiceContext.from_defaults(embed_model=embed_model)
%%timeit -r 1 -n 1
index = VectorStoreIndex.from_documents(
    documents, service_context=service_context, show_progress=True
)

zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz

<center><h1>Loading Data (Ingestion)<h1></center>

Before your LLM can act on the data, you need to process it and load it. It has 3 stages :
1. Load the data
2. Transform the data
3. Index and store the data

**Loaders**

Data connectors, also called `Readers`. Ingest data from different sources and format them into `Document` objects, collection of data and metadata.

Easiest way is using SimpleDirectoryReader. It can read a variety of formats including markdown, PDFs, word documents, powerpoint decks, images, audio and video.

In [None]:
from llama_index import SimpleDirectoryReader
documents = SimpleDirectoryReader('./data').load_data(show_progress=True)

In [None]:
# loading connector from LlamaHub - query against SQL database
from llama_index import download_loader

DatabaseReader = download_loader('DatabaseReader')

reader = DatabaseReader(
    scheme = os.getenv('DB_SCHEME'),
    host = os.getenv('DB_HOST'),
    port = os.getenv('DB_PORT'),
    user = os.getenv('DB_USER'),
    password = os.getenv('DB_PASS'),
    dbname = os.getenv('DB_NAME'),
)

query = 'SELECT * FROM users'
documents = reader.load_data(query=query)

In [None]:
from llama_index import download_loader

GoogleDocsReader = download_loader('GoogleDocsReader')
loader = GoogleDocsReader()
documents = loader.load_data(
    document_ids = [...]
)

In [None]:
from llama_hub.web.simple_web.base import SimpleWebPageReader

reader = SimpleWebPageReader(html_to_text=True)
docs = reader.load_data(
    urls=['']
)

In [None]:
# creating documents directly
from llama_index.schema import Document
doc = Document(text='text')

**Transformations**

After loading, data has to be processed and transformed before putting it into a storage system. These include chunking, extracting metadata and embedding each chunk. This is necessary to make sure the data can be retrieved and used optimally by the LLM. Transformation inputs/outputs are Node objects (Document is a subclass of Node). They can be stacked and reordered.

Document is a generic container around any data source. Can be constructed manually or created automatically via the data loaders. Data Loaders return Document objects via the load_data function. By default, a Document stores text along with some attributes like metadata and relationships. 

Node is a chunk of a source document, they also contain metadata and relationsip inforkmation with other nodes. You may choose to parse source Documents into Nodes through the NodeParser classes.

Customising Documents :
Using the metadata dictionary on each document, additional information can be included to help inform responses and track down sources for query responses. Any information set in the metadata dictionary will show up in the metadata of each source node created from the document. This includes filename and doc_id.

In [None]:
# file name

# 1. in constructor
document = Document(
    text = 'text',
    metadata = {
        'filename' : '<doc_finename',
        'category' : '<doc_category'
    }
)

# 2. after document is created
document.metadata = {
    'filename' : '<doc_filename'
}

# 3. set filename automatically
from llama_index import SimpleDirectoryReader

filename_fn = lambda filename : {'file_name' : filename}
documents = SimpleDirectoryReader(
    './data',
    file_metadata = filename_fn
).load_data()

In [None]:
# doc_id

from llama_index import SimpleDirectoryReader

documents = SimpleDirectoryReader(
    './data',
    filename_as_id=True
).load_data
print([x.doc_id for x in documents])

# set it manually
document.doc_id = 'My new document id'


High-level API : `.from_documents()` method of indexes accept an array of Document objects and correctly parse and chunk them up. This splits the Document into Node objects, similar to Documents but have a relationship to parent document.

In [None]:
from llama_index import VectorStoreIndex
vector_index = VectorStoreIndex.from_documents(documents)
vector_index.as_query_engine()

You can also pass in a custom ServiceContext to customize core components.

In [None]:
from llama_index.node_parser import SentenceSplitter

# load documents first
text_splitter = SentenceSplitter(
    chunk_size = 512,
    chunk_overlap = 10
)
service_context = ServiceContext.from_defaults(
    text_splitter=text_splitter
)
index = VectorStoreIndex.from_documents(
    documents,
    service_context=service_context
)

Lower-level API : You can also define these steps explicitly, either using the transformation modules (text splitters, metadata extractors, etc.) as standalone components or compose them into our declarative Transformation pipeline interface.

Splitting documents into nodes : split documents into chunks. Key idea is to process data into bite-sized pieces that can be retrieved / fed to the LLM. Can be used on their own or as part of the ingestion pipeline. 

In [None]:
from llama_index import SimpleDirectoryReader
from llama_index.ingestion import IngestionPipeline
from llama_index.node_parser import TokenTextSplitter

docs = SimpleDirectoryReader('./data').load_data()
pipeline = IngestionPipeline(
    transformations=[
        TokenTextSplitter(),
        ...
    ]
)
nodes = pipeline.run(documents=docum)

Adding metadata : Either manually or with automatic metadata extractors.

You can use LLMs to automate metadata extraction with the Metadata Extractor modules. It includes the following feature extractors:
1. SummaryExtractor - automatically extract a summary over a set of nodes
2. QuestionsAnsweredExtractor - extracts a set of questions that each Node can answer
3. TitleExtractor - extracts a title over the context of each Node
4. EntityExtractor - extracts entities mentioned in the context of each Node

Then you can chain the Metadata Extractors with the node parser.

In [None]:
from llama_index.extractors import (
    TitleExtractor,
    QuestionsAnsweredExtractor
)
from llama_index.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(
    separator=' ',
    chunk_size=512,
    chunk_overlap=128
)

title_extractor = TitleExtractor(nodes=5)
qa_extractor = QuestionsAnsweredExtractor(questions=3)

# assume documents are defined -> extract nodes
from llama_index.ingestion import IngestionPipeline

pipeline = IngestionPipeline(
    transformations=[
        text_splitter,
        title_extractor,
        qa_extractor
    ]
)

nodes = pipeline.run(
    documents=documents,
    in_place=True,
    show_progress=True
)

**Automated Metadata Extraction**



In [None]:
import nest_asyncio
nest_asyncio.apply()
import os

from llama_index.node_parser import SemanticSplitterNodeParser
from llama_index.extractors import (
    SummaryExtractor,
    TitleExtractor,
    EntityExtractor
)
from llama_index.ingestion import IngestionPipeline

splitter = SemanticSplitterNodeParser(
    buffer_size=1,
    breakpoint_percentile_threshold=95,
    embed_model=embed_model
)

extractor = [
    SummaryExtractor(
        summaries=['prev', 'self', 'next'],
        service_context=service_context
    ),
    TitleExtractor(
        nodes=5
    ),
    EntityExtractor(
        prediction_threshold=0.5,
        label_entities=False,
        device='cpu'
    )
]

pipeline = IngestionPipeline(
    transformations=[splitter, *extractor]
)

nodes = pipeline.run(
    documents=documents,
    in_place=False,
    show_progress=True
)

In [None]:

index = VectorStoreIndex(nodes=nodes)
engine = index.as_query_engine(
    similarity_top_k=3
)

zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz

<center><h1>Query Engine<h1></center>

Generic interface that allows to ask questions over our data. Takes in a natural language query and returns a rich response. It is most often built on one or mroe indexes via retrievers. You can compose multiple query engines to achieve more advanced capability.

In [None]:
# usage pattern
query_engine = index.as_query_engine()
response = query_engine.query('')

# to stream response
query_engine = index.as_query_engine(streaming=True)
response = query_engine.query('')
response.print_response_stream()

**Configuring a query engine**

Directly build and configure from an index as follows. Response modes currently supported :
1. `refine` - 
* create and refine the answer by sequentially going through each retrieved text chunk
* makes separate LLM call per node/retrieved chunk
* first chunk is used using the `text_qa_template` prompt, answer and next chunk used in another query with `refine_template` prompt, and so on
* too large chunk is split using `TokenTextSplitter`
* good for more detailed answers


2. `compact` -
* default, similar to `refine` but concatenates the chunks beforehand, resulting in less LLM calls
* stuff as many chunks that can fit in the context window

3. `tree_summarize` -
* query LLM using `summary_template` as many times as needed so that all concatenated chunks have been queries, resulting in as many answers
* good for summarization

4. `simple_summarize`
5. `no_text`
6. `accumulate`
7. `compact_accumulate`

In [None]:
query_engine = index.as_query_engine(
    response_mode='tree_summarize',
    verbose=True,
)


**Low-level composition API**

granular control, explicitly construct a query engine object instead of calling `index.as_query_engine()`

In [None]:
from llama_index import (
    VectorStoreIndex,
    get_response_synthesizer
)
from llama_index.retrievers import VectorIndexRetriever
from llama_index.query_engine import RetrieverQueryEngine

index = VectorStoreIndex.from_documents(documents)

retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=2
)

response_synthesizer = get_response_synthesizer(
    response_mode='tree_summarize',
    streaming=True
)

query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer
)

response = query_engine.query('')
response.print_response_stream()

# Using indexes if cached

In [None]:
if not os.path.exists("./citation"):
    documents = SimpleDirectoryReader("./data/paul_graham").load_data()
    index = VectorStoreIndex.from_documents(
        documents, service_context=service_context
    )
    index.storage_context.persist(persist_dir="./citation")
else:
    index = load_index_from_storage(
        StorageContext.from_defaults(persist_dir="./citation"),
        service_context=service_context,
    )

# CitationQueryEngines

In [None]:
import os
from llama_index.query_engine import CitationQueryEngine
from llama_index.retrievers import VectorIndexRetriever
from llama_index import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
    load_index_from_storage,
    ServiceContext,
)


service_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-3.5-turbo", temperature=0)
)


!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

if not os.path.exists("./citation"):
    documents = SimpleDirectoryReader("./data/paul_graham").load_data()
    index = VectorStoreIndex.from_documents(
        documents, service_context=service_context
    )
    index.storage_context.persist(persist_dir="./citation")
else:
    index = load_index_from_storage(
        StorageContext.from_defaults(persist_dir="./citation"),
        service_context=service_context,
    )


query_engine = CitationQueryEngine.from_args(
    index,
    similarity_top_k=3,
    # here we can control how granular citation sources are, the default is 512
    citation_chunk_size=512,
)
response = query_engine.query("What did the author do growing up?")
print(response)

# source nodes are 6, because the original chunks of 1024-sized nodes were broken into more granular nodes
print(len(response.source_nodes))


print(response.source_nodes[0].node.get_text())

print(response.source_nodes[1].node.get_text())


# adjusting settings
query_engine = CitationQueryEngine.from_args(
    index,
    # increase the citation chunk size!
    citation_chunk_size=1024,
    similarity_top_k=3,
)

response = query_engine.query("What did the author do growing up?")
print(response)

# should be less source nodes now!
print(len(response.source_nodes))

# inspecting the actual source
print(response.source_nodes[0].node.get_text())
