# RAG Template

- Vectorization and inference requires an `OPENAI API KEY` in your local env. 
- Upload your documents to the `data` folder.

In [8]:
%pip install -q llama-index

Note: you may need to restart the kernel to use updated packages.


## ETL (Oversimplified)

### Data Loading from Local Folder

Note the use of `metadata`, this will be useful at inference time (e.g. to cite authors or sources).

In [2]:
from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader(input_dir="data")
docs = reader.load_data()

for doc in docs:
    doc.metadata = {'filename': doc.extra_info['file_name']}


### Vectorization & Indexation

We default this step to `OPEN_AI`, using other models is possible.

Note the `chunk_size` and `chunk_overlap` parameters.

In [3]:
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

from llama_index.core import Settings

# global
Settings.text_splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=20)

# per index
index = VectorStoreIndex.from_documents(
    docs,
    transformations=[SentenceSplitter(chunk_size=1024, chunk_overlap=32)],
)

### Persist Index

In practice, you might want to use a vector database such as [Weaviate](https://weaviate.io/) or Pinecone (https://www.pinecone.io/).

Inspect the `./storage` folder to grasp how your documents have been transformed and chunks and embeddings indexed.

In [None]:
index.storage_context.persist(persist_dir="./storage")

### RAG

#### Sample_1

In [9]:
query_engine = index.as_query_engine()
response = query_engine.query("which vegetables contribute to carbon sequestration?")

Note how source citation is enabled by the custom `metadata` and `source_nodes` fields:

In [None]:
import pprint
pprint.pprint(response)

Response(response='Grain legumes and dryland cereals, such as chickpea, common '
                  'bean, cowpea, groundnut, lentil, pigeon pea, soybean, '
                  'finger millet, pearl millet, and sorghum, contribute to '
                  'carbon sequestration.',
         source_nodes=[NodeWithScore(node=TextNode(id_='fe4f1f62-c95f-4464-8084-844b65b8fbab', embedding=None, metadata={'filename': 'Grain-legumes-and-dryland-cereals-contribute-to-carbo_2023_Agriculture--Ecos.pdf'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='f5d70dca-0fa6-4a6c-944d-4021ab5c9299', node_type='4', metadata={'filename': 'Grain-legumes-and-dryland-cereals-contribute-to-carbo_2023_Agriculture--Ecos.pdf'}, hash='ff5a334

#### Sample_2

In [14]:
query_engine = index.as_query_engine()
response = query_engine.query("which strategy do you recommend to optimize soil nitrate levels?")

In [15]:
import pprint
pprint.pprint(response)

Response(response='I recommend sowing cover crops on 9 – 10 August as this '
                  'strategy resulted in the lowest nitrate concentration in '
                  'the drainage water over autumn and winter months compared '
                  'to later sowing times. This approach can help optimize soil '
                  'nitrate levels by reducing nitrate leaching and potentially '
                  'improving nitrogen uptake by the cover crops.',
         source_nodes=[NodeWithScore(node=TextNode(id_='2b5a15e4-938f-46fe-96b6-afe3352606cb', embedding=None, metadata={'filename': 'Delaying-sowing-of-cover-crops-decreases-the-abi_2023_Agriculture--Ecosystem.pdf'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeI