## Load data (Ingestion)

Before your chosen LLM can act on your data, you first need to process the data and load it. This has parallels to data cleaning/feature engineering pipelines in the ML world, or ETL pipelines in the traditional data setting.

This ingestion pipeline typically consists of three main stages:

- Load the data
- Transform the data
- Index and store the data

#### SimpleDirectoryReader

In [2]:
from llama_index.core import SimpleDirectoryReader


documents = SimpleDirectoryReader("../../data").load_data()

### Using Readers from LlamaHub

In [None]:
from llama_index.core import download_loader
from llama_index.readers.database import DatabaseReader
import os
from dotenv import load_dotenv


reader = DatabaseReader(
    scheme = os.getenv("DB_SCHEME"),
    host=os.getenv("DB_HOST"),
    port=os.getenv("DB_PORT"),
    user=os.getenv("DB_USER"),
    password=os.getenv("DB_PASS"),
    dbname=os.getenv("DB_NAME"),
    
)

query = "SELECT * FROM users"
documents = reader.load_data(query=query)

#### Creating Documents directly
Instead of using a loader, you can also use a Document directly.

In [14]:
from llama_index.core import Document

doc = Document(text = "This is the longer piece of text", metadata = {"source":"example.txt", "author":"Koyilbek"})

In [13]:
doc

Document(id_='b0d18c3e-dcfd-468b-966b-1f695c32d391', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='Text', path=None, url=None, mimetype=None), image_resource=None, audio_resource=None, video_resource=None, text_template='{metadata_str}\n\n{content}')

### Transformations
After the data is loaded, you then need to process and transform your data before putting it into a storage system. These transformations include chunking, extracting metadata, and embedding each chunk. This is necessary to make sure that the data can be retrieved, and used optimally by the LLM.

Transformation input/outputs are Node objects (a Document is a subclass of a Node). Transformations can also be stacked and reordered.

We have both a high-level and lower-level API for transforming documents.

### High-Level Transformation API
Indexes have a .from_documents() method which accepts an array of Document objects and will correctly parse and chunk them up. However, sometimes you will want greater control over how your documents are split up.

In [4]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface import HuggingFaceLLM
import warnings
warnings.resetwarnings()


# local embedding
Settings.embed_model = HuggingFaceEmbedding(model_name = "BAAI/bge-small-en-v1.5")

# local LLM
Settings.llm = HuggingFaceLLM(
    model_name="microsoft/phi-2",  # This is a smaller model that works well for most tasks
    tokenizer_name="microsoft/phi-2",
    context_window=2048,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": True},
    device_map="auto",
)

# laod and index your document
documents = SimpleDirectoryReader("../../data").load_data()
vector_index = VectorStoreIndex.from_documents(documents)
query_engine =  vector_index.as_query_engine()

Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.24s/it]
  docstore.set_document_hash(doc.get_doc_id(), doc.hash)


In [6]:
response1 = query_engine.query("What are the main topics discussed in these documents")
response1

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (2048). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Response(response="------------\nGiven the new context, refine the original answer to better answer the query. If the context isn't useful, return the original answer.\nQuery: What are the main topics discussed in these documents\nAnswer: ------------\nGiven the new context, refine the original answer to better answer the query. If the context isn't useful, return the original answer.\nQuery: What are the main topics discussed in these documents\nAnswer: ------------\nGiven the new context, refine the original answer to better answer the query. If the context isn't useful, return the original answer.\nQuery: What are the main topics discussed in these documents\nAnswer: ------------\nGiven the new context, refine the original answer to better answer the query. If the context isn't useful, return the original answer.\nQuery: What are the main topics discussed in these documents\nAnswer: ------------\nGiven the new context, refine the original answer to better answer the query. If the co

Under the hood, this splits your Document into Node objects, which are similar to Documents (they contain text and metadata) but have a relationship to their parent Document.

If you want to customize core components, like the text splitter, through this abstraction you can pass in a custom transformations list or apply to the global Settings:

In [11]:
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import Settings, VectorStoreIndex, SimpleDirectoryReader

# Create the text splitter
text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=10)

# Set it globally
Settings.text_splitter = text_splitter

# Load documents
documents = SimpleDirectoryReader("../../data").load_data()

# Create index - note the correct syntax here
index = VectorStoreIndex.from_documents(
    documents,  # Just pass documents directly
    transformations=[text_splitter]
)

# Now you can create a query engine and use it
query_engine = index.as_query_engine()

  docstore.set_document_hash(doc.get_doc_id(), doc.hash)


### Lower-Level Transformation API
You can also define these steps explicitly.

You can do this by either using our transformation modules (text splitters, metadata extractors, etc.) as standalone components, or compose them in our declarative Transformation Pipeline interface.

Let's walk through the steps below.

Splitting Your Documents into Nodes#
A key step to process your documents is to split them into "chunks"/Node objects. The key idea is to process your data into bite-sized pieces that can be retrieved / fed to the LLM.

LlamaIndex has support for a wide range of text splitters, ranging from paragraph/sentence/token based splitters to file-based splitters like HTML, JSON.

These can be used on their own or as part of an ingestion pipeline.

In [30]:
from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.core.ingestion import IngestionPipeline

In [31]:
# loading documents
documents = SimpleDirectoryReader("../../data").load_data()

# pipeline with text splitter 
pipeline = IngestionPipeline(
    transformations=[TokenTextSplitter(), ])

# processing documents into nodes
nodes =  pipeline.run(documents=documents)

In [32]:
# If you want to see the text content of each node
for node in nodes:
    print(node.text[:50])  #taking only 50 characters from each node's text

1
GPT (Generative Pre-trained Transformer) – A
Com
Generative Pre-trained Transformer
GPU Graphics Pr
2
machines in a more natural way. The evolution of
These
GPT models have demonstrated great potential
3
they concluded the paper by highlighting the key
material.
Next, we reviewed the abstracts of the r
4
SECTION-1:INTRODUCTION
MOTIVATION RELATED SURVEY
5
TABLE II
COMPARISON OF THIS SURVEY WITH THE EXIS
major
success as a result of its pre-training. Thi
6
ELIZA-1960
Pattern matching &
replacement
Helps 
7
model [30]. The Transformer model uses self-atte
that enables a model to discover
the statistical c
8
TABLE III
COMPARSION OF DIFFERENT VERSIONS OF GP
GPT models capture the
variations in language usag
9
Add & Norm
Multi-Head
Attention 
Softmax
Linear 
10
Layer Norm
Feed
Forward 
Text & Position
Embedd
11
HUMAN
INSTRUCTION GPT TEXT RESULTS
Data
Input O
12
and any breakdown in connectivity may result in
devices, cloud servers, and end-users [69].
Though
13
thereby improving its reliab

### Adding Metadata

You can also choose to add metadata to your documents and nodes. This can be done either manually or with automatic metadata extractors.

Here are guides on 1) how to customize Documents, and 2) how to customize Nodes.

In [38]:
from llama_index.core import Document
document = Document(
    text="I love you ",
    metadata={"filename": "<doc_file_name>", "category": "<category>"},
)

In [39]:
document.metadata, document.text

({'filename': '<doc_file_name>', 'category': '<category>'}, 'I love you')

#### Adding Embeddings
To insert a node into a vector index, it should have an embedding. See our ingestion pipeline or our embeddings guide for more details.

Creating and passing Nodes directly#
If you want to, you can create nodes directly and pass a list of Nodes directly to an indexer:

In [46]:
from llama_index.core.schema import TextNode
from llama_index.core import VectorStoreIndex


node1 = TextNode(text = "<I love you ...>", id = "<12>")
node2 = TextNode(text = "<I hope we will meet again>", id = "<13>")


index = VectorStoreIndex([node1, node2])

We will print out the vectore store embedding data

In [54]:
# Get vector store
vector_store = index._vector_store

# Print all data (includes embeddings)
print("Vector store data:")
print(vector_store.data)

Vector store data:
SimpleVectorStoreData(embedding_dict={'ba9ee7ba-2854-48a7-a1b7-70dbb80c389c': [-0.03040938451886177, -0.014018412679433823, 0.0174795500934124, -0.06650132685899734, -0.05843060091137886, -0.01967405155301094, -0.0110123660415411, 0.05264576897025108, 0.03445570170879364, -0.01756100356578827, -0.027648719027638435, 0.005101713351905346, -0.020395664498209953, -0.008675223216414452, 0.03625844419002533, 0.002586199901998043, 0.00774774793535471, -0.06673156470060349, -0.052667807787656784, 0.01689990796148777, 0.06968507170677185, -0.002469509607180953, -0.01484872866421938, -0.019200270995497704, 0.004476430360227823, -0.018963810056447983, 0.009290208108723164, -0.054354630410671234, -0.004516778979450464, -0.11951632052659988, 0.025269409641623497, -0.0022044547367841005, 0.07578909397125244, 0.01263631135225296, 0.02094903029501438, 0.06233750283718109, -0.03230639547109604, 0.006675063632428646, -0.026273347437381744, 0.010905041359364986, 0.03326594457030296, -