before an llm can use your data, we need to process data and load it.

the ingestation pipeline usually follows three stages:
1. load data
2. transform data
3. index & store data

llamaindex loads data via data connectors, also called 'Reader'. these ingest data from different sources and formats it all into 'Document' objects.

a 'Document' is a collection of data (text) and metadata about that data.

<h1>loading</h1>

In [None]:
import os

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data").load_data() # this creates 'documents' out of every file in the data directory
# it can handle many formats (markdown, pdfs, docx, images, audio, video, etc.)

In [None]:
from llama_index.core import download_loader

from llama_index.readers.database import DatabaseReader

# you can also load data from an sql database
# this runs a query against a SQL database and returns every row of the results as a Document

reader = DatabaseReader(
    scheme=os.getenv("DB_SCHEME"),
    host=os.getenv("DB_HOST"),
    port=os.getenv("DB_PORT"),
    user=os.getenv("DB_USER"),
    password=os.getenv("DB_PASS"),
    dbname=os.getenv("DB_NAME"),
)

query = "SELECT * FROM users"
documents = reader.load_data(query=query)

ValueError: You must provide either a SQLDatabase, a SQL Alchemy Engine, a valid connection URI, or a valid set of credentials.

In [None]:
# you can also just use a document directly

from llama_index.core import Document

doc = Document(text="text") # this creates a document with the text "text"

<h1>transformations</h1>

after loading data, we gotta process and then transform data before putting it into a storage system. 

transformations include chunking, extracting metadata, and embedding each chunk. this is so data can be retrieved and used optimally by the llm. 

transformation input/outputs are 'Node' objects. (a 'Document' is a subclass of a 'Node')

<h3>high level transformation api</h3>

indexes have a .from_documents() method which accepts an array of Document objects and will correctly parse and chunk them. however sometimes you may want greater control over this. 

In [None]:
from llama_index.core import VectorStoreIndex

vector_index = VectorStoreIndex.from_documents(documents)
vector_index.as_query_engine()

essentially, it splits your document into node objects, which are similar to Documents (they contain text and metadata) but have a relationship to their parent Document.

If you want to customize core components, like the text splitter, through this abstraction you can pass in a custom transformations list or apply to the global Settings:

In [None]:
from llama_index.core.node_parser import SentenceSplitter

text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=10)

# global
from llama_index.core import Settings

Settings.text_splitter = text_splitter

# per-index
index = VectorStoreIndex.from_documents(
    documents, transformations=[text_splitter]
)

<h3>lower level transformation api</h3>

if we want we can define these steps explicitly

first is to split the document into nodes - 
we need to split our data into small pieces to fit within the llm context window. 

LlamaIndex has support for a wide range of text splitters, ranging from paragraph/sentence/token based splitters to file-based splitters like HTML, JSON.

In [None]:
from llama_index.core import SimpleDirectoryReader
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import TokenTextSplitter

documents = SimpleDirectoryReader("./data").load_data()

pipeline = IngestionPipeline(transformations=[TokenTextSplitter(), ...])

nodes = pipeline.run(documents=documents)

You can also choose to add metadata to your documents and nodes. This can be done either manually or with automatic metadata extractors.

In [None]:
document = Document(
    text="text",
    metadata={"filename": "<doc_file_name>", "category": "<category>"},
)

To insert a node into a vector index, it should have an embedding.

An IngestionPipeline uses a concept of Transformations that are applied to input data. These Transformations are applied to your input data, and the resulting nodes are either returned or inserted into a vector database (if given). Each node+transformation pair is cached, so that subsequent runs (if the cache is persisted) with the same node+transformation combination can use the cached result and save you time.

In [None]:
from llama_index.core import Document
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import TitleExtractor
from llama_index.core.ingestion import IngestionPipeline
from llama_index.vector_stores.qdrant import QdrantVectorStore

import qdrant_client

client = qdrant_client.QdrantClient(location=":memory:")
vector_store = QdrantVectorStore(client=client, collection_name="test_store")

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=25, chunk_overlap=0),
        TitleExtractor(),
        OpenAIEmbedding(),
    ],
    vector_store=vector_store,
)

# Ingest directly into a vector db
pipeline.run(documents=[Document.example()])

# Create your index
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_vector_store(vector_store)

we can also create and pass nodes directly into an indexer. 

In [None]:
from llama_index.core.schema import TextNode

node1 = TextNode(text="<text_chunk>", id_="<node_id>")
node2 = TextNode(text="<text_chunk>", id_="<node_id>")

index = VectorStoreIndex([node1, node2])

there are many data connectors offered on llamahub to parse various data types and add it easily into vector stores. 

usage pattern:

In [None]:
from llama_index.core import download_loader

from llama_index.readers.google import GoogleDocsReader

loader = GoogleDocsReader()
documents = loader.load_data(document_ids=[...])

llamaindex is built in with SimpleDirectoryReader. it can read most things you need

In [None]:
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data").load_data()