# Loading Data (Ingestion)


This example is adapted from [Loading Data, Ingestion](https://docs.llamaindex.ai/en/stable/understanding/loading/loading/)

Before your chosen LLM can act on your data, you first need to process the data and load it. This has parallels to data cleaning/feature engineering pipelines in the ML world, or ETL pipelines in the traditional data setting.

This ingestion pipeline typically consists of three main stages:

1. Load the data
2. Transform the data
3. Index and store the data

We cover indexing/storage in [future](https://docs.llamaindex.ai/en/stable/understanding/indexing/indexing/) [sections](https://docs.llamaindex.ai/en/stable/understanding/storing/storing/). In this guide we'll mostly talk about loaders and transformations.

## Steps
In this example we will
1. Download some PDFs about Ray, a distributed compute framework
2. Use the SimpleDirectoryReader to create a list of Documents
3. Then use the VectorStoreIndex class to split, chunk and index the documents 
4. Finally run several queries over the documents using OpenAI's LLM

## Loaders
Before your chosen LLM can act on your data you need to load it. The way LlamaIndex does this is via data connectors, also called `Reader`. Data connectors ingest data from different data sources and format the data into Document objects. A `Document` is a collection of data (currently text, and in future, images and audio) and metadata about that data.

### Loading using SimpleDirectoryReader
The easiest reader to use is our SimpleDirectoryReader, which creates documents out of every file in a given directory. It is built in to LlamaIndex and can read a variety of formats including Markdown, PDFs, Word documents, PowerPoint decks, images, audio and video.

In [3]:
! mkdir -p data

In [10]:
! cd data && wget https://www.usenix.org/system/files/osdi18-moritz.pdf
! cd data && wget https://assets.ctfassets.net/xjan103pcp94/7gZbuzVlgVWMfynUTQstOc/65cc8708700710dff229cf50bf09c5fb/9781098117160_7-18-22.pdf
! cd data && wget https://www.scs.stanford.edu/24sp-cs244b/notes/ray.pdf

--2024-08-05 17:04:15--  https://www.usenix.org/system/files/osdi18-moritz.pdf
Resolving www.usenix.org (www.usenix.org)... 23.185.0.4, 2620:12a:8001::4, 2620:12a:8000::4
Connecting to www.usenix.org (www.usenix.org)|23.185.0.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5486963 (5.2M) [application/pdf]
Saving to: ‘osdi18-moritz.pdf’


2024-08-05 17:04:16 (7.12 MB/s) - ‘osdi18-moritz.pdf’ saved [5486963/5486963]

--2024-08-05 17:04:16--  https://assets.ctfassets.net/xjan103pcp94/7gZbuzVlgVWMfynUTQstOc/65cc8708700710dff229cf50bf09c5fb/9781098117160_7-18-22.pdf
Resolving assets.ctfassets.net (assets.ctfassets.net)... 3.165.102.46, 3.165.102.21, 3.165.102.9, ...
Connecting to assets.ctfassets.net (assets.ctfassets.net)|3.165.102.46|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4220592 (4.0M) [application/pdf]
Saving to: ‘9781098117160_7-18-22.pdf’


2024-08-05 17:04:16 (78.6 MB/s) - ‘9781098117160_7-18-22.pdf’ saved [4220592/422059

In [6]:
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data").load_data()

In [16]:
len(documents)

179

### Prepare credentials
Load credentials from a .env file and the [python-dotenv package](https://pypi.org/project/python-dotenv/)

```toml
OPENAI_API_KEY="<KEY>"
```

In [14]:
import os
from dotenv import load_dotenv

load_dotenv()

assert os.environ["OPENAI_API_KEY"]

### Index the documents
We will use `VectorStoreIndex` to load the documents into the index

When you use `from_documents`, your Documents are split into chunks and parsed into [Node objects](https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/), lightweight abstractions over text strings that keep track of metadata and relationships. By default, VectorStoreIndex stores everything in memory.

In [15]:
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

### Run a query
Query engine is a generic interface that allows you to ask question over your data.

A query engine takes in a natural language query, and returns a rich response. It is most often (but not always) built on one or many [indexes](https://docs.llamaindex.ai/en/stable/module_guides/indexing/) via [retrievers](https://docs.llamaindex.ai/en/stable/module_guides/querying/retriever/). You can compose multiple query engines to achieve more advanced capability.

In [17]:
response = query_engine.query("What Python decorator is used to transform an ordinary Python function into a Ray function?")
print(response)

@ray.remote


In [18]:
response = query_engine.query("How does Ray store large objects which are several hundred megabytes in size in its distributed memory?")
print(response)

Ray stores large objects which are several hundred megabytes in size in its distributed memory by using a lightweight chain replication layer on top of Redis.


In [19]:
response = query_engine.query("How can I create a Ray cluster?")
print(response)

To create a Ray cluster, you can start by importing Ray and initializing it using the `ray.init()` function. This will set up a Ray cluster on your local machine, utilizing all available cores as workers. If you wish to run Ray on a cluster other than your local machine, you would need to provide additional arguments to the `init` function. After initializing Ray, you can access the Ray dashboard at the specified URL to monitor the cluster.


In [20]:
response = query_engine.query("Does Ray support a multi-user or multi-tenant environment?")
print(response)

Ray supports a multi-user environment.


In [22]:
response = query_engine.query("How do I create and handle secrets in a Ray cluster? Give me a code example of how do handle secrets in a Ray cluster")
print(response)

To create and handle secrets in a Ray cluster, you can use the `ray.put` and `ray.get` functions to securely store and retrieve sensitive information. Here is an example code snippet demonstrating how to handle secrets in a Ray cluster:

```python
import ray

# Initialize Ray
ray.init()

# Define the secret information
secret_data = "my_secret_password"

# Store the secret data securely in the Ray cluster
secret_id = ray.put(secret_data)

# Retrieve the secret data from the Ray cluster
retrieved_secret = ray.get(secret_id)

# Print the retrieved secret data
print("Retrieved secret data:", retrieved_secret)
```
