# Ingesting with LLamaindex

### Introduction

In the lesson, we'll continue exploring how llamaindex can help us with the data ingestion stage.  Let's get started.

### Loading Documents

If you look at the index.py file, you'll see that instead of using `pymupdf` to read in our documents, we use llamaindex's SimpleDirectoryReader.  

```python
from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader(
    input_dir="./data/10k",
    required_exts=[".pdf"])

documents = reader.load_data()
```

By calling `reader.load_data()` this directly returns a list of document objects.

The `SimpleDirectoryReader` takes various arguments for loading files from a directory.  So above, we we specified the input directory and extension type (pdf) to select our documents.  But we can also specify specific input files.

```python
lyft_docs = SimpleDirectoryReader(
    input_files=["./data/10k/lyft_2021.pdf"]
).load_data()
```

Either way, this will return a list of documents, which we can directly create an index from these documents.

In [None]:
index = VectorStoreIndex.from_documents(documents)

And this one function will parse our documents into nodes (chunks), embed those nodes, and create a database, and index the nodes.

From there, we can create a query engine and ask questions of the data.

In [None]:
engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("What is this document about?")

That's it!  So we can see the benefit of using a library like llamaindex.  With it we can build a rag pipeline in just a few lines of code.  Yet, we also have the option to customize our pipeline along the way.

### More connectors

Above, the SimpleDirectoryReader is an example of a connector.  A connector is what reads in data and returns `Documents` with the `load_data` function.  LLamaindex has many different connectors.  

For example, this how we can create a GoogleDocs connector.

```python

from llama_index.readers.google import GoogleDocsReader

loader = GoogleDocsReader()
documents = loader.load_data(document_ids=[...])
```

> You can learn more about this [here](https://docs.llamaindex.ai/en/stable/examples/data_connectors/GoogleDocsDemo.html).

And this is a connector for reading webpages.

```python
from llama_index.readers.web import SimpleWebPageReader
documents = SimpleWebPageReader(html_to_text=True).load_data(
    ["http://paulgraham.com/worked.html"]
)
```

> You can see more [here](https://docs.llamaindex.ai/en/stable/examples/data_connectors/WebPageDemo.html).

Notice again, that all of our connectors follow the same pattern.  We create the connector `SimpleWebPageReader()`, and then we call `load_data()` to return a list of documents.

From there, we can parse the documents.

```python
parser = SentenceSplitter(chunk_size=1024)
nodes = parser.get_nodes_from_documents(documents)
```

Create embeddings.

```python
for node in nodes:
        node_embedding = embed_model.get_text_embedding(
            node.get_content()
        )
        node.embedding = node_embedding
```

And store the data in a vector database.

```python
index = VectorStoreIndex(nodes)
```

Or we just directly pass the documents into our VectorStoreIndex, and will perform this procedure under the hood.

```python
index = VectorStoreIndex.from_documents(documents)
response = query_engine.query("What is this document about?")
```

### Wrapping Up

Curious about what connectors are available?  You can check out [llamahub.ai](https://llamahub.ai/) to explore more, or if you ask ChatGPT here are some of the answers it gave:

* NotionPageReader
* GoogleDocsReader
* SlackReader
* SimpleDirectoryReader

### Resources

[HTML Reader Connector](https://docs.llamaindex.ai/en/stable/examples/data_connectors/WebPageDemo.html)

[10k tutorial](https://docs.llamaindex.ai/en/stable/examples/usecases/10k_sub_question.html)

[Custom Connectors](https://www.gettingstarted.ai/llamaindex-data-connectors-create-custom-chatgpt-using-own-documents/)