# Data Ingestion Summary

### Introduction

Over the past several lessons, we have discussed how llama-index can help us with data ingestion.  And so it's useful to recap how that stage works here.

### Moving through the workflow

1. Loading Data with Connectors

The first step is loading data with connectors.  Llama-index provides various connectors like: 

* GoogleDocsReader
* SlackReader
* SimpleDirectoryReader
* UnstructuredReader

And they all work by initializing the connector and then calling the load_data() function to return a list of document objects.

```python
documents = SimpleWebPageReader(html_to_text=True).load_data(
    ["http://paulgraham.com/worked.html"]
)
```

> You can see various connectors [here](https://llamahub.ai/) on llamahub.

### Node Parsers

Once we have our document objects, the next step is to parse these documents into chunks, which in llama-index are stored in our node objects.  

Remember that these node objects can store various information on them like the text, the embedding, the start index from the document, and previous and following chunks (ie nodes).  They can store additional meta-data like the page number from the document, or the earlier's nodes text summary.

So how do we turn a document into chunks?  With a parser.  So far we have used the sentence splitter.

```python
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20,
)
nodes = splitter.get_nodes_from_documents(documents)
```

But there are various other parsers.  For example, there are parses for properly chunking HTML, or code, or markdown.  Take a look at some of the other parsers [here](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules.html).

### Storing and Querying

Once we have the nodes we can store and query our data.  Passing our nodes into an index, will automatically create the embeddings for each node.

```python
embed_model.get_text_embedding(
            node.get_content()
        )
        node.embedding = node_embedding
```

So we can just write the following:

```python
index = VectorStoreIndex(nodes)
```

Remember, this will, create and store embeddings on each node, create and store the nodes in a vector database, and declare an index which specifies how our nodes are stored.

From there we can create a query engine to query the data.

```python
query_engine = index.as_query_engine()
response = query_engine.query("What is this document about?")
```

### Summary

So we saw that in the data ingestion stage, both connectors and parsing is involved.  Then we store the data and create a query engine to query the data.

However, if we want to simply perform all of these steps as quickly as possible (and skip the explicit node parsing stage), we can go from data retrieval to parsing to storage to querying in just a few lines like so.

```python
lyft_docs = SimpleDirectoryReader(
    input_files=["./data/10k/lyft_2021.pdf"]
).load_data()

index = VectorStoreIndex.from_documents(lyft_docs)
query_engine = index.as_query_engine()
response = query_engine.query("What is this document about?")
```