# Data Ingestion Lab

Ok, so in the last lesson we reviewed data ingestion with llamaindex.  In this lab, we'll move through downloading and parsing the 10k reports for uber and lyft.

### Downloading our data

1. Data Retrieval 

Begin by making a directory called `data/10k`.

```bash
mkdir -p 'data/10k/'
```

And from there, download the following pdf documents.  
> Use Wget if possible.  Otherwise simply download the documents.

In [1]:
uber_pdf = 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/10k/uber_2021.pdf'
lyft_pdf = 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/10k/lyft_2021.pdf' 

For example, we can download the `uber_2021.pdf` file with the following:

* `wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf'`

So download the `uber_2021` report, and then use similar code to download the `lyft_2021` report.

### Reading the data

* Uber report

For this section, we'll ask you to write various functions in the `index.py` file.  You can call the try out the functions with the `console.py` file.

1. Write a function called `read_doc_text` that given a file_path return the entire contents of text using the `pymupdf` library.

2. Then write a function called `build_nodes_from_text` that given the text from our pdf, will return a list of node objects.  Do this by using the `Document` constructor, the `SentenceSplitter`, and then calling the `get_nodes_from_documents` method.

3. Now make sure you run the `console.py` file, and take a look at the returned nodes.
    * For example, if you look at the `first_node`, what attributes are on there available.
    * `first_node.__dict__.keys()`
    * What is the `start_char_index`, and the `end_char_index`.
    * Is there an embedding at this point? (See that this returns `None`/nothing.)
    * What is returned from `node.get_content()`
    
4. Ok, so now we'll want to manually build our embeddings.  You can see in the `build_embeddings(nodes)` we begin by declaring our `OPEN_AI_KEY` environmental variable, and constructing our embedding model.

```python
os.environ['OPENAI_API_KEY'] =api_key
embed_model = OpenAIEmbedding(api_key=api_key)
```

Complete the function to manually embed each node, and store the embedding on the node

5. Now take another look at the first node and confirm that the embedding is stored on the node.
* `first_node.embedding`


6. Next, load the nodes to the vector store index.  This should simply use pass through the nodes to the `VectorStoreIndex`.

> In truth, this `VectorStoreIndex` function does a few things. (1) It creates the embeddings and stores it on each node (we did it manually for fun), (2) it builds the VectorStore, where these nodes are stored in a simple in memory database, (3) it build the *Index* of the nodes which specifies *how* these nodes are stored.  We'll talk more about indexes later.

7. build_query_engine_from(index)

Do this by writing a function called `build_query_engine` that calls the `index.as_query_engine` function, with the `tree_summarize` response mode, and then returns the query engine.

8. In the console, use the `query_engine.query` function to ask the following.  And store the result as `response`.

> "What is the revenue growth of Uber from 2020 to 2021?"

* You should be able to return text like the following: 

> 'The revenue growth of Uber from 2020 to 2021 was 57%.'

* Also, use `response.source_nodes[0]` to find the original text where this came from.

9. Persist and load data

* Then call the `persist_data` function, passing through the index to persist the data to disk.  Notice that this creates a new folder called `storage`.Take a look at some of the files in this folder. 

Finally, in the console.py file, we already imported the `load_index_from_storage` function from the llamaindex library for you.  Use this function to load the index that you saved to your computer, and then create a query engine from the index, and submit your query.

### Bonus

From there, you can move through the following [Llamaindex tutorial](https://docs.llamaindex.ai/en/stable/examples/usecases/10k_sub_question.html), and then re-read this resource on [chunking](https://www.pinecone.io/learn/chunking-strategies/).

* Then move through the following on [DataConnectors](https://www.gettingstarted.ai/llamaindex-data-connectors-create-custom-chatgpt-using-own-documents/)
