This tutorial walks through how to use LlamaHub to load documents and upload these documents to a vector store through the ChatGPT Retrieval Plugin.

In [1]:
!git clone https://github.com/openai/chatgpt-retrieval-plugin.git

Cloning into 'chatgpt-retrieval-plugin'...
remote: Enumerating objects: 174, done.[K
remote: Counting objects: 100% (174/174), done.[K
remote: Compressing objects: 100% (108/108), done.[K
remote: Total 174 (delta 62), reused 154 (delta 49), pack-reused 0[K
Receiving objects: 100% (174/174), 226.44 KiB | 726.00 KiB/s, done.
Resolving deltas: 100% (62/62), done.


In [None]:
!pip install poetry

#### Run these commands in a separate shell

Please run the following commands in a separate terminal (we will be spinning up the local server):

```bash
cd chatgpt-retrieval-plugin && \
poetry env use python3.10 && \
poetry shell && \
poetry install && \
poetry run start
```

## Import/Convert LlamaIndex Documents

In [25]:
from llama_index import download_loader, Document
from typing import Dict, List
import json

In [26]:
SimpleWebPageReader = download_loader("SimpleWebPageReader")

loader = SimpleWebPageReader(html_to_text=True)
url = "http://www.paulgraham.com/worked.html"
documents = loader.load_data(urls=[url])

In [20]:
# Convert LlamaIndex Documents to JSON format

def dump_docs_to_json(documents: List[Document], out_path: str) -> Dict:
    """Convert LlamaIndex Documents to JSON format and save it."""
    result_json = []
    for doc in documents:
        cur_dict = {
            "text": doc.get_text(),
            "id": doc.get_doc_id(),
            # NOTE: feel free to customize the other fields as you wish
            # fields taken from https://github.com/openai/chatgpt-retrieval-plugin/tree/main/scripts/process_json#usage
            # "source": ...,
            # "source_id": ...,
            # "url": url,
            # "created_at": ...,
            # "author": "Paul Graham",
        }
        result_json.append(cur_dict)
    
    json.dump(result_json, open(out_path, 'w'))

In [21]:
dump_docs_to_json(documents, "docs.json")

## Upload to Docstore server

We now follow the instructions (taken from https://github.com/openai/chatgpt-retrieval-plugin/tree/main/scripts/process_json) to upload your JSON documents to your docstore.

You need to set environment variables indicating 1) the docstore you're using, and 2) how to authenticate into the docstore.

Please follow these instructions https://github.com/openai/chatgpt-retrieval-plugin#development.

### Pinecone example

Below, we give an example for connecting to Pinecone.

We first define the bearer token.
(Note: If you were confused like me as to how to generate a bearer token, use jwt.io and follow [this tutorial](https://www.ibm.com/docs/da/order-management?topic=SSGTJF/configuration/t_GeneratingJWTToken.htm)

```bash
export BEARER_TOKEN=<bearer_token>
```

Now define the rest of the environment variables.

```bash
export DATASTORE=<datastore>
export PINECONE_API_KEY=<pinecone_api_key>
export PINECONE_ENVIRONMENT=<pinecone_environment>
export PINECONE_INDEX=<pinecone_index>
```

#### Run the `process_json.py` script

The process_json.py script will take our document, chunk it up under the hood, and upload it to the docstore.

**NOTE**: we disable `--screen_for_pii` and `--extract_metadata`. Toggling these flags will call the OpenAI chat completion endpoint,
leading to potential context size too big errors.

Copy and paste the following command into a terminal
```python
cd scripts/process_json && \
python process_json.py --filepath path/to/docs.json --custom_metadata '{"source": "file"}'
```

**NOTE**: If the command above returns `ModuleNotFoundError: No module named 'models`, try `pip install -e .` from the root directory of `chatgpt-retrieval-plugin` and rerun.

This should now upload documents to your vector db of choice! 