# Ingesting a Website into ApertureDB

This notebook shows how to take web content and load it into ApertureDB so that it can be used in a RAG chain to answer questions.

First we need to install a few libraries.

In [5]:
%pip install --quiet --upgrade aperturedb langchain langchain-community langchainhub gpt-web-crawler Twisted gpt4all

Note: you may need to restart the kernel to use updated packages.


## Crawl the Website

We're going to use the `gpt-web-crawler` package to crawl a website for us.

First we grab the default configuration file.  This is where you can insert API keys for advanced services.

In [5]:
!wget https://raw.githubusercontent.com/Tim-Saijun/gpt-web-crawler/refs/heads/main/config_template.py -O config.py

--2024-10-26 20:15:18--  https://raw.githubusercontent.com/Tim-Saijun/gpt-web-crawler/refs/heads/main/config_template.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.


HTTP request sent, awaiting response... 200 OK
Length: 272 [text/plain]
Saving to: ‘config.py’


2024-10-26 20:15:18 (8.63 MB/s) - ‘config.py’ saved [272/272]



Now we do the actual crawl.  We've configured this to point to our documentation website, but feel free to change the starting URL.

In [1]:
START_URLS = "https://docs.aperturedata.io/"
MAX_PAGES = 1000
OUTPUT_FILE = "output.json"

# Delete the output file if it exists
import os
if os.path.exists(OUTPUT_FILE):
    os.remove(OUTPUT_FILE)

from gpt_web_crawler import run_spider, NoobSpider

run_spider(NoobSpider, 
    max_page_count=MAX_PAGES,
    start_urls=START_URLS,
    output_file="output.json",
    extract_rules=r'.*')

2024-10-26 22:49:54,218 [32mINFO scrapy.utils.log: Scrapy 2.11.2 started (bot: scrapybot)[0m
2024-10-26 22:49:54 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: scrapybot)
2024-10-26 22:49:54,219 [32mINFO scrapy.utils.log: Versions: lxml 5.3.0.0, libxml2 2.12.9, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.7.0, Python 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0], pyOpenSSL 24.2.1 (OpenSSL 3.3.2 3 Sep 2024), cryptography 43.0.1, Platform Linux-5.15.0-122-generic-x86_64-with-glibc2.35[0m
2024-10-26 22:49:54 [scrapy.utils.log] INFO: Versions: lxml 5.3.0.0, libxml2 2.12.9, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.7.0, Python 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0], pyOpenSSL 24.2.1 (OpenSSL 3.3.2 3 Sep 2024), cryptography 43.0.1, Platform Linux-5.15.0-122-generic-x86_64-with-glibc2.35
2024-10-26 22:49:54,220 [32mINFO MySpider: start_urls: ['https://docs.aperturedata.io/'][0m
2024-10-26 22:49:54 [MySpider] INFO: start_urls: ['https://docs.

## Create Documents

Now we load the website crawl and turn it into LangChain documents.

In [2]:
from langchain_core.documents import Document
import json


with open("output.json") as f:
    data = json.load(f)

documents = [
    Document(
        page_content=d['body'], 
        id=d['url'],
        metadata={
            'title': d['title'], 
            'keywords': d['keywords'],
            'description': d['description'],
            'url': d['url']
        }
    ) for d in data
]
print(len(documents))


260


## Split Documents into Segments

Generally a web page is too large and diverse to be useful in a RAG chain.  Instead we break the document up into segments.  LangChain provides support for this.

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=256,
    chunk_overlap=64,
)

segments = text_splitter.split_documents(documents)
print(len(segments))

4470


## Choose an Embedding

Here we're using the GPT2All package and loading one of its smaller models.  Don't worry if you see messages about CUDA libraries being unavailable.

In [5]:
from langchain_community.embeddings import GPT4AllEmbeddings

embeddings = GPT4AllEmbeddings(model_name="all-MiniLM-L6-v2.gguf2.f16.gguf")
embeddings_dim = len(embeddings.embed_query("test"))
print(f"Embeddings dimension: {embeddings_dim}")


[0m


2024-10-26 22:53:32,375 [37mDEBUG urllib3.connectionpool: Starting new HTTPS connection (1): gpt4all.io:443[0m
2024-10-26 22:53:32 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): gpt4all.io:443
2024-10-26 22:53:32,515 [37mDEBUG urllib3.connectionpool: https://gpt4all.io:443 "GET /models/models3.json HTTP/1.1" 301 167[0m
2024-10-26 22:53:32 [urllib3.connectionpool] DEBUG: https://gpt4all.io:443 "GET /models/models3.json HTTP/1.1" 301 167
2024-10-26 22:53:32,519 [37mDEBUG urllib3.connectionpool: Starting new HTTPS connection (1): raw.githubusercontent.com:443[0m
2024-10-26 22:53:32 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): raw.githubusercontent.com:443
2024-10-26 22:53:32,649 [37mDEBUG urllib3.connectionpool: https://raw.githubusercontent.com:443 "GET /nomic-ai/gpt4all/main/gpt4all-chat/metadata/models3.json HTTP/1.1" 200 4222[0m
2024-10-26 22:53:32 [urllib3.connectionpool] DEBUG: https://raw.githubusercontent.com:443 "GET /nomi

Embeddings dimension: 384


Failed to load libllamamodel-mainline-cuda.so: dlopen: libcudart.so.11.0: cannot open shared object file: No such file or directory
Failed to load libllamamodel-mainline-cuda-avxonly.so: dlopen: libcudart.so.11.0: cannot open shared object file: No such file or directory


## Connect to ApertureDB

For the next part, we need access to a specific ApertureDB instance.
There are several ways to set this up.
The code provided here will accept ApertureDB connection information as a JSON string.
See our [Configuration](https://docs.aperturedata.io/Setup/client/configuration) help page for more options.

In [None]:
! adb config create rag --from-json --active 

Here we create a LangChain vectorstore using ApertureDB.
We use the default client configuration that we have already set up.

If you want to create more than one version of the embeddings, then change the `DESCRIPTOR_SET` name.

See [AddDescriptorSet](https://docs.aperturedata.io/query_language/Reference/descriptor_commands/desc_set_commands/AddDescriptorSet) for more information about selecting an engine and metric.

We use the embeddings object we created above, which will be used when we add documents to the vectorstore.

In [8]:
from langchain_community.vectorstores import ApertureDB

DESCRIPTOR_SET = 'my_website'

vectorstore = ApertureDB(
    embeddings=embeddings,
    descriptor_set=DESCRIPTOR_SET,
    dimensions=embeddings_dim,
    engine="HNSW",
    metric="CS",
    log_level="INFO"
)

2024-10-26 23:12:00,797 [32mINFO aperturedb.CommonLibrary: Using active configuration 'rag'[0m
2024-10-26 23:12:00 [aperturedb.CommonLibrary] INFO: Using active configuration 'rag'
2024-10-26 23:12:00,799 [32mINFO aperturedb.CommonLibrary: Configuration: [rag-p20eoib7.farm0000.cloud.aperturedata.io:55555 as admin using TCP with SSL=True][0m
2024-10-26 23:12:00 [aperturedb.CommonLibrary] INFO: Configuration: [rag-p20eoib7.farm0000.cloud.aperturedata.io:55555 as admin using TCP with SSL=True]
2024-10-26 23:12:00,800 [37mDEBUG aperturedb.CommonLibrary: Created connector using: [rag-p20eoib7.farm0000.cloud.aperturedata.io:55555 as admin using TCP with SSL=True]. Will connect on query.[0m
2024-10-26 23:12:00 [aperturedb.CommonLibrary] DEBUG: Created connector using: [rag-p20eoib7.farm0000.cloud.aperturedata.io:55555 as admin using TCP with SSL=True]. Will connect on query.
2024-10-26 23:12:00,801 [37mDEBUG aperturedb.CommonLibrary: Query=[{'GetStatus': {}}][0m
2024-10-26 23:12:00 [a

## Load the documents into the vectorstore

Finally, we come to the part where we load the documents into the vectorstore.
Again, this will take a little while to run.

The full process takes a while, so we've restricted it here to a few thousand documents so you can progress through the notebook.
You can remove this limit and go for lunch instead.

Once you add the documents, your ApertureDB instance will be hard at work building a high-performance index for them.

In [9]:
ids = vectorstore.add_documents(segments)

2024-10-26 23:14:11,243 [32mINFO aperturedb.ParallelQuery: Connection test successful with [rag-p20eoib7.farm0000.cloud.aperturedata.io:55555 as admin using TCP with SSL=True][0m
2024-10-26 23:14:11 [aperturedb.ParallelQuery] INFO: Connection test successful with [rag-p20eoib7.farm0000.cloud.aperturedata.io:55555 as admin using TCP with SSL=True]
2024-10-26 23:14:11,244 [32mINFO aperturedb.ParallelLoader: Starting ingestion with batchsize=1000, numthreads=4[0m
2024-10-26 23:14:11 [aperturedb.ParallelLoader] INFO: Starting ingestion with batchsize=1000, numthreads=4
2024-10-26 23:14:11,245 [32mINFO aperturedb.ParallelQuery: Commands per query = 1, Blobs per query = 1[0m
2024-10-26 23:14:11 [aperturedb.ParallelQuery] INFO: Commands per query = 1, Blobs per query = 1
2024-10-26 23:14:11,246 [32mINFO aperturedb.ParallelQuery: Worker 0 executing 2 batches[0m
2024-10-26 23:14:11 [aperturedb.ParallelQuery] INFO: Worker 0 executing 2 batches
2024-10-26 23:14:11,246 [32mINFO aperturedb

Let's check out how many documents are in our vectorstore.

In [11]:
import json
print(json.dumps([ d for d in ApertureDB.list_vectorstores() if d['_name'] == DESCRIPTOR_SET ], indent=2))

2024-10-26 23:18:33,645 [32mINFO aperturedb.CommonLibrary: Using active configuration 'rag'[0m
2024-10-26 23:18:33 [aperturedb.CommonLibrary] INFO: Using active configuration 'rag'
2024-10-26 23:18:33,646 [32mINFO aperturedb.CommonLibrary: Configuration: [rag-p20eoib7.farm0000.cloud.aperturedata.io:55555 as admin using TCP with SSL=True][0m
2024-10-26 23:18:33 [aperturedb.CommonLibrary] INFO: Configuration: [rag-p20eoib7.farm0000.cloud.aperturedata.io:55555 as admin using TCP with SSL=True]
2024-10-26 23:18:33,646 [37mDEBUG aperturedb.CommonLibrary: Created connector using: [rag-p20eoib7.farm0000.cloud.aperturedata.io:55555 as admin using TCP with SSL=True]. Will connect on query.[0m
2024-10-26 23:18:33 [aperturedb.CommonLibrary] DEBUG: Created connector using: [rag-p20eoib7.farm0000.cloud.aperturedata.io:55555 as admin using TCP with SSL=True]. Will connect on query.


[
  {
    "_count": 4470,
    "_dimensions": 384,
    "_engines": [
      "HNSW"
    ],
    "_metrics": [
      "CS"
    ],
    "_name": "my_website",
    "_uniqueid": "2.0.80"
  }
]


## Tidy up

If you want to tidy up and restore your ApertureDB instance to before, you can delete the vectorstore.

We've deliberately left this next box not executable so you can go on to use your database.

ApertureDB.delete_vectorstore(DESCRIPTOR_SET)

## What's next?

Next you want to use this vectorstore to drive a RAG (Retrieval-Augmented Generation) chain.

See [Building a RAG Chain from a Website](https://docs.aperturedata.io/HowToGuides/Applications/website_search).

## Further information

* [LangChain vectorstore integration](https://python.langchain.com/api_reference/community/vectorstores/langchain_community.vectorstores.aperturedb.ApertureDB.html)
* [ApertureDB documentation website](https://docs.aperturedata.io/)