# Scrape URLs and Generate Document Embeddings

This is the **first notebook** in the Agent Neo workflow. At a high level, this notebook scrapes Neo4j and GDS documentation websites, creates smaller document chunks, converts the text chunks into vectors, and saves the documents and vectors to a GCP bucket. 

Specifically, this notebook completes the following steps:
* Parse Neo4j documentation XML sitemaps to gather individual URLs for scraping
* Combine these URLs with other useful Neo4j and GDS resources that have been manually collected
* Scrape the combined list of documents using LangChain
* Sub-divide the scraped pages into smaller document chunks
* Create embeddings of the document chunks using GCP Text to VertexAI Text Embedding
* Save the text chunks, embeddings, and reference URLs to a GCP bucket

References: 
* [Generative AI / LLM - Document Retrieval and Question Answering](https://www.youtube.com/watch?v=inAY6M6UUkk)
* [Using Vertex AI Matching Engine and Vertex AI Embeddings for Text for StackOverflow Questions](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/matching_engine/sdk_matching_engine_create_stack_overflow_embeddings_vertex.ipynb)

In [1]:
# !pip3 install --upgrade google-cloud-aiplatform google-cloud-storage

In [2]:
# !pip install langchain --upgrade
# !pip install unstructured --upgrade # used by langchain
# !pip install pdf2image --upgrade # used by langchain
# !pip install tensorflow_text --upgrade # used by langchain

In [1]:
import os
import time
import logging
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

import json

In [2]:
import langchain
from langchain.document_loaders import UnstructuredURLLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import VertexAIEmbeddings

import requests
from bs4 import BeautifulSoup

import tensorflow_text
from google.cloud import aiplatform

2023-07-28 18:36:22.971928: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


# GCP Authentication and Details

## Authenticate Account

In [3]:
# run the below in the jupyter terminal to authenticate your google account
# ! gcloud auth login

## Set Project and Region

In [4]:
PROJECT_ID = ''
REGION = ""

In [5]:
! gcloud config set project {PROJECT_ID}

Updated property [core/project].


In [6]:
aiplatform.init(project=PROJECT_ID, location=REGION)

# Get Lists of Websites to Scrape

## URLs from Neo4j Documentation Sitemaps

In [7]:
def parse_sitemap(sitemap) -> list:
    '''Parse a top-level sitemap to get a list of URLs
    inputs:
        sitemap: url comtaining an xml site map
        
    returns:
        parsed list of urls on the sitemap
    '''
    response = requests.get(sitemap)
    soup = BeautifulSoup(response.content, "xml")
    urls = [element.text for element in soup.find_all("loc")]
    return urls


In [8]:
neo4j_sitemaps = [
    'https://neo4j.com/docs/graph-data-science/current/sitemap.xml', # core GDS
    'https://neo4j.com/docs/graph-data-science-client/current/sitemap.xml', # GDS client
    'https://neo4j.com/docs/python-manual/current/sitemap.xml', # Neo4j python client 
    'https://neo4j.com/docs/cypher-manual/current/sitemap.xml', # Cypher manual 
    'https://neo4j.com/docs/apoc/current/sitemap.xml', # APOC manual
    'https://neo4j.com/docs/aura/sitemap.xml', # Aura DB and DS
    'https://neo4j.com/docs/operations-manual/5/sitemap.xml' # Neo4j5
]

In [9]:
# parse sitemaps into a single list 
neo4j_doc_sites = []
neo4j_doc_sites = [x for sitemap in neo4j_sitemaps for x in parse_sitemap(sitemap)]
len(neo4j_doc_sites)

1098

## Unofficial Practitioner's Guide to GDS
Markdown pages from the supplemental Neo4j and GDS guides I have created during my work as a GDS CSA.

In [10]:
practitioners_guide_sites = [
    'https://github.com/danb-neo4j/gds-guide/blob/main/README.md',
    'https://github.com/danb-neo4j/gds-guide/blob/main/gds-resources.md',
    'https://github.com/danb-neo4j/gds-guide/blob/main/graph-data-modeling.md',
    'https://github.com/danb-neo4j/gds-guide/blob/main/graph-eda.md',
    'https://github.com/danb-neo4j/gds-guide/blob/main/graphs-llms.md',
    'https://github.com/danb-neo4j/gds-guide/blob/main/neo4j-resources.md',
    'https://github.com/danb-neo4j/gds-guide/blob/main/usecase-specific.md',
    'https://github.com/danb-neo4j/gds-guide/blob/main/algorithms/README.md',
    'https://github.com/danb-neo4j/gds-guide/blob/main/algorithms/gds_centrality.md',
    'https://github.com/danb-neo4j/gds-guide/blob/main/algorithms/gds_community.md',
    'https://github.com/danb-neo4j/gds-guide/blob/main/algorithms/gds_pathfinding.md',
    'https://github.com/danb-neo4j/gds-guide/blob/main/algorithms/gds_similarity.md',
    'https://github.com/danb-neo4j/gds-guide/blob/main/embeddings/README.md',
    'https://github.com/danb-neo4j/gds-guide/blob/main/embeddings/fastrp.md',
    'https://github.com/danb-neo4j/gds-guide/blob/main/embeddings/graphSAGE.md',
    'https://github.com/danb-neo4j/gds-guide/blob/main/embeddings/hashGNN.md',
    'https://github.com/danb-neo4j/gds-guide/blob/main/embeddings/node2vec.md',
]
len(practitioners_guide_sites)

17

## Other Articles and Resources
The following is a list of Neo4j Developer Blogs and related GDS content that is recommended for additional Neo4j and GDS context.

In [11]:
other_articles = [
    'https://neo4j.com/developer-blog/get-started-with-neo4j-gds-python-client/',
    'https://medium.com/neo4j/graph-data-modeling-categorical-variables-dd8a2845d5e0',
    'https://medium.com/neo4j/graph-modeling-labels-71775ff7d121',
    'https://medium.com/neo4j/graph-data-modeling-all-about-relationships-5060e46820ce',
    'https://medium.com/neo4j/graph-data-modeling-keys-a5a5334a1297'
    'https://medium.com/neo4j/modeling-patient-journeys-with-neo4j-d0785fbbf5a2'
    'https://neo4j.com/docs/graph-data-science/current/installation/System-requirements/',
    'https://neo4j.com/developer/kb/understanding-memory-consumption/',
    'https://neo4j.com/developer-blog/exploring-fraud-detection-neo4j-graph-data-science-summary/',
    'https://neo4j.com/developer-blog/exploring-fraud-detection-neo4j-graph-data-science-part-1/',
    'https://neo4j.com/developer-blog/exploring-fraud-detection-neo4j-graph-data-science-part-2/',
    'https://neo4j.com/developer-blog/exploring-fraud-detection-neo4j-graph-data-science-part-3/',
    'https://neo4j.com/developer-blog/exploring-fraud-detection-neo4j-graph-data-science-part-4/',
    'https://medium.com/neo4j/using-neo4j-graph-data-science-in-python-to-improve-machine-learning-models-c55a4e15f530',
    'https://medium.com/neo4j/user-segmentation-based-on-node-roles-in-the-peer-to-peer-payment-network-1a766c60a4ee',
    'https://medium.com/data-science-at-microsoft/using-graphs-to-model-and-analyze-the-customer-journey-4b1f1e9f3696',
    'https://towardsdatascience.com/exploring-practical-recommendation-engines-in-neo4j-ff09fe767782',
    'https://neo4j.com/developer-blog/supply-chain-neo4j-gds-bloom/',
    'https://neo4j.com/developer-blog/gds-supply-chains-metrics-performance-python/',
    'https://neo4j.com/developer-blog/gds-supply-chain-pathfinding-optimization/',
    'https://medium.com/neo4j/knowledge-graphs-llms-real-time-graph-analytics-89b392eaaa95',
    'https://medium.com/neo4j/knowledge-graphs-llms-multi-hop-question-answering-322113f53f51',
    'https://towardsdatascience.com/integrate-llm-workflows-with-knowledge-graph-using-neo4j-and-apoc-27ef7e9900a2',
    'https://medium.com/neo4j/knowledge-graphs-llms-fine-tuning-vs-retrieval-augmented-generation-30e875d63a35',
    'https://medium.com/neo4j/langchain-cypher-search-tips-tricks-f7c9e9abca4d',
    'https://towardsdatascience.com/langchain-has-added-cypher-search-cb9d821120d5',
    'https://medium.com/neo4j/generating-cypher-queries-with-chatgpt-4-on-any-graph-schema-a57d7082a7e7',
    'https://towardsdatascience.com/fine-tuning-an-llm-model-with-h2o-llm-studio-to-generate-cypher-statements-3f34822ad5',
    'https://towardsdatascience.com/implementing-a-sales-support-agent-with-langchain-63c4761193e7',
    'https://towardsdatascience.com/integrating-neo4j-into-the-langchain-ecosystem-df0e988344d2',
    'https://medium.com/neo4j/context-aware-knowledge-graph-chatbot-with-gpt-4-and-neo4j-d3a99e8ae21e',
    'https://towardsdatascience.com/what-happened-with-apoc-in-neo4j-v5-core-and-extended-edition-23994cdf0a2c',
    'https://medium.com/neo4j/creating-a-knowledge-graph-from-video-transcripts-with-gpt-4-52d7c7b9f32c',
    'https://towardsdatascience.com/how-to-use-cypher-aggregations-in-neo4j-graph-data-science-library-5d8c40c2670c',
    'https://medium.com/neo4j/enhancing-word-embedding-with-graph-neural-networks-c26d8e54fe4a',
    'https://medium.com/neo4j/knowledge-graph-based-chatbot-with-gpt-3-and-neo4j-c4ebbd325ed',
    'https://towardsdatascience.com/use-chatgpt-to-query-your-neo4j-database-78680a05ec2',
    'https://towardsdatascience.com/how-cypher-changed-in-neo4j-v5-d0f10cbb60bf',
    'https://towardsdatascience.com/analyze-your-website-with-nlp-and-knowledge-graphs-88e291f6cbf4',
    'https://towardsdatascience.com/investigate-family-connections-between-house-of-dragon-and-game-of-thrones-characters-ff2afd5bdb82',
    'https://medium.com/neo4j/user-segmentation-based-on-node-roles-in-the-peer-to-peer-payment-network-1a766c60a4ee',
    'https://towardsdatascience.com/analyzing-the-evolution-of-life-on-earth-with-neo4j-7daeeb1e032d',
    'https://medium.com/neo4j/how-to-get-started-with-the-neo4j-graph-data-science-python-client-56209d9b0d0d',
    'https://towardsdatascience.com/extract-knowledge-from-text-end-to-end-information-extraction-pipeline-with-spacy-and-neo4j-502b2b1e0754',
    'https://towardsdatascience.com/batching-transactions-in-neo4j-1001d12c9a4a',
    
]

## Combine Site Lists

In [12]:
combined_neo4j_gds_sites = neo4j_doc_sites + practitioners_guide_sites + other_articles
combined_neo4j_gds_sites = list(set(combined_neo4j_gds_sites))

print('total web pages to scrape:', len(combined_neo4j_gds_sites))

total web pages to scrape: 1157


# Scrape Sites into LangChain Documents

In [13]:
# instantiate the loader with our combined URL list
loader = UnstructuredURLLoader(urls=combined_neo4j_gds_sites)

In [14]:
# run the loader and save scraped documents
documents = loader.load()

In [15]:
# number of documents scraped
print('number of documents scraped:', len(documents))

number of documents scraped: 1157


In [16]:
# # uncomment to view text from a scraped document 
# documents[999]

# Chunk Scraped Documents

In [17]:
# instantiate CharacterTextSplittler to chunk scraped documents
text_splitter = CharacterTextSplitter(
    separator = "\n", # split on newline character
    chunk_size = 512, # smaller for context windows
    chunk_overlap  = 51) # set approx 10% overlap between chunked documents 


In [18]:
# apply splitter to scraped documents
# NOTE: some chunks may be larger than specified chunk_size due to location of separator 
# returns a list of langchain 'Documents'
document_chunks = text_splitter.split_documents(documents)

Created a chunk of size 513, which is longer than the specified 512
Created a chunk of size 780, which is longer than the specified 512
Created a chunk of size 513, which is longer than the specified 512
Created a chunk of size 823, which is longer than the specified 512
Created a chunk of size 1209, which is longer than the specified 512
Created a chunk of size 887, which is longer than the specified 512
Created a chunk of size 662, which is longer than the specified 512
Created a chunk of size 589, which is longer than the specified 512
Created a chunk of size 620, which is longer than the specified 512
Created a chunk of size 534, which is longer than the specified 512
Created a chunk of size 643, which is longer than the specified 512
Created a chunk of size 557, which is longer than the specified 512
Created a chunk of size 556, which is longer than the specified 512
Created a chunk of size 601, which is longer than the specified 512
Created a chunk of size 589, which is longer th

In [19]:
print(f"Number documents {len(documents)}")
print(f"Number chunks {len(document_chunks)}")

Number documents 1157
Number chunks 12331


In [23]:
# add 'Context' and 'Source' labels to the document chunks 
document_chunks = [f"Context: {chunk.page_content} Source: {chunk.metadata['source']}" for chunk in document_chunks]

In [26]:
# convert document chunks to a dataframe 
embeddings_df = pd.DataFrame(document_chunks, columns =['text'])
embeddings_df.shape

(12331, 1)

In [27]:
embeddings_df[995:1000]

Unnamed: 0,text
995,Context: Function\napoc.path.expand \nReturns paths expanded from the start node following the given relationship types from min-depth to max-depth.\nProcedure\napoc.path.expandConfig \nReturns paths expanded from the start node the given relationship types from min-depth to max-depth.\nProcedure\napoc.path.slice \nReturns a sub-path of the given length and offset from the given path.\nFunction\napoc.path.spanningTree Source: https://neo4j.com/docs/aura/platform/apoc/
996,Context: Function\napoc.path.spanningTree \nReturns spanning tree paths expanded from the start node following the given relationship types to max-depth.\nProcedure\napoc.path.subgraphAll \nReturns the sub-graph reachable from the start node following the given relationship types to max-depth.\nProcedure\napoc.path.subgraphNodes \nReturns the nodes in the sub-graph reachable from the start node following the given relationship types to max-depth.\nProcedure\napoc.periodic\napoc.periodic.cancel Source: https://neo4j.com/docs/aura/platform/apoc/
997,Context: Procedure\napoc.periodic\napoc.periodic.cancel \nCancels the given background job.\nProcedure\napoc.periodic.commit \nRuns the given statement in separate batched transactions.\nProcedure\napoc.periodic.countdown \nRuns a repeatedly called background statement until it returns 0.\nProcedure\napoc.periodic.iterate \nRuns the second statement for each item returned by the first statement.\nThis procedure returns the number of batches and the total number of processed rows.\nProcedure\napoc.periodic.list Source: https://neo4j.com/docs/aura/platform/apoc/
998,"Context: Procedure\napoc.periodic.list \nReturns a list of all background jobs.\nProcedure\napoc.periodic.repeat \nRuns a repeatedly called background job.\nTo stop this procedure, use apoc.periodic.cancel.\nProcedure\napoc.periodic.submit \nCreates a background job which runs the given Cypher statement once.\nProcedure\napoc.refactor\napoc.refactor.categorize \nCreates new category nodes from nodes in the graph with the specified sourceKey as one of its property keys. Source: https://neo4j.com/docs/aura/platform/apoc/"
999,Context: The new category nodes are then connected to the original nodes with a relationship of the given type.\nProcedure\napoc.refactor.cloneNodes \nClones the given nodes with their labels and properties.\nIt is possible to skip any node properties using skipProperties (note: this only skips properties on nodes and not their relationships).\nProcedure\napoc.refactor.cloneSubgraph Source: https://neo4j.com/docs/aura/platform/apoc/


*To-do: Explore performing additional pre-processing on the chunked documents to remove newlines, special characters, or other noise that may impact their readability or the quality of embeddings. This will need to be investgated more for future iterations.*

# Create GCP Text Embeddings using LangChain

## Define Error Handling Function

In [55]:
def handle_quota_errors(func,  *args, retry_delay=5, backoff_factor=2, **kwargs):
    """
    Executes the given function and retries in case of an exception after a delay.

    Args:
        func (callable): The function to be executed.
        *args: Positional arguments to be passed to `func`.
        retry_delay (int, optional): Initial delay before retrying in case of an error. Default is 5 seconds.
        backoff_factor (int, optional): Multiplier for the delay for subsequent retries. Default is 2.
        **kwargs: Keyword arguments to be passed to `func`.

    Returns:
        Any: Document embeddings from function func

    Raises:
        Exception: Any exception raised by `func` if it occurs after exhausting all retries.
    """
    retries = 0

    try:
      return func(*args, **kwargs)
    except Exception as e:
      print(f"error: {e}")
      retries += 1
      wait = retry_delay * (backoff_factor ** retries)
      time.sleep(wait)
      print("wait for {wait} seconds")

## Generate Embeddings and Save Document Chunks

In [50]:
# change to directory to save chunked documents
os.chdir('/home/jupyter/data/doc_chunks_v2')

In [54]:
# instantiate VertexAI Embeddings function 
embeddings = VertexAIEmbeddings()

In [None]:
# generate embeddings 
index_embeddings = []

for index, doc in embeddings_df.iterrows():
  # generate embeddings with error handling   
  print(f"Get embedding and write document for document {index} of {len(embeddings_df)-1}")
  embedding = handle_quota_errors(embeddings.embed_query, doc['text'])

  if embedding is not None:
    # create source doc id variable 
    doc_id = f"{index}.txt"
    
    # save doc_id and embedding to a dictionary 
    embedding_dict = {
              "id": doc_id,
              "embedding": [str(value) for value in embedding],
    }
    
    index_embeddings.append(json.dumps(embedding_dict) + "\n")

    # write source doc chunk text to local directory 
    doc_id = f"{index}.txt"
    with open(f"{doc_id}", "w") as document:
      document.write(doc['text'])

Get embedding and write document for document 0 of 12316
Get embedding and write document for document 1 of 12316
Get embedding and write document for document 2 of 12316
Get embedding and write document for document 3 of 12316
Get embedding and write document for document 4 of 12316
Get embedding and write document for document 5 of 12316
Get embedding and write document for document 6 of 12316
Get embedding and write document for document 7 of 12316
Get embedding and write document for document 8 of 12316
Get embedding and write document for document 9 of 12316
Get embedding and write document for document 10 of 12316
Get embedding and write document for document 11 of 12316
Get embedding and write document for document 12 of 12316
Get embedding and write document for document 13 of 12316
Get embedding and write document for document 14 of 12316
Get embedding and write document for document 15 of 12316
Get embedding and write document for document 16 of 12316
Get embedding and write 

In [103]:
# write embeddings JSON file to separate directory 
os.chdir('/home/jupyter/data/embeddings_json_v2')

with open("embeddings.json", "w") as f:
    f.writelines(index_embeddings)

os.listdir()

['embeddings.json']

# Create GCP Storage Bucket

In [1]:
# BUCKET_URI = f"gs://"  
# BUCKET_URI

In [2]:
# ! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

# Copy Document Chunks and Embedding JSON  to GCP Bucket

In [3]:
# confirm in data folder
os.chdir('/home/jupyter/data/')
os.listdir()

# Copy Document Chunks

In [64]:
# # upload embedded docks to bucket
# !gsutil -m cp -r 

Copying file://embedded_docs_v2/4954.txt [Content-Type=text/plain]...
Copying file://embedded_docs_v2/45.txt [Content-Type=text/plain]...             
Copying file://embedded_docs_v2/11396.txt [Content-Type=text/plain]...          
Copying file://embedded_docs_v2/9435.txt [Content-Type=text/plain]...           
Copying file://embedded_docs_v2/3466.txt [Content-Type=text/plain]...           
Copying file://embedded_docs_v2/117.txt [Content-Type=text/plain]...            
Copying file://embedded_docs_v2/11734.txt [Content-Type=text/plain]...
Copying file://embedded_docs_v2/1686.txt [Content-Type=text/plain]...           
Copying file://embedded_docs_v2/7619.txt [Content-Type=text/plain]...           
Copying file://embedded_docs_v2/4793.txt [Content-Type=text/plain]...           
Copying file://embedded_docs_v2/4079.txt [Content-Type=text/plain]...
Copying file://embedded_docs_v2/2903.txt [Content-Type=text/plain]...
Copying file://embedded_docs_v2/12285.txt [Content-Type=text/plain]... 

In [67]:
print('upload finished')

upload finished


## Copy Embeddings JSON File 

In [4]:
# !gsutil cp embeddings.json gs://