> # **Setup**

In [1]:
!rm -rf /opt/conda/lib/python3.10/site-packages/aiohttp-3.9.1.dist-info
!pip install -q llama-index llama-index-core llama-parse neo4j

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
libpysal 4.9.2 requires packaging>=22, but you have packaging 21.3 which is incompatible.
libpysal 4.9.2 requires shapely>=2.0.1, but you have shapely 1.8.5.post1 which is incompatible.
momepy 0.7.0 requires shapely>=2, but you have shapely 1.8.5.post1 which is incompatible.
preprocessing 0.1.13 requires nltk==3.2.4, but you have nltk 3.8.1 which is incompatible.
spopt 0.6.0 requires shapely>=2.0.1, but you have shapely 1.8.5.post1 which is incompatible.[0m[31m
[0m

- **Notes**

    llama-parse is async-first, which means it uses asynchronous programming by default.
    
- **Asynchronous code** 

    Technique that allows a program to perform multiple tasks concurrently without blocking the main thread of execution. This is particularly useful for tasks that involve waiting, such as I/O operations (reading from a file, fetching data from a network, querying a database) or time-consuming computations. By using asynchronous code, a program can remain responsive and efficient, as it can handle other tasks while waiting for the completion of these operations.

In [2]:
import nest_asyncio

# Apply the nest_asyncio to allow the notebook to run asynchronous code without issues.
nest_asyncio.apply() 

- **Why use nest_asyncio?**

    Jupyter Notebooks don’t support running async code directly. nest_asyncio allows you to run async code in notebooks by enabling nested event loops.

> # **Load the Model**

In [3]:
!pip install -q llama-index-llms-groq

In [4]:
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()
GROQ_API_KEY = user_secrets.get_secret("GROQ_API_KEY")

In [5]:
from llama_index.llms.groq import Groq

llm = Groq(api_key=GROQ_API_KEY, model="llama3-70b-8192")

> # **PDF Preprocessing**

In [6]:
user_secrets = UserSecretsClient()
LLAMA_CLOUD_API_KEY = user_secrets.get_secret("LLAMA_CLOUD_API_KEY")

- **Notes**
    
    For more information on how to get the LlamaParse key, please visit [here.](https://cloud.llamaindex.ai/)

In [7]:
from llama_parse import LlamaParse

# Download a PDF file from a specified URL and save it locally with a specified file name.
!wget 'https://raw.githubusercontent.com/floresernesto95/Documents/main/arXiv%20-%20Graphs.pdf' -O './arXiv_graph.pdf'

# Initialize the LlamaParse instance with the API key and specify the result type as markdown.
# The load_data method is used to load and parse the content of the specified PDF file.
documents = LlamaParse(api_key=LLAMA_CLOUD_API_KEY, result_type="markdown").load_data('./arXiv_graph.pdf')

--2024-06-21 15:26:24--  https://raw.githubusercontent.com/floresernesto95/Documents/main/arXiv%20-%20Graphs.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6782285 (6.5M) [application/octet-stream]
Saving to: './arXiv_graph.pdf'


2024-06-21 15:26:25 (149 MB/s) - './arXiv_graph.pdf' saved [6782285/6782285]

Started parsing the file under job_id 97f49bc8-98e7-41e5-aad5-397b0f947649
.

- **Notes**

    The result_type argument can be set to text, markdown, or json.

In [8]:
print(f"Documents Type: \n{type(documents)}\n")
print(f"Documents Len: \n{len(documents)}\n")

for document in documents:
    print(f"Document Element ID: \n{document.doc_id}\n")
    print(f"Document Element Content: \n{document.text[:8000]}...\n")

Documents Type: 
<class 'list'>

Documents Len: 
1

Document Element ID: 
9cf4378e-74ff-49af-9d8b-f6a5c1cbe7cb

Document Element Content: 
# From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge1† Ha Trinh1† Newman Cheng2 Joshua Bradley2 Alex Chao3 Apurva Mody3 Steven Truitt2 Jonathan Larson1

1Microsoft Research 2Microsoft Strategic Missions and Technologies 3Microsoft Office of the CTO

{daedge,trinhha,newmancheng,joshbradley,achao,moapurva,steventruitt,jolarso}@microsoft.com

†These authors contributed equally to this work

# Abstract

The use of retrieval-augmented generation (RAG) to retrieve relevant information from an external knowledge source enables large language models (LLMs) to answer questions over private and/or previously unseen document collections. However, RAG fails on global questions directed at an entire text corpus, such as “What are the main themes in the dataset?”, since this is inherently a query-focused summarization (QFS) tas

- **Notes**

    - When result_type is set to markdown, the following part of the text is removed:

      "For example, while our default prompt extracting the broad class of 'named entities' like people, places, and organizations is generally applicable, domains with specialized knowledge (e.g., science, medicine, law) will benefit from few-shot examples specialized to those domains... This approach allows us to use larger chunk sizes without a drop in quality (Figure 2) or the forced introduction of noise."

      However, when result_type is set to text, this part of the text is included.
      
    - The images are poorly converted.

In [9]:
from llama_index.core.node_parser import MarkdownElementNodeParser

# Process the documents and returns the parsed nodes.
node_parser = MarkdownElementNodeParser(llm=llm, num_workers=8)
nodes = node_parser.get_nodes_from_documents(documents)

9it [00:00, 61580.32it/s]
100%|██████████| 9/9 [00:02<00:00,  3.60it/s]


- **Parsing**
    
    Process of analyzing and breaking down a piece of text into smaller, structured components. 

- **RateLimitError documentation**

    [- Groq Console Rate Limits Settings.](https://console.groq.com/settings/limits) 

    [- Langchain Discussion on RateLimitError.](https://github.com/langchain-ai/langchain/discussions/20598)

In [10]:
print(f"Nodes Type: \n{type(nodes)}\n")
print(f"Nodes Elements Type: \n{type(nodes[0])}\n")
print(f"Nodes Len: \n{len(nodes)}\n")

for i in range(7):
    print(f"Node {i} Content:")
    print(nodes[i].get_content())
    print("\n" + "=" * 115 + "\n")

Nodes Type: 
<class 'list'>

Nodes Elements Type: 
<class 'llama_index.core.schema.TextNode'>

Nodes Len: 
38

Node 0 Content:
From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge1† Ha Trinh1† Newman Cheng2 Joshua Bradley2 Alex Chao3 Apurva Mody3 Steven Truitt2 Jonathan Larson1

1Microsoft Research 2Microsoft Strategic Missions and Technologies 3Microsoft Office of the CTO

{daedge,trinhha,newmancheng,joshbradley,achao,moapurva,steventruitt,jolarso}@microsoft.com

†These authors contributed equally to this work

 Abstract

The use of retrieval-augmented generation (RAG) to retrieve relevant information from an external knowledge source enables large language models (LLMs) to answer questions over private and/or previously unseen document collections. However, RAG fails on global questions directed at an entire text corpus, such as “What are the main themes in the dataset?”, since this is inherently a query-focused summarization (QFS) task, rather than 

- **Notes**

    - Node 1: The table about a RAG pipeline is incomplete and cuts off under the section labeled "# Source...".
    
    - Node 2: This node repeats the content of Node 1, removes the section labeled "# Text...", and is also incomplete.
    
    - Node 3: This node contains the section labeled "# Text..." but is still incomplete.
    
    - Until Node 12: The same pattern of issues continues.

- **Conclusion**

    - Repeated content: Multiple nodes, especially those dealing with tables, contain repeated content (e.g., Node 1 and Node 2).

    - Split content: Content related to figures and their captions is split across nodes, disrupting the logical flow (e.g., Node 12 and Node 13).

    - Potentially Deleted Parts: Some nodes might be missing context or detailed explanations from the original document due to splitting, especially around figures and tables.

In [11]:
# Extract base nodes and objects from a given list of nodes.
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

In [12]:
print(f"Base Nodes Type: \n{type(base_nodes)}\n")
print(f"Base Node Elements Type: \n{type(base_nodes[0])}\n")
print(f"Base Nodes Len: \n{len(base_nodes)}\n")

for i in range(4):
    print(f"Base Node {i} Content:")
    print(base_nodes[i].get_content())
    print("\n" + "=" * 115 + "\n")

Base Nodes Type: 
<class 'list'>

Base Node Elements Type: 
<class 'llama_index.core.schema.TextNode'>

Base Nodes Len: 
20

Base Node 0 Content:
From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge1† Ha Trinh1† Newman Cheng2 Joshua Bradley2 Alex Chao3 Apurva Mody3 Steven Truitt2 Jonathan Larson1

1Microsoft Research 2Microsoft Strategic Missions and Technologies 3Microsoft Office of the CTO

{daedge,trinhha,newmancheng,joshbradley,achao,moapurva,steventruitt,jolarso}@microsoft.com

†These authors contributed equally to this work

 Abstract

The use of retrieval-augmented generation (RAG) to retrieve relevant information from an external knowledge source enables large language models (LLMs) to answer questions over private and/or previously unseen document collections. However, RAG fails on global questions directed at an entire text corpus, such as “What are the main themes in the dataset?”, since this is inherently a query-focused summarization (QFS)

In [13]:
import json

# Constants to identify nodes related to tables by their ID suffixes.
TABLE_REF_SUFFIX = '_table_ref'
TABLE_ID_SUFFIX  = '_table'

# Print the data type of the 'objects' list and its first element, as well as its length.
print(f"Objects Type: \n{type(objects)}\n")
print(f"Objects Elements Type: \n{type(objects[0])}\n")
print(f"Objects Len: \n{len(objects)}\n")

# Initialize a counter to control the loop execution.
counter = 0

# Iterate over each node in the 'objects' list.
for node in objects: 
    # Break the loop if one node has been processed.
    if counter >= 1:
        break

    # Print basic information about the current node.
    print(f"Object Info:")
    print(f"id: \n{node.node_id}\n")
    print(f"hash: \n{node.hash}\n")
    print(f"parent: \n{node.parent_node}\n")
    print(f"prev: \n{node.prev_node}\n")
    print(f"next: \n{node.next_node}\n")

    # Check if the node's ID indicates it is a table reference.
    if node.node_id[-1 * len(TABLE_REF_SUFFIX):] == TABLE_REF_SUFFIX:

        # Process the next node if it exists.
        if node.next_node is not None:
            next_node = node.next_node
        
            # Print metadata and the ID of the next node.
            print(f"next_node metadata: \n{next_node.metadata}")
            print(f"next_next_node: \n{next_node.node_id}")

            # Load the JSON representation of the next node.
            obj_metadata = json.loads(str(next_node.json()))

            # Print the entire JSON content and specific metadata fields.
            print(str(obj_metadata))
            print(f"def: \n{obj_metadata['metadata']['table_df']}")
            print(f"summary: \n{obj_metadata['metadata']['table_summary']}")

    # Continue to print additional details about the node.
    print(f"next: \n{node.next_node}\n")
    print(f"type: \n{node.get_type()}\n")
    print(f"class: \n{node.class_name()}\n")
    print(f"content: \n{node.get_content()[:200]}\n")
    print(f"metadata: \n{node.metadata}\n")
    print(f"extra: \n{node.extra_info}\n")
    
    # Load the JSON representation of the node.
    node_json = json.loads(node.json())

    # Print the start and end character indexes.
    print(f"start_idx: \n{node_json.get('start_char_idx')}\n")
    print(f"end_idx: \n{node_json['end_char_idx']}\n")

    # If a summary exists in the JSON, print it.
    if 'table_summary' in node_json: 
        print(f"summary: \n{node_json['table_summary']}")

    print("=" * 115 + "\n")
    counter += 1   

Objects Type: 
<class 'list'>

Objects Elements Type: 
<class 'llama_index.core.schema.IndexNode'>

Objects Len: 
9

Object Info:
id: 
ee4f0d9c-4a52-44f9-90c1-cbfd3704edd3

hash: 
8c336e9b1971caef23549b982b475480242c8fadc6a04d6e2612863059d8fd55

parent: 
None

prev: 
node_id='c7002040-12f0-4e7d-bc39-a3bd843b8412' node_type=<ObjectType.TEXT: '1'> metadata={} hash='b3467413c9c837d92283d0631cc2bba6e028592b2fc3b52b819fa525a0c7c6da'

next: 
node_id='242a236d-1f8c-4ede-afcc-ce4cfda4ad76' node_type=<ObjectType.TEXT: '1'> metadata={'table_df': "{'text extraction': {0: 'and chunking'}, 'query-focused': {0: 'summarization'}}", 'table_summary': ',\nwith the following columns:\n- : \n'} hash='2236bd99b1f3a44d4d1ab800b2de6126097ee72ba3bf7b8d84840d3a58b26bb9'

next: 
node_id='242a236d-1f8c-4ede-afcc-ce4cfda4ad76' node_type=<ObjectType.TEXT: '1'> metadata={'table_df': "{'text extraction': {0: 'and chunking'}, 'query-focused': {0: 'summarization'}}", 'table_summary': ',\nwith the following columns:\n-

- **Notes**

    - base_nodes contains the parsed nodes after processing.
    
    - objects contains the tables extracted from the nodes.

> # **Create Graph Schema**

In [14]:
user_secrets = UserSecretsClient()
NEO4J_PASSWORD = user_secrets.get_secret("NEO4J_PASSWORD")

In [15]:
from neo4j import GraphDatabase

NEO4J_URL = "bolt://0.tcp.sa.ngrok.io:12998"
NEO4J_USER = "neo4j"
NEO4J_PASSWORD = NEO4J_PASSWORD
NEO4J_DATABASE = "db-test-0"

# Create a Neo4j driver instance.
driver = GraphDatabase.driver(
        NEO4J_URL, 
        database=NEO4J_DATABASE, 
        auth=(NEO4J_USER, NEO4J_PASSWORD)
)

In [16]:
# Function to initialize the Neo4j schema.
def initializeNeo4jSchema():
    # List of Cypher commands to create constraints and indexes.
    cypher_schema = [
        # Create a unique constraint on the 'key' property for nodes labeled 'Section'.
        "CREATE CONSTRAINT sectionKey IF NOT EXISTS FOR (c:Section) REQUIRE (c.key) IS UNIQUE;",
        # Create a unique constraint on the 'key' property for nodes labeled 'Chunk'.
        "CREATE CONSTRAINT chunkKey IF NOT EXISTS FOR (c:Chunk) REQUIRE (c.key) IS UNIQUE;",
        # Create a unique constraint on the 'url_hash' property for nodes labeled 'Document'.
        "CREATE CONSTRAINT documentKey IF NOT EXISTS FOR (c:Document) REQUIRE (c.url_hash) IS UNIQUE;",
        # Create a vector index on the 'value' property of 'Embedding' nodes.
        "CREATE VECTOR INDEX `chunkVectorIndex` IF NOT EXISTS FOR (e:Embedding) ON (e.value) OPTIONS { indexConfig: {`vector.dimensions`: 384, `vector.similarity_function`: 'cosine'}};"
    ]

    # Open a session with the Neo4j database.
    with driver.session() as session:     
        # Iterate over each Cypher command in the list and execute them in the session.
        for cypher in cypher_schema:
            session.run(cypher)            
            
    driver.close()

In [17]:
initializeNeo4jSchema()

- **Graph schema**

    Blueprint for a graph database that defines the structure and constraints of the data. It dictates how data is organized by specifying types of nodes and relationships, properties of those nodes and relationships, and the rules that govern the integrity of the data.

> # **Store Data in Neo4j**

In [18]:
print("Start saving documents to Neo4j...")

# Initialize a counter to keep track of how many documents have been processed.
i = 0

# Open a session to the Neo4j database.
with driver.session() as session:
    for doc in documents:
        # Define a Cypher query that uses the MERGE statement to ensure that a document with a specific 'url_hash' either exists or is created.
        # 'MERGE' looks for a 'Document' node with a 'url_hash' matching 'doc_id'. If it exists, it does nothing; if not, it creates it.
        # 'ON CREATE' sets the 'url' property of the node when a new node is created.
        cypher = "MERGE (d:Document {url_hash: $doc_id}) ON CREATE SET d.url=$url;"
        
        # Execute the Cypher query, passing in the document's ID and URL as parameters.
        session.run(cypher, doc_id=doc.doc_id, url=doc.doc_id)
        
        i = i + 1
    session.close()
print(f"{i} documents saved.\n")

Start saving documents to Neo4j...


  with driver.session() as session:


1 documents saved.



In [19]:
print("Start saving nodes to Neo4j...")

# Counter to keep track of how many nodes have been saved.
i = 0

# Open a session with the Neo4j database.
with driver.session() as session:
    for node in base_nodes: 
        # MERGE clause ensures that a node of type 'Section' with a specified key exists or creates it if it doesn't.
        cypher  = "MERGE (c:Section {key: $node_id})\n"
        # FOREACH clause sets multiple properties on the Section node only if its 'type' property is currently NULL.
        cypher += " FOREACH (ignoreMe IN CASE WHEN c.type IS NULL THEN [1] ELSE [] END |\n"
        cypher += "     SET c.hash = $hash, c.text=$content, c.type=$type, c.class=$class_name, c.start_idx=$start_idx, c.end_idx=$end_idx)\n"
        # WITH clause passes the Section node to the next part of the query.
        cypher += " WITH c\n"
        # MATCH clause finds a Document node with a specific url_hash.
        cypher += " MATCH (d:Document {url_hash: $doc_id})\n"
        # MERGE clause creates a relationship indicating the Section node is part of the Document node.
        cypher += " MERGE (d)<-[:HAS_DOCUMENT]-(c);"

        # Convert the JSON string of the node to a dictionary to access its properties.
        node_json = json.loads(node.json())

        # Execute the Cypher query with parameters extracted from the node object and its JSON data.
        session.run(
            cypher, node_id=node.node_id, hash=node.hash, content=node.get_content(), type='TEXT', class_name=node.class_name(), 
            start_idx=node_json['start_char_idx'], end_idx=node_json['end_char_idx'], doc_id=node.ref_doc_id
        )
        
        # If the node has a 'next_node', create a relationship from the next node to this node.
        if node.next_node is not None:
            cypher  = "MATCH (c:Section {key: $node_id})\n" 
            cypher += "MERGE (p:Section {key: $next_id})\n" 
            cypher += "MERGE (p)<-[:NEXT]-(c);" 
            
            session.run(
                cypher, node_id=node.node_id, next_id=node.next_node.node_id
            )

        # If the node has a 'prev_node', create a NEXT relationship pointing from this node to the previous node.
        if node.prev_node is not None:  
            cypher  = "MATCH (c:Section {key: $node_id})\n"    
            cypher += "MERGE (p:Section {key: $prev_id})\n" 
            cypher += "MERGE (p)-[:NEXT]->(c);"

            # Adjust the ID for the previous node if it is a special type (such as a table).
            if node.prev_node.node_id[-1 * len(TABLE_ID_SUFFIX):] == TABLE_ID_SUFFIX:
                prev_id = node.prev_node.node_id + '_ref'
            else:
                prev_id = node.prev_node.node_id

            session.run(
                cypher, node_id=node.node_id, prev_id=prev_id
            )
            
        i = i + 1
    session.close()
print(f"{i} nodes saved.")

Start saving nodes to Neo4j...


  with driver.session() as session:


20 nodes saved.


In [20]:
print("Start saving objects to Neo4j...")

# Initialize a counter to track the number of objects processed.
i = 0

# Open a session with the Neo4j database.
with driver.session() as session:
    for node in objects:              
        # Deserialize the JSON representation of the node to a Python dictionary.
        node_json = json.loads(node.json())

        # Check if the current node is a table by looking for a specific suffix in its node ID.
        # If the object is a Table, then the 'ref_table' object is created as a Section, and the 'table' object as a Chunk.
        if node.node_id[-1 * len(TABLE_REF_SUFFIX):] == TABLE_REF_SUFFIX:
            # If there is a next node, process it as the actual table object.
            if node.next_node is not None:  
                next_node = node.next_node
                obj_metadata = json.loads(str(next_node.json()))

                # Cypher to merge or create a Section node for the table reference.
                cypher  = "MERGE (s:Section {key: $node_id})\n"
                # Continue the query to merge or create a Chunk node for the actual table content.
                cypher += "WITH s MERGE (c:Chunk {key: $table_id})\n"
                # If the Chunk node's type is NULL, set several properties.
                cypher += " FOREACH (ignoreMe IN CASE WHEN c.type IS NULL THEN [1] ELSE [] END |\n"
                cypher += "     SET c.hash = $hash, c.definition=$content, c.text=$table_summary, c.type=$type, c.start_idx=$start_idx, c.end_idx=$end_idx )\n"
                # Create a relationship indicating that the Chunk is under the Section.
                cypher += " WITH s, c\n"
                cypher += " MERGE (s) <-[:UNDER_SECTION]- (c)\n" 
                # Link the Section to its corresponding Document.
                cypher += " WITH s MATCH (d:Document {url_hash: $doc_id})\n"
                cypher += " MERGE (d)<-[:HAS_DOCUMENT]-(s);" 

                # Execute the Cypher query with parameters filled from node and metadata information.
                session.run(
                    cypher, node_id=node.node_id, hash=next_node.hash, content=obj_metadata['metadata']['table_df'], type='TABLE', 
                    start_idx=node_json['start_char_idx'], end_idx=node_json['end_char_idx'], doc_id=node.ref_doc_id, 
                    table_summary=obj_metadata['metadata']['table_summary'], table_id=next_node.node_id
                )
                
            # If a previous node exists, establish a relationship to maintain the document structure.
            if node.prev_node is not None:
                cypher  = "MATCH (c:Section {key: $node_id})\n"    
                cypher += "MERGE (p:Section {key: $prev_id})\n"  
                cypher += "MERGE (p)-[:NEXT]->(c);"

                # Modify the previous node ID if it ends with a specific suffix.
                if node.prev_node.node_id[-1 * len(TABLE_ID_SUFFIX):] == TABLE_ID_SUFFIX:
                    prev_id = node.prev_node.node_id + '_ref'
                else:
                    prev_id = node.prev_node.node_id
                
                # Execute the Cypher query to link the current node to the previous one.
                session.run(cypher, node_id=node.node_id, prev_id=prev_id)
                
        i = i + 1
    session.close()
print(f"{i} objects saved.")

Start saving objects to Neo4j...
9 objects saved.


  with driver.session() as session:


In [21]:
print("Start creating chunks for each TEXT Section...")

# Open a session with the Neo4j database.
with driver.session() as session:
    # Cypher query starts by matching all nodes labeled 'Section' with a property type of 'TEXT'.
    cypher  = "MATCH (s:Section) WHERE s.type='TEXT' \n"
    # Begins a subquery to process each 'Section' node individually.
    cypher += "WITH s CALL {\n" 
    # Splits the text of the section into paragraphs based on newline characters.
    cypher += "WITH s WITH s, split(s.text, '\n') AS para\n" 
    # Creates an array of indices for each paragraph to use later.
    cypher += "WITH s, para, range(0, size(para)-1) AS iterator\n"
    # Unwinds the iterator to process each paragraph individually; 'i' is the index of each paragraph.
    cypher += "UNWIND iterator AS i WITH s, trim(para[i]) AS chunk, i WHERE size(chunk) > 0\n"
    # Creates a new 'Chunk' node for each paragraph that contains text. The key of each chunk is a combination of the section key and its index.
    cypher += "CREATE (c:Chunk {key: s.key + '_' + i}) SET c.type='TEXT', c.text = chunk, c.seq = i \n"
    # Creates a relationship UNDER_SECTION from each chunk to its parent section.
    cypher += "CREATE (s) <-[:UNDER_SECTION]-(c) } IN TRANSACTIONS OF 500 ROWS ;"
    
    # Executes the assembled Cypher query within the open session.
    session.run(cypher)
    session.close()
    
print("=================DONE====================")
driver.close()

Start creating chunks for each TEXT Section...


  with driver.session() as session:




In [22]:
!pip install -q sentence-transformers

In [23]:
from sentence_transformers import SentenceTransformer

# Initialize the SentenceTransformer model for generating text embeddings.
model = SentenceTransformer('all-MiniLM-L6-v2')

2024-06-21 15:30:40.530719: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-21 15:30:40.530864: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-21 15:30:40.664622: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [24]:
# Define a function to load embeddings for specific properties of nodes in a Neo4j database.
def LoadEmbedding(label, prop):
    # Establish a connection to the Neo4j database.
    driver = GraphDatabase.driver(NEO4J_URL, auth=(NEO4J_USER, NEO4J_PASSWORD), database=NEO4J_DATABASE)

    # Open a session with the database.
    with driver.session() as session:
        # Run a Cypher query to fetch all nodes with a specified label and retrieve their IDs and text properties.
        result = session.run(f"MATCH (ch:{label}) RETURN id(ch) AS id, ch.{prop} AS text")
        
        # Initialize a counter to track the number of processed nodes.
        count = 0 
        
        # Iterate over the result set from the query.
        for record in result:
            # Extract the node ID from the record.
            id = record["id"]
            # Extract the text content from the specified property.
            text = record["text"]
            
            # Generate the embedding for the text using the SentenceTransformer model.
            embedding = model.encode(text).tolist()

            # Prepare a Cypher query to create an 'Embedding' node with the generated embedding,
            # and establish a 'HAS_EMBEDDING' relationship from the original node to this new Embedding node.
            cypher = "CREATE (e:Embedding) SET e.key=$key, e.value=$embedding, e.model=$model"
            cypher = cypher + " WITH e MATCH (n) WHERE id(n) = $id CREATE (n) -[:HAS_EMBEDDING]-> (e)"
                        
            # Execute the Cypher query to update the database.
            session.run(cypher, key=prop, embedding=embedding, id=id, model='all-MiniLM-L6-v2') 
            
            count += 1
        session.close() 
        
        print(f"\nProcessed {count} {label} nodes for property @{prop}.")

- **With**

    Statement in Python used for resource management and ensures that resources are properly cleaned up after use. It is commonly used with objects like files, database connections, or network connections that need to be closed or released after being used.

In [25]:
LoadEmbedding("Chunk", "text")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Processed 312 Chunk nodes for property @text.


<div style="text-align:center">
  <img src="https://raw.githubusercontent.com/floresernesto95/Images/main/graph%208.png" alt="Centered and Resized Image" width="50%">
</div>

> # **GraphRAG**

In [26]:
# Generate an embedding for the word "author" and convert it to a list format.
embedding = model.encode("authors").tolist()

# Establish a connection to a Neo4j database using the GraphDatabase driver.
with GraphDatabase.driver(NEO4J_URL, auth=(NEO4J_USER, NEO4J_PASSWORD)) as driver:
    # Define a Cypher query that uses a vector similarity search.
    cypher_query = """   
    // MATCH (e:Embedding {value: $queryEmbedding})
    // 'questions' is the name of the vector index used for searching
    CALL db.index.vector.queryNodes('chunkVectorIndex', 6, $queryEmbedding)
    YIELD node, score
    MATCH (c:Chunk)-[:HAS_EMBEDDING]->(node)
    RETURN c.text as Text, score as Score
    """

    # Execute the Cypher query with the embedding as a parameter.
    records, summary, keys = driver.execute_query(
        cypher_query,
        queryEmbedding=embedding,
        database_="db-test-0"
    )
    driver.close()

for record in records:
    print(record['Text'] + "\n")
    print(record['Score'])
    print("\n" + "="*100 + "\n")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

†These authors contributed equally to this work

0.7960867881774902


We would also like to thank the following people who contributed to the work: Alonso Guevara Fernández, Amber Hoak, Andrés Morales Esquivel, Ben Cutler, Billie Rinaldi, Chris Sanchez, Chris Trevino, Christine Caggiano, David Tittsworth, Dayenne de Souza, Douglas Orbaker, Ed Clark, Gabriel Nieves-Ponce, Gaudy Blanco Meneses, Kate Lytvynets, Katy Smith, Mónica Carvajal, Nathan Evans, Richard Ortega, Rodrigo Racanicci, Sarah Smith, and Shane Solomon.

0.7492123246192932


References

0.7294324636459351


References

0.7294324636459351


References

0.7294324636459351


Actors and Directors [...] Public Figures in Controversy [...] Musicians and Executives [...] Athletes and Coaches [...] Influencers and Entrepreneurs [...]

0.7002530694007874




In [27]:
!pip install -q groq

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [28]:
from groq import Groq
import os

# Initialize the Groq client with an API key.
client = Groq(
    api_key=GROQ_API_KEY
)

# Define a function to handle querying and text completion.
def query(query):
    embedding = model.encode(query).tolist()
    
    # Establish a connection to a Neo4j database using the GraphDatabase driver.
    with GraphDatabase.driver(NEO4J_URL, auth=(NEO4J_USER, NEO4J_PASSWORD)) as driver:
        cypher_query = """   
        CALL db.index.vector.queryNodes('chunkVectorIndex', 6, $queryEmbedding)
        YIELD node, score
        MATCH (c:Chunk)-[:HAS_EMBEDDING]->(node)
        RETURN c.text as Text, score as Score
        """

    # Execute the Cypher query with the embedding as a parameter.
    records, summary, keys = driver.execute_query(
        cypher_query,
        queryEmbedding=embedding,
        database_="db-test-0"
    )
    driver.close()

    # Aggregate the text content from the graph query results.
    graph_retrieved_content = ""
    
    for record in records:
        graph_retrieved_content += record['Text'] + "\n"       
        
    # Use the Groq API client to generate a chat completion.
    chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": f"""Sos un automata artificial inteligente experto en llms y todo lo que tiene que ver con ellos, \
            que recibe consultas en español y responde siempre en español de manera precisa, \
            pensando paso a paso y cuidando de no cometer errores ortográficos o gramaticales. \
            Tu principal tarea es la búsqueda y resumen de información a partir de una serie de preguntas y respuestas sobre un artículo científico. \
            Basándote únicamente en estos textos: \
            {graph_retrieved_content} \
            Quiero que armes paso a paso una respuesta o análisis sobre {query}, que refleje el contenido del texto. \
            La respuesta debe ser lo más completa posible."""
        },
        {
            "role": "user",
            "content": f"{query}"
        }
    ],
    model="llama3-70b-8192"
    )
    
    print(chat_completion.choices[0].message.content) 

In [29]:
query("Quienes contribuyeron con el paper?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  records, summary, keys = driver.execute_query(


De acuerdo al texto proporcionado, los autores que contribuyeron con el paper "Fabula: Intelligence report generation using retrieval-augmented narrative construction" son:

* Alonso Guevara Fernández
* Amber Hoak
* Andrés Morales Esquivel
* Ben Cutler
* Billie Rinaldi
* Chris Sanchez
* Chris Trevino
* Christine Caggiano
* David Tittsworth
* Dayenne de Souza
* Douglas Orbaker
* Ed Clark
* Gabriel Nieves-Ponce
* Gaudy Blanco Meneses
* Kate Lytvynets
* Katy Smith
* Mónica Carvajal
* Nathan Evans
* Richard Ortega
* Rodrigo Racanicci
* Sarah Smith
* Shane Solomon

Es importante destacar que, según el texto, algunos autores contribuyeron de manera igual, lo que se indica con la marca † ( dagger ) al final del texto.

Espero que esta respuesta sea completa y precisa. Si necesitas más información o aclaraciones, no dudes en preguntar.


> # **Conclusion**

- LLamaParse may not be functioning correctly, or it might require additional PDF preprocessing, such as image extraction. Consider extracting images before using LLamaParse. Attempt to create your own chunks.

- The graph schema could be improved to be more field-related. Experiment with different ones.

In [None]:
import time 

for i in range(60*8):
    time.sleep(60)
    print("Ha pasado un minuto de existencia!")