Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: Apache-2.0

# Ask the Graph
# Notebook 2: Ingest Data
In this notebook, we ingest into Neptune the data we prepared in notebooks 0 and 1. This includes structured and unstructured organizational data. The following figure shows the data model.

TODO - data model

We have prepared the data for you. You do not need to prepare it yourself.

Two deployment options are available:

- Labeled Property Graph (LPG) deployment in an Amazon Neptune Analytics graph
- Resource Description Framework (RDF) deployment using a combination of an Amazon Neptune database cluster and an Amazon OpenSearch Service domain.

Refer to the README.md file in the repo for instructions how to setup

The following diagram shows these options.

TODO - diagram


## Option 1: Labeled Property Graph Setup Using Neptune Analytics
We assume you have already setup the Neptune Analytics graph and that this notebook instance has connectivity to it. 

Refer to the README.md file in the repo for instructions how to setup.

### Batch-load LPG CSV files into Neptune

TODO -remove the graph-notebook-host

In [6]:
%graph_notebook_host g-11z1ctsbu7.us-east-1.neptune-graph.amazonaws.com

set host to g-11z1ctsbu7.us-east-1.neptune-graph.amazonaws.com


In [7]:
import graph_notebook as gn
config = gn.configuration.get_config.get_config()

region = config.aws_region
s3_bucket = f"s3://aws-neptune-customer-samples-{region}/tmls2024/prepped/graphdata/lpg"
s3_bucket

's3://aws-neptune-customer-samples-us-east-1/tmls2024/prepped/graphdata/lpg'

In [8]:
%%oc

CALL neptune.load({
    format: "csv", 
    source: "${s3_bucket}", 
    region : "${region}",
    format: "csv",
    failOnError: False,
    concurrency: 1
})

Tab(children=(Output(layout=Layout(max_height='600px', max_width='940px', overflow='scroll')), Output(layout=L…

### View stats - nodes and edges in the graph,
Wait a minute if stats shows no change. Try again.

In [None]:
%summary pg --detailed

In [9]:
%%oc

MATCH (n)
WITH labels(n) as ln
RETURN ln, count(ln)
order by ln


Tab(children=(Output(layout=Layout(max_height='600px', max_width='940px', overflow='scroll')), Output(layout=L…

In [17]:
!grep ExtractedEntity graphdata/lpg/extractions.csv | wc -l

639


In [18]:
!grep ExtractedEvent graphdata/lpg/extractions.csv | wc -l

477


In [15]:
!wc -l graphdata/lpg/*.csv


     435 graphdata/lpg/chunk2doc.csv
       1 graphdata/lpg/_chunks.csv
     435 graphdata/lpg/chunks.csv
     118 graphdata/lpg/documents.csv
    1282 graphdata/lpg/entity_embeddings.csv
     266 graphdata/lpg/extraction_links.csv
    1461 graphdata/lpg/extraction_rels.csv
    1117 graphdata/lpg/extractions.csv
      22 graphdata/lpg/industries.csv
      11 graphdata/lpg/locations.csv
      56 graphdata/lpg/orgs.csv
      11 graphdata/lpg/persons.csv
      34 graphdata/lpg/products.csv
     141 graphdata/lpg/rels.csv
    1291 graphdata/lpg/resolved_entities.csv
      16 graphdata/lpg/services.csv
      12 graphdata/lpg/_summaries.csv
     118 graphdata/lpg/summaries.csv
      22 graphdata/lpg/tax_concept_rels.csv
      22 graphdata/lpg/taxonomy_concepts.csv
    6871 total


In [None]:
%%oc

MATCH (n)
WHERE labels(n)=[]
RETURN id(n)
ORDER by id(n)


In [None]:
%%oc

MATCH(n)
WHERE id(n)='http://example.org/orgdemo/Document/05b0a143-1900-4356-a562-dba1ee87c3a2'
RETURN n

In [5]:
import helpers

# The NLQ
#query="Does Amazon have a fulfillment center in Mississippi?"
#query="What activities does AWS have going on in San Francisco?"
#query="Summarize the top trends in AWS?"
#query="What info do you have on Jeff Bezos and Andy Jassy"
query="What info do you have on energy storage"
query="Is there anything in the press about Ammazon facikities in Mississipppi, Florada, or Saskachaon"

# Make an embedding of it
embedding = helpers.make_embedding(query)

# What terms are mentioned
terms = helpers.extract_keywords(query)

resies = helpers.resolve_entities("Pembroke")


embparams={'emb': embedding}
terms
resies



['Pembroke, Wales',
 'Pembroke Castle',
 'Pembroke Dock',
 'Pembrokeshire',
 'Pembroke College, Cambridge',
 'Pembroke College, Oxford',
 'Pembroke, Ontario',
 'Pembroke, Massachusetts',
 'Pembroke, New Hampshire',
 'Pembroke, Maine',
 'Pembroke Pines, Florida',
 'Pembroke Township, Illinois',
 'Pembroke Township, Michigan',
 'Pembroke, Bermuda',
 'Pembroke Parish, Bermuda',
 'http://dbpedia.org/resource/Pembroke,_Wales',
 'http://dbpedia.org/resource/Pembroke_Castle',
 'http://dbpedia.org/resource/Pembroke_Dock',
 'http://dbpedia.org/resource/Pembrokeshire']

In [3]:
%%oc -qp embparams

WITH $emb as emb
CALL neptune.algo.vectors.topKByEmbedding(emb)
YIELD embedding, node, score
WITH node, score
OPTIONAL MATCH(node)-[:belongsToDocument]->(d:Document)
RETURN id(node), labels(node), score, d.title


Invalid query parameter input, ignoring.


Tab(children=(Output(layout=Layout(overflow='scroll')),), _titles={'0': 'Error'})

In [None]:
%%oc 

MATCH(d:ExtractedEvent)
CALL neptune.algo.vectors.get(d)
YIELD node, embedding
RETURN node.doc, embedding
LIMIT 20

In [None]:
%%oc 

MATCH(d:Document)
CALL neptune.algo.vectors.topKByNode(d)
YIELD node, score
RETURN d.title, node.title, score
LIMIT 10


### Explore the data

### Find entities by name

In [20]:
import helpers

#search_term="Whole Foods"
#search_term="Amazon Fire"
search_term="Zoox"
embedding = helpers.make_embedding(search_term)

qparams={'emb': embedding, 'term': search_term}



In [21]:
%%oc -qp qparams

MATCH(n) 
WHERE n.label={term}
RETURN n

Tab(children=(Output(layout=Layout(max_height='600px', max_width='940px', overflow='scroll')), Output(layout=L…

### Find entities like

Pretty bad 


In [30]:
%%oc -qp qparams

WITH $emb as emb
CALL neptune.algo.vectors.topKByEmbedding(emb, {topK: 10})
YIELD embedding, node, score
WITH node, score 
WHERE not 'Chunk' in labels(node) and not 'Document' in labels(node)

RETURN id(node), labels(node), score


Tab(children=(Output(layout=Layout(max_height='600px', max_width='940px', overflow='scroll')), Output(layout=L…

### Find summaries like

In [31]:
import helpers

# The NLQ
#query="Does Amazon have a fulfillment center in Mississippi?"
#query="What activities does AWS have going on in San Francisco?"
#query="Summarize the top trends in AWS?"
#query="What info do you have on Jeff Bezos and Andy Jassy"
query="What info do you have on energy storage"
query="Is there anything in the press about Ammazon facikities in Mississipppi, Florada, or Saskachaon"

# Make an embedding of it
embedding = helpers.make_embedding(query)

# What terms are mentioned
terms = helpers.extract_keywords(query)

qparams={'emb': embedding, 'terms': terms}
terms

['misspelling', 'Amazon', 'Mississippi', 'Florida', 'Saskatchewan']

In [33]:
%%oc -qp qparams

MATCH(n) 
WHERE n.label in {terms}
RETURN id(n)

Tab(children=(Output(layout=Layout(max_height='600px', max_width='940px', overflow='scroll')), Output(layout=L…

In [38]:
%%oc -qp qparams

WITH $emb as emb
CALL neptune.algo.vectors.topKByEmbedding(emb, {topK: 10})
YIELD embedding, node, score
WITH node, score 
MATCH (node:Chunk)-[:belongsToDocument]->(doc:Document)
OPTIONAL MATCH (doc)-[de:documentHasEvent]->(ev:ExtractedEvent)
RETURN id(node), labels(node), id(doc),doc.title, de.role, id(ev), ev.label, score

Tab(children=(Output(layout=Layout(max_height='600px', max_width='940px', overflow='scroll')), Output(layout=L…

### Find document chunks like