Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: Apache-2.0


# Ask the Graph
# Notebook 0: Prep Structured Data

In this notebook we prepare structured organization data for the Neptune graph. We represent it in Labeled Property Graph (LPG) and Resource Description Framework (RDF) forms.

You DO NOT need to run this notebook. The notebook produces output that is already prepared for you and available in a public bucket. But you may wish to review how we prepared that data. In that case, follow along with the logic below.

Our input is sourced from s3://aws-neptune-customer-samples/tmls2024/source/. We download a local copy to the source folder:

- dbpedia_orgsx.csv - from DBPedia query to obtain organizations
- dbpedia_labelsx.csv - from DBPedia query to obtain labels for orgs and related entities
- industry_taxonomy.json - A taxonomy for industry types

We produce the following output locally:
- graphdata/lpg - CSV files in Gremlin format to bulk-load to Neptune as LPG data
- graphdata/rdf - Turtle files to bulk-load to Neptune as RDF data

Specific files that we produce are the following. LPG files in graphdata/lpg are:

- orgs.csv - Organizations
- persons.csv - Persons belonging to orgs
- industries.csv - Industries of organizations.
- products.csv - Products from organizations.
- services.csv - Services from organizations.
- locations.csv - Locations of organizations.
- rels.csv - Relationships of the above.
- taxonomy_concepts.csv - A taxonomy of industry types, including alternate labels and broader concepts.
- tax_concept_rels.csv - Link industries from industries.csv to taxonomy concepts for industry.

RDF in graphdata/rdf are:
- orgdata.ttl - Orgs, persons, industries, services, locations, rels
- industry_taxonomy.ttl - A SKOS taxonomy of industry types, including alternate labels and broader concepts. Not produced here, but copied from source. We'll ingest this as is into Neptune later.
- tax_concept_rels.ttl - Link industries from industries.csv to taxonomy concepts for industry.

These files are also maintained in s3://aws-neptune-customer-samples/tmls2024/graphdata/

TODO - data model, including what we've built so far

In the next notebook, we continue the prep by creating embeddings, summaries, and extracted entities from a set of press releases that refer to the organizations from this notebook.

In [None]:
!aws s3 sync s3://aws-neptune-customer-samples-us-east-1/tmls2024/source/ source

In [None]:
%%bash
rm -rf graphdata summaries chunks documents

In [None]:
%%bash 
mkdir -p graphdata graphdata/rdf graphdata/lpg summaries chunks documents
cp source/industry_taxonomy.ttl graphdata/rdf

## Get source data and create output folders

## Create base org, person, product, and industry data

We'll use DBPedia as a source.


<details><summary>Click to view/hide how to get this data from DBPedia</summary>
<p>

To get a few of the orgs, run the following query on https://dbpedia.org/sparql against default named graph http://dbpedia.org. Set the result format to CSV. 

```
select ?company ?otype ?olocation ?oparent ?pstatus ?pname ?pparent ?ptype ?purl ?fhomepage ?fname
(GROUP_CONCAT(distinct ?oproduct;SEPARATOR="|") AS ?oproducts)
(GROUP_CONCAT(distinct ?oservice;SEPARATOR="|") AS ?oservices)
(GROUP_CONCAT(distinct ?osub;SEPARATOR="|") AS ?osubs)
(GROUP_CONCAT(distinct ?oindustry;SEPARATOR="|") AS ?oindustries)
(GROUP_CONCAT(distinct ?okeyPerson;SEPARATOR="|") AS ?okeyPersons)
(GROUP_CONCAT(distinct ?pindustry;SEPARATOR="|") AS ?pindustries)
where 
{
 values ?company { 
<http://dbpedia.org/resource/Amazon_(company)>
<http://dbpedia.org/resource/Amazon_Web_Services>
<http://dbpedia.org/resource/AbeBooks>

<http://dbpedia.org/resource/Epic_Games>
<http://dbpedia.org/resource/IMDb>
<http://dbpedia.org/resource/Lockheed_Martin>

<http://dbpedia.org/resource/Ring_(company)>
<http://dbpedia.org/resource/Rite_Aid>
<http://dbpedia.org/resource/Rivian>

<http://dbpedia.org/resource/Salesforce>
<http://dbpedia.org/resource/Shutterfly>
<http://dbpedia.org/resource/Spire_Global>
<http://dbpedia.org/resource/Saint_Mary's_Hospital,_Manchester>
<http://dbpedia.org/resource/Standard_Bank>

<http://dbpedia.org/resource/The_Globe_and_Mail>
<http://dbpedia.org/resource/Tsawwassen_First_Nation>
<http://dbpedia.org/resource/Verizon_Communications>
<http://dbpedia.org/resource/Visy>
<http://dbpedia.org/resource/Whole_Foods_Market>
 } .
OPTIONAL { ?company dbo:type ?otype . } .
OPTIONAL { ?company dbo:location ?olocation . } .
OPTIONAL { ?company dbo:parentCompany ?oparent . } .
OPTIONAL { ?company dbo:product	 ?oproduct . } .
OPTIONAL { ?company dbo:service	 ?oservice . } .
OPTIONAL { ?company dbo:subsidiary	 ?osub . } .
OPTIONAL { ?company dbo:industry ?oindustry . } .
OPTIONAL { ?company dbo:keyPerson ?okeyPerson . } .

OPTIONAL { ?company dbp:currentStatus ?pstatus . } .
OPTIONAL { ?company dbp:industry ?pindustry . } .
OPTIONAL { ?company dbp:name ?pname . } .
OPTIONAL { ?company dbp:parent ?pparent . } .
OPTIONAL { ?company dbp:type ?ptype . } .
OPTIONAL { ?company dbp:url ?purl . } .

OPTIONAL { ?company foaf:homepage ?fhomepage . } .
OPTIONAL { ?company foaf:name ?fname . } .

} 
GROUP BY ?company ?otype ?olocation ?oparent ?pstatus ?pname ?pparent ?ptype ?purl ?fhomepage ?fname
ORDER BY ?company
```

To get labeling for URIs from results, 

```
select distinct *
where 
{
 values ?uri { 
<http://dbpedia.org/resource/The_Washington_Post>
 } .
OPTIONAL { ?uri dbp:name|foaf:name|rdfs:label ?label . FILTER(lang(?label) = 'en') } .
OPTIONAL { ?uri dbo:type ?dbotype . 
   OPTIONAL { ?dbotype dbp:name|foaf:name|rdfs:label ?dtlabel . FILTER(lang(?dtlabel) = 'en') } .
 } .
OPTIONAL { ?uri rdfs:seeAlso ?seeAlso . 
  OPTIONAL { ?seeAlso dbo:type ?stype  .
    OPTIONAL { ?stype dbp:name|foaf:name|rdfs:label ?stlabel . FILTER(lang(?stlabel) = 'en') } .
  } .
  OPTIONAL { ?seeAlso dbp:name|foaf:name|rdfs:label ?slabel . FILTER(lang(?slabel) = 'en') } .  
} .

} ```

URIs to check include the following. DBPedia SPARQL engine throws a request size error if you use all URIs at once, so chunk them.
```
<http://dbpedia.org/resource/AbeBooks>
<http://dbpedia.org/resource/Amazon_(company)>
<http://dbpedia.org/resource/BookFinder.com>
<http://dbpedia.org/resource/IberLibro>
<http://dbpedia.org/resource/LibraryThing>
<http://dbpedia.org/resource/A9.com>
<http://dbpedia.org/resource/Alexa_Internet>
<http://dbpedia.org/resource/Amazon.com>
<http://dbpedia.org/resource/Amazon_Air>
<http://dbpedia.org/resource/Amazon_Books>
<http://dbpedia.org/resource/Amazon_Fresh>
<http://dbpedia.org/resource/Amazon_Games>
<http://dbpedia.org/resource/Amazon_Lab126>
<http://dbpedia.org/resource/Amazon_Logistics>
<http://dbpedia.org/resource/Amazon_Pharmacy>
<http://dbpedia.org/resource/Amazon_Publishing>
<http://dbpedia.org/resource/Amazon_Robotics>
<http://dbpedia.org/resource/Amazon_Studios>
<http://dbpedia.org/resource/Amazon_Web_Services>
<http://dbpedia.org/resource/Audible_(service)>
<http://dbpedia.org/resource/Blink_Home>
<http://dbpedia.org/resource/Body_Labs>
<http://dbpedia.org/resource/Book_Depository>
<http://dbpedia.org/resource/ComiXology>
<http://dbpedia.org/resource/Digital_Photography_Review>
<http://dbpedia.org/resource/Goodreads>
<http://dbpedia.org/resource/Graphiq>
<http://dbpedia.org/resource/IMDb>
<http://dbpedia.org/resource/MGM_Holdings>
<http://dbpedia.org/resource/PillPack>
<http://dbpedia.org/resource/Ring_Inc.>
<http://dbpedia.org/resource/Souq.com>
<http://dbpedia.org/resource/Twitch_Interactive>
<http://dbpedia.org/resource/Whole_Foods_Market>
<http://dbpedia.org/resource/Woot>
<http://dbpedia.org/resource/Zappos>
<http://dbpedia.org/resource/Zoox_Inc>
<http://dbpedia.org/resource/Epic_Games>
<http://dbpedia.org/resource/Lockheed_Martin>
<http://dbpedia.org/resource/Lockheed_Martin_Canada>
<http://dbpedia.org/resource/Lockheed_Martin_UK>
<http://dbpedia.org/resource/Sikorsky_Aircraft>
<http://dbpedia.org/resource/Ring_(company)>
<http://dbpedia.org/resource/Rite_Aid>
<http://dbpedia.org/resource/Bartell_Drugs>
<http://dbpedia.org/resource/Rivian>
<http://dbpedia.org/resource/Saint_Mary's_Hospital,_Manchester>
<http://dbpedia.org/resource/Salesforce>
<http://dbpedia.org/resource/Shutterfly>
<http://dbpedia.org/resource/Spire_Global>
<http://dbpedia.org/resource/Standard_Bank>
<http://dbpedia.org/resource/The_Globe_and_Mail>
<http://dbpedia.org/resource/Tsawwassen_First_Nation>
<http://dbpedia.org/resource/Verizon_Communications>
<http://dbpedia.org/resource/Visy>
<http://dbpedia.org/resource/Book>
<http://dbpedia.org/resource/Collectable>
<http://dbpedia.org/resource/Ephemera>
<http://dbpedia.org/resource/Fine_art>
<http://dbpedia.org/resource/Out_of_print_books>
<http://dbpedia.org/resource/Rare_book>
<http://dbpedia.org/resource/Textbooks>
<http://dbpedia.org/resource/Used_book>
<http://dbpedia.org/resource/Amazon_Echo>
<http://dbpedia.org/resource/Amazon_Fire_TV>
<http://dbpedia.org/resource/Amazon_Fire_tablet>
<http://dbpedia.org/resource/Amazon_Kindle>
<http://dbpedia.org/resource/Fire_OS>
<http://dbpedia.org/resource/Bink_Video>
<http://dbpedia.org/resource/Epic_Games_Store>
<http://dbpedia.org/resource/Fortnite>
<http://dbpedia.org/resource/Gears_of_War>
<http://dbpedia.org/resource/Unreal_(video_game_series)>
<http://dbpedia.org/resource/Unreal_Engine>
<http://dbpedia.org/resource/Atlas_V>
<http://dbpedia.org/resource/Lockheed_Martin_C-130J_Super_Hercules>
<http://dbpedia.org/resource/Lockheed_Martin_F-35_Lightning_II>
<http://dbpedia.org/resource/Pharmacy>
<http://dbpedia.org/resource/Car_battery>
<http://dbpedia.org/resource/Electric_car>
<http://dbpedia.org/resource/Bureau_de_change>
<http://dbpedia.org/resource/Commercial_Banking>
<http://dbpedia.org/resource/Insurance>
<http://dbpedia.org/resource/Investment_Banking>
<http://dbpedia.org/resource/Investment_Management>
<http://dbpedia.org/resource/Private_Banking>
<http://dbpedia.org/resource/Retail_Banking>
<http://dbpedia.org/resource/Wealth_Management>
<http://dbpedia.org/resource/Online_shopping>
<http://dbpedia.org/resource/Amazon.ca>
<http://dbpedia.org/resource/Amazon_Alexa>
<http://dbpedia.org/resource/Amazon_Appstore>
<http://dbpedia.org/resource/Amazon_Prime>
<http://dbpedia.org/resource/Amazon_Prime_Video>
<http://dbpedia.org/resource/Amazon_Web_Services>
<http://dbpedia.org/resource/Ring_(company)>
<http://dbpedia.org/resource/Twitch_(service)>
<http://dbpedia.org/resource/Amazon_Luna>
<http://dbpedia.org/resource/Amazon_Music>
<http://dbpedia.org/resource/Amazon_Pay>
<http://dbpedia.org/resource/Electric_vehicle_charging_network>
<http://dbpedia.org/resource/Vehicle_insurance>
<http://dbpedia.org/resource/Cloud_computing>
<http://dbpedia.org/resource/Internet>
<http://dbpedia.org/resource/E-commerce>
<http://dbpedia.org/resource/Artificial_intelligence>
<http://dbpedia.org/resource/Cloud_Computing>
<http://dbpedia.org/resource/Consumer_electronics>
<http://dbpedia.org/resource/Digital_distribution>
<http://dbpedia.org/resource/Entertainment>
<http://dbpedia.org/resource/Self-driving_cars>
<http://dbpedia.org/resource/Supermarket>
<http://dbpedia.org/resource/Grocery_store>
<http://dbpedia.org/resource/Health_food_store>
<http://dbpedia.org/resource/Video_game_industry>
<http://dbpedia.org/resource/Retail>
<http://dbpedia.org/resource/Automotive_industry>
<http://dbpedia.org/resource/Energy_storage>
<http://dbpedia.org/resource/Cloud_computing>
<http://dbpedia.org/resource/Consulting>
<http://dbpedia.org/resource/Enterprise_software>
<http://dbpedia.org/resource/Aerospace>
<http://dbpedia.org/resource/Banking>
<http://dbpedia.org/resource/Packaging>
<http://dbpedia.org/resource/Victoria,_British_Columbia>
<http://dbpedia.org/resource/Bethesda,_Maryland>
<http://dbpedia.org/resource/Santa_Monica,_California>
<http://dbpedia.org/resource/Philadelphia,_Pennsylvania>
<http://dbpedia.org/resource/United_States>
<http://dbpedia.org/resource/Salesforce_Tower>
<http://dbpedia.org/resource/San_Francisco,_California>
<http://dbpedia.org/resource/Johannesburg>
<http://dbpedia.org/resource/Standard_Bank_Centre>
<http://dbpedia.org/resource/South_Africa>
<http://dbpedia.org/resource/Chief_executive_officer>
<http://dbpedia.org/resource/Andy_Jassy>
<http://dbpedia.org/resource/Jeff_Bezos>
<http://dbpedia.org/resource/President_(corporate_title)>
<http://dbpedia.org/resource/Chairman>
<http://dbpedia.org/resource/Chief_Creative_Officer>
<http://dbpedia.org/resource/Chief_technical_officer>
<http://dbpedia.org/resource/Mark_Rein_(software_executive)>
<http://dbpedia.org/resource/James_D._Taiclet>
<http://dbpedia.org/resource/CEO>
```
    
</p>
</details>

In [None]:
import pandas as pd
import csv, json
import helpers

# data structures for entities and relationships and common labeling
orgs = {}
persons={}
locations={}
products={}
services={}
industries={}
labels={}
encountered_types={}
encountered_see_alsos={}
main_entities=[orgs, persons, locations, products, services, industries]
links=[]

# convenience function to print URIs to get labels of
# not used in main code, but to gather the labels CSV
def get_entities_for_labeling():
    def liri(s):
        return f"<{s}>"
        
    for o in orgs:
        print(liri(o))
    for o in products:
        print(liri(o))
    for o in services:
        print(liri(o))
    for o in industries:
        print(liri(o))
    for o in locations:
        print(liri(o))
    for o in persons:
        print(liri(o))

# add record with distinct key to entity collection
def add_record(coll, key):
    rec = {}
    if not (key in coll):
        coll[key] = rec
    else:
        rec= coll[key]
    return rec

# If val is defined, add to record at key
def add_single(dicto, key, val):
    if str(val) == "nan" or str(val)=="":
        return
    if (not key in org):
        dicto[key] = val

# If val is defined, add to multi-val record at key
def add_multi(dicto, key, val):
    if str(val) == "nan" or str(val)=="":
        return
    if not key in dicto:
        dicto[key] = [val]
    elif not(val in dicto[key]):
        dicto[key].append(val)

# If val is defined, add each token of it to multi-val record at key
def add_multi_arr(dicto, key, val):
    if str(val) == "nan":
        return
    vals = val.split("|")
    for v in vals:
        add_multi(dicto, key, v)
        
        
# Add labeling for all entities in a collection
def add_labels(coll):
    for key in coll:
        if key in labels:
            # The URI in the collection has labeling, so let's add it
            coll[key]['labels']=labels[key]
            
            # If it has a SeeAlso that is also a main entity type, link them
            if 'seeAlsos' in labels[key]:
                for sa in labels[key]['seeAlsos']:
                    for m in main_entities:
                        if sa in m:
                            print("Gotcha")
                            add_link(key, sa, "seeAlso")
                
# add an edge (source, target, label). Will need this to map to LPG and RDF
def add_link(source, target, label):
    links.append([source, target, label])

# Arrange labels and seeAlsos for this dataset that we sourced from DBPedia
df = pd.read_csv(filepath_or_buffer="source/dbpedia_labelsx.csv")
for index, row in df.iterrows():
    uri = row['uri']
    urirec = add_record(labels, uri)
    
    add_multi(urirec, "labels", row['label'])
    add_multi(urirec, "ulabels", helpers.get_local_name(uri))
    add_multi(urirec, "types",  row['dbotype'])
    add_multi(urirec, "typeLabels",  row['dtlabel'])
    add_multi(urirec, "seeAlsos", row['seeAlso'])
    add_multi(urirec, "seeAlsoTypes",  row['stype'])
    add_multi(urirec, "seeAlsoTypeLabels",  row['stlabel'])
    add_multi(urirec, "seeAlsoLabels",  row['slabel'])
    
    urirec['label'] =  urirec['labels'][0] if 'labels' in urirec and len(urirec['labels']) > 0 else ""
    
    if not(str(row['dbotype']) == "nan" or str(row['dbotype'])==""):
        dbtype = row['dbotype']
        dbtrec = add_record(encountered_types, dbtype)
        add_multi(dbtrec, "labels", row['dtlabel'])
        add_multi(dbtrec, "ulabels", helpers.get_local_name(dbtype))
        dbtrec['label'] =  dbtrec['labels'][0] if 'labels' in dbtrec and len(dbtrec['labels']) > 0 else ""

    if not(str(row['seeAlso']) == "nan" or str(row['seeAlso'])==""):
        sa = row['seeAlso']
        sarec = add_record(encountered_see_alsos, sa)
        add_multi(sarec, "labels", row['slabel'])
        add_multi(sarec, "ulabels", helpers.get_local_name(sa))
        add_multi(sarec, "types", row['stype'])
        sarec['label'] =  sarec['labels'][0] if 'labels' in sarec and len(sarec['labels']) > 0 else ""

    if not(str(row['stype']) == "nan" or str(row['stype'])==""):
        satype = row['stype']
        satrec = add_record(encountered_types, satype)
        add_multi(satrec, "labels", row['stlabel'])
        add_multi(satrec, "ulabels", helpers.get_local_name(satype))
        satrec['label'] =  satrec['labels'][0] if 'labels' in satrec and len(satrec['labels']) > 0 else ""

# Build orgs dynamically; loop through results from DBPedia
df = pd.read_csv(filepath_or_buffer="source/dbpedia_orgsx.csv")
for index, row in df.iterrows():

    # Company ID, which is an IRI, maps to org record in orgs dictionary
    company_id = row['company']
    org = add_record(orgs, company_id)
    
    add_single(org, 'leType', row['otype'])
    add_single(org, 'leStatus', row['pstatus'])
    add_single(org, 'parent', row['oparent']) 
    add_multi(org, 'locations', row['olocation'])
    add_multi_arr(org, 'products', row['oproducts'])
    add_multi_arr(org, 'services', row['oservices'])
    add_multi_arr(org, 'subs', row['osubs'])    
    add_multi_arr(org, 'industries', row['oindustries'])
    add_multi_arr(org, 'keyPersons', row['okeyPersons'])

    # Parent and subs are orgs too. Add them
    if 'parent' in org:
        add_record(orgs,org['parent'])
        
    if 'subs' in org:
        for s in org['subs']:
            add_record(orgs,s)

# Add related entities
for company_id in orgs:
    org = orgs[company_id]
    
    if 'parent' in org:
        add_link(company_id, org['parent'], 'hasParentOrg')
        
    if 'subs' in org:
        for s in org['subs']:
            add_link(s, company_id, 'subsidiaryOf')

    if 'products' in org:
        for p in org['products']:
            add_record(products, p)
            add_link(company_id, p, 'hasProduct')

    if 'services' in org:
        for s in org['services']:
            add_record(services, s)
            add_link(company_id, s, 'hasService')
            
    if 'locations' in org:
        for l in org['locations']:
            add_record(locations, l)
            add_link(company_id, l, 'hasLocation')

    if 'keyPersons' in org:
        for k in org['keyPersons']:
            add_record(persons, k)
            add_link(company_id, k, 'hasKeyPerson')

    if 'industries' in org:
        for i in org['industries']:
            add_record(industries, i)
            add_link(company_id, i, 'hasIndustry')

add_labels(orgs)
add_labels(persons)
add_labels(products)
add_labels(services)
add_labels(locations)
add_labels(industries)


## Map source data to LPG

In [None]:
def write_node_file(filename, coll_label, coll, extra_prop_keys=[]):
    headers = ["~id", "~label"]
    labeling= ["label", "labels", "ulabels", "types", "typeLabels", "seeAlsos", "seeAlsoTypes", "seeAlsoTypeLabels", "seeAlsoLabels"]    
    with open(filename, 'w', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(headers + labeling + extra_prop_keys)
        for p in coll:
            # ID and node label
            values = [p, coll_label]
            for lk in labeling:
                values.append(helpers.get_delim_string(coll[p]['labels'], lk))
            for ex in extra_prop_keys:
                values.append(helpers.get_delim_string(coll[p], ex))
            writer.writerow(values)

# write nodes to CSV
write_node_file("graphdata/lpg/orgs.csv", "org", orgs, ["leType", "leStatus"])
write_node_file("graphdata/lpg/persons.csv", "person", persons)
write_node_file("graphdata/lpg/locations.csv", "location", locations)
write_node_file("graphdata/lpg/products.csv", "product", products)
write_node_file("graphdata/lpg/services.csv", "service", services)
write_node_file("graphdata/lpg/industries.csv", "industry", industries)

# write edges to CSV
with open('graphdata/lpg/rels.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["~id","~from", "~to", "~label"])
    for l in links:
        source = l[0]
        target = l[1]
        label = l[2]
        edge_id=f"structuredLink_{source}_{target}"
        writer.writerow([ edge_id, source, target, label])


## Map source data to RDF

In [None]:
import helpers
from rdflib import Graph, Literal, RDF, RDFS, URIRef, XSD, OWL, BNode, DC

rdf_file = helpers.rdf_open()

def write_entities(coll_type, coll,  extra_prop_literal_keys=[], extra_prop_uri_keys=[]):
    for p in coll:
        helpers.rdf_write(rdf_file, URIRef(p), RDF.type, helpers.make_uri("OrgEntity"))
        helpers.rdf_write(rdf_file, URIRef(p), RDF.type, helpers.make_uri(coll_type))
        for ex in extra_prop_literal_keys:
            for val in helpers.get_delim_array(coll[p], ex):
                helpers.rdf_write(rdf_file, URIRef(p), helpers.make_uri(ex), Literal(val))
        for ex in extra_prop_uri_keys:
            for val in helpers.get_delim_array(coll[p], ex):
                helpers.rdf_write(rdf_file, URIRef(p), helpers.make_uri(ex), URIRef(val))
        labels = coll[p]['labels']
        if len(labels['label']) > 0:
            helpers.rdf_write(rdf_file, URIRef(p), RDFS.label, Literal(labels['label']))
        for ex in helpers.get_delim_array(labels, "labels"):
            helpers.rdf_write(rdf_file, URIRef(p), helpers.make_uri("sourceLabel"),Literal(ex))
        for ex in helpers.get_delim_array(labels, "ulabels"):
            helpers.rdf_write(rdf_file, URIRef(p), helpers.make_uri("sourceLabel"),Literal(ex))
        for ex in helpers.get_delim_array(labels, "types"):
            helpers.rdf_write(rdf_file, URIRef(p), RDF.type, URIRef(ex))
        for ex in helpers.get_delim_array(labels, "seeAlsos"):
            helpers.rdf_write(rdf_file, URIRef(p), RDFS.seeAlso, URIRef(ex))
            
                
# write nodes to CSV
write_entities("Organization", orgs, ["leStatus"], ["leType"])
write_entities("Person", persons)
write_entities("Location", locations)
write_entities("Product", products)
write_entities("Service", services)
write_entities("Industry", industries)

# write the links
for l in links:
    source = l[0]
    target = l[1]
    label = l[2]
    helpers.rdf_write(rdf_file, URIRef(source), helpers.make_uri(label), URIRef(target))

# write encountered types
for en in encountered_types:
    if len(encountered_types[en]['label']) > 0:
        helpers.rdf_write(rdf_file, URIRef(en), RDFS.label, Literal(encountered_types[en]['label']))

for en in encountered_see_alsos:
    if len(encountered_see_alsos[en]['label']) > 0:
        helpers.rdf_write(rdf_file, URIRef(en), RDFS.label, Literal(encountered_see_alsos[en]['label']))
                  
helpers.rdf_close(rdf_file, "graphdata/rdf/orgdata.ttl")

## Add taxonomy

We'll have some fun and have the LLM (Anthropic Sonnet 3.0) create a SKOS taxonomy for the world in industry!

I tried this in the Bedrock playground. 

```
Write a taxonomy for industry using SKOS in ntriples format. Include alternate labels. Make sure to include the following terms.

Internet
E-commerce
Artificial intelligence
Cloud Computing
Consumer electronics
Digital distribution
Entertainment
Self-driving cars
Supermarket
Grocery store
Health food store
Video game industry
Retail
Automotive industry
Energy storage
Cloud computing
Consulting
Enterprise software
Aerospace
Banking
Packaging
```

Now let's add the taxonomy to both LPG and RDF models. 

The RDF file is already downloaded and is available in source/industry_taxonomy.ttl

The LPG file is source/industry_taxonomy.json. We need to transform it from JSON to bulk-loadable CSV format.




In [None]:
# read JSON file
taxo_dict=None
with open('source/industry_taxonomy.json') as taxo_file:
    taxo_dict = json.load(taxo_file)

# look at each entry, tie to the industry entity
tax_labels={}
concept_to_pref={}
for entry in taxo_dict:
    def get_skos_values(dicto, key, subkey='@value'):
        vals=[]
        if key in dicto:
            for item in dicto[key]:
                vals.append(item[subkey])
        return vals
                
    concept=entry['@id']
    prefs=get_skos_values(entry, 'http://www.w3.org/2004/02/skos/core#prefLabel')
    if len(prefs)> 1:
        raise Exception("Prefs too many labels " + entry);
    elif len(prefs) == 0:
        continue
    pref = prefs[0]
    alts=get_skos_values(entry, 'http://www.w3.org/2004/02/skos/core#altLabel')
    broaders=get_skos_values(entry, 'http://www.w3.org/2004/02/skos/core#broader', '@id')
    entry_record=[concept, pref, alts, broaders]
    tax_labels[pref.lower()] = entry_record
    concept_to_pref[concept] = pref.lower()

# prefer broaders as string, not URI
for t in tax_labels:
    broaders=[]
    for b in tax_labels[t][3]:
        broaders.append(concept_to_pref[b])
    tax_labels[t].append(broaders)
    
# Now tie industries we already have on file to this taxonomy
for i in industries:
    if industries[i]['labels']['label'].lower() in tax_labels:
        industries[i]['taxonomy']=tax_labels[industries[i]['labels']['label'].lower()]
    else:
        raise Exception(f"Industry {i} with label {industries[i]['labels']['label']} has NO tax")



## Write taxonomy for LPG

In [None]:
with open("graphdata/lpg/taxonomy_concepts.csv", 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["~id", "~label", "prefLabel", "altLabels", "broaders"])
    for i in industries:
        concept=industries[i]['taxonomy'][0]
        pref=industries[i]['taxonomy'][1]
        alts=industries[i]['taxonomy'][2]
        broaders=industries[i]['taxonomy'][4]
        
        values=[concept, "taxonomy_concept", pref, helpers.CELL_DELIM.join(alts), helpers.CELL_DELIM.join(broaders)]
        writer.writerow(values)

# write edges to CSV
with open('graphdata/lpg/tax_concept_rels.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["~id","~from", "~to", "~label"])
    for i in industries:
        concept=industries[i]['taxonomy'][0]
        edge_id=f"taxConceptLink_{i}_{concept}"
        writer.writerow([ edge_id, i, concept, "hasTaxonomyConcept"])


## Write taxonomy for RDF
This is mostly already done. Just a simple link suffices

In [None]:
rdf_file = helpers.rdf_open()

for i in industries:
    concept=industries[i]['taxonomy'][0]
    helpers.rdf_write(rdf_file, URIRef(i), helpers.make_uri("hasConcept"), URIRef(concept))

helpers.rdf_close(rdf_file, "graphdata/rdf/tax_concept_rels.ttl")