# Get Climate data

This notebook demonstrates graph virtualization in Amazon Neptune. 

In a Neptune database we maintain weather station data. We represent this data as Resource Description Framework (RDF)

In a data lake we keep a record of readings from those stations collected over several years. The lake data resides in Parquet files in an Amazon Simple Storage Service (S3) bucket. We can query that data using SQL via the Amazon Athena service. We use the AWS Glue Data Catalog to define a relational/tabular structure of the readings data. This enables Athena to see the readings data as if it came from a table. 

In a SPARQL query to the Neptune database, we can access the readings data from the data lake using SPARQL federated query. In SPARQL federation, we use the SERVICE directive to collect results from a second SPARQL endpoint. In our example, we wish to collect readings from the lake. But the lake can be queried using SQL, not SPARQL! We use Ontop VKG (https://ontop-vkg.org/), a tool that functions as a SPARQL endpoint but can in turn invoke a relational database using SQL. 

The end to end flow is shown in the next figure.

<img src="https://raw.githubusercontent.com/aws-samples/amazon-neptune-graph-virtualization/main/images/nep2lake_design_flow.png">


For more detail, see our blog post. Or watch a YouTube video on this topic: https://www.youtube.com/watch?v=jNB0HpsGtEw. Or visit our Git repository: https://github.com/aws-samples/amazon-neptune-graph-virtualization/tree/main.

See https://docs.aws.amazon.com/neptune/latest/userguide/sparql-service.html for more on SPARQL federation support in Neptune. We also recommend this blog post: https://aws.amazon.com/blogs/database/benefitting-from-sparql-1-1-federated-queries-with-amazon-neptune/. 

## Bulk-load RDF data to Neptune

Run the next cell to bulk-load the weather station data to Neptune. The S3 bucket was created as part of the CloudFormation stack. See https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load.html for more on the bulk loader.

In [None]:
import os
import subprocess

stream = os.popen("source ~/.bashrc ; echo $DATA_BUCKET")
lines=stream.read().split("\n")
DATA_BUCKET=lines[0]
DATA_BUCKET
%load -s s3://{DATA_BUCKET}/rdf -f nquads --store-to loadres1 --run


In [None]:
%load_status {loadres1['payload']['loadId']} --errors --details


## Run queries across graph and lake!
Now let's query weather station and readings data. 

The first query lists details about a specific station. 

In [None]:
%%sparql

select * where {
    <http://climate.aws.com/resource#3129099999> ?p ?o}


The next query gets both the station and some of its readings. Before running, in the SERVICE tag change the IP address to the private IP address of your Ontop container. For detailed instructions on how to do this, refer to the blog post.

In [None]:
%%sparql
PREFIX clmo: <http://climate.aws.com/ontology/>
PREFIX clmr: <http://climate.aws.com/resource#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>


# readings at Manchester UK in 
select * where {
    BIND(clmr:3334099999 as ?stat) .
    ?stat rdfs:label ?name .
    ?stat clmo:latitude ?lat .
    ?stat clmo:longitude ?long .
    
    SERVICE <http://172.30.1.129:8080/sparql> {
           ?reading clmo:station ?stat ;
            clmo:celsius ?celsius ;
            clmo:dateTime ?dateTime ;
            clmo:fahrenheit ?fahrenheit ;
            FILTER(?celsius >= "30"^^xsd:double) .
    } 
} 
LIMIT 10


The next query gets both the station and some of its readings. Before running, in the SERVICE tag change the IP address to the private IP address of your Ontop container. For detailed instructions on how to do this, refer to the blog post.

In [None]:
%%sparql
PREFIX clmo: <http://climate.aws.com/ontology/>
PREFIX clmr: <http://climate.aws.com/resource#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>


# readings at Manchester UK in 
select * where {
    BIND(clmr:3334099999 as ?stat) .
    ?stat rdfs:label ?name .
    ?stat clmo:latitude ?lat .
    ?stat clmo:longitude ?long .
    
    SERVICE <http://172.30.1.129:8080/sparql> {
       ?reading clmo:station ?stat ;
            clmo:celsius ?celsius ;
            clmo:dateTime ?dateTime ;
            clmo:fahrenheit ?fahrenheit ;
            FILTER(?dateTime > "1985-12-04T13:20:00"^^xsd:dateTime) .
            #FILTER(?celsius >= "30"^^xsd:double) .
    } 
} 
LIMIT 40