# EarthCube Graph Analytics Exploration

## About

This is the start of learning a bit about leveraging graph analytics to assess the 
EarthCube graph and explore both the relationships but also look for methods to 
better search the graph for relevant connections.

## References

* [RDFLib](https://github.com/RDFLib/rdflib)
* [NetworkX](https://networkx.org/)
* [NetworkX link analysis](https://networkx.org/documentation/latest/reference/algorithms/link_analysis.html?highlight=page%20rank#)
* https://faculty.math.illinois.edu/~riveraq2/teaching/simcamp16/PageRankwithPython.html
* https://docs.dask.org/en/latest/
* https://examples.dask.org/bag.html
* https://s3fs.readthedocs.io/en/latest/
* https://docs.dask.org/en/latest/remote-data-services.html

## Installs

In [29]:
!pip -q install mimesis
!pip -q install minio 
!pip -q install s3fs
!pip -q install SPARQLWrapper
!pip -q install boto3
!pip -q install 'fsspec>=0.3.3'
!pip -q install rdflib
!pip -q install rdflib-jsonld
!pip -q install PyLD==2.0.2

## Imports


In [30]:
import dask, boto3
import dask.dataframe as dd
import pandas as pd
import json

from SPARQLWrapper import SPARQLWrapper, JSON

sweet = "http://cor.esipfed.org/sparql"
dbsparql = "http://dbpedia.org/sparql"
ufokn = "http://graph.ufokn.org/blazegraph/namespace/ufokn-dev/sparql"

## Code inits

### Helper function(s)
The following block is a SPARQL to Pandas feature.  You may need to run it to load the function per standard notebook actions.

In [52]:
#@title
def get_sparql_dataframe(service, query):
    """
    Helper function to convert SPARQL results into a Pandas data frame.
    """
    sparql = SPARQLWrapper(service)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    result = sparql.query()

    processed_results = json.load(result.response)
    cols = processed_results['head']['vars']

    out = []
    for row in processed_results['results']['bindings']:
        item = []
        for c in cols:
            item.append(row.get(c, {}).get('value'))
        out.append(item)

    return pd.DataFrame(out, columns=cols)

### Set up some Pandas Dataframe options

In [53]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

### Set up the connection to the object store to access the graph objects from

In [54]:
import s3fs 

oss = s3fs.S3FileSystem(
      anon=True,
      key="",
      secret="",
      client_kwargs = {"endpoint_url":"https://oss.geodex.org"}
   )

In [55]:
# Simple command to list objects in the current results bucket prefix
oss.ls('gleaner/results/cdfv3')

['gleaner/results/cdfv3/bcodmo_graph.nq',
 'gleaner/results/cdfv3/earthchem_graph.nq',
 'gleaner/results/cdfv3/hydroshare_graph.nq',
 'gleaner/results/cdfv3/ieda_graph.nq',
 'gleaner/results/cdfv3/iris_graph.nq',
 'gleaner/results/cdfv3/lipdverse_graph.nq',
 'gleaner/results/cdfv3/ocd_graph.nq',
 'gleaner/results/cdfv3/opentopo_graph.nq',
 'gleaner/results/cdfv3/ssdb_graph.nq',
 'gleaner/results/cdfv3/unavco_graph.nq']

## Pull a graph and load

Let's pull back an example graph and load it up into an RDFLib graph so we can test out a SPARQL call on it.

In [35]:
import rdflib
import gzip

with oss.open('gleaner/results/cdfv3/opentopo_graph.nq', 'rb') as f:
  #print(f.read())
   file_content = f.read()  #.decode("utf-8", "ignore").replace('\n',' ')
    
# with gzip.open('./oceanexperts_graph.nq.gz', 'rb') as f:
#     file_content = f.read()


In [39]:

g = rdflib.Graph()
parsed = g.parse(data = file_content, format="nquads")

In [40]:
qres = g.query(
    """prefix schema: <http://schema.org/>
    SELECT DISTINCT ?s ?name
       WHERE {
          ?s  a schema:Dataset.
           ?s schema:name ?name.
       }
       LIMIT 10""")

for row in qres:
    print("%s Name: %s" % row)

Nb0186ff66b5649fbb32491dda263838a Name: High Resolution Topography near Santa Cruz, CA 2017
N40a2df2c83054580b8714aac155784be Name: Walker Fault System, Nevada, 2015
Na820cf1a029c4f21afaa90b4e2dd8934 Name: Almaty range front fault, Koram site, Kazakhstan
N6622f9260fed46dfb80bb0626648657b Name: 2019 Ridgecrest, CA: Trona-Argus Area - Post-Earthquake Lidar
Na713d5ea88a248ff96c0b78ab249b449 Name: Alteration of Groundwater Flow due to Slow Landslide Failure, CA
Ndcf2768257064225a2414b73eb6526ac Name: 2018 Faraglione, Vulcano Island, Sicily, Italy (simple demo)
Na4e31523cb0441b5ab29aa0de226b10d Name: Northern Rockies, Montana lidar 2016
Ne8f3de748f1b426cade89ce9ab61789c Name: IT-Ren, Fluxnet site
Nbfd99a87ac6d406d82b1204ac052acc3 Name: 2019 Ridgecrest, CA: Trona-Argus Area - Post-Earthquake Lidar
N8f41aac0fc3242e7a271f43e1601fedf Name: Manawatu - Whanganui, New Zealand 2015


## Convert to NetworkX

Convert this to a networkx graph so we can explore some analytics calls

In [45]:
import rdflib
from rdflib.extras.external_graph_libs import rdflib_to_networkx_multidigraph
from rdflib.extras.external_graph_libs import rdflib_to_networkx_digraph
import networkx as nx
import matplotlib.pyplot as plt

G = rdflib_to_networkx_digraph(parsed)


### Pagerank

Test a page rank call and see if we can load the results into Pandas and sort.

In [47]:
import pandas as pd

pr = nx.pagerank(G,alpha=0.9)
prdf = pd.DataFrame.from_dict(pr, orient='index')
prdf.sort_values(by=0,ascending=False, inplace=True,)
prdf.head(10)

Unnamed: 0,0
http://schema.org/PropertyValue,0.065468
http://schema.org/Organization,0.031006
http://schema.org/GeoShape,0.017148
Vertical Coordinates,0.011443
Horizontal Coordinates,0.011443
Area,0.009463
Unit,0.008047
meter,0.007746
http://schema.org/Place,0.007105
opentopoID,0.00698


### NetworkX hits 

Try the hits API.   Sadly we fail here on this graph while we sare OK on smaller grsaphs  May need to up the iterations it tries?

In [51]:
hits = nx.hits(G)  # can also be done with a nodelist (which is interesting.   provide with SPARQL call?)

PowerIterationFailedConvergence: (PowerIterationFailedConvergence(...), 'power iteration failed to converge within 100 iterations')

In [None]:
hitsdf = pd.DataFrame.from_dict(hits)
hitsdf.head()