<img src="dbpedia_getting_started.jpg">

# Experimental area for interacting with Parthenos Discovery platform

Data notebooks are increasingly popular, as they offer "in browser code execution" and easy sharing and reproducing of algorithmis procedures.
In the context of Parthenos Discovery, the Juypter notebooks provide more freedom as to postprocessing of the resulting data. For querying the data we use [SPARQL](https://www.w3.org/TR/sparql11-query/). https://parthenos.acdh-dev.oeaw.ac.at/


Credits:
Based on [notebooks from gastrodon](https://github.com/paulhoule/gastrodon/blob/master/notebooks/remote/Querying%20DBpedia.ipynb)
"My method is a deliberate combination of systematic analysis (looking at counts, methods that can applied to arbitrary predicates or classes) and opportunism (looking at topics that catch my eye.)"


In [None]:
import sys
from os.path import expanduser
from gastrodon import RemoteEndpoint,QName,ttl,URIRef,inline
import pandas as pd
import json
pd.options.display.width=120
pd.options.display.max_colwidth=100

First let's define a few prefixes for namespaces we will use:

In [None]:
prefixes=inline("""
    @prefix : <http://dbpedia.org/resource/> .
    @prefix pe: <http://parthenos.d4science.org/CRMext/CRMpe.rdfs/> .
    @prefix crm: <http://www.cidoc-crm.org/cidoc-crm/> .
    @prefix crmdig: <http://www.ics.forth.gr/isl/CRMext/CRMdig.rdfs/> .
""").graph

Next we set up a SPARQL endpoint and register the above prefixes. 
Setting the base_uri helps to make the results more readable, by stripping the URI in the base_uri namespace.
We leave the default_graph empty, because the data is grouped in many graphs (based on their provenance).

In [None]:
# for connecting to an endpoint with restricted access, you need to provide the credentials:
#connection_data=json.load(open(expanduser("config.json")))
#connection_data["prefixes"]=prefixes
#endpoint=RemoteEndpoint(**connection_data)

In [None]:
# for ease of use a dedicated PARTHENOS endpoint is available that allows anonymous read-access (SELECT queries)
endpoint=RemoteEndpoint(
    "https://triplestore-parthenos-cached.acdh-dev.oeaw.ac.at/parthenos-dev/sparql"
    ,default_graph=""
    ,prefixes=prefixes
    ,base_uri="http://parthenos.d4science.org/handle/"
)

## Counting Triples

First let's count how many triples there are in the triple store, to get a first idea of the size of the overall dataset.

In [None]:
count=endpoint.select("""
    SELECT (COUNT(*) AS ?count) { ?s ?p ?o .}
""").at[0,"count"]
format(count, ",")

## Counting Predicates

A list of predicates and their frequency.

In [None]:
predicates=endpoint.select("""
    SELECT ?p (COUNT(*) AS ?count) { ?s ?p ?o .} GROUP BY ?p ORDER BY DESC(?count)
""")
predicates

Just give me the number of distinct predicates used:

In [None]:
endpoint.select("""
    SELECT (COUNT(*) AS ?count) { SELECT DISTINCT ?p { ?s ?p ?o .} }
""")

When you have a number of "things" ordered by how prevalent they are, a cumulative distribution function is a great nonparametric method of characterizing the statistics

In [None]:
predicates["dist"]=predicates["count"].cumsum()/count

In [None]:
%matplotlib inline
predicates["dist"].plot()

In [None]:
predicates["dist"].head(10).plot()

Top 10 predicates represent around 80% of the predicates in the dataset. Which are they?

In [None]:
predicates.head(10)

And which are the least used properties then? 
Let's see those used less than 20 times:

In [None]:
rare_predicates = predicates[predicates['count']<20]
rare_predicates

# Classes

Let's also get some numbers on the classes. How many instances of each class are there:

In [None]:
types=endpoint.select("""
    SELECT ?type (COUNT(*) AS ?count) { ?s a ?type .} GROUP BY ?type ORDER BY DESC(?count)
""")
types

In [None]:
endpoint.select("""
    SELECT (COUNT(*) AS ?count) { SELECT DISTINCT ?type { ?s a ?type .} }
""")

Show me just the CIDOC CRM classes:

In [None]:
types[types.index.str.startswith('crm:')]

# Instances

Let's have a look at instances of one class, say **crm:E38_Image**:

In [None]:
images = endpoint.select("""
    SELECT ?img { 
        ?img a crm:E38_Image
    } LIMIT 10
""")
images

Render one of the images:

In [None]:
#from bs4 import BeautifulSoup
from IPython.display import display, HTML
#from uritools import urijoin

HTML('<img src="{0}">'.format(images.at[3,'img']))

Or render  all ten images as thumbnails:

In [None]:
htmlimgs = ""
for ix,row  in images.iterrows():    
   htmlimgs = htmlimgs + '<img src="{0}" style="float:left;height:80px;margin:1em;">'.format(row[0])
HTML(htmlimgs)

Now I want to use a value from one result in the next query. (gastrodon library "lets you use Python variables in your SPARQL queries simply by adding ?_ to the name of your variables")

In [None]:
actors = endpoint.select("""
    SELECT ?actor { 
        ?actor a crm:E39_Actor.
        ?actor rdfs:label ?label.
    } LIMIT 10
""")
actor1 = actors.at[1,'actor']

actor1_properties = endpoint.select("""
    SELECT ?p ?o { 
        ?_actor1 ?p ?o.
    } LIMIT 10
""")
actor1_properties