# querying a local triples file using rdflib

## my use case
Examine the kind and number of statements on:
- the Bhagavad Gita written work (http://www.wikidata.org/entity/Q46802) and Annie Besant's English translation (http://www.wikidata.org/entity/Q63196925) in Wikidata
    - I can use the [WD query service](https://query.wikidata.org/) for this
- a [BF Hub/Work](https://id.loc.gov/resources/hubs/02138b5d-0a89-a6d8-1555-7f0731e6ea0c.html) (“Hub/Work”?), [BF Work](https://id.loc.gov/resources/works/1102998.html), and [BF Instance](https://id.loc.gov/resources/instances/1102998.html)
    - I need a way to query these files
    - No LOC endpoint
    - I can easily download the files


### first I need to install some libraries
I believe that I could do this in the notebook itself; but I'll use the Anaconda Prompt.
```
(base) C:\Users\Benjamin>pip install rdflib
[...stuff happens...]
(base) C:\Users\Benjamin>pip install rdflib-jsonld
[...more stuff happens...]
```
### next I need to try and figure out what I'm doing

- I'm looking at the [rdflib 6.1.1 documentation](https://rdflib.readthedocs.io/en/stable/index.html) > [Querying with SPARQL](https://rdflib.readthedocs.io/en/stable/intro_to_sparql.html)
- Also, here's a SO post that looks my speed: [Is there a Hello World example for SPARQL with RDFLib?](https://stackoverflow.com/questions/16829351/is-there-a-hello-world-example-for-sparql-with-rdflib)

### query version one

In [7]:
import rdflib

In [8]:
g = rdflib.Graph()
g.parse("testdata/02138b5d-0a89-a6d8-1555-7f0731e6ea0c.rdf", format="xml")
query = "SELECT (COUNT(?property) AS ?statements) WHERE { <http://id.loc.gov/resources/hubs/02138b5d-0a89-a6d8-1555-7f0731e6ea0c> ?property ?value. }"
result = g.query(query)
for row in result:
    print(row)

(rdflib.term.Literal('109', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#integer')),)


In [9]:
g2 = rdflib.Graph()
g2.parse("testdata/1102998.rdf", format="xml")
query2 = "SELECT (COUNT(?property2) AS ?statements2) WHERE { <http://id.loc.gov/resources/works/1102998> ?property2 ?value2. }"
result2 = g2.query(query2)
for row in result2:
    print(row)

(rdflib.term.Literal('20', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#integer')),)


In [10]:
g3 = rdflib.Graph()
g3.parse("testdata/1102998_Instance.rdf", format="xml")
query3 = "SELECT (COUNT(?property2) AS ?statements2) WHERE { <http://id.loc.gov/resources/instances/1102998> ?property2 ?value2. }"
result3 = g3.query(query3)
for row in result3:
    print(row)

(rdflib.term.Literal('21', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#integer')),)


#### 🔍 query version one: success with very basic queries; repeating lots of things
Things I'd like to improve:
- use one query to get statements on all three resources, rather than repeat the query for each resource/file
    - maybe JOIN?
- To do the above, would I need to load the triples from all three downloaded BF files into one graph?

**RESOURCES**
- rdflib > [Querying with SPARQL](https://rdflib.readthedocs.io/en/stable/intro_to_sparql.html#querying-with-sparql)
- a HA! [Merging graphs](https://rdflib.readthedocs.io/en/stable/merging.html#merging-graphs)

```
from rdflib import Graph
graph = Graph()
graph.parse(input1)
graph.parse(input2)
```

> graph now contains the merged graph of input1 and input2.

*It's that easy!?*

### query version two

In [2]:
from rdflib import Graph

In [3]:
# create a combined graph
cg = Graph()
cg.parse("testdata/02138b5d-0a89-a6d8-1555-7f0731e6ea0c.rdf", format="xml")
cg.parse("testdata/1102998.rdf", format="xml")
cg.parse("testdata/1102998_Instance.rdf", format="xml")

<Graph identifier=Ndb0686f2a4874568bdc3142430ca9184 (<class 'rdflib.graph.Graph'>)>

In [5]:
query = "SELECT (COUNT(?property) AS ?statements) WHERE { <http://id.loc.gov/resources/hubs/02138b5d-0a89-a6d8-1555-7f0731e6ea0c> ?property ?value. }"
result = cg.query(query)
for row in result:
    print(row)
query2 = "SELECT (COUNT(?property2) AS ?statements2) WHERE { <http://id.loc.gov/resources/works/1102998> ?property2 ?value2. }"
result2 = cg.query(query2)
for row in result2:
    print(row)
query3 = "SELECT (COUNT(?property2) AS ?statements2) WHERE { <http://id.loc.gov/resources/instances/1102998> ?property2 ?value2. }"
result3 = cg.query(query3)
for row in result3:
    print(row)

(rdflib.term.Literal('111', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#integer')),)
(rdflib.term.Literal('20', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#integer')),)
(rdflib.term.Literal('22', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#integer')),)


#### 🔍 query version two - query one graph 3x instead of three graphs 1x/each
**NOTE** that I got a different triple count for the `bf:Hub` (!?!)  
Next to do is to improve my query syntax so that one query can retrieve the statements on the Hub, the Work and the Instance

**QUESTIONS/RESOURCES/ETC.**
- I *think* that JOIN is what I need?
- How to check my SPARQL syntax before putting the query string in a var??
- Attempting to get a query structure right in the Wikidata query service but [not having much luck so far...](https://w.wiki/57re)