This notebook was developed on google colab: https://colab.research.google.com/drive/1yq8Z92vNAAXD9ExlK1jcJnYIGfBl4_h9?usp=sharing

# install packages
This notebook requires two packages that are not installed by default

In [1]:
# install the library
%%capture
!pip install wikidataintegrator
!pip install shexer

# Collect all pathways


In [12]:
from rdflib import Graph
import requests
from bs4 import BeautifulSoup, SoupStrainer
import zipfile
import io
from contextlib import closing
from google.colab import files

def fetch_wprdf():
  temp = Graph()
  url = 'http://data.wikipathways.org/current/rdf'
  page = requests.get(url).text
  files = []
  for link in BeautifulSoup(page, "lxml", parse_only=SoupStrainer('a')):
      address = str(link).split("\"")
      if len(address) > 1:
          filename = address[1].replace("./", "/")
          if len(filename) > 1:
              if filename not in files:
                  if filename != "./":
                      files.append(url + filename)
  for file in set(files):
      if "rdf-authors" in file:  # get the most accurate file
          print(file)
          u = requests.get(file)
          with closing(u), zipfile.ZipFile(io.BytesIO(u.content)) as archive:
              for member in archive.infolist():
                  if "_" in str(member.filename):
                      continue
                  print("parsing: " + member.filename)
                  nt_content = archive.read(member)
                  temp.parse(data=nt_content, format="turtle")
          print("size: "+str(len(temp)))

      if "rdf-wp" in file:  # get the most accurate file
          print(file)
          u = requests.get(file)
          with closing(u), zipfile.ZipFile(io.BytesIO(u.content)) as archive:
              for member in archive.infolist():
                  nt_content = archive.read(member)
                  temp.parse(data=nt_content.decode(), format="turtle")
          print("size: "+str(len(temp)))
  return temp


In [6]:
wprdf = fetch_wprdf()

http://data.wikipathways.org/current/rdf/wikipathways-20220710-rdf-authors.zip
parsing: authors/WP1.ttl
parsing: authors/WP10.ttl
parsing: authors/WP100.ttl
parsing: authors/WP1000.ttl
parsing: authors/WP1001.ttl
parsing: authors/WP1002.ttl
parsing: authors/WP1004.ttl
parsing: authors/WP1005.ttl
parsing: authors/WP1006.ttl
parsing: authors/WP1007.ttl
parsing: authors/WP1008.ttl
parsing: authors/WP1009.ttl
parsing: authors/WP101.ttl
parsing: authors/WP1010.ttl
parsing: authors/WP1011.ttl
parsing: authors/WP1012.ttl
parsing: authors/WP1013.ttl
parsing: authors/WP1014.ttl
parsing: authors/WP1015.ttl
parsing: authors/WP1016.ttl
parsing: authors/WP1017.ttl
parsing: authors/WP1018.ttl
parsing: authors/WP1019.ttl
parsing: authors/WP102.ttl
parsing: authors/WP1020.ttl
parsing: authors/WP1021.ttl
parsing: authors/WP1022.ttl
parsing: authors/WP1023.ttl
parsing: authors/WP1024.ttl
parsing: authors/WP1025.ttl
parsing: authors/WP1026.ttl
parsing: authors/WP1027.ttl
parsing: authors/WP1028.ttl
parsi

# Extract the Shape Expression
A shape expression is a formal description of a (subset of a) RDF file. 

In [14]:
from shexer.shaper import Shaper
from shexer.consts import NT, TURTLE
import pprint

q = "select ?class where { ?item rdf:type ?class }"
target_classes = []
x = wprdf.query(q)
for target_class in x:
  if str(target_class["class"]) not in target_classes:
    target_classes.append(str(target_class["class"]))

shex_target_file = "wikipathways.shex"

shaper = Shaper(target_classes=target_classes,
                rdflib_graph=wprdf,
                input_format=TURTLE,
                )  # Default rdf:type
            
shaper.shex_graph(output_file=shex_target_file)

# Print the extracted shape expression
The shape expression extracted is printed below. It contains all the properies found and the percentage of qids that use the extracted property. It also prints the cardinality of the properties. For example the [occupation property](https://www.wikidata.org/wiki/Property:P106): 
``` wdt:P106  IRI  *;
            # 79.8941798941799 % obj: IRI. Cardinality: +
            # 44.97354497354497 % obj: IRI. Cardinality: {1}
            # 16.402116402116402 % obj: IRI. Cardinality: {2}
            # 11.11111111111111 % obj: IRI. Cardinality: {3}
            # 3.7037037037037033 % obj: IRI. Cardinality: {4}
            # 1.5873015873015872 % obj: IRI. Cardinality: {8}
            # 1.0582010582010581 % obj: IRI. Cardinality: {6}
            # 0.5291005291005291 % obj: IRI. Cardinality: {5}
            # 0.5291005291005291 % obj: IRI. Cardinality: {10}
```
Here is stated that in Wikidata on approx 80% of identified collectors their occupation is known. Furthermore the cardinality is given. for 80% the "+" symbol indicates that for each of those 1 or more occupations are known. For 45% of those only 1 occupation is listed, for 16% 2 occupations are listed, etc  



In [15]:
!cat wikipathways.shex

PREFIX : <http://weso.es/shapes/>
PREFIX brick: <https://brickschema.org/schema/Brick#>
PREFIX csvw: <http://www.w3.org/ns/csvw#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX dcat: <http://www.w3.org/ns/dcat#>
PREFIX dcmitype: <http://purl.org/dc/dcmitype/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX dcam: <http://purl.org/dc/dcam/>
PREFIX doap: <http://usefulinc.com/ns/doap#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX odrl: <http://www.w3.org/ns/odrl/2/>
PREFIX org: <http://www.w3.org/ns/org#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX prof: <http://www.w3.org/ns/dx/prof/>
PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX schema: <https://schema.org/>
PREFIX sh: <http://www.w3.org/ns/shacl#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX sosa: <http://www.w3.org/ns/sosa/>
PREFIX ssn: <http

In [16]:

files.download("wikipathways.shex")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>