---

Schema Definition for RDF Graph

---

In this notebook, we define the schema for an RDF graph based on the data sources. This involves outlining the structure and relationships of the data to be used in constructing a semantic knowledge graph. The main steps include:

- Schema Design: We design the RDF schema, identifying key entities, attributes, and relationships that will form the basis of the knowledge graph.
- Mapping Data to RDF: We map the cleaned data to the RDF schema, converting it into RDF triples (subject-predicate-object format), as it can be seen in the `KnowledgeGraph.py`.
- Validation: We validate the generated RDF data to ensure it adheres to the defined schema and accurately represents the original data.
- Integration and Storage: Finally, we integrate the RDF data into a triple store or RDF database for further querying and analysis.

In [23]:
#!pip install networkx pyvis rdflib ipython numpy matplotlib pygraphviz

In [1]:
import networkx as nx
from rdflib import Graph, Namespace, Literal
from rdflib.namespace import RDF, RDFS, XSD
import matplotlib.pyplot as plt
from networkx.drawing.nx_agraph import graphviz_layout
from rdflib.plugins.sparql import prepareQuery
import duckdb
import sys
import os

# Path to root direct
project_root = './..'
sys.path.append(project_root)
from utils import *

In [25]:
# Crear un grafo RDF
g = Graph()

# Definir los Namespaces
ex = Namespace('http://example.org/')
loc = Namespace('http://example.org/location/')
ent = Namespace('http://example.org/entertainment/')
apt = Namespace('http://example.org/apartment/')
inc = Namespace('http://example.org/incident/')
schema = Namespace('http://schema.org/')
# Vinculate Namespaces to graph RDF
g.bind('ex', ex)
g.bind('loc', loc)
g.bind('ent', ent)
g.bind('apt', apt)
g.bind('inc', inc)
g.bind('schema', schema)

General Schema Definition

In [26]:
# Define Classes
g.add((loc.Locations, RDF.type, RDFS.Class))
g.add((loc.District, RDF.type, RDFS.Class))
g.add((inc.Incident, RDF.type, RDFS.Class))

# Define Locations subclases
g.add((ent.Entertainment, RDF.type, RDFS.Class))
g.add((ent.Entertainment, RDFS.subClassOf, loc.Locations))

g.add((apt.Apartment, RDF.type, RDFS.Class))
g.add((apt.Apartment, RDFS.subClassOf, loc.Locations))

# Define isinDistrict propiety 
g.add((loc.isinDistrict, RDF.type, RDF.Property))
g.add((loc.isinDistrict, RDFS.domain, loc.Locations))
g.add((loc.isinDistrict, RDFS.range, loc.District))

# Define Longitude propiety
g.add((loc.Longitude, RDF.type, RDF.Property))
g.add((loc.Longitude, RDFS.domain, loc.Locations))
g.add((loc.Longitude, RDFS.range, XSD.float))

# Define Latitude propiety
g.add((loc.Latitude, RDF.type, RDF.Property))
g.add((loc.Latitude, RDFS.domain, loc.Locations))
g.add((loc.Latitude, RDFS.range, XSD.float))

<Graph identifier=Nd153756c57f1489f84295873535b3786 (<class 'rdflib.graph.Graph'>)>

Incident Schema Definition: 

For the incident schema, the variables that will be needed are: 

- any: year
- num_mes: numberMonth
- nom_mes: nameMonth
- area_basica_policial: Reference the District Class (neighbourhood).
- tipus_de_lloc_dels_fets: wherePenalCode
- tipus_de_fet_codi_penal: typePenalCode
- ambit_fet: incidentType
- nombre_victimes: numberVictims

An example of an instance is: 
- any: 2021
- num_mes: 6
- nom_mes: Juny
- area_basica_policial: Nou Barris
- tipus_de_lloc_dels_fets: Via pública urbana
- tipus_de_fet_codi_penal: Amenaces
- ambit_fet:  Political orientation	
- nombre_victimes: 1.0

In [27]:
# Define happenedAt
g.add((inc.happenedAt, RDF.type, RDF.Property))
g.add((inc.happenedAt, RDFS.domain, inc.Incident))
g.add((inc.happenedAt, RDFS.range, loc.District))

# Define year
g.add((inc.year, RDF.type, RDF.Property))
g.add((inc.year, RDFS.domain, inc.Incident))
g.add((inc.year, RDFS.range, XSD.integer))

# Define numberMonth
g.add((inc.numberMonth, RDF.type, RDF.Property))
g.add((inc.numberMonth, RDFS.domain, inc.Incident))
g.add((inc.numberMonth, RDFS.range, XSD.integer))

# Define nameMonth
g.add((inc.nameMonth, RDF.type, RDF.Property))
g.add((inc.nameMonth, RDFS.domain, inc.Incident))
g.add((inc.nameMonth, RDFS.range, XSD.string))

# Define typePenalCode
g.add((inc.typePenalCode, RDF.type, RDF.Property))
g.add((inc.typePenalCode, RDFS.domain, inc.Incident))
g.add((inc.typePenalCode, RDFS.range, XSD.string))

# Define wherePenalCode
g.add((inc.wherePenalCode, RDF.type, RDF.Property))
g.add((inc.wherePenalCode, RDFS.domain, inc.Incident))
g.add((inc.wherePenalCode, RDFS.range, XSD.string))

# Define incidentType 
g.add((inc.incidentType, RDF.type, RDF.Property))
g.add((inc.incidentType, RDFS.domain, inc.Incident))
g.add((inc.incidentType, RDFS.range, XSD.string))

# Define numberVictims
g.add((inc.numberVictims, RDF.type, RDF.Property))
g.add((inc.numberVictims, RDFS.domain, inc.Incident))
g.add((inc.numberVictims, RDFS.range, XSD.float))

<Graph identifier=Nd153756c57f1489f84295873535b3786 (<class 'rdflib.graph.Graph'>)>

Airbnb Schema Definition

For the Airbnb schema, the variables that will be needed are: 

- neighbourhood: Reference the District Class (neighbourhood).
- latitude: Reference the Latitude domain.
- longitude: Reference the Longitude domain.
- criminality index: criminalityIndex
- extra_people: extraPeople
- property_type: propertyType
- room_type: roomType
- accommodates: accommodates
- bathrooms: bathrooms
- bedrooms: bedrooms
- beds: beds
- bed_type: bedType
- price (x night): price
- security_deposit: securityDeposit
- cleaning_fee: cleaningFee
- guests_included: guestsIncluded
- review_scores_location (0-10): reviewScoresLocation
- cancellation_policy: cancellationPolicy


An example of an instance is: 

- neighbourhood: Reference the District Class (neighbourhood).
- latitude: Reference the Latitude domain.
- longitude: Reference the Longitude domain.
- criminality index: criminalityIndex
- extra_people: extraPeople
- property_type: Apartment
- room_type: Private room
- accommodates: 2
- bathrooms: 2.0
- bedrooms: 1.0
- beds: 2.0
- bed_type: Real Bed
- price (x night): 21.0
- security_deposit: 123.0
- cleaning_fee: 54.0
- guests_included: 1
- review_scores_location (0-10): 100.0	
- cancellation_policy: moderate	

In [28]:
# Define criminalityIndex
g.add((apt.criminalityIndex, RDF.type, RDF.Property))
g.add((apt.criminalityIndex, RDFS.domain, apt.Apartment))
g.add((apt.criminalityIndex, RDFS.range, XSD.float))

# Define extraPeople
g.add((apt.extraPeople, RDF.type, RDF.Property))
g.add((apt.extraPeople, RDFS.domain, apt.Apartment))
g.add((apt.extraPeople, RDFS.range, XSD.integer))

# Define propertyType
g.add((apt.propertyType, RDF.type, RDF.Property))
g.add((apt.propertyType, RDFS.domain, apt.Apartment))
g.add((apt.propertyType, RDFS.range, XSD.string))

# Define roomType
g.add((apt.roomType, RDF.type, RDF.Property))
g.add((apt.roomType, RDFS.domain, apt.Apartment))
g.add((apt.roomType, RDFS.range, XSD.string))

# Define accommodates
g.add((apt.accommodates, RDF.type, RDF.Property))
g.add((apt.accommodates, RDFS.domain, apt.Apartment))
g.add((apt.accommodates, RDFS.range, XSD.integer))

# Define bathrooms
g.add((apt.bathrooms, RDF.type, RDF.Property))
g.add((apt.bathrooms, RDFS.domain, apt.Apartment))
g.add((apt.bathrooms, RDFS.range, XSD.integer))

# Define bedrooms
g.add((apt.bedrooms, RDF.type, RDF.Property))
g.add((apt.bedrooms, RDFS.domain, apt.Apartment))
g.add((apt.bedrooms, RDFS.range, XSD.integer))

# Define beds
g.add((apt.beds, RDF.type, RDF.Property))
g.add((apt.beds, RDFS.domain, apt.Apartment))
g.add((apt.beds, RDFS.range, XSD.integer))

# Define bedType
g.add((apt.bedType, RDF.type, RDF.Property))
g.add((apt.bedType, RDFS.domain, apt.Apartment))
g.add((apt.bedType, RDFS.range, XSD.string))

# Define price
g.add((apt.price, RDF.type, RDF.Property))
g.add((apt.price, RDFS.domain, apt.Apartment))
g.add((apt.price, RDFS.range, XSD.float))

# Define securityDeposit
g.add((apt.securityDeposit, RDF.type, RDF.Property))
g.add((apt.securityDeposit, RDFS.domain, apt.Apartment))
g.add((apt.securityDeposit, RDFS.range, XSD.float))

# Define cleaningFee
g.add((apt.cleaningFee, RDF.type, RDF.Property))
g.add((apt.cleaningFee, RDFS.domain, apt.Apartment))
g.add((apt.cleaningFee, RDFS.range, XSD.float))

# Define guestsIncluded
g.add((apt.guestsIncluded, RDF.type, RDF.Property))
g.add((apt.guestsIncluded, RDFS.domain, apt.Apartment))
g.add((apt.guestsIncluded, RDFS.range, XSD.integer))

# Define cancellationPolicy
g.add((apt.cancellationPolicy, RDF.type, RDF.Property))
g.add((apt.cancellationPolicy, RDFS.domain, apt.Apartment))
g.add((apt.cancellationPolicy, RDFS.range, XSD.string))

"""# Define reviewScoresLocation
g.add((apt.reviewScoresLocation, RDF.type, RDF.Property))
g.add((apt.reviewScoresLocation, RDFS.domain, apt.Apartment))
g.add((apt.reviewScoresLocation, RDFS.range, XSD.integer))  """

'# Define reviewScoresLocation\ng.add((apt.reviewScoresLocation, RDF.type, RDF.Property))\ng.add((apt.reviewScoresLocation, RDFS.domain, apt.Apartment))\ng.add((apt.reviewScoresLocation, RDFS.range, XSD.integer))  '

Tripadvisor Locations Schema Definition

For the Tripadvisor Locations schema, the variables that will be needed are: 

- location_id: S'HAURIA DE TREURE POTSER?

- neighbourhood: Reference the District Class (neighbourhood).
- latitude: Reference the Latitude domain.
- longitude: Reference the Longitude domain.
- name: name
- address_obj_address_string: adress
- type: type
- neighbourhood: Reference the District Class (neighbourhood).
- latitude: Reference the Latitude domain.
- longitude: Reference the Longitude domain.

An example of an instance is: 

- neighbourhood: Eixample
- latitude: 41.383205
- longitude: 2.162197
- name: Anna Subirats Xarcuteria
- address_obj_address_string: Carrer De Sepulveda, 167, 08011 Barcelona Spain
- type: restaurant

In [29]:
# Define name
g.add((ent.name, RDF.type, RDF.Property))
g.add((ent.name, RDFS.domain, ent.Entertainment))
g.add((ent.name, RDFS.range, XSD.string))

# Define adress
g.add((ent.adress, RDF.type, RDF.Property))
g.add((ent.adress, RDFS.domain, ent.Entertainment))
g.add((ent.adress, RDFS.range, XSD.string))

# Define typeEnt
g.add((ent.typeEnt, RDF.type, RDF.Property))
g.add((ent.typeEnt, RDFS.domain, ent.Entertainment))
g.add((ent.typeEnt, RDFS.range, XSD.string))

<Graph identifier=Nd153756c57f1489f84295873535b3786 (<class 'rdflib.graph.Graph'>)>

 Instances Generator: Add the instances from the DuckDB to RDF

In [30]:
# Connection to trusted database
con = duckdb.connect(database='./../data/trusted_zone/barcelona_processed.db')

- Criminal Instances

In [31]:
# Connection to database
df_criminal = con.execute("SELECT * FROM df_criminal_dataset").fetchdf()
add_criminal_instances(g, loc, inc, ex, df_criminal)

- Airbnb Dataset

In [32]:
# Connection to database         

df_airbnb = con.execute("SELECT * FROM df_airbnb_listings").fetchdf()
add_airbnb_instances(g, apt, df_airbnb)

- Tripadvisor Datasets: Locations and Restaurants

In [33]:
# Connection to databases --> NO FUNCIONA
#df_tripadvisor_locations = con.execute("SELECT * FROM df_tripadvisor_locations").fetchdf()
#df_tripadvisor_reviews = con.execute("SELECT * FROM df_tripadvisor_reviews").fetchdf()
#add_entertainment_instances(g, ent, df_tripadvisor_reviews, 'locations')
#add_entertainment_instances(g, ent, df_tripadvisor_locations, 'restaurants')

Store the RDF dataset into Turtle format

In [39]:
g.serialize(destination='./../data/explotation_zone/RDFGraph.ttl', format='turtle')

<Graph identifier=Nd153756c57f1489f84295873535b3786 (<class 'rdflib.graph.Graph'>)>

Sanity Check: Verify if the instances are well added

In [35]:
print_random_instance(g, inc.Incident)


Random instance of type http://example.org/incident/Incident: http://example.org/incident_0
  22-rdf-syntax-ns#type: http://example.org/incident/Incident
  happenedAt: http://example.org/location/Nou_Barris
  year: 2021.0
  numberMonth: 6.0
  nameMonth: Juny
  typePenalCode: Amenaces
  wherePenalCode: Via pública urbana
  incidentType: Political orientation
  numberVictims: 1.0


In [36]:
con.close()

RDF Graph Visualitzation

In [37]:
#g = Graph()
#g.parse("./../data_preparation_pipeline/RDFGraph.ttl", format="turtle")
#visualize_rdf_graph(g)

In [38]:
#g = Graph()
#g.parse("./../data_preparation_pipeline/RDFGraph.ttl", format="turtle")
#visualize_hierarchical_rdf_graph(g)