___

<div style="text-align: center;">
  <span style="font-family: 'Playfair Display', serif; font-size: 24px; font-weight: bold;">
    Schema Definition for RDF Graph -> Model Embedding
  </span>
</div>

___

In this notebook, we define the schema for an RDF graph based on the data sources. This involves outlining the structure and relationships of the data to be used in constructing a semantic knowledge graph. The main steps include:

- Schema Design: We design the RDF schema, identifying key entities, attributes, and relationships that will form the basis of the knowledge graph.
- Mapping Data to RDF: We map the cleaned data to the RDF schema, converting it into RDF triples (subject-predicate-object format), as it can be seen in the `KnowledgeGraph.py`.
- Validation: We validate the generated RDF data to ensure it adheres to the defined schema and accurately represents the original data.
- Integration and Storage: Finally, we integrate the RDF data into a triple store or RDF database for further querying and analysis.

In [1]:
#!pip install networkx pyvis rdflib ipython numpy matplotlib pygraphviz

In [2]:
import networkx as nx
from rdflib import Graph, Namespace, Literal
from rdflib.namespace import RDF, RDFS, XSD
import matplotlib.pyplot as plt
from networkx.drawing.nx_agraph import graphviz_layout
from rdflib.plugins.sparql import prepareQuery
import duckdb
import sys
import os

# Path to root direct
project_root = './..'
sys.path.append(project_root)
import utils

In [3]:
# Create RDF
g = Graph()

# Define Namespaces
ex = Namespace('http://example.org/')
loc = Namespace('http://example.org/location/')
ent = Namespace('http://example.org/entertainment/')
apt = Namespace('http://example.org/apartment/')
inc = Namespace('http://example.org/incident/')
schema = Namespace('http://schema.org/')

# Vinculate Namespaces to graph RDF
g.bind('ex', ex)
g.bind('loc', loc)
g.bind('ent', ent)
g.bind('apt', apt)
g.bind('inc', inc)
g.bind('schema', schema)

### General Schema Definition

In [4]:
# Define Classes
g.add((loc.Locations, RDF.type, RDFS.Class))
g.add((loc.District, RDF.type, RDFS.Class))
g.add((inc.Incident, RDF.type, RDFS.Class))

# Define Locations subclases
g.add((ent.Entertainment, RDF.type, RDFS.Class))
g.add((ent.Entertainment, RDFS.subClassOf, loc.Locations))

g.add((apt.Apartment, RDF.type, RDFS.Class))
g.add((apt.Apartment, RDFS.subClassOf, loc.Locations))

# Define isinDistrict propiety 
g.add((loc.isinDistrict, RDF.type, RDF.Property))
g.add((loc.isinDistrict, RDFS.domain, loc.Locations))
g.add((loc.isinDistrict, RDFS.range, loc.District))

# Define hasLocation for apt
g.add((apt.hasLocation, RDF.type, RDF.Property))
g.add((apt.hasLocation, RDFS.domain, apt.Apartment))
g.add((apt.hasLocation, RDFS.range, loc.Locations))

# Define hasLocation for entertainment
g.add((ent.hasLocation, RDF.type, RDF.Property))
g.add((ent.hasLocation, RDFS.domain, ent.Entertainment))
g.add((ent.hasLocation, RDFS.range, loc.Locations))

# Define Longitude propiety
g.add((loc.Longitude, RDF.type, RDF.Property))
g.add((loc.Longitude, RDFS.domain, loc.Locations))
g.add((loc.Longitude, RDFS.range, XSD.float))

# Define Latitude propiety
g.add((loc.Latitude, RDF.type, RDF.Property))
g.add((loc.Latitude, RDFS.domain, loc.Locations))
g.add((loc.Latitude, RDFS.range, XSD.float))

<Graph identifier=N22212a1ddc3b4c018e42dd540a06f7b4 (<class 'rdflib.graph.Graph'>)>

### Incident Schema Definition

For the incident schema, the variables that will be needed are: 

- nom_mes: nameMonth
- neighbourhood: Reference the District Class (neighbourhood).
- type_crime: incidentType
- nombre_victimes: numberVictims
- criminality_index: criminalityIndex

An example of an instance is: 
- nom_mes: Juny
- neighbourhood: Nou Barris
- type_crime:  Political orientation	
- nombre_victimes: 1.0
- criminality_index: 0.053398

In [5]:
# Define happenedAt
g.add((inc.happenedAt, RDF.type, RDF.Property))
g.add((inc.happenedAt, RDFS.domain, inc.Incident))
g.add((inc.happenedAt, RDFS.range, loc.District))

"""# Define year
g.add((inc.year, RDF.type, RDF.Property))
g.add((inc.year, RDFS.domain, inc.Incident))
g.add((inc.year, RDFS.range, XSD.integer))

# Define numberMonth
g.add((inc.numberMonth, RDF.type, RDF.Property))
g.add((inc.numberMonth, RDFS.domain, inc.Incident))
g.add((inc.numberMonth, RDFS.range, XSD.integer))

# Define typePenalCode
g.add((inc.typePenalCode, RDF.type, RDF.Property))
g.add((inc.typePenalCode, RDFS.domain, inc.Incident))
g.add((inc.typePenalCode, RDFS.range, XSD.string))

# Define wherePenalCode
g.add((inc.wherePenalCode, RDF.type, RDF.Property))
g.add((inc.wherePenalCode, RDFS.domain, inc.Incident))
g.add((inc.wherePenalCode, RDFS.range, XSD.string))
"""

# Define nameMonth
g.add((inc.nameMonth, RDF.type, RDF.Property))
g.add((inc.nameMonth, RDFS.domain, inc.Incident))
g.add((inc.nameMonth, RDFS.range, XSD.string))

# Define incidentType 
g.add((inc.incidentType, RDF.type, RDF.Property))
g.add((inc.incidentType, RDFS.domain, inc.Incident))
g.add((inc.incidentType, RDFS.range, XSD.string))

# Define numberVictims
g.add((inc.numberVictims, RDF.type, RDF.Property))
g.add((inc.numberVictims, RDFS.domain, inc.Incident))
g.add((inc.numberVictims, RDFS.range, XSD.float))

# Define criminalityIndex
g.add((inc.criminalityIndex, RDF.type, RDF.Property))
g.add((inc.criminalityIndex, RDFS.domain, inc.Incident))
g.add((inc.criminalityIndex, RDFS.range, XSD.float))

<Graph identifier=N22212a1ddc3b4c018e42dd540a06f7b4 (<class 'rdflib.graph.Graph'>)>

### Airbnb Schema Definition

For the Airbnb schema, the variables that will be needed are: 

- name: name
- host_id: hostId
- host_since: hostSince
- host_total_listings_count: hostTotalListingsCount
- host_verifications: hostVerifications
- neighbourhood: Reference the District Class (neighbourhood).
- latitude: Reference the Latitude domain.
- longitude: Reference the Longitude domain.
- property_type: propertyType
- room_type: roomType
- accommodates: accommodates
- bathrooms: bathrooms
- bedrooms: bedrooms
- beds: beds
- bed_type: bedType
- price (x night): price
- security_deposit: securityDeposit
- cleaning_fee: cleaningFee
- guests_included: guestsIncluded
- extra_people: extraPeople
- minimum_nights: minimumNights
- maximum_nights: maximumNights
- number_of_reviews: numberOfReviews
- cancellation_policy: cancellationPolicy


An example of an instance is: 

- name: Piso cerca del Cam Nou.
- host_id: 38925857
- host_since: 2015-07-19
- host_total_listings_count: 1.0
- host_verifications: email, phone, reviews, jumio
- neighbourhood: Les Corts
- latitude: 41.378150177192396
- longitude: 2.122075545481956
- property_type: Apartment
- room_type: Private room
- accommodates: 2
- bathrooms: 2.0
- bedrooms: 1.0
- beds: 2.0
- bed_type: Real Bed
- price (x night): 21.0
- security_deposit: 123.0
- cleaning_fee: 54.0
- guests_included: 1
- extra_people: 7
- minimum_nights: 1
- maximum_nights: 1125
- number_of_reviews: 1
- cancellation_policy: moderate	

In [6]:
# Define name
g.add((apt.name, RDF.type, RDF.Property))
g.add((apt.name, RDFS.domain, apt.Apartment))
g.add((apt.name, RDFS.range, XSD.string))

# Define hostId
g.add((apt.hostId, RDF.type, RDF.Property))
g.add((apt.hostId, RDFS.domain, apt.Apartment))
g.add((apt.hostId, RDFS.range, XSD.integer))

# Define hostSince
g.add((apt.hostSince, RDF.type, RDF.Property))
g.add((apt.hostSince, RDFS.domain, apt.Apartment))
g.add((apt.hostSince, RDFS.range, XSD.string))

# Define hostTotalListingsCount
g.add((apt.hostTotalListingsCount, RDF.type, RDF.Property))
g.add((apt.hostTotalListingsCount, RDFS.domain, apt.Apartment))
g.add((apt.hostTotalListingsCount, RDFS.range, XSD.float))

# Define hostVerifications
g.add((apt.hostVerifications, RDF.type, RDF.Property))
g.add((apt.hostVerifications, RDFS.domain, apt.Apartment))
g.add((apt.hostVerifications, RDFS.range, XSD.string))

# Define propertyType
g.add((apt.propertyType, RDF.type, RDF.Property))
g.add((apt.propertyType, RDFS.domain, apt.Apartment))
g.add((apt.propertyType, RDFS.range, XSD.string))

# Define roomType
g.add((apt.roomType, RDF.type, RDF.Property))
g.add((apt.roomType, RDFS.domain, apt.Apartment))
g.add((apt.roomType, RDFS.range, XSD.string))

# Define accommodates
g.add((apt.accommodates, RDF.type, RDF.Property))
g.add((apt.accommodates, RDFS.domain, apt.Apartment))
g.add((apt.accommodates, RDFS.range, XSD.integer))

# Define bathrooms
g.add((apt.bathrooms, RDF.type, RDF.Property))
g.add((apt.bathrooms, RDFS.domain, apt.Apartment))
g.add((apt.bathrooms, RDFS.range, XSD.integer))

# Define bedrooms
g.add((apt.bedrooms, RDF.type, RDF.Property))
g.add((apt.bedrooms, RDFS.domain, apt.Apartment))
g.add((apt.bedrooms, RDFS.range, XSD.integer))

# Define beds
g.add((apt.beds, RDF.type, RDF.Property))
g.add((apt.beds, RDFS.domain, apt.Apartment))
g.add((apt.beds, RDFS.range, XSD.integer))

# Define bedType
g.add((apt.bedType, RDF.type, RDF.Property))
g.add((apt.bedType, RDFS.domain, apt.Apartment))
g.add((apt.bedType, RDFS.range, XSD.string))

# Define price
g.add((apt.price, RDF.type, RDF.Property))
g.add((apt.price, RDFS.domain, apt.Apartment))
g.add((apt.price, RDFS.range, XSD.string))

# Define securityDeposit
g.add((apt.securityDeposit, RDF.type, RDF.Property))
g.add((apt.securityDeposit, RDFS.domain, apt.Apartment))
g.add((apt.securityDeposit, RDFS.range, XSD.float))

# Define cleaningFee
g.add((apt.cleaningFee, RDF.type, RDF.Property))
g.add((apt.cleaningFee, RDFS.domain, apt.Apartment))
g.add((apt.cleaningFee, RDFS.range, XSD.float))

# Define guestsIncluded
g.add((apt.guestsIncluded, RDF.type, RDF.Property))
g.add((apt.guestsIncluded, RDFS.domain, apt.Apartment))
g.add((apt.guestsIncluded, RDFS.range, XSD.integer))

# Define extraPeople
g.add((apt.extraPeople, RDF.type, RDF.Property))
g.add((apt.extraPeople, RDFS.domain, apt.Apartment))
g.add((apt.extraPeople, RDFS.range, XSD.integer))

# Define minimumNights
g.add((apt.minimumNights, RDF.type, RDF.Property))
g.add((apt.minimumNights, RDFS.domain, apt.Apartment))
g.add((apt.minimumNights, RDFS.range, XSD.integer))

# Define maximumNights
g.add((apt.maximumNights, RDF.type, RDF.Property))
g.add((apt.maximumNights, RDFS.domain, apt.Apartment))
g.add((apt.maximumNights, RDFS.range, XSD.integer))

# Define numberOfReviews
g.add((apt.numberOfReviews, RDF.type, RDF.Property))
g.add((apt.numberOfReviews, RDFS.domain, apt.Apartment))
g.add((apt.numberOfReviews, RDFS.range, XSD.integer)) 

# Define cancellationPolicy
g.add((apt.cancellationPolicy, RDF.type, RDF.Property))
g.add((apt.cancellationPolicy, RDFS.domain, apt.Apartment))
g.add((apt.cancellationPolicy, RDFS.range, XSD.string))

<Graph identifier=N22212a1ddc3b4c018e42dd540a06f7b4 (<class 'rdflib.graph.Graph'>)>

### Tripadvisor Locations Schema Definition

For the Tripadvisor Locations schema, the variables that will be needed are: 

- location_id: locationId
- name: name
- type: type
- neighbourhood: Reference the District Class (neighbourhood).
- latitude: Reference the Latitude domain.
- longitude: Reference the Longitude domain.

An example of an instance is: 

- location_id: 27200794
- name: Anna Subirats Xarcuteria
- type: restaurant
- neighbourhood: Eixample
- latitude: 41.383205
- longitude: 2.162197

In [7]:
# Define locationID
g.add((ent.locationID, RDF.type, RDF.Property))
g.add((ent.locationID, RDFS.domain, ent.Entertainment))
g.add((ent.locationID, RDFS.range, XSD.integer))

# Define name
g.add((ent.name, RDF.type, RDF.Property))
g.add((ent.name, RDFS.domain, ent.Entertainment))
g.add((ent.name, RDFS.range, XSD.string))

# Define typeEnt
g.add((ent.typeEnt, RDF.type, RDF.Property))
g.add((ent.typeEnt, RDFS.domain, ent.Entertainment))
g.add((ent.typeEnt, RDFS.range, XSD.string))

<Graph identifier=N22212a1ddc3b4c018e42dd540a06f7b4 (<class 'rdflib.graph.Graph'>)>

### Tripadvisor Reviews Schema Definition

For the Tripadvisor Locations schema, the variables that will be needed are: 

- location_id: locationId
- rating: rating
- text: text
- title: title

An example of an instance is: 

- location_id: 8821646
- rating: 5
- text: Stopped for a light bites chicken, chips, rice...
- title: Good value tasty food.

In [8]:
# Define rating
g.add((ent.rating, RDF.type, RDF.Property))
g.add((ent.rating, RDFS.domain, ent.Entertainment))
g.add((ent.rating, RDFS.range, XSD.float))

# Define text
g.add((ent.text, RDF.type, RDF.Property))
g.add((ent.text, RDFS.domain, ent.Entertainment))
g.add((ent.text, RDFS.range, XSD.string))

# Define title
g.add((ent.title, RDF.type, RDF.Property))
g.add((ent.title, RDFS.domain, ent.Entertainment))
g.add((ent.title, RDFS.range, XSD.string))

<Graph identifier=N22212a1ddc3b4c018e42dd540a06f7b4 (<class 'rdflib.graph.Graph'>)>

 Instances Generator: Add the instances from the DuckDB to RDF

In [9]:
# Connection to trusted database
con = duckdb.connect(database='./../data/explotation_zone/barcelona_processed_emb.db')

- Criminal Instances

In [10]:
# Connection to database
df_criminal = con.execute("SELECT * FROM df_criminal_dataset").fetchdf()
utils.add_criminal_instances(g, loc, inc, ex, df_criminal)

- Airbnb Dataset

In [11]:
# Connection to database         
df_airbnb = con.execute("SELECT * FROM df_airbnb_listings").fetchdf()
utils.add_airbnb_instances(g, loc, apt, df_airbnb, mode = 'model')

- Tripadvisor Datasets: Locations and Restaurants

In [12]:
# Connection to databases
df_tripadvisor_locations = con.execute("SELECT * FROM df_tripadvisor_locations").fetchdf()
df_tripadvisor_reviews = con.execute("SELECT * FROM df_tripadvisor_reviews").fetchdf()
utils.add_entertainment_instances(g, loc, ent, df_tripadvisor_locations, 'loc')
utils.add_entertainment_instances(g, loc, ent, df_tripadvisor_reviews, 'rev')

Store the RDF dataset into Turtle format

In [13]:
g.serialize(destination='./../data/explotation_zone/RDFGraph_Model_emb.ttl', format='turtle')

<Graph identifier=N22212a1ddc3b4c018e42dd540a06f7b4 (<class 'rdflib.graph.Graph'>)>

In [14]:
con.close()

Sanity Check: Verify if the instances are well added

In [18]:
utils.print_random_detailed_instance(g, inc.Incident)


Random instance of type http://example.org/incident/Incident: http://example.org/incident_122
Properties of the selected instance:
  rdf:type: inc:Incident (URI)
  inc:happenedAt: loc:Eixample (URI)
  inc:nameMonth: Març (Literal)
  inc:incidentType: LGBTI-phobia (Literal)
  inc:numberVictims: 1.0 (Literal)
  inc:criminalityIndex: 0.1925566343042071 (Literal)

Related instances and their properties:

Properties of inc:Incident:
  rdf:type: rdfs:Class (URI)

Properties of loc:Eixample:
  rdf:type: loc:District (URI)
  rdfs:label: Eixample (Literal)


### RDF Graph Visualitzation

In [16]:
#g = Graph()
#g.parse("./../data_preparation_pipeline/RDFGraph.ttl", format="turtle")
#visualize_rdf_graph(g)

In [17]:
#g = Graph()
#g.parse("./../data_preparation_pipeline/RDFGraph.ttl", format="turtle")
#visualize_hierarchical_rdf_graph(g)