Data Model

Please note: this diagram should not be interpreted as UML Class Diagram. It is loosely designed to contain entities and relationships. Do not read details from these diagrams, as they were not put in there. e.g.:

specific datatypes: int vs long
visibility of attributes

In the future, we should make both an Entity-Relationship and a Class Diagram out of this.

This shows the "normalized" version of the schema. We have not yet evaluated the performance if actually implemented like this. When designing the actual schema based on this rough draft, we should try to increase time efficiency. As a bonus, if space efficiency is also achieved, it is great. For example, resource types can be squeezed into a single field within Resource. OntologyClass definitions are held elsewhere anyways.

Other desired features:

Think about the streaming learning use case. Which design would make it easier to update with new resources, new context statistics, new candidate mappings, etc.

Glossary

Resource: a resource is any entity or concept in our target knowledge base (e.g. DBpedia). We take this name from RDF (Resource Description Framework), as a generic name for things, concepts, ideas "that can be identified on the Web, even when they cannot be directly retrieved on the Web."
OntologyClass: an ontology class represents a set of resources sharing similar characteristics. Resources can be of several types: Person, Organisation, Location, FloweringPlant, etc. All of these classes are organized in a domain model (i.e. schema, ontology). The "type" or the "ontology class" of a resource comes from this ontology.
SurfaceForm: a surface form is the phrase used to refer to a resource in text. For example: "Barack Obama", "President Obama" and "Obama" are all surface forms for the resource dbpedia:Obama.
Context: the context refers to the "the parts of something written or spoken that immediately precede and follow a word or passage and clarify its meaning." (source)
Token: each individual element extracted after tokenizing the text (more). Tokens are the individual words in the context, or slightly modified versions of these words (e.g. running -> run)
Topic: a topic is a broad categorization of knowledge into areas of interest. For example, text can belong to Business, Politics, Sports or Arts topics.

Draft Diagram

Draft Data Model

Implementation issues

The context table is our largest bottleneck, as it is very large. We should study efficient approaches to implement it, e.g.:

join table approach: (res_id,token_id,count)
hash approach: hash(res_id+token_id) -> count
postings list approach: res_id -> (token_id: count, token_id: count, ...)

DBpedia Spotlight - Shedding Light on the Web of Documents

Home

Project

Model backend

Developers

Google Summer of Code - GSoC

GSoC - Guidelines

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Model

Glossary

Draft Diagram

Implementation issues

Clone this wiki locally