Skip to content
This repository has been archived by the owner on Oct 20, 2018. It is now read-only.

Data Model

sandroacoelho edited this page Jul 26, 2013 · 10 revisions

Please note: this diagram should not be interpreted as UML Class Diagram. It is loosely designed to contain entities and relationships. Do not read details from these diagrams, as they were not put in there. e.g.:

  • specific datatypes: int vs long
  • visibility of attributes

In the future, we should make both an Entity-Relationship and a Class Diagram out of this.

This shows the "normalized" version of the schema. We have not yet evaluated the performance if actually implemented like this. When designing the actual schema based on this rough draft, we should try to increase time efficiency. As a bonus, if space efficiency is also achieved, it is great. For example, resource types can be squeezed into a single field within Resource. OntologyClass definitions are held elsewhere anyways.

Other desired features:

  • Think about the streaming learning use case. Which design would make it easier to update with new resources, new context statistics, new candidate mappings, etc.

Glossary

  • Resource: a resource is any entity or concept in our target knowledge base (e.g. DBpedia). We take this name from RDF (Resource Description Framework), as a generic name for things, concepts, ideas "that can be identified on the Web, even when they cannot be directly retrieved on the Web."
  • OntologyClass: an ontology class represents a set of resources sharing similar characteristics. Resources can be of several types: Person, Organisation, Location, FloweringPlant, etc. All of these classes are organized in a domain model (i.e. schema, ontology). The "type" or the "ontology class" of a resource comes from this ontology.
  • SurfaceForm: a surface form is the phrase used to refer to a resource in text. For example: "Barack Obama", "President Obama" and "Obama" are all surface forms for the resource dbpedia:Obama.
  • Context: the context refers to the "the parts of something written or spoken that immediately precede and follow a word or passage and clarify its meaning." (source)
  • Token: each individual element extracted after tokenizing the text (more). Tokens are the individual words in the context, or slightly modified versions of these words (e.g. running -> run)
  • Topic: a topic is a broad categorization of knowledge into areas of interest. For example, text can belong to Business, Politics, Sports or Arts topics.

Draft Diagram

Draft Data Model

Implementation issues

The context table is our largest bottleneck, as it is very large. We should study efficient approaches to implement it, e.g.:

  • join table approach: (res_id,token_id,count)
  • hash approach: hash(res_id+token_id) -> count
  • postings list approach: res_id -> (token_id: count, token_id: count, ...)
Clone this wiki locally