Skip to content
Lookup using incremental indexing
Java Dockerfile Shell
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
download-client
lookup-application
.gitignore
Dockerfile
README.md
app-config.yml
docker-compose.yml

README.md

Lookup Application

This application is an improved and DBpedia Databus compatible version of the DBpedia Lookup Service. The planned steps executed by the Docker image are the following:

  • The user supplies an YAML Configuration and a Databus Collection
  • The DBpedia Download Service loads the RDF data specified by the collection
  • The Indexing process parses the RDF data and creates a Lucene Index using the YAML Configuration
  • A Java Servlet is started on a Tomcat Server that accepts queries and executes searches on the Lucene Index as specified in the YAML Configuration

How to run it

Clone the git repository

git clone https://github.com/dbpedia/lookup-application.git

Build the docker image

docker build -t lookup .

Adjust the docker-compose and application configuration - then run docker-compose:

docker-compose up

Docker Compose

services:
  dbpedia-lookup:
    image: lookup:latest
    ports:
      - 9273:8080
    environment:
      - COLLECTION=https://databus.dbpedia.org/collections/dbpedia/databus
      - DATAPATH=/root/data/
    volumes:
      - ./app-config.yml:/root/app-config.yml
      - ./local-files:/root/data/

The docker compose loads the latest lookup docker image and exposes the service on a configurable port. The following environment variables can be configured:

  • COLLECTION (optional): The URI of a published databus collection. You can learn more about databus collections here
  • DATAPATH: The folder relative to the docker container root where the data can be found

The lookup application will look for a configuration file at /root/app-config.yml - in order to load your own application configuration, you can overwrite the default configuration with your own by loading it to the container as a volume (see the docker compose above).

Additionally you can load local data as a volume to the data path specified in the DATAPATH environment variable where the lookup application will be able to find it.

YAML Application Configuration

This is the example YAML Configuration that will be present in the docker container. Note that the configuration is very specific to your data so that you will have to overwrite it with your own configuration (see the description above) in almost all cases.

version: "1.0"
indexConfig:
  indexPath: /root/lucene-index
  cacheSize: 100000
  commitInterval: 100000
  indexFields:
    - fieldName: label
      resourceName: artifact
      query: >
        SELECT DISTINCT ?artifact ?label WHERE {
          ?artifact ^<http://dataid.dbpedia.org/ns/core#artifact> ?dataset .
          ?dataset <http://www.w3.org/2000/01/rdf-schema#label> ?label.
        }
    - fieldName: comment
      resourceName: artifact
      query: >
        SELECT DISTINCT ?artifact ?comment WHERE {
          ?artifact ^<http://dataid.dbpedia.org/ns/core#artifact> ?dataset .
          ?dataset <http://www.w3.org/2000/01/rdf-schema#comment> ?comment.
        }
    - fieldName: typeName
      resourceName: resource
      query: >
        SELECT DISTINCT ?resource (REPLACE(STR(?type), "(.*)(/|#)", "") AS ?typeName) WHERE{
          ?resource a ?type.
        }

queryConfig:
  exactMatchBoost: 5
  prefixMatchBoost: 2
  fuzzyMatchBoost: 1
  fuzzyEditDistance: 2
  fuzzyPrefixLength: 2
  maxResults: 100
  fieldFomat: DEFAULT
  minRelevanceScore: 0.1
  queryFields:
    - fieldName: label
      weight: 1.0
      highlight: true
      queryByDefault: true
    - fieldName: comment
      weight: 0.5
      highlight: true
      queryByDefault: true
    - fieldName: typeName
      weight: 0.1
      highlight: false
      required: true

The Configuration is split into the index configuration and the query configuration.

Search Configuration

indexPath The path relative to the container root where the lucene index will be located

cacheSize The size of the document cache in # of documents. The indexer will not only need to create but also update documents. Documents that are updated frequently are stored in a cache for faster retrieval

commitInterval The amount of updates on the index before changes are commited

indexFields This configuration field is the most important one for the indexing process and consists of a list of index fields. Each index field has the following sub-fields

  • fieldName The name of the field. This will be used to fetch the field value from the SPARQL query as well as for the field name in the index.
  • resourceName The SPARQL variable name of the resource to index
  • query Before indexing, all RDF data is stored in a local TB2 Database. The Lookup indexer indexes resource URIs (the document id) with String-valued fields. The selection of indexable URI-value-pairs is done by a SPARQL query. The SPARQL has to return a result set with 2 columns matching the names of the fieldName and resourceName. Please refer to the default configuration above for examples.

Query Configuration

exactMatchBoost The multiplier applied to the retrieval score when a search term matches a field exactly (e.g. applied to a result with the field "DBpedia" when searching for "DBpedia" )

prefixMatchBoost The multiplier applied to the retrieval score when a search term is a prefix of a field (e.g. applied to a result with the field "DBpedia" when searching for "DBp" )

fuzzyMatchBoost The multiplier applied to the retrieval score when a search term matches a field with some minor mistakes (e.g. applied to a result with the field "DBpedia" when searching for "DBpodia" )

fuzzyEditDistance The maximum of this parameter is 2. The number of mistakes in a search term to be still considered a fuzzy match of a field (e.g. "DBpedia" and "DBpodia" have an edit distance of 1)

fuzzyPrefixLength This is the number of characters at the start of a term that must be identical (not fuzzy) to the query term if the query is to match that term.

maxResults The maximum number of results returned in a single search

minRelevanceScore The minimum score a document has to receive in a certain search to appear in the result set. This is helpful when dealing with required query fields with weights equal to zero - documents without the field will be omitted as well as documents that only have a match in the zero weight field.

fieldFormat This can either be omitted (defaults to DEFAULT) or set to either DEFAULT, NONE or JSON. DEFAULT will return all fields with highlighting tags (if the field is set to highlighted and highlightable tokens were found). NONE will return all fields without the highlighting tags. JSON will return a json object with the fields value and highlight. The value field contains the field value without highlighting, the highlight field contains the field value with highlighting tags (if any).

queryFields A list of query fields, each with the following sub-fields

  • fieldName The name of the search field (matching the ones in the indexing process)
  • weight The default weight applied to the field when searching
  • highlight Indicates whether matches should be highlighted in the search result for this field
  • required Indicates whether the result MUST have matches for this field. IMPORTANT: This does not apply when the search does not include the field!
  • queryByDefault Indicates whether the field will be queried when using the query request parameter

Using the Lookup Service

Once the docker container is running, you can query the index at the following address:

http://localhost:9273/lookup-application/api/search?query=YOUR_QUERY

Static Query Parameters

query This will query the index with the specified value on all fields with queryByDefault set to true

maxResults The maximum number of results, overrides the value in the configuration.

fieldFormat The field format for the query, overrides the value in the configuration.

minRelevance The minimum relevance score of the query, overrides the minRelevanceScore value in the configuration

Dynamic Query Parameters

[FIELD_NAME] You can search for a field name directly to query only for that field. Using both the static query parameter together with a field name will apply the value to all fields as if using only the query parameter and then override the ones specified with the field name directly.

Example: http://localhost:9273/lookup-application/api/search?label=DBpedia

This search will only be run on the "label" field with the term "DBpedia"

Example: http://localhost:9273/lookup-application/api/search?query=something&label=else

This will search all fields for the term "something" except for the label field. The label field will be searched for the term "else" instead.

[FIELD_NAME]Weight Modifies the weight for a specific field

Example: http://localhost:9273/lookup-application/api/search?query=something&labelWeight=5

This will search on all fields for the term "something" and boost the weight for the "label" field up to 5

[FIELD_NAME]Required Overrides the default setting of the required parameter of a query field.

Example: http://localhost:9273/lookup-application/api/search?query=something&labelRequired=true

This will search on all fields for the term "something". Since it also searches the "label" field and it has been set to required, all results without a match on the "label" field will be omitted.

You can’t perform that action at this time.