ISemantics 2013 Supporting Data
Google Summer of Code - GSoC
Clone this wiki locally
This page contains supporting material for our submission to ISemantics 2013.
- A demo for most of our supported languages can be found here.
A detailed internationalization guide can be found here.
In the simplified model, the result of the indexing process is a single directory, containing all necessary files and the system requires no further configuration to run.
$ tree /data/spotlight/model_nl/ /data/spotlight/model_nl/ ├── model │ ├── candmap.mem │ ├── context.mem │ ├── res.mem │ ├── sf.mem │ └── tokens.mem ├── model.properties ├── opennlp │ ├── chunker.bin │ ├── pos-maxent.bin │ ├── sent.bin │ └── token.bin ├── spotter_thresholds.txt └── stopwords.list
To run this model on port 2222 using the DBpedia Spotlight server:
$ java -jar dbpedia-spotlight.jar /data/spotlight/model_nl http://localhost:2222
To select the correct annotation candidates from the set of candidates we generate in various ways we estimate the probability that a surface form (s) is annotated as:
P(annotation|s) = ∑e count(e, s) / count(s)
The value of count(s) is the total number of times the surface form occurs as a string in the whole dataset. Since it would not be feasible to perform a string search for each surface form (>5m in the case of English), in the current version of PigNLProc, ngrams (with n up to 5 by default) are extracted from the whole dataset and are then joined with the set of surface forms. This approach requires the temporary storage of all possible ngrams, which produces a noticeable bottleneck. Hence, we extended PigNLProc to only collect ngrams for the set of accepted surface forms, which is distributed to the cluster nodes via the Hadoop distributed cache.
When we analyzed the values of P(annotation|s) for Dutch, we observed that for some entities, the probabilities were consistently lower. After further investigation, we found that these were all cases where the surface form is a substring of another surface form. In the Pig script, count(e, s) only considers full annotations and not their parts or tokens. Since count(s) is the general frequency of a surface form, cases where a surface form is contained in another surface form are counted as a non-annotation of the contained surface form. Consider, for example, the surface form "Apple" and suppose that our corpus consists of the following text (where [...] indicates an annotation):
[Apple] is the company selling the [Apple MacBook], the [Apple iPod] and the [Apple iPad].
According to the counts produced by the Pig script, in this corpus ∑e count(e, "Apple") would be 1 and count("Apple") would be 4. Hence P(annotation|"Apple") = 1 / 4. However, the annotated mentions of "Apple MacBook", "Apple iPod" and "Apple iPad" should not count as unannotated occurrences of the surface form "Apple" since there can be only one annotation in the text span "Apple X" and not both annotations (e.g. [[Apple] iPad]). Hence, in our build, we correct the counts from the Pig script by subtracting the annotated counts of bigger surface forms from the total counts of their surface form substrings.
| Precision | Recall | F1 |
---|:---------:|:------:|:----| PP | 90.41 | 93.03 | 91.7 NP | 83.19 | 80.75 | 81.95 MWU | 87.27 | 75.1 | 80.73 Total | 86.1 | 84.04 | 85.06
DBpedia Spotlight - Shedding Light on the Web of Documents