Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Data generation manual
Summary of the data generation guidelines
- Download a few files from the latest DBpedia in your language. For English we used the following datasets: labels_en.nt.bz2, redirects_en.nt.bz2, disambiguations_en.nt.bz2, instance_types_en.nt.bz2. If they are not available, you will need to look into how to create them. For higher quality extraction, consider adding infobox mappings and DBpedia Ontology labels for your language at http://mappings.dbpedia.org. The results from mapping-based extraction are available on the file: mappingbased_properties_en.nt.bz2.
- Get a Wikipedia XML dump in your language (only pages-articles.xml.bz2).
- Get Lucene tokenizers, stemmers, stopwords, etc. in your language. For many languages SnowballAnalyzer may be enough. For others, this will require some language-specific knowledge, so I can't help much. But see for example http://hunspell.sourceforge.net/
- Extract name variations (lexicalizations) for DBpedia Resources via ExtractCandidateMap
- Extract DBpedia resource occurrences via ExtractOccsFromWikipedia
- Sort the TSV (tab-separated values) file extracted by URI
- Run DBpedia Spotlight occurrence indexing via IndexMergedOccurrences
- Add name variations and types to the index via AddSurfaceFormsToIndex and AddTypesToIndex
- Optionally compress the index
- Run evaluation and detect similarity threshold scores
- If you'd like to try indexing using Hadoop and Apache Pig, check this page
We are also creating a page to catalog Internationalization issues.
Finally, if you plan to change the source code, please consider committing it back to our repository so that other people can also benefit from it - and you can be acknowledged as awesome contributor!
Table of Contents
You should make sure that each step is generating the data correctly.
- Use grep/cut, etc. to inspect the text files
- Use Luke to inspect the generated indexes.
Check conceptURIs to see if undesirable URIs were kept (e.g. disambiguations, redirects, lists, and whatever matches your use case).
Check surface forms to see if they look like legitimate names of things in your language.
See how many were extracted. How many URIs have occurrences? Are the contexts with a decent length (subjective, related to your use case)?
Use Luke to open the index. How many URIs are there in the index? Were the entity names stored in the field SURFACE_FORM? Were the counts stored in the field URI_COUNT? Were types stored in the field TYPE?
Look in the context field:
- Were words stemmed?
- Were stopwords removed?
- Do they have the correct morphology? (e.g. were they all lowercased if that's what you wanted) Do they look "clean", or are there very generic references (e.g. "here" pointing to "Berlin")