-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some resources result in a very large set of precomputed quads #87
Comments
"Determine what aspect of the stored data is bloating the quads" is probably the fact that all related sources are pointed to with a seeAlso. For the specific case of Geonames the rules in https://github.com/bbcarchdev/spindle/blob/develop/twine/rulebase.ttl#L593-L624 cause entities such as http://www.geonames.org/2653822/cardiff.html to be located in http://www.geonames.org/2634895/wales.html (parentADM1) as well as http://www.geonames.org/2635167/united-kingdom-of-great-britain-and-northern-ireland.html (parentCountry). Unless I am mistaken the proxy for the latter will then point to every single proxy related to it with a seeAlso. After some manual check it appears the attached example is the proxy for Somalia and has a number of Somalian cities and features attached to it by a seeAlso. One way to fix that, which will not be a universal fix for similar situations but could anyway be interesting, would be to revise the rules for Geonames. |
Per-authority rules aren't really a solution here. Anything which has a lot of inbound references will trigger the same thing (e.g., a programme genre). |
Sure. Another way to phrase the issue is to wonder if all those links are necessary... but it's then an issue of CBD versus SCBD and opinions may vary. |
I think we could decide to not materialise all those links and associate the resources at query time if we want. Like what DBPedia does on http://dbpedia.org/page/Somalia by services all the "is <> of" statements. |
Some resources result in an extremely large set of precomposed quads being stored in the bucket. This is problematic because parsing the quads takes a long time and causes the API request to fail.
The attached example (gzipped; 20M uncompressed) demonstrates the problem.
We should:—
Internal tracking: RESDATA-1096
The text was updated successfully, but these errors were encountered: