Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some resources result in a very large set of precomputed quads #87

Open
nevali opened this issue Aug 23, 2016 · 4 comments
Open

Some resources result in a very large set of precomputed quads #87

nevali opened this issue Aug 23, 2016 · 4 comments

Comments

@nevali
Copy link
Member

nevali commented Aug 23, 2016

Some resources result in an extremely large set of precomposed quads being stored in the bucket. This is problematic because parsing the quads takes a long time and causes the API request to fail.

The attached example (gzipped; 20M uncompressed) demonstrates the problem.

We should:—

  • Determine what aspect of the stored data is bloating the quads
  • Apply a limit, and a mechanism to indicate to Quilt that more detail is available
  • Ensure the database schema supports an efficient query for this detail
  • Implement a pagination mechanism so that the client can page beyond the limited initial results if required

Internal tracking: RESDATA-1096

@cgueret
Copy link
Collaborator

cgueret commented Aug 23, 2016

"Determine what aspect of the stored data is bloating the quads" is probably the fact that all related sources are pointed to with a seeAlso. For the specific case of Geonames the rules in https://github.com/bbcarchdev/spindle/blob/develop/twine/rulebase.ttl#L593-L624 cause entities such as http://www.geonames.org/2653822/cardiff.html to be located in http://www.geonames.org/2634895/wales.html (parentADM1) as well as http://www.geonames.org/2635167/united-kingdom-of-great-britain-and-northern-ireland.html (parentCountry). Unless I am mistaken the proxy for the latter will then point to every single proxy related to it with a seeAlso. After some manual check it appears the attached example is the proxy for Somalia and has a number of Somalian cities and features attached to it by a seeAlso.

One way to fix that, which will not be a universal fix for similar situations but could anyway be interesting, would be to revise the rules for Geonames.

@nevali
Copy link
Member Author

nevali commented Aug 23, 2016

Per-authority rules aren't really a solution here. Anything which has a lot of inbound references will trigger the same thing (e.g., a programme genre).

@cgueret
Copy link
Collaborator

cgueret commented Aug 23, 2016

Sure. Another way to phrase the issue is to wonder if all those links are necessary... but it's then an issue of CBD versus SCBD and opinions may vary.

@cgueret
Copy link
Collaborator

cgueret commented Aug 23, 2016

I think we could decide to not materialise all those links and associate the resources at query time if we want. Like what DBPedia does on http://dbpedia.org/page/Somalia by services all the "is <> of" statements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants