New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should we limit what's returned from the /data/edges API? #821
Comments
Going by the data I have saved from my LJ version of these tools, LJ probably limits fdata.bml output to 5000 total edges. We, on the other hand, seem to be limiting to 50000 for each individual edge. |
Limit of 5k should be good. Or, not return edges for accounts that have the thing that forces an empty friends list turned on, which i'm blanking on what it is ... On Jul 15, 2014, at 2:50 PM, Sophira notifications@github.com wrote:
|
I just checked the dw_news edges again to get more accurate data. It seemed to be quicker this time, possibly because data had been loaded into memcache from the previous run. Total time getting the data was 22.584s, of which only about 2s of that was actual data transfer, meaning that about 20 seconds was spent getting the data on the server end. The actual data transfer was also not as bad as I had thought, but still a lot (2.4MB) and worth worrying about. I'll take a look to see what the empty friends list option is. It's probably still worth doing the 5000 limit too, though! |
checking -- did you mean to pick this up or did you want to leave it someone else? And if you're picking this up, I'd be happy to accept it straight into the release branch. |
Oh, you're right, I did mean to pick it up. Claimed! [edit: Hmm. Did I do that wrongly?] |
Er, this should respect Adding a limit otherwise is fine and generally good, too, but at least make it respect the blacklist. |
that was the thing i was trying to remember! |
Yeah, that's what I've been doing. |
Data obtained by /data/edges can be quite large; limit the data to 5000 edges for most edge types. Also, respect the %LJ::FORCE_EMPTY_SUBSCRIPTIONS configuration option. This patch also fixes a potential race condition and adds functionality to the member_userids and member_of_userids functions to request a limit to the number of userids returned.
cc @momijizukamori for your API stuff as a consideration. Since this hasn't cropped up as A Problem in 4 years I'm going to close it wrt our current APIs, but the new stuff should definitely make sure we have limits in them (at least planned, not saying we need them to begin with). |
Out of curiosity, was there something wrong with my patch? |
What patch? This is an issue, not a PR? |
Look up in the thread, there's a referencing commit in her repo, not a PR. |
I don't review every random commit on every random fork. If someone wants a review, they need to submit a PR. |
Yeah, I apologise, sorry. I thought I had made a PR. I can make one now if wanted. I apologise. |
Oh, I had (#856), and there were changes requested. Sorry. |
I'm developing a suite of tools (for my own use only as of right now) to use the data APIs to get data. For some reason I decided it would be a good idea to try it on dw_news. I'm not entirely sure why, but there you go.
I don't have verbose output yet for exactly what stages took the most time, but as a whole, the fetching and processing took about a minute. I looked in my database afterwards to see how many edges had been entered:
Yep, that's quite a lot.
Do we want to limit the amount of data that's sent from the APIs? I don't want to run the query again without permission, but I'm guessing that the biggest bottleneck would have been the downstream bandwidth. For my server that's about 1MB/s, meaning that if the bandwidth is indeed the limiting factor, then the implications are rather worrying.
May I make this query one more time and save the data to a file locally so that I can test things without having get the data over the network again?
The text was updated successfully, but these errors were encountered: