-
-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement caching to reduce loading time while working with diff files #59
Comments
@janakrajchadha Just curious where this issue stood. I know you worked on this in the past, and I kind of had the sense that maybe this was done. It might be worth providing a quick status update here? |
@mhucka I had worked on the Python implementations and figured out the most basic way to do this.
I've mentioned in one of my reports that this will have to be kept on hold until we answer these questions and figure out the best way to add caching somewhere in the entire process. |
From where I'm sitting:
I don't really think anything needs to kept on hold here. For the scope of caching diffs that the diff service would provide to clients, there are two clear places to do it (and we could do it in one or both):
|
@janakrajchadha did you ever put the caching work you originally had in edgi-govdata-archiving/web-monitoring-processing#68 in a different branch? You removed it before there was ever much discussion, but IIRC, you had caching at the level of each diff function, and as noted above, it would probably be more effective at an earlier point. I think we also talked a bit about whether the cache would be better leveraging a centralized storage across processes (since the current service runs 4) by using something like Redis or Memcached. The one you implemented, I think, used in-process memory? |
I agree with your point. I think one of the questions that Dan and I had discussed at that point in time was how expensive would caching prove to be in terms of the memory required and will it vary from one service to the other. After putting more thought into it, I think it really depends on us what we want to cache and the important information from the result of any diff service would not vary a lot.
We had considered the possibility of caching all diffs (due to concerns around PF's speed) and weren't sure about how computationally expensive it'll be. Talking about the possible places you've suggested for caching -
If we are computing diffs locally, this can be done in front of the local service, if I'm not wrong. I'm not sure if Nginx or Varnish will be helpful in that case (I have limited knowledge of these two).
I have kept it in another branch on my local. I haven't pushed that branch yet. I had only tried it out on the function which makes call to the PF API as this was happening around the time when our own diffing service had just come up and the first discussion was restricted to PF's speed issues.
Yes, it used in-process memory and handled caching using Least Recently Used (LRU) cache. |
@Mr0grog If you can give me an idea of a good way to do this (keeping future additions in mind), I can start working on it. |
There is no issue for this other than #64, which is an umbrella issue for a lot of ops/devops-related work. I think caching is among the lowest priority of all things we can be working on right now, so there’s no timeline or plan for it.
I think we are talking about different things when we talk about “in front of the service” here. When I say service, I mean an always-running application that responds to requests via some API (web or otherwise), so the diff service is
If we want to put a cache in front of either the diff service or the monitoring API, either Nginx’s built-in caching tools or Varnish are probably the way to go. Both are reverse proxies and, they way we are using Nginx now, both can do caching and the same load balancing work we already have Nginx doing. I’m not a deep expert on these tools, so I would probably just go ahead and figure out how to configure Nginx’s caching and not worry about Varnish.
If you want to implement your own caching inside of our Python code, I’d look into leveraging either Memcached or Redis—both are mature, stable, and have great client libraries for most every language. An in-process cache like the one you implemented is a good start, but ideally you want want to move the actual cache outside your web service’s process. You can think of Memcached like a giant hash table that you access via TCP (you generally want it running on a separet machine so that it can consume pretty much all that machine’s memory without causing your service any issues). Figure out what you want to use as a key for that hash table (e.g. the URL of the request or possibly something more concise like |
I'm sorry if you got confused, I was talking about the issue related to rewriting the project in Python/JS as you had talked about how it can be a minor barrier. Also, thanks for clarifying what you mean by an "in front of" service.
So from what I understand, this means that we will have to necessarily access the diff service through its web API, right? Using Nginx seems like a good way to go, given the fact that it has built-in caching tools and we already have it running for load balancing work. Thank you for the super helpful information and production related knowledge. Really appreciate that! 👐 |
Oh. To be clear, I don’t think the rewrite is a barrier to anything here. I meant that the fact that it is currently a Ruby app is a barrier to other people adding caching (hence the need to rewrite it). The only timeline is the timeline stated in the issue: it’s gotta happen before I leave, but unless somebody has something to say about it, there are too many other things that need to be worked on first.
Well, that’s the only API the diff service has! If you want to access the contents of the diff module and not the service, then you have other options ;) |
OK, there is caching (with Redis) at the DB level now (has been for a while), so I'm going to go ahead and close this. |
There is a need to implement caching of the diff results from the different diffing services as it'll reduce the time it takes to access them. This will also eliminate the need to repeatedly use the API to get the same results.
The text was updated successfully, but these errors were encountered: