-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize the feed reader for websites with a large number of users #10996
Comments
Thanks for the issue! We'll take your request into consideration and follow up if we decide to tackle this issue. To our amazing contributors: issues labeled To claim an issue to work on, please leave a comment. If you've claimed the issue and need help, please ping @forem/oss and we will follow up within 3 business days. For full info on how to contribute, please check out our contributors guide. |
After talking with @benhalpern a couple of optimization ideas (totally to be fleshed oud and tested) have come out:
cc @citizen428 @mstruve as you provided feedback in #10998 but obviously open to anyone :) |
We actually had to deal with a very similar problem at Kenna. There we dealt with what we called connectors. Those connectors would deliver tons of data to our app on vulnerabilities from our clients. In the beginning, we processed and saved these vulns one by one with all the Rails callbacks and it was SLOW. Bc it was such a cornerstone of our application we ended up writing a separate "importer" service to handle creating all of the vulnerabilities from the connectors. This service eventually became in charge of always creating vulnerabilities whether through the API or a connector. It was a very big change but it forced us to completely decouple the system from Rails callbacks and got us to the point where there was a single flow for creating vulnerabilities. We then optimized the crap out of that flow(including bulk inserts) til we were able to process hundreds of millions of vulnerabilities a day. The whole point of that story is lets think really long term here for Forem. I know it can be tempting to try and fit a solution into our current architecture, but maybe its time we overhaul some of it? Maybe its time to make a separate service for something like that that will give us more control over it? What do you all think? PS being decoupled from Rails callbacks ended up being the best decision we ever made. |
I'm 3000% in favor of phasing out callbacks on non trivial models like We definitely don't need to come up with all the solutions ahead of time, as we all know that's seldom a good appproach. Anyway, definitely in favor of having a single service which also stands as the one true way to do a task in the application code. Following callbacks for side effects becomes tiresome quickly and it's always a bit tricky to refactor for non trivial apps like ours. |
When we did it, we ended up using this lightservice gem https://github.com/adomokos/light-service mainly bc the guy who wrote it was working with us at the time. I actually really enjoyed using it as it made viewing all the steps for creation very easy. I'm sure there are lots of gems out there like it. Might be worth checking out for helping us architect and organize a flow for article creation and RSS imports.
I think we are at a point where that goal is something we should be tackling now. We have hit a lot of low hanging fruit from an SRE and code maintainability standpoint and I think these are the things we should now be tackling. |
Maybe it's time to introduce a separate model? You can bulk insert data into an unvalidated table like To keep this table from getting too large the processing step can clean up after itself (delete everything from the import id it just processed). We can also utilize a processing service object in the job I mentioned above and over time this service will become the canonical way to validate articles, similar to what Molly mentioned. This gives us some isolation, introduces a new layer where abstraction can live but is still within our monolith. I don't think this is on the same level as ingesting security vulns from agents, so I'm a bit hesitant to introduce a service boundary with all the complexities that entails (transport latency, cross-service authentication, deployment depedencies) just yet. |
@citizen428 I love that idea of using a temporary table! You are essentially taking a lot of the memory burden off the processes and offloading it into a db table in the interim before its completely processed and made an article. That approach also gives us a nice "log" as we ingest data so that when things fail we don't have to "reprocess" to get the failed feed data, we simply have to look in this table. |
I agree, the idea of using ETL techniques makes sense as this kind of operation is mostly asynchronous (we still have the on demand fetch that can be triggered by https://dev.to/settings/publishing-from-rss but once we'd have the ETL machinery, it should be trivial to add a use case for that). The only downside of the log table is cleaning up but the ETL workflow could clean up after itself as mentioned by @citizen428 (though for my personal exprience in advertising and finance, you still have to account for cleaning up scripts that fail to clean up but that's easily solvable with a counter on the table and an alarm). I still think we should measure the current script and move forward with the performance refactoring we started before swapping parts but I think it's worth exploring (as in measure if it makes sense and then change approach). As we're essentially designing for DEV and future Forems bigger than DEV we can proceed step by step with data in our hands more than anything else. |
Sounds good @rhymes. I agree that cleanup scripts need to be monitored (which can be as simple as row count in DataDog), or depending on the implementation you can also just truncate the table every X days. But I agree, step by step with measurements is the way to go here 👍 |
Adding them here as they are potentially micro-optimizations for now:
thanks @maestromac for bouncing ideas :D |
@rhymes probably have to use them both because fresher cached header can still contain feed item that's already imported. |
Current situation
Currently Forem's functionality to import feeds takes too long, especially for DEV which has 3420 feeds to go through.
Currently the import is sequential: the feed is downloaded, then parsed, then articles are built in memory and saved in the DB
Unfortunately, at least in the case of DEV, we don't really have solid metrics on what is actually slow in production, as it only tracks errors but nothing else. We should consider adding instrumentation before replacing it one way or another
Variables to consider and things learned in benchmarking:
Optimization ideas
There are two main things we can optimize (my opinion is that we should find a combination of both that suits us):
This why I think we should employ a combination of both:
articles
table as we could end up using too many ActiveRecord connections and exhaust the poolsPlan of action
The first step is to write a POC which parallelizes network fetching and parsing, this is part of a multi step plan (not necessarily in this order):
This PR is one step in a multi step plan (not necessarily in this order) which comprises:
RssReader
to undestand what it's profile in productionFeeds::Import
class which takes advantage of concurrency to fetch and parse feeds into articles - Add Feeds::Import service class #10998Benchmarks, more or less
Disclaimer: these benchmarks don't really count as all benchmarks don't really count. These especially because they were conducted unscientifically, while using the computer for other things. They are only to give a really rough idea of what is going on with the
RSSReader
and the future service (calledFeeds::Import
as of today).With 100 feeds, on October 21st 2020, tested on a Macbook Pro 2,4 GHz 8-Core Intel Core i9, 16 cores, 64GB RAM:
Feeds::Import
was run with 8 fetching threads, 4 parsing threads, with batches of 50 users/feedsThe text was updated successfully, but these errors were encountered: