Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the feed reader for websites with a large number of users #10996

Closed
7 tasks
rhymes opened this issue Oct 21, 2020 · 12 comments · Fixed by #12169
Closed
7 tasks

Optimize the feed reader for websites with a large number of users #10996

rhymes opened this issue Oct 21, 2020 · 12 comments · Fixed by #12169
Assignees
Labels
internal team only internal tasks only for Forem team members type: optimization code and performance optimizations

Comments

@rhymes
Copy link
Contributor

rhymes commented Oct 21, 2020

Current situation

Currently Forem's functionality to import feeds takes too long, especially for DEV which has 3420 feeds to go through.

Currently the import is sequential: the feed is downloaded, then parsed, then articles are built in memory and saved in the DB

Unfortunately, at least in the case of DEV, we don't really have solid metrics on what is actually slow in production, as it only tracks errors but nothing else. We should consider adding instrumentation before replacing it one way or another

Variables to consider and things learned in benchmarking:

  • network latency is not a constant (fetching thousands of feeds can have different performance results depending on network conditions of the upstream servers)
  • different feeds can have different lenghts thus users have a variable amount of articles to process at each run
  • currently we skip a random number of feeds at each run because feed fetching can be slow
  • Nokogiri parsing occupies a lot of memory (there are literally millions of objects allocated by the nokogiri gem)

Optimization ideas

There are two main things we can optimize (my opinion is that we should find a combination of both that suits us):

  1. make the actual fetching of feeds and parsing of them faster (it's all I/O, there's no reason for it to be sequential)
  2. process multiple users in parallel (basically by doing things more or less sequentially but splitting the workload in separate workers, one per each user)

This why I think we should employ a combination of both:

  • we process feeds sequentially, but downloading bytes from the web is inherently parallelizable, so we can download a bunch (in batches obviously) from the web and then start processing those
  • we parse feeds sequentally, but those also can be parallelized
  • both the above steps have an upper ceiling not just based on how many cores the machines will run on but also due to memory consumption (the most memory resource hungry of the two operations is parsing for obvious reasons)
  • writing articles on the DB can be parallelized but that doesn't really need to be per user (we'd have 3420 jobs in the queue that could still be individually fast or slow)
  • we can parallelize articles creation in batches by changing the logic a little bit: right now it's the single "future article" that's responsible to know if they actually exist. Each of them has a conditional check to see if it already exists. We could do this in one swoop for the entire batch and then remove from the batch of workers those that don't need to be processed at all
  • we need to be careful at how many workers we add concurrently that write to the articles table as we could end up using too many ActiveRecord connections and exhaust the pools

Plan of action

The first step is to write a POC which parallelizes network fetching and parsing, this is part of a multi step plan (not necessarily in this order):

This PR is one step in a multi step plan (not necessarily in this order) which comprises:

  • add monitoring to the existing RssReader to undestand what it's profile in production
  • adding a Feeds::Import class which takes advantage of concurrency to fetch and parse feeds into articles - Add Feeds::Import service class #10998
  • add a related sidekiq worker
  • split the writing of articles in the DB in batch workers
  • measure the performance (speed an memory occupation) in production
  • hide the new service behind a feature flag so it can be activated and deactivated at will, transparently
  • refactor, tune, improve and optimize what can be refactored, tuned, improved and optimized (both in the logic and in knobs)

Benchmarks, more or less

Disclaimer: these benchmarks don't really count as all benchmarks don't really count. These especially because they were conducted unscientifically, while using the computer for other things. They are only to give a really rough idea of what is going on with the RSSReader and the future service (called Feeds::Import as of today).

With 100 feeds, on October 21st 2020, tested on a Macbook Pro 2,4 GHz 8-Core Intel Core i9, 16 cores, 64GB RAM:

158.67s user 8.14s system 404.78s real 817832kB mem -- rails fetch_all_rss
151.71s user 7.14s system 258.57s real 781528kB mem -- rails fetch_feeds_import

Feeds::Import was run with 8 fetching threads, 4 parsing threads, with batches of 50 users/feeds

@rhymes rhymes added type: discussion type: optimization code and performance optimizations internal team only internal tasks only for Forem team members labels Oct 21, 2020
@rhymes rhymes self-assigned this Oct 21, 2020
@github-actions
Copy link
Contributor

Thanks for the issue! We'll take your request into consideration and follow up if we decide to tackle this issue.

To our amazing contributors: issues labeled type: bug are always up for grabs, but for feature requests, please wait until we add a ready for dev before starting to work on it.

To claim an issue to work on, please leave a comment. If you've claimed the issue and need help, please ping @forem/oss and we will follow up within 3 business days.

For full info on how to contribute, please check out our contributors guide.

@rhymes
Copy link
Contributor Author

rhymes commented Oct 23, 2020

After talking with @benhalpern a couple of optimization ideas (totally to be fleshed oud and tested) have come out:

  • is there maybe a chance to bulk insert articles? The issue here is that we have many validators and callbacks attached to articles and these would be skipped. I'm more concerned about validators as they would potentially let in invalid data if disabled. We could probably rely on the eventual article.save to re-validate everything and re-compute all computations

  • can we avoid fetching data for users that haven't published recently? Let's say we stop fetching feeds for users that haven't published an article in 60 days (or more or less). Would that be a feasible strategy?

cc @citizen428 @mstruve as you provided feedback in #10998 but obviously open to anyone :)

@mstruve
Copy link
Contributor

mstruve commented Oct 23, 2020

is there maybe a chance to bulk insert articles?

We actually had to deal with a very similar problem at Kenna. There we dealt with what we called connectors. Those connectors would deliver tons of data to our app on vulnerabilities from our clients. In the beginning, we processed and saved these vulns one by one with all the Rails callbacks and it was SLOW. Bc it was such a cornerstone of our application we ended up writing a separate "importer" service to handle creating all of the vulnerabilities from the connectors. This service eventually became in charge of always creating vulnerabilities whether through the API or a connector. It was a very big change but it forced us to completely decouple the system from Rails callbacks and got us to the point where there was a single flow for creating vulnerabilities. We then optimized the crap out of that flow(including bulk inserts) til we were able to process hundreds of millions of vulnerabilities a day.

The whole point of that story is lets think really long term here for Forem. I know it can be tempting to try and fit a solution into our current architecture, but maybe its time we overhaul some of it? Maybe its time to make a separate service for something like that that will give us more control over it?

What do you all think?

PS being decoupled from Rails callbacks ended up being the best decision we ever made.

@rhymes
Copy link
Contributor Author

rhymes commented Oct 23, 2020

I'm 3000% in favor of phasing out callbacks on non trivial models like Article and I do agree it's a long term goal. We could design a service like that in steps.

We definitely don't need to come up with all the solutions ahead of time, as we all know that's seldom a good appproach.

Anyway, definitely in favor of having a single service which also stands as the one true way to do a task in the application code. Following callbacks for side effects becomes tiresome quickly and it's always a bit tricky to refactor for non trivial apps like ours.

@mstruve
Copy link
Contributor

mstruve commented Oct 23, 2020

When we did it, we ended up using this lightservice gem https://github.com/adomokos/light-service mainly bc the guy who wrote it was working with us at the time. I actually really enjoyed using it as it made viewing all the steps for creation very easy. I'm sure there are lots of gems out there like it. Might be worth checking out for helping us architect and organize a flow for article creation and RSS imports.

I'm 3000% in favor of phasing out callbacks on non trivial models like Article and I do agree it's a long term goal.

I think we are at a point where that goal is something we should be tackling now. We have hit a lot of low hanging fruit from an SRE and code maintainability standpoint and I think these are the things we should now be tackling.

@citizen428
Copy link
Contributor

citizen428 commented Oct 26, 2020

is there maybe a chance to bulk insert articles?

Maybe it's time to introduce a separate model? You can bulk insert data into an unvalidated table like RssImportArticle, then trigger some other job to process these later and move valid ones to Article. I've done this in the past and it works well. It's a reasonably common pattern in ELT workloads, where the initial import goes into a different table (often temporary and potentially even in another schema), then data gets moved over to the target table in a processing step.

To keep this table from getting too large the processing step can clean up after itself (delete everything from the import id it just processed). We can also utilize a processing service object in the job I mentioned above and over time this service will become the canonical way to validate articles, similar to what Molly mentioned.

This gives us some isolation, introduces a new layer where abstraction can live but is still within our monolith. I don't think this is on the same level as ingesting security vulns from agents, so I'm a bit hesitant to introduce a service boundary with all the complexities that entails (transport latency, cross-service authentication, deployment depedencies) just yet.

@mstruve
Copy link
Contributor

mstruve commented Oct 26, 2020

@citizen428 I love that idea of using a temporary table! You are essentially taking a lot of the memory burden off the processes and offloading it into a db table in the interim before its completely processed and made an article. That approach also gives us a nice "log" as we ingest data so that when things fail we don't have to "reprocess" to get the failed feed data, we simply have to look in this table.

@rhymes
Copy link
Contributor Author

rhymes commented Oct 27, 2020

I agree, the idea of using ETL techniques makes sense as this kind of operation is mostly asynchronous (we still have the on demand fetch that can be triggered by https://dev.to/settings/publishing-from-rss but once we'd have the ETL machinery, it should be trivial to add a use case for that).

The only downside of the log table is cleaning up but the ETL workflow could clean up after itself as mentioned by @citizen428 (though for my personal exprience in advertising and finance, you still have to account for cleaning up scripts that fail to clean up but that's easily solvable with a counter on the table and an alarm).

I still think we should measure the current script and move forward with the performance refactoring we started before swapping parts but I think it's worth exploring (as in measure if it makes sense and then change approach).

As we're essentially designing for DEV and future Forems bigger than DEV we can proceed step by step with data in our hands more than anything else.

@citizen428
Copy link
Contributor

Sounds good @rhymes. I agree that cleanup scripts need to be monitored (which can be as simple as row count in DataDog), or depending on the implementation you can also just truncate the table every X days.

But I agree, step by step with measurements is the way to go here 👍

@rhymes
Copy link
Contributor Author

rhymes commented Oct 30, 2020

Adding them here as they are potentially micro-optimizations for now:

  1. using caching headers for servers that support them: why download and parse a feed if we know it hasn't changed since the last time?
  2. using a streaming parser for server that don't support caching headers: why load in memory an entire feed (as we do now) if we can use a fast parsers that extracts only those fields we need to ask the DB: "is this feed item a new article?". That way we could only load in memory feeds that contain new items

thanks @maestromac for bouncing ideas :D

@maestromac
Copy link
Contributor

@rhymes probably have to use them both because fresher cached header can still contain feed item that's already imported.

@rhymes rhymes mentioned this issue Jan 7, 2021
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal team only internal tasks only for Forem team members type: optimization code and performance optimizations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants