Current situation
Currently Forem's functionality to import feeds takes too long, especially for DEV which has 3420 feeds to go through.
Currently the import is sequential: the feed is downloaded, then parsed, then articles are built in memory and saved in the DB
Unfortunately, at least in the case of DEV, we don't really have solid metrics on what is actually slow in production, as it only tracks errors but nothing else. We should consider adding instrumentation before replacing it one way or another
Variables to consider and things learned in benchmarking:
- network latency is not a constant (fetching thousands of feeds can have different performance results depending on network conditions of the upstream servers)
- different feeds can have different lenghts thus users have a variable amount of articles to process at each run
- currently we skip a random number of feeds at each run because feed fetching can be slow
- Nokogiri parsing occupies a lot of memory (there are literally millions of objects allocated by the nokogiri gem)
Optimization ideas
There are two main things we can optimize (my opinion is that we should find a combination of both that suits us):
- make the actual fetching of feeds and parsing of them faster (it's all I/O, there's no reason for it to be sequential)
- process multiple users in parallel (basically by doing things more or less sequentially but splitting the workload in separate workers, one per each user)
This why I think we should employ a combination of both:
- we process feeds sequentially, but downloading bytes from the web is inherently parallelizable, so we can download a bunch (in batches obviously) from the web and then start processing those
- we parse feeds sequentally, but those also can be parallelized
- both the above steps have an upper ceiling not just based on how many cores the machines will run on but also due to memory consumption (the most memory resource hungry of the two operations is parsing for obvious reasons)
- writing articles on the DB can be parallelized but that doesn't really need to be per user (we'd have 3420 jobs in the queue that could still be individually fast or slow)
- we can parallelize articles creation in batches by changing the logic a little bit: right now it's the single "future article" that's responsible to know if they actually exist. Each of them has a conditional check to see if it already exists. We could do this in one swoop for the entire batch and then remove from the batch of workers those that don't need to be processed at all
- we need to be careful at how many workers we add concurrently that write to the
articles table as we could end up using too many ActiveRecord connections and exhaust the pools
Plan of action
The first step is to write a POC which parallelizes network fetching and parsing, this is part of a multi step plan (not necessarily in this order):
This PR is one step in a multi step plan (not necessarily in this order) which comprises:
Benchmarks, more or less
Disclaimer: these benchmarks don't really count as all benchmarks don't really count. These especially because they were conducted unscientifically, while using the computer for other things. They are only to give a really rough idea of what is going on with the RSSReader and the future service (called Feeds::Import as of today).
With 100 feeds, on October 21st 2020, tested on a Macbook Pro 2,4 GHz 8-Core Intel Core i9, 16 cores, 64GB RAM:
158.67s user 8.14s system 404.78s real 817832kB mem -- rails fetch_all_rss
151.71s user 7.14s system 258.57s real 781528kB mem -- rails fetch_feeds_import
Feeds::Import was run with 8 fetching threads, 4 parsing threads, with batches of 50 users/feeds
Current situation
Currently Forem's functionality to import feeds takes too long, especially for DEV which has 3420 feeds to go through.
Currently the import is sequential: the feed is downloaded, then parsed, then articles are built in memory and saved in the DB
Unfortunately, at least in the case of DEV, we don't really have solid metrics on what is actually slow in production, as it only tracks errors but nothing else. We should consider adding instrumentation before replacing it one way or another
Variables to consider and things learned in benchmarking:
Optimization ideas
There are two main things we can optimize (my opinion is that we should find a combination of both that suits us):
This why I think we should employ a combination of both:
articlestable as we could end up using too many ActiveRecord connections and exhaust the poolsPlan of action
The first step is to write a POC which parallelizes network fetching and parsing, this is part of a multi step plan (not necessarily in this order):
This PR is one step in a multi step plan (not necessarily in this order) which comprises:
RssReaderto undestand what it's profile in productionFeeds::Importclass which takes advantage of concurrency to fetch and parse feeds into articles - Add Feeds::Import service class #10998Benchmarks, more or less
Disclaimer: these benchmarks don't really count as all benchmarks don't really count. These especially because they were conducted unscientifically, while using the computer for other things. They are only to give a really rough idea of what is going on with the
RSSReaderand the future service (calledFeeds::Importas of today).With 100 feeds, on October 21st 2020, tested on a Macbook Pro 2,4 GHz 8-Core Intel Core i9, 16 cores, 64GB RAM:
Feeds::Importwas run with 8 fetching threads, 4 parsing threads, with batches of 50 users/feeds