Problem: Ingest will always be flooded with as many transfers as are queued using the new (1.11) queuing behavior #1010

jorikvankemenade · 2019-11-28T13:34:45Z

Please describe the problem you'd like to be solved
I was just testing the work @cole did on package queueing in this PR. First of all, props to that! It provides a nice way to limit the number of transfers without having to implement custom throttling in your API calls (#911).

This PR introduces the concept of packages and limits the number of packages that run concurrently. From the PR:

If there are too many active packages, the Job is placed on a deferred package queue and held until a package stops processing (packages stop processing when they reach a decision point in the workflow that hasn't been pre-configured, or when they hit specific workflow links denoting SIP/Transfer completion or failure). If there is room to process the package, the job is placed on an active job queue, which is consumed by the processing loop running on the main thread.

I like this idea, it is simple and it seems to work pretty well. But I ran into one "problem" when adding a lot of packages at once. Let's say you add 20 packages. Currently, you will run each of the 20 packages until the Complete transfer job is finished. Once every package has finished this job, you will create the first SIP and start the ingest process.

This is because one of the workflow links, that is used as a queueing trigger, is the successful completion of Move to SIP creation directory for completed transfers. As a result of this, the latency of a large group of transfers is proportional to the number of started transfers.

Describe the solution you'd like to see implemented
I think that you should always go for the smallest possible latency, especially if the throughput does not suffer. If we remove the Move to SIP creation directory for completed transfers link as a queueing point, we will still have the same throughput (number of concurrent packages) but at a smaller latency. You can use and inspect your AIPs earlier, and the overall AIP creation time will not suffer.

I think an additional advantage is in the case of major problems with the Archivematica install. By queueing the packages right after creating them, we don't have any transfers in flight. If we lose the queue, we can reconstruct the right "state" by resubmitting the API calls. For transfers that are further in the processing chain, this might be more difficult. However, this is just a feeling. I am not familiar enough with the MCP Server design to make a fully qualified argument here.

Describe alternatives you've considered
An obvious alternative is to change nothing. I wasn't involved in designing this functionality, so maybe there is a very solid argument for the current design. Maybe @cole can elaborate a bit on this and also give his opinion on my proposed change.

A second alternative is to say that people need to rate-limit their API calls. This queuing mechanism works well for "normal" usage of the API. If you want to overload the API with a backlog of transfers, like 10s or 100s, you need to use some rate-limiting mechanism. One option I can think of is to add queueing to enduro. So if I add 20 transfers to a watched directory, enduro can only send a couple of them at the same time. The disadvantage is that you move implementation details, like processing capacity, to the user. Maybe @sevein has some additional ideas about this?

Additional context
I used CentOS 7 and installed the latest Archivematica qa/1.x branch as well as the latest Archivematica Storage Service qa/0.x branch. For creating many transfers at the I used enduro with a watched directory and copying 20 transfers at once into the watched directory.

For Artefactual use:

Before you close this issue, you must check off the following:

All pull requests related to this issue are properly linked
All pull requests related to this issue have been merged
A testing plan for this issue has been implemented and passed (testing plan information should be included in the issue body or comments)
Documentation regarding this issue has been written and merged
Details about this issue have been added to the release notes

The text was updated successfully, but these errors were encountered:

ross-spencer · 2019-11-28T14:27:34Z

Hi @jorikvankemenade this is really fascinating. From the perspective of those reading the issue here, could you consider renaming it with the problem? e.g. (If I read correctly, and I'm not sure I am) Problem: Ingest will always be flooded with as many transfers as are queued using the new (1.11) queuing behavior? And then adding the discussion label? From my perspective in this instance, is that it just lowers the barrier to understanding what is being asked. In other cases where we use Problem it is to ask users to consider the multiple possible ways of resolving what someone might be experiencing. Thank you ahead of time if you can consider that! 🙂

jorikvankemenade · 2019-11-28T14:29:35Z

@ross-spencer sure, I wasn't sure about the best title. It was a tie between your suggestion and the original option. Unfortunately, I can't add the label so if you could do that for me that would be nice!

ross-spencer · 2019-11-28T14:32:00Z

Oh yeah! sure thing @jorikvankemenade I think we can fix that for you to add the label... let me ask @sromkey for future reference, and I can add it now!

sromkey · 2019-11-28T14:47:22Z

@ross-spencer sure, I wasn't sure about the best title. It was a tie between your suggestion and the original option. Unfortunately, I can't add the label so if you could do that for me that would be nice!

@jorikvankemenade if you'd like to have label adding privileges feel free to become a member in the repo!!

@scollazo or @sevein I'm wondering if either of you want to take a look at this in relation to testing we've been doing for NHA?

sevein · 2019-11-28T18:17:04Z

@sromkey, we could definitely take a look. Santi has been observing the same behaviour.

joel-simpson · 2019-11-28T20:24:28Z

I really like the proposed enhancement suggested above, as well as the suggested feature for rate-limiting with Enduro.

Having said that, while the behaviour is subtly different now in 1.11, the basic problem that Archivematica can get "flooded" by placing a large number of transfers into a watched directory has existed in all of the previous versions. Cole's PR improves that situation considerably for small-ish groups of transfers (e.g. 20 or even 50), but probably won't help if you flood a pipeline with 1,000 transfers at once. Given that's always been a problem, that I don't think this is a high priority right now.

jorikvankemenade · 2019-11-28T21:13:09Z

[T]he basic problem that Archivematica can get "flooded" by placing a large number of transfers into a watched directory has existed in all of the previous versions. Cole's PR improves that situation considerably for small-ish groups of transfers (e.g. 20 or even 50) but probably won't help if you flood a pipeline with 1,000 transfers at once.

I fully agree with this. Flooding Archivematica with 1000's of transfers at once is outside of what I would call reasonable, supported usage. So I think this shouldn't be the first thing on our mind. But I do think that it should be possible to support this case. I have given it some thought, and I have a very rough proposal. I am wondering what you guys think. If it is any good I can make a better plan and see if I can find the time to make some PRs.

My proposal would include two changes to Archivemtica and one addition to Enduro. And I think it would even make sense if you don't use Enduro. The first change would be my proposed change. This means that once a package is accepted/started it won't end up in a queue. This has two advantages: the latency as mentioned at the start of this issue, and that we have a single backlog for work in Archivematica.

Having a single backlog for all packages enables the second change. Having an enormous backlog doesn't make sense. You don't want to have 1000's of packages waiting, even if you have an institutional backlog of 1000's of transfers. And each package takes up space in your shared directory, which is of course not unlimited. If we have this single queue, we can give it a configurable capacity. Once this capacity is reached, the API can return an error signaling the user to try again later.

And this brings me to my Enduro change. If the API can signal that it had enough, we have a natural way of rate-limiting our clients. You can send packages to Archivematica until you get an error. Or you could even create an API call that can expose the current capacity. The client just needs to handle this potential error scenario and make sure to retry again later. For example by using a persistent queue.

jorikvankemenade · 2020-02-21T16:55:14Z

I have made some progress in understanding this issue. The basic building blocks for pipelining transfers are already in place using package prioritization. This is organized in such a way that jobs that are further in their progress are prioritized above jobs that are earlier in their lifecycle. This means that if there is processing capacity a DIP is chosen over a SIP and SIP is chosen over a Transfer.

The basic scheduling is package based. At certain points in the workflow, packages are activated and deactivated. The work loop in the MCP Server has a greedy approach towards having the maximum number of packages active in the workflow. This means that if during the processing of a package the package is deactivated before it has turned into a higher level, e.g a Transfer to an SIP, it will end up at the back of the Transfer queue.

An example of this is that this link is defined as terminal. As a result, every Transfer will call the package_completed_callback. This means that the package is deactivated as soon as it completed the "Complete Transfer" phase, and a new package will be scheduled. If you submit n transfers at approximately the same time, this means that the transfer_queue contains all transfers in the submission order. This means that the scheduler will schedule the oldest transfer as soon as a Transfer hits the "Complete Transfer" point. The deactivated Transfer will be appended at the end of the transfer_queue as soon as the watched_directory handler has detected it in the right directory. This means that each of the n transfers will hit the transfer_queue twice.

The problems with the links are discussed and solved in #1055 and #1108. But it is important to realize that for a nice "pipelined" execution of transfers we need to ensure that a Package is never deactivated unless it is actually done. I'll link some PR's dealing with this issue later this weekend. This should solve the major problem we discussed above. The remaining question si the "nice to have" of rate-limiting using the API.

cole · 2020-02-23T21:19:02Z

Hi @jorikvankemenade, sorry for the long silence.

I don't have anything very useful to say unfortunately; it's been a while since I implemented this. The goal of the queuing change was to process each tranfer completely (i.e. produce an AIP) before starting work on the next transfer. As you've seen there are multiple queues, split by package type.

It sounds like this is not working as intended, but I'm not sure why and I'm not sure I follow exactly your purposed change here. In general I think there's lots of room for improvement on MCP server 🙂. If you tag me in a PR I'll try to comment in a more timely manner.

Also, I struggled a lot with when a package is "complete", and how that is defined (or not defined) in the workflow and data model. It seems like the proposal to add terminal nodes to the workflow would greatly improve that situation.

jorikvankemenade · 2020-02-23T21:47:17Z

Hi @cole, no problem everyone deserves their time away!

It sounds like this is not working as intended, but I'm not sure why and I'm not sure I follow exactly your purposed change here.

The change is actually quite straightforward. We need to remove d27fd07e-d3ed-4767-96a5-44a2251c6d0a as a terminal link and then it should behave as intended. I have done this in PR 1577, although I don't think I mentioned this explicitly.

jorikvankemenade · 2020-02-26T11:21:39Z

I have opened a PR with a possible solution for limiting the number of transfers (configurable) a user can start. Any feedback on that approach is welcome :).

sromkey · 2020-07-19T00:45:07Z

Hi @jorikvankemenade, sorry (again!) for the long silence on this issue- I think it's definitely still an interesting discussion to have but we're scoping the 1.12 release quite small for some internal reasons- just wanted to let you know why I'm about to remove the triage label :)

jorikvankemenade changed the title ~~Discussion: MCPServer package queueing behaviour~~ Problem: Ingest will always be flooded with as many transfers as are queued using the new (1.11) queuing behavior Nov 28, 2019

ross-spencer added the Request: discussion The path towards resolving the issue is unclear and opinion is sought from other community members. label Nov 28, 2019

sromkey added the Piql NHA label Nov 28, 2019

jorikvankemenade mentioned this issue Nov 28, 2019

Problem: rate-limiting is not supported artefactual-labs/enduro#12

Closed

sromkey added the triage-release-1.12 label Nov 28, 2019

sromkey added the Status: refining The issue needs additional details to ensure that requirements are clear. label Jan 6, 2020

jorikvankemenade mentioned this issue Feb 21, 2020

Problem: processing of packages is never started under some circumstances #1055

Closed

5 tasks

This was referenced Feb 23, 2020

Add terminal links to the Archivematica workflow artefactual/archivematica#1577

Merged

Proposal: limit package flooding artefactual/archivematica#1579

Closed

sromkey removed the triage-release-1.12 label Jul 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem: Ingest will always be flooded with as many transfers as are queued using the new (1.11) queuing behavior #1010

Problem: Ingest will always be flooded with as many transfers as are queued using the new (1.11) queuing behavior #1010

jorikvankemenade commented Nov 28, 2019 •

edited

ross-spencer commented Nov 28, 2019

jorikvankemenade commented Nov 28, 2019

ross-spencer commented Nov 28, 2019

sromkey commented Nov 28, 2019

sevein commented Nov 28, 2019

joel-simpson commented Nov 28, 2019

jorikvankemenade commented Nov 28, 2019 •

edited

jorikvankemenade commented Feb 21, 2020

cole commented Feb 23, 2020

jorikvankemenade commented Feb 23, 2020

jorikvankemenade commented Feb 26, 2020 •

edited

sromkey commented Jul 19, 2020

Problem: Ingest will always be flooded with as many transfers as are queued using the new (1.11) queuing behavior #1010

Problem: Ingest will always be flooded with as many transfers as are queued using the new (1.11) queuing behavior #1010

Comments

jorikvankemenade commented Nov 28, 2019 • edited

ross-spencer commented Nov 28, 2019

jorikvankemenade commented Nov 28, 2019

ross-spencer commented Nov 28, 2019

sromkey commented Nov 28, 2019

sevein commented Nov 28, 2019

joel-simpson commented Nov 28, 2019

jorikvankemenade commented Nov 28, 2019 • edited

jorikvankemenade commented Feb 21, 2020

cole commented Feb 23, 2020

jorikvankemenade commented Feb 23, 2020

jorikvankemenade commented Feb 26, 2020 • edited

sromkey commented Jul 19, 2020

jorikvankemenade commented Nov 28, 2019 •

edited

jorikvankemenade commented Nov 28, 2019 •

edited

jorikvankemenade commented Feb 26, 2020 •

edited