New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem: Ingest will always be flooded with as many transfers as are queued using the new (1.11) queuing behavior #1010
Comments
Hi @jorikvankemenade this is really fascinating. From the perspective of those reading the issue here, could you consider renaming it with the problem? e.g. (If I read correctly, and I'm not sure I am) |
@ross-spencer sure, I wasn't sure about the best title. It was a tie between your suggestion and the original option. Unfortunately, I can't add the label so if you could do that for me that would be nice! |
Oh yeah! sure thing @jorikvankemenade I think we can fix that for you to add the label... let me ask @sromkey for future reference, and I can add it now! |
@jorikvankemenade if you'd like to have label adding privileges feel free to become a member in the repo!! @scollazo or @sevein I'm wondering if either of you want to take a look at this in relation to testing we've been doing for NHA? |
@sromkey, we could definitely take a look. Santi has been observing the same behaviour. |
I really like the proposed enhancement suggested above, as well as the suggested feature for rate-limiting with Enduro. Having said that, while the behaviour is subtly different now in 1.11, the basic problem that Archivematica can get "flooded" by placing a large number of transfers into a watched directory has existed in all of the previous versions. Cole's PR improves that situation considerably for small-ish groups of transfers (e.g. 20 or even 50), but probably won't help if you flood a pipeline with 1,000 transfers at once. Given that's always been a problem, that I don't think this is a high priority right now. |
I fully agree with this. Flooding Archivematica with 1000's of transfers at once is outside of what I would call reasonable, supported usage. So I think this shouldn't be the first thing on our mind. But I do think that it should be possible to support this case. I have given it some thought, and I have a very rough proposal. I am wondering what you guys think. If it is any good I can make a better plan and see if I can find the time to make some PRs. My proposal would include two changes to Archivemtica and one addition to Enduro. And I think it would even make sense if you don't use Enduro. The first change would be my proposed change. This means that once a package is accepted/started it won't end up in a queue. This has two advantages: the latency as mentioned at the start of this issue, and that we have a single backlog for work in Archivematica. Having a single backlog for all packages enables the second change. Having an enormous backlog doesn't make sense. You don't want to have 1000's of packages waiting, even if you have an institutional backlog of 1000's of transfers. And each package takes up space in your shared directory, which is of course not unlimited. If we have this single queue, we can give it a configurable capacity. Once this capacity is reached, the API can return an error signaling the user to try again later. And this brings me to my Enduro change. If the API can signal that it had enough, we have a natural way of rate-limiting our clients. You can send packages to Archivematica until you get an error. Or you could even create an API call that can expose the current capacity. The client just needs to handle this potential error scenario and make sure to retry again later. For example by using a persistent queue. |
I have made some progress in understanding this issue. The basic building blocks for pipelining transfers are already in place using package prioritization. This is organized in such a way that jobs that are further in their progress are prioritized above jobs that are earlier in their lifecycle. This means that if there is processing capacity a DIP is chosen over a SIP and SIP is chosen over a Transfer. The basic scheduling is package based. At certain points in the workflow, packages are activated and deactivated. The work loop in the MCP Server has a greedy approach towards having the maximum number of packages active in the workflow. This means that if during the processing of a package the package is deactivated before it has turned into a higher level, e.g a Transfer to an SIP, it will end up at the back of the Transfer queue. An example of this is that this link is defined as terminal. As a result, every Transfer will call the The problems with the links are discussed and solved in #1055 and #1108. But it is important to realize that for a nice "pipelined" execution of transfers we need to ensure that a |
Hi @jorikvankemenade, sorry for the long silence. I don't have anything very useful to say unfortunately; it's been a while since I implemented this. The goal of the queuing change was to process each tranfer completely (i.e. produce an AIP) before starting work on the next transfer. As you've seen there are multiple queues, split by package type. It sounds like this is not working as intended, but I'm not sure why and I'm not sure I follow exactly your purposed change here. In general I think there's lots of room for improvement on MCP server 🙂. If you tag me in a PR I'll try to comment in a more timely manner. Also, I struggled a lot with when a package is "complete", and how that is defined (or not defined) in the workflow and data model. It seems like the proposal to add terminal nodes to the workflow would greatly improve that situation. |
Hi @cole, no problem everyone deserves their time away!
The change is actually quite straightforward. We need to remove |
I have opened a PR with a possible solution for limiting the number of transfers (configurable) a user can start. Any feedback on that approach is welcome :). |
Hi @jorikvankemenade, sorry (again!) for the long silence on this issue- I think it's definitely still an interesting discussion to have but we're scoping the 1.12 release quite small for some internal reasons- just wanted to let you know why I'm about to remove the triage label :) |
Please describe the problem you'd like to be solved
I was just testing the work @cole did on package queueing in this PR. First of all, props to that! It provides a nice way to limit the number of transfers without having to implement custom throttling in your API calls (#911).
This PR introduces the concept of packages and limits the number of packages that run concurrently. From the PR:
I like this idea, it is simple and it seems to work pretty well. But I ran into one "problem" when adding a lot of packages at once. Let's say you add 20 packages. Currently, you will run each of the 20 packages until the
Complete transfer
job is finished. Once every package has finished this job, you will create the first SIP and start the ingest process.This is because one of the workflow links, that is used as a queueing trigger, is the successful completion of
Move to SIP creation directory for completed transfers
. As a result of this, the latency of a large group of transfers is proportional to the number of started transfers.Describe the solution you'd like to see implemented
I think that you should always go for the smallest possible latency, especially if the throughput does not suffer. If we remove the
Move to SIP creation directory for completed transfers
link as a queueing point, we will still have the same throughput (number of concurrent packages) but at a smaller latency. You can use and inspect your AIPs earlier, and the overall AIP creation time will not suffer.I think an additional advantage is in the case of major problems with the Archivematica install. By queueing the packages right after creating them, we don't have any transfers in flight. If we lose the queue, we can reconstruct the right "state" by resubmitting the API calls. For transfers that are further in the processing chain, this might be more difficult. However, this is just a feeling. I am not familiar enough with the MCP Server design to make a fully qualified argument here.
Describe alternatives you've considered
An obvious alternative is to change nothing. I wasn't involved in designing this functionality, so maybe there is a very solid argument for the current design. Maybe @cole can elaborate a bit on this and also give his opinion on my proposed change.
A second alternative is to say that people need to rate-limit their API calls. This queuing mechanism works well for "normal" usage of the API. If you want to overload the API with a backlog of transfers, like 10s or 100s, you need to use some rate-limiting mechanism. One option I can think of is to add queueing to enduro. So if I add 20 transfers to a watched directory, enduro can only send a couple of them at the same time. The disadvantage is that you move implementation details, like processing capacity, to the user. Maybe @sevein has some additional ideas about this?
Additional context
I used CentOS 7 and installed the latest Archivematica
qa/1.x
branch as well as the latest Archivematica Storage Serviceqa/0.x
branch. For creating many transfers at the I used enduro with a watched directory and copying 20 transfers at once into the watched directory.For Artefactual use:
Before you close this issue, you must check off the following:
The text was updated successfully, but these errors were encountered: