Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Byte and item count limits for Sitemaps #382

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

huntharo
Copy link
Contributor

@derduher - Let me know your thoughts on how to proceed given that the Transforms appear to just hang after the error (either using throw or passing it to the callback). I'm not sure if that hang is due to our usage of Transform that needs to be fixed or of throwing during write causes the Transform to be unusable per requirements. I think the hang might be our issue to fix.

@derduher
Copy link
Collaborator

derduher commented Jan 3, 2022

@huntharo I think you need to call callback with the error and handle there. Its also very possible this very scenario is built into streams and we just haven't seen it yet. It seems like the kind of thing you'd want to handle with a write stream. https://nodejs.org/api/stream.html#implementing-a-transform-stream
Like a backpressure error or something.

@derduher
Copy link
Collaborator

derduher commented Jan 3, 2022

Some notes/thoughts.
note 1: The transform is in object mode if I recall, this will impact some of the docs you read
thought 2: I wonder if it might be easier to take the string output of the transform and pipe it into another in byte mode. Have it do the switch... That might already be partially how it works. It might just need to track the bytes in there.
Yeah see line 120ish of sitemap-index-stream. That tracks the count by wrapping the stream. It's in object mode though.
Or have the wrapping stream handle the error emitted when the write stream sees it would exceed 50k if it wrote - use that to kick off the getSitemapStream function and write the failed to write object to the new stream.

@derduher
Copy link
Collaborator

derduher commented Jan 3, 2022

I saw this first so I'm replying to some of your thoughts in the bug thread:
I think trying to anticipate hitting the limit is a flawed approach as at an object level a single URL entry can have a massive description where all that came before would have tiny. It's possible there is an XSD defined limit to all possible xml fields that you could add up...
Probably the more elegantish solution would just be to exceed the limit then back up and reattempt after a new sitemap file has been readied.

@huntharo
Copy link
Contributor Author

@derduher - Ok, I think we are largely in agreement here that the best thing is to absolutely keep the sitemap under the specified size limit as the big search bots have set the limit to 50 MB and I think they will just ignore files larger than that, and my thoughts align that a large item could quite easily push this over the limit.

Regarding the SitemapAndIndexStream._transform, I think that case is different for 2 reasons:

  • It's only a count limit not a size limit. Count limit is easy because you can know before the write if you are already at the count limit. Size limit is more difficult because we serialize to XML in the stream so we do not know the size of the item in hand until we do the write.
  • SitemapAndIndexStream is a consumer of SitemapStream and is managing the count limit itself... which SitemapStream consumers can do today. Handling the limit inside of the Transform is more challenging.

I did some stand-alone experiments and it looks like Transform will always close the stream after an error.

Here is I think the best we can do:

  • No behavior change when byte limit is not set
  • When byte limit is set, serialize the item, check the size (as done in the PR already):
    • If size will not be exceeded, write the item
    • If size limit will be exceeded, write the XML end tag, then raise the stream error
      • This will close the stream to further writes
      • This will also notify the consumer that no further items can be written to this stream and that they should create a new stream

Trying to update this into the PR now.

@huntharo huntharo marked this pull request as ready for review January 15, 2022 21:02
@huntharo
Copy link
Contributor Author

@derduher - OK this is ready for review. Added specific exception types so they can be identified and added an example that shows how to rotate the files as they fill up (including writing the last item that could not be written to the prior sitemap).

A key point here is that write on a stream is async and it's required that write only have one outstanding call at a time so that it can be determined exactly which item caused the overflow and no items attempt to be written after it. That means that consumers would need to wrap write with a Promise, as shown in the example, so they can await each write, catch the error that is thrown as an exception, rotate, the sitemap to a new file, and redo the write.

When closely managing file sizes and counts I do not think there is a much better option.

I've been using a version of this code, for months now, to write hundreds of millions of sitemap indices with sitemaps totaling hundreds of millions of items. While awaiting each write might be a little slower it's fine in practice for even the largest sitemaps.

@derduher
Copy link
Collaborator

@huntharo ok. I'm going to give a shot at this myself if for no other reason than to better understand the problem space. I'll set a deadline for Saturday.

@huntharo
Copy link
Contributor Author

@huntharo ok. I'm going to give a shot at this myself if for no other reason than to better understand the problem space. I'll set a deadline for Saturday.

Sounds good. I appreciate another pair of eyes on it. Let me know what you come up with.

@derduher
Copy link
Collaborator

I got a chance to dig into this tonight and in trying my own approach ended up at something very similar to yours. So I'll just tag my modifications on to your branch if that's ok. I want to modify the SitemapAndIndexStream to properly use this before releasing. It might end up as a breaking change, if that's the case I'll probably throw some more changes including dropping support for node 12 a little early.
I'll try and wrap this up tomorrow.

A key point here is that write on a stream is async and it's required that write only have one outstanding call at a time so that it can be determined exactly which item caused the overflow and no items attempt to be written after it. That means that consumers would need to wrap write with a Promise, as shown in the example, so they can await each write, catch the error that is thrown as an exception, rotate, the sitemap to a new file, and redo the write.

I'm not following how it would need to be async. Afaik a stream only processes one item at a time unless you explicitly implement a specific api.

@huntharo
Copy link
Contributor Author

I got a chance to dig into this tonight and in trying my own approach ended up at something very similar to yours. So I'll just tag my modifications on to your branch if that's ok. I want to modify the SitemapAndIndexStream to properly use this before releasing. It might end up as a breaking change, if that's the case I'll probably throw some more changes including dropping support for node 12 a little early. I'll try and wrap this up tomorrow.

Perfect, sounds good to me!

A key point here is that write on a stream is async and it's required that write only have one outstanding call at a time so that it can be determined exactly which item caused the overflow and no items attempt to be written after it. That means that consumers would need to wrap write with a Promise, as shown in the example, so they can await each write, catch the error that is thrown as an exception, rotate, the sitemap to a new file, and redo the write.

I'm not following how it would need to be async. Afaik a stream only processes one item at a time unless you explicitly implement a specific api.

Ah, let me see if I can clarify. Most folks use streams somewhat incorrectly:

s.write(item1);
s.write(item2);
s.write(item3);

item1 will not call the code that computes the length and rejects the write before the write for item2 is called and returned.

A correct but terrible way to write 3 items like this would be:

// I'm not checking syntax here... I might have the parameter positions incorrect
s.write(item1, (error) => {
  if (error === undefined) {
    s.write(item2, (error2) => {
     if (error2 === undefined) {
      s.write(item3, ...);
    }
   }
 }
});

When I said "it needs to be async", I meant that wrapping the write calls with a Promise allows just awaiting the write calls so that the item1 write will complete or reject before item2 is attempted to be written if the stream is not overfull already. I think most programmers do not do this... so length limits will not work for them unless they await the writes or do callback chaining.

lib/sitemap-stream.ts Outdated Show resolved Hide resolved
lib/sitemap-stream.ts Outdated Show resolved Hide resolved
lib/sitemap-stream.ts Outdated Show resolved Hide resolved
@derduher
Copy link
Collaborator

Good catches @huntharo!
And RE your chained calls, I would think it'd still call them serially and add up the total correctly even if but might as well write a test to verify.

@huntharo
Copy link
Contributor Author

Good catches @huntharo! And RE your chained calls, I would think it'd still call them serially and add up the total correctly even if but might as well write a test to verify.

Ah yes... it will call them serially... but they will only total up correctly in some places. That is: if the first call throws, you'll still make the second and third calls and you won't know which item it failed on.

I sort of did test this, but I guess I didn't leave a test specifically for it... if you remove the async wrapper around write from the byte length tests, they will have unpredictable results. Sometimes they will succeed, other times they will hang (if you try a write after the first write fails and then try to wait for the stream to finish). The hanging was the worst and would likely prevent us from adding a test that proves that we have to wait for these writes.

But if you think about it you absolutely have to wait for confirmation that an item was not rejected before writing another item because the first rejected item has to be written to the next file (as to do all the others).

@derduher
Copy link
Collaborator

hrmm, yeah I think you are right. That's a lot of docs to update. I was pretty sloppy with the serial writes anyway. Hazard of no longer having a real world usecase anymore

@huntharo
Copy link
Contributor Author

huntharo commented Mar 1, 2022

@derduher - What do you think the next steps are here? I've got a project that I'm trying to open source that depends on this (I have an internal version with the changes but don't want to include that in the open source version). Let me know if there is anything you want me to do or if there are things we can create issues for and follow-up on.

@derduher
Copy link
Collaborator

derduher commented Mar 1, 2022

@huntharo oof yeah, sorry about the slow progress. I've been a bit burnt out on the weekends to work on this.
I basically want to propagate the changes up to the various wrappers (SitemapAndIndexStream, and simpleSitemapAndIndex) and update the docs. I think this the point where a lot of breaking changes are made as this is one of the most important features of the tool.

I'm running into a problem with getting SitemapAndIndex to properly catch and rotate. I'll try to push up what I've got so you can move ahead unblocked by me.

@huntharo
Copy link
Contributor Author

huntharo commented Mar 27, 2022

@derduher - I've spent the weekend trying to get SitemapAndIndex to handle rotations. I've gotten all tests passing except when using createGzip. createGzip is very challenging because it refuses to write the output if an error is passed to it in the pipeline (and it sees the write error returned by SitemapStream to its producer, which is odd as createGzip is a consumer).

Problems Handling Stream Writes into Size-Constrained Sitemap Files

  • When writing sitemap items into streams, the sitemap stream is chosen before it is known if that write will succeed
  • Streams allow multiple stacked up writes before any of them succeeds or fails
  • There might be 20 writes pending, the 1st one throws an exception because the file would overflow, so the 1st one creates a new sitemap, and the subsequent 19 writes need to catch the exception for writing after end, then re-submit their writes to the newly created sitemap file
  • The _flush call is no longer serialized to be after the last write call into a SitemapStream - There is a race condition where an overflow causes a new SitemapStream to be created and, at that moment, a _flush call comes in and calls .end() on the new SitemapStream before any of the items could be rewritten to it, resulting in writes that cannot be finished

Problems with createGzip

  • createGzip will not flush the received data if an error is encountered
  • This is different than other streams that do flush the data when the error is raised from the transform and no more data is written after that
  • The Byte/Item limits pass an error to the callback after the close tag is written
  • There are no gzip tests for SitemapStream or for the sitemaps within a SitemapAndIndexStream - this error doesn't show up until the tests for simpleSitemapAndIndex are run as those tests use gzip compression of the sitemaps
  • If there was a way to prevent these two specific errors from being passed through to createGzip then everything would work

Summary

I'm not sure what to do. If we can prevent the error propagation to createGzip then I have code that works for SitemapAndIndexStream and simpleSitemapAndIndex.

If we can't prevent the error propagation to createGzip then we might want to take the approach I took for my own projects:

  • Provide a higher-level set of classes that do not allow using streams but instead provide an async write interface - The write function serializes the item, checks if it will fit in the stream, throws an exception if it won't (without attempting to write to the stream), otherwise it does the write to the stream and returns

@SGudbrandsson
Copy link

Can this be merged without the creategzip part?

@huntharo
Copy link
Contributor Author

Can this be merged without the creategzip part?

Thanks for the vote of confidence @SGudbrandsson !

Unfortunately, if I recall correctly, there is no way to prevent a consumer from setting the sink (destination) to a gzip stream and a gzip stream will refuse to correctly end the output stream when it detects that the stream being piped to it had any sort of error at all. Most streams only fail when the writing stream had a read error (indicating gzip couldn't read from the input stream). But gzip refuses to write if the input stream rejected a write to itself, which gzip shouldn't care about, but it does.

I don't believe I found a way to be able to throw an exception in the input stream without causing gzip to mess up.

So if we merge this and someone writes to gzip using this size limit feature they will get weird and difficult to diagnose behavior.

I might take another look after releasing some other projects that implement this at a layer above this project without a streaming interface, which avoids the problem of gzip stream error handling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ BUG ] Incorrect division into files when the limit is exceeded
3 participants