-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
max_document_size check is not accurate #659
Comments
max_document_size currently checks document sizes based on Erlang's external term size of the jiffy-decoded document body. This makes sense because that's what used to store the data on disk and it's what manipulated by the CouchDB internals. However erlang term size is not always a good approximation of the size of json encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for users to estimate or check the external term size beforehand. So for example if max_document_size is 1MB, CouchDB might reject user's 600KB json document because Erlang's external term size of that document greater than 1MB. Re-encode the data using jiffy and check the size against that. That's a better check but will impact performance. Also this is also not an exact solution.Users' json encoder might insert more whitespace (say as indentation), or whitespace after commas, use a different algorithm for encoding floating point numbers (scientific notation, represent exact floating point numbers without a decimal point (5 instead of 5.0 etc.). So the size would still be off. Issue apache#659
@nickva: There is an implementation in erlang here https://github.com/okeuday/erlang_term/blob/master/src/erlang_term.erl. MIT license. |
Another implementation could be based on https://gist.github.com/iilyak/a11a481bd3f7311d8499e19f5e4c8f22 |
We don't really need it be that generalized in a way because we know we'd only get what jiffy decoded which is object proplists |
Another option is to change chttpd:json_body to return |
@iilyak |
max_document_size currently checks document sizes based on Erlang's external term size of the jiffy-decoded document body. This makes sense because that's what used to store the data on disk and it's what manipulated by the CouchDB internals. However erlang term size is not always a good approximation of the size of json encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for users to estimate or check the external term size beforehand. So for example if max_document_size is 1MB, CouchDB might reject user's 600KB json document because Erlang's external term size of that document greater than 1MB. Re-encode the data using jiffy and check the size against that. That's a better check but will impact performance. Also this is also not an exact solution.Users' json encoder might insert more whitespace (say as indentation), or whitespace after commas, use a different algorithm for encoding floating point numbers (scientific notation, represent exact floating point numbers without a decimal point (5 instead of 5.0 etc.). So the size would still be off. Issue apache#659
max_document_size currently checks document sizes based on Erlang's external term size of the jiffy-decoded document body. This makes sense because that's what used to store the data on disk and it's what manipulated by the CouchDB internals. However erlang term size is not always a good approximation of the size of json encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for users to estimate or check the external term size beforehand. So for example if max_document_size is 1MB, CouchDB might reject user's 600KB json document because Erlang's external term size of that document greater than 1MB. To fix the issue provide a module which calculates the encoded size of a json document. Issue apache#659
max_document_size currently checks document sizes based on Erlang's external term size of the jiffy-decoded document body. This makes sense because that's what used to store the data on disk and it's what manipulated by the CouchDB internals. However erlang term size is not always a good approximation of the size of json encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for users to estimate or check the external term size beforehand. So for example if max_document_size is 1MB, CouchDB might reject user's 600KB json document because Erlang's external term size of that document greater than 1MB. To fix the issue provide a module which calculates the encoded size of a json document. Issue apache#659
max_document_size currently checks document sizes based on Erlang's external term size of the jiffy-decoded document body. This makes sense because that's what used to store the data on disk and it's what manipulated by the CouchDB internals. However erlang term size is not always a good approximation of the size of json encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for users to estimate or check the external term size beforehand. So for example if max_document_size is 1MB, CouchDB might reject user's 600KB json document because Erlang's external term size of that document greater than 1MB. To fix the issue provide a module which calculates the encoded size of a json document. The size calculation approximates as well, since there is no canonical json size as it depends on the encoder used. Issue apache#659
max_document_size currently checks document sizes based on Erlang's external term size of the jiffy-decoded document body. This makes sense because that's what used to store the data on disk and it's what manipulated by the CouchDB internals. However erlang term size is not always a good approximation of the size of json encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for users to estimate or check the external term size beforehand. So for example if max_document_size is 1MB, CouchDB might reject user's 600KB json document because Erlang's external term size of that document greater than 1MB. Re-encode the data using jiffy and check the size against that. That's a better check but will impact performance. Also this is also not an exact solution.Users' json encoder might insert more whitespace (say as indentation), or whitespace after commas, use a different algorithm for encoding floating point numbers (scientific notation, represent exact floating point numbers without a decimal point (5 instead of 5.0 etc.). So the size would still be off. Issue apache#659
max_document_size currently checks document sizes based on Erlang's external term size of the jiffy-decoded document body. This makes sense because that's what used to store the data on disk and it's what manipulated by the CouchDB internals. However erlang term size is not always a good approximation of the size of json encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for users to estimate or check the external term size beforehand. So for example if max_document_size is 1MB, CouchDB might reject user's 600KB json document because Erlang's external term size of that document greater than 1MB. To fix the issue provide a module which calculates the encoded size of a json document. The size calculation approximates as well, since there is no canonical json size as it depends on the encoder used. Issue apache#659
max_document_size currently checks document sizes based on Erlang's external term size of the jiffy-decoded document body. This makes sense because that's what used to store the data on disk and it's what manipulated by the CouchDB internals. However erlang term size is not always a good approximation of the size of json encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for users to estimate or check the external term size beforehand. So for example if max_document_size is 1MB, CouchDB might reject user's 600KB json document because Erlang's external term size of that document greater than 1MB. To fix the issue provide a module which calculates the encoded size of a json document. The size calculation approximates as well, since there is no canonical json size as it depends on the encoder used. Issue apache#659
max_document_size currently checks document sizes based on Erlang's external term size of the jiffy-decoded document body. This makes sense because that's what used to store the data on disk and it's what manipulated by the CouchDB internals. However erlang term size is not always a good approximation of the size of json encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for users to estimate or check the external term size beforehand. So for example if max_document_size is 1MB, CouchDB might reject user's 600KB json document because Erlang's external term size of that document greater than 1MB. To fix the issue provide a module which calculates the encoded size of a json document. The size calculation approximates as well, since there is no canonical json size as it depends on the encoder used. Issue apache#659
max_document_size currently checks document sizes based on Erlang's external term size of the jiffy-decoded document body. This makes sense because that's what used to store the data on disk and it's what manipulated by the CouchDB internals. However erlang term size is not always a good approximation of the size of json encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for users to estimate or check the external term size beforehand. So for example if max_document_size is 1MB, CouchDB might reject user's 600KB json document because Erlang's external term size of that document greater than 1MB. To fix the issue provide a module which calculates the encoded size of a json document. The size calculation approximates as well, since there is no canonical json size as it depends on the encoder used. Issue apache#659
max_document_size currently checks document sizes based on Erlang's external term size of the jiffy-decoded document body. This makes sense because that's what used to store the data on disk and it's what manipulated by the CouchDB internals. However erlang term size is not always a good approximation of the size of json encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for users to estimate or check the external term size beforehand. So for example if max_document_size is 1MB, CouchDB might reject user's 600KB json document because Erlang's external term size of that document greater than 1MB. To fix the issue provide a module which calculates the encoded size of a json document. The size calculation approximates as well, since there is no canonical json size as it depends on the encoder used. Issue apache#659
max_document_size currently checks document sizes based on Erlang's external term size of the jiffy-decoded document body. This makes sense because that's what used to store the data on disk and it's what manipulated by the CouchDB internals. However erlang term size is not always a good approximation of the size of json encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for users to estimate or check the external term size beforehand. So for example if max_document_size is 1MB, CouchDB might reject user's 600KB json document because Erlang's external term size of that document greater than 1MB. To fix the issue provide a module which calculates the encoded size of a json document. The size calculation approximates as well, since there is no canonical json size as it depends on the encoder used. Issue apache#659
max_document_size currently checks document sizes based on Erlang's external term size of the jiffy-decoded document body. This makes sense because that's what used to store the data on disk and it's what manipulated by the CouchDB internals. However erlang term size is not always a good approximation of the size of json encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for users to estimate or check the external term size beforehand. So for example if max_document_size is 1MB, CouchDB might reject user's 600KB json document because Erlang's external term size of that document greater than 1MB. To fix the issue provide a module which calculates the encoded size of a json document. The size calculation approximates as well, since there is no canonical json size as it depends on the encoder used. Issue apache#659
max_document_size currently checks document sizes based on Erlang's external term size of the jiffy-decoded document body. This makes sense because that's what used to store the data on disk and it's what manipulated by the CouchDB internals. However erlang term size is not always a good approximation of the size of json encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for users to estimate or check the external term size beforehand. So for example if max_document_size is 1MB, CouchDB might reject user's 600KB json document because Erlang's external term size of that document greater than 1MB. To fix the issue provide a module which calculates the encoded size of a json document. The size calculation approximates as well, since there is no canonical json size as it depends on the encoder used. Issue apache#659
max_document_size currently checks document sizes based on Erlang's external term size of the jiffy-decoded document body. This makes sense because that's what used to store the data on disk and it's what manipulated by the CouchDB internals. However erlang term size is not always a good approximation of the size of json encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for users to estimate or check the external term size beforehand. So for example if max_document_size is 1MB, CouchDB might reject user's 600KB json document because Erlang's external term size of that document greater than 1MB. Re-encode the data using jiffy and check the size against that. That's a better check but will impact performance. Also this is also not an exact solution.Users' json encoder might insert more whitespace (say as indentation), or whitespace after commas, use a different algorithm for encoding floating point numbers (scientific notation, represent exact floating point numbers without a decimal point (5 instead of 5.0 etc.). So the size would still be off. Issue apache#659
max_document_size currently checks document sizes based on Erlang's external term size of the jiffy-decoded document body. This makes sense because that's what used to store the data on disk and it's what manipulated by the CouchDB internals. However erlang term size is not always a good approximation of the size of json encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for users to estimate or check the external term size beforehand. So for example if max_document_size is 1MB, CouchDB might reject user's 600KB json document because Erlang's external term size of that document greater than 1MB. To fix the issue provide a module which calculates the encoded size of a json document. The size calculation approximates as well, since there is no canonical json size as it depends on the encoder used. Issue apache#659
max_document_size currently checks document sizes based on Erlang's external term size of the jiffy-decoded document body. This makes sense because that's what used to store the data on disk and it's what manipulated by the CouchDB internals. However erlang term size is not always a good approximation of the size of json encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for users to estimate or check the external term size beforehand. So for example if max_document_size is 1MB, CouchDB might reject user's 600KB json document because Erlang's external term size of that document greater than 1MB. To fix the issue provide a module which calculates the encoded size of a json document. The size calculation approximates as well, since there is no canonical json size as it depends on the encoder used. Issue apache#659
max_document_size currently checks document sizes based on Erlang's external term size of the jiffy-decoded document body. This makes sense because that's what used to store the data on disk and it's what manipulated by the CouchDB internals. However erlang term size is not always a good approximation of the size of json encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for users to estimate or check the external term size beforehand. So for example if max_document_size is 1MB, CouchDB might reject user's 600KB json document because Erlang's external term size of that document greater than 1MB. To fix the issue provide a module which calculates the encoded size of a json document. The size calculation approximates as well, since there is no canonical json size as it depends on the encoder used. Issue #659
This was fixed. Close it. |
max_document_size currently checks document sizes based on Erlang's external term size of the jiffy-decoded document body. This makes sense because that's what used to store the data on disk and it's what manipulated by the CouchDB internals. However erlang term size is not always a good approximation of the size of json encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for users to estimate or check the external term size beforehand. So for example if max_document_size is 1MB, CouchDB might reject user's 600KB json document because Erlang's external term size of that document greater than 1MB. To fix the issue provide a module which calculates the encoded size of a json document. The size calculation approximates as well, since there is no canonical json size as it depends on the encoder used. Issue #659
max_document_size currently checks document sizes based on Erlang's external term size of the jiffy-decoded document body. This makes sense because that's what used to store the data on disk and it's what manipulated by the CouchDB internals. However erlang term size is not always a good approximation of the size of json encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for users to estimate or check the external term size beforehand. So for example if max_document_size is 1MB, CouchDB might reject user's 600KB json document because Erlang's external term size of that document greater than 1MB. To fix the issue provide a module which calculates the encoded size of a json document. The size calculation approximates as well, since there is no canonical json size as it depends on the encoder used. Issue apache#659
max_document_size
currently checks document sizes based on Erlang's external term size of the jiffy-decoded document body. This makes sense because that's what used to store the data on disk and it's what manipulated by the CouchDB internals.However erlang term size is not always a good approximation of the size of json encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for users to estimate or check the external term size beforehand. So for example if max_document_size is 1MB, CouchDB might reject user's 600KB json document because Erlang's external term size of that document greater than 1MB.
Possible Solutions
Do nothing. If the current size check is to throttle or limit disk usage and disk usage is driven by the external size (though in most cases compression is used on it, so it will usually be less), then on that level it makes sense to keep it as is.
Re-encode the data using jiffy and check the size against that. That's a better check but will impact performance. However this is also not an exact solution. Users' json encoder might insert more whitespace (say as indentation), or whitespace after commas, use a different algorithm for encoding floating point numbers (scientific notation, represent exact floating point numbers without a decimal point (
5
instead of5.0
etc.). So the size would still be off. Made a PR with this approach: Provide a more accurate size check for max_document_size limit #660Like above but enhance jiffy to return an "encoded size" without actually doing the encoding. Or least have it do the encoding internally and just return byte size to Erlang instead of a full term which will have to be thrown away.
Here is at attempt to do a size check in Erlang. Since jiffy is pretty quick is might end being slower than doing the full encoding in jiffy:
Maybe modify it to provide an always underestimating check. So users get the benefit of the doubt. That is whatever encoding they pick, find a not too expensive check that will be accurate enough and always be smaller?
The text was updated successfully, but these errors were encountered: