New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max_document_size check is not accurate #659

Closed
nickva opened this Issue Jul 10, 2017 · 6 comments

Comments

Projects
None yet
3 participants
@nickva
Contributor

nickva commented Jul 10, 2017

max_document_size currently checks document sizes based on Erlang's external term size of the jiffy-decoded document body. This makes sense because that's what used to store the data on disk and it's what manipulated by the CouchDB internals.

However erlang term size is not always a good approximation of the size of json encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for users to estimate or check the external term size beforehand. So for example if max_document_size is 1MB, CouchDB might reject user's 600KB json document because Erlang's external term size of that document greater than 1MB.

Possible Solutions

  • Do nothing. If the current size check is to throttle or limit disk usage and disk usage is driven by the external size (though in most cases compression is used on it, so it will usually be less), then on that level it makes sense to keep it as is.

  • Re-encode the data using jiffy and check the size against that. That's a better check but will impact performance. However this is also not an exact solution. Users' json encoder might insert more whitespace (say as indentation), or whitespace after commas, use a different algorithm for encoding floating point numbers (scientific notation, represent exact floating point numbers without a decimal point (5 instead of 5.0 etc.). So the size would still be off. Made a PR with this approach: #660

  • Like above but enhance jiffy to return an "encoded size" without actually doing the encoding. Or least have it do the encoding internally and just return byte size to Erlang instead of a full term which will have to be thrown away.

  • Here is at attempt to do a size check in Erlang. Since jiffy is pretty quick is might end being slower than doing the full encoding in jiffy:

encsize({[]}) ->
    2;  % opening { and closing }

encsize({Props}) ->
    % 2 is because opening { and closing }
    % Inside the lc 2 is for : and , but counts an extra , at end
    % -1 is to subtract last , from lc part
    2 + lists:sum([encsize(K) + encsize(V) + 2 || {K, V} <- Props]) - 1;

encsize([]) ->
    2;  % opening [ and closing ]

encsize(List) when is_list(List) ->
    % 2 is for [ and ]
    % inside the lc 1 is for , but it counts one extra , for last element
    2 + lists:sum([encsize(V) + 1 || V <- List]) - 1;

encsize(Float) when is_float(Float) ->
    % this doesn't match what jiffy does. so will be off
    byte_size(float_to_binary(Float, [{decimals, 16}, compact]));

encsize(Integer) when is_integer(Integer), Integer >= 0 ->
    trunc(math:log10(Integer)) + 1;

encsize(Integer) when is_integer(Integer), Integer < 0 ->
    % 2 is because 1 is for the - character
    trunc(math:log10(Integer)) + 2;

encsize(Binary) when is_binary(Binary) ->
    % count escapes and assume they'd be escaped with one extra escape character
    Escapes = [<<"\b">>, <<"\f">>, <<"\n">>, <<"\r">>, <<"\t">>, <<"\\">>,
        <<"\"">>],
    MatchCount = length(binary:matches(Binary, Escapes)),
    % 2 is for open and closing "
    2 + byte_size(Binary) + MatchCount;

encsize(Atom) when is_atom(Atom) ->
    encsize(atom_to_binary(Atom, utf8)).

Maybe modify it to provide an always underestimating check. So users get the benefit of the doubt. That is whatever encoding they pick, find a not too expensive check that will be accurate enough and always be smaller?

nickva added a commit to cloudant/couchdb that referenced this issue Jul 10, 2017

Provide a more accurate size check for max_document_size limit
max_document_size currently checks document sizes based on Erlang's external
term size of the jiffy-decoded document body. This makes sense because that's
what used to store the data on disk and it's what manipulated by the CouchDB
internals.

However erlang term size is not always a good approximation of the size of json
encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for
users to estimate or check the external term size beforehand. So for example if
max_document_size is 1MB, CouchDB might reject user's 600KB json document
because Erlang's external term size of that document greater than 1MB.

Re-encode the data using jiffy and check the size against that. That's a better
check but will impact performance.

Also this is also not an exact solution.Users' json encoder might insert more
whitespace (say as indentation), or whitespace after commas, use a different
algorithm for encoding floating point numbers (scientific notation, represent
exact floating point numbers without a decimal point (5 instead of 5.0 etc.).
So the size would still be off.

Issue #659

@wohali wohali added bug dbcore labels Jul 11, 2017

@iilyak

This comment has been minimized.

Show comment
Hide comment
@iilyak

iilyak Jul 11, 2017

Contributor

@nickva: There is an implementation in erlang here https://github.com/okeuday/erlang_term/blob/master/src/erlang_term.erl. MIT license.

Contributor

iilyak commented Jul 11, 2017

@nickva: There is an implementation in erlang here https://github.com/okeuday/erlang_term/blob/master/src/erlang_term.erl. MIT license.

@iilyak

This comment has been minimized.

Show comment
Hide comment
@iilyak

iilyak Jul 11, 2017

Contributor

Another implementation could be based on https://gist.github.com/iilyak/a11a481bd3f7311d8499e19f5e4c8f22

Contributor

iilyak commented Jul 11, 2017

Another implementation could be based on https://gist.github.com/iilyak/a11a481bd3f7311d8499e19f5e4c8f22

@nickva

This comment has been minimized.

Show comment
Hide comment
@nickva

nickva Jul 11, 2017

Contributor

We don't really need it be that generalized in a way because we know we'd only get what jiffy decoded which is object proplists {[...]}, arrays (lists [...]) and then integers, floats and binary (but unicode and escape sequences have to be accounted for). Oh maybe atoms but those are just translated to binary and handled that way.

Contributor

nickva commented Jul 11, 2017

We don't really need it be that generalized in a way because we know we'd only get what jiffy decoded which is object proplists {[...]}, arrays (lists [...]) and then integers, floats and binary (but unicode and escape sequences have to be accounted for). Oh maybe atoms but those are just translated to binary and handled that way.

@iilyak

This comment has been minimized.

Show comment
Hide comment
@iilyak

iilyak Jul 27, 2017

Contributor

Another option is to change chttpd:json_body to return {Size, Body} and pass Size through.

Contributor

iilyak commented Jul 27, 2017

Another option is to change chttpd:json_body to return {Size, Body} and pass Size through.

@nickva

This comment has been minimized.

Show comment
Hide comment
@nickva

nickva Jul 28, 2017

Contributor

@iilyak
Interesting idea. It would depend how much the code would be affected _bulk_docs and possibly multipart/related PUTs might be an issue too.

Contributor

nickva commented Jul 28, 2017

@iilyak
Interesting idea. It would depend how much the code would be affected _bulk_docs and possibly multipart/related PUTs might be an issue too.

nickva added a commit to cloudant/couchdb that referenced this issue Jul 31, 2017

Provide a more accurate size check for max_document_size limit
max_document_size currently checks document sizes based on Erlang's external
term size of the jiffy-decoded document body. This makes sense because that's
what used to store the data on disk and it's what manipulated by the CouchDB
internals.

However erlang term size is not always a good approximation of the size of json
encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for
users to estimate or check the external term size beforehand. So for example if
max_document_size is 1MB, CouchDB might reject user's 600KB json document
because Erlang's external term size of that document greater than 1MB.

Re-encode the data using jiffy and check the size against that. That's a better
check but will impact performance.

Also this is also not an exact solution.Users' json encoder might insert more
whitespace (say as indentation), or whitespace after commas, use a different
algorithm for encoding floating point numbers (scientific notation, represent
exact floating point numbers without a decimal point (5 instead of 5.0 etc.).
So the size would still be off.

Issue #659

nickva added a commit to cloudant/couchdb that referenced this issue Jul 31, 2017

Provide a more accurate size check for max_document_size limit
max_document_size currently checks document sizes based on Erlang's external
term size of the jiffy-decoded document body. This makes sense because that's
what used to store the data on disk and it's what manipulated by the CouchDB
internals.

However erlang term size is not always a good approximation of the size of json
encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for
users to estimate or check the external term size beforehand. So for example if
max_document_size is 1MB, CouchDB might reject user's 600KB json document
because Erlang's external term size of that document greater than 1MB.

To fix the issue provide a module which calculates the encoded size of a json
document.

Issue #659

nickva added a commit to cloudant/couchdb that referenced this issue Aug 2, 2017

Provide a more accurate size check for max_document_size limit
max_document_size currently checks document sizes based on Erlang's external
term size of the jiffy-decoded document body. This makes sense because that's
what used to store the data on disk and it's what manipulated by the CouchDB
internals.

However erlang term size is not always a good approximation of the size of json
encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for
users to estimate or check the external term size beforehand. So for example if
max_document_size is 1MB, CouchDB might reject user's 600KB json document
because Erlang's external term size of that document greater than 1MB.

To fix the issue provide a module which calculates the encoded size of a json
document.

Issue #659

nickva added a commit to cloudant/couchdb that referenced this issue Aug 4, 2017

Provide a more accurate size check for max_document_size limit
max_document_size currently checks document sizes based on Erlang's external
term size of the jiffy-decoded document body. This makes sense because that's
what used to store the data on disk and it's what manipulated by the CouchDB
internals.

However erlang term size is not always a good approximation of the size of json
encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for
users to estimate or check the external term size beforehand. So for example if
max_document_size is 1MB, CouchDB might reject user's 600KB json document
because Erlang's external term size of that document greater than 1MB.

To fix the issue provide a module which calculates the encoded size of a json
document. The size calculation approximates as well, since there is no
canonical json size as it depends on the encoder used.

Issue #659

nickva added a commit to cloudant/couchdb that referenced this issue Sep 5, 2017

Provide a more accurate size check for max_document_size limit
max_document_size currently checks document sizes based on Erlang's external
term size of the jiffy-decoded document body. This makes sense because that's
what used to store the data on disk and it's what manipulated by the CouchDB
internals.

However erlang term size is not always a good approximation of the size of json
encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for
users to estimate or check the external term size beforehand. So for example if
max_document_size is 1MB, CouchDB might reject user's 600KB json document
because Erlang's external term size of that document greater than 1MB.

Re-encode the data using jiffy and check the size against that. That's a better
check but will impact performance.

Also this is also not an exact solution.Users' json encoder might insert more
whitespace (say as indentation), or whitespace after commas, use a different
algorithm for encoding floating point numbers (scientific notation, represent
exact floating point numbers without a decimal point (5 instead of 5.0 etc.).
So the size would still be off.

Issue #659

nickva added a commit to cloudant/couchdb that referenced this issue Sep 5, 2017

Provide a more accurate size check for max_document_size limit
max_document_size currently checks document sizes based on Erlang's external
term size of the jiffy-decoded document body. This makes sense because that's
what used to store the data on disk and it's what manipulated by the CouchDB
internals.

However erlang term size is not always a good approximation of the size of json
encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for
users to estimate or check the external term size beforehand. So for example if
max_document_size is 1MB, CouchDB might reject user's 600KB json document
because Erlang's external term size of that document greater than 1MB.

To fix the issue provide a module which calculates the encoded size of a json
document. The size calculation approximates as well, since there is no
canonical json size as it depends on the encoder used.

Issue #659

nickva added a commit to cloudant/couchdb that referenced this issue Sep 8, 2017

Provide a more accurate size check for max_document_size limit
max_document_size currently checks document sizes based on Erlang's external
term size of the jiffy-decoded document body. This makes sense because that's
what used to store the data on disk and it's what manipulated by the CouchDB
internals.

However erlang term size is not always a good approximation of the size of json
encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for
users to estimate or check the external term size beforehand. So for example if
max_document_size is 1MB, CouchDB might reject user's 600KB json document
because Erlang's external term size of that document greater than 1MB.

To fix the issue provide a module which calculates the encoded size of a json
document. The size calculation approximates as well, since there is no
canonical json size as it depends on the encoder used.

Issue #659

nickva added a commit to cloudant/couchdb that referenced this issue Sep 8, 2017

Provide a more accurate size check for max_document_size limit
max_document_size currently checks document sizes based on Erlang's external
term size of the jiffy-decoded document body. This makes sense because that's
what used to store the data on disk and it's what manipulated by the CouchDB
internals.

However erlang term size is not always a good approximation of the size of json
encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for
users to estimate or check the external term size beforehand. So for example if
max_document_size is 1MB, CouchDB might reject user's 600KB json document
because Erlang's external term size of that document greater than 1MB.

To fix the issue provide a module which calculates the encoded size of a json
document. The size calculation approximates as well, since there is no
canonical json size as it depends on the encoder used.

Issue #659

nickva added a commit to cloudant/couchdb that referenced this issue Sep 8, 2017

Provide a more accurate size check for max_document_size limit
max_document_size currently checks document sizes based on Erlang's external
term size of the jiffy-decoded document body. This makes sense because that's
what used to store the data on disk and it's what manipulated by the CouchDB
internals.

However erlang term size is not always a good approximation of the size of json
encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for
users to estimate or check the external term size beforehand. So for example if
max_document_size is 1MB, CouchDB might reject user's 600KB json document
because Erlang's external term size of that document greater than 1MB.

To fix the issue provide a module which calculates the encoded size of a json
document. The size calculation approximates as well, since there is no
canonical json size as it depends on the encoder used.

Issue #659

nickva added a commit to cloudant/couchdb that referenced this issue Sep 12, 2017

Provide a more accurate size check for max_document_size limit
max_document_size currently checks document sizes based on Erlang's external
term size of the jiffy-decoded document body. This makes sense because that's
what used to store the data on disk and it's what manipulated by the CouchDB
internals.

However erlang term size is not always a good approximation of the size of json
encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for
users to estimate or check the external term size beforehand. So for example if
max_document_size is 1MB, CouchDB might reject user's 600KB json document
because Erlang's external term size of that document greater than 1MB.

To fix the issue provide a module which calculates the encoded size of a json
document. The size calculation approximates as well, since there is no
canonical json size as it depends on the encoder used.

Issue #659

nickva added a commit to cloudant/couchdb that referenced this issue Sep 12, 2017

Provide a more accurate size check for max_document_size limit
max_document_size currently checks document sizes based on Erlang's external
term size of the jiffy-decoded document body. This makes sense because that's
what used to store the data on disk and it's what manipulated by the CouchDB
internals.

However erlang term size is not always a good approximation of the size of json
encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for
users to estimate or check the external term size beforehand. So for example if
max_document_size is 1MB, CouchDB might reject user's 600KB json document
because Erlang's external term size of that document greater than 1MB.

To fix the issue provide a module which calculates the encoded size of a json
document. The size calculation approximates as well, since there is no
canonical json size as it depends on the encoder used.

Issue #659

nickva added a commit to cloudant/couchdb that referenced this issue Sep 12, 2017

Provide a more accurate size check for max_document_size limit
max_document_size currently checks document sizes based on Erlang's external
term size of the jiffy-decoded document body. This makes sense because that's
what used to store the data on disk and it's what manipulated by the CouchDB
internals.

However erlang term size is not always a good approximation of the size of json
encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for
users to estimate or check the external term size beforehand. So for example if
max_document_size is 1MB, CouchDB might reject user's 600KB json document
because Erlang's external term size of that document greater than 1MB.

To fix the issue provide a module which calculates the encoded size of a json
document. The size calculation approximates as well, since there is no
canonical json size as it depends on the encoder used.

Issue #659

nickva added a commit to cloudant/couchdb that referenced this issue Sep 12, 2017

Provide a more accurate size check for max_document_size limit
max_document_size currently checks document sizes based on Erlang's external
term size of the jiffy-decoded document body. This makes sense because that's
what used to store the data on disk and it's what manipulated by the CouchDB
internals.

However erlang term size is not always a good approximation of the size of json
encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for
users to estimate or check the external term size beforehand. So for example if
max_document_size is 1MB, CouchDB might reject user's 600KB json document
because Erlang's external term size of that document greater than 1MB.

To fix the issue provide a module which calculates the encoded size of a json
document. The size calculation approximates as well, since there is no
canonical json size as it depends on the encoder used.

Issue #659

nickva added a commit to cloudant/couchdb that referenced this issue Sep 12, 2017

Provide a more accurate size check for max_document_size limit
max_document_size currently checks document sizes based on Erlang's external
term size of the jiffy-decoded document body. This makes sense because that's
what used to store the data on disk and it's what manipulated by the CouchDB
internals.

However erlang term size is not always a good approximation of the size of json
encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for
users to estimate or check the external term size beforehand. So for example if
max_document_size is 1MB, CouchDB might reject user's 600KB json document
because Erlang's external term size of that document greater than 1MB.

Re-encode the data using jiffy and check the size against that. That's a better
check but will impact performance.

Also this is also not an exact solution.Users' json encoder might insert more
whitespace (say as indentation), or whitespace after commas, use a different
algorithm for encoding floating point numbers (scientific notation, represent
exact floating point numbers without a decimal point (5 instead of 5.0 etc.).
So the size would still be off.

Issue #659

nickva added a commit to cloudant/couchdb that referenced this issue Sep 12, 2017

Provide a more accurate size check for max_document_size limit
max_document_size currently checks document sizes based on Erlang's external
term size of the jiffy-decoded document body. This makes sense because that's
what used to store the data on disk and it's what manipulated by the CouchDB
internals.

However erlang term size is not always a good approximation of the size of json
encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for
users to estimate or check the external term size beforehand. So for example if
max_document_size is 1MB, CouchDB might reject user's 600KB json document
because Erlang's external term size of that document greater than 1MB.

To fix the issue provide a module which calculates the encoded size of a json
document. The size calculation approximates as well, since there is no
canonical json size as it depends on the encoder used.

Issue #659

nickva added a commit to cloudant/couchdb that referenced this issue Sep 13, 2017

Provide a more accurate size check for max_document_size limit
max_document_size currently checks document sizes based on Erlang's external
term size of the jiffy-decoded document body. This makes sense because that's
what used to store the data on disk and it's what manipulated by the CouchDB
internals.

However erlang term size is not always a good approximation of the size of json
encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for
users to estimate or check the external term size beforehand. So for example if
max_document_size is 1MB, CouchDB might reject user's 600KB json document
because Erlang's external term size of that document greater than 1MB.

To fix the issue provide a module which calculates the encoded size of a json
document. The size calculation approximates as well, since there is no
canonical json size as it depends on the encoder used.

Issue #659

nickva added a commit that referenced this issue Sep 13, 2017

Provide a more accurate size check for max_document_size limit
max_document_size currently checks document sizes based on Erlang's external
term size of the jiffy-decoded document body. This makes sense because that's
what used to store the data on disk and it's what manipulated by the CouchDB
internals.

However erlang term size is not always a good approximation of the size of json
encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for
users to estimate or check the external term size beforehand. So for example if
max_document_size is 1MB, CouchDB might reject user's 600KB json document
because Erlang's external term size of that document greater than 1MB.

To fix the issue provide a module which calculates the encoded size of a json
document. The size calculation approximates as well, since there is no
canonical json size as it depends on the encoder used.

Issue #659
@nickva

This comment has been minimized.

Show comment
Hide comment
@nickva

nickva Sep 26, 2017

Contributor

This was fixed. Close it.

Contributor

nickva commented Sep 26, 2017

This was fixed. Close it.

@nickva nickva closed this Sep 26, 2017

wohali added a commit that referenced this issue Oct 19, 2017

Provide a more accurate size check for max_document_size limit
max_document_size currently checks document sizes based on Erlang's external
term size of the jiffy-decoded document body. This makes sense because that's
what used to store the data on disk and it's what manipulated by the CouchDB
internals.

However erlang term size is not always a good approximation of the size of json
encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for
users to estimate or check the external term size beforehand. So for example if
max_document_size is 1MB, CouchDB might reject user's 600KB json document
because Erlang's external term size of that document greater than 1MB.

To fix the issue provide a module which calculates the encoded size of a json
document. The size calculation approximates as well, since there is no
canonical json size as it depends on the encoder used.

Issue #659

willholley added a commit to willholley/couchdb that referenced this issue May 22, 2018

Provide a more accurate size check for max_document_size limit
max_document_size currently checks document sizes based on Erlang's external
term size of the jiffy-decoded document body. This makes sense because that's
what used to store the data on disk and it's what manipulated by the CouchDB
internals.

However erlang term size is not always a good approximation of the size of json
encoded data. Sometimes it can be way off (I've seen 30% off) and It's hard for
users to estimate or check the external term size beforehand. So for example if
max_document_size is 1MB, CouchDB might reject user's 600KB json document
because Erlang's external term size of that document greater than 1MB.

To fix the issue provide a module which calculates the encoded size of a json
document. The size calculation approximates as well, since there is no
canonical json size as it depends on the encoder used.

Issue #659
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment