Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Skipping #5

Closed
camgunz opened this issue Sep 5, 2014 · 14 comments
Closed

Add Skipping #5

camgunz opened this issue Sep 5, 2014 · 14 comments

Comments

@camgunz
Copy link
Owner

camgunz commented Sep 5, 2014

So #3 proposed adding skipping to CMP, and there's some discussion there.

After trying to implement a version, it looks like the only way to fully implement this is by creating a SAX-style state machine.

"WHY!?", you might ask. Well I'll tell you, hopefully I'm wrong.

It's not a backend-support problem. We can add an optional skip callback, and CMP can just set an error whenever cmp_skip_object is called on a context where that callback is NULL.

The problem is nested arrays and maps. The naive approach is to just have cmp_skip_object recursively call itself, but that leaves CMP open to stack overflow attacks via specifically-crafted data. I absolutely will not do that.

The alternative is to have a bunch of state in cmp_ctx_t itself and use the heap. There are a few downsides to this:

  • It makes heap allocations, and it's doing this for every nested array/map. This is, therefore, a vector for memory exhaustion given specifically-crafted data.
  • It's a complicated little state machine, in contrast with the rest of CMP, which is extremely clear.
  • It would cause CMP to depend on the C Standard Library, and thus would require a build system. Not that that's a big deal, but it's still a definite paradigm shift.

I vote against adding skipping to CMP. To use the example in #3 of an RPC server, let's say you're getting MessagePack data as a stream and that's your CMP backend (error handling omitted):

uint32_t map_size;

cmp_read_map(&cmp, &map_size)

for (uint32_t i = 0; i < map_size; i++) {
    char *arg_name;
    uint32_t str_size;

    cmp_read_str_size(&cmp, &str_size);
    read_netstream_string(&arg_name, str_size);

    if (!rpc_arg_is_valid(arg_name)) {
        cmp_read_str_size(&cmp, &str_size);
        skip_netstream_bytes(str_size);
    }
}

This is pretty simple, and the only thing that would be different if CMP added cmp_skip_next_object is skip_netstream_bytes(str_size) is replaced with cmp_skip_next_object(&cmp).

The problem is that cmp_skip_next_object might have to skip a map containing 5000 other arrays, each containing 5000 arrays, each containing 5000 arrays that each contain 5000 entries of the Gettysburg Address. Skipping an unwanted string is much simpler than skipping the next object, whatever it might be. Furthermore, skipping can most easily be handled using backend API's designed specifically for that; CMP can add no value there. Therefore, I think adding skipping to CMP is out of scope.


That said, I'm always open to arguments! :) If this is a feature you're really needing and you've got a cool idea on how to do it, I'm absolutely happy to work on it (or, even better, merge a PR ;) ). I just think it's not feasible.

@catwell
Copy link

catwell commented Oct 6, 2014

I think maybe you could just go the recursive route and add a parameter to the skip function representing a maximum depth, decremented on recursive calls (and trying to skip with a max depth of 0 is an error). This solves the "stack overflow" problem.

@beatgammit
Copy link

I don't think it's possible to do it without either heap allocation or potential stack overflow since msgpack forces a tree traversal without letting you know the size of the tree in advance.

That being said, I'd really like this feature, and I think a configurable max-depth is the way to go with a callback when the max-depth is reached (so the user can cleanly abort/advance the stream). A lot can be done to mitigate stack usage for larger values of max-depth, but I doubt this will ever be an issue in practice.

In my case, I have the maximum size of the msgpack stream, so I can just advance the stream myself if max-depth is reached. However, in many cases I just want to skip an object and sub-parts it may or may not have. I know this won't be more than 5 layers deep or so.

If this is an acceptable solution, I'd be happy to work on a PR. I just don't want to go down that long and painful road if you don't want it in the library.

@camgunz
Copy link
Owner Author

camgunz commented Feb 2, 2015

So I could implement the skip callback, and bool cmp_skip_next_object(cmp_ctx_t *ctx, cmp_object_t *obj).

If the "next" object (the object to skip) contains no other objects, cmp_skip_next_object skips it, never touches obj isn't touched, returns true.

Otherwise, cmp_skip_next_object populates obj with the "next" object's data (which is functionally skipped), sets a special error code, and returns false.

This solves basically all the problems:

  • Avoids memory exhaustion by using the passed obj parameter
  • Avoids stack overflow by setting an error and returning false
  • Doesn't lose the skipped object, allowing the caller to decide what to do
    • As a side benefit, this allows the caller to put their own limit on depth
  • Avoids potentially costly reads by using the skip callback

I also think this is in-scope because you need to know the sizes of MP objects for maximum efficiency (from the skip callback), and that's not something you should ever need to know using an MP library (that's why you're using a library).

I'll work this up and see if it's feasible. If it's not then we're back where we started, but it seems promising.

@beatgammit
Copy link

Otherwise, cmp_skip_next_object populates obj with the "next" object's data (which is functionally skipped), sets a special error code, and returns false.

For example, let's say we have something like this:

// array containing an array with one item
[ "hello", [ 0 ], "world" ]

If I call cmp_skip_next_object, I'll get an error. The next read will be "hello", right?

So, if a user wanted to implement a recursive skip, they could do something like this:

void skip_all_the_things(cmp_ctx_t *ctx, int depth) {
    cmp_object_t obj;
    if (depth == MAX_DEPTH) {
        // handle max recursion...
    }
    if (cmp_skip_next_object(ctx, &obj)) {
        return;
    }
    if (obj.type == CMP_TYPE_ARRAY16 || obj.type == CMP_TYPE_ARRAY32 || obj.type == CMP_TYPE_FIXARRAY) {
        for (int i = 0; i < obj.array_size; i++) {
            skip_all_the_things(ctx, depth+1);
        }
    } else if (...) {
        // handle map
    } else {
        skip_all_the_things(ctx, depth+1);
    }
}

If that's the case, then I think it should work. I just didn't like having to know the type beforehand if I just wanted to throw stuff away.

@camgunz
Copy link
Owner Author

camgunz commented Feb 3, 2015

If I call cmp_skip_next_object, I'll get an error. The next read will be "hello", right?

Yeah.

So, if a user wanted to implement a recursive skip, they could do something like this:

Yeah but, watch out for stack overflow! It's probably better to do a while loop; if you're checking for max depth then it's more or less the same, but it depends on how much your function allocates on the stack really....

I just didn't like having to know the type beforehand if I just wanted to throw stuff away

I'm not super enthused about it either. Maybe I could add a cmp_get_length function; it would just return 0 for objects that don't have a length

@beatgammit
Copy link

It's probably better to do a while loop; if you're checking for max depth then it's more or less the same

I suppose you could, bit you'd have to keep track of your iteration index at each level so when you descend, you don't lose your place and forget where you are at an outer array. I suppose this has a smaller impact than recursing though.

I'm guessing it'd look something like this (completely untested, but hopefully the intent is clear):

int depth_todos[MAX_LENGTH];
int depth = 0;
memset(depth_todos, 0, sizeof(depth_todos));
do {
    cmp_object_t obj;
    if (cmp_skip_next_object(ctx, &obj)) {
        if (depth_todos[depth] > 0) {
            depth_todos[depth]--;
        }
        while (depth > 0 && depth_todos[depth] == 0) {
            depth--;
        }
        continue;
    }
    if (array) {
        depth++;
        depth_todos = obj.length;
    }
    // TODO: other types
} while (depth > 0 || depth_todos[0] > 0);

This has the problem where you burn a bunch of stack at the beginning, but your stack isn't going to grow. This gets more complicated if you need to do non-blocking i/o, in which case depth_todos lives in your struct.

Supporting maps is a little more complicated, so depth_todos may need to be a struct that contains the type of the object (e.g. cmp_object_t depth_todos[MAX_LENGTH]), but now we're basically the same size as the recursive method...

@camgunz
Copy link
Owner Author

camgunz commented Feb 23, 2015

OK, I've implemented this in my local repo. If user code wants to handle arrays and maps, it will have to implement a state machine and should build in a recursion limit, but at least the pieces are now in CMP to make skipping possible. I've also added a small RPC example that uses skipping. Current tests pass, but I need to write a couple more for the new functionality (even though the RPC example does some itself).

I still need to work on #15, and then I need to write some tests for it too, but progress!

@beatgammit
Copy link

Awesome! I'm currently running this on a micro-controller and throwing away parts (manually) that I don't actually need. Once this is pushed, I'll play with it and see what the damage is in terms of additional design space. It'll definitely clean my code up a bit.

@clwi
Copy link

clwi commented Mar 27, 2017

FYI: You don't need recursion to skip nested maps etc., you can do it with iteration. For an example look at cw_skip_items in CWPack file cwpack.c.

@antoinealb
Copy link
Contributor

Interesting approach, could be a good starting point if someone wants to implement skipping 👍

@m-wichmann
Copy link
Contributor

I really liked the idea of @clwi and implemented it. You can currently find it as a gist.

Two things, that should be thought about:

  1. obj_skip_count is currently a uint32_t and could overflow (e.g. while reading malicious data). Extending the type to uint64_t would help with the problem, but I didn't really want to change it, since I use cmp on an embedded device.
  2. Skipping of strings/bin/ext currently reads one byte at a time, since there is no buffer available. This could be replaced with a skip/seek function pointer.

I'm happy to change my code and open a pull request, if you think it's worth it.

@antoinealb
Copy link
Contributor

It would be nice to have a pull request opened for it, it would also make the code a bit easier to review :)

I looked at your implementation and it looks nice, good job 👍 ! The only thing I would personally like to see is some error checking on the ctx->read calls.

@camgunz
Copy link
Owner Author

camgunz commented May 8, 2017

This is done pending test coverage.

@camgunz
Copy link
Owner Author

camgunz commented Jun 19, 2017

Skipping is reasonably well covered in the test suite now; closing.

@camgunz camgunz closed this as completed Jun 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants