-
Notifications
You must be signed in to change notification settings - Fork 47
Home
Table of Contents
Normalize vocabulary to correspond the upcoming docopt.py v0.7.0.
Element should be Pattern:
Pattern
ChildPattern # a better name??: TerminalPattern or LeafPattern
Option
Argument
Command
ParentPattern # a better name??: CollectionPattern
Required
Optional
AnyOptions
OneOrMore
Either
How Pattern C-struct could look like:
typedef {
// probably more conventional to have enum
// all UPPER CASE
enum {
OPTION,
ARGUMENT,
COMMAND,
REQUIRED,
OPTIONAL,
ONEORMORE,
EITHER,
NONE,
} type;
union {
Option option;
Argument argument;
Command command;
// maybe having a single container for
// required/optional/oneoremore/either,
// since they all hold the same kind of data.
Container container;
} payload;
} Pattern;
typedef struct {
// in Python these are `short` and `long`,
// but these are reserved in C
const char *oshort; // maybe *flag?
const char *olong; // maybe *full?
bool argcount; // number of arguments 0 or 1
bool value; // value if argcount is 0
char *argument; // value if argcount is 1
// maybe instead of having `value` and `argument`
// we could have just `char *value` and treat it
// as NULL/non-NULL for true/false (if argcount == 0)
// and as value/no-value when argcount == 1?
// I wonder, would that be portable:
// char *value = true;
//
// Oh, it is actually more complicated than that.
// In Python, Option's value could be either:
// * True/False, if it's a single option without argument
// * a number, if it's an option (w/o argument) that could be repeated
// * a string, if it's an option with argument non-repeated
// * array of strings, if it's an option with argument, repeated
// Well:
// repeatable | no yes
// -----------+----------------------
// argcount |
// 0 | bool int
// 1 | char* char**
//
// So we need to handle all these cases.
//
// a) Maybe go full type-unsafe and declare it as char**
// and use it as bool/char*/int by casting? :-)
//
// b) Another variant would be to have a union of these types.
//
// c) Yet another would be to store all these in the same struct.
} Option;
typedef struct {
char *name;
bool repeating; // Maybe int count; instead?
// (to explicitly state the length of **array).
char *value; // Maybe get rid of this in favor of
// just having 1 item in **array.
char **array; // I can think of 2 ways how to allocate this array.
// 1. It could be statically allocated
// to fit, say, max 32 elements, and then each
// pointer in it will point to an item in argv.
// 2. Make it point to (a part of) argv directly
// and then use `count` to see how long it is.
} Argument;
typedef struct {
char *name;
// Command's value could be either True/False in Python
// or the number of times the command was mentioned.
// We could use the fact that 0 is falsy in C, and
// declare it simply as int:
int value; // Maybe int count; instead?
} Command;
// In Python version required/optional/etc hold an array
// called `children` which is an array of all child nodes.
// Since we don't want to allocate these arrays dynamically
// or overallocate them statically (by allocating, say
// 32 items "just in case"), I was thinking of 2 variants:
//
// 1. Use linked list like the following:
typedef struct {
Pattern *pattern;
Container *next;
} Container;
// and transform patterns in Python into a linked list form:
//
// Required(a, b, c) into Required(a, Required(b, Required(c, NULL)))
//
// 2. Another way could be to *generate* code for each struct. I.e. if the
// pattern is `usage: prog [<this> <that>] (-a | -b | -c)` or in Python terms:
// Required(Optional(<this>, <that>), Either(-a, -b, -c))
//
// Then just generate precisely these containers:
//
typedef struct {
Pattern patterns[2];
} ContainerRequired1;
typedef struct {
Pattern patterns[2];
} ContainerOptional1;
typedef struct {
Pattern patterns[3];
} ContainerEither1;
<Add new ideas here!>
- Runs function
parse_args
to populateElement *options
array - Overrides default values in
struct DocoptArgs
with the values parsed fromargv
- Verifies that the Pattern is respected.
Note: It might possible to integrate step 3 into step 2.
- Advantages
- all generated C code is readable because only pattern tree is generated, not any real code, (maybe just a little);
- implementation similar to the python one, which makes it only a matter of porting code to C
- Disadvantages
(- this needs to port considerable portion of Python code)
- Complexity: O(Size_of_the_pattern_tree)
- complex patterns might use a lot of call stack.
VK: First thing that comes into mind for me is to mimic Python implementation. Python parses pattern into a tree of nodes like:
'''usage: program --hai
program --bye <arg>'''
=> parse_pattern =>
Either(Option('--hai'), Required(Option('--bye'), Argument('<arg>')))
Then Python parses argv:
['--bye', 'arg'] => parse_argv => [Option('--bye'), Argument('arg')]
Then matches one against another. We could generate the pattern using Python. Write parse_argv
function similar to Python one and write a bunch of match
functions similar to Python match
methods for each kind of pattern (Option, Argument, Either, etc.).
- Advantages
- Complexity: O(Number_of_arguments_provided_in_CLI)
- Simple memory consumption (stack usage is not dependent on the pattern tree size). Definite number of stack levels.
- all generated C code is readable because only sub-patterns are generated, not any real code
- Disadvantages
- Non-recursive means more code, which might be a bad thing in a single file
VK: I don't see advantages in being non-recursive. How does it make memory consumption simpler? We are not going to run out of stack anyway.
This approach starts by dividing the pattern tree into multiple sub_patterns
. Example:
Usage:
prog ship new [--coordinator] <name>
prog ship move <x> <y>
prog fire <gun> <x> <y> [--burst]
This pattern has three sub-patterns
.
When we are iterating over the argv
elements, a sub_pattern
can be in two states: 'unsatisfiable' and 'not unsatisfiable'.
When a sub_pattern
is said to be unsatisfiable, it means the algorithm has come to the conclusion that no matter what the next elements of argv
might be, the sub_pattern
will not be satisfied. Caution: if there are still argv
elements to be parsed, a sub_pattern
might not be 'satisfiable'. Hence, the 'not unsatisfiable' notation.
The main workflow is to assume all sub-patterns
are NOT UNsatisfiable in the beginning and, after parsing each element of argv
, all sub-patterns
are tested if they have become unsatisfiable.
Example with the previous usage pattern:
user inserted argv: $ ./prog ship new SuperBoat
Initial sub_patterns list:
sub_pattern[0].IsUnsatisfiable() == False
sub_pattern[1].IsUnsatisfiable() == False
sub_pattern[2].IsUnsatisfiable() == False
Iteration 1:
parsed from argv: Command('ship')
sub_pattern[0].IsUnsatisfiable() == False
sub_pattern[1].IsUnsatisfiable() == False
sub_pattern[2].IsUnsatisfiable() == True
Iteration 2:
parsed from argv: Command('new')
sub_pattern[0].IsUnsatisfiable() == False
sub_pattern[1].IsUnsatisfiable() == True
sub_pattern[2].IsUnsatisfiable() == True
# (...)
In C, we can have an array of sub_patterns
that keep their state along iterations.
With this approach its easy to define a flowchart/state_machine:
- If all
sub_patterns
become unsatisfiable before allargv
elements are parsed, then presentusage_message
. - If all
argv
elements were parsed and there is not an unique SATISFIABLE(*)sub_pattern
, then presentusage_message
.
(*) After all argv
elements are parsed, the algorithm can conclude which sub_patterns
are satisfiable. At the end of argv
, the set of ["unsatisfiable", "not unsatisfiable"] sub_patterns
can be converted to a set of ["unsatisfiable", "satisfiable"] sub_patterns
.
- Iterate over
argv
- Parse
argv[i]
to(type, value [, name])
- Iterate over all not unsatisfiable
sub_patterns
- Change the state to unsatisfiable if test fails
- If all patterns are unsatisfiable, present
usage_message
and exit(1)
- Parse
- Iterate over all not unsatisfiable
sub_patterns
and convert them to satisfiable or unsatisfiable - If there is not one single unique satisfiable
sub_pattern
, presentusage_message
and exit(1) - Compile resulting
struct DocoptArgs
from the unique satisfiablesub_pattern
(copy parsed values and insert defaults)
In 2013-06-06, @halst said at docopt/docopt:
docopt does respect the order of arguments and commands in argv, but options could go in any order, anywhere.
We these characteristics in mind, a sub_pattern
can be implemented by separating the ordered elements (aka statics
) and Options (options
):
typedef enum {REQUIRED, OPTIONAL, ONEORMORE, EITHER} BRANCH_TYPE;
typedef enum {ARG, CMD, BRANCH} STATIC_TYPE;
typedef enum {UNSAT, NOTUNSAT, SAT} SUBPATTERN_STATE;
typedef struct {
const char *name;
bool value;
} Command;
typedef struct {
const char *name;
char *value;
char **array;
} Argument;
typedef struct {
const char *oshort;
const char *olong;
bool argcount;
bool value;
char *argument;
} Option;
typedef struct {
BRANCH_TYPE type;
Subpattern *subpattern;
} Branch;
typedef struct {
STATIC_TYPE type;
void *element;
} Static;
typedef struct __subpattern {
SUBPATTERN_STATE satisfiability;
int statics_len;
int statics_idx;
int options_len;
Static *statics;
Option *options;
} Subpattern;
- Advantages
- It is probably simple to implement in
docopt_c.py
- It is probably simple to implement in
- Disadvantages
- The produced code is unreadable :(((
- VK: I don't thing switch-case is a good idea. What about options—they can be positioned anywhere, so you need to take them into account at every point of time?
This approach needs to build a "tree of switch-cases" based on the chars in words of argv
. For instance, see this docopt string:
Usage:
program tcp <host> <port> [--timeout=<seconds>]
program serial <port> [--baud=9600] [--timetout=<seconds>]
program -h | --help | --version
can have in its argv[1]
the following words:
tcp
serial
-h
--help
--version
The tree of switch-cases for the first word in argv
can be summarized in the following diagram:
argv[1][0] == 't' -> argv[1] is "tcp"
argv[1][0] == 's' -> argv[1] is "serial"
argv[1][0] == '-'
|-- argv[1][1] == 'h' -> argv[1] is "-h"
|-- argv[1][1] == '-'
|-- argv[1][2] == 'h' -> argv[1] is "--help"
|-- argv[1][2] == 'v' -> argv[1] is "--version"
Check out this simple example in C:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/* "Usage:\n"
* " program tcp <host> <port> [--timeout=<seconds>]\n"
* " program serial <port> [--baud=9600] [--timetout=<seconds>]\n"
* " program -h | --help | --version\n"
*/
int main(int argc, char *argv[]) {
int i = 1;
char *word;
size_t word_len;
if (argc == 1) {
fprintf(stderr, "Usage:\n"
" program tcp <host> <port> [--timeout=<seconds>]\n"
" program serial <port> [--baud=9600] [--timetout=<seconds>]\n"
" program -h | --help | --version\n");
exit(0);
}
word = argv[1];
word_len = strlen(word);
if (word_len < 1) {
fprintf(stderr, "error: '%s' not recognized\n", word);
exit(1);
}
switch (word[0]) {
case 't':
if (strcmp(word, "tcp")) {
fprintf(stderr, "error: '%s' not recognized\n", word);
exit(1);
}
/* Keep on parsing the rest of the CLI knowing that argv[1] is "tcp" */
/* (...) */
break;
case 's':
if (strcmp(word, "serial")) {
fprintf(stderr, "error: '%s' not recognized\n", word);
exit(1);
}
/* Keep on parsing the rest of the CLI knowing that argv[1] is "serial" */
/* (...) */
break;
case '-':
if (word_len < 2) {
fprintf(stderr, "error: '%s' not recognized\n", word);
exit(1);
}
switch (word[1]) {
case 'h':
if (strcmp(word, "-h")) {
fprintf(stderr, "error: '%s' not recognized\n", word);
exit(1);
}
/* Executes the -h action */
/* (...) */
break;
case '-':
if (word_len < 3) {
fprintf(stderr, "error: '%s' not recognized\n", word);
exit(1);
}
switch (word[2]) {
case 'h':
if (strcmp(word, "--help")) {
fprintf(stderr, "error: '%s' not recognized\n", word);
exit(1);
}
/* Executes the --help action */
/* (...) */
break;
case 'v':
if (strcmp(word, "--version")) {
fprintf(stderr, "error: '%s' not recognized\n", word);
exit(1);
}
/* Executes the --version action */
/* (...) */
break;
default:
fprintf(stderr, "error: '%s' not recognized\n", word);
exit(1);
}
break;
default:
fprintf(stderr, "error: '%s' not recognized\n", word);
exit(1);
}
break;
default:
fprintf(stderr, "error: '%s' not recognized\n", word);
exit(1);
}
return 0;
}
This approach aims at separating implicit from explicit elements when
parsing argv
. Among the three possible Leaf elements (Arguments, Commands
and Options), only Options can be considered explicit because, unlike the
other two elements, they can be automaticly parsed when looping over
argv
.
When Arguments and Commands are present, parsing argv
becomes slightly
complicated since these are not explicitly identified like Options. Example:
"""
Usage: ./app [<file>] update
"""
This CLI specification has two elements, an optional
Argument(name='<file>')
and a Command(name='update')
, and can accept
one of the following inputs:
$ ./app update
$ ./app hello.txt update
$ ./app update update
In the last line, the user has indicated that the argument <file>
is to be
setted with the value "update" which has in fact the same string as the name of
the Command update
. The argv
parser algorithm had to implicitly identify
which of the two updates
was the Argument, and which was the Command.
This section was developed using section II.3 from Stuart J. Russell and Peter Norvig. 2003. Artificial Intelligence: A Modern Approach (2 ed.). Pearson Education.
This gist was develop as a proof-of-concept for the following examples.
"""
Usage: ./app [<file>] update
"""
In this example we define the basic operation that allows us to build a
tree of patterns. If one of the elements in a pattern is Optional, than it
means there are two possible patterns that the end-user can use in argv
(one
with the element and one without it). With this idea, the following tree can
be created:
There are two things important to notice:
- The numbers associated with each node reveal the expansion algorithm used. In this example, it doesn't actually reveal anything because Depth-First-Search (DFS) or Breadth-First-Search (BFS) would give the same numbering for the three nodes. (DFS was used because of its lower memory consumption).
- All leaf nodes do not have Branch-Pattern-Nodes, i.e. there are no Optionals.
It is easy to see that we are only interested in the Leaf nodes of this tree. Lets move to an example more complicated.
"""
Usage: ./app [<file> [<template>]] update
"""
Using the idea from the previous example, we can create an equivalent tree to this pattern:
The obtained Leaf nodes are:
(<file> <template> update)
(<file> update)
(update)
It is important to see that these Leafs are not ambiguous: all the implicit
elements in argv
(Arguments and Commands) will be parsed without any
problem. The recognition of implicit elements is directly related to the
number elements. In the following last example, we will see a pattern that is
in fact ambiguous.
"""
Usage: ./app [<file>] update [<template>]
"""
As you can see, the red nodes (nodes 3 and 5) are ambiguous when the user
inserts $ ./app update update
.
With the patterns that are in the Leaf nodes from the previous trees, we can build a Look-Up-Table (LUT) and embed it into the C source-code:
1. Instantiate `DocoptArgs struct`
2. Populate struct with default values
3. Iterate over `argv` and:
1. Parse explicit elements (known as Options) and, at the same time,
2. Count the number of implicit elements, N.
4. Use the LUT to convert N into a translation table.
5. Parse the implicit elements into the `DocoptArgs struct` using the
translation table
6. Validate `DocoptArgs struct`, i.e. check if the inserted values respect
the _root pattern_.