Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantics of start / end keywords for sequence operations #843

Open
waywardmonkeys opened this issue Jan 3, 2015 · 14 comments
Open

Semantics of start / end keywords for sequence operations #843

waywardmonkeys opened this issue Jan 3, 2015 · 14 comments
Labels

Comments

@waywardmonkeys
Copy link
Member

In the DRM, there are operations like copy-sequence and fill! which are documented to take start and end keyword arguments.

In other libraries, there are things like string-equal? which do the same.

None of these appear to be clearly specified as to what should happen when end is greater than the size of the sequence.

It ends up that this is also an area where the implementations in Open Dylan and Gwydion Dylan weren't entirely consistent for the DRM-provided operations.

The DRM says (for copy-sequence):

Creates a freshly allocated sequence containing the elements
of source between start and end.

This is in contrast to other operations which are documented to signal errors on out-of-bounds accesses or other conditions.

Creating a copy-sequence("abc", 2, 5) could be an entirely valid reading of the DRM ... and in fact, it is implemented that way in the specialization of copy-sequence for <string> in Gwydion:

define method copy-sequence
    (vector :: <byte-string>,
     #key start :: <integer> = 0,
          end: last :: type-union(<integer>, singleton($not-supplied))
                 = $not-supplied)
 => (result :: <byte-string>);
  let src-sz :: <integer> = size(vector);
  let last :: <integer>
    = if (last ~== $not-supplied & last < src-sz) last else src-sz end if;
  let start :: <integer> = if (start < 0) 0 else start end if;
  let sz :: <integer> = last - start;

  if(start > last) sz := 0 end;

  let result :: <byte-string> = make(<byte-string>, size: sz);
  for (from-index :: <integer> from start below last,
       to-index :: <integer> from 0)
    %element(result, to-index) := %element(vector, from-index);
  end for;
  result;
end method copy-sequence;

That said, it appears that other copy-sequence methods are not implemented in quite the same way in Gwydion Dylan. In Open Dylan, there isn't any ambiguity, they're all implemented to result in failures.

For someone familiar with C and other libraries, this wasn't expected behavior for me. From strncmp:

The strncmp() function shall compare not more than n bytes (bytes that follow a null byte are not compared) from the array pointed to by s1 to the array pointed to by s2.

And from strncpy:

The strncpy() function shall copy not more than n bytes (bytes
that follow a null byte are not copied) from the array pointed to
by s2 to the array pointed to by s1. If copying takes place
between objects that overlap, the behavior is undefined.

If the array pointed to by s2 is a string that is shorter than n bytes,
null bytes shall be appended to the copy in the array pointed to
by s1, until n bytes in all are written.

There are a few different possible things that copy-sequence and other operations could do in this situation, but I'm not sure that just resulting in an error is the nicest, most convenient, or most expected of them.

@BarAgent
Copy link
Member

BarAgent commented Jan 3, 2015

element signals an error if it tries to access an element with a key that isn't present in the collection (i.e., if index ≧ size).

I think that all the collection operations are assumed to be implemented via element, in the abstract if not in practice. Therefore copy-sequence should fall back to the behavior exhibited by its building blocks, and signal an error if end: exceeds the sequence size ("exceed" because the end: index is an exclusive index).

I think this is also the better approach from a software engineering perspective. If we do something special beyond a naive implementation that uses element, we risk inconsistent behavior across different specializations of the built-in and custom collection classes, not to mention the extra burden of developing and accounting for the special behavior code. Additionally, end: already defaults to the sequence's size; if the user explicitly overrides that, he should only put in valid values. And finally, I like the symmetry of both reporting an error in cases where the compiler can determine invalid end: indices at compile-time, and still reporting an error in cases where it has to make that determination at run-time.

strncpy is not a good model to use for Dylan code. The relevant argument in strncpy is the length, whereas in Dylan it is the end index. You can't find the end index or the length of a C string without iterating through the entire thing, and C does not want to force you to do that in order to provide a valid length argument for a copy that will be iterating through the entire thing anyway. So strncpy is forgiving of that. That consideration is not necessary for Dylan.

@BarAgent
Copy link
Member

BarAgent commented Jan 3, 2015

fill! presents an additional issue. If fill! is understood to act like element-setter, then you can fill a <stretchy-sequence> to an end index beyond its size. element-setter would ensure the sequence has that index before setting it. Likewise, fill! would grow a <stretchy-sequence> to fill everything you want filled…but the same code applied to a non-stretchy <sequence> would yield an error. There is a symmetry break there, but a sensible one.

Because of that difference in the way element and element-setter treats an invalid index, if there were a function that, in the abstract, combines the actions of element and element-setter — say, a version of map that had a start: and end: — its semantics might not be common sense. Of course, in that case, the DRM has already made the expected behavior clear in its section on collection alignment.

@waywardmonkeys
Copy link
Member Author

I feel like you've presented a lot of strawman arguments.

Whether or not copy-sequence and friends are implemented in terms of element doesn't matter much when specifying how the end argument should be handled with respect to the bounds. We can define it however we find best.

As for:

I think this is also the better approach from a software engineering perspective. If we do something
special beyond a naive implementation that uses element, we risk inconsistent behavior across
different specializations of the built-in and custom collection classes, not to mention the extra
burden of developing and accounting for the special behavior code. Additionally, end: already
defaults to the sequence's size; if the user explicitly overrides that, he should only put in valid
values. And finally, I like the symmetry of both reporting an error in cases where the compiler
can determine invalid end: indices at compile-time, and still reporting an error in cases where
it has to make that determination at run-time.

We have to specify the semantics, no matter what they are. And Gwydion already demonstrates that even within a single codebase, without a clear definition of those semantics, different specializations will end up implementing different behavior.

That's why I'd like to see a single semantics for how start: and end: keywords are handled. This goes beyond the DRM-specified operations as these are also common in the strings library for example:

define sealed generic  string-equal?
    (string1 :: <string>, string2 :: <string>, #key start1, end1, start2, end2, test)
 => (equal? :: <boolean>);

I'd like to be able to compare a chunk of a string without having to worry at the call site about whether or not I've gone beyond the bounds of one of the strings. That's a pretty reasonable thing, otherwise, I end up with min calls at the call site:

if (string-equal?(uri, "//", start1: scheme-end, end1: min(uri.size, scheme-end + 2)))

Looking afield more at other languages ...

In Apple's NSString, any such begin/end range throws an NSRangeException when it is not within the bounds of the string.

Python is pretty lenient.

C++ std::string::substr also allows the end to beyond the bounds of the string (much like strncpy).

I don't want to try to claim that one side is on the side of "better software engineering". There are valid arguments for both behaviors and there's no reason to try have one side try to claim some moral high ground. It isn't productive.

As for fill! on a <stretchy-sequence> ... that's interesting and worthy of thought as to what should happen and how that relates to this.

@DHovekamp42
Copy link

I would suggest to use the above implementation of copy-sequence for string from Gwydion with end: and start: keywords bounded by the length of the sequence as the definition. This would match programmers expectations and common-lisp semantic best. An out-of-bounds access situation would never occur.

@swmckay
Copy link

swmckay commented Jan 4, 2015

I think that they should all signal an error if:

  • start >= length
  • end > length
  • start > end

Silently doing the wrong thing, is the wrong thing.

The fill! situation on stretchy sequences… extending the sequence might be right.

@DHovekamp42
Copy link

Returning an empty sequence is the right thing for all 3 'error' cases.

Compare to the CL-HyperSpec: http://www.ai.mit.edu/projects/iiip/doc/CommonLISP/HyperSpec/Body/glo_b.html#bounding_index_designator
bounded adj. (of a sequence S, by an ordered pair of bounding indices istart and iend) restricted to a subrange of the elements of S that includes each element beginning with (and including) the one indexed by istart and continuing up to (but not including) the one indexed by iend.

bounding index n. (of a sequence with length n) either of a conceptual pair of integers, istart and iend, respectively called the lower bounding index'' andupper bounding index'', such that 0 <=istart <=iend <=n, and which therefore delimit a subrange of the sequence bounded by istart and iend.

bounding index designator (for a sequence) one of two objects that, taken together as an ordered pair, behave as a designator for bounding indices of the sequence; that is, they denote bounding indices of the sequence, and are either: an integer (denoting itself) and nil (denoting the length of the sequence), or two integers (each denoting themselves).

@BarAgent
Copy link
Member

BarAgent commented Jan 5, 2015

start: and end: are used with streams as well as with strings and general sequences. A consistent semantic for start: and end: should also apply to streams. Streams are more restrictive than sequences. I consider this a good thing, because it forces us to be thorough. replace-subsequence!(“hello”, “goodbye”, start: 0) can result in a whole new “goodbye” string, but you have to consider the consequences of trying to grow something that can’t be grown if your code is:

let s = make(<string-stream>, contents: “hello”);
write(s, “goodbye”, start: 0)

I think we can describe the different semantics in these terms:

  • usable start indexes in stretchy and non-stretchy contexts
  • usable end indexes in stretchy and non-stretchy contexts
  • effect when reading or writing to a stream using those indexes

I will use Swift range notation to describe usable start and end indexes.

I can think of the following semantics (this would be easier if GitHub allowed tables or monospace font while editing):

Conservative — Only indexes that are valid with element and the end-of-file-or-sequence index are allowed.

  • non-stretchy read
    • Valid start indexes are 0 ... size.
    • Valid end indexes are 0 ... size.
    • Reading outside that range is an error.
    • Reading the range size … size yields an empty result.
  • stretchy read
    • Valid start indexes are 0 … size.
    • Valid end indexes are 0 … size.
    • Reading outside that range is an error.
    • Reading the range size … size yields an empty result.
  • non-stretchy write
    • Valid start indexes are 0 ..< size.
    • Valid end indexes are 0 … size.
    • Writing outside that range is an error.
  • stretchy write
    • Valid start indexes are 0 … ∞.
    • Valid end indexes are 0 … ∞.
    • Writing outside the range 0 ..< size grows the collection or file.

Virtual — The file or sequence is assumed to be as big as you need it to be.

  • non-stretchy read
    • Valid start indexes are 0 ... ∞.
    • Valid end indexes are 0 ... ∞.
    • Reading outside the range 0 ..< size results in a sequence of the fill value.
    • For example, reading 1...3 of “Hi” results in “i “.
  • stretchy read
    • Valid start indexes are 0 ... ∞.
    • Valid end indexes are 0 ... ∞.
    • Reading outside the range 0 ..< size results in a sequence of the fill value. It does not actually grow the collection or file, though.
  • non-stretchy write
    • Valid start indexes are 0 ... ∞.
    • Valid end indexes are 0 ... ∞.
    • Writing outside the range 0 ..< size only changes elements in the intersection of the specified range and the range 0 ..< size.
  • stretchy write.
    • Valid start indexes are 0 ... ∞.
    • Valid end indexes are 0 ... ∞.
    • Writing outside the range 0 ..< size grows the collection or file.

Permissive — The system doesn’t complain if you access the file or sequence beyond its end.

  • non-stretchy read
    • Valid start indexes are 0 ... ∞.
    • Valid end indexes are 0 ... ∞.
    • Reading outside the range 0 ..< size returns elements in the intersection of the specified range and the range 0 ..< size.
    • For example, reading 1…3 of “Hi” results in “i”. An empty intersection yields an empty result.
  • stretchy read
    • Valid start indexes are 0 ... ∞.
    • Valid end indexes are 0 ... ∞.
    • Reading outside the range 0 ..< size returns elements in the intersection of the specified range and the range 0 ..< size.
    • For example, reading 1...3 of “Hi” results in “i”. An empty intersection yields an empty result.
  • non-stretchy write
    • Valid start range is 0 ... ∞.
    • Valid end range is 0 ... ∞.
    • Writing outside the range 0 ..< size only changes elements in the intersection of the collection or file’s range and the specified range.
  • stretchy write
    • Valid start range is 0 ... ∞.
    • Valid end range is 0 ... ∞.
    • Writing outside the range 0 ..< size grows the collection or file.

Permissive/Virtual — If the file or sequence is stretchy, it is assumed to be as big as you need it to be. If not stretchy, the system doesn’t complain if you access it beyond its end.

  • non-stretchy read
    • Valid start indexes are 0 ... ∞.
    • Valid end indexes are 0 ... ∞.
    • Reading outside the range 0 ..< size returns elements in the intersection of the specified range and the range 0 ..< size.
    • For example, reading 1...3 of “Hi” results in “i”. An empty intersection yields an empty result.
  • stretchy read
    • Valid start indexes are 0 ... ∞.
    • Valid end indexes are 0 ... ∞.
    • Reading outside the range 0 ..< size results in a sequence of the fill value. It does not actually grow the collection or file, though.
    • For example, reading 1...3 of “Hi” results in “i “.
  • non-stretchy write
    • Valid start range is 0 ... ∞.
    • Valid end range is 0 ... ∞.
    • Writing outside the range 0 ..< size only changes elements in the intersection of the collection or file’s range and the specified range.
  • stretchy write
    • Valid start range is 0 ... ∞.
    • Valid end range is 0 ... ∞.
    • Writing outside the range 0 ..< size grows the collection or file.

Conservative/Virtual — If the file or sequence is stretchy, it is assumed to be as big as you need it to be. If not stretchy, only indexes that are valid with element and the end-of-file-or-sequence index is allowed.

  • non-stretchy read
    • Valid start indexes are 0 ... size.
    • Valid end indexes are 0 ... size.
    • Reading outside that range is an error.
    • Reading the range size ... size yields an empty result.
  • stretchy read
    • Valid start indexes are 0 ... ∞.
    • Valid end indexes are 0 ... ∞.
    • Reading outside the range 0 ..< size results in a sequence of the fill value. It does not actually grow the collection or file, though.
    • For example, reading 1...3 of “Hi” results in “i “.
  • non-stretchy write
    • Valid start indexes are 0 ..< size.
    • Valid end indexes are 0 ... size.
    • Writing outside that range is an error.
  • stretchy write
    • Valid start range is 0 ... ∞.
    • Valid end range is 0 ... ∞.
    • Writing outside the range 0 ..< size grows the collection or file.

Let’s consider some advantages and disadvantages.

  • In the URL parsing example earlier, any semantic other than Conservative or Conservative/Virtual would give you the nice behavior where string-equal?(“http:”, "//", start1: 5, end1: 7) is error-free.
  • I do not like the Permissive or Permissive/Virtual semantics, because accessing an array beyond its end seems like a programmer error.
  • The data you read under the Virtual, Permissive/Virtual, and Conservative/Virtual semantics is incorrect (or, more generously, pre-populated). Sometimes you want data to be pre-populated for ease of processing, and sometimes you don’t.
  • To draw an analogy to databases, it could come down to distinguishing between null and empty. Is a zero-length string equivalent to a nonexistent string? The Permissive semantic answers that with “empty == non-existent”, the Conservative semantic answers “empty ~== non-existent”, and the question can’t even arise under the Virtual semantics.

Of these, then, the only semantic that does not make assumptions about desirability of pre-populated data or the equivalence of empty and null data is the Conservative semantic. More precisely, it does not prevent the application from making those distinctions.

@waywardmonkeys
Copy link
Member Author

@BarAgent , that's a great comment, thanks!

I think that you and @swmckay agree that the semantics described by your "Conservative" section are what you would prefer.

@DHovekamp42, I don't see a clear statement in the CLHS that an empty sequence is what should be returned in those situations. It defines a bounded sequence, but it doesn't state what should happen when the bounds are invalid. This is true in the docs for subseq as well by my reading.

I have no horse in this race and am not against that result at all. I just want something to be clearly specified and followed everywhere.

An interesting extension to this question is when the bounds should be verified and a condition signaled. This is particularly true in the case of fill! on a <mutable-sequence> where the size calculation is not necessarily cheap, like <list>.

The current implementation does not verify the bounds prior to mutating the list, so you can end up with a situation where the list is mutated and a condition is signaled.

@waywardmonkeys
Copy link
Member Author

Thanks to some help from #lisp (from |3b|, beach and ggole) on IRC, I got some information about this from the Common Lisp perspective:

http://www.lispworks.com/documentation/HyperSpec/Issues/iss332_w.htm is about out of bounds indices with subseq.

While things like find and subseq don't address out of bounds indices directly, they do say this about proper sequences:

Exceptional Situations:

Should be prepared to signal an error of type type-error if sequence is not a proper sequence.

beach indicates that:

The spirit is that if you have a long list, you should not have to traverse it to the end
just to check for errors if you can obtain a reasonable result anyway.

The usage of the term should be prepared to signal an error is described here:

Should be prepared to signal an error
This is similar to ``should be signaled'' except that it does not imply thatextra effort' has to be taken on the part of an operator to discover an erroneous situation if the normal action of that operator can be performed successfully with only lazy' checking. An
implementation is always permitted to signal an error, but even in safe code, it is only
required to signal the error when failing to signal it might lead to incorrect results. In
unsafe code, the consequences are undefined.

For example, defining that ``find should be prepared to signal an error of type
type-error if its second argument is not a proper list'' does not imply that an error is
always signaled. The form

(find 'a '(a b . c))

must either signal an error of type type-error in safe code, else return A. In unsafe code,
the consequences are undefined. By contrast,

(find 'd '(a b . c))

must signal an error of type type-error in safe code. In unsafe code, the consequences are
undefined. Also,

(find 'd '#1=(a b . #1#))

in safe code might return nil (as an implementation-defined extension), might never return, or
might signal an error of type type-error. In unsafe code, the consequences are undefined.

Typically, the ``should be prepared to signal'' terminology is used in type checking situations
where there are efficiency considerations that make it impractical to detect errors that are
not relevant to the correct operation of the operator.

@DHovekamp42
Copy link

After rereading the given comments and the http://www.lispworks.com/documentation/HyperSpec/Issues/iss332_w.htm clean-up clarification in CLHS
I'd vote for @BarAgent conservative view too.

For the convenience I still prefer always working code - i.e. to get an empty sequence as plausible result from out-of-bounds indexing - but can confirm that current CL-implementations like LispWorks & CCL verify bounds and signal an out-of-bounds condition as an error. Also it will be easier to wrap some code around an error-signaling standard to get my desired behavior - by catching and handling out-of-bounds conditions accordingly - as the other way around.

The case of having fill! on a leaving a modified list AND signaling a condition is worse - if possible the implementation should handle it in such a way that the mutation is done only after the indices have been verified as valid.

@swmckay
Copy link

swmckay commented Jan 13, 2015

By the way, code that runs without signaling an error is not the same
thing as "working code". A "plausible result" is not always a correct
result. Better to have errors signaled, which can then be corrected,
than to have code that fails silently without errors and yields results
that aren't predictable.

What you want isn't really very sound engineering practice. I've learned
the hard way over many years that these kinds of "favors" end up screwing
you sooner or later, and usually just later enough that you no longer have
any idea what the hell is going on in that code you wrote last year.

--S

On Tue, Jan 13, 2015 at 9:22 AM, Dieter Hovekamp notifications@github.com
wrote:

After rereading the given comments and the
http://www.lispworks.com/documentation/HyperSpec/Issues/iss332_w.htm
clean-up clarification in CLHS
I'd vote for @BarAgent https://github.com/BarAgent conservative view
too.

For the convenience I still prefer always working code - i.e. to get an
empty sequence as plausible result from out-of-bounds indexing - but can
confirm that current CL-implementations like LispWorks & CCL verify bounds
and signal an out-of-bounds condition as an error. Also it will be easier
to wrap some code around an error-signaling standard to get my desired
behavior - by catching and handling out-of-bounds conditions accordingly -
as the other way around.

The case of having fill! on a leaving a modified list AND signaling a
condition is worse - if possible the implementation should handle it in
such a way that the mutation is done only after the indices have been
verified as valid.

Reply to this email directly or view it on GitHub
#843 (comment)
.

@DHovekamp42
Copy link

Thanks @swmckay for elaborating and agreed to better have errors signaled than unpredictable results!

My statement that an empty sequence is the right thing to return for the index out of bounds 'error' should be seen as conceptual semantic expectation on implicit lower and upper bounding pairs of start and end into a sequence as written up.

Lifting this assumption puts the burden back to the user at each call as seen in the example
if (string-equal?(uri, "//", start1: scheme-end, end1: min(uri.size, scheme-end + 2)))

So trading in readability and potentially expensive double checking on start & end limits in user & library code - for making the programmers assertions explicit.

(The implicit semantic is working for me - but I may have just been lucky that I wasn't screwed up yet and read my simpler code with pleasure years later. It won't harm if I now have to write down my expectations in a kind of API to some library calls and/or make my "favors" explicit for others.)

@swmckay
Copy link

swmckay commented Jan 13, 2015

For the record, I'm very sympathetic to this kind of example.
My personal toolkit always includes functions like 'starts-with?',
'end-with?', and 'looking-at?'.

So your example would like like this:
if (looking-at?(uri, "//", start: scheme-end)) ...

I just think the language itself should be more strict where possible,
because it's hard to add strictness back in.

--S

On Tue, Jan 13, 2015 at 3:45 PM, Dieter Hovekamp notifications@github.com
wrote:

Thanks @swmckay https://github.com/swmckay for elaborating and agreed
to better have errors signaled than unpredictable results!

My statement that an empty sequence is the right thing to return for the
index out of bounds 'error' should be seen as conceptual semantic
expectation on implicit lower and upper bounding pairs of start and end
into a sequence as written up.

Lifting this assumption puts the burden back to the user at each call as
seen in the example
if (string-equal?(uri, "//", start1: scheme-end, end1: min(uri.size,
scheme-end + 2)))

So trading in readability and potentially expensive double checking on
start & end limits in user & library code - for making the programmers
assertions explicit.

(The implicit semantic is working for me - but I may have just been lucky
that I wasn't screwed up yet and read my simpler code with pleasure years
later. It won't harm if I now have to write down my expectations in a kind
of API to some library calls and/or make my "favors" explicit for others.)

Reply to this email directly or view it on GitHub
#843 (comment)
.

@cgay
Copy link
Member

cgay commented Jan 13, 2015

For the record, starts-with?: https://github.com/dylan-lang/strings/blob/master/strings.dylan#L426

For what it's worth, although I've been spoiled by the laxness of Python string operations over the past years, I'm totally on board with the conservative approach. The way I see it we can always provide a more lax implementation as part of a scripting library or similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants