Updated EEP 31

bufflig · Dec 18, 2009 · e3fceff · e3fceff
1 parent ea7ed7c
commit e3fceff
Showing 1 changed file with 152 additions and 10 deletions.
diff --git a/eep-0031.txt b/eep-0031.txt
@@ -2,7 +2,7 @@ EEP:            31
 Title:          Binary manipulation and searching module
 Version:        $Id: eep-0031.txt,v 1.1 2009/11/26 13:13:37 pan Exp pan $
 Last-Modified:  $Date: 2009/11/26 13:13:37 $
-Author:         Patrik Nyblom
+Author:         Patrik Nyblom, Fredrik Svahn
 Status:         Draft
 Type:           Standards Track
 Content-Type:   text/x-rst
@@ -46,6 +46,13 @@ However some common operations are useful to have as ordinary functions,
 both for performance and to support a more traditiona functional
 programming style.
 
+Some operations for converting lists to binaries and v.v. are today in
+the erlang module. BIFs concerning binaries now present have varied
+view of zero vs. one-based positioning in binaries. I.e.
+binay_to_list/3 uses one-based while split_binary/2 uses
+zero-based. As the convention is to use zero-based, new functions for
+convertion binaries to lists and v.v. are needed.
+
 Binaries are in fact a shared data-type, with small binaries often
 referencing parts of larger binaries in a way not controllable by
 the programmer in a simple way. The bitstring data-type further
@@ -54,6 +61,11 @@ manage. I therefore also suggest some low level functions to
 inspect binary representation and to clone binaries to ensure a
 minimal representation.
 
+As matching is not allowed in guard expressions, I furthermore suggest
+that a function for extracting parts of binaries is added to the set
+of guard BIFs. This would be consistent with the function element/2
+being allowed in guards.
+
 Rationale
 =========
 
@@ -83,6 +95,9 @@ The functionality suggested is the following:
   operations that are not applicable to lists or that we still don't
   know the need for.
 
+- Functions for converting lists to binaries and v.v. These functions
+  should have a consistent view of zero-based indexing in binaries.
+
 - Operations on binaries concerning their internal
   representation. This functionality is sometimes necessary to avoid
   extensive use of memory due to the shared nature of the binaries. As
@@ -147,6 +162,12 @@ search-pattern, later to be used in the find, split or replace
 functions. The cp() returned is guaranteed to be a tuple() to allow
 programs to distinguish it from non precompiled search patterns
 
+When a list of binaries is given, it denotes a *set* of alternative
+binaries to search for. I.e if ``[<<"functional">>, <<"programming">>]`` 
+is given as ``Pattern``, this means ''either ``<<"functional">>`` *or*
+``<<"programming">>``''. The pattern is a *set* of alternatives; when
+only a single binary is given, the set has only one element. 
+
 If pattern is not a binary or a flat proper list of binaries, a ``badarg``
 exception will be raised.
 
@@ -182,8 +203,7 @@ the lowest position in ``Subject``, Example::
 
 Even though ``<<"cd">>`` ends before ``<<"bcde">>``, ``<<"bcde">>``
 begins first and is therefore the first match. If two overlapping
-matches begins at the same position, the shortest is returned.
-
+matches begins at the same position, the longest is returned.
 
 Summary of the options:
 
@@ -195,6 +215,8 @@ Summary of the options:
 The found part() is returned, if none of the strings in ``Pattern`` is
 found, the atom ``no`` is returned.
 
+For a descrition of ``Pattern``, see ``compile_pattern/1``.
+
 If ``{scope, {Start,Length}}`` is given in the options such that
 ``Start`` is larger than the size of ``Subject``, ``Start`` +
 ``Length`` is less than zero or ``Start`` + ``Length`` is larger than
@@ -211,6 +233,9 @@ Types:
 The same as matches(Subject, Pattern, []).
 
 **matches(Subject,Pattern,Options) -> Found**
+
+Types:
+
 - Subject = binary()
 - Pattern = binary() | [ binary() ] | cp()
 - Found = [ part() ] | []
@@ -220,18 +245,30 @@ The same as matches(Subject, Pattern, []).
 Works like match, but the ``Subject`` is search until exhausted and
 a list of all non-overlapping parts present in Pattern are returned (in order).
 
-The first and shortest match is preferred
-to a longer, which is illustrated by the following example::
+The first and *longest* match is preferred
+to a shorter, which is illustrated by the following example::
 
     1> binary:matches(<<"abcde">>, [<<"bcde">>,<<"bc">>>,<<"de">>],[]).
-    [{1,2},{3,2}]
+    [{1,4}]
 
-\- the result shows that ``<<"bc">>>`` and then ``<<"de">>`` are
-selected instead of the longer match ``<<"bcde">>``. This corresponds
-to the default behavior of regular expressions in the ``re`` module.
+\- the result shows that ``<<"bcde">>>`` is selected instead of the
+shorter match ``<<"bc">>`` (which would have given raise to one more
+match,``<<"de">>``). This corresponds to the behavior of
+posix regular expressions (and programs like ``awk``), but is not
+consistent with alternative matches in ``re`` (and Perl), where
+instead lexical ordering in the search pattern selects which string
+matches.
 
 If none of the strings in pattern is found, an empty list is returned.
 
+For a descrition of ``Pattern``, see ``compile_pattern/1`` and for a
+desctioption of available options, see ``match/3``.
+
+If ``{scope, {Start,Length}}`` is given in the options such that
+``Start`` is larger than the size of ``Subject``, ``Start`` +
+``Length`` is less than zero or ``Start`` + ``Length`` is larger than
+the size of ``Subject``, a ``badarg`` exception is raised.
+
 **split(Subject,Pattern) -> Parts**
 
 Types:
@@ -299,6 +336,8 @@ The return type is always a list of binaries which are all referencing
 copied to new binaries and that ``Subject`` cannot be garbage
 collected until the results of the split are no longer referenced.
 
+For a descrition of ``Pattern``, see ``compile_pattern/1``.
+
 **replace(Subject,Pattern,Replacement) -> Result**
 
 Types:
@@ -350,6 +389,42 @@ If any position given in InsPos is greater than the size of the replacement bina
 The options ``global`` and ``{scope, part()}`` works as for
 ``binary:split/3``. The return type is always a binary.
 
+For a descrition of ``Pattern``, see ``compile_pattern/1``.
+
+**longest_common_prefix(Binaries) -> int()**
+
+Types:
+
+- Binaries = [ binary() ]
+
+Returns the length of the longest common prefix of the binaries in the
+list ``Binaries``. Example::
+
+     1> binary:longest_common_prefix([<<"erlang">>,<<"ergonomy">>]).
+     2
+     2> binary:longest_common_prefix([<<"erlang">>,<<"perl">>]).
+     0
+
+If ``Binaries`` is not a flat list of binaries, a ``badarg`` exception
+is raised. 
+
+**longest_common_suffix(Binaries) -> int()**
+
+Types:
+
+- Binaries = [ binary() ]
+
+Returns the length of the longest common suffix of the binaries in the
+list ``Binaries``. Example::
+
+     1> binary:longest_common_suffix([<<"erlang">>,<<"fang">>]).
+     3
+     2> binary:longest_common_suffix([<<"erlang">>,<<"perl">>]).
+     0
+
+If ``Binaries`` is not a flat list of binaries, a ``badarg`` exception
+is raised. 
+
 **first(Subject) -> int()**
 
 Types:
@@ -379,7 +454,6 @@ Returns the byte at position ``Pos`` (zero-based) in the binary
 ``Subject`` as an integer. If ``Pos`` >= byte_size(Subject), a
 ``badarg`` exception is raised.
 
-
 **part(Subject, PosLen) -> binary()**
 
 Types:
@@ -408,6 +482,48 @@ Types:
 
 The same as part(Subject, {Pos, Len}). 
 
+**bin_to_list(Subject) -> list()**
+
+Types:
+
+- Subject = binary()
+
+The same as bin_to_list(Subject,{0,byte_size(Subject)}).
+
+**bin_to_list(Subject, PosLen) -> list()**
+
+- Subject = binary()
+- PosLen = part()
+
+Converts ``Subject`` to a list of int(), each int representing the
+value of one byte. The ``part()`` denotes which part of the
+``binary()`` to convert. Example::
+
+	     1> binary:bin_to_list(<<"erlang">>,{1,3}).
+	     "rla"
+	     %% or [114,108,97] in list notation.
+
+If ``PosLen`` in any way references outside the binary, a ``badarg``
+exception is raised.
+
+**bin_to_list(Subject, Pos, Len) -> list()**
+
+Types:
+
+- Subject = binary()
+- Pos = int()
+- Len = int()
+
+The same as bin_to_list(Subject,{Pos,Len}).
+
+**list_to_bin(ByteList) -> binary()**
+
+Types:
+
+- ByteList = iodata() (see module erlang)
+
+Works exactly like erlang:list_to_binary/1, added for completeness.
+
 **copy(Subject) -> binary()**
 
 Types:
@@ -430,6 +546,13 @@ This function will always create a new binary, even if ``N`` = 1. By
 using ``copy/1`` on a binary referencing a larger binary, one might
 free up the larger binary for garbage collection.  
 
+NOTE! By deliberately copying a single binary to avoid referencing a
+larger binary, one might, instead of freeing up the larger binary for
+later garbage collection, create much more binary data than
+needed. Sharing binary data is usually good. Only in special cases,
+when small parts reference large binaries and the large binaries are
+no longer used *in any process*, deliberate copying might be a good idea.
+
 If ``N`` < 0, a ``badarg`` exception is raised.
 
 **referenced_byte_size(binary()) -> int()**
@@ -478,6 +601,12 @@ Example of binary sharing::
         10
 	6> binary:referenced_byte_size(B)
 	100
+
+NOTE! Binary data is shared among processes. If another process still
+references the larger binary, copying the part this process uses only
+consumes more memory and will not free up the larger binary for garbage
+collection. Use this kind of intrusive functions with extreme care,
+and only if a *real* problem is detected.
 
 **encode_unsigned(Unsigned) -> binary()**
 
@@ -528,6 +657,19 @@ Example::
 	1> binary:decode_unsigned(<<169,138,199>>,big). 
         11111111
 
+Guard BIF
+---------
+
+I suggest adding the functions binary:part/2 and binary:part/3 to the
+set of BIFs allowed in guard tests. As guard BIFs are traditionally
+put in the erlang module, the following names for the guard BIFs are
+suggested::
+
+	erlang:binary_part/2
+	erlang:binary_part/3
+
+They should both work exactly as their counterparts in the binary module.
+
 Interface design discussion
 ---------------------------