Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming API #39

Open
Shinmera opened this issue Mar 17, 2021 · 12 comments
Open

Streaming API #39

Shinmera opened this issue Mar 17, 2021 · 12 comments

Comments

@Shinmera
Copy link
Contributor

Shinmera commented Mar 17, 2021

So it happens more often than I'd like that I want to serialise longer pieces of text to/from an encoding. Having to round-trip through an array copy to do so is quite cumbersome. It would be great if there was instead an API that works either via callbacks, or even better, via a resumable state machine. The callback API and the current copying API could both be implemented in terms of the state machine API quite trivially, I think.

Naturally this would require refactoring most things, and is as such a big undertaking. Still, I feel like this is a very valuable feature, since having to copy megabytes if not gigabytes of text around is often not just slow, but also prohibitively taxing on memory. A state machine API would allow processing text in a streaming fashion, too, without needing to keep anything at all in memory.

@sionescu
Copy link
Member

sionescu commented Mar 18, 2021

If you're talking of reading from a file and writing the converted octets to another file, here's the typical sequence of copies:

Kernel buffer (syscall)=> Lisp input stream buffer (Babel conversion)=> string => (Babel conversion)=> Lisp output stream buffer (syscall)=> Kernel.

It's not necessary to slurp all input data into a string then write it in one go, you need the intermediate string anyway but keeping track of the bounding indexes, a fairly simple implementation of an intermediate buffer. Callbacks and state machines gain you little.

You could conceive of a direct octets-to-octets converter using a single character as intermediate buffer, while bypassing the CL standard stream interface and doing straight syscalls but that's a specialised use that is outside the scope of Babel.

@Shinmera
Copy link
Contributor Author

Shinmera commented Mar 18, 2021

For encodings like utf-8 that have a variable number of bytes per character I don't know the byte boundaries though, so I can't just pass an arbitrary window of bytes to babel and expect it to convert it. Having a state machine that keeps track of partial decoding like that makes the whole process a lot easier.

@sionescu
Copy link
Member

We would need two new conversion functions that take an explicit output buffer with bounding indices, and return the number of elements consumed. That should be pretty easy to do.

@Shinmera
Copy link
Contributor Author

That would work, yeah.

@sionescu
Copy link
Member

I may take a look sometime in the next few weeks. If you have the spare time you could try implementing it and Luis or I can assist.

@Shinmera
Copy link
Contributor Author

I'll see what I can do, but can't promise anything -- very busy with Kandria right now.

@luismbo
Copy link
Member

luismbo commented Mar 18, 2021

For encodings like utf-8 that have a variable number of bytes per character I don't know the byte boundaries though, so I can't just pass an arbitrary window of bytes to babel and expect it to convert it.

In that case, end-of-input-in-character is signalled and character-coding-error-position tells you where that last incomplete character starts. But there's no restart so, you'd need to count characters with vector-size-in-chars first, then pass the bounding indices to octets-to-string (which will count characters yet again).

It would be nicer if we provided a restart (possibly from the code-point-counter function) so that conversion would proceed up until that last incomplete character.

@luismbo
Copy link
Member

luismbo commented Mar 18, 2021

We would need two new conversion functions that take an explicit output buffer with bounding indices, and return the number of elements consumed. That should be pretty easy to do.

Perhaps octets-to-string can take a target string (and respective bounding indices). Likewise, string-to-octets can take a a target vector.

Babel is a bit picky about the type of the target strings/vectors, possibly due to premature optimization on my part.

@sionescu
Copy link
Member

We should see if that doesn't add too much branching to octets-to-string. I'd want to keep the changes to a minimum and talk about more extensive changes later.

I have the impression that the internals of Babel are pretty slow, especially compared to the efficiency of other libraries like trivial-utf-8, but who has the time for a complete rewrite now ? :D

@luismbo
Copy link
Member

luismbo commented Mar 18, 2021

I hope making Babel as fast as trivial-utf-8 doesn't take a complete rewrite. Could be as simple as enabling the optimization declarations that have been commented out in strings.lisp.

@Shinmera
Copy link
Contributor Author

Lord it's been a year already since I started this ticket. Uuuh. Well, I still don't have time I can dedicate to this, but would very much appreciate if someone else did!

@Zulu-Inuoe
Copy link
Contributor

Hey all.

I started working on this, but after spending far too much time on it, I've decided to give up on it and just embed a UTF-8 decoder since that's like 50 LOC to jzon.

However, if anyone is interested, I've got an API working here: https://github.com/Zulu-Inuoe/babel/tree/feature/streaming

What I came up with:

  1. Define a define-multybyte-encoding macro which wraps several helpers for multibyte encodings, primarily allows them to supply a consume-octet symbol which is locally bound to allow the decoders to advance the octet reading process.
  2. specifically for the utf-8b encoding, define a buffer into which the encoding can emit additional code points to

Both of these together define a new protocol where the code is shared between the new define-decoder and new define-streaming-decoder. An example might make more sense:

(define-multibyte-decoder :cp932 (u1 consume-octet)
  (let ((u2 0))
    (declare (type ub8 u1 u2))
    (macrolet
        ((handle-error (n &optional (c 'character-decoding-error))
          (declare (ignore n c))
          `(error "TODO")
         (handle-error-if-icb (var n)
           `(when (not (< #x7f ,var #xc0))
              (handle-error ,n invalid-utf8-continuation-byte))))
      (cond
        ;; 2 octets
        ((or (<= #x81 u1 #x9f)
             (<= #xe0 u1 #xfc))
         (setf u2 (consume-octet))
         (cp932-to-ucs (logior (f-ash u1 8)
                               u2)))
        ;; 1 octet
        (t
         (cp932-to-ucs u1))))))

Things to note:

  1. The body expects to be given the first octet, here in the u1 variable. This allows us to control the iteration directly from outside. For example, in my code I do this:
(let ((u1 (read-byte stream nil nil)))
  (when u1 (funcall decoder u1) ...))

if I am processing a stream, or

(when (< i (length vect) 
  (funcall decoder (prog1 (aref vect i) (incf i)) ...))

if I am processing a vector.

  1. The code is provided a consume-octet form it may invoke as a local macro to pull the next octet as it needs.

  2. This body of code should return the next code point.

For UTF-8b, in which we may need to emit up to 4 bytes from a single 'call', I defined a buffer-var argument:

(define-multibyte-decoder :utf-8b (u1 consume-octet :buffer-var buffer)
  ...
)

this is a (array (integer 0) (*)) that the decoder may use to place additional code points beyond the first.

The purpose of this is that for a grand majority of the encodings and cases, a single value is all that is required, so I optimize for that case as performance would drop significantly otherwise.

Finally, the way you make use of it can be seen in this helper macro I created for one to iterate over the code points of a vector:

(defun octets-to-string-new2 (vector &key (start 0) end (encoding *default-character-encoding*))
  (with-vector-decoder (decoder vector :encoding encoding :start start :end end)
    (loop :with string := (make-array 128 :element-type 'character :fill-pointer 0)
          :for c := (decoder) :while c
          :do (vector-push-extend c string)
          :finally (return (coerce string 'simple-string)))))

It's all working, and it all performs well. I just don't have it in me to review the macros to make sure they don't have issues like symbol leakage, and the errors need to be revised (the old error handlers reported positions etc, which are no longer directly available to each of the macros).

Good luck if you're braver than I.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants