Change LineDecoder to match stdlib splitlines, resulting in significant speed up #2423

giannitedesco · 2022-10-26T11:30:09Z

Leading to enormous speedups when doing things such as Response(...).iter_lines()

tomchristie · 2022-11-02T11:32:53Z

A solution using the built-in splitlines would probably be preferable to our own regex.
It might be worth taking a look at what requests does here?

giannitedesco · 2022-11-03T06:24:09Z

A solution using the built-in splitlines would probably be preferable to our own regex. It might be worth taking a look at what requests does here?

Actually requests is using splitlines, too. So should I just switch to that and update the failing tests?

tomchristie · 2022-11-03T10:26:01Z

So should I just switch to that and update the failing tests?

Switching to a splitlines based approach, using the requests implementation as a reference point would be a good starting point...

giannitedesco · 2022-11-04T01:37:31Z

Haha, yes, the first implementation I did with splitlines happens to be identical to theirs when refactored (ie. theirs is a generator, not based an object 😂 but, yeah, that broke the tests which is when I went off on a tangent to fix that.

tests/models/test_responses.py

giannitedesco · 2022-11-21T03:37:13Z

@florimondmanca @tomchristie is this acceptable? I suppose the benefit of this API is that if you do something like ''.join(resp.iter_lines()) you recover the original text and that means you could, for example, hash, encrypt, or compress the response body incrementally by lines without altering its contents? So, while incompatible, maybe it could be considered an improvement? But not sure the reasoning behind the previous normalization of line endings so who can say?

tomchristie · 2022-11-21T09:17:53Z

is this acceptable?

No. I don't believe resolving this issue should require the tests to change.

tomchristie · 2022-11-21T10:51:51Z

Hrm. I'd not appreciated that we have different behaviour from requests here. That's interesting.

(Our output includes trailing \n delimiters. Theirs does not.)

giannitedesco · 2022-11-23T14:55:33Z

Right, I feel like there are 3 options to what the behaviour could be:

no newlines (ala requests)
all newline types replaced with "\n" (ala httpx now)
all newlines preserved (result of this patch)

1 and 3 can be done efficiently and simply with "".splitlines(), and only 3 allows for reconstructing the exact original document after the fact (and therefore supports, eg. incrementally hashing a document line by line). (edit: i feel option 3 is the superior option since it supports the most use-cases)

Preservation of the existing semantics could be done with a regex perhaps, but it's tricky to get right. Or maybe a straight python implementation that just doesn't start from the beginning every time, it wouldn't be as fast as the ones which call to C code, but it could at least be linear-time?

Lemme know which direction you want to take it and I can give it a go..

tomchristie · 2022-11-30T10:14:22Z

Right, I feel like there are 3 options to what the behaviour could be.

I'd suggest that we've got the current behaviour wrong, and that we need a breaking API change here.
The sensible behaviour for iter_lines would be to match the same style as the sodlib's splitlines().
So, yeah, I think we want option (1) here. (Right?)

ofek · 2022-12-07T23:51:11Z

I think we want option (1) here. (Right?)

Yes imo

florimondmanca · 2022-12-08T04:29:57Z

After reading my comment pointing at a behavior change, it’s true that I would rather be expecting iter_lines() to return lines without newline separators at the end. This is what splitlines() does, but also other languages like Rust’s str.lines() (AdventOfCode season…), or JS when doing .split(/\r?\n/), etc.

What are the ways this change could impact existing code? I don’t see many. If code is currently using iter_lines() and calling trim() on the result, they shouldn’t be impacted, only those trim calls would become unnecessary. People can’t be relying on the current behavior to reconstruct the original content either, so there shouldn’t be breakage on this use case either.

httpx/_decoders.py

Handle text ending in `\r` more gracefully. Return as much content as possible.

tomchristie · 2023-01-10T14:40:37Z

Noticed that this implementation wouldn't return until flush for input that was just a sequence of \r characters, so I've adapted it slightly.

Implementation update in d3c6a8e.

Test change in 46cad89...

    decoder = LineDecoder()
    assert decoder.decode("") == []
    # This will now return `["a", "", "b"]` and *only* buffer the last portion.
    assert decoder.decode("a\r\rb\rc\r") == ["a", "", "b"]  
    assert decoder.flush() == ["c"]

httpx/_decoders.py

cdeler · 2023-01-13T04:25:05Z

LGTM (with a small style suggestion which can be easily ignored)

zanieb · 2023-01-13T05:24:21Z

Can we also update this pull request title to note the change in behavior since it's no longer just a change in algo complexity?

Co-authored-by: cdeler <serj.krotov@gmail.com>

tomchristie · 2023-03-16T14:29:10Z

Okay, I think we can merge this in now.
The line ending behaviour change means that we should bump the version on this one.

ofek · 2023-03-16T14:35:24Z

This is very exciting, thanks!

giannitedesco force-pushed the fast-line-decode branch 7 times, most recently from 08ad3ce to 50c6811 Compare October 26, 2022 12:23

giannitedesco force-pushed the fast-line-decode branch 7 times, most recently from 9cbd511 to 27b880e Compare November 6, 2022 07:12

florimondmanca reviewed Nov 6, 2022

View reviewed changes

tests/models/test_responses.py Show resolved Hide resolved

giannitedesco force-pushed the fast-line-decode branch from 27b880e to ee687ae Compare November 21, 2022 03:33

giannitedesco force-pushed the fast-line-decode branch from ee687ae to 042ab35 Compare November 21, 2022 03:38

tomchristie reviewed Dec 9, 2022

View reviewed changes

httpx/_decoders.py Outdated Show resolved Hide resolved

tomchristie reviewed Dec 9, 2022

View reviewed changes

httpx/_decoders.py Outdated Show resolved Hide resolved

tomchristie added the api change PRs that contain breaking public API changes label Jan 4, 2023

tomchristie added 4 commits January 10, 2023 10:37

Merge branch 'master' into fast-line-decode

b656f90

Update _decoders.py

d3c6a8e

Handle text ending in `\r` more gracefully. Return as much content as possible.

Update test_decoders.py

46cad89

Merge branch 'master' into fast-line-decode

becea55

tomchristie requested a review from a team January 10, 2023 14:40

cdeler reviewed Jan 11, 2023

View reviewed changes

httpx/_decoders.py Outdated Show resolved Hide resolved

tomchristie added 3 commits January 12, 2023 10:15

Update _decoders.py

e45ed3a

Update _decoders.py

0e3ad1b

Update _decoders.py

3e176f3

tomchristie requested a review from cdeler January 12, 2023 10:42

Merge branch 'master' into fast-line-decode

53ee5a1

cdeler reviewed Jan 13, 2023

View reviewed changes

httpx/_decoders.py Outdated Show resolved Hide resolved

cdeler approved these changes Jan 13, 2023

View reviewed changes

tomchristie and others added 2 commits January 13, 2023 12:08

Update httpx/_decoders.py

8bf3b43

Co-authored-by: cdeler <serj.krotov@gmail.com>

Update _decoders.py

8509ab5

cdeler approved these changes Jan 13, 2023

View reviewed changes

florimondmanca changed the title ~~Replace quadratic algo in LineDecoder~~ Change LineDecoder to match stdlib splitlines, resulting in significant speed up Feb 4, 2023

Merge branch 'master' into fast-line-decode

050ca64

florimondmanca mentioned this pull request Feb 12, 2023

Drop private imports from test_decoders.py #2570

Merged

Merge branch 'master' into fast-line-decode

f6685bd

tomchristie merged commit 85c5898 into encode:master Mar 16, 2023
5 checks passed

tomchristie mentioned this pull request Apr 6, 2023

Version 0.24.0 #2652

Merged

This was referenced Apr 13, 2023

Address inconsistent _lazy_values after attribute set gtsystem/lightkube#45

Closed

Client pod logs can now stream with or without newlines gtsystem/lightkube#46

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change LineDecoder to match stdlib splitlines, resulting in significant speed up #2423

Change LineDecoder to match stdlib splitlines, resulting in significant speed up #2423

giannitedesco commented Oct 26, 2022 •

edited by tomchristie

tomchristie commented Nov 2, 2022

giannitedesco commented Nov 3, 2022

tomchristie commented Nov 3, 2022

giannitedesco commented Nov 4, 2022 •

edited

giannitedesco commented Nov 21, 2022

tomchristie commented Nov 21, 2022

tomchristie commented Nov 21, 2022 •

edited

giannitedesco commented Nov 23, 2022 •

edited

tomchristie commented Nov 30, 2022

ofek commented Dec 7, 2022

florimondmanca commented Dec 8, 2022 •

edited

tomchristie commented Jan 10, 2023

cdeler commented Jan 13, 2023

zanieb commented Jan 13, 2023

tomchristie commented Mar 16, 2023

ofek commented Mar 16, 2023

Change LineDecoder to match stdlib splitlines, resulting in significant speed up #2423

Change LineDecoder to match stdlib splitlines, resulting in significant speed up #2423

Conversation

giannitedesco commented Oct 26, 2022 • edited by tomchristie

tomchristie commented Nov 2, 2022

giannitedesco commented Nov 3, 2022

tomchristie commented Nov 3, 2022

giannitedesco commented Nov 4, 2022 • edited

giannitedesco commented Nov 21, 2022

tomchristie commented Nov 21, 2022

tomchristie commented Nov 21, 2022 • edited

giannitedesco commented Nov 23, 2022 • edited

tomchristie commented Nov 30, 2022

ofek commented Dec 7, 2022

florimondmanca commented Dec 8, 2022 • edited

tomchristie commented Jan 10, 2023

cdeler commented Jan 13, 2023

zanieb commented Jan 13, 2023

tomchristie commented Mar 16, 2023

ofek commented Mar 16, 2023

giannitedesco commented Oct 26, 2022 •

edited by tomchristie

giannitedesco commented Nov 4, 2022 •

edited

tomchristie commented Nov 21, 2022 •

edited

giannitedesco commented Nov 23, 2022 •

edited

florimondmanca commented Dec 8, 2022 •

edited