Combine cboar and cbor2, close #47 #51

waveform80 · 2019-05-22T23:53:46Z

Here we go! Turned out to be a few more than 15 commits... ahem...

As mentioned in the associated ticket (#47) the overall diff is ridiculously huge. I've done a few rebases now to break everything down into (hopefully) manageable / reviewable chunks. A rough overview is as follows:

The first 4 commits deal with trivial changes to the encoder / decoder.
The "migrate functions to methods" commit is the first "big" commit which moves most stuff into the CBOREncoder and CBORDecoder classes; although it looks complex it's mostly fairly trivial (moving the location of the function and replacing the first arg with "self") but there are some slightly more complex changes in there.
The next "big" commit is the one merging cboar into the package. This was a straight copy of cboar with a couple of regex-based search'n'replaces of CBOAR with CBOR2 etc.
From here on there's three different types of commits...
Those labelled "parity work" are commits aimed at bringing the Python and C implementations closer together, i.e. renaming attributes, ensuring exceptions have the same type and message for tests, etc.
Those labelled "bug" are commits fixing bugs I found while merging the projects - mostly things I'd missed in the cboar tests which the cbor2 tests picked up and on rare occasions the reverse.
The rest are parts of the merge work itself; making the test suite work across implementations, boosting coverage, including various utility scripts for testing speed, trying to catch ref-leaks, etc.
Finally, the one thing I haven't looked at in detail yet is the documentation, mostly because I'm not sure how they're currently built (the sphinx Makefile doesn't seem to be there?). That said, they will need adjusting as there's some API changes here (e.g. the change in the tag_hook prototype to remove shareable_index, etc).

Most of the commits have a bit more commentary in their commit messages, hopefully enough for some context. If you want anything split up further (or combined, re-ordered, removed, etc.) do let me know. I'm happy rebasing and editing stuff for clarity.

Good luck!

Also remove shareable_index argument

Move the bulk of the encoding logic into the encoding class alongside the other encoding methods

The C-module is written against the 3.3+ API. We could skip testing this and just declare it's 3.4+ (it'd certainly be easier given all the hoops one has to jump through to get tox working on 3.3) but I like to support as many versions as I can where possible. Still - do feel free to reject this commit - I can always strip it out easily enough.

Favouring the C variant where it successfully imports. This commit also handles generating the encoding dictionaries for the C variant from those defined for the Python variant (this is much simpler than doing this in C).

Just doc-string and comment changes, nothing behavioural. These changes are mostly to make some future changes a bit easier (particularly the use of context managers for exception testing which makes the tested call syntactically similar to other normal calls).

Various methods in the _cbor2 implementations had slightly different names or were missing from the public API; fix these so _cbor2 matches the cbor2 public API precisely

Previously, the C version was actively importing deferred types. The Python algorithm was better, only checking if the specified module was *already* imported (which, if we're encoding the specified type it must be or the specified type couldn't exist yet). This commit also removes the ugly reference counting in the deferred logic, using the standard "borrowed references on entry, new references on return" idiom throughout instead.

The C implementation of CBORSimpleValue had missed the __eq__ implementation from the Python version. Both were also missing a __ne__ method which led to some odd test results with inequalities.

This is a modification of the Python version to make it more like the C version: We don't call tell() in the C version in the event of an exception and arguably we shouldn't anyway as there's no guarantee that all file-like objects implement (e.g. sockets and pipes). This might make debugging a little harder for users but they can always call it themselves if they need it, and this makes the decoder applicable to literally any readable file-like object.

Minor changes to bring absolute parity between the implementations: ensuring exception messages are exactly equal, that tests equivalently in both (and don't rely on poking internal implementation details which is harder with the _cbor2 implementation).

The C variant of the module accepted out of ranges values like 256 without complaining (forgot that unsigned values aren't checked for overflow by ParseTuple)

Nice to see the test suite for the Python module picking up bugs in the C implementation!

This isn't a translation of FrozenDict to C, rather this simply imports the Python FrozenDict implementation into the _cbor2 module's __dict__

Use some py.test fixture magic to provide an "impl" parameter representing the C or the pure Python modules. Some further py.test magic is used to ensure the C variant is only tested on supported platforms and skipped otherwise.

The C variant supports semantic tags 260 and 261 for encoding IP addresses and networks; this commit adds a pure Python implementation for python 3.3 and above (the built-in ipaddress module was added in 3.3) along with the tests for these.

The C variant permits just about any type of value to be shared within the stream (though the encoder won't produce such shared values). This commit updates the Python variant to be similarly generous toward shared values.

The C implementation already correctly handles timestamp microseconds; ensure the Python variant does too

The C variant has a "str_errors" parameter, permitting configuration of the error-handling mode used when decoding UTF-8 strings. This commit adds this attribute to the Python variant, and also updates the decoding of indefinite-length byte and unicode strings to match the algorithm in the C variant. Specifically, this disallows split-unicode characters in indefinite length unicode strings, and also rejects non-string elements in indefinite length byte and unicode strings.

The C variant (by its nature) imposes more restrictions on certain attributes like object_hook, tag_hook, default, etc. It ensures the types assigned are valid (callable or None). This commit ensures the Python variant does the same. It also fixes some of the attribute names on the C side (which didn't match the Python side).

The C variant raises CBOREncodeError in the case that a deferred encoder has an invalid type (not a 2-tuple, etc.); this commit ensures the Python side does the same

This commit brings the Python CBORTag implementation in line with the C implementation. The repr of the class now handles recursive constructs correctly, and instances are now orderable as if they were simple 2-tuples

This commit brings the Python implementations of break_marker and undefined in line with the C variants. They are now true singletons, with the same reprs and boolean results.

In expanding coverage of the C variant, several segfaults were uncovered when calling load(s) and dump(s) with either no args (erroneously) or keyword-args only.

The immutable property was missing from the C variant, and a test was added for coverage

Doh! Should've run tox before that last commit...

Changaco · 2019-05-25T08:10:09Z

Looks good to me.

Sekenre · 2019-05-25T11:39:33Z

This reads well, thanks for structuring the commits so clearly. I'm going to run some tests and merge hopefully today and tomorrow.

Sekenre · 2019-05-25T15:24:39Z

It doesnt build with Microsofts toolchain which is used by Anaconda on windows. Can you add another filter to setup.py so it doesn't try? I will open another issue for it.

At least until we can figure out why it's not working...

waveform80 · 2019-05-25T22:25:45Z

Sure - I must admit I haven't tried this at all under Windows (I don't have a Windows development environment at the moment, but I would like to include the platform if it's not too much trouble). Any obvious error messages?

waveform80 · 2019-05-26T20:52:04Z

It doesnt build with Microsofts toolchain which is used by Anaconda on windows. Can you add another filter to setup.py so it doesn't try? I will open another issue for it.

Don't worry about a new issue - I had a shot at setting up a Windows environment for Python extension building this weekend and I think I've got a fix that allows everything to build and work. The issue is mostly the use of the "endian" routines which aren't standard (but they're easy to replace with byte-swapping routines given all the platforms Windows runs on are little-endian).

I should add I was absolutely dreading setting up a Windows environment for this - last time I tried it was a major pain, but it looks like things have improved significantly both on the Python and the VS side of things (what with Py3.5 and the "universal" CRT), so this time a straight-forward install of VS2019 Community Edition seemed to include everything I needed (including the Python and Git bits!).

I've just got to tidy up some bits and make sure the test suite still runs properly in both Linux & Windows then I'll push the commits to this branch (should only be a couple more).

Only build under CPython 3.5+ under Windows as that's when the universal CRT support was added (and we really don't want to deal with Windows builds prior to that).

Most of these were the VS compiler being overly picky about types (or just rubbish at static analysis), but it did make me notice a subtle and nasty little bug in handling datetimes with negative TZ offsets so worthwhile overall!

Okay, I confess I'm utterly stumped on this one. Took ages to find as despite my best efforts I've no idea how to interactively debug stuff under VS (after installing the lib via pip), so this was back to prints and a certain amount of guess-work. Anyway... Turns out the "s#" format (for string+length) in Py_BuildValue (or its derivatives like PyObject_CallMethod, etc.) doesn't work on Windows. It just results in a MemoryError whenever it's used! However, the simple "s" format (for NULL-terminated string) works fine. It's not exactly a huge deal to have it calculate the length of "big" so this is a worthwhile trade-off to get stuff working on Windows. At this point all the tests pass in both the C and Python variants.

waveform80 · 2019-05-27T23:31:56Z

Right, there's a few more commits now that get things working on Windows too. Nothing too major, but that last commit took me ages to figure out! :)

Sekenre · 2019-05-31T15:31:57Z

everything's working beautifully on Linux, Windows is another story. 🤷‍♂

I must have my python dev environment setup all wrong on Windows, it's taken me ages to set it up again from scratch and I'm still getting the same errors.

running build_ext
building '_cbor2' extension
C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.16.27023\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD "-IC:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.16.27023\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17763.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17763.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17763.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17763.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17763.0\cppwinrt" /Tcsource/module.c /Fobuild\temp.win-amd64-3.6\Release\source/module.obj
module.c
c:\users\sekenre\projects\cbor2\source\encoder.h(17): error C2032: '__timezone': function cannot be member of struct '<anonymous-tag>'

Any idea if I'm just doing something wrong? I've tried both Anaconda and the Python install through Visual Studio.

waveform80 · 2019-05-31T16:10:59Z

Any idea if I'm just doing something wrong? I've tried both Anaconda and the Python install through Visual Studio.

My setup was pretty vanilla, but I'm not sure if it's the "right way" of doing things on Windows (I used to code on Windows quite extensively back in my Delphi days, but back then XP was shiny and new, and most of my Windows knowledge has bit-rotted!). Anyway, here's the process I followed:

Fired up a new VM on Azure, with desktop Windows 10
Installed VS2019 Community Edition with the "Python Environment" option, and checked the "Python native code" option too (to pull in the full SDK - reported installation size was a little over 10Gb)
Started VS2019 and selected "Clone or checkout code"
Selected the "GitHub" option under "Browse a repository"
Logged into GitHub
Selected and cloned my cbor2 repo (waveform80/cbor2)
In the main VS interface now, selected Branches under Team Explorer and switched to the "c_module" branch

Here's where I'm not sure what I'm doing is the "right thing" on Windows, simply because it's what I normally do on Linux (ish):

Selected Tools / Python / Python Environments to bring up the Python Environments sidebar
With Python 3.7 (64-bit) selected, clicked "Open in Powershell"
Switched to the cloned repo ("cd \Users\dave\source\repos\waveform80\cbor2")
Installed an editable version of library with the "test" option ("pip install -v -e .[test]")
Ran the test suite ("py.test -v")

I'm guessing that's not the right way of doing things because I couldn't debug interactively within VS to figure out the issue I mentioned above. Still, I figured it didn't matter that much because it seemed to build happily, so I assumed Windows compat was done. I'll have a dig into the error you saw a bit later and see if I can reproduce it.

waveform80 · 2019-05-31T16:13:13Z

Ah! Just had a quick google - Python issue 34657 looks relevant. At a guess I just need to rename the "timezone" entry in the CBOREncoderObject struct. I'll give that a whirl later.

Sekenre · 2019-05-31T16:36:19Z

I followed the same steps for VS on a laptop instead of a vm, and then running python setup.py build_ext instead of pip install -v -e .[test]

Your googling was much more successful than mine!

waveform80 · 2019-05-31T19:12:59Z

Okay, hopefully I've got that patched. I've renamed timezone to tz in the CBOREncoderObject struct (but left the attribute named "timezone" for compatibility - that should be fine as the attribute name doesn't get declared in a C struct). I haven't tested this on Windows yet as my setup has Python 3.7 and it looks like the issue (34657 is a duplicate of 24643) was fixed in 3.7. I could try and setup a down-level version in another VM, but I figured it might be quicker if you tested it with your already-setup laptop :)

agronholm · 2019-06-01T06:51:19Z

This sounds like it warrants the use of Azure Pipelines to check for Windows compilation issues.

waveform80 · 2019-06-01T17:19:31Z

This sounds like it warrants the use of Azure Pipelines to check for Windows compilation issues.

Interesting - I'll have a look at this (I was wondering whether it was time to dig into appveyor as well which I think supports Windows too).

agronholm · 2019-06-02T16:20:56Z

Appveyor supports only Windows. Travis currently offers support for Linux, macOS and Windows but their Python support for the latter two is nonexistent. It's particularly awful on macOS. Azure Pipelines has none of these issues and its design makes working across all three operating systems much less of a hassle.

waveform80 · 2019-06-02T18:25:30Z

Appveyor supports only Windows. Travis currently offers support for Linux, macOS and Windows but their Python support for the latter two is nonexistent. It's particularly awful on macOS. Azure Pipelines has none of these issues and its design makes working across all three operating systems much less of a hassle.

Actually Appveyor claims to support Linux too now: . However, having said that I spent a few hours playing with Azure Pipelines and Appveyor yesterday and couldn't get a thing working on the latter! Even their own recipe for tox based setups didn't work (on either Linux or Windows) with issues mostly around permission to install stuff (like tox) or upgrade stuff (like setuptools, necessary for recent versions of tox).

By contrast, Azure Pipelines was indeed ridiculously simple: had it up and testing within a few minutes. My one concern with it, is that it demands a fairly scary level of permissions on any repository it's associated with (ability to create commits, branches, etc. etc.) in stark contrast to Appveyor and Travis which both have fairly minimal access requirements. I understand why Pipelines does this - it's trying to provide a very "friendly" interface in which it can set up your repository by creating and committing the necessary configuration - but I must admit I prefer Travis and Appveyor's more minimal approach in this regard.

Anyway, it seems Pipelines is the way to go. There's a couple of branches in my cloned repo that you may want to have a look at:

The PIpelines configuration: https://github.com/waveform80/cbor2/blob/azure-pipelines/azure-pipelines.yml

And my failed Appveyor experiments (in case there's anything blindingly obviously stupid in there): https://github.com/waveform80/cbor2/blob/appveyor/.appveyor.yml

Both of these branch off the c_module branch, and here's their respective test portals (hopefully these are public so you should be able to view them):

Azure Pipelines: https://dev.azure.com/waveform88/cbor2

Appveyor: https://ci.appveyor.com/project/DaveJones/cbor2

Happy to take any suggestions for edits to the config; once we've got something reasonably comprehensive across Linux + Windows (+macOS?) I can merge it into the c_module branch and then presumably someone can set up PIpelines to work with this repo off that config.

Seems like Python.h didn't always include stdint.h historically

Apparently Mac OS has machine/endian.h ... but it doesn't define any of the standard routines. I've no idea what it actually *does* define but who cares - the consensus according to Google seems to be that everyone uses routines from the libkern/OSByteOrder.h header.

waveform80 · 2019-06-02T22:30:12Z

Got Mac OS tests working on Azure Pipelines (I've pretty much given up on Appveyor) and it picked up an issue (surprise surprise) with the endian header again. That's fixed along with a minor issue on 3.5 under Windows (which slightly bizarrely only occurred with one version of VS). Anyway, everything's now passing with CPython 2.7, 3.5, 3.6, 3.7, PyPy, and PyPy3 under Linux, Windows, and Mac OS X (CPython 3.4 is only tested on Linux).

Sekenre · 2019-06-03T13:44:27Z

It works! That was a huge quantity of work, thanks so much!

waveform80 · 2019-06-03T22:02:18Z

Wow! Thanks very much for reviewing it!

Do you want me to push the azure pipelines config as another PR? I didn't get around to merging it into c_module before it was merged, but it's only a single extra YAML file (azure-pipelines.yml) in the root of the repo.

waveform80 added 30 commits May 17, 2019 11:15

Store read/write methods instead of fp object itself

b5410bc

Don't concatenate when writing bytes and str

6f5fbb3

Pre-allocate list and dict structures for better performance

82cf5b0

Use a faster algorithm for encoding decimals

c6cb076

Migrate functions to methods

74d3dfc

Also remove shareable_index argument

Use ValueError as the basis for CBOR errors

214359c

Simplify shareable_encoder

706e8b7

Move the bulk of the encoding logic into the encoding class alongside the other encoding methods

Merge cboar into cbor2 as _cbor2 module

173cc15

Make the package import both variants

938b5d6

Favouring the C variant where it successfully imports. This commit also handles generating the encoding dictionaries for the C variant from those defined for the Python variant (this is much simpler than doing this in C).

Tidying up

9e526b2

Just doc-string and comment changes, nothing behavioural. These changes are mostly to make some future changes a bit easier (particularly the use of context managers for exception testing which makes the tested call syntactically similar to other normal calls).

Parity work: add missing method-defs, rename some methods

1ba43ba

Various methods in the _cbor2 implementations had slightly different names or were missing from the public API; fix these so _cbor2 matches the cbor2 public API precisely

Parity work: _cbor2.CBORSimpleValue test equal to ints

4fb375f

The C implementation of CBORSimpleValue had missed the __eq__ implementation from the Python version. Both were also missing a __ne__ method which led to some odd test results with inequalities.

Parity work: miscellaneous

14e5dbf

Minor changes to bring absolute parity between the implementations: ensuring exception messages are exactly equal, that tests equivalently in both (and don't rely on poking internal implementation details which is harder with the _cbor2 implementation).

Bug: fix CBORSimpleValue accepting out of range

ccdf743

The C variant of the module accepted out of ranges values like 256 without complaining (forgot that unsigned values aren't checked for overflow by ParseTuple)

Bug: fix premature EOF error message

c21493f

Bug: fix canonical mapping encoding

d3b3724

Nice to see the test suite for the Python module picking up bugs in the C implementation!

Parity work: add FrozenDict to the C implementation

e29149f

This isn't a translation of FrozenDict to C, rather this simply imports the Python FrozenDict implementation into the _cbor2 module's __dict__

Make test-suite work with both implementations

373870c

Use some py.test fixture magic to provide an "impl" parameter representing the C or the pure Python modules. Some further py.test magic is used to ensure the C variant is only tested on supported platforms and skipped otherwise.

Parity work: decode shareable for anything

44cf673

The C variant permits just about any type of value to be shared within the stream (though the encoder won't produce such shared values). This commit updates the Python variant to be similarly generous toward shared values.

Parity work: Close agronholm#44, close agronholm#45

85e22dd

The C implementation already correctly handles timestamp microseconds; ensure the Python variant does too

Parity work: tweak handling of bad deferred types

6c887da

The C variant raises CBOREncodeError in the case that a deferred encoder has an invalid type (not a 2-tuple, etc.); this commit ensures the Python side does the same

Parity work: CBORTag

f45aa27

This commit brings the Python CBORTag implementation in line with the C implementation. The repr of the class now handles recursive constructs correctly, and instances are now orderable as if they were simple 2-tuples

Parity work: break_marker and undefined

a341473

This commit brings the Python implementations of break_marker and undefined in line with the C variants. They are now true singletons, with the same reprs and boolean results.

Bug: fix several segfaults

5b9f34f

In expanding coverage of the C variant, several segfaults were uncovered when calling load(s) and dump(s) with either no args (erroneously) or keyword-args only.

waveform80 added 3 commits May 23, 2019 21:28

Parity work: immutable property

3100e95

The immutable property was missing from the C variant, and a test was added for coverage

Bump coverage to appease coveralls

5a5e202

Fix test for compare of CBORTag against unknown objects

406b9d1

Doh! Should've run tox before that last commit...

Don't build on Windows

07f8d97

At least until we can figure out why it's not working...

waveform80 added 3 commits May 28, 2019 00:19

Enable building under Windows

52aa754

Only build under CPython 3.5+ under Windows as that's when the universal CRT support was added (and we really don't want to deal with Windows builds prior to that).

Fix windows build warnings

da1bb4c

Most of these were the VS compiler being overly picky about types (or just rubbish at static analysis), but it did make me notice a subtle and nasty little bug in handling datetimes with negative TZ offsets so worthwhile overall!

Workaround Python issue #24643

82aa531

waveform80 added 2 commits June 2, 2019 23:07

Include stdint.h wherever required

1a94e2d

Seems like Python.h didn't always include stdint.h historically

Sekenre merged commit 58c9ec3 into agronholm:master Jun 3, 2019

waveform80 deleted the c_module branch June 5, 2019 10:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combine cboar and cbor2, close #47 #51

Combine cboar and cbor2, close #47 #51

waveform80 commented May 22, 2019

Changaco commented May 25, 2019

Sekenre commented May 25, 2019

Sekenre commented May 25, 2019

waveform80 commented May 25, 2019

waveform80 commented May 26, 2019

waveform80 commented May 27, 2019

Sekenre commented May 31, 2019

waveform80 commented May 31, 2019

waveform80 commented May 31, 2019

Sekenre commented May 31, 2019

waveform80 commented May 31, 2019

agronholm commented Jun 1, 2019

waveform80 commented Jun 1, 2019

agronholm commented Jun 2, 2019

waveform80 commented Jun 2, 2019

waveform80 commented Jun 2, 2019

Sekenre commented Jun 3, 2019

waveform80 commented Jun 3, 2019

Combine cboar and cbor2, close #47 #51

Combine cboar and cbor2, close #47 #51

Conversation

waveform80 commented May 22, 2019

Changaco commented May 25, 2019

Sekenre commented May 25, 2019

Sekenre commented May 25, 2019

waveform80 commented May 25, 2019

waveform80 commented May 26, 2019

waveform80 commented May 27, 2019

Sekenre commented May 31, 2019

waveform80 commented May 31, 2019

waveform80 commented May 31, 2019

Sekenre commented May 31, 2019

waveform80 commented May 31, 2019

agronholm commented Jun 1, 2019

waveform80 commented Jun 1, 2019

agronholm commented Jun 2, 2019

waveform80 commented Jun 2, 2019

waveform80 commented Jun 2, 2019

Sekenre commented Jun 3, 2019

waveform80 commented Jun 3, 2019