Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine cboar and cbor2, close #47 #51

Merged
merged 43 commits into from
Jun 3, 2019
Merged

Conversation

waveform80
Copy link
Contributor

Here we go! Turned out to be a few more than 15 commits... ahem...

As mentioned in the associated ticket (#47) the overall diff is ridiculously huge. I've done a few rebases now to break everything down into (hopefully) manageable / reviewable chunks. A rough overview is as follows:

  • The first 4 commits deal with trivial changes to the encoder / decoder.
  • The "migrate functions to methods" commit is the first "big" commit which moves most stuff into the CBOREncoder and CBORDecoder classes; although it looks complex it's mostly fairly trivial (moving the location of the function and replacing the first arg with "self") but there are some slightly more complex changes in there.
  • The next "big" commit is the one merging cboar into the package. This was a straight copy of cboar with a couple of regex-based search'n'replaces of CBOAR with CBOR2 etc.
  • From here on there's three different types of commits...
  • Those labelled "parity work" are commits aimed at bringing the Python and C implementations closer together, i.e. renaming attributes, ensuring exceptions have the same type and message for tests, etc.
  • Those labelled "bug" are commits fixing bugs I found while merging the projects - mostly things I'd missed in the cboar tests which the cbor2 tests picked up and on rare occasions the reverse.
  • The rest are parts of the merge work itself; making the test suite work across implementations, boosting coverage, including various utility scripts for testing speed, trying to catch ref-leaks, etc.
  • Finally, the one thing I haven't looked at in detail yet is the documentation, mostly because I'm not sure how they're currently built (the sphinx Makefile doesn't seem to be there?). That said, they will need adjusting as there's some API changes here (e.g. the change in the tag_hook prototype to remove shareable_index, etc).

Most of the commits have a bit more commentary in their commit messages, hopefully enough for some context. If you want anything split up further (or combined, re-ordered, removed, etc.) do let me know. I'm happy rebasing and editing stuff for clarity.

Good luck!

Also remove shareable_index argument
Move the bulk of the encoding logic into the encoding class alongside
the other encoding methods
The C-module is written against the 3.3+ API. We could skip testing this
and just declare it's 3.4+ (it'd certainly be easier given all the hoops
one has to jump through to get tox working on 3.3) but I like to support
as many versions as I can where possible.

Still - do feel free to reject this commit - I can always strip it out
easily enough.
Favouring the C variant where it successfully imports. This commit also
handles generating the encoding dictionaries for the C variant from
those defined for the Python variant (this is much simpler than doing
this in C).
Just doc-string and comment changes, nothing behavioural. These changes
are mostly to make some future changes a bit easier (particularly the
use of context managers for exception testing which makes the tested
call syntactically similar to other normal calls).
Various methods in the _cbor2 implementations had slightly different
names or were missing from the public API; fix these so _cbor2 matches
the cbor2 public API precisely
Previously, the C version was actively importing deferred types. The
Python algorithm was better, only checking if the specified module was
*already* imported (which, if we're encoding the specified type it must
be or the specified type couldn't exist yet).

This commit also removes the ugly reference counting in the deferred
logic, using the standard "borrowed references on entry, new references
on return" idiom throughout instead.
The C implementation of CBORSimpleValue had missed the __eq__
implementation from the Python version. Both were also missing a __ne__
method which led to some odd test results with inequalities.
This is a modification of the Python version to make it more like the C
version:

We don't call tell() in the C version in the event of an exception and
arguably we shouldn't anyway as there's no guarantee that all file-like
objects implement (e.g. sockets and pipes). This might make debugging a
little harder for users but they can always call it themselves if they
need it, and this makes the decoder applicable to literally any readable
file-like object.
Minor changes to bring absolute parity between the implementations:
ensuring exception messages are exactly equal, that tests equivalently
in both (and don't rely on poking internal implementation details which
is harder with the _cbor2 implementation).
The C variant of the module accepted out of ranges values like 256
without complaining (forgot that unsigned values aren't checked for
overflow by ParseTuple)
Nice to see the test suite for the Python module picking up bugs in the
C implementation!
This isn't a translation of FrozenDict to C, rather this simply imports
the Python FrozenDict implementation into the _cbor2 module's __dict__
Use some py.test fixture magic to provide an "impl" parameter
representing the C or the pure Python modules. Some further py.test
magic is used to ensure the C variant is only tested on supported
platforms and skipped otherwise.
The C variant supports semantic tags 260 and 261 for encoding IP
addresses and networks; this commit adds a pure Python implementation
for python 3.3 and above (the built-in ipaddress module was added in
3.3) along with the tests for these.
The C variant permits just about any type of value to be shared within
the stream (though the encoder won't produce such shared values). This
commit updates the Python variant to be similarly generous toward shared
values.
The C implementation already correctly handles timestamp microseconds;
ensure the Python variant does too
The C variant has a "str_errors" parameter, permitting configuration of
the error-handling mode used when decoding UTF-8 strings. This commit
adds this attribute to the Python variant, and also updates the decoding
of indefinite-length byte and unicode strings to match the algorithm in
the C variant.

Specifically, this disallows split-unicode characters in indefinite
length unicode strings, and also rejects non-string elements in
indefinite length byte and unicode strings.
The C variant (by its nature) imposes more restrictions on certain
attributes like object_hook, tag_hook, default, etc. It ensures the
types assigned are valid (callable or None). This commit ensures the
Python variant does the same. It also fixes some of the attribute names
on the C side (which didn't match the Python side).
The C variant raises CBOREncodeError in the case that a deferred encoder
has an invalid type (not a 2-tuple, etc.); this commit ensures the
Python side does the same
This commit brings the Python CBORTag implementation in line with the C
implementation. The repr of the class now handles recursive constructs
correctly, and instances are now orderable as if they were simple
2-tuples
This commit brings the Python implementations of break_marker and
undefined in line with the C variants. They are now true singletons,
with the same reprs and boolean results.
In expanding coverage of the C variant, several segfaults were uncovered
when calling load(s) and dump(s) with either no args (erroneously) or
keyword-args only.
The immutable property was missing from the C variant, and a test was
added for coverage
Doh! Should've run tox before that last commit...
@Changaco
Copy link
Contributor

Looks good to me.

@Sekenre
Copy link
Collaborator

Sekenre commented May 25, 2019

This reads well, thanks for structuring the commits so clearly. I'm going to run some tests and merge hopefully today and tomorrow.

@Sekenre
Copy link
Collaborator

Sekenre commented May 25, 2019

It doesnt build with Microsofts toolchain which is used by Anaconda on windows. Can you add another filter to setup.py so it doesn't try? I will open another issue for it.

At least until we can figure out why it's not working...
@waveform80
Copy link
Contributor Author

Sure - I must admit I haven't tried this at all under Windows (I don't have a Windows development environment at the moment, but I would like to include the platform if it's not too much trouble). Any obvious error messages?

@waveform80
Copy link
Contributor Author

It doesnt build with Microsofts toolchain which is used by Anaconda on windows. Can you add another filter to setup.py so it doesn't try? I will open another issue for it.

Don't worry about a new issue - I had a shot at setting up a Windows environment for Python extension building this weekend and I think I've got a fix that allows everything to build and work. The issue is mostly the use of the "endian" routines which aren't standard (but they're easy to replace with byte-swapping routines given all the platforms Windows runs on are little-endian).

I should add I was absolutely dreading setting up a Windows environment for this - last time I tried it was a major pain, but it looks like things have improved significantly both on the Python and the VS side of things (what with Py3.5 and the "universal" CRT), so this time a straight-forward install of VS2019 Community Edition seemed to include everything I needed (including the Python and Git bits!).

I've just got to tidy up some bits and make sure the test suite still runs properly in both Linux & Windows then I'll push the commits to this branch (should only be a couple more).

Only build under CPython 3.5+ under Windows as that's when the universal
CRT support was added (and we really don't want to deal with Windows
builds prior to that).
Most of these were the VS compiler being overly picky about types (or
just rubbish at static analysis), but it did make me notice a subtle and
nasty little bug in handling datetimes with negative TZ offsets so
worthwhile overall!
Okay, I confess I'm utterly stumped on this one. Took ages to find as
despite my best efforts I've no idea how to interactively debug stuff
under VS (after installing the lib via pip), so this was back to prints
and a certain amount of guess-work. Anyway...

Turns out the "s#" format (for string+length) in Py_BuildValue (or its
derivatives like PyObject_CallMethod, etc.) doesn't work on Windows. It
just results in a MemoryError whenever it's used! However, the simple
"s" format (for NULL-terminated string) works fine. It's not exactly a
huge deal to have it calculate the length of "big" so this is a
worthwhile trade-off to get stuff working on Windows. At this point all
the tests pass in both the C and Python variants.
@waveform80
Copy link
Contributor Author

Right, there's a few more commits now that get things working on Windows too. Nothing too major, but that last commit took me ages to figure out! :)

@Sekenre
Copy link
Collaborator

Sekenre commented May 31, 2019

everything's working beautifully on Linux, Windows is another story. 🤷‍♂

I must have my python dev environment setup all wrong on Windows, it's taken me ages to set it up again from scratch and I'm still getting the same errors.

running build_ext
building '_cbor2' extension
C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.16.27023\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD "-IC:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.16.27023\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17763.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17763.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17763.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17763.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17763.0\cppwinrt" /Tcsource/module.c /Fobuild\temp.win-amd64-3.6\Release\source/module.obj
module.c
c:\users\sekenre\projects\cbor2\source\encoder.h(17): error C2032: '__timezone': function cannot be member of struct '<anonymous-tag>'

Any idea if I'm just doing something wrong? I've tried both Anaconda and the Python install through Visual Studio.

@waveform80
Copy link
Contributor Author

Any idea if I'm just doing something wrong? I've tried both Anaconda and the Python install through Visual Studio.

My setup was pretty vanilla, but I'm not sure if it's the "right way" of doing things on Windows (I used to code on Windows quite extensively back in my Delphi days, but back then XP was shiny and new, and most of my Windows knowledge has bit-rotted!). Anyway, here's the process I followed:

  • Fired up a new VM on Azure, with desktop Windows 10
  • Installed VS2019 Community Edition with the "Python Environment" option, and checked the "Python native code" option too (to pull in the full SDK - reported installation size was a little over 10Gb)
  • Started VS2019 and selected "Clone or checkout code"
  • Selected the "GitHub" option under "Browse a repository"
  • Logged into GitHub
  • Selected and cloned my cbor2 repo (waveform80/cbor2)
  • In the main VS interface now, selected Branches under Team Explorer and switched to the "c_module" branch

Here's where I'm not sure what I'm doing is the "right thing" on Windows, simply because it's what I normally do on Linux (ish):

  • Selected Tools / Python / Python Environments to bring up the Python Environments sidebar
  • With Python 3.7 (64-bit) selected, clicked "Open in Powershell"
  • Switched to the cloned repo ("cd \Users\dave\source\repos\waveform80\cbor2")
  • Installed an editable version of library with the "test" option ("pip install -v -e .[test]")
  • Ran the test suite ("py.test -v")

I'm guessing that's not the right way of doing things because I couldn't debug interactively within VS to figure out the issue I mentioned above. Still, I figured it didn't matter that much because it seemed to build happily, so I assumed Windows compat was done. I'll have a dig into the error you saw a bit later and see if I can reproduce it.

@waveform80
Copy link
Contributor Author

Ah! Just had a quick google - Python issue 34657 looks relevant. At a guess I just need to rename the "timezone" entry in the CBOREncoderObject struct. I'll give that a whirl later.

@Sekenre
Copy link
Collaborator

Sekenre commented May 31, 2019

I followed the same steps for VS on a laptop instead of a vm, and then running python setup.py build_ext instead of pip install -v -e .[test]

Your googling was much more successful than mine!

@waveform80
Copy link
Contributor Author

Okay, hopefully I've got that patched. I've renamed timezone to tz in the CBOREncoderObject struct (but left the attribute named "timezone" for compatibility - that should be fine as the attribute name doesn't get declared in a C struct). I haven't tested this on Windows yet as my setup has Python 3.7 and it looks like the issue (34657 is a duplicate of 24643) was fixed in 3.7. I could try and setup a down-level version in another VM, but I figured it might be quicker if you tested it with your already-setup laptop :)

@agronholm
Copy link
Owner

This sounds like it warrants the use of Azure Pipelines to check for Windows compilation issues.

@waveform80
Copy link
Contributor Author

This sounds like it warrants the use of Azure Pipelines to check for Windows compilation issues.

Interesting - I'll have a look at this (I was wondering whether it was time to dig into appveyor as well which I think supports Windows too).

@agronholm
Copy link
Owner

Appveyor supports only Windows. Travis currently offers support for Linux, macOS and Windows but their Python support for the latter two is nonexistent. It's particularly awful on macOS. Azure Pipelines has none of these issues and its design makes working across all three operating systems much less of a hassle.

@waveform80
Copy link
Contributor Author

Appveyor supports only Windows. Travis currently offers support for Linux, macOS and Windows but their Python support for the latter two is nonexistent. It's particularly awful on macOS. Azure Pipelines has none of these issues and its design makes working across all three operating systems much less of a hassle.

Actually Appveyor claims to support Linux too now: . However, having said that I spent a few hours playing with Azure Pipelines and Appveyor yesterday and couldn't get a thing working on the latter! Even their own recipe for tox based setups didn't work (on either Linux or Windows) with issues mostly around permission to install stuff (like tox) or upgrade stuff (like setuptools, necessary for recent versions of tox).

By contrast, Azure Pipelines was indeed ridiculously simple: had it up and testing within a few minutes. My one concern with it, is that it demands a fairly scary level of permissions on any repository it's associated with (ability to create commits, branches, etc. etc.) in stark contrast to Appveyor and Travis which both have fairly minimal access requirements. I understand why Pipelines does this - it's trying to provide a very "friendly" interface in which it can set up your repository by creating and committing the necessary configuration - but I must admit I prefer Travis and Appveyor's more minimal approach in this regard.

Anyway, it seems Pipelines is the way to go. There's a couple of branches in my cloned repo that you may want to have a look at:

The PIpelines configuration: https://github.com/waveform80/cbor2/blob/azure-pipelines/azure-pipelines.yml

And my failed Appveyor experiments (in case there's anything blindingly obviously stupid in there): https://github.com/waveform80/cbor2/blob/appveyor/.appveyor.yml

Both of these branch off the c_module branch, and here's their respective test portals (hopefully these are public so you should be able to view them):

Azure Pipelines: https://dev.azure.com/waveform88/cbor2

Appveyor: https://ci.appveyor.com/project/DaveJones/cbor2

Happy to take any suggestions for edits to the config; once we've got something reasonably comprehensive across Linux + Windows (+macOS?) I can merge it into the c_module branch and then presumably someone can set up PIpelines to work with this repo off that config.

Seems like Python.h didn't always include stdint.h historically
Apparently Mac OS has machine/endian.h ... but it doesn't define any of
the standard routines. I've no idea what it actually *does* define but
who cares - the consensus according to Google seems to be that everyone
uses routines from the libkern/OSByteOrder.h header.
@waveform80
Copy link
Contributor Author

Got Mac OS tests working on Azure Pipelines (I've pretty much given up on Appveyor) and it picked up an issue (surprise surprise) with the endian header again. That's fixed along with a minor issue on 3.5 under Windows (which slightly bizarrely only occurred with one version of VS). Anyway, everything's now passing with CPython 2.7, 3.5, 3.6, 3.7, PyPy, and PyPy3 under Linux, Windows, and Mac OS X (CPython 3.4 is only tested on Linux).

@Sekenre
Copy link
Collaborator

Sekenre commented Jun 3, 2019

It works! That was a huge quantity of work, thanks so much!

@Sekenre Sekenre merged commit 58c9ec3 into agronholm:master Jun 3, 2019
@waveform80
Copy link
Contributor Author

Wow! Thanks very much for reviewing it!

Do you want me to push the azure pipelines config as another PR? I didn't get around to merging it into c_module before it was merged, but it's only a single extra YAML file (azure-pipelines.yml) in the root of the repo.

@waveform80 waveform80 deleted the c_module branch June 5, 2019 10:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants