New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machida3 for Python3 application code #2354

Merged
merged 25 commits into from Oct 11, 2018

Conversation

Projects
None yet
5 participants
@caj-larsson
Copy link
Contributor

caj-larsson commented Aug 17, 2018

A copy of the machida source tree to machida3 except for the wallaroo python modules.

The Celsius test can be run with machida3, there might be some encoding issues with the result out as
most of the tests use string formatting which ends up calling str() on bytes objects resulting in broken output. In order to fix the integration tests for machida3 this will have to be addressed.

@rachelblasucci

This comment has been minimized.

Copy link
Contributor

rachelblasucci commented Aug 17, 2018

Thanks @caj-larsson!

Hey, @nisanharamati, can you also review please?

@nisanharamati

This comment has been minimized.

Copy link
Contributor

nisanharamati commented Aug 17, 2018

Ooh exciting! I'll take a look at this later today!

@nisanharamati

This comment has been minimized.

Copy link
Contributor

nisanharamati commented Aug 18, 2018

Ooh exciting! I'll take a look at this later today!

Sorry, I couldn't get around to this today. I'll take a look at it on Monday, unless someone else beats me to it!
An initial skim of the code looks pretty good, though!

@nisanharamati

This comment has been minimized.

Copy link
Contributor

nisanharamati commented Aug 20, 2018

@caj-larsson The code changes look good, but there are a few places where integration tests (and manual tests) run into fatal errors due to typing errors.

You can grab the code for the additional tests from 8de4213

and run them as follows:

Examples/python/

cd examples/python
make test

testing/correctness/apps/

cd testing/correctness/apps/alphabet_python/
make test
cd testing/correctness/apps/alphabet_python27/
make test
cd testing/correctness/apps/sequence_window_python/
make test
@nisanharamati

This comment has been minimized.

Copy link
Contributor

nisanharamati commented Aug 20, 2018

The tests that pass are:

  • examples/python/alphabet
  • examples/python/celsius
  • examples/python/market_spread
  • examples/python/reverse

The tests that fail are:

  • testing/correctness/apps/alphabet_python27
  • testing/correctness/apps/alphabet_python
  • testing/correctness/apps/sequence_window_python
  • examples/python/alphabet_partitioned
  • examples/python/word_count
  • examples/python/word_count_with_dynamic_keys
@caj-larsson

This comment has been minimized.

Copy link
Contributor

caj-larsson commented Aug 21, 2018

Wow that's great, I'll pull those tests so we can have them run automatically in this PR. I have limited time now for the immediate future however. It will probably be a couple of days before I can take a look at the cause but I have a good idea of what type of issue it is.

@rachelblasucci

This comment has been minimized.

Copy link
Contributor

rachelblasucci commented Aug 21, 2018

@caj-larsson No worries. Take your time, and thanks for the help you've given us so far! :-)

@nisanharamati

This comment has been minimized.

Copy link
Contributor

nisanharamati commented Oct 1, 2018

Hi @caj-larsson
We're picking this up as mainline work this week, and wanted to let you know. And we also wanted to thank you for the work you've done on this so far! It's been really helpful, and it's really heartening to see early contributions from outside the company!

However, we currently don't have access to make changes to the PR branch. As the PR creator, you can update that following these instructions: https://help.github.com/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork/ .
In the mean time, we'll work on the python3 branch in the main repo. But we'll be happy to move the updates from there back into this PR once we have access.

@caj-larsson

This comment has been minimized.

Copy link
Contributor

caj-larsson commented Oct 1, 2018

Hi @nisanharamati!
I'm sorry I couldn't finish this in time, since we last talked I've started a new job in a new country working in a new stack and I just haven't been able to dedicate any time to open source work.

The "allow edits from maintainers" -box was already checked in my UI. I unchecked and checked it again in hope that if there is some stale state that it's at least updated.

I've tried adding you to my repo if that was not enough however, is it an issue of me needing to rebase my master? I can imagine you can't rewrite the history.

@nisanharamati nisanharamati force-pushed the caj-larsson:master branch from f8958ea to 72ea50c Oct 2, 2018

@nisanharamati

This comment has been minimized.

Copy link
Contributor

nisanharamati commented Oct 2, 2018

Oh hey, that worked!

I have:

  1. cloned this caj-larsson/wallaroo.git
  2. set up wallaroolabs/wallaroo.git as remote upstream
  3. copied the python work to python3 branch for safe keeping
  4. reset caj-larsson/wallaroo.git:master to the last common commit
  5. fetched upstream, and rebased master
  6. merged the upstream/python3 branch into caj-larsson/wallaroo.git:master so it includes both the original commits from here, and the new ones from the main wallaroo repo's python3 branch.
    We should be able to continue working on this branch here now.

Thanks for granting me permissions on the repo, @caj-larsson!

@caj-larsson

This comment has been minimized.

Copy link
Contributor

caj-larsson commented Oct 3, 2018

Thanks that sounds exactly like what I expected actually!

change some machida.pony python string conversion to use UTF8 instead
of bytes

This is exactly the type of issue I was working on when my move started, there were several places I had switched to bytes incorrectly because it's sometimes hard to tell a C/pony buffer from a printable string.
Mainly these were things like labels that I missed were such when going through the parsing machine.

Personally I think we could split those big switch blocks to something a little more high level, these almost reminds me of my SoC development days haha

@nisanharamati nisanharamati force-pushed the caj-larsson:master branch from be7084d to cf2fc8e Oct 3, 2018

@nisanharamati

This comment has been minimized.

Copy link
Contributor

nisanharamati commented Oct 3, 2018

I did another rebase on upstream to take in improvements to CI error log and core dump capture. Hopefully this gives us core dumps from CI (in addition to the ones we can get from running the tests locally)

@nisanharamati

This comment has been minimized.

Copy link
Contributor

nisanharamati commented Oct 4, 2018

Last commit: all python examples pass in python3, regardless of whether they use bytes or unicode in their user code.
Next up: adding python3 resilience and topology tests

let py_string_size = @PyUnicode_GetLength(ps)

let ret = String.copy_cpointer(py_string_p, py_string_size)
let ret = String.copy_cstring(@py_bytes_or_unicode_as_char(ps))

This comment has been minimized.

@aturley

aturley Oct 4, 2018

Contributor

I think we need to stick with String.copy_cpointer here. Even though PyBytes_AsString returns a null-terminated string, there can be other nulls before the null terminator. copy_cstring will copy to the first null, which means we could lose the rest of the value.

I see this pattern used in a lot of places in the code and I think we should fix it in all of them.

@nisanharamati let me know if I'm missing something here.

This comment has been minimized.

@nisanharamati

nisanharamati Oct 4, 2018

Contributor

That makes sense. I saw copy_cstring was used everywhere else besides this place, so I figured it was sensible to use it here too. If it's the reverse that's sensible, that's reasonable too!


if not byte_string.is_null() then
let arr = recover val
// create a temporary Array[U8] wrapper for the C array, then clone it
Array[U8].from_cpointer(@PyBytes_AsString(byte_buffer),
Array[U8].from_cpointer(@py_bytes_or_unicode_as_char(byte_buffer),

This comment has been minimized.

@aturley

aturley Oct 4, 2018

Contributor

This implies that the user should be able to use bytes or unicode values as the output of an encoder. I think we should restrict encoders to returning bytes, not strings. Is the intention to allow unicode strings or bytes here? If so, why?

Unless there's a good reason for allowing unicode strings AND bytes in a lot of places, I think we should break py_bytes_or_unicode_as_char into two functions, py_bytes_as_char and py_unicode_as_char, and then use those functions in the appropriate places where we expect either bytes or unicode. Having this catch-all function makes me nervous and allowing bytes and unicode strings in so many places seems unnecessarily permissive and potentially error-prone.

This comment has been minimized.

@nisanharamati

nisanharamati Oct 4, 2018

Contributor

while the user should return bytes from that function, I think we should catch and handle that case in the python api side, and produce a useful error there.
In machida itself, I think segfaulting because we got the wrong kind of string is worse than simply sending it out as bytes, regardless of what it is.
We could put a special check for it being bytes here, but again, I think it's easier to produce useful error messages (and pointing at the offending code) from the python api side.

What do you think?

This comment has been minimized.

@aturley

aturley Oct 4, 2018

Contributor

I agree, we shouldn't segfault. What I'm saying is that I think if we're already enforcing type constraints on the Python side then we should be able to know that we could call py_bytes_as_char or py_unicode_as_char instead of having to rely on a single function that handles both cases.

This comment has been minimized.

@caj-larsson

caj-larsson Oct 5, 2018

Contributor

In principle I agree that we should be strict with this, something I experienced working on these parts was that the error messaging form Machida was a lot less than what's necessary for a good developer experience. As I'm not aware of any spec for what the application developer can expect, maybe a handful of user stories would be enough guidance for what this needs to be.

This comment has been minimized.

@aturley

aturley Oct 5, 2018

Contributor

The guidance here is that anything that would be expected to be a string (such as names) should be a string, anything that is a key or a data blob (for example: the return value of a partition function, a serialized representation of an object, or the output of a tcp encoder) should be bytes. @nisanharamati @caj-larsson does that sound like a reasonable description of the expected behavior?

This comment has been minimized.

@nisanharamati

nisanharamati Oct 5, 2018

Contributor

I disagree.
I think that in most places, machida should tolerate any text type that can be reliably compared as bytes (e.g. bytestrings or unicode). Most of these values are used for either of 2 things:

  1. uniquely identify a thing
  2. print some text to stdout

Both bytestrings and unicode work fine here, so I don't understand the benefit in restricting users to only one or the other. We can easily distinguish between the two types on the pony/c side, and adding this requirement makes user code less friendly because of both the potential errors, and the potential for user values to be either type (for example if you read them from a file or a buffer with a 3rd party parser that may return either type).

I also think there's some performance impact from requiring code in the hot path to also execute encode() on a string (e.g. partition function), but I don't know if it's significant enough to be of concern:

$ python3 -m timeit -s "x = 'hello'*1000" "x.encode()"
10000000 loops, best of 3: 0.149 usec per loop
$ python3 -m timeit -s "x = 'hello'*10000" "x.encode()"
1000000 loops, best of 3: 1.22 usec per loop

Maybe for some cases.

The places where I agree that machida should only accept byestrings are:

  • decoder input
  • encoder output
  • serialization output
  • deserialization input

For the serialization/deserialization case I think the main limitation arises because we don't know at deserialization time what the original serialization output type was, so we can choose whether to create a unicode or a bytestring python object as the argument for the user's deserialization method.
I think that's a tractable problem, but maybe it's outside the scope of this PR?

This comment has been minimized.

@caj-larsson

caj-larsson Oct 6, 2018

Contributor

As a Python programmer I like @nisanharamati expect the system to have some leniency between these types but as a C programmer I see them as very different things and that the interface should be strict. Is this an issue best solved in the python library?

caj-larsson and others added some commits Aug 14, 2018

Platform independent python3 includes and linking
Uses python3-config in the Makefile for flag generation and the
default Python.h header in the C file.

The pony linking is solved as suggested by @dipin at:
    https://gist.github.com/dipinhora/e49e146845a1a3af5e6af37729e1959f/revisions
Handle Python3 C API differences.
This has been automatically done with two replacements.

    PyInt_ -> PyLong_ and PyString_ -> PyBytes

Both should be safe, the first one is just a consolidation of
integral types in the C-API we must adapt for. The second one is
replacing all Python2 strings with bytes(), this is defintiely
the correct thing to do for the binary states that are passed
between Pony and Python. But the strings that configure the
computation graph should potentially be unicode instead. I think
this is good enough for the current ambition level.
Change PyBytes_ to PyUnicode for human readable strings
In a previous commit the PyString_ calls were replaced with PyBytes_
indiscriminately. This fixes a lot of issues related to this, it is
not wholly tested and there will probably be more issues related to
PyObjects not being the right python type.
Added support for unicode encoded data
This is mostly a bandaid for running python code intended for
python2 there the string literals mean bytes() but the python3
interpreters sees it as unicode().
Some fixes. Now it segfaults.
- Python3 syntax error in word_count.py
- Switch first py3 test logging level to debug
- refactor the machida3 Makefile
- change some machida.pony python string conversion to use UTF8 instead
of bytes
- Fix function signature in step builder in machida
Add machida3 to wallaroo-up.sh
- takes care of machida3 build in most environments
- TOOD: add machida3 build instructions to build-from-source
Fix segfault in partition function
- Use of pony's String.copy_cpointer with wrong size was causing a
  segfault:  PyUnicode_AsUTF8 + PyBytes_Size
- Fixed by using PyUnicode_GetLength instead
- Add some extra print statements to python word_count app for debugging
  (to be removed in final merge)
- Use python3-dbg for extra gdb debuggability (to be removed in final
  merge)

nisanharamati added some commits Oct 3, 2018

Delegate PyObject-to-Cstring to function
- delegate conversion of PyObjects that are bytes or unicode to a single
  function in python-wallaroo.c
- add a function for getting size from a PyObject that's either bytes or
  unicode
- Remove partition_function_u64 since it's no longer supported
- revert examples' test Makefiles to default (log-level error, and run
  tests for both python2 and python3)
Fix sink_encoder_encode c function
- Encoder must return a bytes, so we don't need to perform type checking
  on its result (and if we do wish to do so, it should probably be done
  in the python api side).
  - reverted function to original format as in python2 version
- This unshadowed the error that was causing
  testing/correctness/apps/alphabet_python to fail
  - There is an integer division in that application that was using `/`
    which becomes floating point division in python3. Since `//` is
    integer division in both py2 and py3, that is also fixed now.

@nisanharamati nisanharamati force-pushed the caj-larsson:master branch from eb8e8ce to a0b5f8e Oct 4, 2018

nisanharamati added some commits Oct 4, 2018

Always return bytestring in partition function
- Previously, this function returned a byte, which _was_ a string in py2
  but is a number (a u8) in py3.
- Fix is to change `word[0]` to `word[0:1]` to ensure it is always a
  bytestring (e.g. an array of bytes), regardless of python version.

@nisanharamati nisanharamati force-pushed the caj-larsson:master branch from 0923344 to 2a7b1f6 Oct 4, 2018

Fix sequence_window_python test
- struct.pack requires bytestring, not unicode, so encode the string
  first
- check python's `isatty` status to determine whether we need to replace
  its stdout/stderr objects with ones that _do_ flush on each write

@aturley aturley force-pushed the caj-larsson:master branch from 8bda663 to 340a8c5 Oct 9, 2018

Fix example tests that hang due to long builds
It looks like the script that tests the examples in the documentation
is failing because it doesn't get any terminal output from wallaroo-up
for 10 minutes. Dipin thought this might be happening because the
build takes a long time. This commit adds some echo statements which
should hopefully reset the timer and cause the tests to pass.

@aturley aturley force-pushed the caj-larsson:master branch from 340a8c5 to 63355a4 Oct 9, 2018

nisanharamati added some commits Oct 9, 2018

Add machida3 test to restart_without_resilience
Note that this test does not currently run. It is disabled due to an
error in the Makefile, which will be fixed outside of this PR.
let ret = String.copy_cstring(@py_bytes_or_unicode_as_char(ps))
let py_string_size = @py_bytes_or_unicode_size(ps)

let ret = String.copy_cpointer(@py_bytes_or_unicode_as_char(ps), py_string_size)

This comment has been minimized.

@nisanharamati

nisanharamati Oct 11, 2018

Contributor

Should this use the Machida.py_bytes_or_unicode_to_pony_string primitive?

This comment has been minimized.

@aturley

aturley Oct 11, 2018

Contributor

Yeah, I can fix this one.

@@ -772,14 +775,22 @@ primitive Machida
let ps = @get_name(o)
recover
if not ps.is_null() then
let ret = String.copy_cstring(@py_bytes_or_unicode_as_char(ps))
let ps_size = @py_bytes_or_unicode_size(ps)
let ret = String.copy_cpointer(@py_bytes_or_unicode_as_char(ps), ps_size)

This comment has been minimized.

@nisanharamati

nisanharamati Oct 11, 2018

Contributor

Could this use the py_bytes_or_unicode_to_pony_string primitive method instead?

This comment has been minimized.

@aturley

aturley Oct 11, 2018

Contributor

Yes, I think this can be replaced.

@nisanharamati
Copy link
Contributor

nisanharamati left a comment

I have one nit question around a couple of remaining use cases of @py_bytes... instead of Machida.py_bytes..., but I don't have a strong opinion about whether they need to be changed.
Otherwise, this looks good to merge to me.

@aturley aturley force-pushed the caj-larsson:master branch from eb33143 to 2123409 Oct 11, 2018

@nisanharamati nisanharamati changed the base branch from master to python3 Oct 11, 2018

@nisanharamati nisanharamati merged commit ef47826 into WallarooLabs:python3 Oct 11, 2018

2 of 7 checks passed

ci/circleci: build CircleCI is running your tests
Details
ci/circleci: integration-tests CircleCI is running your tests
Details
ci/circleci: integration-tests-with-resilience CircleCI is running your tests
Details
ci/circleci: unit-tests CircleCI is running your tests
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
ci/circleci: verify-changelog Your tests passed on CircleCI!
Details
license/cla Contributor License Agreement is signed.
Details
@nisanharamati

This comment has been minimized.

Copy link
Contributor

nisanharamati commented Oct 11, 2018

Woohoo this is basically at functional parity with python2 (on all the tests, at least!)!
The next steps will happen in the wallaroo repo:

  • we're going to add documentation
  • we're going to add some more tests (to both python2 and python3)

And then we'll be able to do an alpha release for public preview and feedback.

Thanks for everything you've done on this, @caj-larsson! It was really helpful to have something to start from.

@caj-larsson

This comment has been minimized.

Copy link
Contributor

caj-larsson commented Oct 11, 2018

That's fantastic! I'm sorry I couldn't help more at the final stages but I realize that the closer to done things got the more assumptions I had to start making.

I'll look at the python3 project to see if there is anything else I can help with

@SeanTAllen

This comment has been minimized.

Copy link
Member

SeanTAllen commented Oct 12, 2018

Thanks @caj-larsson!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment