Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several changes to the read() and write() methods #36

Merged
merged 5 commits into from
May 27, 2014
Merged

Conversation

mgeier
Copy link
Contributor

@mgeier mgeier commented May 5, 2014

See discussion in #34.

Replace ffi.new() with np.empty() and np.ascontiguousarray()

Remove dicts for readers/writers

read():
* reserve only as much memory as needed (if 'frames' is too large)
* check return value of sf_readf_*()

write():
* avoid copy if data has already the correct memory layout
* check return value of sf_writef_*()
* don't return the number of written frames!
  This is for symmetry with read() and it's redundant information anyway
See also #16

If less frames are left in the file than would fit into out and
fill_value=None, a smaller view into out is returned that contains only
the valid frames.

This includes several suggestions by @bastibe from #37.
@mgeier
Copy link
Contributor Author

mgeier commented May 17, 2014

I just re-wrote some commits, including 202fe2b, which has some interesting discussions.
I included some of the suggestions from #37.

@bastibe
Copy link
Owner

bastibe commented May 17, 2014

Very cool! I like the way this is going! I think we are finally getting close to a merge.

_read_or_write is a great idea! However, how about splitting it into two methods:

  • one that fetches the sndfile function given the data type and "_sf_readf_" or "_sf_writef_" (or even just some read or write identifier?) but does not call that function.

  • one that validates the array dimensions and size. This could also include something like

    if data.ndim not in (1, 2):
        raise ValueError("data must be one- or two-dimensional")
    elif data.ndim == 1 and self.channels != 1:
        raise ValueError("data must have 2 dimensions for non-mono signals")
    elif data.ndim == 2 and data.shape[1] != self.channels:
        raise ValueError("two-dimensional data must have %i columns" % self.channels)

    as in 042d425.

I think this would clear up the structure of the code a bit. What do you think about this?

Also, we should make sure to include tests before we merge (probably mostly cherry-picked from #37).

@mgeier
Copy link
Contributor Author

mgeier commented May 17, 2014

What would be the advantage of fetching the function and not calling it?
Wouldn't that just add more code in total and more duplicated code?

We could of course replace 'sf_readf_' with 'read' and funcname + ffi_type with 'sf_' + funcname + 'f_' + ffi_type, but would this really improve anything?

Probably I'm missing your point ...?

I'm not quite sure about your position regarding defensiveness. Your example with very detailed error reporting seems to contradict you statement there: https://github.com/bastibe/PySoundFile/pull/34#issuecomment-42779149.

I think that I actually used too much error checking initially and I'm trying to generally reduce the amount of error checking.
In the current case, however, I think we have to check the array size. We're passing the array to a C function afterwards, so if it doesn't have the right size, all sorts of bad things will happen.

What about something in-between: combine all checks and only raise a single exception?

if (data.ndim not in (1, 2) or
        data.ndim == 1 and self.channels != 1 or
        data.ndim == 2 and data.shape[1] != self.channels):
    raise ValueError("Invalid shape: %s" % repr(data.shape))

Also, we should make sure to include tests before we merge (probably mostly cherry-picked from #37).

I think it makes more sense if we first merge #36 and from there you create a new PR with the changes to the tests in #37.

The tests themselves have a few issues which I would prefer to work on in a separate PR authored by you.

* add comment for documentation
* add check for array shape
* add an assertion
@mgeier
Copy link
Contributor Author

mgeier commented May 17, 2014

I added the shape check, now all tests from #37 pass except the one test for the write() function which isn't surprising because it isn't implemented yet.

@bastibe
Copy link
Owner

bastibe commented May 19, 2014

What would be the advantage of fetching the function and not calling it?

Just a small measure of decoupling. I prefer methods that can be combined with other methods over methods that wrap other methods. But that's maybe not that important. The important thing is to separate reading/writing from error-checking.

We could of course replace 'sf_readf_' with 'read' and funcname + ffi_type with 'sf_' + funcname + 'f_' + ffi_type, but would this really improve anything?

I was thinking of {'read': 'sf_readf_', 'write': 'sf_writef_'}, just to make it somewhat easier to read. Not very important though.

I'm not quite sure about your position regarding defensiveness. Your example with very detailed error reporting seems to contradict you statement there: #34 (comment).

I think that I actually used too much error checking initially and I'm trying to generally reduce the amount of error checking.
In the current case, however, I think we have to check the array size. We're passing the array to a C function afterwards, so if it doesn't have the right size, all sorts of bad things will happen.

Right. Let's check as little as practical, but make sure we get the size right. We don't want to segfault the interpreter.

I think it makes more sense if we first merge #36 and from there you create a new PR with the changes to the tests in #37.

I disagree, but that's fine. Let's get this PR ready to merge as soon as possible. I am eagerly awaiting these features, and so are my colleagues.

@mgeier
Copy link
Contributor Author

mgeier commented May 22, 2014

What would be the advantage of fetching the function and not calling it?

Just a small measure of decoupling. I prefer methods that can be combined with other methods over methods that wrap other methods.

OK, I like functions that can be combined in different ways.
But I also like functions that wrap something complicated.

Why should we split it if the functions are only used internally and only exactly two times in exactly the same combination?

But that's maybe not that important. The important thing is to separate reading/writing from error-checking.

I don't get it.
I would understand if we would be able to use either separately, but as I see it, the two parts will always be done together.

I think as soon as we see that either part can be used somewhere else we should separate it into two functions, but not earlier.

YAGNI comes to mind.

If we were talking about public functions, this would probably be different, I'd often violate YAGNI, but we're talking about private functions here ...

Or am I missing something?

We could of course replace 'sf_readf_' with 'read' and funcname + ffi_type with 'sf_' + funcname + 'f_' + ffi_type, but would this really improve anything?

I was thinking of {'read': 'sf_readf_', 'write': 'sf_writef_'}, just to make it somewhat easier to read. Not very important though.

I know that everything can be solved by using another level of indirection.
But there is nothing here that needs solving.

And again, this isn't part of the public API, so there is no need to complicate things just to have marginally nicer string literals.

I think 'sf_readf_' even is more clear if someone happens to know the libsndfile API.
And if not, I guess they still can figure out that it has something to do with reading.

I think it makes more sense if we first merge #36 and from there you create a new PR with the changes to the tests in #37.

I disagree, but that's fine. Let's get this PR ready to merge as soon as possible. I am eagerly awaiting these features, and so are my colleagues.

OK, great.
What's keeping us?

@bastibe
Copy link
Owner

bastibe commented May 25, 2014

I would understand if we would be able to use either separately, but as I see it, the two parts will always be done together.

Because a function should do one thing, and one thing only. This is maybe at the core of most of our arguments. I strongly believe that whenever you would have to put an and in the function description, that function should be split in two. By any other logic, all code should be lumped into one giant function with a myriad of different arguments that branch out into different functionalities (Matlab style).

This is one of the lessons I learned from unit testing. Whenever functionality can be split in independent functions, one should do so, as it makes code easier to reason about.

_read_or_write currently has the following atomic code blocks:

  • Dimensionality checks (and raise errors).
  • Type checks (and raise errors).
  • Size checks (and raise errors).
  • fetch the correct libsndfile function and call that functions with the appropriate numpy memory buffer (mutates memory).

This can be factored into two functions:

  • one error-checking functions, which either runs through or throws an error. This function conceptually only has two exit points: An exception, or no exception. This puts a clear point of "beyond this, everything is guaranteed to work" in the code, which makes the following code much easier to reason about. Also, this function has definitely no side effects beyond raising an exception.
  • one sndfile-calling function, which always works (the gods of C permitting). This is then a clean abstraction over the C interface and the memory mutation this entails.

At the heart of this is the notion of splitting exception-raising code paths from memory-mutating code paths, so that they can be reasoned about independently. To make this even more useful, they can be tested independently.

Code that does many things in an intertwined manner can be called complected, as coined by Rich Hickey, a hero of mine, in Simple Made Easy.

Please split that function, then we're ready to merge. If you are still not convinced that this is the right way to go, do it for me.

@mgeier
Copy link
Contributor Author

mgeier commented May 26, 2014

I would understand if we would be able to use either separately, but as I see it, the two parts will always be done together.

Because a function should do one thing, and one thing only. This is maybe at the core of most of our arguments. I strongly believe that whenever you would have to put an and in the function description, that function should be split in two.

That sounds like a reasonable guideline.

By any other logic, all code should be lumped into one giant function with a myriad of different arguments that branch out into different functionalities (Matlab style).

That would be one extreme.
The other extreme would be to write a function for each line, which would also be quite silly.

I think we should aim for something in between.

This is one of the lessons I learned from unit testing. Whenever functionality can be split in independent functions, one should do so, as it makes code easier to reason about.

That sounds resonable, too.
But again, I wouldn't go into the extreme.
It's more or less always possible to split a function into parts.
But the resulting parts should still be worth being called a "function".

_read_or_write currently has the following atomic code blocks:

  • Dimensionality checks (and raise errors).
  • Type checks (and raise errors).
  • Size checks (and raise errors).
  • fetch the correct libsndfile function and call that functions with the appropriate numpy memory buffer (mutates memory).

That sounds to me like a single action with some preparation specific to this single action.

This can be factored into two functions:

[...]

At the heart of this is the notion of splitting exception-raising code paths from memory-mutating code paths, so that they can be reasoned about independently. To make this even more useful, they can be tested independently.

Yes they could be tested independently. But what does that improve?
When we add the number of tests for the first part and the number of tests for the second part, we get exactly the same number of tests as before.

Code that does many things in an intertwined manner can be called complected, as coined by Rich Hickey, a hero of mine, in Simple Made Easy.

Thanks for the link. I watched it and I really liked it!

But the code in question is not at all complex. It's totally linear.

I think we would complect the code if we would add another function for no reason.

Please split that function, then we're ready to merge. If you are still not convinced that this is the right way to go, do it for me.

I'm indeed not convinced yet.
But just for you, I tried to split it in cff33b0.
I think it just doesn't look right.
Especially the _ffi_types lookup is a pain, because it is checked in the first function and then done again in the second one.
I could also return the CFFI type from the first function and then pass it to the second function, but this would seem even stranger.

But most probably I didn't split it the right way ...

BTW: the "WIP" in the commit message means "Work In Progress", i.e. I don't want to merge this as is, I rather want to do some further work on it and then rewrite some commits.

@bastibe
Copy link
Owner

bastibe commented May 26, 2014

I indeed very much prefer it this way. In my first attempt to do this, I put the assertions into the check function as part of the "error checking" path. But seeing your solution, I prefer it your way.

Yes they could be tested independently. But what does that improve?

It improves me being able to reason about them independently from one another. Having two functions instead of one means that I can modify one without the other, and rest assured that my changes only alter the workings of one of them. Being able to test one without the other means that test failures can show me the origin of my error in more detail.

Also, the code now contains two function invocations, one saying "this checks the dimensions of the array", the second saying "this puts new data into the array". This is a form of codified documentation. When I read the code for the read method, I can clearly see that we do indeed check the array sizes, without having to go into read_or_write to verify that.

I think that these benefits make splitting the function well worth it.

Especially the _ffi_types lookup is a pain

Call me crazy, but I prefer the simple if branch to the try block. I also like how the lookup is guaranteed to work in the second function. But I agree that there is some degree of unnecessary duplication involved in this. Still, I prefer it this way.

The other extreme would be to write a function for each line, which would also be quite silly. [...]
It's more or less always possible to split a function into parts.

There are usually some parts which can not reasonably be split into different functions without breaking the control flow. The question is how to cut these atomic blocks into separate functions. My rule of thumb is to put them in as few functions as can be unambiguously named. As long as a function name can aptly describe its contents, it probably does not contain too much code. In this particular case, the read_and_write function contained too much error-checking for my taste. The name did not make it clear to me that this function would be throwing exceptions. I think that the split clarified this nicely.

I think it just doesn't look right.

In my eyes, it beats the alternative of lumping both error checking and reading/writing in one function. It took me quite a while to appreciate short functions. When I first learned programming, my teachers were scientists, and their code was atrocious. Monster functions everywhere! Matlab code, where function invocations were expensive things! But the more I grew, the more I learned to appreciate short, clearly named functions.

These days, I try to keep function complexity to an absolute minimum. This usually means splitting them up in many smaller functions. Short functions are like short sentences. They are easy to understand. They are easy to reason about. They are easy to disassemble and reuse. If you want to do a refactor of some giant function, it is never clear which parts can be taken out of context and still work, and which parts are intrinsic to the function and its data structures. In a function that only delegates to a bunch of other small functions, this is always obvious. Thus, refactoring becomes simple.

However, I have to admit that I didn't do a stellar job at this when I first wrote PySoundFile. At the time, it was a quick-and-dirty hack for playing with the CFFI. I probably should have taken the time back then to factor my code better. In the end, it took your invaluable feedback to make me realize this and hopefully inform future decisions about new code I write. Thank you for that!

Code that does many things in an intertwined manner can be called complected, as coined by Rich Hickey, a hero of mine, in Simple Made Easy.

Thanks for the link. I watched it and I really liked it!

Be sure to check out Hammock Driven Development as well. Those two talks changed the way I think about programming.

@mgeier
Copy link
Contributor Author

mgeier commented May 27, 2014

I indeed very much prefer it this way.

Fair enough. I changed the commit to have a proper commit message: ab33e66

Is this all now? Are we able to merge?

I still don't like the changes, but I'm OK with merging anyway.

In my first attempt to do this, I put the assertions into the check function as part of the "error checking" path. But seeing your solution, I prefer it your way.

Indeed, I'm quite sure on this one.
Assertions are very different from error checking.
If a block of code has some pre-conditions, they should be asserted immediately above (and the post-conditions immediately below).
It would be a mistake to put the assertions in a different function.

Thanks for your detailed response.
All you're saying sounds very interesting and resonable.
I still think your "rules" don't apply to error checking.
But it's hard to find information about this in the Python context, because function arguments are hardly checked at all.

Do you happen to know some sources of information about that?

I think in our concrete case it doesn't really matter, but in a larger codebase it would probably be dangerous to separate the error checking from the actual function, because one could simply forget to call the error-checking function.

In a larger context, I think it's also dangerous insofar as something could move in between error checking and the actual function and change the state which was carefully checked before.

But these are just thoughts, I don't have practical (bad) experience with that.

Especially the _ffi_types lookup is a pain

Call me crazy, but I prefer the simple if branch to the try block.

It's un-Pythonic and, in a more general context, it's un-safe because the dict could change between the check and the actual access.

If you don't like the try block, it would be preferable to use

ffi_type = _ffi_types.get(array.dtype)
if ffi_type is None:
    raise ValueError("Some Error")

I also like how the lookup is guaranteed to work in the second function.

It's only guaranteed if it's also guaranteed that _check_array() is always called immediately before _read_or_write().
That's a pretty weak guarantee.

But I agree that there is some degree of unnecessary duplication involved in this.

Indeed.

Still, I prefer it this way.

OK. As you wish.

If you want to do a refactor of some giant function, it is never clear which parts can be taken out of context and still work, and which parts are intrinsic to the function and its data structures.

Exactly.
I think in our case the error checking is intrinsic to the function and it cannot be taken out of context.
Having two separate functions suggests that they're both useful on their own, which they aren't (in our concrete case).

Be sure to check out Hammock Driven Development as well.

Thanks for the tip, I will.

@bastibe
Copy link
Owner

bastibe commented May 27, 2014

It would be a mistake to put the assertions in a different function.

I really haven't ever used assertions before. But seeing how you used them and reading your comments made me realize their purpose. Very useful indeed!

Do you happen to know some sources of information about [moving error checks into a function]?

Sadly, I don't. As an anecdote, one could maybe look at command line applications or web APIs, which always sanity-check/parse their arguments, and typically do so in some special function. I see our example in a similar light, since the point of the check-function is to sanity-check the array. I would not classify this as argument checking really, but that's just a feeling.

Are we able to merge?

I think so! I'll run the code through the tests one more time, then I'll merge! Let's have a virtual beer for finally accomplishing this!

bastibe added a commit that referenced this pull request May 27, 2014
Several changes to the read() and write() methods
@bastibe bastibe merged commit b5bce08 into master May 27, 2014
@mgeier
Copy link
Contributor Author

mgeier commented May 27, 2014

Great, cheers!

@mgeier mgeier deleted the read-write branch May 27, 2014 19:09
@mgeier
Copy link
Contributor Author

mgeier commented May 29, 2014

Although the issue is closed, I'd like to discuss a bit more about separating error checking from the rest of a function.
I'm still not convinced that it was a good idea to rip the error checking out of _read_or_write() into a separate function.

I have the feeling that this point will come up again quite soon and it would be good to have some kind of guideline.

I had a look at all functions/methods in PySoundFile (current master 7b0d0cb) regarding error checking.

There are 3 methods just for error checking, which are nicely re-used: _handle_error(), _handle_error_number(), _check_if_closed(). That's good.

And there are 2 groups of functions which violate the "rule" that error checking should be separated:

  1. Functions/methods with intertwined error checking: SoundFile.__init__(), SoundFile.__getattr__(), SoundFile._get_slice_bounds(), SoundFile.__setitem__(), SoundFile.seek(), _format_int()
  2. Functions/methods with error checking which could be easily factored out but the specific error message might get lost (and become more generic): SoundFile.__setattr__(), SoundFile.__getitem__(), SoundFile.__setitem__(), SoundFile.read(), SoundFile.write(), _format_int()

I think those are perfectly fine, no need to extract any more error checking.

Would it be really a good idea to create separate functions for the error checking of each of the above-mentioned functions?

Very few of these error-checking functions could be re-used, so we would have to add probably around 8 new functions/methods!

@bastibe
Copy link
Owner

bastibe commented Jun 6, 2014

I am coming around on this. I am still thinking about it though.

In general, I think we should split up functions as much as possible. We should probably even split some of the functions we already have in more, smaller, separable, ortogonal pieces (even if those pieces are only used once). In the case of splitting error checking from data usage, I might have been on the wrong track though.

I'll think some more about it.

mgeier added a commit that referenced this pull request Apr 15, 2015
mgeier added a commit that referenced this pull request May 16, 2015
mgeier added a commit that referenced this pull request May 18, 2015
mgeier added a commit that referenced this pull request May 18, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants