Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perhaps add something about not supporting numpy int/float to the docs? #18

Closed
endrebak opened this issue Oct 7, 2016 · 13 comments
Closed

Comments

@endrebak
Copy link

endrebak commented Oct 7, 2016

I converted a pandas df to lists of numpy.int64 (starts, ends) and numpy.float64 (values) with list(df.Starts) etc. This gave strange segfaults. Perhaps you should warn about this in the docs that you need to convert these to proper Python values first with int(numpy_int) and float(numpy_float)? I had no idea what I was doing wrong...

Thanks for the library by the way.

Ps. should it work in parallel? Some c-libs like bx-python do not.

@dpryan79
Copy link
Collaborator

dpryan79 commented Oct 7, 2016

Good suggestion, I'll add that in.

Regarding parallelism, it depends on how you try to go about things. You can't pickle bigWigFile objects, so can't pass one to a function and have that work (the same is true for AlignmentFile objects from pysam, python's multithreading support just leaves much to be desired). What we do to get around that is to have each thread open the bigWig file(s). That works well for deepTools. Writing is innately single-threaded.

@davek44
Copy link

davek44 commented Oct 18, 2016

I've now been tripped up by this multiple times, so I'll resound the request. However, I'm also curious why you can't work with numpy floats. I'd prefer to use the float16 to trade off precision for less memory. Thanks a lot for putting this code online!

@dpryan79
Copy link
Collaborator

@davek44: I certainly could work with numpy input, it's just a matter of having another dependency.

@dpryan79
Copy link
Collaborator

@endrebak @davek44 If either of you have a chance, give the numpy branch a try. I've started adding support for numpy arrays when creating bigWig files. I'll try to add a method to output a numpy array in the values() method too, since that'd be faster and more memory efficient. Note that this is only possible if you have numpy installed before you install pyBigWig, since pyBigWig is written in C and needs to link against the numpy .so file.

@endrebak
Copy link
Author

Most people use conda, so installation order should very seldom be a
problem.

Super btw. Will report back when I get around to it.

On Mon, Oct 31, 2016 at 8:57 AM, Devon Ryan notifications@github.com
wrote:

@endrebak https://github.com/endrebak @davek44
https://github.com/davek44 If either of you have a chance, give the
numpy branch a try. I've started adding support for numpy arrays when
creating bigWig files. I'll try to add a method to output a numpy array in
the values() method too, since that'd be faster and more memory
efficient. Note that this is only possible if you have numpy installed
before you install pyBigWig, since pyBigWig is written in C and needs to
link against the numpy .so file.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dpryan79/pyBigWig/issues/18#issuecomment-257233002,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AQ9I0utTkLDVZy-5XVXSovcRSfsYYFaqks5q5Z9XgaJpZM4KQuAw
.

@dpryan79
Copy link
Collaborator

Yup, I'll modify the bioconda recipe accordingly if this works out OK.

@dpryan79
Copy link
Collaborator

dpryan79 commented Nov 3, 2016

The numpy branch now has some automated tests. Further, the values() function can return a numpy array. This still needs a fair but of testing, since it's likely I've missed some things.

@endrebak
Copy link
Author

endrebak commented Nov 4, 2016

How to test dpryans new PR:

git clone -b numpy git@github.com:dpryan79/pyBigWig.git
cd pyBigWig
python setup.py install # I guess this should overwrite the already installed pyBigWig?

Code you can copy into ipython afterwards:

iimport pyBigWig

import pandas as pd

header = [("chr1", 500)]


bw = pyBigWig.open("test.bw", "w")

bw.addHeader(header)

c = ["chr1"] * 3

s = pd.Series([1, 5, 7])

e = s + 1

v = pd.Series([5, 0, -5])

bw.addEntries(c, list(s), ends=list(e), values=list(v)) # error happens here

bw.close()

For me, running the above code lead to the following error:

# RuntimeError: You must provide a valid set of entries. These can be comprised of any of the following:
# 1. A list of each of chromosomes, start positions, end positions and values.
# 2. A list of each of start positions and values. Also, a chromosome and span must be specified.
# 3. A list values, in which case a single chromosome, start position, span and step must be specified.

Edit: come to think of it, this would have segfaulted in the previous version, AFAICR so good job. Might be an error on my part, somewhere.

@dpryan79
Copy link
Collaborator

dpryan79 commented Nov 4, 2016

You can also just pip install --upgrade git+https://github.com/dpryan79/pyBigWig@numpy to install it.

I've only tested directly inputting a numpy array (note that there's a nosetest that uses this), so I'd have to look and see what pandas is actually giving things.

As an aside, this is the downside to writing python extensions in C. They're much faster, but also much less flexible.

@dpryan79
Copy link
Collaborator

dpryan79 commented Nov 4, 2016

@endrebak Values need to be floats in bigWig files. That's why you're getting the error.

@dpryan79
Copy link
Collaborator

dpryan79 commented Nov 4, 2016

For pandas, it looks like you can use the values attribute rather than making a list:

import pyBigWig
import pandas as pd
header = [("chr1", 500)]
bw = pyBigWig.open("test.bw", "w")
bw.addHeader(header)
c = ["chr1"] * 3
s = pd.Series([1, 5, 7])
e = s + 1
v = pd.Series([5, 0, -5], dtype=float)
bw.addEntries(c, s.values, ends=e.values, values=v.values)
bw.close()

As an aside, there was a bug preventing this from working with 64bit floats (what pandas uses) that I just fixed.

@endrebak
Copy link
Author

endrebak commented Nov 8, 2016

With the latest PR (Fix the minimum floating point value check) it worked. Thanks!

Ps. I am trying to recreate some of S4Vectors' functionality in Python. Do you have a good reference for reading about Python/numpy/C interop? If no, then please do not go looking for my sake (obviously :) ).

@endrebak endrebak closed this as completed Nov 8, 2016
@dpryan79
Copy link
Collaborator

dpryan79 commented Nov 8, 2016

Great! I'd like to sit on this a little while and play around with it before I make the next release, since the large increase in code size inevitably created a but or two.

I don't have any good reference for python/numpy/C interop. This was my first C/Python hybrid and was quite a learning curve to make (in comparison, I started and completed py2bit/lib2bit in 2 days, so feel free to "borrow" the boilerplate initialization/finalization stuff from my code). I personally found the python C API documentation to be...not the most helpful thing in the world (especially in regards to reference counting). If you're interested in python/C/numpy interop then think early on about python 2/3 differences and the variety of different numeric types that can get passed in by numpy. The latter is relatively easy to handle but the former I at least personally still find to be confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants