-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
getcol() returning inconsistent results with large variable-shaped columns #130
Comments
I had seen something similar in the past - did you try with |
Like here #38 |
Yep same bug. So yeah definitely In my case, I have a 5715090 rows, and a [474,4] data column... and if I read the whole thing in at once, everything from row 1184533 onwards is null. But if I read it with @bennahugo take note. |
@gervandiepen @tammojan what would be the problem for |
Note the possible related bug affecting 4 correlation data I have reported earlier casacore/casacore#756 |
That may point to something deeper than python-casacore, since it affects CASA as well (the ms module specifically) |
I'm having a look at this; it would be useful to have a reproducible error. Preferably (but not necessarily) without sending enormous measurement sets over. |
I don't know, good question. The wrapper @cyriltasse could just allocate a python object. Interested in @gervandiepen's ideas. |
I'll take a look at it asap.
Early this week I got a similar error message from Francesco de Gasperin
who read a 53 GB column.
He determined the maximum of the column (with np.max) which was a lower
value than that of a subset. With TaQL's gmax() function he got correct
values, so the data themselves are fine.
Do you know how large your columns are? Maybe there is some kind of a 32
(or 16) GB problem. Francesco said that a numpy array > 32 GB behaved fine.
…On Wed, Feb 13, 2019 at 3:40 PM Tammo Jan Dijkema ***@***.***> wrote:
@gervandiepen <https://github.com/gervandiepen> @tammojan
<https://github.com/tammojan> what would be the problem for getcol to
actually be a wrapper to getcolnp?
I don't know, good question. getcol is a wrapper around getValueFromTable
<https://github.com/casacore/casacore/blob/master/tables/Tables/TableProxy.cc#L2145>,
getcolnp is a wrapper around getValueFromTableVH
<https://github.com/casacore/casacore/blob/master/tables/Tables/TableProxy.cc#L2423>
(VH is for ValueHolder, which is the C++ side of a python object, in this
case a numpy array). The only difference between the functions seems to be
who allocates the data.
The wrapper @cyriltasse <https://github.com/cyriltasse> could just
allocate a python object. Interested in @gervandiepen
<https://github.com/gervandiepen>'s ideas.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#130 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHfIPPembp-pKxyT_5swWmsz-UFyd4V-ks5vNCPWgaJpZM4Rp3Yh>
.
|
My column is 57150904744*8 = 40.36 GB. The first 1184533 rows (8.36 GB) are read in properly, then I get nulls. (Neither number seems particularly round or significant...) Numpy arrays work fine, I can read the whole column with |
I've written a test program to create a large column and read it back with
getcol.
In C++ it is all fine; the column's data are as expected when read with
getColumn.
In Python I indeed see 0 at the end of the getcol array, so I assume
something is wrong in the C++-Python conversion. I'm investigating it.
…On Wed, Feb 13, 2019 at 6:25 PM Oleg Smirnov ***@***.***> wrote:
My column is 5715090*474*4*8 = 40.36 GB. The first 1184533 rows (8.36 GB)
are read in properly, then I get nulls. (Neither number seems particularly
round or significant...)
Numpy arrays work fine, I can read the whole column with getcolnp().
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#130 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHfIPAO-pg4O77W78F8bFWPHZpS2DSf9ks5vNEp9gaJpZM4Rp3Yh>
.
|
I've found the problem (in the C++-python converters of Casacore) and
issued a PR (issue-pc130, casacore/casacore#868).
In the converters the 64-bit array size was handled as 32-bit. It had the
effect that only the 32-bit modulo of the size was handled, thus for a 40
GB array only about 8 GB was handled.
Once the PR has been merged, a Casacore build and install (of the master)
should do the job.
On Thu, Feb 14, 2019 at 10:05 AM Ger van Diepen <gervandiepen@gmail.com>
wrote:
… I've written a test program to create a large column and read it back with
getcol.
In C++ it is all fine; the column's data are as expected when read with
getColumn.
In Python I indeed see 0 at the end of the getcol array, so I assume
something is wrong in the C++-Python conversion. I'm investigating it.
On Wed, Feb 13, 2019 at 6:25 PM Oleg Smirnov ***@***.***>
wrote:
> My column is 5715090*474*4*8 = 40.36 GB. The first 1184533 rows (8.36
> GB) are read in properly, then I get nulls. (Neither number seems
> particularly round or significant...)
>
> Numpy arrays work fine, I can read the whole column with getcolnp().
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#130 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AHfIPAO-pg4O77W78F8bFWPHZpS2DSf9ks5vNEp9gaJpZM4Rp3Yh>
> .
>
|
Excellent, thanks! |
Fantastic thanks @gervandiepen. @SpheMakh this is a serious bugfix affecting most python-based software in the pipeline. Please build casacore:master from source |
@gervandiepen nice nice! :) |
@gervandiepen I guess this has been solved? |
I've closed it. |
Gah, I've been bitten by this bug again. Someone remind me, when does the fixed release (3.1.1) get into KERN and/or KERN-dev? And can I pip install it in the meantime? |
This will land in KERN-dev / KERN-6 on the next KERN release sprint which has not been scheduled yet. |
Please reopen! I see it rearing its ugly head again. I'm running in a venv with python-casacore 3.3.1. Here's an example:
|
Same issue with getcolnp @o-smirnov ? |
An excellent question, @bennahugo! The plot thickens:
Doesn't look like it reads in the whole array, does it? |
ok then it is different from the previous known bug and seems to indicate that taql selection is broken. Can you confirm these are variable shaped columns? |
Variable-shaped. But why do you think this is a TaQL issue and not another large array issue? It could be that the influence of the TaQL query above is only in determining whether resulting table is in [broken] large-array or [working] small-array regime.
|
It used to be the case that getcolnp was a workaround, so I'm not sure from
the example if this is related to the previous bug or something explicitly
stemming from the queried data. I will do some checks tomorrow as well.
…On Wed, 02 Dec 2020, 17:08 Oleg Smirnov, ***@***.***> wrote:
Variable-shaped. But why do you think this is a TaQL issue and not another
large array issue? It could be that the influence of the TaQL query above
is only in determining whether resulting table is in [broken] large-array
or [working] small-array regime.
In [87]: tab0.getcoldesc("CORRECTED_DATA")
Out[87]:
{'valueType': 'complex',
'dataManagerType': 'TiledShapeStMan',
'dataManagerGroup': 'TiledCorrected',
'option': 0,
'maxlen': 0,
'comment': 'The data column',
'ndim': 2,
'_c_order': True,
'keywords': {'UNIT': 'Jy'}}
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#130 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB4RE6UAZYUKQNGHX5K73R3SSZJ6PANCNFSM4ENHOYQQ>
.
|
@bennahugo the MS is Reproducing is dead simple:
|
@o-smirnov Good news I cannot reproduce this on another multifield dataset after compiling 3.1.1 from source, bad news is I cannot reproduce this on another multifield dataset after compiling 3.1.1 from source...... I'm copying your data over to my environment now. Either there is a difference in the storage manager (which I don't think is the case - both measurement sets are MeerKAT datasets dumped through ska-sa/katdal) or you are somehow binding to a broken casacore.... |
Just an update. I can reproduce this on @o-smirnov's dataset with a custom compiled casacore and bound python-casacore both at 3.3.1. The target he is attempting to load is a field which scans are interleaved by many other scans and hundreds of thousands of rows apart. I tried reproducing the same issue on another dataset by loading the first few scans and the last scan of a target field at the native 208kHz resolution, again separated by hundreds of thousands of rows. There I cannot reproduce the bug. This makes it extremely hard to track, because I can't just give you a command to simulate a large dataset at the temporal-spectro resolution of MeerKAT with a script. I do note that I have seen very strange artifacts popping up when imaging full resolution meerkat datasets with the calibrator scans interleaved, where we use getcolnp and TaQL to select the target rows. When I split out and image only the target the problem seemingly disappears! I will attempt to track down the bug with gdb and @o-smirnov's data - unfortunately it is 500 GiB in size so shifting it around is not ideal. |
may be related to cyriltasse/DDFacet#536 - that was a multi-target observation, so this could show because of the large separation of target scans for a particular field |
I've rewritten some of the observation logic in ratt-ru/simms#58 to simulate a multi field interleaved dataset with the scans dispersed in round robin fashion hundreds of thousands of rows apart to try and get something closer to @o-smirnov's edge case. Alas it still does not reproduce, so I'm not sure how to construct a reproducable example as it seemingly only affects this dataset and this dataset only - perhaps there was some planetary alignment..... We will need to make the dataset available for download somewhere - unfortunately the ftp is too small to host this |
I'll try to extract increasingly smaller subsets of this MS to see if I can get a smaller reproducer MS. Thanks for verifying @bennahugo. It's an extremely worrying bug, but thankfully doesn't seem to occur that easily, eh? |
Here's another datapoint for you. If I read the column in chunks like so: cd1[:] = 9999
rowchunk = 100000
for row0 in range(0, nrows, rowchunk):
tab1.getcolnp("CORRECTED_DATA", cd1[row0:row0+rowchunk], row0, rowchunk)
print(rowchunk, "unfilled:", np.where(cd1==9999)) ..then the failure mode depends on chosen chunksize:
I could keep up a binary search to find the exact chunksize at which the bug starts appearing (somewhere between 45000 and 47500), but there's no meaningful power of 2 or anything like that in there, so I don't see the point. (For reference, 32768 rows is 2^29 visibilities, and 65536 rows is 2^30 visibilities.) For now, I will implement a workaround in cubical to read/write in chunks of 2^29 data points or less. Stay tuned. |
Interresting.... that means we may be dealing a signed 32 bit float indexer somewhere. Good job. My gdb is now hooking into gdbserver on the server, but it is painfully slow to compile things with symbol paths relative to the host gdb session. I will probably only get to it next year after my vacation |
Well, 32768 rows is 2^29 visibilities is 2^32 bytes, and that works ok... and some time before we get up to 2^30 visibilities and 2^33 bytes, it falls to the floor in a heap... So yes, we're suspiciously close to the dreaded binary powers of 31-32. Yet strangely the "breaking point" is somewhere in between whole powers of 2, so what gives? |
@tammojan, @gervandiepen please feel free to chime in. This is an existential bug after all. :) |
OK, "good" news is, I averaged the MS down to 1024 channels (factor of 4), and I can still reproduce the bug. Interestingly, the "breaking point" (in terms of the number of rows) has shifted, it is not simply x4 larger. The code above now works with rowchunks of 100,000, and breaks with rowchunks of 160,000. But at least the "reproducer" MS is now only 73Gb! So, it is eminently practical to ship it to anyone interested. (@bennahugo: you can find it on |
I'm looking into the problem, but my time is limited. Too often me grandchildren require my attention, which is pretty nice :-) |
I'm sure they're a lot more fun than we are. 😀 Can I bribe them with some ice cream to take an interest in the table system? |
Too cold for ice cream these days :-)
…On Mon, Dec 21, 2020 at 1:07 PM Oleg Smirnov ***@***.***> wrote:
I'm sure they're a lot more fun than we are. 😀
Can I bribe them with some ice cream to take an interest in the table
system?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#130 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB34QPE7FS73JVTGBW6HU2TSV427RANCNFSM4ENHOYQQ>
.
|
A very random contribution, but have you tried mucking with the cache size and seeing if it changes anything @o-smirnov? |
How would I do that exactly? You mean |
Do not set it to zero, as that means no limit. Setting to 1 (one byte, I believe) would be an interesting experiment. Actually, maybe unlimited would be interesting if you have enough RAM. |
@Athanaseus has been bitten by this as well. |
* partial fix for #93. Should allow for one spw at a time * fixes #93. Adds workaround for casacore/python-casacore#130 * added future-fstrings * correctly named future-fstrings * ...and install it for all pythons * back to mainstream sharedarray * lowered message verbosity * droppping support for python<3.6 * Remove two uses of future_fstrings. Co-authored-by: JSKenyon <jskenyon007@gmail.com> Co-authored-by: JSKenyon <jonosken@gmail.com>
* partial fix for #93. Should allow for one spw at a time * fixes #93. Adds workaround for casacore/python-casacore#130 * added future-fstrings * correctly named future-fstrings * ...and install it for all pythons * back to mainstream sharedarray * lowered message verbosity * droppping support for python<3.6 * Issues 427 and 429 (#430) * Fixes #427 * Fixes #429 * Update __init__.py Bump version * Update Jenkinsfile.sh (#447) * Update Jenkinsfile.sh Remove python 2 building * Update Jenkinsfile.sh * Issue 431 (#432) * Beginning of nobeam functionality. * Remove duplicated source provider. * Change syntax for python 2.7 compatibility. * Update montblanc version in setup to avoid installation woes. Co-authored-by: JSKenyon <jonosken@gmail.com> * Remove two uses of future_fstrings. * make pypi compatible (#446) * make pypi compatible * error * Update setup.py Only support up to v0.6.1 of montblanc * Update setup.py Co-authored-by: Benjamin Hugo <bennahugo@gmail.com> Co-authored-by: JSKenyon <jskenyon007@gmail.com> Co-authored-by: JSKenyon <jonathan.simon.kenyon@gmail.com> * Update __init__.py * Fix warning formatting Co-authored-by: Oleg Smirnov <osmirnov@gmail.com> Co-authored-by: Benjamin Hugo <bennahugo@gmail.com> Co-authored-by: JSKenyon <jskenyon007@gmail.com> Co-authored-by: JSKenyon <jonosken@gmail.com> Co-authored-by: Gijs Molenaar <gijs@pythonic.nl> Co-authored-by: JSKenyon <jonathan.simon.kenyon@gmail.com>
* Added different slope modes for slope solvers. Experimental. * initial rehashing to support a more geneal slope structure (e.g. iono+delay) * fixes #428 * fixes #433 * no-X polcal experiments * added support for arbitrary Jones terms in *first* chain position * fixed typo * Noxcal (#450) * partial fix for #93. Should allow for one spw at a time * fixes #93. Adds workaround for casacore/python-casacore#130 * added future-fstrings * correctly named future-fstrings * ...and install it for all pythons * back to mainstream sharedarray * lowered message verbosity * droppping support for python<3.6 * Issues 427 and 429 (#430) * Fixes #427 * Fixes #429 * Update __init__.py Bump version * Update Jenkinsfile.sh (#447) * Update Jenkinsfile.sh Remove python 2 building * Update Jenkinsfile.sh * Issue 431 (#432) * Beginning of nobeam functionality. * Remove duplicated source provider. * Change syntax for python 2.7 compatibility. * Update montblanc version in setup to avoid installation woes. Co-authored-by: JSKenyon <jonosken@gmail.com> * Remove two uses of future_fstrings. * make pypi compatible (#446) * make pypi compatible * error * Update setup.py Only support up to v0.6.1 of montblanc * Update setup.py Co-authored-by: Benjamin Hugo <bennahugo@gmail.com> Co-authored-by: JSKenyon <jskenyon007@gmail.com> Co-authored-by: JSKenyon <jonathan.simon.kenyon@gmail.com> * Update __init__.py * Fix warning formatting Co-authored-by: Oleg Smirnov <osmirnov@gmail.com> Co-authored-by: Benjamin Hugo <bennahugo@gmail.com> Co-authored-by: JSKenyon <jskenyon007@gmail.com> Co-authored-by: JSKenyon <jonosken@gmail.com> Co-authored-by: Gijs Molenaar <gijs@pythonic.nl> Co-authored-by: JSKenyon <jonathan.simon.kenyon@gmail.com> * fixed bug when outer chain term was DD and incorrect model was used. Properly implemented scalar updates * added scalar and unislope update types * added auto-generated Stimela2 schemas * added schemas to manifest * use / separator for stimela schema; omit defaults in schema; fix path * minor bugfixes * added stimela_cabs file. Fixed problem with legacy-only flags on output * added stimela_cabs.yml * changed / to . * fixed problem with IFR gain flags overpropagating * added parset argument --------- Co-authored-by: Oleg Smirnov <osmirnov@gmail.com> Co-authored-by: Lexy Andati <andatilexy@gmail.com> Co-authored-by: Benjamin Hugo <bennahugo@gmail.com> Co-authored-by: Gijs Molenaar <gijs@pythonic.nl>
* Added different slope modes for slope solvers. Experimental. * initial rehashing to support a more geneal slope structure (e.g. iono+delay) * fixes #428 * fixes #433 * no-X polcal experiments * added support for arbitrary Jones terms in *first* chain position * fixed typo * Noxcal (#450) * partial fix for #93. Should allow for one spw at a time * fixes #93. Adds workaround for casacore/python-casacore#130 * added future-fstrings * correctly named future-fstrings * ...and install it for all pythons * back to mainstream sharedarray * lowered message verbosity * droppping support for python<3.6 * Issues 427 and 429 (#430) * Fixes #427 * Fixes #429 * Update __init__.py Bump version * Update Jenkinsfile.sh (#447) * Update Jenkinsfile.sh Remove python 2 building * Update Jenkinsfile.sh * Issue 431 (#432) * Beginning of nobeam functionality. * Remove duplicated source provider. * Change syntax for python 2.7 compatibility. * Update montblanc version in setup to avoid installation woes. Co-authored-by: JSKenyon <jonosken@gmail.com> * Remove two uses of future_fstrings. * make pypi compatible (#446) * make pypi compatible * error * Update setup.py Only support up to v0.6.1 of montblanc * Update setup.py Co-authored-by: Benjamin Hugo <bennahugo@gmail.com> Co-authored-by: JSKenyon <jskenyon007@gmail.com> Co-authored-by: JSKenyon <jonathan.simon.kenyon@gmail.com> * Update __init__.py * Fix warning formatting Co-authored-by: Oleg Smirnov <osmirnov@gmail.com> Co-authored-by: Benjamin Hugo <bennahugo@gmail.com> Co-authored-by: JSKenyon <jskenyon007@gmail.com> Co-authored-by: JSKenyon <jonosken@gmail.com> Co-authored-by: Gijs Molenaar <gijs@pythonic.nl> Co-authored-by: JSKenyon <jonathan.simon.kenyon@gmail.com> * fixed bug when outer chain term was DD and incorrect model was used. Properly implemented scalar updates * added scalar and unislope update types * added auto-generated Stimela2 schemas * added schemas to manifest * use / separator for stimela schema; omit defaults in schema; fix path * minor bugfixes * added stimela_cabs file. Fixed problem with legacy-only flags on output * added stimela_cabs.yml * changed / to . * fixed problem with IFR gain flags overpropagating * Update __init__.py * Stimelation (#480) * fixes to AIPS leakage plotter (more table format support); started an AIPS bandpass plotter but this is WIP * add check for INDE value in PRTAB leakages * changed to hierarchical config --------- Co-authored-by: JSKenyon <jskenyon007@gmail.com> Co-authored-by: Jonathan <jonosken@gmail.com> Co-authored-by: Lexy Andati <andatilexy@gmail.com> Co-authored-by: Benjamin Hugo <bennahugo@gmail.com> Co-authored-by: Gijs Molenaar <gijs@pythonic.nl> Co-authored-by: JSKenyon <jonathan.simon.kenyon@gmail.com> Co-authored-by: Athanaseus Javas Ramaila <aramaila@ska.ac.za>
…-incorrectly-load-data Mitigation for casacore/python-casacore#130
I am scratching my eyes trying to figure out what I'm missing here, but it seems a genuinely puzzling bug. @gervandiepen, @tammojan, how can this be?
(This is on a pretty vanilla Ubuntu 16.04 server, using the latest packages from KERN-3.)
The text was updated successfully, but these errors were encountered: