-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
header challenge: header with multivalued columns raises "java.lang.ArithmeticException: / by zero" during df.select(any_column).show() or df.select(any_column).take() #69
Comments
Many thanks @jacobic for reporting the issue, and providing a detailed explanation! I will have a closer look by the end of the week and try to provide a fix quickly, though I suspect it will require some dev. |
Hi @jacobic, A bit more on this - the error you see is actually not related to the multivalued columns. This error is thrown when the Hence, if you do: # Use 5KB for recordLength
df = spark.read.format("fits")\
.option("hdu", 1)\
.option("recordlength", 5*1024)\
.load("path/to/photoObj-001000-1-0027.fits")
df.show(2) It does not crash anymore. But that does not mean it works correctly as it takes only the first value in each of the multivalued columns. I am working on a fix for this. |
Thanks for the insight Julien, I appreciate you working on this!
I have a few quick questions. By one line, do you mean a single row? If so, does this mean that the large recordlength is due to a large number of columns or the data size (in bytes) for each element in a column? I assume it would not be related to the number of columns because things were lazily evaluated? In short, how do I find the optimal recordlength for an arbitrary fits file in order to avoid the recordlength exception?
Cheers,
Jacob
|
Hi @jacobic
Yes by one line, I mean a single row whose size is given by the number of columns and the type of objects for each column.
When you call // in FitsRecordReader.scala
// Convert each row
// 1 task: 32 MB @ 2s
val tmp = Seq.newBuilder[Row]
for (i <- 0 to recordLength / rowSizeLong.toInt - 1) {
tmp += Row.fromSeq(fits.getRow(
recordValueBytes.slice(
rowSizeInt*i, rowSizeInt*(i+1))))
}
recordValue = tmp.result One could be more clever and specify only specific columns to be decoded (parquet does it for example), and this is something that will be added in the future.
The current value of 1KB has been found completely empirically, through a series of benchmarks and profiling. For the moment, the user needs manually to make sure that the recordLength default option (1 KB) is higher than the row size (from the header). |
Issue 69: Add support for vector elements
Fixed in #70! |
Hi guys just wanted to give some more feedback about my favourite spark package!
I encounter an error reading a fits file with an "exotic header". I assume the issue is due to the columns which contain data arrays. I would expect spark-fits to load multivalued columns as vectors but I think it might be causing bigger problems as I cannot view any columns.
For example when I read the data:
The following error is thrown when calling:
The header is shown here example.txt and the file itself is zipped here photoObj-001000-1-0027.fits.zip
Before the error the schema is inferred:
Despite this the multivalued columns (e.g. code 5E with shape 5, such as 'MODELMAG') are teated as floats. I would expect them to be them to be treated as vectors. Is this possible?
The error itself occurs after selecting any column (even if it is a regular non-multivalue column) and then applying the .take(n) or .show(n) method:
Please let me know if you require any additional information or have any questions,
Cheers,
Jacob
The text was updated successfully, but these errors were encountered: