WIP: ARROW-2298: [Python] Add conversion from np.float64 to nullable integer columns#2750
WIP: ARROW-2298: [Python] Add conversion from np.float64 to nullable integer columns#2750farnoy wants to merge 5 commits intoapache:masterfrom
Conversation
| constexpr int64_t kDoubleMax = 1LL << std::numeric_limits<double>::digits; | ||
| constexpr int64_t kDoubleMin = -(1LL << std::numeric_limits<double>::digits); | ||
| if (options.Safe && (value > kDoubleMax || value < kDoubleMin)) { | ||
| // TODO: error out, but how? |
There was a problem hiding this comment.
This is not the right place to return this error, right? I wasn't sure where to do it but I think the conditions are what we want
| # ARROW-1090: work around CMake rough edges | ||
| if 'ARROW_HOME' in os.environ and sys.platform != 'win32': | ||
| pkg_config = pjoin(os.environ['ARROW_HOME'], 'lib', | ||
| pkg_config = pjoin(os.environ['ARROW_HOME'], 'lib64', |
There was a problem hiding this comment.
This is obviously temporary - I couldn't get it to build a self-contained .whl without this option, is there something I'm missing wrt the build system?
There was a problem hiding this comment.
You'll have to pass -DCMAKE_INSTALL_LIBDIR=lib when building the C++ libraries
|
cc @wesm I was testing with a custom python script like below. This acts properly if you use it with
I think NULL/NaN values are handled out of band, somewhere else? The import pyarrow
import pandas as pd
df = pd.DataFrame({'a': [None, 1, 2, 3, None, 9007199254740993]})
print(df.dtypes)
schema = pyarrow.schema([pyarrow.field(name='a', type=pyarrow.int64(), nullable=True)])
table = pyarrow.Table.from_pandas(df, schema=schema, preserve_index=False, safe=False)
print(table)
print(table.to_pandas())
print("null_count: ", table.columns[0].null_count)
print(table.to_pandas().a.iloc[-1])And output: |
|
@wesm ping 😄 |
|
Well, there's no tests. Can you let me know when you have a completed patch including tests |
| const in_type* in_data = GetValues<in_type>(input, 1); | ||
| auto out_data = GetMutableValues<out_type>(output, 1); | ||
|
|
||
| if (options.allow_float_truncate) { |
There was a problem hiding this comment.
should we have a specialized fast path for options.allow_float_truncate && options.allow_float_overflow?
|
Hey @wesm, I have implemented the approach I described in JIRA and it has tests now. I haven't exposed these fine grained safety settings to python yet, but I can do it in this PR. WDYT about this approach? |
| # ARROW-1090: work around CMake rough edges | ||
| if 'ARROW_HOME' in os.environ and sys.platform != 'win32': | ||
| pkg_config = pjoin(os.environ['ARROW_HOME'], 'lib', | ||
| pkg_config = pjoin(os.environ['ARROW_HOME'], 'lib64', |
There was a problem hiding this comment.
You'll have to pass -DCMAKE_INSTALL_LIBDIR=lib when building the C++ libraries
| const in_type* in_data = GetValues<in_type>(input, 1); | ||
| auto out_data = GetMutableValues<out_type>(output, 1); | ||
|
|
||
| if (options.allow_float_truncate) { |
| ctx->SetStatus(Status::Invalid("Floating point value truncated")); | ||
| } | ||
| if (!options.allow_float_overflow && | ||
| ARROW_PREDICT_FALSE(out_value >= kMax || out_value <= kMin)) { |
There was a problem hiding this comment.
Hm, I'm not sure this is totally legit since the conversion to out_value was lossy. Maybe better to compare *in_data with the maximum / minimum representable floating point integer?
|
Do you have any time to work on this for the 0.13 release? Sorry I wasn't able to spend time on this yet |
|
I don't think I can pick this up in the near future :( |
|
I'm closing this stale PR. Hopefully someone can pick it up in the future |
This is very much WIP, see my comments below