Skip to content

WIP: ARROW-2298: [Python] Add conversion from np.float64 to nullable integer columns#2750

Closed
farnoy wants to merge 5 commits intoapache:masterfrom
farnoy:arrow-2298-double-to-nullable-int-conv
Closed

WIP: ARROW-2298: [Python] Add conversion from np.float64 to nullable integer columns#2750
farnoy wants to merge 5 commits intoapache:masterfrom
farnoy:arrow-2298-double-to-nullable-int-conv

Conversation

@farnoy
Copy link
Copy Markdown

@farnoy farnoy commented Oct 12, 2018

This is very much WIP, see my comments below

constexpr int64_t kDoubleMax = 1LL << std::numeric_limits<double>::digits;
constexpr int64_t kDoubleMin = -(1LL << std::numeric_limits<double>::digits);
if (options.Safe && (value > kDoubleMax || value < kDoubleMin)) {
// TODO: error out, but how?
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the right place to return this error, right? I wasn't sure where to do it but I think the conditions are what we want

# ARROW-1090: work around CMake rough edges
if 'ARROW_HOME' in os.environ and sys.platform != 'win32':
pkg_config = pjoin(os.environ['ARROW_HOME'], 'lib',
pkg_config = pjoin(os.environ['ARROW_HOME'], 'lib64',
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is obviously temporary - I couldn't get it to build a self-contained .whl without this option, is there something I'm missing wrt the build system?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll have to pass -DCMAKE_INSTALL_LIBDIR=lib when building the C++ libraries

@farnoy
Copy link
Copy Markdown
Author

farnoy commented Oct 12, 2018

cc @wesm

I was testing with a custom python script like below. This acts properly if you use it with safe=False, but I haven't yet implemented an error notification for the safe=True case.

9007199254740993 is equal to (2<<53)+1, the first int64 number that a double cannot represent (on the positive side that is).

I think NULL/NaN values are handled out of band, somewhere else? The null_count, as reported by pyarrow.Table looks right, but I didn't do anything to get that working.

import pyarrow
import pandas as pd

df = pd.DataFrame({'a': [None, 1, 2, 3, None, 9007199254740993]})

print(df.dtypes)

schema = pyarrow.schema([pyarrow.field(name='a', type=pyarrow.int64(), nullable=True)])

table = pyarrow.Table.from_pandas(df, schema=schema, preserve_index=False, safe=False)

print(table)
print(table.to_pandas())
print("null_count: ", table.columns[0].null_count)
print(table.to_pandas().a.iloc[-1])

And output:

a    float64
dtype: object
pyarrow.Table
a: int64
metadata
--------
{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
            b' "a", "field_name": "a", "pandas_type": "int64", "numpy_type": "'
            b'float64", "metadata": null}], "pandas_version": "0.20.3"}'}
              a
0           NaN
1  1.000000e+00
2  2.000000e+00
3  3.000000e+00
4           NaN
5  9.007199e+15
null_count:  2
9007199254740992.0

@kou kou changed the title [ARROW-2298] Add conversion from np.float64 to nullable integer columns ARROW-2298: [Python] Add conversion from np.float64 to nullable integer columns Oct 12, 2018
@kou kou changed the title ARROW-2298: [Python] Add conversion from np.float64 to nullable integer columns WIP: ARROW-2298: [Python] Add conversion from np.float64 to nullable integer columns Oct 12, 2018
@farnoy
Copy link
Copy Markdown
Author

farnoy commented Oct 29, 2018

@wesm ping 😄

@wesm
Copy link
Copy Markdown
Member

wesm commented Oct 29, 2018

Well, there's no tests. Can you let me know when you have a completed patch including tests

const in_type* in_data = GetValues<in_type>(input, 1);
auto out_data = GetMutableValues<out_type>(output, 1);

if (options.allow_float_truncate) {
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have a specialized fast path for options.allow_float_truncate && options.allow_float_overflow?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@farnoy
Copy link
Copy Markdown
Author

farnoy commented Nov 8, 2018

Hey @wesm, I have implemented the approach I described in JIRA and it has tests now. I haven't exposed these fine grained safety settings to python yet, but I can do it in this PR.

WDYT about this approach?

# ARROW-1090: work around CMake rough edges
if 'ARROW_HOME' in os.environ and sys.platform != 'win32':
pkg_config = pjoin(os.environ['ARROW_HOME'], 'lib',
pkg_config = pjoin(os.environ['ARROW_HOME'], 'lib64',
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll have to pass -DCMAKE_INSTALL_LIBDIR=lib when building the C++ libraries

const in_type* in_data = GetValues<in_type>(input, 1);
auto out_data = GetMutableValues<out_type>(output, 1);

if (options.allow_float_truncate) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

ctx->SetStatus(Status::Invalid("Floating point value truncated"));
}
if (!options.allow_float_overflow &&
ARROW_PREDICT_FALSE(out_value >= kMax || out_value <= kMin)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'm not sure this is totally legit since the conversion to out_value was lossy. Maybe better to compare *in_data with the maximum / minimum representable floating point integer?

@wesm
Copy link
Copy Markdown
Member

wesm commented Jan 29, 2019

Do you have any time to work on this for the 0.13 release? Sorry I wasn't able to spend time on this yet

@farnoy
Copy link
Copy Markdown
Author

farnoy commented Jan 29, 2019

I don't think I can pick this up in the near future :(

@wesm
Copy link
Copy Markdown
Member

wesm commented May 17, 2019

I'm closing this stale PR. Hopefully someone can pick it up in the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants