WIP: ARROW-2298: [Python] Add conversion from np.float64 to nullable integer columns by farnoy · Pull Request #2750 · apache/arrow

farnoy · 2018-10-12T16:49:18Z

This is very much WIP, see my comments below

farnoy · 2018-10-12T16:50:02Z

cpp/src/arrow/compute/kernels/cast.cc

+      constexpr int64_t kDoubleMax = 1LL << std::numeric_limits<double>::digits;
+      constexpr int64_t kDoubleMin = -(1LL << std::numeric_limits<double>::digits);
+      if (options.Safe && (value > kDoubleMax || value < kDoubleMin)) {
+        // TODO: error out, but how?


This is not the right place to return this error, right? I wasn't sure where to do it but I think the conditions are what we want

farnoy · 2018-10-12T16:51:02Z

python/setup.py

                # ARROW-1090: work around CMake rough edges
                if 'ARROW_HOME' in os.environ and sys.platform != 'win32':
-                    pkg_config = pjoin(os.environ['ARROW_HOME'], 'lib',
+                    pkg_config = pjoin(os.environ['ARROW_HOME'], 'lib64',


This is obviously temporary - I couldn't get it to build a self-contained .whl without this option, is there something I'm missing wrt the build system?

You'll have to pass -DCMAKE_INSTALL_LIBDIR=lib when building the C++ libraries

farnoy · 2018-10-12T16:55:10Z

cc @wesm

I was testing with a custom python script like below. This acts properly if you use it with safe=False, but I haven't yet implemented an error notification for the safe=True case.

9007199254740993 is equal to (2<<53)+1, the first int64 number that a double cannot represent (on the positive side that is).

I think NULL/NaN values are handled out of band, somewhere else? The null_count, as reported by pyarrow.Table looks right, but I didn't do anything to get that working.

import pyarrow
import pandas as pd

df = pd.DataFrame({'a': [None, 1, 2, 3, None, 9007199254740993]})

print(df.dtypes)

schema = pyarrow.schema([pyarrow.field(name='a', type=pyarrow.int64(), nullable=True)])

table = pyarrow.Table.from_pandas(df, schema=schema, preserve_index=False, safe=False)

print(table)
print(table.to_pandas())
print("null_count: ", table.columns[0].null_count)
print(table.to_pandas().a.iloc[-1])

And output:

a    float64
dtype: object
pyarrow.Table
a: int64
metadata
--------
{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
            b' "a", "field_name": "a", "pandas_type": "int64", "numpy_type": "'
            b'float64", "metadata": null}], "pandas_version": "0.20.3"}'}
              a
0           NaN
1  1.000000e+00
2  2.000000e+00
3  3.000000e+00
4           NaN
5  9.007199e+15
null_count:  2
9007199254740992.0

farnoy · 2018-10-29T14:42:42Z

@wesm ping 😄

wesm · 2018-10-29T22:02:39Z

Well, there's no tests. Can you let me know when you have a completed patch including tests

farnoy · 2018-11-08T12:56:27Z

cpp/src/arrow/compute/kernels/cast.cc

    const in_type* in_data = GetValues<in_type>(input, 1);
    auto out_data = GetMutableValues<out_type>(output, 1);

-    if (options.allow_float_truncate) {


should we have a specialized fast path for options.allow_float_truncate && options.allow_float_overflow?

farnoy · 2018-11-08T12:56:35Z

Hey @wesm, I have implemented the approach I described in JIRA and it has tests now. I haven't exposed these fine grained safety settings to python yet, but I can do it in this PR.

WDYT about this approach?

wesm · 2018-11-10T02:52:24Z

python/setup.py

                # ARROW-1090: work around CMake rough edges
                if 'ARROW_HOME' in os.environ and sys.platform != 'win32':
-                    pkg_config = pjoin(os.environ['ARROW_HOME'], 'lib',
+                    pkg_config = pjoin(os.environ['ARROW_HOME'], 'lib64',


You'll have to pass -DCMAKE_INSTALL_LIBDIR=lib when building the C++ libraries

wesm · 2018-11-10T02:53:00Z

cpp/src/arrow/compute/kernels/cast.cc

    const in_type* in_data = GetValues<in_type>(input, 1);
    auto out_data = GetMutableValues<out_type>(output, 1);

-    if (options.allow_float_truncate) {


wesm · 2018-11-10T02:59:16Z

cpp/src/arrow/compute/kernels/cast.cc

+          ctx->SetStatus(Status::Invalid("Floating point value truncated"));
+        }
+        if (!options.allow_float_overflow &&
+            ARROW_PREDICT_FALSE(out_value >= kMax || out_value <= kMin)) {


Hm, I'm not sure this is totally legit since the conversion to out_value was lossy. Maybe better to compare *in_data with the maximum / minimum representable floating point integer?

wesm · 2019-01-29T03:38:36Z

Do you have any time to work on this for the 0.13 release? Sorry I wasn't able to spend time on this yet

farnoy · 2019-01-29T10:04:30Z

I don't think I can pick this up in the near future :(

wesm · 2019-05-17T22:05:12Z

I'm closing this stale PR. Hopefully someone can pick it up in the future

Add rough conversion from Double to Int64

5b78eaf

farnoy commented Oct 12, 2018

View reviewed changes

kou changed the title ~~[ARROW-2298] Add conversion from np.float64 to nullable integer columns~~ ARROW-2298: [Python] Add conversion from np.float64 to nullable integer columns Oct 12, 2018

kou changed the title ~~ARROW-2298: [Python] Add conversion from np.float64 to nullable integer columns~~ WIP: ARROW-2298: [Python] Add conversion from np.float64 to nullable integer columns Oct 12, 2018

farnoy added 2 commits November 8, 2018 13:51

Use a better approach for checking float to int overflow

1bdc6d6

Remove debug prints

f12647e

farnoy commented Nov 8, 2018

View reviewed changes

farnoy added 2 commits November 8, 2018 13:58

Fix lint

fdfc3a2

Fix clang-format issue

1750b91

wesm reviewed Nov 10, 2018

View reviewed changes

wesm force-pushed the master branch from 3088183 to 0c6b2d2 Compare February 18, 2019 19:34

wesm closed this May 17, 2019

asfimport mentioned this pull request Jun 25, 2019

[Python] Add option to not consider NaN to be null when converting to an integer Arrow type #18252

Closed

Conversation

farnoy commented Oct 12, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

farnoy commented Oct 12, 2018

Uh oh!

farnoy commented Oct 29, 2018

Uh oh!

wesm commented Oct 29, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

farnoy commented Nov 8, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Jan 29, 2019

Uh oh!

farnoy commented Jan 29, 2019

Uh oh!

wesm commented May 17, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants