New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
32bit test suite: pandas 1.3.3 "Buffer dtype mismatch" #8169
Comments
Can you check if changing dask/dask/dataframe/backends.py Line 359 in 7b66144
np.int64 to np.intp fixes the issue?
|
Checking right now. Meanwhile here is the change in pandas for reference: pandas-dev/pandas#40528 |
I can confirm that changing the casting to |
With Dask 2022.01 and 2022.02 this patch gets past the initial test failure test_categorical_set_index[tasks]: --- a/dask/dataframe/backends.py
+++ b/dask/dataframe/backends.py
@@ -352,7 +352,7 @@
@group_split_dispatch.register((pd.DataFrame, pd.Series, pd.Index))
def group_split_pandas(df, c, k, ignore_index=False):
indexer, locations = pd._libs.algos.groupsort_indexer(
- c.astype(np.int64, copy=False), k
+ c.astype(np.intp, copy=False), k
)
df2 = df.take(indexer)
locations = locations.cumsum() However the new info(verbose=True) option introduced in #8222 gives a new error when running on 32 bit. (Traceback from my 2022.02 test run _____________________________ test_categorize_info _____________________________
@pytest.mark.skipif(not PANDAS_GT_120, reason="need newer version of Pandas")
def test_categorize_info():
# assert that we can call info after categorize
# workaround for: https://github.com/pydata/pandas/issues/14368
from io import StringIO
pandas_format._put_lines = put_lines
df = pd.DataFrame(
{"x": [1, 2, 3, 4], "y": pd.Series(list("aabc")), "z": pd.Series(list("aabc"))},
index=[0, 1, 2, 3],
)
ddf = dd.from_pandas(df, npartitions=4).categorize(["y"])
# Verbose=False
buf = StringIO()
ddf.info(buf=buf, verbose=True)
expected = (
"<class 'dask.dataframe.core.DataFrame'>\n"
"Int64Index: 4 entries, 0 to 3\n"
"Data columns (total 3 columns):\n"
" # Column Non-Null Count Dtype\n"
"--- ------ -------------- -----\n"
" 0 x 4 non-null int64\n"
" 1 y 4 non-null category\n"
" 2 z 4 non-null object\n"
"dtypes: category(1), object(1), int64(1)\n"
"memory usage: 496.0 bytes\n"
)
> assert buf.getvalue() == expected
E assert "<class 'dask...312.0 bytes\n" == "<class 'dask...496.0 bytes\n"
E <class 'dask.dataframe.core.DataFrame'>
E Int64Index: 4 entries, 0 to 3
E Data columns (total 3 columns):
E # Column Non-Null Count Dtype
E --- ------ -------------- -----
E 0 x 4 non-null int64
E 1 y 4 non-null category...
E
E ...Full output truncated (7 lines hidden), use '-vv' to show I'm pretty sure that 32-bit architectures will use less memory than 64-bit architectures and that should be expected. (pdb) ddf.info(verbose=True)
<class 'dask.dataframe.core.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 x 4 non-null int64
1 y 4 non-null category
2 z 4 non-null object
dtypes: category(1), object(1), int64(1)
memory usage: 312.0 bytes Would it be better to not include the memory usage in the test, or make it something that could be altered depending on the architecture? I tested the following patch as a solution to https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1006537 but I could also image just trimming the expected and response to end just at usage:" and delete the "496.0 bytes\n". --- a/dask/dataframe/tests/test_dataframe.py
+++ b/dask/dataframe/tests/test_dataframe.py
@@ -3,6 +3,7 @@
import xml.etree.ElementTree
from itertools import product
from operator import add
+import platform
import numpy as np
import pandas as pd
@@ -3597,6 +3598,12 @@
# Verbose=False
buf = StringIO()
ddf.info(buf=buf, verbose=True)
+
+ if platform.architecture()[0] == "32bit":
+ memory_usage = "312.0"
+ else:
+ memory_usage = "496.0"
+
expected = (
"<class 'dask.dataframe.core.DataFrame'>\n"
"Int64Index: 4 entries, 0 to 3\n"
@@ -3607,7 +3614,7 @@
" 1 y 4 non-null category\n"
" 2 z 4 non-null object\n"
"dtypes: category(1), object(1), int64(1)\n"
- "memory usage: 496.0 bytes\n"
+ "memory usage: {} bytes\n".format(memory_usage)
)
assert buf.getvalue() == expected
|
Could you submit the fixes as PRs, please? |
What happened:
When running the test suite in a 32-bit environment on the openSUSE buildservice, I get the following error
If I didn't overlook something, it is the same error for 187 tests:
Full buildlog:
dask-test-i586_log.txt
Anything else we need to know?:
dask/dask/dataframe/backends.py
Lines 356 to 360 in 7b66144
Environment:
The text was updated successfully, but these errors were encountered: