# RAPIDS cuDF bug. June 7, 2022
This notebook shows a bug in the RAPIDS cuDF transform function. We use the 3GB dataset from here  
https://www.kaggle.com/datasets/raddar/amex-data-integer-dtypes-parquet-format?select=test.parquet

In [13]:
import matplotlib.pyplot as plt

import cudf
cudf.__version__

'22.04.00'

# Index without Consecutive Numbers
By sorting we cause the index to have non-consectutive numbers. This causes RAPIDS cuDF transform to return incorrect results

In [34]:
PATH = '../../May-2022/may-32-22-AMEX/'
df = cudf.read_parquet(PATH+'test.parquet')
df['customer_ID'] = df['customer_ID'].str[-16:].str.hex_to_int().astype('int64')
df = df.sort_values('customer_ID')

In [35]:
df['P_2_max'] = df.groupby('customer_ID').P_2.transform('max')
tmp = df.groupby('customer_ID').P_2_max.agg('std')
tmp.max()

0.6389674118471806

In [41]:
i = tmp.loc[tmp>0.5].index[0]
df.loc[df.customer_ID==i,['customer_ID','P_2_max']]

Unnamed: 0,customer_ID,P_2_max
5118281,-35742134755167706,-0.105689
5118282,-35742134755167706,-0.105689
5118283,-35742134755167706,-0.105689
5118284,-35742134755167706,-0.105689
5118285,-35742134755167706,0.936756
5118286,-35742134755167706,0.936756
5118287,-35742134755167706,0.936756
5118288,-35742134755167706,0.936756
5118289,-35742134755167706,0.936756
5118290,-35742134755167706,0.936756


# Without Sort
If we don't sort the dataframe, then index has consecutive numbers and transform works correctly

In [42]:
df = cudf.read_parquet(PATH+'test.parquet')
df['customer_ID'] = df['customer_ID'].str[-16:].str.hex_to_int().astype('int64')

In [43]:
df['P_2_max'] = df.groupby('customer_ID').P_2.transform('max')
tmp = df.groupby('customer_ID').P_2_max.agg('std')
tmp.max()

1.9088763446849327e-07

# Reset Index
If we reset index after sort, then index has consecutive numbers and transform works correctly

In [44]:
df = cudf.read_parquet(PATH+'test.parquet')
df['customer_ID'] = df['customer_ID'].str[-16:].str.hex_to_int().astype('int64')
df = df.sort_values('customer_ID').reset_index()

In [45]:
df['P_2_max'] = df.groupby('customer_ID').P_2.transform('max')
tmp = df.groupby('customer_ID').P_2_max.agg('std')
tmp.max()

1.9088763446849327e-07

# Agg then Merge
If we use aggregation then merge, it works without index having consecutive numbers

In [57]:
df = cudf.read_parquet(PATH+'test.parquet')
df['customer_ID'] = df['customer_ID'].str[-16:].str.hex_to_int().astype('int64')
df = df.sort_values('customer_ID')

In [58]:
tmp = df.groupby('customer_ID').P_2.agg('max').rename('P_2_max')
df = df.merge(tmp, left_on='customer_ID', right_index=True)
tmp = df.groupby('customer_ID').P_2_max.agg('std')
tmp.max()

1.9088763446849327e-07