<a href="https://colab.research.google.com/github/aaln/aaln/blob/main/rapids_cudf_pandas_accelerator_mode.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 10 Minutes to RAPIDS cuDF's pandas accelerator mode (cudf.pandas)

cuDF is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating tabular data using a DataFrame style API in the style of pandas.

cuDF now provides a pandas accelerator mode (`cudf.pandas`), allowing you to bring accelerated computing to your pandas workflows without requiring any code change.

This notebook is a short introduction to `cudf.pandas`.

# ⚠️ Verify your setup

First, we'll verify that you are running with an NVIDIA GPU.

In [4]:
!nvidia-smi  # this should display information about available GPUs

Wed Oct 30 18:13:07 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P8               9W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [24]:
get_ipython().kernel.do_shutdown(restart=True)

{'status': 'ok', 'restart': True}

With our GPU-enabled Colab runtime active, we're ready to go. cuDF is available by default in the GPU-enabled runtime.

If you're interested in installing on other platforms, please visit https://rapids.ai/#quick-start to learn more.

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
# Installing GPU driver for LightGBM:-
!mkdir -p /etc/OpenCL/vendors && echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd
!sudo apt install nvidia-driver-460 nvidia-cuda-toolkit clinfo
!apt-get update --fix-missing
!pip install -q  lightgbm==4.1.0 \
  --config-settings=cmake.define.USE_GPU=ON \
  --config-settings=cmake.define.OpenCL_INCLUDE_DIR="/usr/local/cuda/include/" \
  --config-settings=cmake.define.OpenCL_LIBRARY="/usr/local/cuda/lib64/libOpenCL.so"

In [2]:
%load_ext cudf.pandas

import cudf  # this should work without any errors
import cupy as cp
import lightgbm as lgb
import pandas as pd
import numpy as np


from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error



Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



In [3]:
from google.colab import files
files.upload()  # Select kaggle.json from your local files (the api key)

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"aalndy","key":"3d3aac3e0773862d88152e8c280a92c2"}'}

In [4]:
!mkdir -p ~/.kaggle  # Use -p to avoid errors if the directory exists
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [5]:
!pip install -q kaggle

In [6]:

!kaggle competitions download -c odsc-2024-nvidia-hackathon

Downloading odsc-2024-nvidia-hackathon.zip to /content
100% 3.71G/3.72G [02:56<00:00, 24.4MB/s]
100% 3.72G/3.72G [02:56<00:00, 22.6MB/s]


In [7]:
import zipfile
with zipfile.ZipFile("odsc-2024-nvidia-hackathon.zip", "r") as zip_ref:
    zip_ref.extractall("odsc-2024-nvidia-hackathon")

In [3]:
# prompt: # the csvs should be in the odsc-2024-nvidia-hackathon directory
# train = cudf.read_csv('train.csv', nrows=1000000)  # Load a sample of 1 million rows
# test = cudf.read_csv('test.csv')

# Assuming 'train.csv' and 'test.csv' are in the 'odsc-2024-nvidia-hackathon' directory
# train = cudf.read_csv('odsc-2024-nvidia-hackathon/train.csv', nrows=1000000)  # Load a sample of 1 million rows
# test = cudf.read_csv('odsc-2024-nvidia-hackathon/test.csv')

# Load a sample of the train.csv using Pandas
# Note: Assuming train.csv is uploaded to Colab environment.
train = pd.read_csv('odsc-2024-nvidia-hackathon/train.csv')
test = pd.read_csv('odsc-2024-nvidia-hackathon/test.csv')


In [4]:

# Check for missing values (in case some need imputation)
print(train.isnull().sum())
print(test.isnull().sum())

id                       0
y                        0
trickortreat        976662
kingofhalloween    1026793
mumming            1054562
                    ...   
satan                    0
monsterhunter       400267
tabulatable         479359
vampire                  0
hallo               468437
Length: 108, dtype: int64
id                       0
y                  1000000
trickortreat         88960
kingofhalloween      93404
mumming              95907
                    ...   
satan                    0
monsterhunter        36560
tabulatable          43653
vampire                  0
hallo                43008
Length: 108, dtype: int64


In [5]:
# Basic Feature Preprocessing
# Extract feature columns and target
target = train['y']
features = train.drop(['id', 'y'], axis=1)

In [6]:
# Label Encoding for Categorical Features using pandas factorize
categorical_cols = features.select_dtypes(include=['object']).columns
for col in categorical_cols:
    features[col], _ = pd.factorize(features[col])

In [7]:
# Handle the test data similarly to ensure columns align with training set
test_features = test.drop(['id'], axis=1)
for col in categorical_cols:
    if col in test_features.columns:
        test_features[col], _ = pd.factorize(test_features[col])

# Align train and test dataframes to have the same columns
features, test_features = features.align(test_features, join='left', axis=1, fill_value=0)

# Split data into train and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(features, target, test_size=0.2, random_state=42)

# Convert to NumPy arrays for LightGBM compatibility
X_train = X_train.values
y_train = y_train.values
X_valid = X_valid.values
y_valid = y_valid.values


In [8]:
# LightGBM Dataset Construction (handle missing values automatically)
lgb_train = lgb.Dataset(X_train, y_train, free_raw_data=False)
lgb_valid = lgb.Dataset(X_valid, y_valid, reference=lgb_train, free_raw_data=False)

# Set LightGBM Parameters
params = {
    'objective': 'regression',
    'metric': 'rmse',
    'boosting_type': 'gbdt',
    'gpu_platform_id': 0,  # Specify GPU platform (optional, depending on the setup)
    'gpu_device_id': 0,    # Specify GPU device ID (optional, depending on the setup)
    'learning_rate': 0.1,
    'max_depth': -1,
    'num_leaves': 31,
    'verbose': -1,
    'device': 'gpu',  # Enable GPU with CUDA for faster training  # Utilize GPU for faster training if available
}

# Train LightGBM Model with Callbacks
callbacks = [
    lgb.early_stopping(stopping_rounds=50),
    lgb.log_evaluation(period=100)
]

model = lgb.train(params, lgb_train, num_boost_round=1000, valid_sets=[lgb_valid], callbacks=callbacks)


Training until validation scores don't improve for 50 rounds
[100]	valid_0's rmse: 643.501
[200]	valid_0's rmse: 643.037
[300]	valid_0's rmse: 642.911
[400]	valid_0's rmse: 642.854
[500]	valid_0's rmse: 642.821
[600]	valid_0's rmse: 642.796
[700]	valid_0's rmse: 642.79
[800]	valid_0's rmse: 642.78
[900]	valid_0's rmse: 642.772
Early stopping, best iteration is:
[921]	valid_0's rmse: 642.77


In [9]:
# keep track of leaderboard
# ID
# Predict on Validation Set
y_pred_valid = model.predict(X_valid, num_iteration=model.best_iteration)
rmse_valid = np.sqrt(mean_squared_error(y_valid, y_pred_valid))
print(f'Validation RMSE: {rmse_valid}')

# Predict on Test Set
test_features = test_features.values
test_preds = model.predict(test_features, num_iteration=model.best_iteration)


Validation RMSE: 642.7704077659927


In [10]:
# Create Submission File
submission = pd.DataFrame({'id': test['id'], 'y': test_preds})
submission.to_csv('submission.csv', index=False)

print("Submission saved as 'submission.csv'")


Submission saved as 'submission.csv'


In [11]:
# prompt: read submission.csv and count the number of rows

submission = pd.read_csv('submission.csv')
num_rows = len(submission)
print(f"Number of rows in submission.csv: {num_rows}")

Number of rows in submission.csv: 1000000


In [12]:
# prompt: kaggle competitions ubmit -c odsc-2024-nvidia-hackathon -f submission.csv -m "First submission using 1M rows of training"

!kaggle competitions submit -c odsc-2024-nvidia-hackathon -f submission.csv -m "First submission using 1M rows of training"


100% 19.9M/19.9M [00:04<00:00, 5.06MB/s]
Successfully submitted to 🎃 Spooktacular NVIDIA Data Science Competition