# Welcome to cuDF: GPU Accelerated Dataframes

## What is cuDF?
cuDF is a powerful library designed for manipulating large datasets using the computational power of GPUs. It offers a familiar interface similar to pandas but can handle much larger datasets faster.

Let's start by installing cudf using pip. For other methods, please check the [RAPIDS Installation Guide](https://docs.rapids.ai/install/?_gl=1*1420qne*_ga*MTU3MTEzNzgxNC4xNzI0OTc1MzQ1*_ga_RKXFW6CM42*MTcyODM5MTMxNS4xNS4wLjE3MjgzOTEzMTUuNjAuMC4w)

In [1]:
!pip install \
    --extra-index-url=https://pypi.nvidia.com \
    cudf-cu12==24.8.*

Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com
Collecting cudf-cu12==24.8.*
  Downloading https://pypi.nvidia.com/cudf-cu12/cudf_cu12-24.8.3-cp311-cp311-manylinux_2_28_x86_64.whl (517.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m517.8/517.8 MB[0m [31m45.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting cupy-cuda12x>=12.0.0 (from cudf-cu12==24.8.*)
  Downloading cupy_cuda12x-13.3.0-cp311-cp311-manylinux2014_x86_64.whl.metadata (2.7 kB)
Collecting rmm-cu12==24.8.* (from cudf-cu12==24.8.*)
  Downloading https://pypi.nvidia.com/rmm-cu12/rmm_cu12-24.8.2-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m57.0 MB/s[0m eta [36m0:00:00[0m
Downloading cupy_cuda12x-13.3.0-cp311-cp311-manylinux2014_x86_64.whl (91.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.2/91.2 MB[0m [31m72.2 MB/s[0m eta [36m0:00:00[0m:0

## 📊 Your 1st cuDF Dataframe

Let’s start by creating our first cuDF DataFrame:

In [8]:
import cudf
import numpy as np

numRows = 1000000
# Create a DataFrame with cuDF
data = {
    'A': np.random.rand(numRows),
    'B': np.random.rand(numRows),
    'C': np.random.rand(numRows)
}
gdf = cudf.DataFrame(data)

# Display the first few rows
print(gdf.head())


          A         B         C
0  0.949641  0.879344  0.442241
1  0.393400  0.870158  0.498193
2  0.807076  0.506883  0.098362
3  0.102371  0.486801  0.730431
4  0.095091  0.677342  0.412487


## 📊 Explore the DataFrame

Now that we have created the dataframe, let's explore it!

**Shape:**

In [3]:
gdf.shape

(1000000, 3)

As you can see, the first value corresponds to the number of rows we have, while the second indicates the number of columns we created.

Get a more comprehensive view of the dataframe using the .info method!

In [4]:
gdf.info

<bound method DataFrame.info of                A         B         C
0       0.067609  0.508237  0.849234
1       0.606739  0.271058  0.316622
2       0.232824  0.598659  0.014538
3       0.469160  0.715005  0.855415
4       0.452243  0.638148  0.736776
...          ...       ...       ...
999995  0.942827  0.801837  0.561716
999996  0.175658  0.403991  0.330349
999997  0.835403  0.953952  0.199196
999998  0.546117  0.175619  0.376347
999999  0.314654  0.914983  0.311265

[1000000 rows x 3 columns]>

## 📊 Filtering Data

Filtering data is a breeze with cuDF! Let's say you want to filter rows where column 'A' is greater than 0.5:

In [10]:
filtered_gdf = gdf[gdf['A'] > 0.5]
filtered_gdf.shape
print(f"As you can tell from the shape of the new filtered dataframe, the number of rows reduced from {numRows} to {filtered_gdf.shape[0]}. That's {numRows - filtered_gdf.shape[0]} rows that we've filtered out with 'A' values less than 0.5!")

As you can tell from the shape of the new filtered dataframe, the number of rows reduced from 1000000 to 500559. That's 499441 rows that we've filtered out with 'A' values less than 0.5!


## 📊 Grouping & Aggregating

Want to group your data and calculate averages? 
Let's create another dataframe with categories: 

In [23]:
data = {
    'Category': ['A', 'B', 'A', 'B', 'A'],
    'Value': [10, 20, 30, 40, 50]
}
gdf = cudf.DataFrame(data)
gdf

Unnamed: 0,Category,Value
0,A,10
1,B,20
2,A,30
3,B,40
4,A,50


In [30]:
# Group by 'Category' and calculate the mean

grouped = gdf.groupby('Category')['Value'].mean().reset_index()
print(grouped)

  Category  Value
0        B   30.0
1        A   30.0


## 📊 Data Manipulation

### Adding a New Column:



In [None]:
gdf['NewValue'] = gdf['Value'] * 2

### Modifying Existing Columns:

In [None]:
gdf['A'] = gdf['A'] ** 2

### Dropping Columns:

In [None]:
gdf = gdf.drop('NewValue', axis=1)

### Renaming Columns:

In [None]:
gdf = gdf.rename(columns={'Value': 'newValue', 'Category': 'newCategory'})


## 📊 Sorting Data

### Sorting Column Values 

In [None]:
sorted_gdf = gdf.sort_values(by='newValue', ascending=True)


## 📊 Handling Missing Data
### Detecting Missing Values:

In [None]:
missing_mask = gdf.isnull()

### Filling Missing Values:

In [None]:
gdf['newValue'] = gdf['newValue'].fillna(0)

### Dropping Missing Values:

In [None]:
gdf = gdf.dropna()

## 📊 Merging and Joining DataFrames

### Concatenating another Dataframe:

In [None]:
additional_data = {
    'Category': ['A', 'C'],
    'Value': [60, 70]
}
gdf2 = cudf.DataFrame(additional_data)

combined_gdf = cudf.concat([gdf, gdf2], ignore_index=True)
print(combined_gdf)

## 📊 Input/Output

In [None]:
gdf.to_csv('output.csv', index=False) # To CSV