<a href="https://colab.research.google.com/github/Viny2030/polars/blob/main/FireDucks_vs_Pandas_vs_Polars.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://fireducks-dev.github.io/


## Pandas vs Polars vs FireDucks

This notebook is contributed by [Mr. Avi Chawla](https://www.linkedin.com/in/avi-chawla/), Co-founder @ [Daily Dose of Data Science](https://www.dailydoseofds.com/)

We would like to express our deepest gratitude for his kind contribution.


In [None]:
!pip install -U fireducks polars linetimer



In [None]:
# download the dataset:
!wget https://modin-datasets.s3.amazonaws.com/testing/yellow_tripdata_2015-01.csv

--2024-12-26 18:19:44--  https://modin-datasets.s3.amazonaws.com/testing/yellow_tripdata_2015-01.csv
proxygate1.nic.nec.co.jp (proxygate1.nic.nec.co.jp) をDNSに問いあわせています... 10.51.8.102, 10.51.8.101
proxygate1.nic.nec.co.jp (proxygate1.nic.nec.co.jp)|10.51.8.102|:8080 に接続しています... 接続しました。
Proxy による接続要求を送信しました、応答を待っています... 200 OK
長さ: 209447840 (200M) [text/csv]
‘yellow_tripdata_2015-01.csv.1’ に保存中


2024-12-26 18:20:42 (3.45 MB/s) - ‘yellow_tripdata_2015-01.csv.1’ へ保存完了 [209447840/209447840]



In [None]:
import polars as pl
df = pl.scan_csv("yellow_tripdata_2015-01.csv")

big_df = pl.concat([df for _ in range(20)])
big_df.collect().write_parquet("taxi.parquet")

In [None]:
!ls -lah | grep taxi

-rw-r--r--  1 sourav scaleup 613M 12月 26 18:21 taxi.parquet


In [None]:
import platform, psutil
print("="*30, "Evaluation Environment Information", "="*30)
print(f'platform: {platform.system()}')
print(f'architecture: {platform.machine()}')
print(f'processor: {platform.processor()}')
print(f'cpu: {psutil.cpu_count()}')

platform: Linux
architecture: x86_64
processor: x86_64
cpu: 128


# Pandas


In [None]:
# defining query to be performed on pandas DataFrame

from linetimer import CodeTimer

def pandas_query(key):
  with CodeTimer(name=f"Overall execution for ${key} using {pd.__name__}", unit="s"):
    res = (
        pd.read_parquet("taxi.parquet")
        .groupby(key)
        .agg(
            mean_mta_tax=("mta_tax", "mean"),
            mean_tip_amount=("tip_amount", "mean"),
            mean_tolls_amount=("tolls_amount", "mean"),
            mean_trip_distance=("trip_distance", "mean"),
        )
    )
    return res

In [None]:
import pandas as pd
pd.__version__

'2.2.3'

In [None]:
pandas_query("PULocationID")

Code block 'Overall execution for $PULocationID using pandas' took: 9.02208 s


Unnamed: 0_level_0,mean_mta_tax,mean_tip_amount,mean_tolls_amount,mean_trip_distance
PULocationID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0.104895,5.670350,2.147483,1.645385
2,0.500000,5.030000,1.332500,10.280000
3,0.500000,1.368750,1.332500,9.281250
4,0.497843,1.401985,0.066277,2.877373
6,0.500000,0.000000,21.320000,36.700000
...,...,...,...,...
261,0.496182,1.193111,0.162796,3.924002
262,0.498785,1.318876,0.184178,2.736880
263,0.499182,1.282391,0.136376,2.596841
264,0.487009,1.526466,0.239003,2.742721


In [None]:
pandas_query("DOLocationID")

Code block 'Overall execution for $DOLocationID using pandas' took: 7.81713 s


Unnamed: 0_level_0,mean_mta_tax,mean_tip_amount,mean_tolls_amount,mean_trip_distance
DOLocationID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0.012980,6.649897,12.258724,17.091866
2,0.500000,3.894000,3.198000,13.406000
3,0.500000,3.138618,1.753171,14.017805
4,0.499095,1.294488,0.057856,2.517735
5,0.500000,1.992222,12.678889,26.704444
...,...,...,...,...
261,0.499016,1.467040,0.137750,4.153200
262,0.499334,1.409528,0.271510,2.807141
263,0.499432,1.379754,0.211886,2.671808
264,0.485516,1.449961,0.269229,2.553356


# FireDucks

In [None]:
# defining query to be performed for FireDucks DataFrame (exact same pandas query with _evaluate to trigger execution)

from linetimer import CodeTimer

def fireducks_query(key):
  with CodeTimer(name=f"Overall execution for {key} using {pd.__name__}", unit="s"):
    res = (
        pd.read_parquet("taxi.parquet")
        .groupby(key)
        .agg(
            mean_mta_tax=("mta_tax", "mean"),
            mean_tip_amount=("tip_amount", "mean"),
            mean_tolls_amount=("tolls_amount", "mean"),
            mean_trip_distance=("trip_distance", "mean"),
        )
    )
    return res._evaluate()

In [None]:
# to get actual FireDucks version, when calling __version__
from fireducks.core import set_fireducks_option
set_fireducks_option("fireducks-version", True)

In [None]:
%load_ext fireducks.ipyext
import fireducks.pandas as pd
pd.__version__

'1.1.5'

In [None]:
%%fireducks.profile
fireducks_query("PULocationID") # exact same pandas code, but much faster

Code block 'Overall execution for PULocationID using fireducks.pandas' took: 0.22191 s


Unnamed: 0_level_0,mean_mta_tax,mean_tip_amount,mean_tolls_amount,mean_trip_distance
PULocationID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0.104895,5.670350,2.147483,1.645385
2,0.500000,5.030000,1.332500,10.280000
3,0.500000,1.368750,1.332500,9.281250
4,0.497843,1.401985,0.066277,2.877373
6,0.500000,0.000000,21.320000,36.700000
...,...,...,...,...
261,0.496182,1.193111,0.162796,3.924002
262,0.498785,1.318876,0.184178,2.736880
263,0.499182,1.282391,0.136376,2.596841
264,0.487009,1.526466,0.239003,2.742721


Unnamed: 0,name,type,n_calls,duration (msec)
0,read_parquet_with_metadata,kernel,1,163.54106
1,groupby_agg,kernel,1,48.112375
2,DataFrame._repr_html_,fallback,1,3.631178
3,read_parquet_metadata,kernel,1,1.695878
4,to_pandas.frame.metadata,kernel,1,1.223513
5,slice,kernel,2,0.02824
6,concat,kernel,1,0.027131
7,getattr:_repr_html_,fallback,1,0.0242
8,get_shape,kernel,2,0.00251


In [None]:
%%fireducks.profile
fireducks_query("DOLocationID") # exact same pandas code, but much faster

Code block 'Overall execution for DOLocationID using fireducks.pandas' took: 0.22334 s


Unnamed: 0_level_0,mean_mta_tax,mean_tip_amount,mean_tolls_amount,mean_trip_distance
DOLocationID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0.012980,6.649897,12.258724,17.091866
2,0.500000,3.894000,3.198000,13.406000
3,0.500000,3.138618,1.753171,14.017805
4,0.499095,1.294488,0.057856,2.517735
5,0.500000,1.992222,12.678889,26.704444
...,...,...,...,...
261,0.499016,1.467040,0.137750,4.153200
262,0.499334,1.409528,0.271510,2.807141
263,0.499432,1.379754,0.211886,2.671808
264,0.485516,1.449961,0.269229,2.553356


Unnamed: 0,name,type,n_calls,duration (msec)
0,read_parquet_with_metadata,kernel,1,168.14886
1,groupby_agg,kernel,1,46.67986
2,read_parquet_metadata,kernel,1,1.935601
3,DataFrame._repr_html_,fallback,1,1.511927
4,to_pandas.frame.metadata,kernel,1,0.804799
5,concat,kernel,1,0.01111
6,slice,kernel,2,0.01024
7,getattr:_repr_html_,fallback,1,0.00325
8,get_shape,kernel,2,0.00099


# Polars

In [None]:
# defining query to be performed for Polars DataFrame (a little different API from pandas query)

from linetimer import CodeTimer

def polars_query(key):
  with CodeTimer(name=f"Overall execution for {key} using {pl.__name__}", unit="s"):
    res = (
        pl.scan_parquet("taxi.parquet")
        .group_by(key)
        .agg([
             pl.mean("mta_tax").alias("mean_mta_tax"),
             pl.mean("tip_amount").alias("mean_tip_amount"),
             pl.mean("tolls_amount").alias("mean_tolls_amount"),
             pl.mean("trip_distance").alias("mean_trip_distance"),
        ])
    )
    ret, prof = res.profile()
    print(prof.with_columns(((pl.col("end") - pl.col("start")) / 1e3).alias("duration(msec)")))
    return ret


In [None]:
import polars as pl
pl.__version__

'1.18.0'

In [None]:
polars_query("PULocationID") # different API, with little slower than FireDucks

shape: (3, 4)
┌─────────────────────────────────┬────────┬────────┬────────────────┐
│ node                            ┆ start  ┆ end    ┆ duration(msec) │
│ ---                             ┆ ---    ┆ ---    ┆ ---            │
│ str                             ┆ u64    ┆ u64    ┆ f64            │
╞═════════════════════════════════╪════════╪════════╪════════════════╡
│ optimization                    ┆ 0      ┆ 7      ┆ 0.007          │
│ parquet(taxi.parquet)           ┆ 7      ┆ 255606 ┆ 255.599        │
│ group_by_partitioned(PULocatio… ┆ 255627 ┆ 976133 ┆ 720.506        │
└─────────────────────────────────┴────────┴────────┴────────────────┘
Code block 'Overall execution for PULocationID using polars' took: 0.98510 s


PULocationID,mean_mta_tax,mean_tip_amount,mean_tolls_amount,mean_trip_distance
i64,f64,f64,f64,f64
265,0.258384,5.619169,2.060236,2.776486
134,0.481651,1.662844,0.391193,4.269083
3,0.5,1.36875,1.3325,9.28125
6,0.5,0.0,21.32,36.7
137,0.498752,1.248832,0.178264,2.317984
…,…,…,…,…
259,0.5,0.05,0.25381,2.894286
125,0.495782,1.379642,0.130522,2.659686
128,0.5,0.0,0.0,1.947778
131,0.454545,0.597273,0.484545,4.785909


In [None]:
polars_query("DOLocationID") # different API, with little slower than FireDucks

shape: (3, 4)
┌─────────────────────────────────┬───────┬────────┬────────────────┐
│ node                            ┆ start ┆ end    ┆ duration(msec) │
│ ---                             ┆ ---   ┆ ---    ┆ ---            │
│ str                             ┆ u64   ┆ u64    ┆ f64            │
╞═════════════════════════════════╪═══════╪════════╪════════════════╡
│ optimization                    ┆ 0     ┆ 1      ┆ 0.001          │
│ parquet(taxi.parquet)           ┆ 1     ┆ 64577  ┆ 64.576         │
│ group_by_partitioned(DOLocatio… ┆ 64584 ┆ 921230 ┆ 856.646        │
└─────────────────────────────────┴───────┴────────┴────────────────┘
Code block 'Overall execution for DOLocationID using polars' took: 0.92597 s


DOLocationID,mean_mta_tax,mean_tip_amount,mean_tolls_amount,mean_trip_distance
i64,f64,f64,f64,f64
265,0.271957,6.414913,4.495978,12.248787
3,0.5,3.138618,1.753171,14.017805
134,0.494242,2.659962,0.910499,8.775873
137,0.499297,1.238587,0.185286,2.079024
6,0.5,3.629737,10.659211,16.370263
…,…,…,…,…
128,0.5,4.161176,0.752471,9.422235
259,0.498428,2.07239,0.871572,14.047893
125,0.498713,1.352669,0.081436,2.36063
262,0.499334,1.409528,0.27151,2.807141
