# A short presentation showing speedup when switching from pandas to FireDucks

🔥 🐦[**FireDucks**](https://fireducks-dev.github.io/) is a high-performance compiler-accelerated DataFrame library with highly compatible pandas APIs developed to speedup a pandas application without any manual source code changes. It comes with a multi-threaded C++ kernel and automatic query optimization features (powered by an in-built compiler) with lazy-execution model.

In this test drive, we will be using [ Parking Violations Issued - Fiscal Year 2022](https://data.cityofnewyork.us/City-Government/Parking-Violations-Issued-Fiscal-Year-2022/7mxj-7a6y/about_data) dataset from NYC Open Data.

REF: https://colab.research.google.com/github/rapidsai-community/showcase/blob/main/getting_started_tutorials/cudf_pandas_colab_demo.ipynb




In [1]:
!pip install -q -U fireducks

In [2]:
%load_ext fireducks.pandas
import time
import pandas as pd

In [3]:
print(f"evaluation with {pd.__name__}")

evaluation with fireducks.pandas


In [4]:
# to disable FireDucks lazy execution mode
from fireducks.core import get_fireducks_options
get_fireducks_options().set_benchmark_mode(True)

In [5]:
# to get actual FireDucks version, when calling __version__
from fireducks.core import set_fireducks_option
set_fireducks_option("fireducks-version", True)

In [6]:

import platform, psutil
print("="*30, "Evaluation Environment Information", "="*30)
print(f'platform: {platform.system()}')
print(f'architecture: {platform.machine()}')
print(f'processor: {platform.processor()}')
print(f'cpu: {psutil.cpu_count()}')
print(f'{pd.__name__} version: {pd.__version__}')

platform: Linux
architecture: x86_64
processor: x86_64
cpu: 48
fireducks.pandas version: 1.1.6


## Let's load the parquet dataset

In [7]:
# Data can be downloaded from here:
!wget -q https://data.rapids.ai/datasets/nyc_parking/nyc_parking_violations_2022.parquet

In [8]:
t0 = time.time()
df = pd.read_parquet(
    "nyc_parking_violations_2022.parquet",
    columns=["Registration State", "Violation Description",
             "Vehicle Body Type", "Issue Date", "Summons Number"]
)
load_t = time.time() - t0
df

Unnamed: 0,Registration State,Violation Description,Vehicle Body Type,Issue Date,Summons Number
0,NY,,VAN,06/25/2021,1457617912
1,NY,,SUBN,06/25/2021,1457617924
2,TX,,SDN,06/17/2021,1457622427
3,MO,,SDN,06/16/2021,1457638629
4,NY,,TAXI,07/04/2021,1457639580
...,...,...,...,...,...
15435602,99,21-No Parking (street clean),SUBN,06/07/2022,8995222761
15435603,TN,21-No Parking (street clean),PICK,06/07/2022,8995222773
15435604,NY,21-No Parking (street clean),2DSD,06/07/2022,8995222785
15435605,VA,21-No Parking (street clean),SUBN,06/07/2022,8995222827


## Q1: Which parking violation is most commonly committed by vehicles from various U.S states?

In [9]:
t1 = time.time()
r1 = (df[["Registration State", "Violation Description"]]
 .value_counts()
 .groupby("Registration State")
 .head(1)
 .sort_index()
 .reset_index()
)
q1_t = time.time() - t1
r1

Unnamed: 0,Registration State,Violation Description,count
0,99,,17550
1,AB,14-No Standing,22
2,AK,PHTO SCHOOL ZN SPEED VIOLATION,125
3,AL,PHTO SCHOOL ZN SPEED VIOLATION,3668
4,AR,PHTO SCHOOL ZN SPEED VIOLATION,537
...,...,...,...
62,VT,PHTO SCHOOL ZN SPEED VIOLATION,3024
63,WA,21-No Parking (street clean),3732
64,WI,14-No Standing,1639
65,WV,PHTO SCHOOL ZN SPEED VIOLATION,1185


## Q2: Which vehicle body types are most frequently involved in parking violations?

In [10]:
t2 = time.time()
r2 = (df
 .groupby(["Vehicle Body Type"])
 .agg({"Summons Number": "count"})
 .rename(columns={"Summons Number": "Count"})
 .sort_values(["Count"], ascending=False)
)
q2_t = time.time() - t2
r2

Unnamed: 0_level_0,Count
Vehicle Body Type,Unnamed: 1_level_1
SUBN,6449007
4DSD,4402991
VAN,1317899
DELV,436430
PICK,429798
...,...
YANT,1
YBSD,1
YEL,1
YL,1


## Q3. How do parking violations vary across days of the week?

In [11]:
t3 = time.time()
weekday_names = {
    0: "Monday",
    1: "Tuesday",
    2: "Wednesday",
    3: "Thursday",
    4: "Friday",
    5: "Saturday",
    6: "Sunday",
}

df["Issue Date"] = df["Issue Date"].astype("datetime64[ms]")
df["issue_weekday"] = df["Issue Date"].dt.weekday.map(weekday_names)
r3 = df.groupby(["issue_weekday"])["Summons Number"].count().sort_values()
q3_t = time.time() - t3
r3

issue_weekday
Sunday        462992
Saturday     1108385
Monday       2488563
Wednesday    2760088
Tuesday      2809949
Friday       2891679
Thursday     2913951
Name: Summons Number, dtype: int64

## Evaluation

In [12]:
s = pd.Series([load_t, q1_t, q2_t, q3_t], index = ["data_loading", "query_1", "query_2", "query_3"])
print(f"total time taken: {s.sum()} sec")
s

total time taken: 0.7754817008972168 sec


data_loading    0.521625
query_1         0.075056
query_2         0.035200
query_3         0.143601
dtype: float64

# Conclusion

🚀 Execution time could be reduced **from 11.37 sec to 0.79 sec (~14.4x speedup)** without incurring any migration cost (pandas to FireDucks code translation is absolutely not required, no need to learn a new library) or an expensive hardware cost (no need for high spec GPU system)!!

🚀🚀 You may like to check other [benchmarks](https://fireducks-dev.github.io/docs/benchmarks/)