## Pandas vs Polars — Run-time and Memory Comparison

## Install required libraries

In [1]:
! pip3 install polars
! pip3 install pandas



## Import Package

In [2]:
import polars as pl
import pandas as pd

## Read CSV

In [3]:
%timeit pd.read_csv("../dataset/employee_dataset.csv")

2.01 ms ± 71.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [4]:
%timeit pl.read_csv("../dataset/employee_dataset.csv")

324 µs ± 7.37 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [5]:
df_pd = pd.read_csv("../dataset/employee_dataset.csv")
df_pl = pl.read_csv("../dataset/employee_dataset.csv")

## To CSV

In [6]:
%timeit df_pd.to_csv("dataset_dummy_pandas.csv")

2.14 ms ± 75.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [7]:
%timeit df_pl.write_csv("dataset_dummy_polars.csv")

700 µs ± 2.85 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


## Memory Usage

In [8]:
df_pl.estimated_size() # in Bytes

155580

In [9]:
df_pd.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Name                1000 non-null   object 
 1   Company_Name        1000 non-null   object 
 2   Employee_Job_Title  1000 non-null   object 
 3   Employee_City       1000 non-null   object 
 4   Employee_Country    1000 non-null   object 
 5   Employee_Salary     1000 non-null   int64  
 6   Employment_Status   1000 non-null   object 
 7   Employee_Rating     1000 non-null   float64
 8   Credits             1000 non-null   int64  
dtypes: float64(1), int64(2), object(6)
memory usage: 438.4 KB


## Selecting Columns

In [10]:
%timeit df_pd[["Name", "Employee_Rating"]]

189 µs ± 4.28 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [11]:

%timeit df_pl[["Name", "Employee_Rating"]]

2.15 µs ± 25.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


## Filtering

In [12]:
%timeit df_pd[df_pd.Credits>2]


147 µs ± 844 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [13]:
%timeit df_pl.filter(pl.col('Credits') > 2)

111 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


## Grouping

In [14]:
%timeit df_pd.groupby("Company_Name").Employee_Salary.mean().reset_index()

389 µs ± 2.15 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [15]:
%timeit df_pl.groupby("Company_Name").agg(pl.col("Employee_Salary").mean())

478 µs ± 9.31 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


## Sorting

In [16]:
%timeit df_pd.sort_values("Employee_Salary")

116 µs ± 1.62 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [17]:
%timeit df_pl.sort("Employee_Salary")

162 µs ± 8.81 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
