<a href="https://colab.research.google.com/github/aaubs/ds-master/blob/main/notebooks/M6_Performing_a_Big_Data_workflow_with_Pandas_and_Polars.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#What is Polars and Why is it Faster Than Pandas?

Polars is a DataFrame library designed for parallelization. It is built from the ground up and written in Rust but also has a Python package, making it a potential alternative to Pandas. 
> Polars has two different APIs: an eager API and a lazy API. Eager execution is similar to Pandas, while lazy execution is more efficient because it avoids running unnecessary code. 

Polars is faster than Pandas because it utilizes all available cores on your machine. Polars has different syntax from Pandas and can perform operations in parallel. However, Polars code is usually a little longer than the Pandas code. If you need to do a lot of data processing on large datasets, Polars can be a good alternative to Pandas.

# Database-like ops benchmark

![](https://www.dominodatalab.com/hs-fs/hubfs/Imported_Blog_Media/polars_benchmark.png?width=774&name=polars_benchmark.png)

#Comparison between Pandas and Polars
At first glance, Pandas and Polars (eager API) are similar regarding syntax because of their shared main building blocks: Series and DataFrames.

This section explores the main aspects of how the Polars package differs from Pandas regarding syntax and execution time:

- Reading Data
- Selecting and Filtering Data
- Creating New Columns
- Grouping and Aggregation
- Missing Data

In [23]:
!pip install polars --q

##Reading Data
Reading a CSV file in Polars will feel familiar because you can use the .read_csv() method like in Pandas:

In [24]:
%%time
import pandas as pd

# Pandas
df_pd = pd.read_csv("https://raw.githubusercontent.com/RandomFractals/chicago-crimes/main/data/crimes-2022.csv")

CPU times: user 1.55 s, sys: 259 ms, total: 1.81 s
Wall time: 3.2 s


In [25]:
%%time
import polars as pl

# Polars
df_pl = pl.read_csv("https://raw.githubusercontent.com/RandomFractals/chicago-crimes/main/data/crimes-2022.csv")

CPU times: user 507 ms, sys: 248 ms, total: 755 ms
Wall time: 1.31 s


##Selecting and Filtering Data
The first major difference between Pandas and Polars is that Polars does not use an index [1].

In [26]:
%%time
# Pandas
df_pd[['ID', 'Case Number', 'Date']] 

CPU times: user 10 ms, sys: 1.87 ms, total: 11.9 ms
Wall time: 13.2 ms


Unnamed: 0,ID,Case Number,Date
0,12757446,JF313117,07/08/2022 10:38:00 AM
1,12755229,JF310109,07/08/2022 03:21:00 AM
2,12763369,JF320208,07/16/2022 10:55:00 PM
3,12766036,JF323691,07/19/2022 04:00:00 PM
4,12758668,JF314314,07/12/2022 06:30:00 AM
...,...,...,...
215546,12759190,JF315350,07/09/2022 07:00:00 AM
215547,12765045,JF321506,07/08/2022 02:00:00 PM
215548,12742026,JF294494,06/24/2022 03:43:00 PM
215549,12757420,JF313061,07/01/2022 12:00:00 AM


In [5]:
%%time
# The above code will run with Polars as well, 
# but the correct way in Polars is:
df_pl.select(pl.col(['ID', 'Case Number', 'Date'])) 

CPU times: user 2.1 ms, sys: 22 µs, total: 2.12 ms
Wall time: 2.41 ms


ID,Case Number,Date
i64,str,str
12757446,"""JF313117""","""07/08/2022 10:…"
12755229,"""JF310109""","""07/08/2022 03:…"
12763369,"""JF320208""","""07/16/2022 10:…"
12766036,"""JF323691""","""07/19/2022 04:…"
12758668,"""JF314314""","""07/12/2022 06:…"
12765741,"""JF323255""","""06/27/2022 09:…"
12759104,"""JF315164""","""07/12/2022 01:…"
12756232,"""JF311569""","""07/09/2022 09:…"
12761035,"""JF317440""","""07/13/2022 07:…"
12757374,"""JF313010""","""07/01/2022 09:…"


In [20]:
%%time
# Pandas
df_pd.query('Year > 2021')

CPU times: user 115 ms, sys: 34.4 ms, total: 149 ms
Wall time: 153 ms


Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location,new_year
0,12757446,JF313117,07/08/2022 10:38:00 AM,087XX S WINCHESTER AVE,0820,THEFT,$500 AND UNDER,RESIDENCE,False,True,...,71,06,1165137.0,1846655.0,2022,11/12/2022 03:46:21 PM,41.734817,-87.670596,"(41.734817155, -87.670595647)",2022
1,12755229,JF310109,07/08/2022 03:21:00 AM,056XX N SPAULDING AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,,False,False,...,13,11,1153364.0,1937188.0,2022,11/12/2022 03:46:21 PM,41.983491,-87.711324,"(41.983490742, -87.711324421)",2022
2,12763369,JF320208,07/16/2022 10:55:00 PM,038XX N SHEFFIELD AVE,0330,ROBBERY,AGGRAVATED,SIDEWALK,True,False,...,6,03,1168917.0,1925693.0,2022,11/12/2022 03:46:21 PM,41.951624,-87.654458,"(41.951623924, -87.654458486)",2022
3,12766036,JF323691,07/19/2022 04:00:00 PM,113XX S PARNELL AVE,1792,KIDNAPPING,CHILD ABDUCTION / STRANGER,SIDEWALK,False,False,...,49,26,1174663.0,1829726.0,2022,11/12/2022 03:46:21 PM,41.688155,-87.636199,"(41.688154968, -87.636198645)",2022
4,12758668,JF314314,07/12/2022 06:30:00 AM,044XX W WALTON ST,0610,BURGLARY,FORCIBLE ENTRY,RESIDENCE,False,True,...,23,05,1146679.0,1905959.0,2022,11/12/2022 03:46:21 PM,41.897926,-87.736710,"(41.897926219, -87.736710223)",2022
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
215546,12759190,JF315350,07/09/2022 07:00:00 AM,098XX S CALHOUN AVE,0820,THEFT,$500 AND UNDER,RESIDENCE,False,False,...,51,06,1194763.0,1840111.0,2022,11/12/2022 03:46:21 PM,41.716183,-87.562275,"(41.71618255, -87.562274818)",2022
215547,12765045,JF321506,07/08/2022 02:00:00 PM,006XX S MICHIGAN AVE,0810,THEFT,OVER $500,HOTEL / MOTEL,False,False,...,32,06,1177377.0,1897431.0,2022,11/12/2022 03:46:21 PM,41.873884,-87.624219,"(41.873883785, -87.624218932)",2022
215548,12742026,JF294494,06/24/2022 03:43:00 PM,075XX N CLARK ST,0860,THEFT,RETAIL THEFT,DEPARTMENT STORE,False,False,...,1,06,1162907.0,1949949.0,2022,11/12/2022 03:46:21 PM,42.018312,-87.675867,"(42.018311737, -87.675866628)",2022
215549,12757420,JF313061,07/01/2022 12:00:00 AM,001XX N MAY ST,0281,CRIMINAL SEXUAL ASSAULT,NON-AGGRAVATED,HOTEL / MOTEL,False,False,...,28,02,1168809.0,1900813.0,2022,11/12/2022 03:46:21 PM,41.883354,-87.655578,"(41.883354174, -87.655578272)",2022


In [7]:
%%time
# Polars
df_pl.filter(pl.col('Year') > 2021)

CPU times: user 16.9 ms, sys: 13.7 ms, total: 30.6 ms
Wall time: 41.1 ms


ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
i64,str,str,str,str,str,str,str,bool,bool,i64,i64,i64,i64,str,i64,i64,i64,str,f64,f64,str
12757446,"""JF313117""","""07/08/2022 10:…","""087XX S WINCHE…","""0820""","""THEFT""","""$500 AND UNDER…","""RESIDENCE""",false,true,2221,22,21,71,"""06""",1165137,1846655,2022,"""11/12/2022 03:…",41.734817,-87.670596,"""(41.734817155,…"
12755229,"""JF310109""","""07/08/2022 03:…","""056XX N SPAULD…","""1154""","""DECEPTIVE PRAC…","""FINANCIAL IDEN…",,false,false,1711,17,39,13,"""11""",1153364,1937188,2022,"""11/12/2022 03:…",41.983491,-87.711324,"""(41.983490742,…"
12763369,"""JF320208""","""07/16/2022 10:…","""038XX N SHEFFI…","""0330""","""ROBBERY""","""AGGRAVATED""","""SIDEWALK""",true,false,1923,19,46,6,"""03""",1168917,1925693,2022,"""11/12/2022 03:…",41.951624,-87.654458,"""(41.951623924,…"
12766036,"""JF323691""","""07/19/2022 04:…","""113XX S PARNEL…","""1792""","""KIDNAPPING""","""CHILD ABDUCTIO…","""SIDEWALK""",false,false,2233,22,34,49,"""26""",1174663,1829726,2022,"""11/12/2022 03:…",41.688155,-87.636199,"""(41.688154968,…"
12758668,"""JF314314""","""07/12/2022 06:…","""044XX W WALTON…","""0610""","""BURGLARY""","""FORCIBLE ENTRY…","""RESIDENCE""",false,true,1111,11,37,23,"""05""",1146679,1905959,2022,"""11/12/2022 03:…",41.897926,-87.73671,"""(41.897926219,…"
12765741,"""JF323255""","""06/27/2022 09:…","""002XX S LAVERG…","""1320""","""CRIMINAL DAMAG…","""TO VEHICLE""","""RESIDENCE""",false,false,1533,15,28,25,"""14""",1143292,1898696,2022,"""11/12/2022 03:…",41.87806,-87.749332,"""(41.878059641,…"
12759104,"""JF315164""","""07/12/2022 01:…","""012XX N ARTESI…","""0820""","""THEFT""","""$500 AND UNDER…","""STREET""",false,false,1423,14,26,24,"""06""",1159848,1908228,2022,"""11/12/2022 03:…",41.903891,-87.688279,"""(41.903891052,…"
12756232,"""JF311569""","""07/09/2022 09:…","""020XX W NORTH …","""0460""","""BATTERY""","""SIMPLE""","""HOTEL / MOTEL""",true,false,1434,14,2,24,"""08B""",1162488,1910644,2022,"""11/12/2022 03:…",41.910466,-87.678514,"""(41.910465849,…"
12761035,"""JF317440""","""07/13/2022 07:…","""029XX N SACRAM…","""0917""","""MOTOR VEHICLE …","""CYCLE, SCOOTER…","""STREET""",false,false,1411,14,33,21,"""07""",1155869,1919370,2022,"""11/12/2022 03:…",41.934547,-87.702594,"""(41.93454677, …"
12757374,"""JF313010""","""07/01/2022 09:…","""043XX S PACKER…","""1153""","""DECEPTIVE PRAC…","""FINANCIAL IDEN…","""OTHER COMMERCI…",false,false,924,9,11,61,"""11""",1168167,1876163,2022,"""11/12/2022 03:…",41.815726,-87.658647,"""(41.815726254,…"


##Creating New Columns
Creating a new column in Polars also differs from what you might be used to in Pandas. In Polars, you need to use the .with_column() or the .with_columns() method depending on how many columns you want to create.

In [8]:
%%time
# Pandas
df_pd["new_year"] = df_pd["Year"] 

CPU times: user 1.53 ms, sys: 0 ns, total: 1.53 ms
Wall time: 1.54 ms


In [9]:
%%time
# Polars
df_pl.with_columns([(pl.col("Year")).alias("new_year")])


CPU times: user 429 µs, sys: 0 ns, total: 429 µs
Wall time: 452 µs


ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location,new_year
i64,str,str,str,str,str,str,str,bool,bool,i64,i64,i64,i64,str,i64,i64,i64,str,f64,f64,str,i64
12757446,"""JF313117""","""07/08/2022 10:…","""087XX S WINCHE…","""0820""","""THEFT""","""$500 AND UNDER…","""RESIDENCE""",false,true,2221,22,21,71,"""06""",1165137,1846655,2022,"""11/12/2022 03:…",41.734817,-87.670596,"""(41.734817155,…",2022
12755229,"""JF310109""","""07/08/2022 03:…","""056XX N SPAULD…","""1154""","""DECEPTIVE PRAC…","""FINANCIAL IDEN…",,false,false,1711,17,39,13,"""11""",1153364,1937188,2022,"""11/12/2022 03:…",41.983491,-87.711324,"""(41.983490742,…",2022
12763369,"""JF320208""","""07/16/2022 10:…","""038XX N SHEFFI…","""0330""","""ROBBERY""","""AGGRAVATED""","""SIDEWALK""",true,false,1923,19,46,6,"""03""",1168917,1925693,2022,"""11/12/2022 03:…",41.951624,-87.654458,"""(41.951623924,…",2022
12766036,"""JF323691""","""07/19/2022 04:…","""113XX S PARNEL…","""1792""","""KIDNAPPING""","""CHILD ABDUCTIO…","""SIDEWALK""",false,false,2233,22,34,49,"""26""",1174663,1829726,2022,"""11/12/2022 03:…",41.688155,-87.636199,"""(41.688154968,…",2022
12758668,"""JF314314""","""07/12/2022 06:…","""044XX W WALTON…","""0610""","""BURGLARY""","""FORCIBLE ENTRY…","""RESIDENCE""",false,true,1111,11,37,23,"""05""",1146679,1905959,2022,"""11/12/2022 03:…",41.897926,-87.73671,"""(41.897926219,…",2022
12765741,"""JF323255""","""06/27/2022 09:…","""002XX S LAVERG…","""1320""","""CRIMINAL DAMAG…","""TO VEHICLE""","""RESIDENCE""",false,false,1533,15,28,25,"""14""",1143292,1898696,2022,"""11/12/2022 03:…",41.87806,-87.749332,"""(41.878059641,…",2022
12759104,"""JF315164""","""07/12/2022 01:…","""012XX N ARTESI…","""0820""","""THEFT""","""$500 AND UNDER…","""STREET""",false,false,1423,14,26,24,"""06""",1159848,1908228,2022,"""11/12/2022 03:…",41.903891,-87.688279,"""(41.903891052,…",2022
12756232,"""JF311569""","""07/09/2022 09:…","""020XX W NORTH …","""0460""","""BATTERY""","""SIMPLE""","""HOTEL / MOTEL""",true,false,1434,14,2,24,"""08B""",1162488,1910644,2022,"""11/12/2022 03:…",41.910466,-87.678514,"""(41.910465849,…",2022
12761035,"""JF317440""","""07/13/2022 07:…","""029XX N SACRAM…","""0917""","""MOTOR VEHICLE …","""CYCLE, SCOOTER…","""STREET""",false,false,1411,14,33,21,"""07""",1155869,1919370,2022,"""11/12/2022 03:…",41.934547,-87.702594,"""(41.93454677, …",2022
12757374,"""JF313010""","""07/01/2022 09:…","""043XX S PACKER…","""1153""","""DECEPTIVE PRAC…","""FINANCIAL IDEN…","""OTHER COMMERCI…",false,false,924,9,11,61,"""11""",1168167,1876163,2022,"""11/12/2022 03:…",41.815726,-87.658647,"""(41.815726254,…",2022


In [10]:
%%time
# Polars for multiple columns
# df.with_columns([(pl.col("col") * 10).alias("new_col"), ...])

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 8.11 µs


##Grouping and Aggregation
Grouping and aggregation are slightly different between Pandas and Polars syntax-wise, but both use the .groupby() and .agg() methods.

In [11]:
%%time
# Pandas
df_pd.groupby('Year')['Arrest'].agg('value_counts')

CPU times: user 20.1 ms, sys: 853 µs, total: 21 ms
Wall time: 26.3 ms


Year  Arrest
2022  False     191418
      True       24133
Name: Arrest, dtype: int64

In [12]:
%%time
# Polars
# df.groupby('col1').agg([pl.col('col2').mean()]) # As suggested in Polars docs
df_pl.groupby('Year').agg([pl.col(['Arrest']).value_counts()]) # Shorter

CPU times: user 14.3 ms, sys: 0 ns, total: 14.3 ms
Wall time: 15 ms


Year,Arrest
i64,list[struct[2]]
2022,"[{false,191418}, {true,24133}]"


##Missing Data
Another major difference between Pandas and Polars is that Pandas uses NaN values to indicate missing values, while Polars uses null [1].

In [13]:
%%time
# Pandas
df_pd['Location'].fillna(-999)

CPU times: user 25.5 ms, sys: 0 ns, total: 25.5 ms
Wall time: 31.9 ms


0         (41.734817155, -87.670595647)
1         (41.983490742, -87.711324421)
2         (41.951623924, -87.654458486)
3         (41.688154968, -87.636198645)
4         (41.897926219, -87.736710223)
                      ...              
215546     (41.71618255, -87.562274818)
215547    (41.873883785, -87.624218932)
215548    (42.018311737, -87.675866628)
215549    (41.883354174, -87.655578272)
215550    (41.778619322, -87.683625816)
Name: Location, Length: 215551, dtype: object

In [14]:
%%time
# Polars
# df_pl.with_column(pl.col('col2').fill_null(pl.lit(-999))) # As suggested in Polars docs
df_pl.with_column(pl.col('Location').fill_null(-999)) # Shorter

CPU times: user 13.4 ms, sys: 6.51 ms, total: 19.9 ms
Wall time: 29.7 ms




ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
i64,str,str,str,str,str,str,str,bool,bool,i64,i64,i64,i64,str,i64,i64,i64,str,f64,f64,str
12757446,"""JF313117""","""07/08/2022 10:…","""087XX S WINCHE…","""0820""","""THEFT""","""$500 AND UNDER…","""RESIDENCE""",false,true,2221,22,21,71,"""06""",1165137,1846655,2022,"""11/12/2022 03:…",41.734817,-87.670596,"""(41.734817155,…"
12755229,"""JF310109""","""07/08/2022 03:…","""056XX N SPAULD…","""1154""","""DECEPTIVE PRAC…","""FINANCIAL IDEN…",,false,false,1711,17,39,13,"""11""",1153364,1937188,2022,"""11/12/2022 03:…",41.983491,-87.711324,"""(41.983490742,…"
12763369,"""JF320208""","""07/16/2022 10:…","""038XX N SHEFFI…","""0330""","""ROBBERY""","""AGGRAVATED""","""SIDEWALK""",true,false,1923,19,46,6,"""03""",1168917,1925693,2022,"""11/12/2022 03:…",41.951624,-87.654458,"""(41.951623924,…"
12766036,"""JF323691""","""07/19/2022 04:…","""113XX S PARNEL…","""1792""","""KIDNAPPING""","""CHILD ABDUCTIO…","""SIDEWALK""",false,false,2233,22,34,49,"""26""",1174663,1829726,2022,"""11/12/2022 03:…",41.688155,-87.636199,"""(41.688154968,…"
12758668,"""JF314314""","""07/12/2022 06:…","""044XX W WALTON…","""0610""","""BURGLARY""","""FORCIBLE ENTRY…","""RESIDENCE""",false,true,1111,11,37,23,"""05""",1146679,1905959,2022,"""11/12/2022 03:…",41.897926,-87.73671,"""(41.897926219,…"
12765741,"""JF323255""","""06/27/2022 09:…","""002XX S LAVERG…","""1320""","""CRIMINAL DAMAG…","""TO VEHICLE""","""RESIDENCE""",false,false,1533,15,28,25,"""14""",1143292,1898696,2022,"""11/12/2022 03:…",41.87806,-87.749332,"""(41.878059641,…"
12759104,"""JF315164""","""07/12/2022 01:…","""012XX N ARTESI…","""0820""","""THEFT""","""$500 AND UNDER…","""STREET""",false,false,1423,14,26,24,"""06""",1159848,1908228,2022,"""11/12/2022 03:…",41.903891,-87.688279,"""(41.903891052,…"
12756232,"""JF311569""","""07/09/2022 09:…","""020XX W NORTH …","""0460""","""BATTERY""","""SIMPLE""","""HOTEL / MOTEL""",true,false,1434,14,2,24,"""08B""",1162488,1910644,2022,"""11/12/2022 03:…",41.910466,-87.678514,"""(41.910465849,…"
12761035,"""JF317440""","""07/13/2022 07:…","""029XX N SACRAM…","""0917""","""MOTOR VEHICLE …","""CYCLE, SCOOTER…","""STREET""",false,false,1411,14,33,21,"""07""",1155869,1919370,2022,"""11/12/2022 03:…",41.934547,-87.702594,"""(41.93454677, …"
12757374,"""JF313010""","""07/01/2022 09:…","""043XX S PACKER…","""1153""","""DECEPTIVE PRAC…","""FINANCIAL IDEN…","""OTHER COMMERCI…",false,false,924,9,11,61,"""11""",1168167,1876163,2022,"""11/12/2022 03:…",41.815726,-87.658647,"""(41.815726254,…"


In [15]:
import pandas as pd

# Create a sample DataFrame with 1,000 rows
df = pd.DataFrame({'col1': range(1000), 'col2': ['abc'] * 1000})

# Get the total size of the DataFrame in memory (in bytes)
total_size = df.memory_usage(deep=True).sum()

# Calculate the average memory usage per row (in bytes)
row_size = total_size / len(df)

# Calculate the memory usage for 1 million rows (in bytes)
million_rows_size = row_size * 1000000

print(f"Average memory usage per row: {row_size:.2f} bytes")
print(f"Memory usage for 1 million rows: {million_rows_size:.2f} bytes")


Average memory usage per row: 68.13 bytes
Memory usage for 1 million rows: 68128000.00 bytes
