#### Top

# Chapter 6 Sorting and Ordering in Polars

* [6.0 Imports and Setup](#6.0-Imports-and-Setup)
* [6.1 Introduction](#6.1-Introduction)
* [6.2 Loading the Fuel Economy Dataset](#6.2-Loading-the-Fuel-Economy-Dataset)
* [6.3 Sorting by a Single Column](#6.3-Sorting-by-a-Single-Column)
* [6.4 Sorting by Multiple Columns](#6.4-Sorting-by-Multiple-Columns)
* [6.5 Specifying Custom Ordering for Categorical Columns](#6.5-Specifying-Custom-Ordering-for-Categorical-Columns)
* [6.6 Enums and Ordering](#6.6-Enums-and-Ordering)
* [6.7 Group Ordering and maintain_order](#6.7-Group-Ordering-and-maintain_order)
* [6.8 Stable Sorting](#6.8-Stable-Sorting)
* [6.9 Sorting and Filtering](#6.9-Sorting-and-Filtering)

---
# 6.0 Imports and Setup

[back to Top](#Top)

In [1]:
import polars as pl
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib_inline.backend_inline
import chardet
import pprint as pp
import hvplot.polars
hvplot.extension('matplotlib')

matplotlib_inline.backend_inline.set_matplotlib_formats("retina")
pd.options.mode.copy_on_write = True
print(pd.options.mode.copy_on_write)
pl.Config.set_verbose(True)
pl.show_versions()

def HR():
    print("-"*40)

@pl.Config(tbl_cols=-1, ascii_tables=True)
def tight_layout(df: pl.DataFrame, n=5) -> None:
    with pl.Config(tbl_cols=-1, fmt_str_lengths=n):
        print(df)

def detect_encoding(filename: str) -> str:
    """Return the most probable character encoding for a file."""
    with open(filename, "rb") as f:
        raw_data = f.read()
        print(raw_data)
        result = chardet.detect(raw_data)
        return result["encoding"]

True
--------Version info---------
Polars:              1.9.0
Index type:          UInt32
Platform:            macOS-12.7.6-x86_64-i386-64bit
Python:              3.11.5 (main, Jan 16 2024, 17:25:53) [Clang 14.0.0 (clang-1400.0.29.202)]

----Optional dependencies----
adbc_driver_manager  1.1.0
altair               5.4.0
cloudpickle          3.0.0
connectorx           0.3.3
deltalake            0.19.1
fastexcel            0.11.6
fsspec               2023.12.2
gevent               24.2.1
great_tables         0.10.0
matplotlib           3.9.2
nest_asyncio         1.6.0
numpy                2.0.2
openpyxl             3.1.5
pandas               2.2.2
pyarrow              17.0.0
pydantic             2.8.2
pyiceberg            0.6.1
sqlalchemy           2.0.32
torch                <not installed>
xlsx2csv             0.8.3
xlsxwriter           3.2.0


---
# 6.1 Introduction

[back to Top](#Top)

Cover these points:

* How sorting is a fundamental operation in data analysis.
* Sorting criteria and how to implement them in Polars.
* Performance aspsects of sorting.
* How to optimize sorting operations for large datasets.

---
# 6.2 Loading the Fuel Economy Dataset

[back to Top](#Top)


In [2]:
path = 'data/vehicles.csv'
raw = pl.read_csv(path, null_values=['NA'])
print(raw.shape)
print(f"{raw.estimated_size(unit='mb'):.2f}MB")

#@pl.StringCache()
def tweak_auto(df):
    cols = ['year', 'make', 'model', 'displ', 'cylinders', 'trany', 
           'drive', 'VClass', 'fuelType', 'barrels08', 'city08', 
           'highway08', 'createdOn']
    return (
        df
        .select(pl.col(cols))
        .with_columns( 
            pl.col('year').cast(pl.Int16),
            pl.col(['cylinders', 'highway08', 'city08']).cast(pl.UInt8),
            pl.col(['displ', 'barrels08']).cast(pl.Float32),
            pl.col(['make', 'model', 'VClass', 'drive', 'fuelType']).cast(pl.Categorical),
            pl.col('createdOn').str.to_datetime('%a %b %d %H:%M:%S %Z %Y'),
            is_automatic=pl.col('trany')                    
               .str.contains('Automatic')
               .fill_null('Automatic'),
            num_gears=pl.col('trany')
                .str.extract(r'(\d+)')
                .cast(pl.UInt8)
                .fill_null(6)
        )
    )

autos = tweak_auto(raw)
print(type(autos))
print(autos.shape)
print(f"{autos.estimated_size(unit='mb'):.2f}MB")

avg line length: 434.78027
std. dev. line length: 23.885818
initial row estimate: 47850
no. of chunks: 4 processed by: 4 threads.


(48202, 84)
29.34MB
<class 'polars.dataframe.frame.DataFrame'>
(48202, 15)
2.89MB


  .with_columns(


---
# 6.3 Sorting by a Single Column

[back to Top](#Top)

* The primary method for sorting data in Polars is `.sort()`.
* It can sort data by single or multiple columns.
* Regular Python uses a `key` argument for specifying a custom sorting function.
* Polars uses a different approach, where instead of passing a function, we pass a column or expression.

In [3]:
names = ['Al', 'Bob', 'Charlie', 'Dan', 'Edith', 'Frank']

# Python sorting, sort lexically via sorted()
sorted(names)

['Al', 'Bob', 'Charlie', 'Dan', 'Edith', 'Frank']

In [4]:
# Python, sort by length of the strings by passing a function to key argument
sorted(names, key=len)

['Al', 'Bob', 'Dan', 'Edith', 'Frank', 'Charlie']

* In Polars, instead of using `key`, we use the `by` parameter.
* This provides a parallel expression to sort the data on.

For example, to sort the *autos* dataframe by the length of the *make* column:
* First convert the *make* column from a categorical column to a string column.
* Then use `.sort()` and pass in the length of the *make* column as the `by` parameter.

In [5]:
# test to make sure we get the correct length of 'make'
(
    autos
    .select(
        make_len_chars=pl.col('make').cast(pl.String).str.len_chars()
    )
)

make_len_chars
u32
10
7
5
5
6
…
6
6
6
6


---
* We can also sort the autos dataframe by *year* without providing a custom sorting expression.

In [6]:
autos.sort('year')

year,make,model,displ,cylinders,trany,drive,VClass,fuelType,barrels08,city08,highway08,createdOn,is_automatic,num_gears
i16,cat,cat,f32,u8,str,cat,cat,cat,f32,u8,u8,datetime[μs],str,u8
1984,"""Alfa Romeo""","""Spider Veloce 2000""",2.0,4,"""Manual 5-spd""",,"""Two Seaters""","""Regular""",14.167143,18,25,2013-01-01 00:00:00,"""false""",5
1984,"""Bertone""","""X1/9""",1.5,4,"""Manual 5-spd""",,"""Two Seaters""","""Regular""",13.523182,20,26,2013-01-01 00:00:00,"""false""",5
1984,"""Chevrolet""","""Corvette""",5.7,8,"""Automatic 4-spd""",,"""Two Seaters""","""Regular""",19.834,13,20,2013-01-01 00:00:00,"""true""",4
1984,"""Chevrolet""","""Corvette""",5.7,8,"""Manual 4-spd""",,"""Two Seaters""","""Regular""",19.834,13,20,2013-01-01 00:00:00,"""false""",4
1984,"""Nissan""","""300ZX""",3.0,6,"""Automatic 4-spd""",,"""Two Seaters""","""Regular""",17.500587,15,20,2013-01-01 00:00:00,"""true""",4
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2025,"""Land Rover""","""Defender 130 Outbound""",3.0,6,"""Automatic (S8)""","""4-Wheel Drive""","""Standard Sport Utility Vehicle…","""Premium""",16.528334,16,19,2024-10-17 00:00:00,"""true""",8
2025,"""Nissan""","""Pathfinder 4WD Platinum""",3.5,6,"""Automatic (S9)""","""4-Wheel Drive""","""Standard Sport Utility Vehicle…","""Regular""",13.523182,20,25,2024-10-17 00:00:00,"""true""",9
2025,"""Porsche""","""Cayenne Turbo GT""",4.0,8,"""Automatic (S8)""","""All-Wheel Drive""","""Standard Sport Utility Vehicle…","""Premium""",17.500587,15,20,2024-10-17 00:00:00,"""true""",8
2025,"""Porsche""","""Cayenne GTS""",4.0,8,"""Automatic (S8)""","""All-Wheel Drive""","""Standard Sport Utility Vehicle…","""Premium""",16.528334,15,22,2024-10-17 00:00:00,"""true""",8


* Sort the *autos* dataframe by the average of city and highway miles per gallon.

In [7]:
(
    autos.
    sort(
        (pl.col('city08') + pl.col('highway08'))/2
    )
    .select(['city08', 'highway08'])
)

city08,highway08
u8,u8
132,125
130,128
136,123
136,123
136,123
…,…
137,112
131,120
126,125
129,126


In [8]:
# GB: if we sort by a calculated combination of columns,
# ts is much clearer if we simply create that column
# and explicitly sort on that.

# Some of the output for total_avg seems wonky!
(
    autos
    .with_columns(
        total_avg=(pl.col('city08') + pl.col('highway08'))/2
    )
    .sort(by='total_avg')
    .select(['city08', 'highway08', 'total_avg'])
)

city08,highway08,total_avg
u8,u8,f64
132,125,0.5
130,128,1.0
136,123,1.5
136,123,1.5
136,123,1.5
…,…,…
137,112,124.5
131,120,125.5
126,125,125.5
129,126,127.5


* Sort by a categorical column, *VClass*.

In [9]:
(
    autos
    .sort(by='VClass')
    .select(['year', 'make', 'model', 'VClass'])
)

year,make,model,VClass
i16,cat,cat,cat
1985,"""Alfa Romeo""","""Spider Veloce 2000""","""Two Seaters"""
1985,"""Ferrari""","""Testarossa""","""Two Seaters"""
1994,"""Acura""","""NSX""","""Two Seaters"""
1994,"""Acura""","""NSX""","""Two Seaters"""
1994,"""Alfa Romeo""","""Spider""","""Two Seaters"""
…,…,…,…
2025,"""Ford""","""Escape FWD""","""Small Sport Utility Vehicle 2W…"
2025,"""GMC""","""Terrain FWD""","""Small Sport Utility Vehicle 2W…"
2025,"""Hyundai""","""Venue""","""Small Sport Utility Vehicle 2W…"
2025,"""Lincoln""","""Corsair FWD""","""Small Sport Utility Vehicle 2W…"


* Sort by the average of the *city08* for each make.
* First check we can compute the average of *city08* for each make.
* We don't want the results to be aggregated by *make*.
* So, we use `.over()` to compute for the makes, but preserve the original rows.

In [10]:
(
    autos
    .with_columns(
        make_avg=pl.col('city08').mean().over(pl.col('make'))
    )
    .select(['make', 'year', 'make_avg'])
    .sort('make_avg')
)

make,year,make_avg
cat,i16,f64
"""Vector""",1996,7.25
"""Vector""",1997,7.25
"""Vector""",1992,7.25
"""Vector""",1993,7.25
"""Bugatti""",2006,8.318182
…,…,…
"""Lucid""",2025,124.454545
"""Lucid""",2025,124.454545
"""Lucid""",2025,124.454545
"""Lucid""",2025,124.454545


---
# 6.4 Sorting by Multiple Columns

[back to Top](#Top)

* Polars `.sort()` supports multiple columns.
* First, sort by year, then make.

In [11]:

(
    autos
    .sort(['year', 'make', 'model'])
    .select('year', 'make', 'model', 'num_gears')
)

year,make,model,num_gears
i16,cat,cat,u8
1984,"""Alfa Romeo""","""Spider Veloce 2000""",5
1984,"""Alfa Romeo""","""Spider Veloce 2000""",5
1984,"""Alfa Romeo""","""GT V6 2.5""",5
1984,"""Alfa Romeo""","""GT V6 2.5""",5
1984,"""Dodge""","""Charger""",5
…,…,…,…
2025,"""Lucid""","""Air Pure RWD with 20 inch whee…",1
2025,"""Lucid""","""Air Sapphire AWD""",1
2025,"""INEOS Automotive""","""Grenadier""",8
2025,"""INEOS Automotive""","""Grenadier Trialmaster Edition""",8


* Sort by *year* in descending order
* Sort by *make* and *model* in ascending order.

In [12]:
(
    autos
    .sort(
        ['year', 'make', 'model'],
        descending=[True, False, False]
    )
    .select('year', 'make', 'model', 'num_gears')
)

year,make,model,num_gears
i16,cat,cat,u8
2025,"""Alfa Romeo""","""Giulia""",8
2025,"""Alfa Romeo""","""Giulia AWD""",8
2025,"""Alfa Romeo""","""Stelvio AWD""",8
2025,"""Ferrari""","""Daytona SP3""",7
2025,"""Ferrari""","""Roma Spider""",8
…,…,…,…
1984,"""Bill Dovell Motor Car Company""","""Dovell 230CE""",4
1984,"""Bill Dovell Motor Car Company""","""Dovell 230E""",4
1984,"""Import Foreign Auto Sales Inc""","""1fas 410""",4
1984,"""S and S Coach Company E.p. Du…","""Funeral Coach 2WD""",3


* This did not work.
* This is because *make* and *model* columns are categorical, but not ordered.
* We need to set the order for them.
* We can cast it to an ordered categorical with `cast(pl.Categorical('lexical'))`.
* This sets the ordering to *lexical ordering* (alphabetic ordering).

In [13]:
(
    autos
    .with_columns(
        pl.col('make').cast(pl.Categorical('lexical')),
        pl.col('model').cast(pl.Categorical('lexical'))
    )
    .sort(
        ['year', 'make', 'model'],
        descending=[True, False, False]
    )
    .select('year', 'make', 'model', 'num_gears')
)

year,make,model,num_gears
i16,cat,cat,u8
2025,"""Acura""","""Integra""",7
2025,"""Acura""","""Integra""",6
2025,"""Acura""","""Integra A-Spec""",6
2025,"""Acura""","""Integra A-Spec""",7
2025,"""Acura""","""MDX AWD""",10
…,…,…,…
1984,"""Volvo""","""760 GLE""",5
1984,"""Volvo""","""760 GLE""",4
1984,"""Volvo""","""760 GLE""",4
1984,"""Volvo""","""760 GLE""",4


---
# 6.5 Specifying Custom Ordering for Categorical Columns

[back to Top](#Top)

* We have column with month names, and we want to sort by month.

In [14]:
(
    autos
    .with_columns(
        month=pl.col('createdOn').dt.strftime('%B')
    )
    .group_by('month').all()
    .sort('month')
    .select(['month', 'year', 'make'])
)

keys/aggregates are not partitionable: running default HASH AGGREGATION


month,year,make
str,list[i16],list[cat]
"""April""","[2014, 2014, … 2025]","[""Porsche"", ""MINI"", … ""BMW""]"
"""August""","[2015, 2015, … 2024]","[""Audi"", ""BMW"", … ""Tesla""]"
"""December""","[2009, 2009, … 2023]","[""Pontiac"", ""Pontiac"", … ""Rivian""]"
"""February""","[2015, 2015, … 2023]","[""Jaguar"", ""Jaguar"", … ""Porsche""]"
"""January""","[1985, 1985, … 1993]","[""Alfa Romeo"", ""Ferrari"", … ""Subaru""]"
…,…,…
"""March""","[2015, 2014, … 2024]","[""Audi"", ""Honda"", … ""Toyota""]"
"""May""","[2015, 2015, … 2025]","[""Mazda"", ""Mazda"", … ""Aston Martin""]"
"""November""","[2014, 2014, … 2023]","[""Lexus"", ""Subaru"", … ""INEOS Automotive""]"
"""October""","[2014, 2014, … 2025]","[""Lexus"", ""Nissan"", … ""Porsche""]"


* A naive approach would be to sort by month names.
* This does not work because it orders the months alphabetically.
* We need the months to be in the order of the months, such as Jan, Feb, Mar, etc
* We need to create a *string cache*.
* This can be used in a context manager that allows us to temporarily creating categoricals with a custom ordering.
* We need to create a `Series` with a categorical type and pass in the categories in the order to sort them.
* This is the 'physical' ordering and the default ordering that the categorical type uses for sorting.
* Next, while inside the context manager, we can cast a column with those same values to a categorical column.
* This will use the ordering we specified in the string cache.

---

**polars.StringCache**

A better, more explicit name would be **polars.GlobalStringCache**

Context manager for enabling and disabling the global string cache.

Categorical columns created under the same global string cache have the same underlying physical value when string values are equal. This allows the columns to be concatenated or used in a join operation, for example.

Enabling the global string cache introduces some overhead. The amount of overhead depends on the number of categories in your data. It is advised to enable the global string cache only when strictly necessary.

If StringCache calls are nested, the global string cache will only be disabled and cleared when the outermost context exits.

* https://docs.pola.rs/user-guide/concepts/data-types/categoricals/#comparisons
* https://docs.pola.rs/py-polars/html/reference/api/polars.StringCache.html#polars.StringCache

In [15]:
autos.head(1)

year,make,model,displ,cylinders,trany,drive,VClass,fuelType,barrels08,city08,highway08,createdOn,is_automatic,num_gears
i16,cat,cat,f32,u8,str,cat,cat,cat,f32,u8,u8,datetime[μs],str,u8
1985,"""Alfa Romeo""","""Spider Veloce 2000""",2.0,4,"""Manual 5-spd""","""Rear-Wheel Drive""","""Two Seaters""","""Regular""",14.167143,19,25,2013-01-01 00:00:00,"""false""",5


In [16]:
month_order = [
    'January', 'February', 'March', 'April', 'May', 'June',
    'July', 'August', 'September', 'October', 'November',
    'December'
]

with pl.StringCache():

    # Need this line for the ordering
    pl.Series(month_order).cast(pl.Categorical)
    
    print(
        autos
        .with_columns(
            month=pl.col('createdOn')
                .dt.strftime('%B')
                .cast(pl.Categorical),
            month_physical=pl.col('createdOn')
                .dt.strftime('%B')
                .cast(pl.Categorical)
                .to_physical()
        )
        # .group_by('month').all()
        .sort('month')
        .select(['month', 'month_physical','year', 'make'])
    )

shape: (48_202, 4)
+----------+----------------+------+------------+
| month    | month_physical | year | make       |
| ---      | ---            | ---  | ---        |
| cat      | u32            | i16  | cat        |
| January  | 0              | 1985 | Alfa Romeo |
| January  | 0              | 1985 | Ferrari    |
| January  | 0              | 1985 | Dodge      |
| January  | 0              | 1985 | Dodge      |
| January  | 0              | 1993 | Subaru     |
| …        | …              | …    | …          |
| December | 11             | 2023 | Rivian     |
| December | 11             | 2023 | Rivian     |
| December | 11             | 2023 | Rivian     |
| December | 11             | 2023 | Rivian     |
| December | 11             | 2023 | Rivian     |
+----------+----------------+------+------------+


* Alternatively, we can use `@pl.StringCache` as a decorator to create a string cache.
* If we create a categorical column inside this decorated function, it will use the ordering we specified.

In [17]:
month_order = [
    'January', 'February', 'March', 'April', 'May', 'June',
    'July', 'August', 'September', 'October', 'November',
    'December'
]

@pl.StringCache()
def create_month_order():

    # Need this line for ordering.
    # Because this column has the same value as the series,
    # Polars uses the series order.
    s = pl.Series(month_order).cast(pl.Categorical)
    
    return (
        autos
        .with_columns(
            month=pl.col('createdOn')
                .dt.strftime('%B')
                .cast(pl.Categorical()),
            month_physical=pl.col('createdOn')
                .dt.strftime('%B')
                .cast(pl.Categorical)
                .to_physical()       
        )
    )

(
    create_month_order()
        .group_by('month').all()
        .sort('month')
        .select(['month', 'month_physical', 'year', 'make'])
        .head(12)
)

keys/aggregates are not partitionable: running default HASH AGGREGATION


month,month_physical,year,make
cat,list[u32],list[i16],list[cat]
"""January""","[0, 0, … 0]","[1985, 1985, … 1993]","[""Alfa Romeo"", ""Ferrari"", … ""Subaru""]"
"""February""","[1, 1, … 1]","[2015, 2015, … 2023]","[""Jaguar"", ""Jaguar"", … ""Porsche""]"
"""March""","[2, 2, … 2]","[2015, 2014, … 2024]","[""Audi"", ""Honda"", … ""Toyota""]"
"""April""","[3, 3, … 3]","[2014, 2014, … 2025]","[""Porsche"", ""MINI"", … ""BMW""]"
"""May""","[4, 4, … 4]","[2015, 2015, … 2025]","[""Mazda"", ""Mazda"", … ""Aston Martin""]"
…,…,…,…
"""August""","[7, 7, … 7]","[2015, 2015, … 2024]","[""Audi"", ""BMW"", … ""Tesla""]"
"""September""","[8, 8, … 8]","[2015, 2015, … 2024]","[""Chevrolet"", ""Ferrari"", … ""Chevrolet""]"
"""October""","[9, 9, … 9]","[2014, 2014, … 2025]","[""Lexus"", ""Nissan"", … ""Porsche""]"
"""November""","[10, 10, … 10]","[2014, 2014, … 2023]","[""Lexus"", ""Subaru"", … ""INEOS Automotive""]"


---
**Note**

* We can go back to the initial `tweak_auto` function and set the ordering to 'lexical' on the `.cast` call.

In [18]:
path = 'data/vehicles.csv'
raw = pl.read_csv(path, null_values=['NA'])
print(raw.shape)
print(f"{raw.estimated_size(unit='mb'):.2f}MB")


@pl.StringCache()
def tweak_auto_ordered(df):
    month_order = [
        'January', 'February', 'March', 'April', 'May', 'June',
        'July', 'August', 'September', 'October', 'November',
        'December'
    ]
    s = pl.Series(month_order).cast(pl.Categorical)
    
    cols = ['year', 'make', 'model', 'displ', 'cylinders', 'trany', 
           'drive', 'VClass', 'fuelType', 'barrels08', 'city08', 
           'highway08', 'createdOn']
    
    return (
        df
        .select(pl.col(cols))
        .with_columns( 
            pl.col('year').cast(pl.Int16),
            pl.col(['cylinders', 'highway08', 'city08']).cast(pl.UInt8),
            pl.col(['displ', 'barrels08']).cast(pl.Float32),

            # ordered
            # Does it make a difference if we only sort 'make'?
            pl.col(['make']).cast(pl.Categorical('lexical')),
            
            pl.col(['model', 'VClass', 'drive', 'fuelType']).cast(pl.Categorical()),
            
            createdOn=pl.col('createdOn').str.to_datetime('%a %b %d %H:%M:%S %Z %Y'),
            is_automatic=pl.col('trany')                    
               .str.contains('Automatic')
               .fill_null('Automatic'),
            num_gears=pl.col('trany')
                .str.extract(r'(\d+)')
                .cast(pl.UInt8)
                .fill_null(6),
        )
    )

autos = tweak_auto_ordered(raw)
print(type(autos))
print(autos.shape)
print(f"{autos.estimated_size(unit='mb'):.2f}MB")

autos.head(1)

avg line length: 434.78027
std. dev. line length: 23.885818
initial row estimate: 47850
no. of chunks: 4 processed by: 4 threads.


(48202, 84)
29.34MB
<class 'polars.dataframe.frame.DataFrame'>
(48202, 15)
2.94MB


year,make,model,displ,cylinders,trany,drive,VClass,fuelType,barrels08,city08,highway08,createdOn,is_automatic,num_gears
i16,cat,cat,f32,u8,str,cat,cat,cat,f32,u8,u8,datetime[μs],str,u8
1985,"""Alfa Romeo""","""Spider Veloce 2000""",2.0,4,"""Manual 5-spd""","""Rear-Wheel Drive""","""Two Seaters""","""Regular""",14.167143,19,25,2013-01-01 00:00:00,"""false""",5


---
* Check that 'make' is now an ordered category

In [19]:
(
    autos
    .sort('make')
)

year,make,model,displ,cylinders,trany,drive,VClass,fuelType,barrels08,city08,highway08,createdOn,is_automatic,num_gears
i16,cat,cat,f32,u8,str,cat,cat,cat,f32,u8,u8,datetime[μs],str,u8
1985,"""AM General""","""Post Office DJ5 2WD""",2.5,4,"""Automatic 3-spd""","""Rear-Wheel Drive""","""Special Purpose Vehicle 2WD""","""Regular""",18.594376,16,17,2013-01-01 00:00:00,"""true""",3
1985,"""AM General""","""Post Office DJ8 2WD""",4.2,6,"""Automatic 3-spd""","""Rear-Wheel Drive""","""Special Purpose Vehicle 2WD""","""Regular""",22.885386,13,13,2013-01-01 00:00:00,"""true""",3
1984,"""AM General""","""FJ8c Post Office""",4.2,6,"""Automatic 3-spd""","""2-Wheel Drive""","""Special Purpose Vehicle 2WD""","""Regular""",22.885386,13,13,2013-01-01 00:00:00,"""true""",3
1984,"""AM General""","""DJ Po Vehicle 2WD""",2.5,4,"""Automatic 3-spd""","""2-Wheel Drive""","""Special Purpose Vehicle 2WD""","""Regular""",17.500587,18,17,2013-01-01 00:00:00,"""true""",3
1984,"""AM General""","""FJ8c Post Office""",4.2,6,"""Automatic 3-spd""","""2-Wheel Drive""","""Special Purpose Vehicle 2WD""","""Regular""",22.885386,13,13,2013-01-01 00:00:00,"""true""",3
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2017,"""smart""","""fortwo electric drive converti…",,,"""Automatic (A1)""","""Rear-Wheel Drive""","""Two Seaters""","""Electricity""",0.0792,112,91,2017-10-11 00:00:00,"""true""",1
2018,"""smart""","""fortwo electric drive coupe""",,,"""Automatic (A1)""","""Rear-Wheel Drive""","""Two Seaters""","""Electricity""",0.0744,124,94,2017-12-05 00:00:00,"""true""",1
2018,"""smart""","""fortwo electric drive converti…",,,"""Automatic (A1)""","""Rear-Wheel Drive""","""Two Seaters""","""Electricity""",0.0792,112,91,2017-12-05 00:00:00,"""true""",1
2019,"""smart""","""EQ fortwo (coupe)""",,,"""Automatic (A1)""","""Rear-Wheel Drive""","""Two Seaters""","""Electricity""",0.0744,124,94,2019-01-29 00:00:00,"""true""",1


---
# 6.6 Enums and Ordering

[back to Top](#Top)

If you know the ordering of categorical columns ahead of time and have multiple dataframes with the same categories, this is when you can use an `Enum`.

Enums are more efficient than categorical types because the encoding is more straightforward.

We define Enums using `pl.Enum`.

Here, we make a dataset of birthdays to explore enums.

In [20]:
import io

data = '''Name,Birthday
Brianna Smith,2000-02-16
Alex Johnson,2001-01-15
Carlos Gomez,2002-03-17
Diana Clarke,2003-04-18
Ethan Hunt,2002-05-19
Fiona Gray,2005-06-20
George King,2006-07-21
Hannah Scott,2007-08-22
Ian Miles,2008-09-23
Julia Banks,2009-10-24'''

students = pl.read_csv(io.StringIO(data))
students

file < 128 rows, no statistics determined
no. of chunks: 1 processed by: 1 threads.


Name,Birthday
str,str
"""Brianna Smith""","""2000-02-16"""
"""Alex Johnson""","""2001-01-15"""
"""Carlos Gomez""","""2002-03-17"""
"""Diana Clarke""","""2003-04-18"""
"""Ethan Hunt""","""2002-05-19"""
"""Fiona Gray""","""2005-06-20"""
"""George King""","""2006-07-21"""
"""Hannah Scott""","""2007-08-22"""
"""Ian Miles""","""2008-09-23"""
"""Julia Banks""","""2009-10-24"""


* Add a month column and sort by month.
* First, convert the string data from the *Birthday* column to a date.
* Then, use `.dt.strftime` to retrieve the month name.
* Finally, use `.sort()`

In [21]:
bday=pl.col('Birthday')

(
    students
    .with_columns(
        bday.str.to_datetime('%Y-%m-%d')
    )
    .with_columns(
        month=bday.dt.strftime('%B')
    )
    .sort(by='month')
)

Name,Birthday,month
str,datetime[μs],str
"""Diana Clarke""",2003-04-18 00:00:00,"""April"""
"""Hannah Scott""",2007-08-22 00:00:00,"""August"""
"""Brianna Smith""",2000-02-16 00:00:00,"""February"""
"""Alex Johnson""",2001-01-15 00:00:00,"""January"""
"""George King""",2006-07-21 00:00:00,"""July"""
"""Fiona Gray""",2005-06-20 00:00:00,"""June"""
"""Carlos Gomez""",2002-03-17 00:00:00,"""March"""
"""Ethan Hunt""",2002-05-19 00:00:00,"""May"""
"""Julia Banks""",2009-10-24 00:00:00,"""October"""
"""Ian Miles""",2008-09-23 00:00:00,"""September"""


* Note that the order of the *month* column is alphabetical.
* We want this in chronological order.
* For this, create an `Enum` with the months in the desired order.

In [22]:
month_type = pl.Enum([
    'January', 'February', 'March', 'April', 'May', 'June',
    'July', 'August', 'September', 'October', 'November',
    'December'
])
month_type

Enum(categories=['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'])

* After we defined an enum, we can cast to it.

In [23]:
(
    students
    .with_columns(bday.str.to_datetime('%Y-%m-%d'))
    .with_columns(month=bday.dt.strftime('%B').cast(month_type))
    .sort(by='month')
)

Name,Birthday,month
str,datetime[μs],enum
"""Alex Johnson""",2001-01-15 00:00:00,"""January"""
"""Brianna Smith""",2000-02-16 00:00:00,"""February"""
"""Carlos Gomez""",2002-03-17 00:00:00,"""March"""
"""Diana Clarke""",2003-04-18 00:00:00,"""April"""
"""Ethan Hunt""",2002-05-19 00:00:00,"""May"""
"""Fiona Gray""",2005-06-20 00:00:00,"""June"""
"""George King""",2006-07-21 00:00:00,"""July"""
"""Hannah Scott""",2007-08-22 00:00:00,"""August"""
"""Ian Miles""",2008-09-23 00:00:00,"""September"""
"""Julia Banks""",2009-10-24 00:00:00,"""October"""


---
* Here is another dataset with holidays.
* We want to merge a holiday for each birthday month

In [24]:
holiday_data = '''Month,Holiday
January,New Year's Day
February,Valentine's Day
March,St. Patrick's Day
April,April Fools' Day
May,Memorial Day
June,Juneteenth
July,Independence Day
August,Labor Day
September,Patriot Day
October,Halloween
November,Thanksgiving
December,Christmas Day'''

holidays = pl.read_csv(io.StringIO(holiday_data))
holidays

file < 128 rows, no statistics determined
no. of chunks: 1 processed by: 1 threads.


Month,Holiday
str,str
"""January""","""New Year's Day"""
"""February""","""Valentine's Day"""
"""March""","""St. Patrick's Day"""
"""April""","""April Fools' Day"""
"""May""","""Memorial Day"""
…,…
"""August""","""Labor Day"""
"""September""","""Patriot Day"""
"""October""","""Halloween"""
"""November""","""Thanksgiving"""


* If we try to merge the the month columns as strings, and sort by months, we are back to lexical ordering.

In [25]:
(
    students
    .with_columns(bday.str.to_datetime('%Y-%m-%d'))
    .with_columns(month=bday.dt.strftime('%B'))
    .join(holidays, left_on='month', right_on='Month')
    .sort('month')
)

join parallel: true
INNER join dataframes finished


Name,Birthday,month,Holiday
str,datetime[μs],str,str
"""Diana Clarke""",2003-04-18 00:00:00,"""April""","""April Fools' Day"""
"""Hannah Scott""",2007-08-22 00:00:00,"""August""","""Labor Day"""
"""Brianna Smith""",2000-02-16 00:00:00,"""February""","""Valentine's Day"""
"""Alex Johnson""",2001-01-15 00:00:00,"""January""","""New Year's Day"""
"""George King""",2006-07-21 00:00:00,"""July""","""Independence Day"""
"""Fiona Gray""",2005-06-20 00:00:00,"""June""","""Juneteenth"""
"""Carlos Gomez""",2002-03-17 00:00:00,"""March""","""St. Patrick's Day"""
"""Ethan Hunt""",2002-05-19 00:00:00,"""May""","""Memorial Day"""
"""Julia Banks""",2009-10-24 00:00:00,"""October""","""Halloween"""
"""Ian Miles""",2008-09-23 00:00:00,"""September""","""Patriot Day"""


* If one of the columns is a categorical, we will see an error.

In [26]:
try:
    print(
        students
        .with_columns(bday.str.to_datetime('%Y-%m-%d'))
        .with_columns(month=bday.dt.strftime('%B').cast(pl.Categorical()))
        .join(holidays, left_on='month', right_on='Month')
        .sort('month')
    )
except Exception as e:
    print(e)

join parallel: true


datatypes of join keys don't match - `month`: cat on left does not match `Month`: str on right


INNER join dataframes finished


* If both merge columns are categoricals defined outside of a string cache, we get a `CategoricalRemappingWarning` warning.
* Also, the ordering is based on the appearance of the categories in the *students* data.
* The February date came first in the data, the first month in the physical ordering.

In [27]:
(
    students
    .with_columns(bday.str.to_datetime('%Y-%m-%d'))
    .with_columns(month=bday.dt.strftime('%B').cast(pl.Categorical))
    .join(
        holidays.with_columns(pl.col('Month').cast(pl.Categorical)),
        left_on='month',
        right_on='Month'
    ).sort('month')
)

join parallel: true
  .join(
INNER join dataframes finished


Name,Birthday,month,Holiday
str,datetime[μs],cat,str
"""Brianna Smith""",2000-02-16 00:00:00,"""February""","""Valentine's Day"""
"""Alex Johnson""",2001-01-15 00:00:00,"""January""","""New Year's Day"""
"""Carlos Gomez""",2002-03-17 00:00:00,"""March""","""St. Patrick's Day"""
"""Diana Clarke""",2003-04-18 00:00:00,"""April""","""April Fools' Day"""
"""Ethan Hunt""",2002-05-19 00:00:00,"""May""","""Memorial Day"""
"""Fiona Gray""",2005-06-20 00:00:00,"""June""","""Juneteenth"""
"""George King""",2006-07-21 00:00:00,"""July""","""Independence Day"""
"""Hannah Scott""",2007-08-22 00:00:00,"""August""","""Labor Day"""
"""Ian Miles""",2008-09-23 00:00:00,"""September""","""Patriot Day"""
"""Julia Banks""",2009-10-24 00:00:00,"""October""","""Halloween"""


---
* If we cast both columns to the month enum, we get the desired behavior.
* The months are now sorted in chronological order.

In [28]:
(
    students
    .with_columns(bday.str.to_datetime('%Y-%m-%d'))
    .with_columns(month=bday.dt.strftime('%B').cast(month_type))
    .join(
        holidays.with_columns(pl.col('Month').cast(month_type)),
        left_on='month',
        right_on='Month'
    )
    .sort('month')
)

join parallel: true
INNER join dataframes finished


Name,Birthday,month,Holiday
str,datetime[μs],enum,str
"""Alex Johnson""",2001-01-15 00:00:00,"""January""","""New Year's Day"""
"""Brianna Smith""",2000-02-16 00:00:00,"""February""","""Valentine's Day"""
"""Carlos Gomez""",2002-03-17 00:00:00,"""March""","""St. Patrick's Day"""
"""Diana Clarke""",2003-04-18 00:00:00,"""April""","""April Fools' Day"""
"""Ethan Hunt""",2002-05-19 00:00:00,"""May""","""Memorial Day"""
"""Fiona Gray""",2005-06-20 00:00:00,"""June""","""Juneteenth"""
"""George King""",2006-07-21 00:00:00,"""July""","""Independence Day"""
"""Hannah Scott""",2007-08-22 00:00:00,"""August""","""Labor Day"""
"""Ian Miles""",2008-09-23 00:00:00,"""September""","""Patriot Day"""
"""Julia Banks""",2009-10-24 00:00:00,"""October""","""Halloween"""


* Enums are preferred for handling low cardinal categorical data when:
     - You are concerned about the order.
     - You know the values ahead of time.

---
# 6.7 Group Ordering and maintain_order

[back to Top](#Top)

* If we group by *make*, the results do not come back with *make* ordered.
* If we run this multiple times, the order of the rows will change.

In [29]:
def order_test():
    print(
        autos
        .group_by(pl.col('make'))
        .len()
    )

order_test()
order_test()

known unique values: 145
run PARTITIONED HASH AGGREGATION
known unique values: 145
run PARTITIONED HASH AGGREGATION


shape: (145, 2)
+-------------------------------+-----+
| make                          | len |
| ---                           | --- |
| cat                           | u32 |
| Bitter Gmbh and Co. Kg        | 5   |
| Goldacre                      | 1   |
| Aurora Cars Ltd               | 1   |
| Ruf Automobile Gmbh           | 3   |
| Isuzu                         | 434 |
| …                             | …   |
| Import Foreign Auto Sales Inc | 1   |
| Bentley                       | 176 |
| Panoz Auto-Development        | 1   |
| Daewoo                        | 67  |
| Merkur                        | 14  |
+-------------------------------+-----+
shape: (145, 2)
+-----------------------------+------+
| make                        | len  |
| ---                         | ---  |
| cat                         | u32  |
| TVR Engineering Ltd         | 4    |
| Consulier Industries Inc    | 3    |
| Dodge                       | 2695 |
| PAS Inc - GMC               | 2    |
| Grumman Olson 

* You can make the order consistent by passing `maintain_order=True`.
* However, results are sorted based on the order of rows in the initial data, not on the order of the categoricals we grouped by.

In [30]:
(
    autos
    .group_by(pl.col('make'), maintain_order=True)
    .len()
)

known unique values: 145
run PARTITIONED HASH AGGREGATION


make,len
cat,u32
"""Alfa Romeo""",97
"""Ferrari""",279
"""Dodge""",2695
"""Subaru""",1013
"""Toyota""",2470
…,…
"""General Motors""",1
"""Consulier Industries Inc""",3
"""Goldacre""",1
"""Isis Imports Ltd""",1


* If you want the results in the order of the categorical, you need to sort the results after. 

In [31]:
(
    autos
    .group_by(pl.col('make'))
    .len()
    .sort('make')
)

known unique values: 145
run PARTITIONED HASH AGGREGATION


make,len
cat,u32
"""AM General""",6
"""ASC Incorporated""",1
"""Acura""",424
"""Alfa Romeo""",97
"""American Motors Corporation""",27
…,…
"""Volkswagen""",1326
"""Volvo""",929
"""Wallace Environmental""",32
"""Yugo""",8


---
# 6.8 Stable Sorting

[back to Top](#Top)

* Polars use a *stable* sort algorithm.

In [32]:
students = pl.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Dana', 'Eve'],
    'age': [25, 20, 25, 21, 24],
    'grade': [88, 92, 95, 88, 60],
})
students

name,age,grade
str,i64,i64
"""Alice""",25,88
"""Bob""",20,92
"""Charlie""",25,95
"""Dana""",21,88
"""Eve""",24,60


In [33]:
(
    students
    .sort('age')
)

name,age,grade
str,i64,i64
"""Bob""",20,92
"""Dana""",21,88
"""Eve""",24,60
"""Alice""",25,88
"""Charlie""",25,95


In [34]:
(
    students
    .sort('grade')
)

name,age,grade
str,i64,i64
"""Eve""",24,60
"""Alice""",25,88
"""Dana""",21,88
"""Bob""",20,92
"""Charlie""",25,95


---
# 6.9 Sorting and Filtering

[back to Top](#Top)

* Because Polars has a query planner, it can do smart things if it knows the column is sorted.

Test on unsorted data:

In [35]:
%%timeit
@pl.Config(set_verbose=False)
def timing_test():
    (
        autos
        .filter(pl.col('year')==1994)
    )

timing_test()

645 µs ± 122 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


Test on sorted data.

In [36]:
autos_year_sorted = autos.sort('year')

In [37]:
%%timeit
@pl.Config(set_verbose=False)
def timing_test():
    (
        autos_year_sorted
        .filter(pl.col('year')==1994)
    )

timing_test()

575 µs ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


---
Try searching for all Ford cars with three dataframes:
1. Categorical *make* column
2. String *make* column
3. Sorted string *make* column

In [38]:
cat_make = (
    autos
    .with_columns(make=pl.col('make').cast(pl.Categorical('lexical'))
    )
)

string_make = (
    autos
    .with_columns(make=pl.col('make').cast(pl.String))
)

sorted_make = (
    autos
    .with_columns(make=pl.col('make').cast(pl.String)).sort('make')
)

* Now, time searching for all Ford cars with the three dataframes.

In [39]:
%%timeit
@pl.Config(set_verbose=False)
def timing_test():
    cat_make.filter(pl.col('make')=='Ford')

timing_test()

911 µs ± 62.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [40]:
%%timeit
@pl.Config(set_verbose=False)
def timing_test():
    string_make.filter(pl.col('make')=='Ford')

timing_test()

1 ms ± 34.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [41]:
%%timeit
@pl.Config(set_verbose=False)
def timing_test():
    sorted_make.filter(pl.col('make')=='Ford')

timing_test()

773 µs ± 29.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
