# Categorical and enum dtypes

A string column with repeated values it is often faster and less memory intensive to cast the strings to the `pl.Categorical` or `pl.Enum` dtype.

In [1]:
import polars as pl

## Categorical dtype
The `pl.Categorical` dtype is useful when a string column has many repeated values.

The `pl.Categorical` dtype internally replaces the strings with a unique mapping from each string to an `integer`. 

By working with integers internally we typically:
- reduce memory usage as we store a single integer instead of a string
- reduce computation time as we work with integers instead of strings

In [2]:
df = pl.DataFrame({"text": ["cat", "dog", "rabbit", "cat"]})

Covert string column to categorical dtype.

In [3]:
df.with_columns(
    pl.col("text").cast(pl.Categorical).alias("text_cat")
)

text,text_cat
str,cat
"""cat""","""cat"""
"""dog""","""dog"""
"""rabbit""","""rabbit"""
"""cat""","""cat"""


There is no difference in the printed appearance of values in a `pl.Categorical` column and the original string column. 

However, internally Polars stores the `pl.Categorical` column as integers along with a unique mapping from integer to string.

### Physical representation of categoricals

In Polars, the integer representation used internally is referred to as the **"physical"** representation.

In [4]:
df.with_columns(
    pl.col("text").cast(pl.Categorical).alias("text_cat")
).with_columns(
    pl.col("text_cat").to_physical().alias("cat_physical")
)

text,text_cat,cat_physical
str,cat,u32
"""cat""","""cat""",0
"""dog""","""dog""",1
"""rabbit""","""rabbit""",2
"""cat""","""cat""",0


The integer representation is set by the order of occurrence in the column. 

This can be used as a way to add a column that numbers strings according to their first occurrence.

The dtype for the categorical encoding is `pl.UInt32` - unsigned 32-bit integers.

Polars can accommodate over 4 billion unique string mappings with `pl.UInt32` integers.

## Sorting categorical

As categorical has both a `lexical` (string) representation and an integer representation and there are two ways to sort a categorical column.

In [5]:
df_physical = (
    pl.DataFrame(
            {"strings": ["c","b","a","c"], "values": [0, 1, 2, 3]}
    )
    .with_columns(
        pl.col("strings").cast(pl.Categorical).alias("cats")
    )
    .with_columns(
        pl.col("cats").to_physical().alias("physical")
    )
)
df_physical

strings,values,cats,physical
str,i64,cat,u32
"""c""",0,"""c""",3
"""b""",1,"""b""",4
"""a""",2,"""a""",5
"""c""",3,"""c""",3


> Key point: **In Polars, the default is for sorting categorical by the `physical` representation and not the string representation**

In [7]:
df_physical.sort("cats")

strings,values,cats,physical
str,i64,cat,u32
"""a""",2,"""a""",5
"""b""",1,"""b""",4
"""c""",0,"""c""",3
"""c""",3,"""c""",3


### Integers as categorical?
Only a string column can be converted to `pl.Categorical` in Polars.

Cast an integer column to categorical, cast it to string dtype first.

### Saving categorical

If we save a `DataFrame` with a categorical column to:
- a `Parquet` file preserves it when we read it back into a `DataFrame`
- a `CSV` file casts it to string

## Enum dtype

In some cases we know what the repeated values are in a column in advance so we can define them first.

In [9]:
enum_dtype = pl.Enum(["dog", "cat", "rabbit"])

Cast the `text` column to this dtype

In [10]:
df.with_columns(
    pl.col("text").cast(enum_dtype)
)

text
enum
"""cat"""
"""dog"""
"""rabbit"""
"""cat"""


With a `pl.Categorical` dtype, we can get the underlying physical representation

In [11]:
df.with_columns(
    pl.col("text").cast(enum_dtype)
).with_columns(
    pl.col("text").to_physical().alias("physical")
)

text,physical
enum,u8
"""cat""",1
"""dog""",0
"""rabbit""",2
"""cat""",1


> **Note** that the order of the integers in `physical` is set by the order given in `pl.Enum` rather than order of occurence.

With an `pl.Enum` dtype, Polars raises an `Exception` if we try to include a string that is not in the `pl.Enum` categories.

In [12]:
enum_no_rabbit_dtype = pl.Enum(["cat", "dog"])

In [13]:
df.with_columns(
    pl.col("text").cast(enum_no_rabbit_dtype)
)

InvalidOperationError: conversion from `str` to `enum` failed in column 'text' for 1 out of 4 values: ["rabbit"]

Ensure that all values in the input column are present in the categories of the enum datatype.

## Exercises

### Exercise 1
We have the following `DataFrame` of animals and their sizes

In [16]:
df_animal_sizes = (
    pl.DataFrame(
        {
            "animals":["dog","cat","mouse","giraffe"],
            "size": ["medium","medium","small","big"]
        }
    )
)

df_animal_sizes

animals,size
str,str
"""dog""","""medium"""
"""cat""","""medium"""
"""mouse""","""small"""
"""giraffe""","""big"""


Cast the `size` column to categorical and call it `size_cats`

In [17]:
df_animal_sizes.with_columns(
    pl.col("size").cast(pl.Categorical).alias("size_cats")
)

animals,size,size_cats
str,str,cat
"""dog""","""medium""","""medium"""
"""cat""","""medium""","""medium"""
"""mouse""","""small""","""small"""
"""giraffe""","""big""","""big"""


Add a column with the physical values of the categoricals

In [18]:
df_animal_sizes.with_columns(
    pl.col("size").cast(pl.Categorical).alias("size_cats")
).with_columns(
    pl.col("size_cats").to_physical().alias("physical_cat")
)

animals,size,size_cats,physical_cat
str,str,cat,u32
"""dog""","""medium""","""medium""",6
"""cat""","""medium""","""medium""",6
"""mouse""","""small""","""small""",7
"""giraffe""","""big""","""big""",8


Sort the `DataFrame` by `size_cats` in alphabetical order

In [20]:
df_animal_sizes.with_columns(
    pl.col("size").cast(pl.Categorical).alias("size_cats")
).with_columns(
    pl.col("size_cats").to_physical().alias("physical_cat")
).sort("size_cats")

animals,size,size_cats,physical_cat
str,str,cat,u32
"""giraffe""","""big""","""big""",8
"""dog""","""medium""","""medium""",6
"""cat""","""medium""","""medium""",6
"""mouse""","""small""","""small""",7


Find a way to sort the `DataFrame` so that "small" comes before "medium" comes before "big"

In [21]:
size_enum = pl.Enum(["small", "medium", "big"])

df_animal_sizes.with_columns(
    pl.col("size").cast(size_enum).alias("size_enum")
).with_columns(
    pl.col("size_enum").to_physical().alias("physical_enum")
).sort("size_enum")

animals,size,size_enum,physical_enum
str,str,enum,u8
"""mouse""","""small""","""small""",0
"""dog""","""medium""","""medium""",1
"""cat""","""medium""","""medium""",1
"""giraffe""","""big""","""big""",2


### Exercise 2
Create a `DataFrame` with the Spotify data

In [14]:
pl.Config.set_fmt_str_lengths(50)
spotify_csv = "data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv,try_parse_dates=True)
spotify_df.head(3)

title,rank,date,artist,url,region,chart,trend,streams
str,i64,date,str,str,str,str,str,i64
"""Starboy""",1,2017-01-01,"""The Weeknd, Daft Punk""","""https://open.spotify.com/track/5aAx2yezTd8zXrkmtKl…","""Global""","""top200""","""SAME_POSITION""",3135625
"""Closer""",2,2017-01-01,"""The Chainsmokers, Halsey""","""https://open.spotify.com/track/7BKLCZ1jbUBVqRi2FVl…","""Global""","""top200""","""SAME_POSITION""",3015525
"""Let Me Love You""",3,2017-01-01,"""DJ Snake, Justin Bieber""","""https://open.spotify.com/track/4pdPtRcBmOSQDlJ3Fk9…","""Global""","""top200""","""MOVE_UP""",2545384


Get the estimated size of the `spotify_df` in megabytes

In [24]:
spotify_df.estimated_size(unit="mb")

43.30562686920166

Create a new Spotify `DataFrame` where we:
- cast any suitable columns to categorical
- cast any numerical columns to the smallest possible precision

Create the new `DataFrame` with a smaller size in memory

In [29]:
new_spotify_df = pl.read_csv(spotify_csv,try_parse_dates=True).with_columns(
    pl.col(pl.Utf8).cast(pl.Categorical),
    pl.selectors.numeric().cast(pl.Int32)
)
new_spotify_df

title,rank,date,artist,url,region,chart,trend,streams
cat,i32,date,cat,cat,cat,cat,cat,i32
"""Starboy""",1,2017-01-01,"""The Weeknd, Daft Punk""","""https://open.spotify.com/track/5aAx2yezTd8zXrkmtKl…","""Global""","""top200""","""SAME_POSITION""",3135625
"""Closer""",2,2017-01-01,"""The Chainsmokers, Halsey""","""https://open.spotify.com/track/7BKLCZ1jbUBVqRi2FVl…","""Global""","""top200""","""SAME_POSITION""",3015525
"""Let Me Love You""",3,2017-01-01,"""DJ Snake, Justin Bieber""","""https://open.spotify.com/track/4pdPtRcBmOSQDlJ3Fk9…","""Global""","""top200""","""MOVE_UP""",2545384
"""Rockabye (feat. Sean Paul & Anne-Marie)""",4,2017-01-01,"""Clean Bandit""","""https://open.spotify.com/track/5knuzwU65gJK7IF5yJs…","""Global""","""top200""","""MOVE_DOWN""",2356604
"""One Dance""",5,2017-01-01,"""Drake, WizKid, Kyla""","""https://open.spotify.com/track/1xznGGDReH1oQq0xzbw…","""Global""","""top200""","""SAME_POSITION""",2259887
…,…,…,…,…,…,…,…,…
"""Slow Hands""",196,2018-01-31,"""Niall Horan""","""https://open.spotify.com/track/38yBBH2jacvDxrznF7h…","""Global""","""top200""","""MOVE_UP""",545008
"""New Freezer (feat. Kendrick Lamar)""",197,2018-01-31,"""Rich The Kid""","""https://open.spotify.com/track/4pYZLpX23Vx8rwDpJCp…","""Global""","""top200""","""MOVE_UP""",543534
"""Explícale (feat. Bad Bunny)""",198,2018-01-31,"""Yandel""","""https://open.spotify.com/track/1LszjjoVwDDZcWUQbze…","""Global""","""top200""","""NEW_ENTRY""",534209
"""The Scientist""",199,2018-01-31,"""Coldplay""","""https://open.spotify.com/track/75JFxkI2RXiU7L9VXzM…","""Global""","""top200""","""NEW_ENTRY""",533124


Get the estimated size of `new_spotify_df` in megabytes

In [30]:
new_spotify_df.estimated_size(unit="mb")

12.434532165527344

Find all rows where the artist is Taylor Swift

In [32]:
new_spotify_df.filter(
    pl.col("artist") == "Taylor Swift"
).head(3)

title,rank,date,artist,url,region,chart,trend,streams
cat,i32,date,cat,cat,cat,cat,cat,i32
"""Look What You Made Me Do""",176,2018-02-01,"""Taylor Swift""","""https://open.spotify.com/track/5troof8mcGO3AafoDbk…","""Global""","""top200""","""MOVE_DOWN""",603624
"""Look What You Made Me Do""",188,2018-02-02,"""Taylor Swift""","""https://open.spotify.com/track/5troof8mcGO3AafoDbk…","""Global""","""top200""","""MOVE_DOWN""",609744
"""Look What You Made Me Do""",174,2018-02-03,"""Taylor Swift""","""https://open.spotify.com/track/5troof8mcGO3AafoDbk…","""Global""","""top200""","""MOVE_UP""",586895
