## String and categorical dtypes
By the end of this lecture you will be able to:
- convert from string to categorical dtype
- get the integer mapping values
- sort categorical data

When we have a string column with repeated values it is often faster and less memory intensive to cast the strings to the `pl.Categorical` dtype. The categorical dtype works in some surprising ways, however. In this lecture we go through the fundamentals of how Polars works with the categorical dtype. 

In [None]:
import polars as pl

We create a `DataFrame` with a string column to illustrate how string data is stored in memory

In [None]:
df = pl.DataFrame(
    {
        "text":["cat","dog","rabbit","cat"]
    }
)

## String data in memory
In memory Polars - or really Apache Arrow - stores the strings in a column.

The array `["cat", "dog", "rabbit"]` is stored as:

- a concatenated string "catdograbbit"
- an offset array with the start (and end) of each string [0, 2, 5, 11]

> In Pandas a string column is not a concatenated string. Instead the column is a set of pointers to generic python objects that could be of any type. The actual strings are stored in different places in memory. This representation makes string operations slow in Pandas.

### Advantages of the Arrow string format
- Fast string search and transformation operations 

### Disadvantages of the Arrow string format
- Slow to re-order. For example when doing a `sort` all the data must be moved around (not just pointers to data).
- Repeated string values are stored each time they occur

These disadvantages have implications for operations other than just `sort`. For example a `join` might involve a `sort` internally.

## Categorical dtype
The `pl.Categorical` dtype is useful when you have a string column with repeated values.

The `pl.Categorical` dtype replaces the strings with a unique mapping from each string to an integer.

We convert from string to categorical with `cast`. We modify our original `DataFrame` to have a repeated value

In [None]:
df = (
    pl.DataFrame(
        {
            "text":["cat","dog","rabbit","cat"]
        }
    )
)
(
    df
    .with_columns(
        pl.col("text").cast(pl.Categorical).alias("text_cat")
    )
)

There is no difference in the printed appearance of values in a `pl.Categorical` column.

### Physical representation of categoricals

In Polars the integer part of the categorical mapping is referred to as the **"physical"** representation.

We can see the underlying integer values with the `to_physical` expression

In [None]:
df = pl.DataFrame({"text":["cat","dog","rabbit","cat"]})
(
    df
    .with_columns(
        pl.col("text").cast(pl.Categorical).alias("text_cat")
    )
    .with_columns(
        pl.col("text_cat").to_physical().alias("cat_physical")
    )
)

The dtype for the categorical encoding is `pl.UInt32` - unsigned 32-bit integers.

Polars can accommodate over 4 billion unique string mappings with `pl.UInt32` integers.

## Sorting categoricals

As categoricals have both a `lexical` (string) representation and an integer representation there are two ways to sort a categorical column.

To illustrate this we create a `DataFrame` with some string values in the first column. We add their postion in the `values` column to keep track of where they started

In [None]:
dfPhysical = (
    pl.DataFrame(
            {"strings": ["c","b","a","c"], "values": [0, 1, 2, 3]}
    )
    .with_columns(
        pl.col("strings").cast(pl.Categorical).alias("cats")
    )
    .with_columns(
        pl.col("cats").to_physical().alias("physical")
    )
)
dfPhysical

If we sort this `DataFrame` on the `cats` column we see that the `"c"` values come first rather than `"a"`! 

**In Polars the default is for sorting categoricals by the `physical` representation and not the string representation**

In [None]:
dfPhysical.sort("cats")

We can change the ordering convention to sort by the string lexical representation using the `cat.set_ordering` expression

In [None]:
dfLexical = (
    dfPhysical
    .with_columns(
            [
                pl.col("cats").cast(pl.Categorical).cat.set_ordering("lexical"),
            ]
        )
     )
dfLexical.sort("cats")

### Why does Polars sort on the physical rather than the string representation?
It may seem strange that Polars defaults to sorting categoricals by their physical representation. However, there are advantages to this. 

Polars has fast-track algorithms for sorted data including key operations such as `groupby` and `join`. Polars can use these fast-track algorithms if the physical representation is sorted. We see examples of this in later Sections.

## Operations on categoricals
Arithmetic operations on categorical columns lead to a `null` - even when they work on string columns. 

You can see this behaviour in the following cell

> Earlier versions of Polars gave an `Exception` if you try to do this operation - update your Polars version if you get an exception here

In [None]:
(
    dfLexical
    .select(
        pl.all().max()
    )
)

## Integers as categoricals?
We might have an integer column that we consider to be a categorical column.

For example we can consider the passenger class column in the Titanic dataset to be categorical

In [None]:
csvFile = "../data/titanic.csv"
pl.read_csv(csvFile,n_rows=2)

However, only a string column can be converted to `pl.Categorical` in Polars.

If we want to cast an integer column to categorical we first cast it to string dtype

In [None]:
(
    pl.read_csv(csvFile,n_rows=2)
    .select(
        [
            "Pclass",
            pl.col("Pclass").cast(pl.Utf8).cast(pl.Categorical).alias("cat")
        ]
    )
)

The physical representation may not match the original integer representation

In [None]:
(
    pl.read_csv(csvFile,n_rows=2)
    .select(
        [
            "Pclass",
            pl.col("Pclass").cast(pl.Utf8).cast(pl.Categorical).alias("cat")
        ]
    )
    .with_columns(
            pl.col("cat").to_physical().alias("physical")
    )
)

In the next lecture we learn more about the categoricals including:
- filtering a categorical column
- working with categoricals over multiple `DataFrames` or `Series`

## Exercises

In the exercises you will develop your understanding of:
- casting a string column to categorical
- accessing the physical values
- sorting by a categorical column in alphabetical order

### Exercise 1
We have the following `DataFrame` of animals and their sizes

In [None]:
dfAnimalSizes = (
    pl.DataFrame(
        {
            "animals":["dog","cat","mouse","giraffe"],
            "size": ["medium","medium","small","big"]
        }
    )
)


Cast the `size` column to categorical and call it `size_cats`

In [None]:
dfAnimalSizes = (
    pl.DataFrame(
        {
            "animals":["dog","cat","mouse","giraffe"],
            "size": ["medium","medium","small","big"]
        }
    )
    <blank>
)

Add a column with the physical values of the categoricals

Sort the `DataFrame` by `size_cats` in alphabetical order

## Solutions

### Solution to Exercise 1 

Cast the `size` column to categorical and call it `size_cats`

In [None]:
dfAnimalSizes = (
    pl.DataFrame(
        {
            "animals":["dog","cat","mouse","giraffe"],
            "size": ["medium","medium","small","big"]
        }
    )
    .with_columns(
        pl.col("size").cast(pl.Categorical).alias("size_cats")
    )
)
dfAnimalSizes

Add a column with the physical values of the categoricals

In [None]:
dfAnimalSizes = (
    pl.DataFrame(
        {
            "animals":["dog","cat","mouse","giraffe"],
            "size": ["medium","medium","small","big"]
        }
    )
    .with_columns(
        pl.col("size").cast(pl.Categorical).alias("size_cats")
    )
    .with_columns(
        pl.col("size_cats").to_physical().alias("physical"),
    )
    .sort("size_cats")
)
dfAnimalSizes

Sort the `DataFrame` by `size_cats` in alphabetical order

In [None]:
dfAnimalSizes = (
    pl.DataFrame(
        {
            "animals":["dog","cat","mouse","giraffe"],
            "size": ["medium","medium","small","big"]
        }
    )
    .with_columns(
        pl.col("size").cast(pl.Categorical).alias("size_cats")
    )
    .with_columns(
        [
        pl.col("size_cats").to_physical().alias("physical"),
        pl.col("size_cats").cat.set_ordering("lexical")
        ]
    )
    .sort("size_cats")
)
dfAnimalSizes