## Categoricals and the string cache
By the end of this lecture you will be able to:
- filter a categorical column
- coordinating categortical mappings across objects with the string cache

We introduce the string cache here but we will see that it is essential when combining `DataFrames` with categorical columns. We will see that using categoricals can lead to much faster join operations than with strings.

In [None]:
import polars as pl

We create a `DataFrame` and add a categorical column called `cats`

In [None]:
df = (
    pl.DataFrame(
            {"strings": ["c","b","a","c"], "values": [1, 2, 3, 4]}
    )
    .with_columns(
        pl.col("strings").cast(pl.Categorical).alias("cats")
    )
)
df

## Filtering a categorical column
We filter a categorical column for equality in the normal way

In [None]:
(
    df
    .filter(
        pl.col("cats") == "b"
    )
)

If we try to filter a categorical column with `is_in` we get an `Exception`

In [None]:
# (
#     df
#     .filter(
#         pl.col("cats").is_in(["b"])
#     )
# )

### Why do we get an `Exception` with `is_in`?

When we use `is_in` Polars: 
1. converts the list `["b"]` to a one-element `Series` with string dtype internally
2. casts this `Series` to `pl.Categorical`
3. uses its internal algorithms for comparing categoricals

The problem is in step 2. The mapping to categoricals for the `Series` `["b"]` is not guaranteed to be the same mapping as for `df`

## Combining categoricals with the `StringCache`
To ensure that different objects - in this case `df["cats"]` and `["b"]` have the same categorical mapping we use the `StringCache`.

The `StringCache` object:
- stores the categorical mapping
- ensures that all categorical columns use the same mapping. 

We can use the `StringCache` inside a context manager and with a toggle.

### Using the `StringCache` inside a context-manager

A context-manager is a way to ensure certain actions happen in Python.

Everything inside the code block beginning with `with` is in the same context.

In this case
```python
with pl.StringCache():
```
ensures that everything that happens in the following code block uses the same categorical mappings

In [None]:
with pl.StringCache():
    df = (
        pl.DataFrame(
                {"strings": ["c","b","a","c"], "values": [1, 2, 3, 4]}
        )
        .with_columns(
            pl.col("strings").cast(pl.Categorical).alias("cats")
        )
        .filter(
            pl.col("cats").is_in(["b"])
        )
)
df

At the end of the `with` block the `StringCache` is deleted.

### Toggling the `StringCache`
We can also toggle the `StringCache` to be enabled through a session - be aware that this can have affects beyond this script/notebook. In fact I've commented it out here because when I run my test suite with `pytest` the command changes the outputs in other notebooks!

In [None]:
# pl.toggle_string_cache(toggle=True)

When we use `pl.toggle_string_cache(toggle=True)` then Polars enables a `StringCache` that is used by all categorical column until:
- the end of the session or
- you call `pl.toggle_string_cache(toggle = False)`

You can see whether a string cache is enabled with 

In [None]:
pl.using_string_cache()

### Context-manager or toggle the string cache?
Toggling the string cache is easier than using the context-manager.

However, I recommend using the context-manager as:
- it makes the use of the string cache explicit in the code
- it avoids errors that can arise from setting global values

### Use cases for `pl.StringCache`

We need the string cache whenever different objects with a categorical dtype are involved. For example when:
- joining `DataFrames` with categorical dtypes
- concatenating `DataFrames` with categorical dtypes
- creating a `DataFrame` with categorical dtype from multiple files

We will see examples of these in later Sections of the course.

## Exercises
In the exercises you will develop your understanding of:
- filtering a categorical column
- using the string cache
- the effect of categoricals on the query optimiser

### Exercise 1
Create a `DataFrame` from the Titanic dataset and cast the `Pclass` column to categorical.

In [None]:
csvFile = "../data/titanic.csv"
(
    pl.read_csv(csvFile)
    <blank>
    .head(3)
)

Continue by casting the `Embarked` column to categorical (change `with_column` to `with_columns`). 

Filter the `Pclass` column for third class passengers

Add a filter on the `Embarked` column for passengers who embarked in either Southampton (`S`) or Queenstown (`Q`)

Do the full query again but in lazy mode. 

Print the optimised query plan (recall that you need to call `print` on the query for the optimised plan to format correctly).

Can Polars push the filters on `Pclass` and `Embarked` back to the CSV SCAN? See the `SELECTION` part of the optimised plan.

To see the effect of categoricals on the optmised query plan do the query again but without casting `Pclass` and `Embarked` to categorical

## Solutions

### Solution to Exercise 1

Cast the `Pclass` column to categorical

In [None]:
csvFile = "../data/titanic.csv"
(
    pl.read_csv(csvFile)
    .with_columns(
        pl.col("Pclass").cast(pl.Utf8).cast(pl.Categorical)
    )
)


Cast the `Embarked` column to categorical

In [None]:
(
    pl.read_csv(csvFile)
    .with_columns(
        [
            pl.col("Pclass").cast(pl.Utf8).cast(pl.Categorical),
            pl.col("Embarked").cast(pl.Categorical)
        ]
    )
    .head(3)
)


Filter the `Pclass` column for third class passengers

In [None]:
(
    pl.read_csv(csvFile)
    .with_columns(
        [
            pl.col("Pclass").cast(pl.Utf8).cast(pl.Categorical),
            pl.col("Embarked").cast(pl.Categorical)
        ]
    )
    .filter(pl.col("Pclass")=="3")
    .head(3)
)


In addition, filter the `Embarked` column for passengers who embarked in Southampton (`S`) or Queenstown (`Q`)

In [None]:
with pl.StringCache():
    (
        pl.read_csv(csvFile)
        .with_columns(
            [
                pl.col("Pclass").cast(pl.Utf8).cast(pl.Categorical),
                pl.col("Embarked").cast(pl.Categorical)
            ]
        )
        .filter(pl.col("Pclass")=="3")
        .filter(pl.col("Embarked").is_in(["S","Q"]))
        .head(3)
    )

Do the full query again but in lazy mode.

Print the optimised query plan (recall that you need to call `print` on the query for the optimised plan to format correctly).

In [None]:
with pl.StringCache():
    print(
        pl.scan_csv(csvFile)
        .with_columns(
            [
                pl.col("Pclass").cast(pl.Utf8).cast(pl.Categorical),
                pl.col("Embarked").cast(pl.Categorical)
            ]
        )
        .filter(pl.col("Pclass")=="3")
        .filter(pl.col("Embarked").is_in(["S","Q"]))
        .explain()
    )

Can Polars push the filters on `Pclass` and `Embarked` back to the CSV SCAN? 

Polars is not able to push filters on categorical columns back to the CSV SCAN. In this case Polars reads all rows into memory, casts the columns to categorical and then applies the filters.

When we do the query without categoricals the filters are pushed back to the CSV SCAN and so the filtering happens when reading from the CSV.

In [None]:
print(
    pl.scan_csv(csvFile)
    .filter(pl.col("Pclass")=="3")
    .filter(pl.col("Embarked").is_in(["S","Q"]))
    .explain()
)