# Filtering rows 3: using `filter` in lazy mode
By the end of this section we will learn how to:
- use `filter` in lazy mode
- understand the optimized and non-optimized query plans
- combine mulitiple conditions in lazy mode

In [None]:
import polars as pl

In [None]:
csvFile = "../data/titanic.csv"

Create a `LazyFrame` by scanning the CSV

In [None]:
df = pl.scan_csv(csvFile)
df

This output is the non-optimized `naive` query plan.

## `filter` in lazy mode

When we apply a `filter` in lazy mode a `FILTER` line is added to the `naive` query plan.

In [None]:
df = (
    pl.scan_csv(csvFile)
    .filter(pl.col("Age") > 30)
)
df

All query plans are read from bottom to top. 

We examine this non-optimized query plan first. 

The components of the query plan are separated by `FROM`. Where you see `FROM` Polars creates a dataframe internally when executing the query plan.

### First stage
In this non-optimized query plan the first stage (before `FROM` when reading from the bottom) is:
`CSV SCAN ../data/titanic.csv
PROJECT */12 COLUMNS
`

and this means:
- `CSV SCAN` where Polars reads the CSV line-by-line
- `PROJECT */12 COLUMNS` reads all 12 of the columns from the CSV (`*` is a wildcard meaning all)

### Second stage
In this non-optimized query plan the second stage is:
`FILTER [(col("Age")) > (30i32)] FROM`

states that **once the entire CSV file has been read into memory** as a `DataFrame`:
- the `DataFrame` will be filtered for rows with `Age` greater than 30

## Inspecting the optimized query plan
We compare this with the optimized query plan that Polars will actually run when the `LazyFrame` is evaluated with `collect` or `fetch`.

We need to `print` the output of `explain` to format it correctly.

In [None]:
df = (
    pl.scan_csv(csvFile)
    .filter(
        pl.col("Age") > 30
    )
)
print(df.explain())

The `CSV SCAN` and `PROJECT` parts have not changed relative to the non-optimised plan. However:

`SELECTION: [(col("Age")) > (30.0)]` shows that Polars will apply the filter on the `Age` column **as the CSV is being read**.

For emphasis: in the optimized plan only the rows of the CSV that meet the filter conditions are read into a `DataFrame`. This is memory efficient.

### Multiple conditions in lazy mode
In *lazy mode* if we pass multiple `filter` calls then the query optimizer combines these into a *single condition* inside `SELECTION`.

In this example we filter for first class passengers over the age of 70.

In [None]:
df = (
    pl.scan_csv(csvFile)
    .filter(
        pl.col('Pclass')==1
    )
    .filter(
        (pl.col('Age') > 70)
    )
)
print(df.explain())

## Exercises
In the exercises you will develop your understanding of:
- using the `filter` method in lazy mode
- interpreting optimized query plans
- applying multiple conditions

### Exercise 1
Create a `LazyFrame` rows where `Fare` is greater than 10

In [None]:
(
    pl.<blank>(csvFile)
    <blank>
)

Print out the optimized query plan and confirm the `SELECTION` is updated

Evaluate this query for the first 10 rows

### Exercise 2 
Create a `LazyFrame` where `Age` is greater than 30 and the passenger is in 2nd class

In [None]:
(
    pl.<blank>(csvFile)
    <blank>
)

Print out the optimized query plan and confirm the `SELECTION` is updated

Evaluate this query for the full `DataFrame`

### Exercise 3
Create a lazy query with the following optimized plan
```
  CSV SCAN ../data/titanic.csv
  PROJECT */12 COLUMNS
  SELECTION: [([(col("Sex")) == (Utf8(female))]) & ([(col("Fare")) < (10.0)])]
```
Note - the order of the predicate conditions in the optimised plan can vary

## Solutions

### Solution to Exercise 1
Create a `LazyFrame` rows where `Fare` is greater than 10

In [None]:
(
    pl.scan_csv(csvFile)
    .filter(pl.col('Fare') > 10)
)

Print out the optimized query plan and confirm the `SELECTION` is updated

In [None]:
print(
    pl.scan_csv(csvFile)
    .filter(pl.col('Fare') > 10)
    .explain()
)    

Evaluate this query for the first 10 rows

In [None]:
(
    pl.scan_csv(csvFile)
    .filter(pl.col('Fare') > 10)
    .fetch(10)
)    

### Solution to Exercise 2
Create a `LazyFrame` where `Age` is greater than 30 and the passenger is in 2nd class

In [None]:
(
    pl.scan_csv(csvFile)
    .filter(
        (pl.col('Age') > 30) & (pl.col('Pclass')==2)
    )
)

Print out the optimized query plan and confirm the `SELECTION` is updated

In [None]:
print(
    pl.scan_csv(csvFile)
    .filter(
        (pl.col('Age') > 30) & (pl.col('Pclass')==2)
    )
    .explain()
)

Evaluate this query for the full `DataFrame`

In [None]:
df.collect()

## Solution to Exercise 3

In [None]:
print(
    pl.scan_csv(csvFile)
    .filter(pl.col('Fare') < 10)
    .filter(pl.col('Sex') == 'female')
    .explain()
)