# Lazy mode 1: Introducing lazy mode
By the end of this section we will learn how to:
- create a `LazyFrame` from a CSV file
- explain the difference between a `DataFrame` and a `LazyFrame`
- print the optimized query plan

Lazy mode is crucial to taking full advantage of Polars with query optimisation and streaming large dataset. We introduce lazy mode in section and we re-visit it again and again throughout the workshop.  

## Code or queries?
Data analysis often involves multiple steps:
- loading data from a file or database
- transforming the data
- grouping by a column
- ...

We call the set of steps a **query**.

We can write some lines of code that carry out a query step-by-step in **eager mode**.

There are two problems with this approach:
- Each line of code is not aware of what the others are doing.
- Each line of code requires copying the full dataframe.

We can instead write the steps as an integrated query in **lazy mode**.

With an integrated query:
- a query optimizer can identify efficiencies
- a query engine can minimise the memory usage and produce a single output

## So what are eager and lazy modes?

**Eager mode**: each line of code is run as soon as it is encountered.

**Lazy mode**: each line is added to a query plan and the query plan is optimized.

In [None]:
import polars as pl

In [None]:
csvFile = "../data/titanic.csv"

## `DataFrames` and `LazyFrames`
We **read** a CSV in eager mode with `pl.read_csv`. This creates a **`DataFrame`**

In [None]:
dfEager = pl.read_csv(csvFile)

We **scan** a CSV in lazy mode with `pl.scan_csv`. This creates a **`LazyFrame`**

In [None]:
dfLazy = pl.scan_csv(csvFile)

In [None]:
dfLazy.schema

We cannot get the shape of a `LazyFrame` as Polars does not know how many rows there are from a scan.

In [None]:
# dfLazy.shape # This will throw an error as shape is not an atribute of the LazyFrame object 

In [None]:
dfLazy.collect().shape # This will work as we are converting LazyFrame to a DataFrame

### What's the difference between a `DataFrame` and a `LazyFrame`?

If we print a `DataFrame` we see data...

In [None]:
dfEager.head(2)

...but if we print a `LazyFrame` we see a **query plan**


In [None]:
dfLazy.head(2)

**Key message:** 

**- An operation on a `DataFrame` acts on the `data`**

**- An operation on a `LazyFrame` acts on the `query plan`**

## Operations on a `DataFrame` and a `LazyFrame` 
To show the difference between operations on a `DataFrame` and a `LazyFrame` we rename the `PassengerID` column to `Id` using `rename`.

On a `DataFrame` we see the first column is renamed...

In [None]:
(
    dfEager
    .rename({"PassengerId":"Id"})
    .head(2)
)    

while on a `LazyFrame` we see that a `RENAME` step is added the query plan

In [None]:
(
    dfLazy
    .rename({"PassengerId":"Id"})
)    

## Chaining or re-assigning?
In this workshop we typically run operations with method chaining like this

In [None]:
print(
    pl.scan_csv(csvFile)
    .rename({"PassengerId":"Id"})
    .explain()
)    

However, we can also do operations by re-assigning the variable in each step

In [None]:
dfLazy = pl.scan_csv(csvFile)
dfLazy = dfLazy.rename({"PassengerId":"Id"})
print(dfLazy.explain())

The two methods are equivalent

## Query optimisation
Polars creates a *naive query plan* from your query.

`Polars` passes the naive query plan to its **query optimizer**. The query optimizer looks for more efficient ways to arrive at the output you want.

Printing the output of the `explain` method shows the optimized plan

In [None]:
dfLazy = pl.scan_csv(csvFile)
print(dfLazy.explain())

## What query optimizations are applied?
Query optimizations aren't magic. Most optimizations could be implemented by users in a well-written query if the user:
- knows the optimization exists 
- remembers to implement the optimization and 
- implements the optimization correctly!

Optimizations applied by Polars include:
- `projection pushdown` limit the number of columns read to those required
- `predicate pushdown` apply filter conditions as early as possible
- `combine predicates` combine multiple filter conditions
- `slice pushdown` limit rows processed when limited rows are required
- `common subplan elimination` run duplicated transformations on the same data once and then re-use

We'll see how these optimisations arise later in the workshop.

Polars also implements other optimisations such as fast-path algorithms on sorted data (separate from the query optimiser). 

## Exercises

In the exercises you will develop your understanding of:
- creating a `LazyFrame` from a CSV file
- getting metadata from a `LazyFrame`
- printing the query plans

### Exercise 1
Create a `LazyFrame` by doing a scan of the Titanic CSV file

In [None]:
df = pl.<blank>

Use the fetch statement and count how many rows it returns by default

Check to see which of the following metadata you can get from a `LazyFrame`:
- number of rows
- column names
- schema

## Solutions

### Solution to Exercise 1

In [None]:
df = pl.scan_csv(csvFile)

In [None]:
df.fetch().shape

A `LazyFrame` does not know the number of rows in a CSV

In [None]:
df.shape

A `LazyFrame` does know the column names. As we will see in the I/O section `Polars` scans the first row of the CSV file to get column names in `pl.scan_csv`

In [None]:
df.columns

In [None]:
df.schema