<img src="../assets/data_analysis_with_polars_copyright-1.png" width="600"/>

This notebook is a free sample from the Data Analysis with Polars course on Udemy.

Use this link to do the full course at half price: https://www.udemy.com/course/data-analysis-with-polars/?couponCode=POLARS_HALF_PRICE


# Lazy mode 1: Introducing lazy mode
By the end of this lecture you will be able to:
- create a `LazyFrame` from a CSV file
- explain the difference between a `DataFrame` and a `LazyFrame`
- print the optimized query plan

## Code or queries?
Data analysis often involves multiple steps:
- loading data from a file or database
- transforming the data
- grouping by a column
- ...

We call the set of steps a **query**.

We can write some lines of code that carry out a query step-by-step in eager mode.

There are two problems with this approach:
- Each line of code is not aware of what the others are doing.
- Each line of code requires copying the full dataframe.

We can instead write the steps as an integrated query in lazy mode.

With an integrated query:
- a query optimizer can identify efficiencies
- a query engine can minimise the memory usage and produce a single output

## What are eager and lazy modes?

**Eager mode**: each line of code is run as soon as it is encountered.

**Lazy mode**: each line is added to a query plan and the query plan is optimized.

We introduce lazy mode in this lesson and we re-visit it again and again throughout the course. 

In [None]:
import polars as pl

In [None]:
csvFile = "../data/titanic.csv"

## `DataFrames` and `LazyFrames`
We **read** a CSV in eager mode with `pl.read_csv`. This creates a **`DataFrame`**

In [None]:
dfEager = pl.read_csv(csvFile)

We **scan** a CSV in lazy mode with `pl.scan_csv`. This creates a **`LazyFrame`**

In [None]:
dfLazy = pl.scan_csv(csvFile)

> We look at what happens when you scan a CSV in more detail in the I/O section.

We compare the types of `dfEager` and `dfLazy`

In [None]:
print(type(dfEager))
print(type(dfLazy))

### What's the difference between a `DataFrame` and a `LazyFrame`?

If we print a `DataFrame` we see data...

In [None]:
dfEager.head(2)

...but if we print a `LazyFrame` we see a **query plan**

In [None]:
dfLazy

**Key message: if you do something to a `DataFrame` it updates the data. If you do something to a `LazyFrame` it updates the query plan**.

## What does Polars do with query plans?
Polars creates a *naive query plan* from your query.

`Polars` passes the naive query plan to its **query optimizer**. The query optimizer looks for more efficient ways to arrive at the output you want.

Printing the output of the `describe_optimized_plan` method shows the optimized plan

In [None]:
print(dfLazy.describe_optimized_plan())

## What query optimizations are applied?
Query optimizations aren't magic. Most optimizations could be implemented by users in a well-written query if the user:
- knows the optimization exists 
- remembers to implement the optimization and 
- implements the optimization correctly!

Optimizations applied by Polars include:
- `projection pushdown` limit the number of columns read to those required
- `predicate pushdown` apply filter conditions as early as possible
- `slice pushdown` limit rows processed when limited rows are required
- `combine predicates` combine multiple filter conditions
- `common subplan elimination` combine duplicated transformations

Polars also implements other optimisations such as fast-path algorithms on sorted data (separate from the query optimiser).

## Exercises

In the exercises you will develop your understanding of:
- creating a `LazyFrame` from a CSV file
- printing the query plans

## Exercise 1
Create a `LazyFrame` by doing a scan of the Titanic CSV file

In [None]:
df = pl.<blank>

Ex 1 cont. 

Use the fetch statement and count how many rows it returns by default

Ex 1 cont. 
Check to see which of the following metadata you can get from a `LazyFrame`:
- number of rows
- column names

## Exercise 2: converting between eager and lazy mode
Create a `LazyFrame` from the Titanic CSV file

In [None]:
df = <blank>

Exercise 2 cont: Convert the `LazyFrame` to a `DataFrame`

In [None]:
df = <blank>

Exercise 2 cont: Convert the `DataFrame` to a `LazyFrame`

In [None]:
df = <blank>

## Solutions

## Solution to Exercise 1

In [None]:
df = pl.scan_csv(csvFile)

In [None]:
df.fetch().shape

A `LazyFrame` does not know the number of rows in a CSV

In [None]:
# df.shape

A `LazyFrame` does know the column names. As we will see in the I/O section `Polars` scans the first row of the CSV file to get column names in `pl.scan_csv`

In [None]:
df.columns

## Solution to Exercise 2

Create a `LazyFrame` by doing a scan of the Titanic CSV file

In [None]:
df = pl.scan_csv(csvFile)
df

Exercise 2 cont: Convert the `LazyFrame` to a `DataFrame`

In [None]:
df = df.collect()
df.head(3)

In [None]:
df = df.lazy()
df