# Selecting columns 2: using `select` and expressions
By the end of this lecture you will be able to:
- select a column or columns with `select`
- transform a column while selecting it
- select a column in lazy mode

Selecting columns with expressions is key to performant analysis as:
- this approach works in lazy mode
- when we select and transform multiple columns Polars will run these selections in paralell

We introduce the range of methods we can use to select columns with an expression in this lecture.

In [None]:
import polars as pl

In [None]:
csvFile = "../data/titanic.csv"

In [None]:
df = pl.read_csv(csvFile)
df.head(3)

## Selecting a single column with a string

We can choose a column with a string in the `select` method

In [None]:
(
    df
    .select('Age')
    .head(3)
)

The output of `select` is always a `DataFrame` rather than a `Series` even if just one column is selected.

We can use `to_series` if we want a `Series`

In [None]:
(
    df
    .select('Age')
    .to_series()
    .head(3)
)

### Selecting multiple columns with a `list`

We can pass a list of column names to `select`

In [None]:
(
    df
    .select(
        ['Survived','Age']
    )
    .head(3)
)

## Differences between using `select` and `[]`

- `[]` indexing can only be used in eager mode, but **`select` can also be used in lazy mode**
- expressions in `select` can be **optimised** in lazy mode by the query optimiser
- multiple expressions in `select` can be run in *parallel*


## Selecting columns with an expression

We can select a column with an expression in the `select` method

In [None]:
(
    df
    .select(
        pl.col('Age')
    )
    .head(3)
)

## Selecting and transforming a column with an expression
We can apply a transformation to a column before we output it.

In this example we use the `round` expression to round the values of the `Fare` column

In [None]:
(
    df
    .select(
        pl.col('Fare').round(0)
    )
    .head(3)
)

We will see many more examples where we use expressions to transform data as we go through the course.

### Selecting multiple columns with a list of expressions

We can also pass a list of expressions to `select`. 

In this case we use the `alias` expression to change the name of one column in the output

In [None]:
(
    df
    .select(
        [
            pl.col('Fare'),
            pl.col('Fare').round(0).alias('roundedFare')
        ]
    )
    .head(3)
)

Recall that when you have multiple expressions Polars will run them in parallel.

## Selecting columns in lazy mode

If we apply `select` in lazy mode it changes the `PROJECT` part of the optimised query plan

In [None]:
df = (
    pl.scan_csv(csvFile)
    .select(['Survived','Age'])
)
print(df.describe_optimized_plan())

The optimized query plan now has:

`PROJECT 2/12 COLUMNS`

This means that Polars only loads the `Survived` and `Age` columns into memory when reading the CSV.

Reducing the number of columns reduces time and memory usage 

The `FAST_PROJECT` part of the query plan doesn't have any implications for users but is described here if you are curious... 

> The `FAST_PROJECT` happens when `select` is applied to `scan_csv` but **no transformations are applied** to any columns.

> In this simpler case with column selections and no transformations Polars modifies its standard parallel approach and does the column selection in serial. This is faster than the standard method in parallel and so it is called `FAST_PROJECT`.

# Exercises

In the exercises you will develop your understanding of:
- selecting columns using the `select` method
- transforming columns within the `select` method
- using `select` in lazy mode

## Exercise 1: Select the `Age` and `Survived` columns using the Expression API

Do this twice:
- once using strings
- once using expressions

In [None]:
df = pl.read_csv(csvFile)
df.<blank>.head(3)
df.<blank>.head(3)

## Exercise 2: Select all rows where `Age` is greater than 30 and output the `Age` and `Survived` columns

In [None]:
df = pl.read_csv(csvFile)
df.<blank>.head(3)

## Exercise 3: Output a one-column DataFrame where the column is the `min` of the `Age` column


In [None]:
df = pl.read_csv(csvFile)
df.<blank>

Exercise 3 cont: Output a one-row DataFrame where the first column is the `min` of the `Age` column and the second column is the `max` of the `Age` column

Expand the following cell if you want a hint

In [None]:
#Hint: you cannot have two columns with the same name so you will have to use the `alias` expression 

In [None]:
df = pl.read_csv(csvFile)
df.<blank>

## Exercise 4: Convert the following Pandas code to Polars code

```python
dfPandas.loc[
    (dfPandas['SibSp'] > 0) & (dfPandas['Parch'] > 0),
    ['Survived','SibSp','Parch']
]
```

## Exercise 5: Using lazy mode, create a query that has the following query plan

```
FAST_PROJECT: [Age, Pclass, Survived]
    CSV SCAN ../data/titanic.csv
    PROJECT 3/12 COLUMNS
    SELECTION: None
```

In [None]:
print(
    <blank>.describe_optimized_plan()
)

## Solutions

## Solution to Exercise 1

In [None]:
df = pl.read_csv(csvFile)
df.select(['Age','Survived']).head(3)
df.select([pl.col('Age'),pl.col('Survived')]).head(3)


## Solution to Exercise 2

In [None]:
df = pl.read_csv(csvFile)
df.filter(pl.col('Age')>30).select(['Age','Survived']).head(3)

## Solution to Exercise 3

In [None]:
df = pl.read_csv(csvFile)
df.select(pl.col('Age').min())

In [None]:
df = pl.read_csv(csvFile)
df.select([pl.col('Age').max().alias('age_max'),pl.col('Age').min().alias('age_min')])

## Solution to Exercise 4

In [None]:
df = pl.read_csv(csvFile)
(df
 .filter((pl.col('SibSp') > 0) & (pl.col('Parch') > 0))
 .select(['Survived','SibSp','Parch'])
).head()


## Solution to Exercise 5
```
  FAST_PROJECT: [Age, Pclass, Survived]
    CSV SCAN ../data/titanic.csv
    PROJECT 3/12 COLUMNS
```

In [None]:
print(pl.scan_csv(csvFile).select(['Age','Pclass','Survived']).describe_optimized_plan())