# Selecting columns : using `select` and expressions

In [1]:
import polars as pl

In [2]:
csv_file = "data/titanic.csv"

In [3]:
df = pl.read_csv(csv_file)
df.head(3)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""


## Selecting a single column with a string

In [4]:
df.select("Age").head(3)

Age
f64
22.0
38.0
26.0


> **Note:** The output of `select` is always a `DataFrame` rather than a `Series` even if just one column is selected.

We can use `to_series` if we want a `Series`

In [5]:
df.select("Age").head(3).to_series()

Age
f64
22.0
38.0
26.0


### Selecting multiple columns with a list or comma-separated strings

list of columns

In [6]:
df.select(
    ["Age", "Pclass"]
).head(3)

Age,Pclass
f64,i64
22.0,3
38.0,1
26.0,3


comma-separated string

In [7]:
df.select(
    "Age", "Pclass"
).head(3)

Age,Pclass
f64,i64
22.0,3
38.0,1
26.0,3


## Differences between using `select` and `[]`

- `[]` indexing can only be used in eager mode, but **`select` can also be used in lazy mode**.
- Expressions in `select` can be **optimized** in lazy mode by the query optimizer.
- Multiple expressions in `select` can be run in *parallel*

Always using `select()` first!

## Selecting columns with an expression

In [8]:
df.select(
    pl.col("Age")
).head(3)

Age
f64
22.0
38.0
26.0


## Selecting and transforming a column with an expression

In [9]:
df.select(
    pl.col("Fare").round(0)
).head(3)

Fare
f64
7.0
71.0
8.0


### Selecting multiple columns with a list of expressions

In [None]:
df.select(
    [
        pl.col("Fare"),
        pl.col("Fare").round(0).alias("roundedFare"),  # alias can rename a column
    ]
).head(3)

Fare,roundedFare
f64,f64
7.25,7.0
71.2833,71.0
7.925,8.0


>**NOTE:** With multiple expressions in Polars, it will run them in parallel.

## Returning a single value

In [None]:
df.select(
    pl.col("Name").first() # get the first value
).item() # return a scalar or an element

'Braund, Mr. Owen Harris'

## Selecting columns in lazy mode

If we apply `select` in lazy mode it changes the `PROJECT` part of the optimised query plan

In [None]:
lf = pl.scan_csv(
    csv_file
).select(
    ["Survived", "Age"]
)

print(lf.explain()) # Only 2/12 columns

Csv SCAN [data/titanic.csv]
PROJECT 2/12 COLUMNS
ESTIMATED ROWS: 971


## Exercises

### Exercise 1

Select the `Age` and `Survived` columns using the Expression API

Do this twice:
- once using strings
- once using expressions

In [14]:
df.select(["Age", "Survived"]).head(3)

Age,Survived
f64,i64
22.0,0
38.0,1
26.0,1


In [15]:
df.select(
    pl.col("Age"),
    pl.col("Survived")
).head(3)

Age,Survived
f64,i64
22.0,0
38.0,1
26.0,1


### Exercise 2
Select all rows where `Age` is greater than 30 and output the `Age` and `Survived` columns

In [16]:
df.filter(
    pl.col("Age") > 30
).select(
    ["Age", "Survived"]
).head()

Age,Survived
f64,i64
38.0,1
35.0,1
35.0,0
54.0,0
58.0,1


### Exercise 3
Output a one-column DataFrame where the column is the `min` of the `Age` column

In [17]:
df.select(
    pl.col("Age").min()
)

Age
f64
0.42


Add another line onto the query to output this single value as a float

In [19]:
df.select(
    pl.col("Age").min()
).item()

0.42

Output a one-row DataFrame where the first column is the `max` of the `Age` column and the second column is the `min` of the `Age` column

In [21]:
df.select(
    [
        pl.col("Age").min().alias("min_age"),
        pl.col("Age").max().alias("max_age"),
    ]
)

min_age,max_age
f64,f64
0.42,80.0
