# Selecting columns 3: selecting multiple columns
By the end of this section you will be able to:
- select all columns from a `DataFrame`
- exclude columns from a select on a `DataFrame` 
- select columns based on a regex
- select columns based on dtype


In [1]:
import polars as pl

In [2]:
csvFile = "../data/titanic.csv"

In [3]:
df = pl.read_csv(csvFile)
df.head(3)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S"""


## Selecting all columns from a `DataFrame`

We can select all columns by replacing `pl.col` with `pl.all`

In [4]:
df.select(
    pl.all()
).head(3)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S"""


We can exclude a column (or columns) with the `exclude` expression

In [5]:
df.select(
    pl.exclude(['PassengerId','Survived','Pclass'])
).head(3)

Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
str,str,f64,i64,i64,str,f64,str,str
"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S"""


## Selecting columns with a regex
We can select columns with a regex - if the regex starts with `^` and ends with `$`.

The following regex looks for columns starting with `P` and uses the regex *wildcard* `.*` to show `P` can be followed by any characters.

In [6]:
df.select(
    "^P.*$"
).head(3)

PassengerId,Pclass,Parch
i64,i64,i64
1,3,0
2,1,0
3,3,0


We can pass this regex to `pl.col` to apply transformations to these columns. In this example we take the `max` of each column

In [7]:
df.select(
    pl.col("^P.*$").max()
).head(3)

PassengerId,Pclass,Parch
i64,i64,i64
891,3,6


## Selecting columns based on dtype
We can select all of the columns that have a particular dtype by passing the dtype to `pl.col`.

Here we select all the string columns with `pl.Utf8` - the string dtype object

In [14]:
df.describe()

describe,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
str,f64,f64,f64,str,str,f64,f64,f64,str,f64,str,str
"""count""",891.0,891.0,891.0,"""891""","""891""",891.0,891.0,891.0,"""891""",891.0,"""891""","""891"""
"""null_count""",0.0,0.0,0.0,"""0""","""0""",177.0,0.0,0.0,"""0""",0.0,"""687""","""2"""
"""mean""",446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
"""std""",257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
"""min""",1.0,0.0,1.0,"""Abbing, Mr. An…","""female""",0.42,0.0,0.0,"""110152""",0.0,"""A10""","""C"""
"""max""",891.0,1.0,3.0,"""van Melkebeke,…","""male""",80.0,8.0,6.0,"""WE/P 5735""",512.3292,"""T""","""S"""
"""median""",446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
"""25%""",223.0,0.0,2.0,,,20.0,0.0,0.0,,7.8958,,
"""75%""",669.0,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


In [22]:
(
    df
    .select(
        pl.col(pl.Utf8)
    )
    .head(3)
)

Name,Sex,Ticket,Cabin,Embarked
str,str,str,str,str
"""Braund, Mr. Ow…","""male""","""A/5 21171""",,"""S"""
"""Cumings, Mrs. …","""female""","""PC 17599""","""C85""","""C"""
"""Heikkinen, Mis…","""female""","""STON/O2. 31012…",,"""S"""


We can also pass a list of dtypes to `pl.col`. In this case we select all of the numeric dtypes

In [23]:
(
    df
    .select(
        pl.col([pl.Int64,pl.Float64])
    )
    .head(3)
)

PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
i64,i64,i64,f64,i64,i64,f64
1,0,3,22.0,1,0,7.25
2,1,1,38.0,1,0,71.2833
3,1,3,26.0,0,0,7.925


# Exercises

In the exercises you will develop your understanding of:
- selecting all columns from a `DataFrame`
- excluding columns from a selection
- selecting columns with a regex
- selecting columns with a dtype

### Exercise 1

Select all columns from the `DataFrame` and sort each column

In [None]:
df = pl.read_csv(csvFile)
df.<blank>.head(3)

### Exercise 2
Select all columns from the `DataFrame` with the exception of the `PassengerId` column

In [None]:
df = pl.read_csv(csvFile)
df.<blank>.head(3)

### Exercise 3
Select all columns from the `DataFrame` that start with `S` or `N`

In [None]:
df = pl.read_csv(csvFile)
df.<blank>

### Exercise 4
Select all the columns with 64-bit floating point dtype

Hint: the 64-bit floating point dtype is `pl.Float64`

In [None]:
df = pl.read_csv(csvFile)
df.<blank>

### Exercise 5
Convert the following Pandas code to Polars

Looping over columns in Polars is to be avoided at all costs. 

Convert this Pandas code with a loop over the columns to Polars code using the Expression API.

In the loop we create a dictionary `maxDict` with the column names and maximum values

In [None]:
import pandas as pd
import numpy as np
df = pl.read_csv(csvFile)
dfPandas = df.to_pandas()

# Convert this code below to Polars in the following cell
maxDict = {}
for col in dfPandas.columns:
    if dfPandas[col].dtype == np.float64:
        maxDict[col] = [dfPandas[col].max()]
pd.DataFrame(maxDict)

In [None]:
(
    pl.read_csv(csvFile)
     <blank>
)

## Solutions

### Solution to Exercise 1
Select all columns from the `DataFrame` and sort each column

In [None]:
(
    pl.read_csv(csvFile)
    .select(
        pl.all().sort()
    )
    .head(3)
)    

### Solution to Exercise 2
Select all columns from the `DataFrame` with the exception of the `PassengerId` column

In [None]:
df = pl.read_csv(csvFile)
(
    df
    .select(
        pl.all().exclude('PassengerId')
    )
    .head(3)
)

### Solution to Exercise 3
Select all columns from the `DataFrame` that start with `S` or `N`

In [25]:
pl.read_csv(csvFile)
(
    df
    .select(
        pl.col("^(S|N).*$")
    )
    .head(3)
)

Survived,Name,Sex,SibSp
i64,str,str,i64
0,"""Braund, Mr. Ow…","""male""",1
1,"""Cumings, Mrs. …","""female""",1
1,"""Heikkinen, Mis…","""female""",0


### Solution to Exercise 4
Select all the columns with 64-bit floating point dtype

In [None]:
df = pl.read_csv(csvFile)
(
    df
    .select(
        pl.col(pl.Float64)
    )
)

### Solution to Exercise 5
Convert the following Pandas code to Polars
```python
import pandas as pd
import numpy as np
df = pl.read_csv(csvFile)
dfPandas = df.to_pandas()

# Convert this code below to Polars in the following cell
maxDict = {}
for col in dfPandas.columns:
    if dfPandas[col].dtype == np.float64:
        maxDict[col] = [dfPandas[col].max()]
pd.DataFrame(maxDict)
```

In [27]:
(
    pl.read_csv(csvFile)
    .select(
        pl.col(pl.Float64).max()
    )
)