# Selecting columns 3: selecting multiple columns

Polars has two ways for selecting multiple columns:
- Expression API with `pl.col` or `pl.all`
- Selectors API with polars selectors such as `cs.contains`

In [1]:
import polars as pl
import polars.selectors as cs

In [2]:
csv_file = "data/titanic.csv"

In [3]:
df = pl.read_csv(csv_file)
df.head(3)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""


### Selecting all columns from a `DataFrame`

We can select all columns by `pl.all`.

In [4]:
df.select(
    pl.all()
).head(3)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""


We can select all but a subset of columns with the `exclude` expression

In [5]:
df.select(
    pl.exclude("PassengerId", "Survived", "Pclass")
).head(3)

Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
str,str,f64,i64,i64,str,f64,str,str
"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""


### Selecting columns with a regex

Regex starts with `^` and ends with `$`.

In [None]:
df.select(
    "^P.*$" # .* is wildcard which means every characters are accepted.
).head(3)

PassengerId,Pclass,Parch
i64,i64,i64
1,3,0
2,1,0
3,3,0


Pass regex to `pl.col` to apply transformations to these columns.

In [9]:
df.select(
    pl.col("^P.*$").max()
)

PassengerId,Pclass,Parch
i64,i64,i64
891,3,6


### Selecting columns based on dtype

In [10]:
df.select(
    pl.col(pl.Utf8)
).head(3)

Name,Sex,Ticket,Cabin,Embarked
str,str,str,str,str
"""Braund, Mr. Owen Harris""","""male""","""A/5 21171""",,"""S"""
"""Cumings, Mrs. John Bradley (Fl…","""female""","""PC 17599""","""C85""","""C"""
"""Heikkinen, Miss. Laina""","""female""","""STON/O2. 3101282""",,"""S"""


We can also pass a list of dtypes to `pl.col`.

In [11]:
df.select(
    pl.col([pl.Int64, pl.Float64])
).head(3)

PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
i64,i64,i64,f64,i64,i64,f64
1,0,3,22.0,1,0,7.25
2,1,1,38.0,1,0,71.2833
3,1,3,26.0,0,0,7.925


## Using the selectors API
The selectors API aims to make selecting multiple columns less verbose. 

For simple cases it replicates using the expression API.

Select all columns with `cs.all`

In [13]:
df.select(
    cs.all()
).head(3)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""


Do selection by position with `first` or `last`

In [14]:
df.select(
    cs.first()
).head(3)

PassengerId
i64
1
2
3


In [15]:
df.select(
    cs.last()
).head(3)

Embarked
str
"""S"""
"""C"""
"""S"""


The output of a selector is a standard Polars expression so we can follow it up with standard expression chaining

In [16]:
df.select(
    cs.all().max()
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
891,1,3,"""van Melkebeke, Mr. Philemon""","""male""",80.0,8,6,"""WE/P 5735""",512.3292,"""T""","""S"""


Select columns by groups of dtype - including a group of all integer and floating point dtypes with `cs.numeric`

In [17]:
df.select(
    cs.numeric()
).head()

PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
i64,i64,i64,f64,i64,i64,f64
1,0,3,22.0,1,0,7.25
2,1,1,38.0,1,0,71.2833
3,1,3,26.0,0,0,7.925
4,1,1,35.0,1,0,53.1
5,0,3,35.0,0,0,8.05


`~` operator to exclude the column unwanted.

In [18]:
df.select(
    ~cs.by_name("Pclass", "Age")
).head()

PassengerId,Survived,Name,Sex,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,str,str,i64,i64,str,f64,str,str
1,0,"""Braund, Mr. Owen Harris""","""male""",1,0,"""A/5 21171""",7.25,,"""S"""
2,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,"""Heikkinen, Miss. Laina""","""female""",0,0,"""STON/O2. 3101282""",7.925,,"""S"""
4,1,"""Futrelle, Mrs. Jacques Heath (…","""female""",1,0,"""113803""",53.1,"""C123""","""S"""
5,0,"""Allen, Mr. William Henry""","""male""",0,0,"""373450""",8.05,,"""S"""


As a simpler alternative to the regex, we can use string methods such as:
- `contains`
- `starts_with`
- `end_with`
- `matches`

In [19]:
df.select(
    cs.starts_with("P")
).head(3)

PassengerId,Pclass,Parch
i64,i64,i64
1,3,0
2,1,0
3,3,0


In [21]:
df.select(
    cs.starts_with("P", "A") # P or A
).head(3)

PassengerId,Pclass,Age,Parch
i64,i64,f64,i64
1,3,22.0,0
2,1,38.0,0
3,3,26.0,0


With the `matches` method we can pass a regex without the `^` and `$`.

In [22]:
df.select(
    cs.matches("Age|Fare")
).head()

Age,Fare
f64,f64
22.0,7.25
38.0,71.2833
26.0,7.925
35.0,53.1
35.0,8.05


The difference between `cs.contains` and `cs.matches` is:
- `cs.contains` looks for all column names that contain the literal substring
- `cs.matches` look for all column names that match the regex

### Intersection of selectors

To do an intersection of selector conditions, we use the `&` operator to say both conditions must be fulfilled.

In [23]:
df.select(
    cs.numeric() & cs.contains("A")
).head()

Age
f64
22.0
38.0
26.0
35.0
35.0


### Union of selectors
To do a union operation we use the `|` operator to say at least one of the conditions must be satisfied

In [25]:
df.select(
    cs.string() | cs.contains("P")
).head()

PassengerId,Pclass,Name,Sex,Parch,Ticket,Cabin,Embarked
i64,i64,str,str,i64,str,str,str
1,3,"""Braund, Mr. Owen Harris""","""male""",0,"""A/5 21171""",,"""S"""
2,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",0,"""PC 17599""","""C85""","""C"""
3,3,"""Heikkinen, Miss. Laina""","""female""",0,"""STON/O2. 3101282""",,"""S"""
4,1,"""Futrelle, Mrs. Jacques Heath (…","""female""",0,"""113803""","""C123""","""S"""
5,3,"""Allen, Mr. William Henry""","""male""",0,"""373450""",,"""S"""


### Difference of selectors
To do a difference operation we use a minus operator `-`.

In [27]:
df.select(
    cs.string() - cs.starts_with("T")
).head(3)

Name,Sex,Cabin,Embarked
str,str,str,str
"""Braund, Mr. Owen Harris""","""male""",,"""S"""
"""Cumings, Mrs. John Bradley (Fl…","""female""","""C85""","""C"""
"""Heikkinen, Miss. Laina""","""female""",,"""S"""


### Exercise 1
We create a `DataFrame` from the Spotify data

In [28]:
pl.Config.set_fmt_str_lengths(30)
spotify_csv = "data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv,try_parse_dates=True)
spotify_df.head(3)

title,rank,date,artist,url,region,chart,trend,streams
str,i64,date,str,str,str,str,str,i64
"""Starboy""",1,2017-01-01,"""The Weeknd, Daft Punk""","""https://open.spotify.com/track…","""Global""","""top200""","""SAME_POSITION""",3135625
"""Closer""",2,2017-01-01,"""The Chainsmokers, Halsey""","""https://open.spotify.com/track…","""Global""","""top200""","""SAME_POSITION""",3015525
"""Let Me Love You""",3,2017-01-01,"""DJ Snake, Justin Bieber""","""https://open.spotify.com/track…","""Global""","""top200""","""MOVE_UP""",2545384


Select the `title` and `artist` columns using the expression API (and not selectors)

In [31]:
spotify_df.select(
    pl.col(["title", "artist"])
).head()

title,artist
str,str
"""Starboy""","""The Weeknd, Daft Punk"""
"""Closer""","""The Chainsmokers, Halsey"""
"""Let Me Love You""","""DJ Snake, Justin Bieber"""
"""Rockabye (feat. Sean Paul & An…","""Clean Bandit"""
"""One Dance""","""Drake, WizKid, Kyla"""


Select all string and date columns from the spotify `DataFrame` using the expression API

In [34]:
spotify_df.select(
    pl.col(pl.Utf8, pl.Date)
).head()

title,date,artist,url,region,chart,trend
str,date,str,str,str,str,str
"""Starboy""",2017-01-01,"""The Weeknd, Daft Punk""","""https://open.spotify.com/track…","""Global""","""top200""","""SAME_POSITION"""
"""Closer""",2017-01-01,"""The Chainsmokers, Halsey""","""https://open.spotify.com/track…","""Global""","""top200""","""SAME_POSITION"""
"""Let Me Love You""",2017-01-01,"""DJ Snake, Justin Bieber""","""https://open.spotify.com/track…","""Global""","""top200""","""MOVE_UP"""
"""Rockabye (feat. Sean Paul & An…",2017-01-01,"""Clean Bandit""","""https://open.spotify.com/track…","""Global""","""top200""","""MOVE_DOWN"""
"""One Dance""",2017-01-01,"""Drake, WizKid, Kyla""","""https://open.spotify.com/track…","""Global""","""top200""","""SAME_POSITION"""


Select all string and date columns from the spotify `DataFrame` - except the `url` column using the expression API

In [35]:
spotify_df.select(
    pl.col(pl.Utf8, pl.Date).exclude("url")
).head()

title,date,artist,region,chart,trend
str,date,str,str,str,str
"""Starboy""",2017-01-01,"""The Weeknd, Daft Punk""","""Global""","""top200""","""SAME_POSITION"""
"""Closer""",2017-01-01,"""The Chainsmokers, Halsey""","""Global""","""top200""","""SAME_POSITION"""
"""Let Me Love You""",2017-01-01,"""DJ Snake, Justin Bieber""","""Global""","""top200""","""MOVE_UP"""
"""Rockabye (feat. Sean Paul & An…",2017-01-01,"""Clean Bandit""","""Global""","""top200""","""MOVE_DOWN"""
"""One Dance""",2017-01-01,"""Drake, WizKid, Kyla""","""Global""","""top200""","""SAME_POSITION"""


Select all string and date columns again but use the selectors API

In [37]:
spotify_df.select(
    cs.by_dtype(pl.Utf8, pl.Date)
).head()

title,date,artist,url,region,chart,trend
str,date,str,str,str,str,str
"""Starboy""",2017-01-01,"""The Weeknd, Daft Punk""","""https://open.spotify.com/track…","""Global""","""top200""","""SAME_POSITION"""
"""Closer""",2017-01-01,"""The Chainsmokers, Halsey""","""https://open.spotify.com/track…","""Global""","""top200""","""SAME_POSITION"""
"""Let Me Love You""",2017-01-01,"""DJ Snake, Justin Bieber""","""https://open.spotify.com/track…","""Global""","""top200""","""MOVE_UP"""
"""Rockabye (feat. Sean Paul & An…",2017-01-01,"""Clean Bandit""","""https://open.spotify.com/track…","""Global""","""top200""","""MOVE_DOWN"""
"""One Dance""",2017-01-01,"""Drake, WizKid, Kyla""","""https://open.spotify.com/track…","""Global""","""top200""","""SAME_POSITION"""


Select all the columns that start with `t` or `a`

In [38]:
spotify_df.select(
    cs.starts_with("t", "a")
).head()

title,artist,trend
str,str,str
"""Starboy""","""The Weeknd, Daft Punk""","""SAME_POSITION"""
"""Closer""","""The Chainsmokers, Halsey""","""SAME_POSITION"""
"""Let Me Love You""","""DJ Snake, Justin Bieber""","""MOVE_UP"""
"""Rockabye (feat. Sean Paul & An…","""Clean Bandit""","""MOVE_DOWN"""
"""One Dance""","""Drake, WizKid, Kyla""","""SAME_POSITION"""


Select all columns except the integer columns (using the ~ operator)

In [40]:
spotify_df.select(
    ~cs.integer()
).head()

title,date,artist,url,region,chart,trend
str,date,str,str,str,str,str
"""Starboy""",2017-01-01,"""The Weeknd, Daft Punk""","""https://open.spotify.com/track…","""Global""","""top200""","""SAME_POSITION"""
"""Closer""",2017-01-01,"""The Chainsmokers, Halsey""","""https://open.spotify.com/track…","""Global""","""top200""","""SAME_POSITION"""
"""Let Me Love You""",2017-01-01,"""DJ Snake, Justin Bieber""","""https://open.spotify.com/track…","""Global""","""top200""","""MOVE_UP"""
"""Rockabye (feat. Sean Paul & An…",2017-01-01,"""Clean Bandit""","""https://open.spotify.com/track…","""Global""","""top200""","""MOVE_DOWN"""
"""One Dance""",2017-01-01,"""Drake, WizKid, Kyla""","""https://open.spotify.com/track…","""Global""","""top200""","""SAME_POSITION"""


### Exercise 2
We create a `DataFrame` with temperature and rainfall data from some weather stations

In [29]:
df_weather = pl.DataFrame(
    [
        {
            "Month": "Jan",
            "Station_A (°C)": 20.5,
            "Station_B (°C)": 18.0,
            "Station_A (mm)": 12.0,
            "Station_B (mm)": 13.5,
        },
        {
            "Month": "Feb",
            "Station_A (°C)": 21.0,
            "Station_B (°C)": 18.5,
            "Station_A (mm)": 12.0,
            "Station_B (mm)": 13.5,
        },
    ]
)
df_weather

Month,Station_A (°C),Station_B (°C),Station_A (mm),Station_B (mm)
str,f64,f64,f64,f64
"""Jan""",20.5,18.0,12.0,13.5
"""Feb""",21.0,18.5,12.0,13.5


Select all the columns with `Station` in the column name using `cs.contains`

In [41]:
df_weather.select(
    cs.contains("Station")
).head(3)

Station_A (°C),Station_B (°C),Station_A (mm),Station_B (mm)
f64,f64,f64,f64
20.5,18.0,12.0,13.5
21.0,18.5,12.0,13.5


Use `cs.matches` to select all the columns with `Station` and `°C`  in the column name

In [43]:
df_weather.select(
    cs.matches("Station.*°C")
).head()

Station_A (°C),Station_B (°C)
f64,f64
20.5,18.0
21.0,18.5


### Exercise 3
Convert the following Pandas code to Polars.

Looping over columns in Polars is to be avoided at all costs. 

Convert this Pandas code with a loop over the columns to Polars code using the Expression API.

In the loop we create a dictionary `maxDict` with the column names and maximum values

In [30]:
import pandas as pd
import numpy as np
df = pl.read_csv(csv_file)
dfPandas = df.to_pandas()

# Convert this code below to Polars in the following cell
maxDict = {}
for col in dfPandas.columns:
    if dfPandas[col].dtype == np.float64:
        maxDict[col] = [dfPandas[col].max()]
pd.DataFrame(maxDict)

Unnamed: 0,Age,Fare
0,80.0,512.3292


In [45]:
df.select(
    pl.col(pl.Float64).max()
)

Age,Fare
f64,f64
80.0,512.3292
