# Selecting columns 3: selecting multiple columns
By the end of this lecture you will be able to:
- select columns based on a regex
- select columns based on dtype
- use selectors

Polars has two ways for selecting multiple columns:
- the expression API with `pl.col` or `pl.all`
- the selectors API with polars selectors such as `cs.contains`

We see both of these in this lecture.

To use the selectors API we typically import it as `cs` alongside Polars

> Got any feedback on the course? I'm always keen to hear from my learners - get in touch at liam@rhosignal.com or message me on linkedin: https://www.linkedin.com/in/liam-brannigan-9080b214a/

> Want to recommend the course to others? Please use the latest referal code from this page: https://linktr.ee/braaannigan

Here we import the `polars.selectors` separately as `cs`

In [1]:
import polars as pl
import polars.selectors as cs

In [32]:
csv_file = "../data/titanic.csv"

In [3]:
df = pl.read_csv(csv_file)
df.head(3)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""


### Selecting all columns from a `DataFrame`

We can select all columns by replacing `pl.col` with `pl.all`

In [4]:
(
    df
    .select(
        pl.all()
    )
    .head(3)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""


We can select all but a subset of columns with the `exclude` expression

In [5]:
(
    df
    .select(
        pl.exclude('PassengerId','Survived','Pclass')
    )
    .head(3)
)

Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
str,str,f64,i64,i64,str,f64,str,str
"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""


In [10]:
(
    df
    .filter(
        pl.col('Sex').str.contains("^m.*$")
    )
    .head(2)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
5,0,3,"""Allen, Mr. William Henry""","""male""",35.0,0,0,"""373450""",8.05,,"""S"""


This is a shorthand for `pl.all().exclude(...)`

### Selecting columns with a regex
We can select columns with a regex - if the regex starts with `^` and ends with `$`. Note that we meet an easier approach to doing this with selectors below.

The following regex looks for columns starting with `P` and uses the regex *wildcard* `.*` to show `P` can be followed by any characters.

In [6]:
(
    df
    .select(
        "^P.*$"
    )
    .head(3)
)

PassengerId,Pclass,Parch
i64,i64,i64
1,3,0
2,1,0
3,3,0


We can pass this regex to `pl.col` to apply transformations to these columns. In this example we take the `max` of each column

In [11]:
(
    df
    .select(
        pl.col("^P.*$").max()
    )
    .head(3)
)

PassengerId,Pclass,Parch
i64,i64,i64
891,3,6


### Selecting columns based on dtype
We can select all of the columns that have a particular dtype by passing the dtype to `pl.col`. I use this approach **a lot** in my Polars pipelines.

Here we select all the string columns with `pl.Utf8` - the string dtype object

In [12]:
(
    df
    .select(
        pl.col(pl.Utf8)
    )
    .head(3)
)

Name,Sex,Ticket,Cabin,Embarked
str,str,str,str,str
"""Braund, Mr. Owen Harris""","""male""","""A/5 21171""",,"""S"""
"""Cumings, Mrs. John Bradley (Fl…","""female""","""PC 17599""","""C85""","""C"""
"""Heikkinen, Miss. Laina""","""female""","""STON/O2. 3101282""",,"""S"""


We can also pass a list of dtypes to `pl.col`. In this case we select both 64-bit integer and float columns

In [13]:
(
    df
    .select(
        pl.col([pl.Int64,pl.Float64])
    )
    .head(3)
)

PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
i64,i64,i64,f64,i64,i64,f64
1,0,3,22.0,1,0,7.25
2,1,1,38.0,1,0,71.2833
3,1,3,26.0,0,0,7.925


## Using the selectors API
The selectors API aims to make selecting multiple columns less verbose. 

For simple cases it replicates using the expression API. For example to select all columns we use `cs.all`

<hr>


### Selectors API with polars docs

* 컬럼 selecting을 좀 더 직관적으로 쉽고 빠르게 할 수 있는 API

####  `cs.all()`

In [4]:
import polars.selectors as cs
import polars as pl
from datetime import date
# cs.all() - > select all column
tmp = pl.DataFrame(
    {
        "dt": [date(1999, 12, 31), date(2024, 1, 1)],
        "value": [1_234_500, 5_000_555],
    },
    schema_overrides={"value": pl.Int32},
)

(
    tmp
    .select(
        cs.all().cast(pl.String) # .cast -> 특정 열의 데이터 타입을 변형
    )
)

dt,value
str,str
"""1999-12-31""","""1234500"""
"""2024-01-01""","""5000555"""


In [5]:
(
    tmp
    .select(
        cs.all() - cs.numeric() # - cs.데이터타입 -> exclude
    )
)

dt
date
1999-12-31
2024-01-01


#### `cs.by_dtype`

In [7]:
# cs.by_dtype()
tmp = pl.DataFrame(
    {
        "dt": [date(1999, 12, 31), date(2024, 1, 1), date(2010, 7, 5)],
        "value": [1_234_500, 5_000_555, -4_500_000],
        "other": ["foo", "bar", "foo"],
    }
)
print(tmp)

(
    tmp
    .select(
        cs.by_dtype([pl.Date, pl.String])
    )
)

shape: (3, 3)
┌────────────┬──────────┬───────┐
│ dt         ┆ value    ┆ other │
│ ---        ┆ ---      ┆ ---   │
│ date       ┆ i64      ┆ str   │
╞════════════╪══════════╪═══════╡
│ 1999-12-31 ┆ 1234500  ┆ foo   │
│ 2024-01-01 ┆ 5000555  ┆ bar   │
│ 2010-07-05 ┆ -4500000 ┆ foo   │
└────────────┴──────────┴───────┘


dt,other
date,str
1999-12-31,"""foo"""
2024-01-01,"""bar"""
2010-07-05,"""foo"""


In [8]:
(
    tmp
    .select(
        ~cs.by_dtype([pl.Date, pl.String])
    )
)

value
i64
1234500
5000555
-4500000


In [9]:
(
    tmp
    .group_by(
        cs.string()
    )
    .agg(
        cs.numeric().sum()
    )
    .sort(by = "other")
)

other,value
str,i64
"""bar""",5000555
"""foo""",-3265500


#### `cs.by_name()`

In [11]:
# cs.by_name() -> 근데 이건 그냥 cs 안써도 가능함 쉽게
(
    tmp
    .select(
        pl.col('value', 'other')
    )
)

value,other
i64,str
1234500,"""foo"""
5000555,"""bar"""
-4500000,"""foo"""


In [12]:
(
    tmp
    .select(
        cs.by_name('value', 'other')
    )
)

value,other
i64,str
1234500,"""foo"""
5000555,"""bar"""
-4500000,"""foo"""


#### `cs.contains()`

In [22]:
tmp = pl.DataFrame(
    {
        "foo" : ["x", "y"],
        "bar" : [123, 456],
        "baz" : [2.0, 5.5],
        "zap" : [False, True]
    }
)

print((
    tmp
    .select(
        pl.selectors.contains("ba") # cs.contains("ba")
    )
))

print(
    (
        tmp
        .select(
            cs.contains("ba", "f")
        )
    )
)

print(
    (
        tmp
        .select(
            ~cs.contains("ba")
        )
    )
)


shape: (2, 2)
┌─────┬─────┐
│ bar ┆ baz │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪═════╡
│ 123 ┆ 2.0 │
│ 456 ┆ 5.5 │
└─────┴─────┘
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ baz │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞═════╪═════╪═════╡
│ x   ┆ 123 ┆ 2.0 │
│ y   ┆ 456 ┆ 5.5 │
└─────┴─────┴─────┘
shape: (2, 2)
┌─────┬───────┐
│ foo ┆ zap   │
│ --- ┆ ---   │
│ str ┆ bool  │
╞═════╪═══════╡
│ x   ┆ false │
│ y   ┆ true  │
└─────┴───────┘
shape: (0, 0)
┌┐
╞╡
└┘


#### `cs.ends_with()` , `starts_with`

In [18]:
tmp = pl.DataFrame(
    {
        "foo": ["x", "y"],
        "bar": [123, 456],
        "baz": [2.0, 5.5],
        "zap": [False, True],
    }
)

print(
    tmp
    .select(
        cs.ends_with("p")
    )
)

shape: (2, 1)
┌───────┐
│ zap   │
│ ---   │
│ bool  │
╞═══════╡
│ false │
│ true  │
└───────┘


#### `cs.exclude`

In [19]:
tmp = pl.DataFrame(
    {
        "aa": [1, 2, 3],
        "ba": ["a", "b", None],
        "cc": [None, 2.5, 1.5],
    }
)

print(
    (
        tmp
        .select(
            cs.exclude("ba", cs.string(), pl.UInt32)
        )
    )
)

shape: (3, 2)
┌─────┬──────┐
│ aa  ┆ cc   │
│ --- ┆ ---  │
│ i64 ┆ f64  │
╞═════╪══════╡
│ 1   ┆ null │
│ 2   ┆ 2.5  │
│ 3   ┆ 1.5  │
└─────┴──────┘


<hr>

In [14]:
(
    df
    .select(
        cs.all()
    )
    .head(3)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""


In [55]:
(
    df
    .select(
        pl.selectors.contains("N")
    )
    .filter(
        pl.col("*").str.contains("^B")
    )
)

Name
str
"""Braund, Mr. Owen Harris"""
"""Bonnell, Miss. Elizabeth"""
"""Beesley, Mr. Lawrence"""
"""Bing, Mr. Lee"""
"""Backstrom, Mrs. Karl Alfred (M…"
…
"""Bystrom, Mrs. (Karolina)"""
"""Balkic, Mr. Cerin"""
"""Beckwith, Mrs. Richard Leonard…"
"""Banfield, Mr. Frederick James"""


In most Polars examples you see online the selectors sub-module is imported separately as `cs` (and I follow this practice below). However, in my own pipelines I find it easier to skip that extra import and use selectors with the main `pl` import

In [56]:
(
    df
    .select(
        pl.selectors.all()
    )
    .head(3)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""


We can also do selection by position with `first` or `last`

In [16]:
(
    df
    .select(
        cs.first()
    )
    .head(3)
)

PassengerId
i64
1
2
3


In [23]:
(
    df
    .select(
        pl.selectors.first()
    )
    .head()
)

PassengerId
i64
1
2
3
4
5


The output of a selector is a standard Polars expression so we can follow it up with standard expression chaining

In [17]:
(
    df
    .select(
        cs.all().max()
    )
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
891,1,3,"""van Melkebeke, Mr. Philemon""","""male""",80.0,8,6,"""WE/P 5735""",512.3292,"""T""","""S"""


In [25]:
(
    df
    .select(
        pl.selectors.all().max()
    )
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
891,1,3,"""van Melkebeke, Mr. Philemon""","""male""",80.0,8,6,"""WE/P 5735""",512.3292,"""T""","""S"""


The selectors API works well in lazy mode and for streaming queries just as expressions do.

We can select columns by groups of dtype - including a group of all integer and floating point dtypes with `cs.numeric`

In [18]:
(
    df
    .select(
        cs.numeric()
    )
    .head(3)
)

PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
i64,i64,i64,f64,i64,i64,f64
1,0,3,22.0,1,0,7.25
2,1,1,38.0,1,0,71.2833
3,1,3,26.0,0,0,7.925


We can select by name - in this example with a `~` operator to exclude the names listed

In [19]:
(
    df
    .select(
        ~cs.by_name("Pclass","Age")
    )
    .head(3)
)

PassengerId,Survived,Name,Sex,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,str,str,i64,i64,str,f64,str,str
1,0,"""Braund, Mr. Owen Harris""","""male""",1,0,"""A/5 21171""",7.25,,"""S"""
2,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,"""Heikkinen, Miss. Laina""","""female""",0,0,"""STON/O2. 3101282""",7.925,,"""S"""


As a simpler alternative to the regex example we saw earlier we can use string methods such as:
- `contains`
- `starts_with`
- `end_with`
- `matches`

In this example we select all columns beginning with P

In [20]:
(
    df
    .select(
        cs.starts_with("P")
    )
    .head(3)
)

PassengerId,Pclass,Parch
i64,i64,i64
1,3,0
2,1,0
3,3,0


We can apply an OR condition by passing multiple strings

In [None]:
(
    df
    .select(
        cs.starts_with("P","A")
    )
    .head(3)
).columns

With the `matches` method we can pass a regex without the `^` and `$` we need for the expression API

In [21]:
(
    df
    .select(
        cs.matches("Age|Fare")
    )
    .head(3)
)

Age,Fare
f64,f64
22.0,7.25
38.0,71.2833
26.0,7.925


The difference between `cs.contains` and `cs.matches` is:
- `cs.contains` looks for all column names that contain the literal substring
- `cs.matches` look for all column names that match the regex

### Intersection of selectors

To do an intersection of selector conditions we use the `&` operator to say both conditions must be fulfilled.

In this example we look for **numeric** columns that **contain A** in the column name

In [26]:
(
    df
    .select(
        cs.numeric() & cs.contains("A") 
    )
    .head(3)
)

Age
f64
22.0
38.0
26.0


### Union of selectors
To do a union operation we use the `|` operator to say at least one of the conditions must be satisfied

In [None]:
(
    df
    .select(
        cs.string() | cs.contains("P") 
    )
    .head(3)
)

### Difference of selectors
To do a difference operation we use a minus operator `-`.

In this example we select all string columns other than any column beginning with T

In [None]:
(
    df
    .select(
        cs.string() - cs.starts_with("T") 
    )
    .head(3)
)

# Exercises

In the exercises you will develop your understanding of:
- selecting all columns from a `DataFrame`
- excluding columns from a selection
- selecting columns with a dtype

### Exercise 1
We create a `DataFrame` from the Spotify data

In [23]:
pl.Config.set_fmt_str_lengths(30)
spotify_csv = "../data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv,try_parse_dates=True)
spotify_df.head(3)

title,rank,date,artist,url,region,chart,trend,streams
str,i64,date,str,str,str,str,str,i64
"""Starboy""",1,2017-01-01,"""The Weeknd, Daft Punk""","""https://open.spotify.com/track…","""Global""","""top200""","""SAME_POSITION""",3135625
"""Closer""",2,2017-01-01,"""The Chainsmokers, Halsey""","""https://open.spotify.com/track…","""Global""","""top200""","""SAME_POSITION""",3015525
"""Let Me Love You""",3,2017-01-01,"""DJ Snake, Justin Bieber""","""https://open.spotify.com/track…","""Global""","""top200""","""MOVE_UP""",2545384


Select the `title` and `artist` columns using the expression API (and not selectors)

In [28]:
(
    spotify_df
    .select(
        pl.col('title', 'artist')
    )
    .head(3)
)

title,artist
str,str
"""Starboy""","""The Weeknd, Daft Punk"""
"""Closer""","""The Chainsmokers, Halsey"""
"""Let Me Love You""","""DJ Snake, Justin Bieber"""


Select all string and date columns from the spotify `DataFrame` using the expression API

In [32]:
(
    spotify_df
    .select(
        pl.col([pl.Date, pl.Utf8])
    )
    .head(2)
)

title,date,artist,url,region,chart,trend
str,date,str,str,str,str,str
"""Starboy""",2017-01-01,"""The Weeknd, Daft Punk""","""https://open.spotify.com/track…","""Global""","""top200""","""SAME_POSITION"""
"""Closer""",2017-01-01,"""The Chainsmokers, Halsey""","""https://open.spotify.com/track…","""Global""","""top200""","""SAME_POSITION"""


Select all string and date columns from the spotify `DataFrame` - except the `url` column using the expression API (and not selectors)

In [34]:
(
    spotify_df
    .select(
        pl.col([pl.Date, pl.Utf8]).exclude('url')
    )
    .head()
)

title,date,artist,region,chart,trend
str,date,str,str,str,str
"""Starboy""",2017-01-01,"""The Weeknd, Daft Punk""","""Global""","""top200""","""SAME_POSITION"""
"""Closer""",2017-01-01,"""The Chainsmokers, Halsey""","""Global""","""top200""","""SAME_POSITION"""
"""Let Me Love You""",2017-01-01,"""DJ Snake, Justin Bieber""","""Global""","""top200""","""MOVE_UP"""
"""Rockabye (feat. Sean Paul & An…",2017-01-01,"""Clean Bandit""","""Global""","""top200""","""MOVE_DOWN"""
"""One Dance""",2017-01-01,"""Drake, WizKid, Kyla""","""Global""","""top200""","""SAME_POSITION"""


Select all string and date columns again but use the selectors API

In [36]:
(
    spotify_df
    .select(
        pl.selectors.by_dtype([pl.Date, pl.Utf8])
    )
    .head(2)
)

title,date,artist,url,region,chart,trend
str,date,str,str,str,str,str
"""Starboy""",2017-01-01,"""The Weeknd, Daft Punk""","""https://open.spotify.com/track…","""Global""","""top200""","""SAME_POSITION"""
"""Closer""",2017-01-01,"""The Chainsmokers, Halsey""","""https://open.spotify.com/track…","""Global""","""top200""","""SAME_POSITION"""


Select all the columns that start with `t` or `a`

In [38]:
(
    spotify_df
    .select(
        pl.selectors.starts_with("t", "a")
    )
    .head()
)

title,artist,trend
str,str,str
"""Starboy""","""The Weeknd, Daft Punk""","""SAME_POSITION"""
"""Closer""","""The Chainsmokers, Halsey""","""SAME_POSITION"""
"""Let Me Love You""","""DJ Snake, Justin Bieber""","""MOVE_UP"""
"""Rockabye (feat. Sean Paul & An…","""Clean Bandit""","""MOVE_DOWN"""
"""One Dance""","""Drake, WizKid, Kyla""","""SAME_POSITION"""


Select all columns except the integer columns (using the ~ operator)

In [28]:
print((
    spotify_df
    .select(
        ~cs.integer()
    )
    .head(3)
))

shape: (3, 7)
┌─────────────┬────────────┬──────────────────┬──────────────────┬────────┬────────┬───────────────┐
│ title       ┆ date       ┆ artist           ┆ url              ┆ region ┆ chart  ┆ trend         │
│ ---         ┆ ---        ┆ ---              ┆ ---              ┆ ---    ┆ ---    ┆ ---           │
│ str         ┆ date       ┆ str              ┆ str              ┆ str    ┆ str    ┆ str           │
╞═════════════╪════════════╪══════════════════╪══════════════════╪════════╪════════╪═══════════════╡
│ Starboy     ┆ 2017-01-01 ┆ The Weeknd, Daft ┆ https://open.spo ┆ Global ┆ top200 ┆ SAME_POSITION │
│             ┆            ┆ Punk             ┆ tify.com/track…  ┆        ┆        ┆               │
│ Closer      ┆ 2017-01-01 ┆ The              ┆ https://open.spo ┆ Global ┆ top200 ┆ SAME_POSITION │
│             ┆            ┆ Chainsmokers,    ┆ tify.com/track…  ┆        ┆        ┆               │
│             ┆            ┆ Halsey           ┆                  ┆        ┆  

### Exercise 2
We create a `DataFrame` with temperature and rainfall data from some weather stations

In [29]:
df_weather = pl.DataFrame(
    [
        {
            "Month": "Jan",
            "Station_A (°C)": 20.5,
            "Station_B (°C)": 18.0,
            "Station_A (mm)": 12.0,
            "Station_B (mm)": 13.5,
        },
        {
            "Month": "Feb",
            "Station_A (°C)": 21.0,
            "Station_B (°C)": 18.5,
            "Station_A (mm)": 12.0,
            "Station_B (mm)": 13.5,
        },
    ]
)
df_weather

Month,Station_A (°C),Station_B (°C),Station_A (mm),Station_B (mm)
str,f64,f64,f64,f64
"""Jan""",20.5,18.0,12.0,13.5
"""Feb""",21.0,18.5,12.0,13.5


Select all the columns with `Station` in the column name using `cs.contains`

In [42]:
(
    df_weather
    .select(
        pl.selectors.contains("Station")
    )
    .head(2)
)

Station_A (°C),Station_B (°C),Station_A (mm),Station_B (mm)
f64,f64,f64,f64
20.5,18.0,12.0,13.5
21.0,18.5,12.0,13.5


Use `cs.matches` to select all the columns with `Station` and `°C`  in the column name

In [30]:
(
    df_weather
    .select(
        cs.matches(".*°C.*")
    )
)

Station_A (°C),Station_B (°C)
f64,f64
20.5,18.0
21.0,18.5


### Exercise 3
Convert the following Pandas code (that I've seen in the wild!) to Polars

Looping over columns in Polars is to be avoided at all costs. 

Convert this Pandas code with a loop over the columns to Polars code using the Expression API.

In the loop we create a dictionary `maxDict` with the column names and maximum values

In [33]:
import pandas as pd
import numpy as np
df = pl.read_csv(csv_file)
dfPandas = df.to_pandas()

# Convert this code below to Polars in the following cell
maxDict = {}
for col in dfPandas.columns:
    if dfPandas[col].dtype == np.float64:
        maxDict[col] = [dfPandas[col].max()]
pd.DataFrame(maxDict)

Unnamed: 0,Age,Fare
0,80.0,512.3292


In [41]:
print((
    pl.read_csv(csv_file)
     .select(
         cs.float().max()
     )
))

print(
    (
        pl.read_csv(csv_file)
        .filter(
            pl.col('Age').max() # filter은 boolean 타입의 조건을 보내야함,.!!!!!!! 주의 !!!!
        )
    )
)

shape: (1, 2)
┌──────┬──────────┐
│ Age  ┆ Fare     │
│ ---  ┆ ---      │
│ f64  ┆ f64      │
╞══════╪══════════╡
│ 80.0 ┆ 512.3292 │
└──────┴──────────┘


ComputeError: filter predicate must be of type `Boolean`, got `f64`

## Solutions

### Solution to Exercise 1
We create a `DataFrame` from the Spotify data

In [None]:
pl.Config.set_fmt_str_lengths(30)
spotify_csv = "../data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv,try_parse_dates=True)
spotify_df.head(3)

Select the `title` and `artist` columns using the expression API (and not selectors)

In [None]:
(
    spotify_df
    .select(
        pl.col("title","artist")
    )
    .head(3)
)

Select all string and date columns from the spotify `DataFrame` using the expression API

In [None]:
(
    spotify_df
    .select(
        pl.col(pl.Utf8,pl.Date)
    )
    .head(3)
)

Select all string and date columns from the spotify `DataFrame` - except the `url` column using the expression API (and not selectors)

In [None]:
(
    spotify_df
    .select(
        pl.col(pl.Utf8,pl.Date).exclude("url")
    )
    .head(3)
)

Select all string and date columns again but use the selectors API

In [None]:
(
    spotify_df
    .select(
        cs.by_dtype(pl.Utf8,pl.Date) - cs.by_name("url")
    )
    .head(3)
)

Select all the columns that start with `t` or `a`

In [None]:
(
    spotify_df
    .select(
        cs.starts_with("t","a")
    )
    .head(3)
)

Select all columns except the integer columns (using the ~ operator)

In [None]:
(
    spotify_df
    .select(
        ~cs.integer()
    )
    .head(3)
)

### Solution to Exercise 2
We create a `DataFrame` with temperature and rainfall data from some weather stations

In [None]:
df_weather = pl.DataFrame(
    [
        {
            "Month": "Jan",
            "Station_A (°C)": 20.5,
            "Station_B (°C)": 18.0,
            "Station_A (mm)": 12.0,
            "Station_B (mm)": 13.5,
        },
        {
            "Month": "Feb",
            "Station_A (°C)": 21.0,
            "Station_B (°C)": 18.5,
            "Station_A (mm)": 12.0,
            "Station_B (mm)": 13.5,
        },
    ]
)
df_weather

Select all the columns with `Station` in the column name using `cs.contains`

In [None]:
(
    df_weather
    .select(
        cs.contains("Station")
    )
)

Use `cs.matches` to select all the columns with `Station` and `°C`  in the column name

In [None]:
(
    df_weather
    .select(
        cs.matches("Station.*°C")
    )
)

### Solution to Exercise 3
Convert the following Pandas code to Polars
```python
import pandas as pd
import numpy as np
df = pl.read_csv(csv_file)
dfPandas = df.to_pandas()

# Convert this code below to Polars in the following cell
maxDict = {}
for col in dfPandas.columns:
    if dfPandas[col].dtype == np.float64:
        maxDict[col] = [dfPandas[col].max()]
pd.DataFrame(maxDict)
```

In [None]:
(
    pl.read_csv(csv_file)
    .select(
        pl.col(pl.Float64).max()
    )
)

Note that there is a better way to do this in Pandas (I just don't see this so often in the wild!)

In [None]:
df_pandas = df.to_pandas()
df_pandas.select_dtypes("float").max()