# Selecting columns 3: selecting multiple columns
By the end of this lecture you will be able to:
- select columns based on a regex
- select columns based on dtype
- use selectors

Polars has two ways for selecting multiple columns:
- the expression API with `pl.col` or `pl.all`
- the selectors API with polars selectors such as `cs.contains`

Here we import the `polars.selectors` separately as `cs`


# 选择列 3：选择多列

本讲结束时，您将能够：

- 基于正则表达式选择列

- 基于数据类型 (dtype) 选择列

- 使用选择器

Polars 提供了两种选择多列的方法：

- 使用表达式 API，例如 `pl.col` 或 `pl.all`

- 使用选择器 API，例如 Polars 选择器 `cs.contains`

这里我们将 `polars.selectors` 单独导入为 `cs`。

In [24]:
import polars as pl
import polars.selectors as cs

In [25]:
csv_file = "../../Files/Sample_Superstore.csv"

In [26]:
df = pl.read_csv(csv_file)
df.head(3)

Row_ID,Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
1,,,"""11-11-2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-BO-10001798""","""Furniture""","""Bookcases""","""Bush Somerset Collection Bookc…",261.96,2,0.0,41.9136
2,"""CA-2016-152156""","""08-11-2016""","""11-11-2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-CH-10000454""","""Furniture""","""Chairs""","""Hon Deluxe Fabric Upholstered …",731.94,3,0.0,219.582
3,"""CA-2016-138688""","""12-06-2016""",,,"""DV-13045""","""Darrin Van Huff""","""Corporate""",,"""Los Angeles""","""California""",90036,"""West""","""OFF-LA-10000240""","""Office Supplies""","""Labels""","""Self-Adhesive Address Labels f…",14.62,2,0.0,6.8714


### Selecting all columns from a `DataFrame`

We can select all columns by replacing `pl.col` with `pl.all`

### 从 `DataFrame` 中选择所有列

我们可以通过将 `pl.col` 替换为 `pl.all` 来选择所有列。

In [36]:
df.select(pl.all()).head(3)

Row_ID,Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
1,,,"""11-11-2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-BO-10001798""","""Furniture""","""Bookcases""","""Bush Somerset Collection Bookc…",261.96,2,0.0,41.9136
2,"""CA-2016-152156""","""08-11-2016""","""11-11-2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-CH-10000454""","""Furniture""","""Chairs""","""Hon Deluxe Fabric Upholstered …",731.94,3,0.0,219.582
3,"""CA-2016-138688""","""12-06-2016""",,,"""DV-13045""","""Darrin Van Huff""","""Corporate""",,"""Los Angeles""","""California""",90036,"""West""","""OFF-LA-10000240""","""Office Supplies""","""Labels""","""Self-Adhesive Address Labels f…",14.62,2,0.0,6.8714


We can select all but a subset of columns with the `exclude` expression

我们可以使用 `exclude` 表达式选择除某个子集之外的所有列。

In [38]:
df.select(pl.exclude('Postal_Code','Sub_Category','Quantity')).head(3) # 这是 `pl.all().exclude(...)` 的简写形式。
# 这里 pl.exclude() 已经默认选择了 pl.all()

Row_ID,Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Region,Product_ID,Category,Product_Name,Sales,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,str,str,str,str,f64,f64,f64
1,,,"""11-11-2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""","""South""","""FUR-BO-10001798""","""Furniture""","""Bush Somerset Collection Bookc…",261.96,0.0,41.9136
2,"""CA-2016-152156""","""08-11-2016""","""11-11-2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""","""South""","""FUR-CH-10000454""","""Furniture""","""Hon Deluxe Fabric Upholstered …",731.94,0.0,219.582
3,"""CA-2016-138688""","""12-06-2016""",,,"""DV-13045""","""Darrin Van Huff""","""Corporate""",,"""Los Angeles""","""California""","""West""","""OFF-LA-10000240""","""Office Supplies""","""Self-Adhesive Address Labels f…",14.62,0.0,6.8714


This is a shorthand for `pl.all().exclude(...)`

### Selecting columns with a regex
We can select columns with a regex - if the regex starts with `^` and ends with `$`. Note that we meet an easier approach to doing this with selectors below.

The following regex looks for columns starting with `P` and uses the regex *wildcard* `.*` to show `P` can be followed by any characters.

这是 `pl.all().exclude(...)` 的简写形式。

### 使用正则表达式选择列

我们可以使用正则表达式选择列——前提是正则表达式以 `^` 开头，以 `$` 结尾。请注意，我们将在下文中介绍一种更简便的方法，即使用选择器。

以下正则表达式查找以 `P` 开头的列，并使用通配符 `.*` 来表示 `P` 后面可以跟任何字符。

In [39]:
(
    df
    .select(
        "^P.*$"
    )
    .head(3)
)

Postal_Code,Product_ID,Product_Name,Profit
i64,str,str,f64
42420,"""FUR-BO-10001798""","""Bush Somerset Collection Bookc…",41.9136
42420,"""FUR-CH-10000454""","""Hon Deluxe Fabric Upholstered …",219.582
90036,"""OFF-LA-10000240""","""Self-Adhesive Address Labels f…",6.8714


We can pass this regex to `pl.col` to apply transformations to these columns. In this example we take the `max` of each column

我们可以将此正则表达式传递给 `pl.col`，以对这些列应用转换。在本例中，我们取每列的最大值。

In [40]:
# 注意这两个代码的差别, 这个是选择前三行进行计算最大值, 而且每一列都是单纯成立的, 若是 str 类型, 那么按照字典序进行排序作为结果.
# (
#     df
#     .select(
#         pl.col("^P.*$").head(3).max()
#     )
# )

# 但是这里将 head(3)放到了最后, 那么这个其实没有什么意义, 毕竟都有了 max() 了, 那么最后的结果肯定只有一行了. 所以这个 .head(3) 没有意义.
(
    df
    .select(
        pl.col("^P.*$").max()
    )
    # .head(3)
)

Postal_Code,Product_ID,Product_Name,Profit
i64,str,str,f64
99301,"""TEC-PH-10004977""","""netTALK DUO VoIP Telephone Ser…",8399.976


### Selecting columns based on dtype
We can select all of the columns that have a particular dtype by passing the dtype to `pl.col`. I use this approach **a lot** in my Polars pipelines.

Here we select all the string columns with `pl.Utf8` - the string dtype object

### 基于数据类型选择列

我们可以通过将**数据类型传递给 `pl.col` 来选择所有具有特定数据类型的列**。我在 Polars 管道中**经常**使用这种方法。

这里我们使用 `pl.Utf8`（字符串数据类型对象）来选择所有字符串类型的列。

In [41]:
(
    df
    .select(
        pl.col(pl.Utf8) # 比如说这里的列: 表示: 我只选择utf8编码的列, 那么就是字符串的列.
    )
    .head(3)
)

Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Region,Product_ID,Category,Sub_Category,Product_Name
str,str,str,str,str,str,str,str,str,str,str,str,str,str,str
,,"""11-11-2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""","""South""","""FUR-BO-10001798""","""Furniture""","""Bookcases""","""Bush Somerset Collection Bookc…"
"""CA-2016-152156""","""08-11-2016""","""11-11-2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""","""South""","""FUR-CH-10000454""","""Furniture""","""Chairs""","""Hon Deluxe Fabric Upholstered …"
"""CA-2016-138688""","""12-06-2016""",,,"""DV-13045""","""Darrin Van Huff""","""Corporate""",,"""Los Angeles""","""California""","""West""","""OFF-LA-10000240""","""Office Supplies""","""Labels""","""Self-Adhesive Address Labels f…"


We can also pass a list of dtypes to `pl.col`. In this case we select both 64-bit integer and float columns

我们还可以向 `pl.col` 传递一个**数据类型列表**。在本例中，我们同时选择了 64 位整数和浮点数列。

In [42]:
(
    df
    .select(
        pl.col([pl.Int64,pl.Float64]) # 传递的这个list 表示, 选择所有的 int64 和 float64 的数据类型的列.
    )
    .head(3)
)

Row_ID,Postal_Code,Sales,Quantity,Discount,Profit
i64,i64,f64,i64,f64,f64
1,42420,261.96,2,0.0,41.9136
2,42420,731.94,3,0.0,219.582
3,90036,14.62,2,0.0,6.8714


## Using the selectors API
The selectors API aims to make selecting multiple columns less verbose. 

For simple cases it replicates using the expression API. For example to select all columns we use `cs.all`

## 使用选择器 API

选择器 API 旨在简化多列选择操作。

对于简单情况，它与表达式 API 的功能类似。例如，要选择所有列，我们可以使用 `cs.all`。

In [43]:
(
    df
    .select(
        cs.all()
    )
    .head(3)
)

Row_ID,Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
1,,,"""11-11-2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-BO-10001798""","""Furniture""","""Bookcases""","""Bush Somerset Collection Bookc…",261.96,2,0.0,41.9136
2,"""CA-2016-152156""","""08-11-2016""","""11-11-2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-CH-10000454""","""Furniture""","""Chairs""","""Hon Deluxe Fabric Upholstered …",731.94,3,0.0,219.582
3,"""CA-2016-138688""","""12-06-2016""",,,"""DV-13045""","""Darrin Van Huff""","""Corporate""",,"""Los Angeles""","""California""",90036,"""West""","""OFF-LA-10000240""","""Office Supplies""","""Labels""","""Self-Adhesive Address Labels f…",14.62,2,0.0,6.8714


In most Polars examples you see online the selectors sub-module is imported separately as `cs` (and I follow this practice below). However, in my own pipelines I find it easier to skip that extra import and use selectors with the main `pl` import

在大多数 Polars 示例中，选择器子模块通常作为 `cs` 单独导入（下文也将遵循此做法）。然而，在我的流程中，我发现省略这一额外的导入步骤，直接在主 `pl` 导入中使用选择器会更方便。

In [44]:
(
    df
    .select(
        pl.selectors.all()
    )
    .head(3)
)

Row_ID,Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
1,,,"""11-11-2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-BO-10001798""","""Furniture""","""Bookcases""","""Bush Somerset Collection Bookc…",261.96,2,0.0,41.9136
2,"""CA-2016-152156""","""08-11-2016""","""11-11-2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-CH-10000454""","""Furniture""","""Chairs""","""Hon Deluxe Fabric Upholstered …",731.94,3,0.0,219.582
3,"""CA-2016-138688""","""12-06-2016""",,,"""DV-13045""","""Darrin Van Huff""","""Corporate""",,"""Los Angeles""","""California""",90036,"""West""","""OFF-LA-10000240""","""Office Supplies""","""Labels""","""Self-Adhesive Address Labels f…",14.62,2,0.0,6.8714


We can also do selection by position with `first` or `last`

我们还可以使用 `first` 或 `last` 按位置进行选择。

In [45]:
(
    df
    .select(
        cs.first() # 这里选择的是 列 不是行.
    )
    .head(3)
)

Row_ID
i64
1
2
3


The output of a selector is a standard Polars expression so we can follow it up with standard expression chaining

选择器的输出是一个标准的 Polars 表达式，因此我们可以接着使用标准表达式链。

In [46]:
(
    df
    .select(
        cs.all().max()
    )
)

Row_ID,Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
9994,"""US-2017-169551""","""31-12-2016""","""31-12-2017""","""Standard Class""","""ZD-21925""","""Zuschuss Donatelli""","""Home Office""","""United States""","""Yuma""","""Wyoming""",99301,"""West""","""TEC-PH-10004977""","""Technology""","""Tables""","""netTALK DUO VoIP Telephone Ser…",22638.48,14,0.8,8399.976


The selectors API works well in lazy mode and for streaming queries just as expressions do.

We can select columns by groups of dtype - including a group of all integer and floating point dtypes with `cs.numeric`


选择器 API 在惰性求值模式和流式查询中都能正常工作，就像表达式一样。

我们可以按数据类型分组选择列，包括使用 `cs.numeric` 选择所有整数和浮点数据类型。

In [47]:
(
    df
    .select(
        cs.numeric()
    )
    .head(3)
)

Row_ID,Postal_Code,Sales,Quantity,Discount,Profit
i64,i64,f64,i64,f64,f64
1,42420,261.96,2,0.0,41.9136
2,42420,731.94,3,0.0,219.582
3,90036,14.62,2,0.0,6.8714


We can select by name - in this example with a `~` operator to exclude the names listed

我们可以按名称进行选择——在本例中，使用 `~` 运算符排除列出的名称。

In [51]:
(
    df
    .select(
        ~cs.by_name("Profit","Row_ID") # 这里的 ~ 表示取反, API 非常简单好懂.
    )
    .head(3)
)

Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount
str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64
,,"""11-11-2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-BO-10001798""","""Furniture""","""Bookcases""","""Bush Somerset Collection Bookc…",261.96,2,0.0
"""CA-2016-152156""","""08-11-2016""","""11-11-2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-CH-10000454""","""Furniture""","""Chairs""","""Hon Deluxe Fabric Upholstered …",731.94,3,0.0
"""CA-2016-138688""","""12-06-2016""",,,"""DV-13045""","""Darrin Van Huff""","""Corporate""",,"""Los Angeles""","""California""",90036,"""West""","""OFF-LA-10000240""","""Office Supplies""","""Labels""","""Self-Adhesive Address Labels f…",14.62,2,0.0


As a simpler alternative to the regex example we saw earlier we can use string methods such as:
- `contains`
- `starts_with`
- `end_with`
- `matches`

作为之前看到的正则表达式示例的更简单替代方案，我们可以使用字符串方法，例如：
- `contains`
- `starts_with`
- `end_with`
- `matches`

In this example we select all columns beginning with P


在这个例子中，我们选择所有以 P 开头的列

In [52]:
(
    df
    .select(
        cs.starts_with("P") # 不用使用正则表达式了, 可以直接用API进行常见的匹配行为.
    )
    .head(3)
)

Postal_Code,Product_ID,Product_Name,Profit
i64,str,str,f64
42420,"""FUR-BO-10001798""","""Bush Somerset Collection Bookc…",41.9136
42420,"""FUR-CH-10000454""","""Hon Deluxe Fabric Upholstered …",219.582
90036,"""OFF-LA-10000240""","""Self-Adhesive Address Labels f…",6.8714


We can apply an OR condition by passing multiple strings

我们可以通过传递多个字符串来应用“或”条件。

In [54]:
(
    df
    .select(
        cs.starts_with("P","S")
    )
    .head(3)
).columns

['Ship_Date',
 'Ship_Mode',
 'Segment',
 'State',
 'Postal_Code',
 'Product_ID',
 'Sub_Category',
 'Product_Name',
 'Sales',
 'Profit']

With the `matches` method we can pass a regex without the `^` and `$` we need for the expression API

使用 `matches` 方法，我们可以传递一个正则表达式，而无需使用表达式 API 所需的 `^` 和 `$`。

In [55]:
(
    df
    .select(
        cs.matches("Customer_Name|Profit")
    )
    .head(3)
)

Customer_Name,Profit
str,f64
"""Claire Gute""",41.9136
"""Claire Gute""",219.582
"""Darrin Van Huff""",6.8714


### Union of selectors
To do a union operation we use the `|` operator to say at least one of the conditions must be satisfied


### 选择器并集

要进行并集操作，我们使用 `|` 运算符，表示至少要满足其中一个条件。

In [28]:
(
    df
    .select(
        cs.string() | cs.contains("P") 
    )
    .head(3)
)

Order_ID,Order_Date,Ship Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Profit
str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64
"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-BO-10001798""","""Furniture""","""Bookcases""","""Bush Somerset Collection Bookc…",41.9136
"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-CH-10000454""","""Furniture""","""Chairs""","""Hon Deluxe Fabric Upholstered …",219.582
"""CA-2016-138688""","""6/12/2016""","""6/16/2016""","""Second Class""","""DV-13045""","""Darrin Van Huff""","""Corporate""","""United States""","""Los Angeles""","""California""",90036,"""West""","""OFF-LA-10000240""","""Office Supplies""","""Labels""","""Self-Adhesive Address Labels f…",6.8714


### Difference of selectors
To do a difference operation we use a minus operator `-`.

In this example we select all string columns other than any column beginning with T


### 选择器差异

要进行差异运算，我们使用减号运算符 `-`。

在本例中，我们选择除以 T 开头的列之外的所有字符串列。

In [29]:
(
    df
    .select(
        cs.string() - cs.starts_with("T") 
    )
    .head(3)
)

Order_ID,Order_Date,Ship Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Region,Product_ID,Category,Sub_Category,Product_Name
str,str,str,str,str,str,str,str,str,str,str,str,str,str,str
"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""","""South""","""FUR-BO-10001798""","""Furniture""","""Bookcases""","""Bush Somerset Collection Bookc…"
"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""","""South""","""FUR-CH-10000454""","""Furniture""","""Chairs""","""Hon Deluxe Fabric Upholstered …"
"""CA-2016-138688""","""6/12/2016""","""6/16/2016""","""Second Class""","""DV-13045""","""Darrin Van Huff""","""Corporate""","""United States""","""Los Angeles""","""California""","""West""","""OFF-LA-10000240""","""Office Supplies""","""Labels""","""Self-Adhesive Address Labels f…"
