## Iterating through a DataFrame
By the end of this lecture you will be able to:
- iterate through a column row-by-row
- iterate through multiple columns row-by-row
- understand the performance effect of the different options

While we introduce iteration methods here be aware that we should avoid iterating through a `DataFrame` if it is possible to use expressions as expressions are much faster. 

## 遍历 DataFrame

本讲结束时，您将能够：

- 逐行遍历一列

- 逐行遍历多列

- 理解不同遍历方式的性能差异

虽然我们在此介绍迭代方法，但请注意，如果可以使用表达式，则应避免遍历 `DataFrame`，因为表达式速度更快。

In [1]:
import polars as pl

In [2]:
csv_file = "../../Files/Sample_Superstore.csv"

In [3]:
df = pl.read_csv(csv_file)
df.head(3)

Row_ID,Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
1,,,"""11-11-2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-BO-10001798""","""Furniture""","""Bookcases""","""Bush Somerset Collection Bookc…",261.96,2,0.0,41.9136
2,"""CA-2016-152156""","""08-11-2016""","""11-11-2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-CH-10000454""","""Furniture""","""Chairs""","""Hon Deluxe Fabric Upholstered …",731.94,3,0.0,219.582
3,"""CA-2016-138688""","""12-06-2016""",,,"""DV-13045""","""Darrin Van Huff""","""Corporate""",,"""Los Angeles""","""California""",90036,"""West""","""OFF-LA-10000240""","""Office Supplies""","""Labels""","""Self-Adhesive Address Labels f…",14.62,2,0.0,6.8714


### Iterating over a single column
We can iterate over a single column just as we would do with a Pandas column or a Numpy array

### 遍历单列

我们可以像遍历 Pandas 列或 NumPy 数组一样遍历单列。

In [4]:
Profits = [Profit for Profit in df["Profit"]]
Profits[:3]

[41.9136, 219.582, 6.8714]

### Iterating over multiple columns
We can iterate over multiple columns using the `rows` attribute of a `DataFrame`.

In this example we create a list where each element is the `Customer_Name` and `Profit`

### 遍历多列

我们可以使用 `DataFrame` 的 `rows` 属性遍历多列。

在本例中，我们创建一个列表，其中每个元素都是 `Customer_Name` 和 `Profit`。

In [12]:
Customer_Profit = [(row[3],row[5]) for row in df.rows()]
Customer_Profit[:3]

[('11/11/2016', 'CG-12520'),
 ('11/11/2016', 'CG-12520'),
 ('6/16/2016', 'DV-13045')]

Alternatively, we can do this with the `iterrows` attribute

或者，我们可以使用 `iterrows` 属性来实现这一点。

In [13]:
Customer_Profit = [(row[3],row[5]) for row in df.iter_rows()]
Customer_Profit[:3]

[('11/11/2016', 'CG-12520'),
 ('11/11/2016', 'CG-12520'),
 ('6/16/2016', 'DV-13045')]

#### Difference between `rows` and `iter_rows`?
The output of `rows` and `iter_rows` is the same. The difference is that:
- when we call `rows` the entire `DataFrame` is materialised as a list of Python tuples where each tuple is a row. We can then iterate over this list of tuples
- when we call `iter_rows` Polars materialises each row as a Python tuple when we iterate over it rather than materialising the whole `DataFrame` at the outset

Use `rows` if you are iterating through the full `DataFrame` and have enough memory to materialise the whole `DataFrame` as a list of tuples.

Use `iter_rows` if you don't want to materialise the whole `DataFrame` as a list of tuples to reduce memory use

### Iterating with named columns
In the examples with `rows` and `iter_rows` above we use indexing to select the column. We can instead use the column name as an attribute by passing the `named` argument to return a `dict` for each row


#### `rows` 和 `iter_rows` 的区别？

`rows` 和 `iter_rows` 的输出结果相同。区别在于：

- 调用 `rows` 时，整个 `DataFrame` 会被物化为一个 Python 元组列表，其中每个元组代表一行。然后我们可以遍历这个元组列表。

- 调用 `iter_rows` 时，Polars 会在遍历每一行时将其物化为一个 Python 元组，而不是一开始就物化整个 `DataFrame`。

如果您需要遍历整个 `DataFrame`，并且有足够的内存将整个 `DataFrame` 物化为元组列表，则请使用 `rows`。

如果您不想将整个 `DataFrame` 物化为元组列表以减少内存使用，请使用 `iter_rows`(可以减少内存消耗)。

### 使用命名列进行迭代

在上面的 `rows` 和 `iter_rows` 示例中，我们使用索引来选择列。我们也可以使用列名作为属性，通过传递 `named` 参数来为每一行返回一个 `dict`。

In [14]:
Customer_Profit = [(row["Customer_Name"],row["Profit"]) for row in df.rows(named=True)]
Customer_Profit[:3]

[('Claire Gute', 41.9136),
 ('Claire Gute', 219.582),
 ('Darrin Van Huff', 6.8714)]

In [15]:
Customer_Profit = [(row["Customer_Name"],row["Profit"]) for row in df.iter_rows(named=True)]
Customer_Profit[:3]

[('Claire Gute', 41.9136),
 ('Claire Gute', 219.582),
 ('Darrin Van Huff', 6.8714)]

This approach with named values is easier to read but slower as the named objects must be created for each row.

这种使用命名值的方法更容易阅读，但速度较慢，因为必须为每一行创建命名对象。