## Iterating through a DataFrame
By the end of this section you will be able to:
- iterate through a column row-by-row
- iterate through multiple columns row-by-row
- understand the performance effect of the different options

While we introduce iteration methods here be aware that we should avoid iterating through a `DataFrame` if it is possible to use expressions as expressions are much faster. 

In [3]:
import polars as pl

In [4]:
csvFile = "../data/titanic.csv"

In [5]:
df = pl.read_csv(csvFile)
df.head(3)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S"""


### Iterating over a single column
We can iterate over a single column just as we would do with a Pandas column or a Numpy array

In [6]:
ages = [age for age in df["Age"]]
ages[:3]

[22.0, 38.0, 26.0]

### Iterating over multiple columns
We can iterate over multiple columns using the `rows` attribute of a `DataFrame`.

In this example we create a list where each element is the `Name` and `Age` of a passenger

In [7]:
nameAge = [(row[3],row[5]) for row in df.rows()]
nameAge[:3]

[('Braund, Mr. Owen Harris', 22.0),
 ('Cumings, Mrs. John Bradley (Florence Briggs Thayer)', 38.0),
 ('Heikkinen, Miss. Laina', 26.0)]

Alternatively, we can do this with the `iterrows` attribute

In [8]:
nameAge = [(row[3],row[5]) for row in df.iter_rows()]
nameAge[:3]

[('Braund, Mr. Owen Harris', 22.0),
 ('Cumings, Mrs. John Bradley (Florence Briggs Thayer)', 38.0),
 ('Heikkinen, Miss. Laina', 26.0)]

#### Difference between `rows` and `iter_rows`?
The output of `rows` and `iter_rows` is the same. The difference is that:
- when we call `rows` the entire `DataFrame` is materialised as a list of Python tuples where each tuple is a row. We can then iterate over this list of tuples
- when we call `iter_rows` Polars materialises each row as a Python tuple when we iterate over it rather than materialising the whole `DataFrame` at the outset

Use `rows` if you are iterating through the full `DataFrame` and have enough memory to materialise the whole `DataFrame` as a list of tuples.

Use `iter_rows` if you don't want to materialise the whole `DataFrame` as a list of tuples to reduce memory use

### Iterating with named columns
In the examples with `rows` and `iter_rows` above we use indexing to select the column. We can instead use the column name as an attribute by passing the `named` argument to return a `dict` for each row

In [9]:
nameAge = [(row["Name"],row["Age"]) for row in df.rows(named=True)]
nameAge[:3]

[('Braund, Mr. Owen Harris', 22.0),
 ('Cumings, Mrs. John Bradley (Florence Briggs Thayer)', 38.0),
 ('Heikkinen, Miss. Laina', 26.0)]

In [10]:
nameAge = [(row["Name"],row["Age"]) for row in df.iter_rows(named=True)]
nameAge[:3]

[('Braund, Mr. Owen Harris', 22.0),
 ('Cumings, Mrs. John Bradley (Florence Briggs Thayer)', 38.0),
 ('Heikkinen, Miss. Laina', 26.0)]

This approach with named values is easier to read but slower as the named objects must be created for each row.

## Exercises
In the exercises you will develop your understanding of:
- iterating over multiple columns with `iterrows`
- iterating over multiple columns with `rows`
- iterating with named columns

## Exercise 1
In the example below we create a random `DataFrame` with 1 million rows and 100 columns. We compare how long it takes to iterate through the `DataFrame` when we select the first 2 columns on each iteration

In [None]:
import numpy as np
N = 1_000_000
dfRandom = pl.DataFrame(np.random.standard_normal((N,100)))

Iterate through `dfRandom` with `iter_rows` to create a list where each element is a tuple with the first two columns of `dfRandom`

How long does it take to iterate if we pre-`select` the columns of interest first?

We use the ipython `timeit` magic with one iteration

In [None]:
%%timeit -n1 -r1


Use pre-selected columns below.

Do the same iteration with `iter_rows` but use named columns

In [None]:
%%timeit -n1 -r1


Compare the performance with the `rows` method

In [None]:
%%timeit -n1 -r1


## Solution to exercise 1
In the example below we create a random `DataFrame` with 1 million rows and 100 columns. We compare how long it takes to iterate through the `DataFrame` when we select the first 2 columns on each iteration

In [15]:
import numpy as np
N = 1_000_000
dfRandom = pl.DataFrame(np.random.standard_normal((N,100)))

Iterate through `dfRandom` with `iter_rows` to create a list where each element is a tuple with the first two columns of `dfRandom`

In [19]:
%%timeit -n1 -r1
[(row[0],row[1]) for row in dfRandom.iter_rows()]

8.98 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


How long does it take to iterate if we `select` the columns of interest first?

In [20]:
%%timeit -n1 -r1
[(row[0],row[1]) for row in dfRandom.select(["column_0","column_1"]).iter_rows()]

321 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


It is much faster if we preselect the columns!

Use pre-selected columns below.

Do the same iteration with `iter_rows` but use named columns

In [21]:
%%timeit -n1 -r1
[(row["column_0"],row["column_1"]) for row in dfRandom.select(["column_0","column_1"]).iter_rows(named=True)]

980 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


We see there is a performance penalty for using the `named` approach

Compare the performance without names using the `rows` method

In [22]:
%%timeit -n1 -r1
[(row[0],row[1]) for row in dfRandom.select(["column_0","column_1"]).rows()]

318 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In this example we see that the `rows` method is faster than the `iter_rows` method. In general I find that it varies whether `rows` or `iter_rows` is faster depending on the problem