## Iterating through a DataFrame
By the end of this lecture you will be able to:
- iterate through a column row-by-row
- iterate through a DataFrame row-by-row

While we introduce iteration methods here be aware that we should avoid iterating through a `DataFrame` if it is possible to use expressions as expressions are much faster. 

In [None]:
import polars as pl

In [None]:
csvFile = "../data/titanic.csv"

In [None]:
df = pl.read_csv(csvFile)
df.head(3)

### Iterating over a single column
We can iterate over a single column just as we would do with a Pandas column or a Numpy array

In [None]:
ages = [age for age in df["Age"]]
ages[:3]

### Iterating over multiple columns
We can iterate over multiple columns using the `rows` attribute of a `DataFrame`.

In this example we create a list where each element is the `Name` and `Age` of a passenger

In [None]:
nameAge = [(row[3],row[5]) for row in df.rows()]
nameAge[:3]

Alternatively, we can do this with the `iterrows` attribute

In [None]:
nameAge = [(row[3],row[5]) for row in df.iterrows()]
nameAge[:3]

#### Difference between `rows` and `iterrows`?
The output of `rows` and `iterrows` is the same. The difference is that:
- when we call `rows` the entire `DataFrame` is materialised as a list of Python tuples where each tuple is a row. We can then iterate over this list of tuples
- when we call `iterrows` Polars materialises each row as a Python tuple when we iterate over it rather than materialising the whole `DataFrame` at the outset

Use `rows` if you are iterating through the full `DataFrame` and have enough memory to materialise the whole `DataFrame` as a list of tuples.

Use `iterrows` if you don't want to materialise the whole `DataFrame` as a list of tuples to reduce memory use

### Iterating with named columns
In the examples with `rows` and `iterrows` above we use indexing to select the column. We can instead use the column name as an attribute by passing the `named` argument

In [None]:
nameAge = [(row.Name,row.Age) for row in df.rows(named=True)]
nameAge[:3]

In [None]:
nameAge = [(row.Name,row.Age) for row in df.iterrows(named=True)]
nameAge[:3]

This approach with named values is easier to read but slower as the named objects must be created for each row.

## Exercises
In the exercises you will develop your understanding of:
- iterating over multiple columns with `iterrows`
- iterating over multiple columns with `rows`
- iterating with named columns

## Exercise 1
In the example below we create a random `DataFrame` with 1 million rows and 100 columns. We compare how long it takes to iterate through the `DataFrame` when we select the first 2 columns on each iteration

In [None]:
import numpy as np
N = 1_000_000
dfRandom = pl.DataFrame(np.random.standard_normal((N,100)))

Iterate through `dfRandom` with `iterrows` to create a list where each element is a tuple with the first two columns of `dfRandom`

How long does it take to iterate if we `select` the columns of interest first?

We use the ipython `timeit` magic with one iteration

In [None]:
%%timeit -n1 -r1


Do the same iteration with `iterrows` but use named columns

In [None]:
%%timeit -n1 -r1


Compare the performance with using the `rows` method

In [None]:
%%timeit -n1 -r1


## Solution to exercise 1
In the example below we create a random `DataFrame` with 1 million rows and 100 columns. We compare how long it takes to iterate through the `DataFrame` when we select the first 2 columns on each iteration

In [None]:
import numpy as np
N = 1_000_000
dfRandom = pl.DataFrame(np.random.standard_normal((N,100)))

Iterate through `dfRandom` with `iterrows` to create a list where each element is a tuple with the first two columns of `dfRandom`

In [None]:
%%timeit -n1 -r1
[(row[0],row[1]) for row in dfRandom.iterrows()]

How long does it take to iterate if we `select` the columns of interest first?

In [None]:
%%timeit -n1 -r1
[(row[0],row[1]) for row in dfRandom.select(["column_0","column_1"]).iterrows()]

It is much faster if we preselect the columns!

Do the same iteration with `iterrows` but use named columns

In [None]:
%%timeit -n1 -r1
[(row.column_0,row.column_1) for row in dfRandom.iterrows(named=True)]

We see there is a performance penalty for using the `named` approach

Compare the performance with using the `rows` method

In [None]:
%%timeit -n1 -r1
[(row[0],row[1]) for row in dfRandom.rows()]

In this example we see that the `rows` method is much slower than the `iterrows` method