# Filtering rows 1: Indexing with `[]`

By the end of this lecture you will be able to:
- select single rows with `[]` indexing
- select multiple rows with `[]` indexing


In [1]:
import polars as pl

In [2]:
csv_file = "../data/titanic.csv"

In [3]:
df = pl.read_csv(csv_file)
df.head()

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""
4,1,1,"""Futrelle, Mrs. Jacques Heath (…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""
5,0,3,"""Allen, Mr. William Henry""","""male""",35.0,0,0,"""373450""",8.05,,"""S"""


## Selecting individual rows with `[]`

A Polars `DataFrame` doesn't have an explicit index as a Pandas `DataFrame` has. A Polars `DataFrame` does, however, have an implicit integer row number index.

> polars는 구시대적인 pandas와 다르게 ^^ 눈에 보이는 index가 없다. 하지만 숨겨진 절대적인 index는 존재한다
> 따라서 이 index를 통해 행을 불러 올 수 있다.

We select an individual row with the integer row number

In [13]:
print(df[0].schema)
df[0]

Schema({'PassengerId': Int64, 'Survived': Int64, 'Pclass': Int64, 'Name': String, 'Sex': String, 'Age': Float64, 'SibSp': Int64, 'Parch': Int64, 'Ticket': String, 'Fare': Float64, 'Cabin': String, 'Embarked': String})


PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""


Note that if we select a single row as in this example the output is a one-row `DataFrame` - unlike Pandas where a one-row query selection becomes a `Series`

## Selecting multiple rows

### List

We can pass a list of integers to `[]`

In [12]:
df[[2,3]]

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""
4,1,1,"""Futrelle, Mrs. Jacques Heath (…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""


### Slice

We can use slice notation to select rows

In [14]:
df[:2]

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""


### Range
We can use a range of integers 

In [15]:
df[range(2,4)]

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""
4,1,1,"""Futrelle, Mrs. Jacques Heath (…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""


### Numpy array

Polars can accept a Numpy array of row numbers in `[]`

In [16]:
import numpy as np
df[np.arange(0,3)]

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""


### Boolean lists
We can only pass a `list` of **Boolean values** in `[]` to select columns - this list must be as long as the length of columns

In [17]:
df[[True,True] + [False]*10 ]

PassengerId,Survived
i64,i64
1,0
2,1
3,1
4,1
5,0
…,…
887,0
888,1
889,0
890,1


The Polars developers chose not to allow functionality of passing Boolean lists to filter rows to discourage Pandas-style queries and encourage use of expressions as we see in the next lecture.

### Boolean `Series`
We cannot pass a Boolean `Series` to `[]` - Polars thinks we are trying to filter columns as above. We see how to do this with `filter` in the next section.

> 시리즈 불리안 인덱스 불가능 그 대신 혁신적인 filter 함수를 통해서 인덱싱 하자

In [18]:
df[df["Age"]>30]

ValueError: expected 12 values when selecting columns by boolean mask, got 891

In [20]:
(
    df
    .filter(
        pl.col("Age") > 30
    )
    .head(3)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
4,1,1,"""Futrelle, Mrs. Jacques Heath (…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""
5,0,3,"""Allen, Mr. William Henry""","""male""",35.0,0,0,"""373450""",8.05,,"""S"""


## Use case of indexing with `[]`

Square bracket has a limited use case in Polars. It is limited because indexing with `[]` cannot be used in lazy mode and so we lose the advantages of query optimisation and streaming large datasets. 

We see in the next section that the `filter` method is the primary way to filter rows in Polars.

> `filter()` -> indexing ROWS

There are good uses for `[]`, however.

One example if when we are inspecting data in interactive mode and want to see e.g. the first row or the last rows.

Square bracket indexing is also useful for extracting scalar values from a `DataFrame`.

In this example we extract the first row from the `Age` column

In [21]:
df[0,'Age']

22.0

# Exercises
In the exercises you will develop your understanding of
- selecting individual rows with `[]`
- selecting multiple rows with `[]`

## Exercise 1
Select the fifth row using `[]`

In [22]:
df = pl.read_csv(csv_file)
df[4]

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
5,0,3,"""Allen, Mr. William Henry""","""male""",35.0,0,0,"""373450""",8.05,,"""S"""


Select the first 5 rows using a `slice`

In [23]:
df = pl.read_csv(csv_file)
df[:5]

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""
4,1,1,"""Futrelle, Mrs. Jacques Heath (…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""
5,0,3,"""Allen, Mr. William Henry""","""male""",35.0,0,0,"""373450""",8.05,,"""S"""


Select the second to fifth rows using a `range`

In [25]:
df = pl.read_csv(csv_file)
df[1:5]

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""
4,1,1,"""Futrelle, Mrs. Jacques Heath (…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""
5,0,3,"""Allen, Mr. William Henry""","""male""",35.0,0,0,"""373450""",8.05,,"""S"""


## Solutions

## Solution to Exercise 1
Select the fifth row using `[]`

In [26]:
df = pl.read_csv(csv_file)
df[4]

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
5,0,3,"""Allen, Mr. William Henry""","""male""",35.0,0,0,"""373450""",8.05,,"""S"""


Select the first 5 rows using a `slice`

In [27]:
df = pl.read_csv(csv_file)
df[:5]

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""
4,1,1,"""Futrelle, Mrs. Jacques Heath (…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""
5,0,3,"""Allen, Mr. William Henry""","""male""",35.0,0,0,"""373450""",8.05,,"""S"""


Select the second to fifth rows using a `range`

In [28]:
df = pl.read_csv(csv_file)
df[range(1,5)]

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""
4,1,1,"""Futrelle, Mrs. Jacques Heath (…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""
5,0,3,"""Allen, Mr. William Henry""","""male""",35.0,0,0,"""373450""",8.05,,"""S"""
