# Filtering rows 2: Using `filter` and the Expression API

By the end of this lecture you will be able to:
- apply conditions with the `filter` method
- add a row number column
- parition a `DataFrame`

The `filter` method is our first example of the *Expression API*.

_**Learning to use the *Expression API* is the most important step to writing high performance queries in Polars**_


In [1]:
import polars as pl

In [2]:
csv_file = "../data/titanic.csv"

In [3]:
df = pl.read_csv(csv_file)
df.head(3)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""


## Applying conditions with `filter`

We use the `filter` method to filter rows according to a condition.
>  `filter()`은 특정 조건을 만족하는 행을 걸려주는 함수!
<hr>

> In Pandas we often use a boolean mask to filter rows but in Polars we use `filter`. Note also that the `filter` method in Polars is quite different from the filter method in Pandas.

We first use an *expression* in the `filter` method before we examine the syntax in more detail.

In this example we want to keep all rows with the first class passengers

In [4]:
(
    df
    .filter(
        pl.col("Pclass") == 1
    )
    .head(3)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
4,1,1,"""Futrelle, Mrs. Jacques Heath (…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""
7,0,1,"""McCarthy, Mr. Timothy J""","""male""",54.0,0,0,"""17463""",51.8625,"""E46""","""S"""


## Syntax of `filter`
Inside the `filter` method we pass an _**expression**_ and apply a Boolean condition to it:

`pl.col('Pclass') == 1` -> 특정 열에서의 행을 걸러주는 조건 -> True, False를 반환.

This expression has two parts:
- `pl.col('Pclass')` expression selects the `Pclass` column from `df`
    −  특정 열을 선택하고
- `== 1` applies a Boolean condition to this expression - 그 열의 행에 적용하고 싶은 조건을 건다.

In this example we choose all rows with the number of parents & children (`Parch`) is greater than 1

In [5]:
(
    df
    .filter(
        pl.col("Parch") > 1
    )
    .select(
        "PassengerId", "Parch", "SibSp" # select에서 굳이 pl.col()로 선택 안해도 되구먼!
    )
    .head()
)

PassengerId,Parch,SibSp
i64,i64,i64
9,2,0
14,5,1
26,5,1
28,2,3
44,2,1


In [6]:
(
    df
    .filter(
        pl.col("Parch") > 1
    )
    .select(
        pl.col("PassengerId"),
        pl.col("Parch"),
        pl.col("SibSp")
    )
    .head(3)
)

PassengerId,Parch,SibSp
i64,i64,i64
9,2,0
14,5,1
26,5,1


As well as the mathemtical operators such as `==`,`>`,`<` there are corresponding text operators that some people find more readable

In [13]:
(
    df
    .filter(
        pl.col('Parch').gt(1)
    )
    .select("PassengerId","Parch","SibSp")
    .head(5)
)

PassengerId,Parch,SibSp
i64,i64,i64
9,2,0
14,5,1
26,5,1
28,2,3
44,2,1


You can see the full set of operators here: https://pola-rs.github.io/polars/py-polars/html/reference/expressions/operators.html

We can make a filter condition based on two expressions (i.e. comparing data in one column to another) rather than one expression and a constant. In this example we find rows where the number of parents & children (`Parch`) is greater than the number of siblings (`SibSp`)

In [18]:
(

    df
    .filter(
        pl.col("Parch") > pl.col("SibSp")
    )
    .select(
        "PassengerId", "Parch", "SibSp"
    )
    .head(3)
)

PassengerId,Parch,SibSp
i64,i64,i64
9,2,0
14,5,1
26,5,1


In [21]:
(
    df
    .filter(
        pl.col('Parch').gt(pl.col("SibSp")) # 굳이 gt-> 같은 거 안써도 된다.
    )
    .select("PassengerId","Parch","SibSp")
    .head(5)
)

PassengerId,Parch,SibSp
i64,i64,i64
9,2,0
14,5,1
26,5,1
44,2,1
55,1,0


To save a bit of typing we can also apply a filter to a column by passing the column name directly

In [24]:
(
    df
    .filter(
        pl.col("Parch") == 3
    )
    .select(
        "PassengerId", "Parch", "SibSp"
    )
    .head(4)
)

PassengerId,Parch,SibSp
i64,i64,i64
87,3,1
438,3,2
737,3,1
775,3,1


In [23]:
(
    df
    .filter(
        Parch= 3,
    )
    .select("PassengerId","Parch","SibSp")
    .head(3)
)

PassengerId,Parch,SibSp
i64,i64,i64
87,3,1
438,3,2
737,3,1


This approach only works for equality conditions (i.e. not for >,< etc). 

Why does this simple approach only work for equalities? Because in this approach Polars takes advantage of Python keyword arguments - we are basically "pretending" we are calling `filter` with an argument called `Parch` equal to 3 which Polars internally converts to `pl.col("Parch") == 3`. Python only lets us use this trick with the `=` operator

### Conditions based on row numbers with `filter`

We can add an explicit row number column using `with_row_index` on a `DataFrame`
> polars는 원래는 index가 pandas 처럼 없음. 하지만 보이는 index 열을 `.with_row_index(name = "~")`으로 추가 가능

In [7]:
df = pl.read_csv(csv_file)
df = df.with_row_index(name='index')
df.head(3)

index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
u32,i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
0,1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
1,2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
2,3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""


We can then use `filter` to apply a condition based on row number

In [8]:
(
    df
    .filter(
        pl.col('index') < 4
    )
)

index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
u32,i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
0,1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
1,2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
2,3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""
3,4,1,1,"""Futrelle, Mrs. Jacques Heath (…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""


However, a simpler way to do this is with `slice`

In [9]:
(
    df
    .slice(0,4)
)

index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
u32,i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
0,1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
1,2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
2,3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""
3,4,1,1,"""Futrelle, Mrs. Jacques Heath (…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""


### Filtering on a Boolean column
We can filter for `True` values on a Boolean column by passing the column as an expression to `filter` without a condition

In [13]:
(
    df
    .with_columns(
        First_class = pl.col("Pclass") == 1
    )
    .filter(
        pl.col("First_class")
    )
    .head()
)

index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,First_class
u32,i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str,bool
1,2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C""",True
3,4,1,1,"""Futrelle, Mrs. Jacques Heath (…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S""",True
6,7,0,1,"""McCarthy, Mr. Timothy J""","""male""",54.0,0,0,"""17463""",51.8625,"""E46""","""S""",True
11,12,1,1,"""Bonnell, Miss. Elizabeth""","""female""",58.0,0,0,"""113783""",26.55,"""C103""","""S""",True
23,24,1,1,"""Sloper, Mr. William Thompson""","""male""",28.0,0,0,"""113788""",35.5,"""A6""","""S""",True


In [15]:
(
    df
    .filter(
        pl.col("Pclass") == 1
    )
    .head(5)
)

index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
u32,i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,4,1,1,"""Futrelle, Mrs. Jacques Heath (…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""
6,7,0,1,"""McCarthy, Mr. Timothy J""","""male""",54.0,0,0,"""17463""",51.8625,"""E46""","""S"""
11,12,1,1,"""Bonnell, Miss. Elizabeth""","""female""",58.0,0,0,"""113783""",26.55,"""C103""","""S"""
23,24,1,1,"""Sloper, Mr. William Thompson""","""male""",28.0,0,0,"""113788""",35.5,"""A6""","""S"""


In [11]:
(
    df
    .with_columns(
        less_than_30 = pl.col("Age") < 30
        # .with_columns -> 새로운 열을 만드는 method
        # pl.col("Age") < 30 이냐 아니냐로 True/ False가 결정
    )
    .filter(
        pl.col("less_than_30") #
    )
    .head(5)
)

index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,less_than_30
u32,i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str,bool
0,1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S""",True
2,3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S""",True
7,8,0,3,"""Palsson, Master. Gosta Leonard""","""male""",2.0,3,1,"""349909""",21.075,,"""S""",True
8,9,1,3,"""Johnson, Mrs. Oscar W (Elisabe…","""female""",27.0,0,2,"""347742""",11.1333,,"""S""",True
9,10,1,2,"""Nasser, Mrs. Nicholas (Adele A…","""female""",14.0,1,0,"""237736""",30.0708,,"""C""",True


We can negate a filter condition with `~`

In [18]:
(
    df
    .with_columns(
        less_than_30 = pl.col("Age") < 30
    )
    .filter(
        ~pl.col("less_than_30")
    )
    .head()
)

index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,less_than_30
u32,i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str,bool
1,2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C""",False
3,4,1,1,"""Futrelle, Mrs. Jacques Heath (…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S""",False
4,5,0,3,"""Allen, Mr. William Henry""","""male""",35.0,0,0,"""373450""",8.05,,"""S""",False
6,7,0,1,"""McCarthy, Mr. Timothy J""","""male""",54.0,0,0,"""17463""",51.8625,"""E46""","""S""",False
11,12,1,1,"""Bonnell, Miss. Elizabeth""","""female""",58.0,0,0,"""113783""",26.55,"""C103""","""S""",False


or with the `not_` expression

In [None]:
(
    df
    .with_columns(
        less_than_30 = pl.col("Age") < 30
    )
    .filter(
        pl.col("less_than_30").not_()
    )
    .head(2)
)

# 근데 이렇게 하는 이유가 뭘가?

## Partitioning a `DataFrame`
In some cases we want to get the different subsets of the `DataFrame` that result from a single condition. 

We can do this partition into sub-`DataFrames` with the `partition_by` method.

In this example we partition by the `Pclass` column

> 그니깐 약간 groupby로 데이터프레임 분리해서 딕셔너리로 저장하는거노 R에서 음음.... list 느낌?

In [23]:
df_sex_dict = (
    df
    .partition_by(
        by = ["Sex"], as_dict= True
    )
)

In [26]:
df_sex_dict['male', ]

index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
u32,i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
0,1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
4,5,0,3,"""Allen, Mr. William Henry""","""male""",35.0,0,0,"""373450""",8.05,,"""S"""
5,6,0,3,"""Moran, Mr. James""","""male""",,0,0,"""330877""",8.4583,,"""Q"""
6,7,0,1,"""McCarthy, Mr. Timothy J""","""male""",54.0,0,0,"""17463""",51.8625,"""E46""","""S"""
7,8,0,3,"""Palsson, Master. Gosta Leonard""","""male""",2.0,3,1,"""349909""",21.075,,"""S"""
…,…,…,…,…,…,…,…,…,…,…,…,…
883,884,0,2,"""Banfield, Mr. Frederick James""","""male""",28.0,0,0,"""C.A./SOTON 34068""",10.5,,"""S"""
884,885,0,3,"""Sutehall, Mr. Henry Jr""","""male""",25.0,0,0,"""SOTON/OQ 392076""",7.05,,"""S"""
886,887,0,2,"""Montvila, Rev. Juozas""","""male""",27.0,0,0,"""211536""",13.0,,"""S"""
889,890,1,1,"""Behr, Mr. Karl Howell""","""male""",26.0,0,0,"""111369""",30.0,"""C148""","""C"""


In [20]:
df_pclass_dict = (
    df
    .partition_by(by=["Pclass"],as_dict=True)
)

The output is a python `dict` mapping from the unique values in `Pclass` to the sub-`DataFrame` for each class. This partition requires copying the data in `df` to new sub-`DataFrames`.

Note that the keys of this `dict` are always tuples even if there is just one element in the tuple for each key

In [21]:
df_pclass_dict.keys()

dict_keys([(3,), (1,), (2,)])

Note that if we don't pass the `as_dict=True` argument we instead get a python `list` of sub-`DataFrames`.

We can get the rows with first class passengers from this `dict` (note the `,` which turns `1` into the tuple `(1,)`

In [22]:
df_pclass_dict[1,].head(2)

index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
u32,i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,4,1,1,"""Futrelle, Mrs. Jacques Heath (…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""


## Filter in lazy mode
We create a `LazyFrame` by scanning the CSV and adding a `filter` operation

In [27]:
(
    pl.scan_csv(csv_file)
    .filter(pl.col("Age") > 30)
)

When we print the optimized plan we see the `filter` operation is part of the `SELECTION`. This query optimisation is called **predicate pushdown**. With predicate pushdown Polars tries to apply a `filter` as early as possible in a query plan to reduce the amount of data that must be processed

In [None]:
print(
    pl.scan_csv(csv_file)
    .filter(pl.col("Age") > 30)
    .explain()
)

In this case of a `filter` applied on a query like this from a CSV on our local machine the query optimisation will not have much impact: Polars just reads the CSV, makes a `DataFrame` in memory and then filters the `DataFrame`. The result would probably be similar to doing the query in eager mode.

However, if we are reading a file from cloud storage then Polars tries to apply the condition in `SELECTION` in the cloud storage and so reduces the amount of data that must be transferred across the network. The transfer across the network is typically the slowest and most expensive part of the query.



If we set `streaming=True` in `explain` we see that the `filter` operation comes after `STREAMING` in the query plan - this means that Polars can do this filter operation in streaming mode if we evaluate the lazy query with `.collect(streaming=True)`

In [None]:
print(
    pl.scan_csv(csv_file)
    .filter(pl.col("Age") > 30)
    .explain(streaming=True)
)

# Exercises
In the exercises you will develop your understanding of
- using the `filter` method
- adding a row number column
- partitioning a `DataFrame`

### Exercise 1 
Select all rows where `Age` is greater than 30

In [28]:
(
    pl.read_csv(csv_file)
    .filter(
        pl.col("Age") > 30
    )
    .head(5)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
4,1,1,"""Futrelle, Mrs. Jacques Heath (…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""
5,0,3,"""Allen, Mr. William Henry""","""male""",35.0,0,0,"""373450""",8.05,,"""S"""
7,0,1,"""McCarthy, Mr. Timothy J""","""male""",54.0,0,0,"""17463""",51.8625,"""E46""","""S"""
12,1,1,"""Bonnell, Miss. Elizabeth""","""female""",58.0,0,0,"""113783""",26.55,"""C103""","""S"""


Select all rows where `Embarked` is equal to "C" - using the keyword approach

In [29]:
(
    df
    .filter(
        pl.col("Embarked") == "C"
    )
    .head(5)
)

index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
u32,i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
9,10,1,2,"""Nasser, Mrs. Nicholas (Adele A…","""female""",14.0,1,0,"""237736""",30.0708,,"""C"""
19,20,1,3,"""Masselmani, Mrs. Fatima""","""female""",,0,0,"""2649""",7.225,,"""C"""
26,27,0,3,"""Emir, Mr. Farred Chehab""","""male""",,0,0,"""2631""",7.225,,"""C"""
30,31,0,1,"""Uruchurtu, Don. Manuel E""","""male""",40.0,0,0,"""PC 17601""",27.7208,,"""C"""


Select all rows where `Embarked` is equal to "C" - use `pl.col` with the text operator rather than the mathematical operator this time

Select all rows where `Embarked` is **not** equal to "C" 

In [31]:
(
    df
    .filter(
        pl.col("Embarked") != "C"
    )
    .head()
)

index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
u32,i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
0,1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""
3,4,1,1,"""Futrelle, Mrs. Jacques Heath (…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""
4,5,0,3,"""Allen, Mr. William Henry""","""male""",35.0,0,0,"""373450""",8.05,,"""S"""
5,6,0,3,"""Moran, Mr. James""","""male""",,0,0,"""330877""",8.4583,,"""Q"""


### Exercise 2 

In this exercise we filter on row numbers.

First add a row number column

In [32]:
df = (
    pl.read_csv(csv_file)
    .with_row_index("row_num")
)

print(df.head())

shape: (5, 13)
┌─────────┬─────────────┬──────────┬────────┬───┬──────────────────┬─────────┬───────┬──────────┐
│ row_num ┆ PassengerId ┆ Survived ┆ Pclass ┆ … ┆ Ticket           ┆ Fare    ┆ Cabin ┆ Embarked │
│ ---     ┆ ---         ┆ ---      ┆ ---    ┆   ┆ ---              ┆ ---     ┆ ---   ┆ ---      │
│ u32     ┆ i64         ┆ i64      ┆ i64    ┆   ┆ str              ┆ f64     ┆ str   ┆ str      │
╞═════════╪═════════════╪══════════╪════════╪═══╪══════════════════╪═════════╪═══════╪══════════╡
│ 0       ┆ 1           ┆ 0        ┆ 3      ┆ … ┆ A/5 21171        ┆ 7.25    ┆ null  ┆ S        │
│ 1       ┆ 2           ┆ 1        ┆ 1      ┆ … ┆ PC 17599         ┆ 71.2833 ┆ C85   ┆ C        │
│ 2       ┆ 3           ┆ 1        ┆ 3      ┆ … ┆ STON/O2. 3101282 ┆ 7.925   ┆ null  ┆ S        │
│ 3       ┆ 4           ┆ 1        ┆ 1      ┆ … ┆ 113803           ┆ 53.1    ┆ C123  ┆ S        │
│ 4       ┆ 5           ┆ 0        ┆ 3      ┆ … ┆ 373450           ┆ 8.05    ┆ null  ┆ S        │
└────

Continue by selecting the first 5 rows using `filter` on the row number column

In [33]:
(
    df
    .filter(
        pl.col("row_num") < 5
    )
)

row_num,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
u32,i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
0,1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
1,2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
2,3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""
3,4,1,1,"""Futrelle, Mrs. Jacques Heath (…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""
4,5,0,3,"""Allen, Mr. William Henry""","""male""",35.0,0,0,"""373450""",8.05,,"""S"""


### Exercise 3
Partition the `DataFrame` by the `Survived` and `Pclass` columns as a `dict` (you may want to check the API docs for help: https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.partition_by.html#polars.DataFrame.partition_by)

In [35]:
survived_pclass_dict = (
    pl.read_csv(csv_file)
    .partition_by("Survived","Pclass",as_dict=True)
)

survived_pclass_dict.keys()

dict_keys([(0, 3), (1, 1), (1, 3), (0, 1), (1, 2), (0, 2)])

Return the sub-`DataFrame` with the passengers who did not survive from the third class

In [36]:
survived_pclass_dict[(0,3)]

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
5,0,3,"""Allen, Mr. William Henry""","""male""",35.0,0,0,"""373450""",8.05,,"""S"""
6,0,3,"""Moran, Mr. James""","""male""",,0,0,"""330877""",8.4583,,"""Q"""
8,0,3,"""Palsson, Master. Gosta Leonard""","""male""",2.0,3,1,"""349909""",21.075,,"""S"""
13,0,3,"""Saundercock, Mr. William Henry""","""male""",20.0,0,0,"""A/5. 2151""",8.05,,"""S"""
…,…,…,…,…,…,…,…,…,…,…,…
883,0,3,"""Dahlberg, Miss. Gerda Ulrika""","""female""",22.0,0,0,"""7552""",10.5167,,"""S"""
885,0,3,"""Sutehall, Mr. Henry Jr""","""male""",25.0,0,0,"""SOTON/OQ 392076""",7.05,,"""S"""
886,0,3,"""Rice, Mrs. William (Margaret N…","""female""",39.0,0,5,"""382652""",29.125,,"""Q"""
889,0,3,"""Johnston, Miss. Catherine Hele…","""female""",,1,2,"""W./C. 6607""",23.45,,"""S"""


### Exercise 4
In this exercise we load data from the Spotify charts

In [37]:
spotify_csv = "../data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv)
spotify_df.head()

title,rank,date,artist,url,region,chart,trend,streams
str,i64,str,str,str,str,str,str,i64
"""Starboy""",1,"""2017-01-01""","""The Weeknd, Daft Punk""","""https://open.spotify.com/track…","""Global""","""top200""","""SAME_POSITION""",3135625
"""Closer""",2,"""2017-01-01""","""The Chainsmokers, Halsey""","""https://open.spotify.com/track…","""Global""","""top200""","""SAME_POSITION""",3015525
"""Let Me Love You""",3,"""2017-01-01""","""DJ Snake, Justin Bieber""","""https://open.spotify.com/track…","""Global""","""top200""","""MOVE_UP""",2545384
"""Rockabye (feat. Sean Paul & An…",4,"""2017-01-01""","""Clean Bandit""","""https://open.spotify.com/track…","""Global""","""top200""","""MOVE_DOWN""",2356604
"""One Dance""",5,"""2017-01-01""","""Drake, WizKid, Kyla""","""https://open.spotify.com/track…","""Global""","""top200""","""SAME_POSITION""",2259887


Filter the `DataFrame` to find all rows with artist Post Malone

In [38]:
(
    spotify_df
    .filter(
        pl.col("artist") == "Post Malone"
    )
)

title,rank,date,artist,url,region,chart,trend,streams
str,i64,str,str,str,str,str,str,i64
"""White Iverson""",196,"""2017-01-01""","""Post Malone""","""https://open.spotify.com/track…","""Global""","""top200""","""NEW_ENTRY""",332756
"""White Iverson""",188,"""2017-01-02""","""Post Malone""","""https://open.spotify.com/track…","""Global""","""top200""","""MOVE_UP""",343936
"""Psycho (feat. Ty Dolla $ign)""",2,"""2018-03-01""","""Post Malone""","""https://open.spotify.com/track…","""Global""","""top200""","""SAME_POSITION""",5999224
"""I Fall Apart""",22,"""2018-03-01""","""Post Malone""","""https://open.spotify.com/track…","""Global""","""top200""","""SAME_POSITION""",2003396
"""Candy Paint""",64,"""2018-03-01""","""Post Malone""","""https://open.spotify.com/track…","""Global""","""top200""","""MOVE_UP""",1065141
…,…,…,…,…,…,…,…,…
"""White Iverson""",184,"""2018-01-30""","""Post Malone""","""https://open.spotify.com/track…","""Global""","""top200""","""MOVE_UP""",590653
"""I Fall Apart""",18,"""2018-01-31""","""Post Malone""","""https://open.spotify.com/track…","""Global""","""top200""","""SAME_POSITION""",2034143
"""Candy Paint""",55,"""2018-01-31""","""Post Malone""","""https://open.spotify.com/track…","""Global""","""top200""","""SAME_POSITION""",1247324
"""Go Flex""",140,"""2018-01-31""","""Post Malone""","""https://open.spotify.com/track…","""Global""","""top200""","""MOVE_UP""",704766


## Solutions

### Solution to Exercise 1
Select all rows with `Age` greater than 30

In [None]:
(
    pl.read_csv(csv_file)
    .filter(pl.col('Age') > 30)
    .head(3)
)

Select all rows where `Embarked` is equal to "C" - using the keyword approach

In [None]:
(
    pl.read_csv(csv_file)
    .filter(Embarked = "C")
    .head(3)
)

Select all rows where `Embarked` is equal to "C" - use `pl.col` with the text operator rather than the mathematical operator this time

In [None]:
(
    pl.read_csv(csv_file)
    .filter(pl.col("Embarked").eq("C"))
    .head(3)
)

Select all rows where `Embarked` is **not** equal to "C" 

In [None]:
(
    pl.read_csv(csv_file)
    .filter(~pl.col("Embarked").eq("C"))
    .head(3)
)

### Solution to Exercise 2
Add a row number column

In [None]:
(
    pl.read_csv(csv_file)
    .with_row_index("row_nr")
)

Continue by selecting the first 5 rows using `filter` on the row number column

In [None]:
(
    pl.read_csv(csv_file)
    .with_row_index("row_nr")
    .filter(pl.col("row_nr")<5)
)

### Solution to Exercise 3
Partition the `DataFrame` by the `Survived` and `Pclass` columns as a `dict`

In [None]:
survived_pclass_dict = (
    pl.read_csv(csv_file)
    .partition_by("Survived","Pclass",as_dict=True)
)

In [None]:
survived_pclass_dict.keys()

Return the sub-`DataFrame` with the passengers who did not survive from the third class

In [None]:
(
    survived_pclass_dict[(0,3)]
    .head(2)
)

### Solution to Exercise 4
In this exercise we load data from the Spotify charts in a compressed CSV

In [None]:
spotify_csv = "../data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv)
spotify_df.head()

Filter the `DataFrame` to find all rows with artist Post Malone

In [None]:
(
    spotify_df
    .filter(
        pl.col("artist") == "Post Malone"
    )
)