## Selecting columns 4: Transforming and adding a column
By the end of this lecture you will be able to:
- transform an existing column in place using `with_columns`
- add a new column with an expression
- add a new column with column arithmetic
- add a column with constant values using `pl.lit`

In [3]:
import polars as pl

In [1]:
csv_file = "../data/titanic.csv"

In [7]:
df = pl.read_csv(csv_file)
df.head(3)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""


## Transforming an existing column

We can transform an existing column by passing the column to `with_columns`.

In this example we round `Fare` to 0 significant figures.

In [6]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.col('Fare').round(0)
    )
    .head(3)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.0,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.0,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",8.0,,"""S"""


* select에서 변경하면 'fare'만 나오는데
* with_columns에서 변경하면 fare와 함께 모든 컬럼 나옴.

In [15]:
# 이렇게 구질구질하게 하면 되기는해 ㅋ
# 근데 굳이 이렇게 해야하나?
(
    pl.read_csv(csv_file)
    .select(
        pl.col("Fare").round(0).alias("roundFare"),
        pl.exclude('Fare')
    )

)

roundFare,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Cabin,Embarked
f64,i64,i64,i64,str,str,f64,i64,i64,str,str,str
7.0,1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",,"""S"""
71.0,2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""","""C85""","""C"""
8.0,3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",,"""S"""
53.0,4,1,1,"""Futrelle, Mrs. Jacques Heath (…","""female""",35.0,1,0,"""113803""","""C123""","""S"""
8.0,5,0,3,"""Allen, Mr. William Henry""","""male""",35.0,0,0,"""373450""",,"""S"""
…,…,…,…,…,…,…,…,…,…,…,…
13.0,887,0,2,"""Montvila, Rev. Juozas""","""male""",27.0,0,0,"""211536""",,"""S"""
30.0,888,1,1,"""Graham, Miss. Margaret Edith""","""female""",19.0,0,0,"""112053""","""B42""","""S"""
23.0,889,0,3,"""Johnston, Miss. Catherine Hele…","""female""",,1,2,"""W./C. 6607""",,"""S"""
30.0,890,1,1,"""Behr, Mr. Karl Howell""","""male""",26.0,0,0,"""111369""","""C148""","""C"""


## Adding a new column from an existing column
We can create a new column from an existing column by renaming it with `alias`
> `.alias`로 이름 변경 가능..!

In [16]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.col('Fare').round(0).alias("컬럼이름변경")
    )
    .head(3)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,컬럼이름변경
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str,f64
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S""",7.0
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C""",71.0
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S""",8.0


In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.col('Fare').round(0).alias('roundFare')
    )
    .head(3)
)

Instead of using `alias` we can also create the new column by assigning the column name equal to the expression (this approach in Polars is referred to as kwargs assignment) 

In [17]:
(
    pl.read_csv(csv_file)
    .with_columns(
        roundFare = pl.col('Fare').round(0)
    )
    .head(3)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,roundFare
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str,f64
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S""",7.0
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C""",71.0
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S""",8.0


## Difference between `with_columns` and `select`
- The `select` method returns a subset of the columns but `with_columns` method returns **all of the columns**
- `with_columns` accepts expressions only - no strings

## Adding or transforming a column with column arithmetic

We can transform columns with arithmetic in an expression.

In this example we double the values in the `Fare` column in a new column called `doubleFare`

In [18]:
(
    pl.read_csv(csv_file)
    .with_columns(
        (pl.col("Fare") * 2).alias("doubleFare")
    )
    .head(3)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,doubleFare
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str,f64
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S""",14.5
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C""",142.5666
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S""",15.85


We can also do arithmetic multiple columns in an expression.

In this examle we add the values in the `Fare` and `Age` column

In [21]:
(
    pl.read_csv(csv_file)
    .with_columns(
        (pl.col("Fare") + pl.col("Age")).alias("farePlusAge")
    )
    .head(2)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,farePlusAge
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str,f64
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S""",29.25
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C""",109.2833


Some people feel text arithmetic expressions are more readable. 

We do the same example as above but with the `.add` operator rather than `+` 

In [22]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.col('Fare').add(pl.col('Age')).alias('farePlusAge')
    )
    .head(2)
)


PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,farePlusAge
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str,f64
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S""",29.25
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C""",109.2833


The mapping from python operators to expressions are:
- `==` to `eq`
- `//` to `floordiv`
- `> ` to `gt`
- `>=` to `ge`
- `< ` to `lt`
- `<=` to `le`
- `% ` to `mod`
- `!=` to `ne`
- `- ` to `sub`
- `/ ` to `truediv`
- `^ ` to `xor`
- `* ` to `mul`

## Adding a new column with a constant value

Use the literal function `pl.lit` to specify a constant value in Polars.

Here we add a new column called `Aboard` with a value `yes` for all passengers

`pl.lit(값)` → 고정된 값을 모든 행에 적용할 때 사용

In [23]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.lit('yes').alias('Aboard')
    )
    .select(['Name','Aboard'])
    .head(2)
)

Name,Aboard
str,str
"""Braund, Mr. Owen Harris""","""yes"""
"""Cumings, Mrs. John Bradley (Fl…","""yes"""


## Exercises

In the exercises you will develop your understanding of:
- transforming an existing column
- adding a new column from existing columns
- adding a new column with a constant value

### Exercise 1

Add a new column called `familySize` which is the sum of the number of siblings (`SibSp` columns), the number of parents or children (`Parch` columns) plus one for the passenger themself.

Print out the first 3 rows.

Hint: Add the two columns inside `()` and then apply `.alias`

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
)

### Exercise 2 

Add a new column called `decade` that converts the `Age` column to the passengers age in decades e.g. 15.2 goes to 10, where 10 is an integer. Add the new column using the kwargs approach.

Print out the first 3 rows.

Hint: use `cast` to convert the dtype

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
)

### Exercise 3
Create a new literal column

Add a new binary column called `Aboard` that has the value `1` for all passengers.

Print out the first 3 rows

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
)

### Exercise 4

Add a new Boolean column `overThirty` that captures whether a passenger's age is 30 years or older

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
)

## Solutions

### Solution to exercise 1

Add a new column for family size

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns( 
        (
        pl.col('SibSp') + pl.col('Parch') + 1
        ).alias('familySize')
    )
    .head(3)
)

### Solution to exercise 2

Create a decades column

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns( 
        decade = ((pl.col('Age')/10).floor()).cast(pl.Int64)
    )
    .select(['Age','decade'])
    .head(3)
)


### Solution to exercise 3

Create a new literal column

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.lit(1).alias('Aboard')
    )
    .head(3)
)

### Solution to Exercise 4

Add a new Boolean column based on an expression

In [25]:
(
    pl.read_csv(csv_file)
    .with_columns(
        (pl.col("Age") >= 30).alias("overThirty")
    )
    .filter(
        pl.col("overThirty")
    )
    .head(3)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,overThirty
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str,bool
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C""",True
4,1,1,"""Futrelle, Mrs. Jacques Heath (…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S""",True
5,0,3,"""Allen, Mr. William Henry""","""male""",35.0,0,0,"""373450""",8.05,,"""S""",True
