## Selecting columns 4: Transforming and adding a column
By the end of this lecture you will be able to:
- transform an existing column in place using `with_columns`
- add a new column with an expression
- add a new column with column arithmetic
- add a column with constant values using `pl.lit`

In [1]:
import polars as pl

In [2]:
csv_file = "../data/titanic.csv"

In [3]:
df = pl.read_csv(csv_file)
df.head(3)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""


## Transforming an existing column

We can transform an existing column by passing the column to `with_columns`.

In this example we round `Fare` to 0 significant figures.

In [7]:
print((
    pl.read_csv(csv_file)
    .with_columns(
        pl.col("Fare").round(0)
        )
    .head(3)
))

print(
    (
        pl.read_csv(csv_file)
            .select(
            pl.col("Fare").round(0)
        ).head()
    )
)

shape: (3, 12)
┌─────────────┬──────────┬────────┬──────────────────────┬───┬───────────┬──────┬───────┬──────────┐
│ PassengerId ┆ Survived ┆ Pclass ┆ Name                 ┆ … ┆ Ticket    ┆ Fare ┆ Cabin ┆ Embarked │
│ ---         ┆ ---      ┆ ---    ┆ ---                  ┆   ┆ ---       ┆ ---  ┆ ---   ┆ ---      │
│ i64         ┆ i64      ┆ i64    ┆ str                  ┆   ┆ str       ┆ f64  ┆ str   ┆ str      │
╞═════════════╪══════════╪════════╪══════════════════════╪═══╪═══════════╪══════╪═══════╪══════════╡
│ 1           ┆ 0        ┆ 3      ┆ Braund, Mr. Owen     ┆ … ┆ A/5 21171 ┆ 7.0  ┆ null  ┆ S        │
│             ┆          ┆        ┆ Harris               ┆   ┆           ┆      ┆       ┆          │
│ 2           ┆ 1        ┆ 1      ┆ Cumings, Mrs. John   ┆ … ┆ PC 17599  ┆ 71.0 ┆ C85   ┆ C        │
│             ┆          ┆        ┆ Bradley (Fl…         ┆   ┆           ┆      ┆       ┆          │
│ 3           ┆ 1        ┆ 3      ┆ Heikkinen, Miss.     ┆ … ┆ STON/O2.  ┆ 8

## Adding a new column from an existing column
We can create a new column from an existing column by renaming it with `alias`

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.col('Fare').round(0).alias('roundFare')
    )
    .head(3)
)

In [7]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.col('Fare').round(0).alias('roundF')
    )
    .head(3)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,roundF
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str,f64
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S""",7.0
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C""",71.0
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S""",8.0


Instead of using `alias` we can also create the new column by assigning the column name equal to the expression (this approach in Polars is referred to as kwargs assignment) 

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        roundFare = pl.col('Fare').round(0)
    )
    .head(3)
)

## Difference between `with_columns` and `select`
- The `select` method returns a subset of the columns but `with_columns` method returns all of the columns
- `with_columns` accepts expressions only - no strings

## Adding or transforming a column with column arithmetic

We can transform columns with arithmetic in an expression.

In this example we double the values in the `Fare` column in a new column called `doubleFare`

In [15]:
print((
    pl.read_csv(csv_file)
    .with_columns(
        (pl.col("Fare") * 2).alias("doubleFare")
    )
    .head(3)
))

print(
    (
        pl.read_csv(csv_file)
        .select(
            (pl.col('Fare') * 2).alias('d')
        )
        #.select("Fare", "d")
        .head(3)
    )
)

shape: (3, 13)
┌─────────────┬──────────┬────────┬──────────────────┬───┬─────────┬───────┬──────────┬────────────┐
│ PassengerId ┆ Survived ┆ Pclass ┆ Name             ┆ … ┆ Fare    ┆ Cabin ┆ Embarked ┆ doubleFare │
│ ---         ┆ ---      ┆ ---    ┆ ---              ┆   ┆ ---     ┆ ---   ┆ ---      ┆ ---        │
│ i64         ┆ i64      ┆ i64    ┆ str              ┆   ┆ f64     ┆ str   ┆ str      ┆ f64        │
╞═════════════╪══════════╪════════╪══════════════════╪═══╪═════════╪═══════╪══════════╪════════════╡
│ 1           ┆ 0        ┆ 3      ┆ Braund, Mr. Owen ┆ … ┆ 7.25    ┆ null  ┆ S        ┆ 14.5       │
│             ┆          ┆        ┆ Harris           ┆   ┆         ┆       ┆          ┆            │
│ 2           ┆ 1        ┆ 1      ┆ Cumings, Mrs.    ┆ … ┆ 71.2833 ┆ C85   ┆ C        ┆ 142.5666   │
│             ┆          ┆        ┆ John Bradley     ┆   ┆         ┆       ┆          ┆            │
│             ┆          ┆        ┆ (Fl…             ┆   ┆         ┆       ┆

In [9]:
(
    pl.read_csv(csv_file)
    .with_columns(
        (pl.col("Fare") * 2).alias("df")
    )
    .head(3)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,df
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str,f64
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S""",14.5
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C""",142.5666
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S""",15.85


In [16]:
(
    pl.read_csv(csv_file)
    .select(
        (pl.col('Age') * 2).alias('test')
    )
    .head()
)

test
f64
44.0
76.0
52.0
70.0
70.0


We can also do arithmetic multiple columns in an expression.

In this examle we add the values in the `Fare` and `Age` column

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        (pl.col("Fare") + pl.col("Age")).alias("farePlusAge")
    )
    .head(2)
)

In [13]:
(
    pl.read_csv(csv_file)
    .with_columns(
        (pl.col('Fare') + pl.col('Age') / 2).alias('test')
    )
    .head(2)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,test
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str,f64
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S""",18.25
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C""",90.2833


Some people feel text arithmetic expressions are more readable. 

We do the same example as above but with the `.add` operator rather than `+` 

In [14]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.col('Fare').add(pl.col('Age')).alias('farePlusAge')
    )
    .head(2)
)


PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,farePlusAge
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str,f64
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S""",29.25
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C""",109.2833


The mapping from python operators to expressions are:
- `==` to `eq`
- `//` to `floordiv`
- `> ` to `gt`
- `>=` to `ge`
- `< ` to `lt`
- `<=` to `le`
- `% ` to `mod`
- `!=` to `ne`
- `- ` to `sub`
- `/ ` to `truediv`
- `^ ` to `xor`
- `* ` to `mul`

## Adding a new column with a constant value

Use the literal function `pl.lit` to specify a constant value in Polars.

Here we add a new column called `Aboard` with a value `yes` for all passengers 

* 어떠한 특정 값을 부여할때 사용
* 문자형, 숫자형 모두 가능

In [24]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.lit(True).alias('Aboard')
    )
    .select(['Name','Aboard'])
    .head(2)
)

Name,Aboard
str,bool
"""Braund, Mr. Owen Harris""",True
"""Cumings, Mrs. John Bradley (Fl…",True


In [17]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.lit(77).alias('lucky')
    )
    .select('lucky', 'Name')
    .head()
)

lucky,Name
i32,str
77,"""Braund, Mr. Owen Harris"""
77,"""Cumings, Mrs. John Bradley (Fl…"
77,"""Heikkinen, Miss. Laina"""
77,"""Futrelle, Mrs. Jacques Heath (…"
77,"""Allen, Mr. William Henry"""


In [22]:
# 생존이면 lucky 추가 아니면 unlucky 추가
# 컬럼 확인
print(df.columns)

#df.select(pl.col('Survived')).value_counts()
print(df.select(pl.col('Survived')).group_by('Survived').count())

print(
    (
        df
        .with_columns(
            pl.when(pl.col('Survived') == 1)
            .then(pl.lit(1))
            .otherwise(pl.lit(0))
            .alias("lucky_or_not")  
        )
        .select(
        pl.col('Survived', "lucky_or_not")
        )   
        
     )
    
)


['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
shape: (2, 2)
┌──────────┬───────┐
│ Survived ┆ count │
│ ---      ┆ ---   │
│ i64      ┆ u32   │
╞══════════╪═══════╡
│ 1        ┆ 342   │
│ 0        ┆ 549   │
└──────────┴───────┘
shape: (891, 2)
┌──────────┬──────────────┐
│ Survived ┆ lucky_or_not │
│ ---      ┆ ---          │
│ i64      ┆ i32          │
╞══════════╪══════════════╡
│ 0        ┆ 0            │
│ 1        ┆ 1            │
│ 1        ┆ 1            │
│ 1        ┆ 1            │
│ 0        ┆ 0            │
│ …        ┆ …            │
│ 0        ┆ 0            │
│ 1        ┆ 1            │
│ 0        ┆ 0            │
│ 1        ┆ 1            │
│ 0        ┆ 0            │
└──────────┴──────────────┘


  print(df.select(pl.col('Survived')).group_by('Survived').count())


## Exercises

In the exercises you will develop your understanding of:
- transforming an existing column
- adding a new column from existing columns
- adding a new column with a constant value

### Exercise 1

Add a new column called `familySize` which is the sum of the number of siblings (`SibSp` columns), the number of parents or children (`Parch` columns) plus one for the passenger themself.

Print out the first 3 rows.

Hint: Add the two columns inside `()` and then apply `.alias`

In [18]:
(
    pl.read_csv(csv_file)
    .with_columns(
        (pl.col('SibSp') + pl.col('Parch') + 1).alias('familyySize')
    )
    .head()
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,familyySize
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str,i64
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S""",2
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C""",2
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S""",1
4,1,1,"""Futrelle, Mrs. Jacques Heath (…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S""",2
5,0,3,"""Allen, Mr. William Henry""","""male""",35.0,0,0,"""373450""",8.05,,"""S""",1


### Exercise 2 

Add a new column called `decade` that converts the `Age` column to the passengers age in decades e.g. 15.2 goes to 10, where 10 is an integer. Add the new column using the kwargs approach.

Print out the first 3 rows.

Hint: use `cast` to convert the dtype

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
)

### Exercise 3
Create a new literal column

Add a new binary column called `Aboard` that has the value `1` for all passengers.

Print out the first 3 rows

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
)

### Exercise 4

Add a new Boolean column `overThirty` that captures whether a passenger's age is 30 years or older

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
)

## Solutions

### Solution to exercise 1

Add a new column for family size

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns( 
        (
        pl.col('SibSp') + pl.col('Parch') + 1
        ).alias('familySize')
    )
    .head(3)
)

### Solution to exercise 2

Create a decades column

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns( 
        decade = ((pl.col('Age')/10).floor()).cast(pl.Int64)
    )
    .select(['Age','decade'])
    .head(3)
)


### Solution to exercise 3

Create a new literal column

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.lit(1).alias('Aboard')
    )
    .head(3)
)

### Solution to Exercise 4

Add a new Boolean column based on an expression

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        (pl.col("Age") >= 30).alias("overThirty")
    )
    .head(3)
)