## Selecting columns 4: Transforming and adding a column
By the end of this section we will learn how to:
- transform an existing column in place using `with_columns`
- add a new column with an expression
- add a new column with column arithmetic
- add a column with constant values using `pl.lit`

In [None]:
import polars as pl

In [None]:
csvFile = "../data/titanic.csv"

In [None]:
df = pl.read_csv(csvFile)
df.head(3)

## Transforming an existing column

We can transform an existing column by passing the column to `with_columns`.

In this example we round `Fare` to 0 significant figures.

In [None]:
(
    pl.read_csv(csvFile)
    .with_columns(
        pl.col("Fare").round(0)
        )
    .head(3)
)

## Adding a new column from an existing column
We can create a new column from an existing column by renaming it with `alias`

In [None]:
# this is recommended 
(
    pl.read_csv(csvFile)
    .with_columns(
        pl.col('Fare').round(0).alias('roundFare')
    )
    .head(3)
)

Instead of using `alias` we can also create the new column by assigning the column name equal to the expression (this approach in Polars is referred to as kwargs assignment) 

In [None]:
# this is not very commonly used, but it works as well
(
    pl.read_csv(csvFile)
    .with_columns(
        roundFare = pl.col('Fare').round(0)
    )
    .head(3)
)

## Difference between `with_columns` and `select`
- The `select` method returns a subset of the columns but `with_columns` method returns all of the columns
- `with_columns` accepts expressions only - no strings

## Adding or transforming a column with column arithmetic

We can transform columns with arithmetic in an expression.

In this example we double the values in the `Fare` column in a new column called `doubleFare`

In [None]:
df = pl.read_csv(csvFile)
(
    df
    .with_columns(
        (pl.col("Fare") * 2).alias("doubleFare")
    )
    .head(3)
)

We can also do arithmetic multiple columns in an expression.

In this examle we add the values in the `Fare` and `Age` column

In [None]:
df = pl.read_csv(csvFile)
(
    df
    .with_columns(
        (pl.col("Fare") + pl.col("Age")).alias("farePlusAge")
    )
    .head(2)
)

Some people feel text arithmetic expressions are more readable. 

We do the same example as above but with the `.add` operator rather than `+` 

In [None]:
df = pl.read_csv(csvFile)
(
    df
    .with_columns(
        pl.col('Fare').add(pl.col('Age')).alias('farePlusAge')
    )
    .head(2)
)


The mapping from python operators to expressions are:
- `==` to `eq`
- `//` to `floordiv`
- `> ` to `gt`
- `>=` to `ge`
- `< ` to `lt`
- `<=` to `le`
- `% ` to `mod`
- `!=` to `ne`
- `- ` to `sub`
- `/ ` to `truediv`
- `^ ` to `xor`
- `* ` to `mul`

## Adding a new column with a constant value

Use the literal function `pl.lit` to specify a constant value in Polars.

Here we add a new column called `Aboard` with a value `yes` for all passengers 

In [None]:
df = pl.read_csv(csvFile)
(
    df
    .with_columns(
        pl.lit('yes').alias('Aboard')
    )
    .select(['Name','Aboard'])
    .head(2)
)

## Exercises

In the exercises you will develop your understanding of:
- transforming an existing column
- adding a new column from existing columns
- adding a new column with a constant value


### Exercise 1: Add a new column for family size

Add a new column called `familySize` which is the sum of the number of siblings (`SibSp` columns), the number of parents or children (`Parch` columns) plus one for the passenger themself.

Print out the first 3 rows.

Hint: Add the two columns inside `()` and then apply `.alias`

In [None]:
(
    pl.read_csv(csvFile)
    <blank>
)

### Exercise 2: Create a decades column
Add a new column called `decade` that converts the `Age` column to the passengers age in decades e.g. 15.2 goes to 10, where 10 is an integer. Add the new column using the kwargs approach.

Print out the first 3 rows.

Hint: use `cast` to convert the dtype

In [None]:
(
    pl.read_csv(csvFile)
    <blank>
)

### Exercise 3: Create a new literal column
Add a new binary column called `Aboard` that has the value `1` for all passengers.

Print out the first 3 rows

In [None]:
(
    pl.read_csv(csvFile)
    <blank>
)

### Exercise 4: Add a new Boolean column based on an expression

Add a new Boolean column `overThirty` that captures whether a passenger's age is 30 years or older

In [None]:
(
    pl.read_csv(csvFile)
    <blank>
)

## Solutions

### Solution to exercise 1: Add a new column for family size

In [None]:
(
    pl.read_csv(csvFile)
    .with_columns( 
        (
        pl.col('SibSp') + pl.col('Parch') + 1
        ).alias('familySize')
    )
    .head(3)
)


In [None]:
df = pl.read_csv(csvFile)
(
    df.with_columns(
            (pl.col('SibSp').add(pl.col('Parch')).add(1)).alias('familySize')
    )
    .head(5)
)

### Solution to exercise 2: Create a decades column

In [None]:
(
    pl.read_csv(csvFile)
    .with_columns( 
        decade = ((pl.col('Age')/10).floor()).cast(pl.Int64)
    )
    .select(['Age','decade'])
    .head(3)
)


### Solution to exercise 3: Create a new literal column

In [None]:
(
    pl.read_csv(csvFile)
    .with_columns(
        pl.lit(1).alias('Aboard')
    )
    .head(3)
)

### Solution to Exercise 4: Add a new Boolean column based on an expression

In [None]:
(
    pl.read_csv(csvFile)
    .with_columns(
        (pl.col("Age") >= 30).alias("overThirty")
    )
    .head(3)
)

In [None]:
(
    pl.read_csv(csvFile)
    .with_columns(
    pl.when(df['Age'] >= 30).then(True).otherwise(False).alias("overThirty"))
    .head(3)
)