## Transforming a `DataFrame`
In this lecture you will learn how to:
- rename columns from a `DataFrame`
- drop columns from a `DataFrame`
- transform a `DataFrame` in a function using `pipe`

In [None]:
import polars as pl
# Set the number of rows to be printed to 8
pl.Config.set_tbl_rows(8)

In [None]:
csvFile = "../data/titanic.csv"

In [None]:
df = pl.read_csv(csvFile)
df.head(2)

## Renaming columns
We can rename columns by passing a `dict` that maps old names to new names.

In [None]:
(
    df
    .rename({"PassengerId":"ID"})
    .head(2)
)

## Dropping columns

We can drop columns by passing a `list` of column names

In [None]:
(
    df
    .drop(["PassengerId","Pclass"])
    .head(2)
)

## Re-ordering columns
We can re-order columns with a `list` in `select`.

In this example we re-order the columns in alphabetical order

In [None]:
(
    df
    .select(sorted(df.columns))
)

## Re-ordering columns in a query

We may want to re-order the columns in a query where we cannot access the `columns` attribute. This may be because in the query:
- we have just read the file
- we have added/removed columns
- we have changed column names

We can still re-order columns using `pipe`.

The `pipe` method allows us to access the `DataFrame` at that point in a query as if we have it assigned to a variable.

In this example we sort the columns alphabetically after reading a CSV

In [None]:
(
    pl.read_csv(csvFile)
    .pipe(
        lambda tempDf: tempDf.select( sorted(tempDf.columns))
    )
    .head(2)
)

We can also use `pipe` in lazy mode

In [None]:
(
    pl.scan_csv(csvFile)
    .pipe(
        lambda qdf: qdf.select( sorted(qdf.columns[:3]))
    )
)

The transformations in `pipe` are passed to the query optimiser in lazy mode.

In this example we only use the first three columns in the `select`

In [None]:
print(
    pl.scan_csv(csvFile)
    .pipe(
        lambda qdf: qdf.select( sorted(qdf.columns[:3]))
    )
    .explain()
)

The query optimiser sees that only 3 columns are required and sets 
```bash
PROJECT 3/12 COLUMNS
```

## Applying a function to a `DataFrame`
We apply a function to a `DataFrame` using `pipe`.

In this example we define a function `_multiply_floats` where we select all the floating point columns and multiply them by 3

In [None]:
def _multiply_floats(df: pl.DataFrame, mul: int) -> pl.DataFrame:
    return df.select(pl.col(pl.Float64)) * mul

(
    df
    .pipe(
        _multiply_floats, 
        mul=3)
    .head(3)
)


## Exercises
In the exercises you will develop your understanding of:
- renaming columns
- dropping columns
- transformations using `pipe`

### Exercise 1
Drop the `Age` and `Fare` columns from the `DataFrame`

In [None]:
(
    pl.read_csv(csvFile)
    <blank>
    .head(3)
)

### Exercise 2
Rename the `Age` column to `age`

In [None]:
(
    pl.read_csv(csvFile)
    <blank>
    .head(3)
)

Rename all column names to lower case. Expand the cell below if you would like a hint

In [None]:
#Hint: do the renaming inside .pipe
#Hint: use the Python method .lower() on column name strings

In [None]:
(
    pl.read_csv(csvFile)
    <blank>
    .head(3)
)

## Solutions

### Solution to exercise 1
Drop the `Age` and `Fare` columns from the `DataFrame`

In [None]:
(
    pl.read_csv(csvFile)
    .drop(["Age","Fare"])
    .head(3)
)

### Solution to exercise 2

Rename the `Age` column to `age`

In [None]:
(
    pl.read_csv(csvFile)
    .rename({"Age":"age"})
    .head(3)
)

Rename all column names to lower case. Expand the cell below if you would like a hint

In [None]:
#Hint: do the renaming inside .pipe
#Hint: use the Python method .lower() on column name strings

In [None]:
(
    pl.read_csv(csvFile)
    .pipe(
        lambda df:df.rename({oldCol:oldCol.lower() for oldCol in df.columns})
    )
    .head(3)
)