## Transforming a `DataFrame`
In this lecture you will learn how to:
- rename, drop and re-order columns from a `DataFrame`
- transform a `DataFrame` in a function using `pipe`

In [5]:
import polars as pl
import polars.selectors as cs # type: ignore
# Set the number of rows to be printed to 6
pl.Config.set_tbl_rows(6)

polars.config.Config

In [3]:
csv_file = "../data/titanic.csv"

In [6]:
df = pl.read_csv(csv_file)
df.head(2)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""


## Renaming columns
We can rename columns by passing a `dict` that maps old names to new names.

In [13]:
(
    df
    .rename({
        "PassengerId": "ID",
        "Pclass": "Class",
        "Name": "Name",
        "Sex": "Gender",
        "Age": "Age",
        "SibSp": "Siblings/Spouses Aboard",
        "Parch": "Parents/Children Aboard",
        "Ticket": "Ticket Number",
        "Cabin": "Cabin Number",
        "Embarked": "Port of Embarkation"
    })
    .head(2)
)

ID,Survived,Class,Name,Gender,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Ticket Number,Fare,Cabin Number,Port of Embarkation
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""


In [8]:
(
    df
    .select(
        pl.col('PassengerId').alias('ID')
    )
    .head(2)
)

ID
i64
1
2


## Dropping columns

We can drop columns by passing a `list` of column names

In [9]:
(
    df
    .drop(["PassengerId","Pclass"])
    .head(2)
)

Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,str,str,f64,i64,i64,str,f64,str,str
0,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""


Or we can pass a comma-seperated list of column names

In [10]:
(
    df
    .drop("PassengerId","Pclass")
    .head(2)
)

Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,str,str,f64,i64,i64,str,f64,str,str
0,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""


## Re-ordering columns
We can re-order columns with a `list` in `select`.

In this example we re-order the columns in alphabetical order

In [11]:
(
    df
    .select(sorted(df.columns))
    .head(2)
)

Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket
f64,str,str,f64,str,i64,i64,i64,str,i64,i64,str
22.0,,"""S""",7.25,"""Braund, Mr. Owen Harris""",0,1,3,"""male""",1,0,"""A/5 21171"""
38.0,"""C85""","""C""",71.2833,"""Cumings, Mrs. John Bradley (Fl…",0,2,1,"""female""",1,1,"""PC 17599"""


## Changing dtypes
We can change dtypes within an expression using `pl.col(...).cast()` but we can also call `cast` with a `dict` argument on a DataFrame.

In this example we cast the `Survived` column from integer to string

In [21]:
(
    df
    .select(
        pl.col("Age"),
        pl.when(
            (pl.col("Age") // 10) == 1
        )
        .then(10)
        .when(
            (pl.col("Age") // 10) == 2
        )
        .then(20)
        .when(
            (pl.col("Age") // 10) == 3
        )
        .then(30)
        .when(
            (pl.col("Age") // 10) == 4
        )
        .then(40)
        .otherwise(50)
        .alias("Age Group")
        .cast(pl.Utf8)
    )
)

Age,Age Group
f64,str
22.0,"""20"""
38.0,"""30"""
26.0,"""20"""
…,…
,"""50"""
26.0,"""20"""
32.0,"""30"""


In [None]:
(
    df
    .cast(
        {
            "Survived":pl.Utf8
        }
    )
    .head(2)
)

We can also cast an entire `DataFrame`

In [None]:
(
    df
    .cast(pl.Utf8)
    .head(2)
)

Or use selectors

In [None]:
(
    df
    .cast(
        {
            cs.numeric():pl.Utf8
        }
    )
    .head(2)
)

## Transforming `DataFrames` in a function

We may want to capture some `DataFrame` transformations in a function. This can be to:
- re-use the same transformations multiple times
- make code easier to read or
- make the transformations testable

If our function:
- takes a `DataFrame` (and some other optional arguments) as an input and
- outputs a `DataFrame`
then we can use the `pipe` method.

In this example we define a function that makes all string columns uppercase

In [22]:
def uppercase_all_strings(df):
    return (
        df
        .with_columns(
            pl.col(pl.Utf8).str.to_uppercase()
        )
    )

We can pipe the `DataFrame` to this function as follows

In [28]:
(
    df
    .with_columns(
        pl.col('Name').str.to_uppercase()
    )
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""BRAUND, MR. OWEN HARRIS""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""CUMINGS, MRS. JOHN BRADLEY (FL…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""HEIKKINEN, MISS. LAINA""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""
…,…,…,…,…,…,…,…,…,…,…,…
889,0,3,"""JOHNSTON, MISS. CATHERINE HELE…","""female""",,1,2,"""W./C. 6607""",23.45,,"""S"""
890,1,1,"""BEHR, MR. KARL HOWELL""","""male""",26.0,0,0,"""111369""",30.0,"""C148""","""C"""
891,0,3,"""DOOLEY, MR. PATRICK""","""male""",32.0,0,0,"""370376""",7.75,,"""Q"""


In [23]:
(
    df
    .pipe(uppercase_all_strings)
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""BRAUND, MR. OWEN HARRIS""","""MALE""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""CUMINGS, MRS. JOHN BRADLEY (FL…","""FEMALE""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""HEIKKINEN, MISS. LAINA""","""FEMALE""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""
…,…,…,…,…,…,…,…,…,…,…,…
889,0,3,"""JOHNSTON, MISS. CATHERINE HELE…","""FEMALE""",,1,2,"""W./C. 6607""",23.45,,"""S"""
890,1,1,"""BEHR, MR. KARL HOWELL""","""MALE""",26.0,0,0,"""111369""",30.0,"""C148""","""C"""
891,0,3,"""DOOLEY, MR. PATRICK""","""MALE""",32.0,0,0,"""370376""",7.75,,"""Q"""


One advantage of the `pipe` method is that it can allow us to access `DataFrame` method data even when we are using method chaining and do not have a variable with the `DataFrame` assigned.

In the following example we have a query that starts with scanning a CSV file in lazy mode. We want to re-order the columns to alphabetical order but within the method chained code.

We can do this with `pipe`.

The `pipe` method allows us to access the `DataFrame` using a temporary variable inside a function.

In this example we sort the columns alphabetically inside a `lambda` function using `pipe`

In [24]:
(
    pl.scan_csv(csv_file)
    .pipe(
        lambda temp_df: temp_df.select( sorted(temp_df.collect_schema().names()))
    )
    .collect_schema().names()
)

['Age',
 'Cabin',
 'Embarked',
 'Fare',
 'Name',
 'Parch',
 'PassengerId',
 'Pclass',
 'Sex',
 'SibSp',
 'Survived',
 'Ticket']

The transformations in `pipe` are passed to the query optimiser in lazy mode.

In this example we only use the first three columns in the `select`

In [25]:
print(
    pl.scan_csv(csv_file)
    .pipe(
        lambda temp_df: temp_df.select( sorted(temp_df.collect_schema().names()[:3]))
    )
    .explain()
)

simple π 3/3 ["PassengerId", "Pclass", ... 1 other column]
  Csv SCAN [../data/titanic.csv]
  PROJECT 3/12 COLUMNS


The query optimiser sees that only 3 columns are required

### Function arguments using `pipe`
The key point about `pipe` is that we pass a function where:
- a `DataFrame` is the first argument and
- only a `DataFrame` is output

We can pass optional arguments to functions using `pipe`

In [26]:
def _multiply_floats(df: pl.DataFrame, multiplication_factor: int) -> pl.DataFrame:
    return df.select(pl.col(pl.Float64)) * multiplication_factor

(
    df
    .pipe(
        _multiply_floats, 
        multiplication_factor=3)
    .head(3)
)


Age,Fare
f64,f64
66.0,21.75
114.0,213.8499
78.0,23.775


## Exercises
In the exercises you will develop your understanding of:
- renaming columns
- dropping columns
- transformations using `pipe`

### Exercise 1
Drop the `Age` and `Fare` columns from the `DataFrame`

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
    .head(3)
)

Cast all of the integer columns to 16-bit integers

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
    .head(3)
)

### Exercise 2
Rename the `Age` column to `age`

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
    .head(3)
)

Rename all column names to lower case. Expand the cell below if you would like a hint

In [None]:
#Hint: do the renaming inside .pipe
#Hint: use the Python method .lower() on column name strings

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
    .head(3)
)

## Solutions

### Solution to exercise 1
Drop the `Age` and `Fare` columns from the `DataFrame`

In [None]:
(
    pl.read_csv(csv_file)
    .drop(["Age","Fare"])
    .head(3)
)

Cast all of the integer columns to 16-bit integers

In [None]:
(
    pl.read_csv(csv_file)
    .cast(
        {
            cs.integer():pl.Int16
        }
    )
    .head(3)
)

### Solution to exercise 2

Rename the `Age` column to `age`

In [None]:
(
    pl.read_csv(csv_file)
    .rename({"Age":"age"})
    .head(3)
)

Rename all column names to lower case. Expand the cell below if you would like a hint

In [None]:
#Hint: do the renaming inside .pipe
#Hint: use the Python method .lower() on column name strings

In [27]:
(
    pl.read_csv(csv_file)
    .pipe(
        lambda df:df.rename({oldCol:oldCol.lower() for oldCol in df.columns})
    )
    .head(3)
)

passengerid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""
