## List dtype 2: using expressions on List columns

By the end of this lecture you will be able to:
- select data in arrays
- re-order data in arrays
- aggregate data in arrays
- call expressions on each row in a `pl.List` column

In [None]:
import polars as pl

We create a `DataFrame` with multiple `pl.List` columns

In [None]:
df = (
    pl.DataFrame(
        {
            'values':[ 
                [0,1], 
                [2,3,4],
                [4,5,6,7,8]
            ],
        }
    )
)
df

In this lecture we refer to the data on each row as an array.

The length of the array does not have to be the same on each row

## The arr expression namespace
Polars has an `.arr` namespace for expressions that work on `pl.List` columns
https://pola-rs.github.io/polars/py-polars/html/reference/expressions/array.html

## Selecting data in each array
We can use array expressions from the `arr` namespace to select data from:
- the start and end of the array on each row and `first`,`last`,`head` and `tail`
- slices with `slice`

In [None]:
(
    df
    .with_columns(
        [
            pl.col("values").arr.first().alias("first"),
            pl.col("values").arr.last().alias("last"),
            pl.col("values").arr.head(2).alias("head"),
            pl.col("values").arr.tail(2).alias("tail"),
            pl.col("values").arr.slice(1,2).alias("slice"),

        ]
    )
)

More generally, we use `arr.get` to select a value by a position index in each array

In [None]:
(
    df
    .with_columns(
        [
            pl.col("values").arr.get(0).alias("first"),
            pl.col("values").arr.get(1).alias("second"),
            pl.col("values").arr.get(-1).alias("last"),

        ]
    )
)

### Finding values in arrays
- We can check whether a value is in an array with `arr.contains`
- We can find all unique values in an array with `arr.unique`

In [None]:
(
    df
    .with_columns(
        [
            pl.col("values").arr.contains(i).alias(str(i)) for i in range(3)
        ]
    )
    .with_columns(
        pl.col("values").arr.unique().alias("unique")
    )
)

### Re-ordering values in each array
We can re-order values in each array:
- `reverse` reverses the order of the array
- `sort` sorts each array
- `shift` moves values in each array (in a non-periodic way)

In [None]:
(
    df
    .with_columns(
        [
            pl.col("values").arr.reverse().alias("reverse"),
            pl.col("values").arr.sort().alias("sort"),
            pl.col("values").arr.shift(1).alias("shift"),
        ]
    )
)

### Array aggregations
We can use array expressions to aggregate the arrays

In [None]:
(
    df
    .with_columns(
        [
            pl.col("values").arr.lengths().alias("lengths"),
            pl.col("values").arr.min().alias("min"),
            pl.col("values").arr.mean().alias("mean"),
            pl.col("values").arr.max().alias("max"),
        ]
    )
)

## Calling expressions on each array
Each row in a `pl.List` column is a `Series`. We can call the same expressions on each `Series` that we would call on a standalone `Series` or column in a `DataFrame`.

To do this we:
- call `arr.eval` on the `pl.List` column and inside this 
- call `pl.element` to select the array on each row and then call expressions

The call to `pl.element` inside `arr.eval` is like calling `pl.col` on a column in a `DataFrame`

In this example we `rank` the elements of each array

In [None]:
(
    pl.DataFrame(
        {
            'values':[ 
                [0,1], 
                [4,3,2]
            ],
        }
    )
    .with_columns(
        pl.col("values").arr.eval(
            pl.element().rank(method="ordinal")
        ).alias("eval")
    )
)

If we call `pl.element` with no further expressions it returns the full array

In [None]:
(
    pl.DataFrame(
        {
            'values':[ 
                [0,1], 
                [4,3,2]
            ],
        }
    )
    .with_columns(
        pl.col("values").arr.eval(
            pl.element()
        ).alias("eval")
    )
)

We can do more complicated operations with repeated calls to `pl.element`

In this example we want to remove `null` values in the arrays using a `filter` inside `arr.eval`

In [None]:
(
    pl.DataFrame(
        {
            'values':[ 
                [0,None,1], 
                [2,3,None]
            ],
        }
    )
    .with_columns(
        pl.col("values").arr.eval(
            pl.element().filter(
                pl.element().is_not_null()
            )
        ).alias("eval")
    )
)

## Exercises
In the exercises you will develop your understanding of:
- splitting a string into an array
- extracting elements of an array
- slicing an array
- indexing into an array using expressions

### Exercise 1
We need to parse the following address strings to get columns with the:
- number
- street
- city
- state
- zipcode

In [None]:
pl.Config.set_fmt_str_lengths(150)
addresses = [
    '93 NORTH 9TH STREET, BROOKLYN NY 11211',
    '380 WESTMINSTER ST, PROVIDENCE RI 02903',
    '177 MAIN STREET, LITTLETON NH 03561'
]
df = (
    pl.DataFrame(
        {"address":addresses}
    )
)
df

Add a column called `split` with the string split by whitespace (using `str.split`) into a list column

In [None]:
(
    df
    .with_columns(
        <blank>
    )
)

In an additional `with_column` statement add a 32-bit integer column called `number` using the `first` element of each array

The street component of the address runs from the second element of the list to the element of the list that contains a comma.

Add a list column called `contains_comma` where we check if each element in the arrays in `split` contain a comma. Use `eval` to run the `str.contains` expression on each element in the array

With a new call to `with_column` slice each array in `split` from the second element to the index of the element that contains a comma.

Hint 1: there is an `arr.arg_max` expression that finds the index of the largest value in an array. Use this to find the index of the `True` value in `contains_comma`

In [None]:
(
    pl.DataFrame(
        {
            "values":
            [
                [0,1],
                [3,2]
            ]
        }
    )
    .with_columns(
        pl.col("values").arr.arg_max().alias("arg_max")
    )
)

Hint 2: you can pass an expression to `arr.slice` if you want the `slice` to depend on values in another column

Join the string arrays in `street` using `arr.join` (with a " " separating the strings)

Extract the `city` from `split` by slicing. The slice should start from the `arg_max` value in `contains_command` and have a length of 1 (here we are taking advantage of 3 one word city names!)

Get the `zipcode` as the last element in `split`

### Solution to exercise 1
We need to parse the following address strings to get columns with the:
- number
- street
- city
- state
- zipcode

In [None]:
pl.Config.set_fmt_str_lengths(150)
addresses = [
    '93 NORTH 9TH STREET, BROOKLYN NY 11211',
    '380 WESTMINSTER ST, PROVIDENCE RI 02903',
    '177 MAIN STREET, LITTLETON NH 03561'
]
df = (
    pl.DataFrame(
        {"address":addresses}
    )
)
df

Add a column called `split` with the string split by whitespace (using `str.split`) into a list column

In [None]:
(
    df
    .with_columns(
        pl.col("address").str.split(" ").alias("split")
    )
)

Add a 32-bit integer column called `number` using the `first` element of each array

In [None]:
(
    df
    .with_columns(
        pl.col("address").str.split(" ").alias("split")
    )
    .with_columns(
        pl.col("split").arr.first().cast(pl.Int32).alias("number")
    )
)

The street component of the address runs from the second element of the list to the element of the list that contains a comma.

Add a list column called `contains_comma` where we check if each element in the arrays in `split` contain a comma. Use `eval` to run the `str.contains` expression on each element in the array

In [None]:
(
    df
    .with_columns(
        pl.col("address").str.split(" ").alias("split")
    )
    .with_columns(
        [
            pl.col("split").arr.first().cast(pl.Int64).alias("number"),
            pl.col("split").arr.eval(
                pl.element().str.contains(",")
            ).alias("contains_comma")
        ]
    )
)

With a new call to `with_column` slice each array in `split` from the second element to the index of the element that contains a comma.

Hint 1: there is an `arr.arg_max` expression that finds the index of the largest value in an array. Use this to find the index of the `True` value in `contains_comma`

In [None]:
(
    pl.DataFrame(
        {
            "values":
            [
                [0,1],
                [3,2]
            ]
        }
    )
    .with_columns(
        pl.col("values").arr.arg_max().alias("arg_max")
    )
)

Hint 2: you can pass an expression to `arr.slice` if you want the `slice` to depend on values in another column

In [None]:
(
    df
    .with_columns(
        pl.col("address").str.split(" ").alias("split")
    )
    .with_columns(
        [
            pl.col("split").arr.first().cast(pl.Int32).alias("number"),
            pl.col("split").arr.eval(pl.element().str.contains(",")).alias("contains_comma")
        ]
    )
    .with_columns(
        pl.col("split").arr.slice(1,pl.col("contains_comma").arr.arg_max()).alias("street")
    )
    
)

Join the string arrays in `street` using `arr.join` (with a " " separating the strings)

In [None]:
(
    df
    .with_columns(
        pl.col("address").str.split(" ").alias("split")
    )
    .with_columns(
        [
            pl.col("split").arr.first().cast(pl.Int32).alias("number"),
            pl.col("split").arr.eval(pl.element().str.contains(",")).alias("contains_comma")
        ]
    )
    .with_columns(
        pl.col("split").arr.slice(1,pl.col("contains_comma").arr.arg_max()).arr.join(" ").alias("street")
    )
    
)

Extract the `city` from `split` by slicing. The slice should start from the `arg_max` value in `contains_command` and have a length of 1 (here we are taking advantage of 3 one word city names!)

In [None]:
(
    df
    .with_columns(
        pl.col("address").str.split(" ").alias("split")
    )
    .with_columns(
        [
            pl.col("split").arr.first().cast(pl.Int32).alias("number"),
            pl.col("split").arr.eval(pl.element().str.contains(",")).alias("contains_comma")
        ]
    )
    .with_columns(
        pl.col("split").arr.slice(1,pl.col("contains_comma").arr.arg_max()).arr.join(" ").alias("street")
    )
    .with_columns(
        pl.col("split").arr.slice(
            pl.col("contains_comma").arr.arg_max()+1,1
        ).alias("city")
    )
)

Get the `zipcode` as the last element in `split`

In [None]:
(
    df
    .with_columns(
        pl.col("address").str.split(" ").alias("split")
    )
    .with_columns(
        [
            pl.col("split").arr.first().cast(pl.Int32).alias("number"),
            pl.col("split").arr.eval(pl.element().str.contains(",")).alias("contains_comma")
        ]
    )
    .with_columns(
        pl.col("split").arr.slice(1,pl.col("contains_comma").arr.arg_max()).arr.join(" ").alias("street")
    )
    .with_columns(
        [
            pl.col("split").arr.slice(
                pl.col("contains_comma").arr.arg_max()+1,1
            ).alias("city"),
            pl.col("split").arr.last().cast(pl.Int32).alias("zipcode")
        ]
    )
)