## List dtype 1: Creating and transforming List columns
By the end of this lecture you will be able to:
- select `pl.List` columns
- turn each `pl.List` row into its own row
- convert a `pl.List` column to a `pl.Struct` or `pl.DataFrame`
- convert a `pl.List` column to a Numpy array

In [None]:
import polars as pl

We create a `DataFrame` with integer, floating point and string `pl.List` columns

In [None]:
dfLists = pl.DataFrame({
    'ints':[ 
        [0,1], 
        [2,3]
    ],
    'floats':[ 
        [0.0,1], 
        [2,3]
    ],
    'strings':[ 
        ["0","1"],
        ["2","3"]
    ]
})
dfLists

In the printed representation we see a list on each row.

In reality the data on each row is a Polars `Series`.

We can see the underlying `Series` by selecting a row in a `pl.List` column

In [None]:
dfLists[0,"ints"]

In this lecture we refer to the data on each row as an array.

## Selecting `pl.List` columns 
At present we cannot select all `pl.List` columns without the column dtype

In [None]:
(
    dfLists
    .select(
        pl.col(pl.List)
    )
)

Instead we must pass the dtype for the arrays in that column.

In this example we select the 64-bit integer `pl.List` column 

In [None]:
(
    dfLists
    .select(
        pl.col(pl.List(pl.Int64)
              )
    )
)

The length of the array does not have to be the same on each row

In [None]:
(
    pl.DataFrame(
        {
            'values':[ 
                [0,1], 
                [2,3,4],
                [4,5,6,7,8]
            ],
        }
    )
)

## Turning `pl.List` columns into rows
We use `explode` to turn each array element into a row.

In some cases we want to keep track of which rows came from which array (for example to do a `groupby` operation). If we do not have an existing column with a unique row identifier we can call `with_row_count` before calling `explode`

In [None]:
(
    pl.DataFrame(
        {
            'values':[ 
                [0,1], 
                [2,3,4]
            ],
        }
    )
    .with_row_count()
    .explode("values")
)

## Convert a `pl.List` column to a `pl.Struct` column or a `DataFrame`
Polars has an `arr` namespace with expressions that work on `pl.List` columns (we see more of this in the next lecture).

We convert a `pl.List` column to a `pl.Struct` column with `arr.to_struct`.

In this example we use `arr.to_struct` to turn the `pl.List` column into `DataFrame` columns

In [None]:
(
    pl.DataFrame(
        {
            'values':[ 
                [0,1], 
                [2,3],
                [4,5]
            ],
        }
    )
    # Convert the arrays to a struct
    .with_columns(
        pl.col("values").arr.to_struct().alias("value_struct")
    )
    # Un-nest the struct to DataFrame columns
    .unnest("value_struct")
)

While a `pl.List` array can have a variable number of elements a `pl.Struct` has a fixed number of elements on each row. See the API docs for strategies to deal with a variable number of elements: https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.arr.to_struct.html

## Convert a `pl.List` column to a Numpy array
A `pl.List` column is a natural way to hold array data that we may need in Numpy

In [None]:
df = (
    pl.DataFrame(
        {
            'embeddings':[ 
                [0.0,1], 
                [2,3],
                [4,5]
            ],
        }
    )
)
df

To get the `embeddings` column as a Numpy array you `explode` the column and then reshape it in Numpy

In [None]:
(
    df["embeddings"]
    .arr.explode()
    .to_numpy()
    .reshape(len(df), -1)
)

## Exercises
In the exercises you will develop your understanding of:
- selecting list columns
- exploding list columns
- counting occurences in a list column

### Exercise 1
We create a `DataFrame` with `pl.List` columns

In [None]:
dfLists = pl.DataFrame({
    'ints':[ 
        [0,1], 
        [2,3]
    ],
    'floats':[ 
        [0.0,1], 
        [2,3]
    ],
    'strings':[ 
        ["0","1"],
        ["2","3"]
    ]
})
dfLists

Select the floating point list column from `dfLists`

In [None]:
(
    dfLists
    <blank>
)

Select the floating point **and** integer list column from `dfLists`

In [None]:
(
    dfLists
    <blank>
)

### Exercise 2
We create a `pl.List` column from the Titanic dataset by splitting the `Name` column on every whitespace

In [None]:
csvFile = "../data/titanic.csv"
df = (
    pl.read_csv(csvFile)
    .select(
        [
            "PassengerId",
            "Pclass",
            "Name",
            pl.col("Name").str.split(" ").alias("Name_list")
        ]
    )
)
df.head(2)

Expand the `Name_list` column into separate rows

In [None]:
(
    df
    <blank>
    .head()
)

Filter rows with titles such as Mr. and Mrs. from the output. Hint: use `~` to negate a filter

Find the most common names using `value_counts` on the `Name_list` column (we cover `value_counts` in more detail in the next Section)

## Solutions

### Solution to exercise 1
We create a `DataFrame` with `pl.List` columns

In [None]:
dfLists = pl.DataFrame({
    'ints':[ 
        [0,1], 
        [2,3]
    ],
    'floats':[ 
        [0.0,1], 
        [2,3]
    ],
    'strings':[ 
        ["0","1"],
        ["2","3"]
    ]
})
dfLists

Select the floating point list column from `dfLists`

In [None]:
(
    dfLists
    .select(
        pl.col(pl.List(pl.Float64))
    )
)

Select the floating point **and** integer list column from `dfLists`

In [None]:
(
    dfLists
    .select(
        pl.col([pl.List(pl.Float64),pl.List(pl.Int64)])
    )
)

### Solution to exercise 2
We create a `pl.List` column from the Titanic dataset by splitting the `Name` column on every whitespace

In [None]:
csvFile = "../data/titanic.csv"
df = (
    pl.read_csv(csvFile)
    .select(
        [
            "PassengerId",
            "Pclass",
            "Name",
            pl.col("Name").str.split(" ").alias("Name_list")
        ]
    )
)
df.head(2)

Expand the `Name_list` column into separate rows

In [None]:
(
    df
    .explode("Name_list")
    .head()
)

Filter rows with titles such as Mr. and Mrs. from the output

In [None]:
(
    df
    .explode("Name_list")
    .filter(~pl.col("Name_list").is_in(["Mr.","Mrs.","Miss.","Master."]))
    .head()
)

Find the most common names using `value_counts` on the `Name_list` column (we cover `value_counts` in more detail in the next Section)

In [None]:
(
    df
    .explode("Name_list")
    .filter(~pl.col("Name_list").is_in(["Mr.","Mrs.","Miss.","Master."]))
    ["Name_list"]
    .value_counts()
    .sort("counts",descending=True)
    .head()
)