## Pivoting and melting
By the end of this lecture you will be able to:
- make a `DataFrame` wide with `pivot`
- make a `DataFrame` long with `melt`

In [None]:
import polars as pl
import numpy as np

In [None]:
csvFile = "../data/titanic.csv"

## Pivot
We start with a simple example of some data on sales of bikes

In [None]:
sales_data = pl.DataFrame({
    'date': ['2022-01-01', '2022-01-02', '2022-01-01', '2022-01-02','2022-01-03'],
    'region': ['East', 'West', 'East', 'West','West'],
    'bike_type': ['Mountain', 'Mountain', 'Road', 'Road','Mountain'],
    'sales': [100, 200, 300, 400,500]
})
sales_data

We want to pivot the data so that we have the sales broken down by product, with a row for each date

In [None]:
(
    sales_data
    .pivot(
        index='date', 
        columns='bike_type', 
        values='sales'
    )
)

When we use `pivot` we turn a `DataFrame` from long to wide format. Where there is no corresponding value in the original `DataFrame` Polars inserts a `null` value.

We can also create new columns from the data in multiple original columns by passing a list of column names to the `columns` argument

In [None]:
(
    sales_data
    .pivot(
        index='date', 
        columns=['region','bike_type'], 
        values='sales'
    )
)

We need to bear in mind that with multiple columns in `columns` we can no longer do horizontal aggregations across the whole row without double counting.

### Pivots and aggregation
When there are multiple values in the original `DataFrame` that correspond to a position in the pivoted `DataFrame` then Polars must aggregate them.

We tell Polars how to aggregate them using the `aggregate_function` argument. We demonstrate this by pivoting by `region`.

In our original `DataFrame` we have two values for each region on 2022-01-01 and 2022-01-02

In [None]:
sales_data

We can alternatively specify one of the following aggregation functions: `sum`, `max`, `min`, `mean`, `median`, `last`, `count`.

If an aggregation is required but no aggregation function is specified then Polars will raise an exception.

In this case we pivot and take the `mean` where there are multiple values

In [None]:
(
    sales_data
    .pivot(
        index='date', 
        columns='region', 
        values='sales',
        aggregate_function="mean"
    )
)

The pivoted columns are ordered by the order Polars finds them in the column - so in this case there was an `East` entry before a `West` entry.

For example if we reverse the `DataFrame` with `.reverse` we get `West` before `East` in the columns

In [None]:
(
    sales_data
    .reverse()
    .pivot(
        index='date', 
        columns='region', 
        values='sales',
        aggregate_function="mean"
    )
)

We can ensure the columns are ordered with the `sort_columns` argument

In [None]:
(
    sales_data
    .reverse()
    .pivot(
        index='date', 
        columns='region', 
        values='sales',
        aggregate_function="mean",
        sort_columns=True
    )
)

One final point on aggregation: the `pivot` function is quite similar to a `groupby`. In fact in the internals `pivot` uses the parallel `groupby` on the column(s) in the `index` argument and the columns in `columns` before reshaping the output

### Pivot in lazy mode?
In lazy mode Polars must know the schema (column names and dtypes) at each stage of a query plan. However, after a `pivot` the column names cannot be known in advance as they are dependant on the data. As such a `pivot` is not - and will not - be available in lazy mode.  

If you have a lazy query but want to do a `pivot` then you can either:
- `collect` your query, do the `pivot` and then call `lazy` to resume in lazy mode
- try to use `groupby` instead

## Melting from wide to long
To convert a `DataFrame` from wide to long we use the `melt` method. This is a common task when transforming data for visualisation libraries as we see in the exercises.

We begin this example with a wide `DataFrame` we get from calling `pivot` on `sales_data`

In [None]:
sales_pv = (
    sales_data
    .pivot(
        index='date', 
        columns='bike_type', 
        values='sales',
        aggregate_function="mean"
    )
)
sales_pv

We melt the `DataFrame` by specifying:
- which metadata column(s) will identify the data on each row in `id_vars`
- which columns will provide the values in `value_vars`

In [None]:
(
    sales_pv
    .melt(
        id_vars="date",
        value_vars=["Mountain","Road"]
    )
)

The column names in `value_vars` become the data in the `variable` column.

If we want to use all columns not specified in the `id_vars` as `value_vars` we can just omit the `value_vars` argument

In [None]:
(
    sales_pv
    .melt(
        id_vars="date",
    )
)

We can optionally specify different names for the `variable` and `value` column with the `variable_name` and `value_name`.

### Melt in lazy mode?
We can use `melt` in lazy mode as the new column names (`variable` and `value`) along with their dtypes are known in advance.

### Stacking and unstacking?
In Pandas there are also methods called `stack` and `unstack`. These work like `melt` and `pivot` except that they convert to/from the index of a Pandas `DataFrame` instead of columns. As Polars doesn't have an index we only need `melt` and `pivot`.


## Exercises
In the exercises you will develop your understanding of:
- converting a `DataFrame` to wide format with `pivot`
- converting a `DataFrame` to long format for visualisation with `melt`

### Exercise 1
For this exercise we use a dataset of bike sales in different countries.

We derive a `year` column from the `date` - see the lecture on Extracting datetime components in the Time Series section for more on this.

In [None]:
sales_df = (
    pl.read_parquet("../data/bike_sales.parquet")
    .with_columns(
        pl.col("date").dt.year().alias("year")
    )
)
sales_df.head(3)

Pivot the data to have a year on each row and a column for each `sub category`. Aggregate by getting the sum of the `order quantity`. Ensure the years are in ascending order

In [None]:
(
    sales_df
    <blank>
)

We want to visualise this data as a time series with Plotly so melt the pivoted `DataFrame` and assign it to `annual_sales_df`

In [None]:
annual_sales_df = (
    sales_df
    <blank>
)

We can now plot the output using `px.line` in Plotly (feel free to do this with your preferred visualisation library). If you haven't come across Plotly before see the lecture in the Visualisation section.

In [None]:
import plotly.express as px
px.line(
    <blank>
)

### Exercise 2
In this exercise we want to identify which words are present in a set of texts. This is a common task in natual language processing often carried out using the CountVectorizer in Scikit-learn.

We begin by defining our `fake_news_df`

In [None]:
fake_news_df = pl.DataFrame({
    'publication': ['The Daily Deception', 'Faux News Network', 'The Fabricator', 'The Misleader', 
                     'The Hoax Herald', ],
    'date': ['2022-01-01', '2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06', 
             ],
    'title': ['Scientists Discover New Species of Flying Elephant', 
              'Aliens Land on Earth and Offer to Solve All Our Problems', 
              'Study Shows That Eating Pizza Every Day Leads to Longer Life', 
              'New Study Finds That Smoking is Good for You', 
              "World's Largest Iceberg Discovered in Florida"],
    'text': ['In a groundbreaking discovery, scientists have found a new species of elephant that can fly. The flying elephants, which were found in the Amazon rainforest, have wings that span over 50 feet and can reach speeds of up to 100 miles per hour. This is a game-changing discovery that could revolutionize the field of zoology.',
             'In a historic moment for humanity, aliens have landed on Earth and offered to solve all our problems. The extraterrestrial visitors, who arrived in a giant spaceship that landed in Central Park, have advanced technology that can cure disease, end hunger, and reverse climate change. The world is waiting to see how this incredible offer will play out.',
             'A new study has found that eating pizza every day can lead to a longer life. The study, which was conducted by a team of Italian researchers, looked at the eating habits of over 10,000 people and found that those who ate pizza regularly lived on average two years longer than those who didn\'t. The study has been hailed as a breakthrough in the field of nutrition.',
             'In a surprising twist, a new study has found that smoking is actually good for you. The study, which was conducted by a team of British researchers, looked at the health outcomes of over 100,000 people and found that those who smoked regularly had lower rates of heart disease and cancer than those who didn\'t. The findings have sparked controversy among health experts.',
             'In a bizarre turn of events, the world\'s largest iceberg has been discovered in Florida. The iceberg, which is over 100 miles long and 50 miles wide, was found off the coast of Miami by a group of tourists on a whale-watching tour. Scientists are baffled by the discovery and are scrambling to figure out how an iceberg of this size could have']
})
fake_news_df

Begin by:
- converting the text to lowercase and splitting the text by whitespace
- adding a new column called `placeholder` with 1 as a placeholder value

In [None]:
(
    fake_news_data
    .with_columns(
        <blank>
)

Explode the lists in the `text` column

Pivot the output so that the article metadata is preserved on each row and the remainder of the columns indicate if the column name is present in the text of that article. Ensure the column names are sorted

Replace the `null` values with 0

## Solutions

### Solution to exercise 1

In [None]:
sales_df = (
    pl.read_parquet("../data/bike_sales.parquet")
    .with_columns(
        pl.col("date").dt.year().alias("year")
    )
)
sales_df.head(3)

Pivot the data to have a year on each row and a column for each `sub category` of bike. Aggregate by getting the sum of the `order quantity`. Ensure the years are in ascending order

In [None]:
(
    sales_df
    .pivot(
        index="year",
        columns="sub category",
        values="order quantity",
        aggregate_function="sum",
    )
    .sort("year")
)

We want to visualise this data as a time series with Plotly so melt the pivoted `DataFrame` and assign it to `annual_sales_df`

In [None]:
annual_sales_df = (
    sales_df
    .pivot(
        index="year",
        columns="sub category",
        values="order quantity",
        aggregate_function="sum"
    )
    .sort("year")
    .melt(
        id_vars="year",
    )
)

Plot the output using `px.line`

In [None]:
import plotly.express as px
px.line(
    x=annual_sales_df["year"],
    y=annual_sales_df["value"],
    color=annual_sales_df["variable"]
)

### Solution to exercise 2
In this exercise we want to identify which words are present in a set of texts. This is a common task in natual language processing often carried out using the CountVectorizer in Scikit-learn.

We begin by defining our `fake_news_df`

In [None]:
fake_news_df = pl.DataFrame({
    'publication': ['The Daily Deception', 'Faux News Network', 'The Fabricator', 'The Misleader', 
                     'The Hoax Herald', ],
    'date': ['2022-01-01', '2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06', 
             ],
    'title': ['Scientists Discover New Species of Flying Elephant', 
              'Aliens Land on Earth and Offer to Solve All Our Problems', 
              'Study Shows That Eating Pizza Every Day Leads to Longer Life', 
              'New Study Finds That Smoking is Good for You', 
              "World's Largest Iceberg Discovered in Florida"],
    'text': ['In a groundbreaking discovery, scientists have found a new species of elephant that can fly. The flying elephants, which were found in the Amazon rainforest, have wings that span over 50 feet and can reach speeds of up to 100 miles per hour. This is a game-changing discovery that could revolutionize the field of zoology.',
             'In a historic moment for humanity, aliens have landed on Earth and offered to solve all our problems. The extraterrestrial visitors, who arrived in a giant spaceship that landed in Central Park, have advanced technology that can cure disease, end hunger, and reverse climate change. The world is waiting to see how this incredible offer will play out.',
             'A new study has found that eating pizza every day can lead to a longer life. The study, which was conducted by a team of Italian researchers, looked at the eating habits of over 10,000 people and found that those who ate pizza regularly lived on average two years longer than those who didn\'t. The study has been hailed as a breakthrough in the field of nutrition.',
             'In a surprising twist, a new study has found that smoking is actually good for you. The study, which was conducted by a team of British researchers, looked at the health outcomes of over 100,000 people and found that those who smoked regularly had lower rates of heart disease and cancer than those who didn\'t. The findings have sparked controversy among health experts.',
             'In a bizarre turn of events, the world\'s largest iceberg has been discovered in Florida. The iceberg, which is over 100 miles long and 50 miles wide, was found off the coast of Miami by a group of tourists on a whale-watching tour. Scientists are baffled by the discovery and are scrambling to figure out how an iceberg of this size could have']
})
fake_news_df

Begin by:
- converting the text to lowercase and splitting the text by whitespace
- adding a new column called `placeholder` with 1 as a placeholder value

In [None]:
(
    fake_news_df
    .with_columns(
        pl.col("text").str.to_lowercase().str.split(" "),
        pl.lit(1).alias("placeholder")
    )
)

Explode the lists in the `text` column

In [None]:
(
    fake_news_df
    .with_columns(
        pl.col("text").str.to_lowercase().str.split(" "),
        pl.lit(1).alias("placeholder")
    )
    .explode("text")
)

Pivot the output so that the article metadata is preserved on each row and the remainder of the columns indicate if the column name is present in the text of that article. Ensure the column names are sorted

In [None]:
(
    fake_news_df
    .with_columns(
        pl.col("text").str.to_lowercase().str.split(" "),
        pl.lit(1).alias("placeholder")
    )
    .explode("text")
    .pivot(
        index=["publication","date","title"],
        columns="text",
        values="placeholder",
        sort_columns=True
    )
)

Replace the `null` values with 0

In [None]:
(
    fake_news_df
    .with_columns(
        pl.col("text").str.to_lowercase().str.split(" "),
        pl.lit(1).alias("placeholder")
    )
    .explode("text")
    .pivot(
        index=["publication","date","title"],
        columns="text",
        values="placeholder",
        sort_columns=True
    )
    .fill_null(value=0)
)

If we wanted to split strings with a slightly more sophisticated pattern we could use the following regex (used by CountVectorizer in scikit-learn) and `str.extract_all`

In [None]:
(
    fake_news_df
    .with_columns(
        pl.col("text").str.to_lowercase().str.extract_all('(?u)\\b\\w\\w+\\b'),
        pl.lit(1).alias("placeholder")
    )
    .explode("text")
    .pivot(
        index=["publication","date","title"],
        columns="text",
        values="placeholder",
        sort_columns=True
    )
    .fill_null(value=0)
)