# Value counts
By the end of this lecture you will be able to:
- count occurrences in a column with `value_counts`
- create a bar chart of the outputs
- use `value_counts` in an expression
- use `value_counts` in lazy mode

In [None]:
import polars as pl
import plotly.express as px

In [None]:
csvFile = '../data/titanic.csv'

In [None]:
df = pl.read_csv(csvFile)
df.head(3)

## Count occurences in a `Series`
We use `value_counts` to count occurences in a `Series`

In [None]:
df['Pclass'].value_counts()

> In Pandas the output of this operation is a `Series` but in Polars the output is a `DataFrame` with one column for the categories and one for the counts.

The order will vary each time you run `value_counts` unless you pass the `sort` argument

In [None]:
df['Pclass'].value_counts(sort=True)

We can also sort by the category using the `sort` method

In [None]:
df['Pclass'].value_counts().sort("Pclass")

## Value counts as an expression
We can call `value_counts` in an expression

In [None]:
(
    df
    .select(
        pl.col("Pclass").value_counts()
    )
)

The output is a one-column `DataFrame` with a `pl.Struct` column.

We can get the output as a two-column `DataFrame` by calling `.struct.unnest` on the `Series`

In [None]:
(
    df
    .select(
        pl.col("Pclass").value_counts()
    )
    ["Pclass"]
    .struct.unnest()
)

## Value counts in lazy mode
To calcualte value counts in lazy mode we call `value_counts` as an expression on a `LazyFrame`.

As the output of the `value_counts` expression is a `struct` dtype we then:
- trigger evaluation of the `LazyFrame`
- transform the `struct` column to a `DataFrame`

In [None]:
(
    pl.scan_csv(csvFile)
    .select(
        pl.col("Pclass").value_counts()
    )
    .collect()
    ["Pclass"]
    .struct.unnest()
)

Polars detects that only the `Pclass` column needs to be read from the CSV in lazy mode.

In [None]:
print(
    pl.scan_csv(csvFile)
    .select(
        pl.col("Pclass").value_counts()
    )
    .explain()
)

We see this from `PROJECT 1/12 COLUMNS` in the optimised query plan.

## Exercises

In the exercises you will develop your understanding of:
- calculating value counts
- calculating percentages
- visualising the outputs
- doing `value_counts` in lazy mode

### Exercise 1 - value counts
Calculate the value counts on the `Survived` column as a `Series`. 

In [None]:
(
    pl.read_csv(csvFile)
    <blank>
)

Sort the output from highest to lowest

Calculate the value counts on the `Survived` column as an expression 

Calculate the value counts on the `Survived` column as an expression and convert the `pl.Struct` column to a `DataFrame`

### Exercise 2 - value counts as a percentage
As in the first part of Exercise 1, calculate the value counts on the `Survived` column as a `Series`

In [None]:
(
    pl.read_csv(csvFile)
    <blank>
)

Add an additional column with the percentage of passengers in each class (divide the `counts` column by the sum of the `counts` column. 

Express the percentages as values ranging from 0 to 100.

Visualise the percentage values for each class in a bar chart

### Exercise 3

Construct the query that produces the following optimized query plan
```
   SELECT [col("Age").round().value_counts()] FROM
    CSV SCAN ../data/titanic.csv
    PROJECT 1/12 COLUMNS
```


In [None]:
dfLazy = (
     <blank>
)


print(dfLazy.explain())

## Solutions

### Solution to Exercise 1

Calculate the value counts on the `Survived` column as a `Series`

In [None]:
(
    pl.read_csv(csvFile)
    ['Survived']
    .value_counts()
)

Sort by the counts from highest to lowest

In [None]:
(
    pl.read_csv(csvFile)
    ['Survived']
    .value_counts(sort=True)
)

Calculate the value counts on the `Survived` column as an expression

In [None]:
(
    pl.read_csv(csvFile)
    .select(
        pl.col("Survived").value_counts()
    )
)

Calculate the value counts on the `Survived` column as an expression and convert the `pl.Struct` column to a `DataFrame`

In [None]:
(
    pl.read_csv(csvFile)
    .select(
        pl.col("Survived").value_counts()
    )
    ["Survived"]
    .struct.unnest()
)

### Solution to Exercise 2
As in the first part of Exercise 1, calculate the value counts on the `Survived` column as a `Series`

In [None]:
(
    pl.read_csv(csvFile)
    ['Survived']
    .value_counts(sort=True)
)

Add an additional column with the percentage of passengers in each class (divide the `counts` column by the sum of the `counts` column. 

In [None]:
(
    pl.read_csv(csvFile)
    ['Survived']
    .value_counts(sort=True)
    .with_columns(
        (pl.col("counts")/pl.col("counts").sum()).alias("percent")
    )
)

Express the percentages as values ranging from 0 to 100.

In [None]:
(
    pl.read_csv(csvFile)
    ['Survived']
    .value_counts(sort=True)
    .with_columns(
        (100*(pl.col("counts")/pl.col("counts").sum())).alias("percent")
    )
)

Visualise the outputs as a bar chart

In [None]:
survivedCounts = (
    pl.read_csv(csvFile)
    ['Survived']
    .value_counts(sort=True)
    .with_columns(
        pl.col("Survived").cast(pl.Utf8)
    )
    .with_columns(
        (100*(pl.col("counts")/pl.col("counts").sum())).alias("percent")
    )
)
px.bar(x=survivedCounts["Survived"],y=survivedCounts["percent"])

### Solution to Exercise 3
Construct the query that produces the following optimized query plan
```
   SELECT [col("Age").round().value_counts()] FROM
    CSV SCAN ../data/titanic.csv
    PROJECT 1/12 COLUMNS
```


In [None]:
dfLazy = (
    pl.scan_csv(csvFile)
    .select(
        pl.col("Age").round(0).value_counts()
    )
)
print(dfLazy.explain())