## Missing values
By the end of this section we will learn how to:
- identify missing values in a `DataFrame`
- count the number of missing values in a column
- filter for `null` or non-`null` values

In [None]:
import polars as pl

In [None]:
csvFile = "../data/titanic.csv"

## Missing values in Polars
Missing values in Polars are represented with a `null` value for all dtypes. We can create them manually with `None` value

In [None]:
df = pl.DataFrame(
    {
        'col1':[0,None,2],
        'col2':[None,None,5],
        'col3':[1,2,3]
    }
)
df

> In Pandas a missing value can be represented with a `null`,`NaN` or `None` value depending on the dtype of the column. Polars also allows `NaN` values for floating point columns as we will see. 

## Metadata on `null` values
Polars stores metadata about `null` values for each column in a `DataFrame`.

### Null count
Polars stores a count of how many `null` values there are. We can access this with the `null_count` method on a single column or on all the columns

In [None]:
df.null_count()

Polars keeps track of the `null_count` at all times so this is a cheap operation regardless of the size of the column.

### Validity bitmap

As well as storing the data for each column Polars stores a separate array that indicates if each value in that column has valid data or is `null`. This array is called the validity bitmap as the value is 1 when valid data is present in that position and 0 when it is not.

If there are no `null` values in a column the validity bitmap is not present. We can check if the validity bitmap is present with `has_validity`

In [None]:
df["col1"].has_validity()

In [None]:
df["col3"].has_validity()

So `col1` does have `null` values and `col3` doesnt.

Using `has_validity` is the fastest way to check if a column or `Series` has any `null` values.

## Finding `null` values

We use the `is_null` expression to find out whether each value is `null` and `is_not_null` for the converse

In [None]:
(
    df
    .select(
        [
            pl.col("col1"),
            pl.col("col1").is_null().alias("is_null"),
            pl.col("col1").is_not_null().alias("is_not_null")
        ]
    )
)

## Filtering by `null` values
We can use these methods to filter by `null` or non-`null` values

In [None]:
(
    df
    .filter(
            pl.col("col1").is_not_null(),
    )
)

## Exercises
In the exercises you will develop your understanding of:
- counting the `null` values
- filtering by `null` values

### Exercise 1
Count the number of `null` values in each row of the Titanic data

In [None]:
csvFile = "../data/titanic.csv"
(
    pl.read_csv(csvFile)
    <blank>
)

Filter out the rows that are `null` from the `Cabin` column and count the null values for all columns again

### Exercise 2
Find all the rows for which the `Age` is `null`

In [None]:
csvFile = "../data/titanic.csv"
(
    pl.read_csv(csvFile)
    <blank>
)

Find all the rows for wFind all the rows for which neither the `Age` nor the `Cabin` is `null`

## Solutions
### Solution to Exercise 1
Count the number of `null` values in each row of the Titanic data

In [None]:
csvFile = "../data/titanic.csv"
(
    pl.read_csv(csvFile)
    .null_count()
)

Filter out the rows that are `null` from the `Cabin` column and count the null values for all columns again

In [None]:
csvFile = "../data/titanic.csv"
(
    pl.read_csv(csvFile)
    .filter(pl.col("Cabin").is_not_null())
    .null_count()
)

### Solution to Exercise 2
Find all the rows for which the `Age` is `null`

In [None]:
csvFile = "../data/titanic.csv"
(
    pl.read_csv(csvFile)
    .filter(pl.col("Age").is_null())
    .head()
)

Find all the rows for which neither the `Age` nor the `Cabin` is `null`

In [None]:
csvFile = "../data/titanic.csv"
(
    pl.read_csv(csvFile)
    .filter(
        (pl.col("Age").is_not_null()) & (pl.col("Cabin").is_not_null())
    )
    .head()
)