# Data exploration and validation

In this exercise we will cover how to use Polars and Pandera to explore, tidy, and validate the data.

## Task 1 - load data from Pin

### 🔄 Task

- Use `polars` to load the data from Posit Connect into a Polars dataframe.

### 🧑‍💻 Code

In [None]:
import os
from pathlib import Path

import polars as pl
from dotenv import load_dotenv
import pins

In [None]:
# Get the API key and server URL from an environment variable.
if Path(".env").exists():
    load_dotenv()

connect_server = os.environ["CONNECT_SERVER"]
connect_api_key = os.environ["CONNECT_API_KEY"]

In [None]:
# Set up a pins board.
board = pins.board_connect(server_url=connect_server, api_key=connect_api_key)
board

In [None]:
# Update the username with your Posit Connect username.
username = "sam.edwardes"

Get the vessel verbose data set.

In [None]:
vessel_verbose_paths = board.pin_download(f"{username}/vessel_verbose_raw")
vessel_verbose_paths

In [None]:
vessel_verbose = pl.read_parquet(vessel_verbose_paths)
vessel_verbose

Get the vessel verbose history data set.

In [None]:
vessel_history_paths = board.pin_download(f"{username}/vessel_history_raw")
vessel_history_paths

In [None]:
vessel_history = pl.read_parquet(vessel_history_paths)
vessel_history

## Task 2 - explore the data

### 🔄 Task

Begin exploring the data. You will want to understand.

- What columns exist in the data?
- How do the two data sets relate to one another?
- What is the type of each column (e.g. string, number, category, date)?
- Which columns could be useful for the model.
- What is the cardinality of categorical data?
- Is all of the data in scope?
- What steps will I need to perform to clean the data?

**Tips**

- Use VS Codes built in data viewer to explore the data.
- If you are more comfortable with Pandas, you can convert the polars dataframe into a pandas dataframe (e.g. `df.to_pandas()`).
- The polars user guide has great docs on how to use polars: https://docs.pola.rs.

🚨 We are not performing feature engineering at this stage. But it is a good time to start thinking about what features you can create from the data.

> 💡 We are not using it in this workshop, but `ydata-profiling` (<https://github.com/ydataai/ydata-profiling>) is a good tool for exploring a new dataset.

### 🧑‍💻 Code

#### vessel_history

In [None]:
(
    vessel_history
    .head(3)
)

- The dates and times are not formatted correctly. We can fix this when we tidy the data.

#### vessel_verbose

In [None]:
(
    vessel_verbose
    .head(2)
)

How many different vessels are in the data?

In [None]:
# Print more rows.
pl.Config.set_tbl_rows(100)

In [None]:
(
    vessel_verbose
    .select(pl.col('VesselID'), pl.col('VesselName'))
)

In [None]:
# Verify that each VesselID is unique.
(
    vessel_verbose
    .get_column('VesselID')
    .n_unique()
)

What are all of the numerical columns?

In [None]:

(
    vessel_verbose
    .select(pl.selectors.numeric())
    .head(5)
)

- Some of the date based columns are integers or floats. During data tidying we could convert them into a proper date type.


What are all of the string columns?

In [None]:
(
    vessel_verbose
    .select(pl.selectors.string())
    .head(5)
)

- It looks like some missing values are represented with an empty string `""` while others have a `null` value. We may want to make this consistent when we tidy the data.
- Some string columns are measurements that should be converted into numeric types.

How much data is missing?

In [None]:
(
    vessel_verbose
    .null_count()
    .transpose(include_header=True)
    .rename({"column": "Column Name", "column_0": "Missing Rows"})
    .with_columns(((pl.col("Missing Rows") / vessel_verbose.shape[0]) * 100).round(1).alias('% Missing'))
    .sort("Missing Rows", descending=True)
)

Whats in the `Class` column?

In [None]:
(
    vessel_verbose
    .get_column("Class")
    .head(2)
)

The class column contains a `struct`: https://docs.pola.rs/user-guide/expressions/structs/

> Polars `Structs` are the idiomatic way of working with multiple columns. It is also a free operation i.e. moving columns into Structs does not copy any data!

Lets look more closely at the `Class` column for Cathlamet.

In [None]:
(
    vessel_verbose
    .filter(pl.col("VesselName") == "Cathlamet")
    .get_column("Class")
    .to_list()
)

It looks like the `Class` column contains a list with a single dictionary. When we tidy this data we can make it easier to work with by unnesting this data and moving it into its own columns.

## Task 3 - Tidy the Data

### 🔄 Task

Now that you have a basic understanding of the data, the next step is to tidy the data.

### 🧑‍💻 Code

#### vessel_history

In [None]:
vessel_history.head(2)

Convert the datetimes from strings to polars datetime objects. The logic is pretty complex. So we will abstract it into a function that we can apply to all of the required columns.

In [None]:
def convert_string_to_datetime(series: pl.Series) -> pl.Series:
    """
    Convert the datetime format from wadot into a datetime format that polars
    can understand.

    >>> convert_string_to_datetime(pl.Series(['/Date(1714547700000-0700)/']))
    shape: (1,)
    Series: '' [datetime[μs, UTC]]
    [
        2024-05-01 07:15:00 UTC
    ]
    """
    # Extract the unix time stamp. To work with polars we need the time
    # the number of seconds since 1970-01-01 00:00 UTC, so divide by
    # 1_000.
    unix_timestamp = (
        (series.str.extract(r"/Date\((\d{13})[-+]").cast(pl.Int64) / 1_000)
        .cast(pl.Int64)
        .cast(pl.String)
    )
    # Extract the timezone.
    timezone = series.str.extract(r"([-+]\d{4})")
    # Create a new series that has the timestamp and timezone.
    clean_timestamp = unix_timestamp + timezone
    # Convert into a datetime.
    datetime_series = clean_timestamp.str.to_datetime("%s%z")
    return datetime_series


convert_string_to_datetime(pl.Series(['/Date(1714547700000-0700)/']))

In [None]:
vessel_history_clean = (
    vessel_history
    .with_columns(
        (
            pl
            .col("ScheduledDepart", "ActualDepart", "EstArrival", "Date")
            .map_batches(lambda s: convert_string_to_datetime(s)))
    )
)

In [None]:
vessel_history_clean.head(5)

#### vessel_verbose

In [None]:
vessel_verbose.head(3)