# Reading data

In this exercise we will cover how to use polars to read data from external data sources. To perform our analysis, we will need several different data sets:

1. Vessel Verbose (`/vesselverbose`): <https://www.wsdot.wa.gov/ferries/api/vessels/rest/help>
2. Vessel History (`/vesselhistory/{VesselName}/{DateStart}/{DateEnd}`): <https://www.wsdot.wa.gov/ferries/api/vessels/rest/help>

All data sets are hosted on <https://wsdot.wa.gov/traffic/api/>.

## Task 1 - read data

### 🔄 Task

- Download the **Vessel Verbose** data
- Convert the data into a polars dataframe

### 🧑‍💻 Code

The State of Washington data portal uses makes data available over an API. The API has lots of features, you can read more about how to use it here: <https://wsdot.wa.gov/traffic/api/>.

To download the data, many persons first instinct is to download via:

- Clicking through your web browser.
- Via the curl command in the terminal.

```bash
WSDOT_ACCESS_CODE='xxxx-xxxx-xxxx-xxxx-xxxx'
curl "https://www.wsdot.wa.gov/Ferries/API/Vessels/rest/vesselverbose?apiaccesscode=${WSDOT_ACCESS_CODE}"
```

There is a better way though! Using httpx we can download the data as JSON and then convert it into a Python dictionary. Then we use polars to create a DataFrame directly from the dictionary. First, lets download the data using httpx.

In [None]:
import os
from pathlib import Path

import httpx
from dotenv import load_dotenv

base_url = "https://www.wsdot.wa.gov/Ferries/API/Vessels/rest"
path = "vesselverbose"

# Get the API key from an environment variable.
if Path(".env").exists():
    load_dotenv()

# Define our params in a dictionary.
params = {"apiaccesscode": os.environ["WSDOT_ACCESS_CODE"]}

with httpx.Client(base_url=base_url, params=params) as client:
    response = client.get(path)

response

The `Response` object from httpx has several methods and attributes we can use to get more info about the request, and the response.

In [None]:
# The URL that was used to make the request.
response.url

In [None]:
# The status of the response
response.status_code

In [None]:
# Use the pprint function from rich for nicer formatting of the dictionary data.
from rich.pretty import pprint

# The JSON data converted into a Python dictionary
print(f"{len(response.json())=}")
pprint(response.json()[0])

Lastly, we can use polars to convert the dictionary into a DataFrame.


In [None]:
import polars as pl

vessel_verbose_raw = pl.DataFrame(response.json())
vessel_verbose_raw

## Task 2 - write data to pin

### 🔄 Task

- Save `vessel_verbose_raw` to a Pin on Posit Connect.
- This way, we do not need to hit the API every time we need to interact with the raw data.

### 🧑‍💻 Code

In [None]:
import pins

# Get the API key and server URL from an environment variable.
if Path(".env").exists():
    load_dotenv()

connect_server = os.environ["CONNECT_SERVER_"]
connect_api_key = os.environ["CONNECT_API_KEY"]

board = pins.board_connect(
    server_url=connect_server,
    api_key=connect_api_key
)

board


In [None]:
# Update the username with your Posit Connect username.
username = "sam.edwardes"

# Upload the data to Connect. At this time Pins only has support for Pandas, so
# we need to convert the Polars DataFrame to a Pandas DataFrame.
board.pin_write(
    vessel_verbose_raw.to_pandas(),
    f"{username}/vessel_verbose_raw",
    type="parquet"
)

To reuse this data in future code we can use `board.pin_download` or `board.pin_read`.

In [None]:
paths = board.pin_download(f"{username}/vessel_verbose_raw")
paths

In [None]:
pl.read_parquet(paths)

## Task 3 - Get Other Data Sets

### 🔄 Task

- Get the other required datasets.

### 🧑‍💻 Code

In [None]:
import datetime

base_url = "https://www.wsdot.wa.gov/Ferries/API/Vessels/rest"
params = {"apiaccesscode": os.environ["WSDOT_ACCESS_CODE"]}

# Get all of the vessel names
with httpx.Client(base_url=base_url, params=params) as client:
    response = client.get("vesselverbose")

vessel_names = [i['VesselName'] for i in response.json()]

# For each vessel, get all of the history from the desired date range
start_date = datetime.date(2024, 5, 1)
end_date = datetime.date.today()

vessel_history_json = []
for vessel_name in vessel_names:
    print(f"Getting vessel history for {vessel_name}...")
    with httpx.Client(base_url=base_url, params=params) as client:
        response = client.get(f"vesselhistory/{vessel_name}/{start_date}/{end_date}")
    print(f"\t{len(response.json())} records retrieved for {vessel_name}.")
    vessel_history_json += response.json()

In [None]:
vessel_history_raw = pl.DataFrame(vessel_history_json)
vessel_history_raw

In [None]:
board.pin_write(
    vessel_history_raw.to_pandas(),
    f"{username}/vessel_history_raw",
    type="parquet"
)

## Task 4 - Publish the solution notebook to Connect

### 🔄 Task

- Publish the solution notebook to Posit Connect.
- Share the notebook with the rest of the workshop.
- Schedule the notebook to run once every week.

### 🧑‍💻 Code