<div style="text-align: right;">
  <img src="https://raw.githubusercontent.com/exasol/ai-lab/refs/heads/main/assets/Exasol_Logo_2025_Dark.svg" style="width:200px; margin: 10px;" />
</div>

# Working with Exasol using IBIS dataframe library.

In this notebook, we will show some basic operations on Exasol data using Ibis. You can find more detailed information on using Ibis on the official [Ibis project](https://ibis-project.org/) website.

The notebook is organized as a quickstart tutorial in which we will be looking at US flight delays. In particular, we will explore the delay caused by the carrier. We will rank the carriers using the delay as the performance metric. The data is publicly accessible at the [Bureau of Transportation Statistics](https://www.transtats.bts.gov/Homepage.asp) of the US Department of Transportation.

## Prerequisites

Prior to using this notebook the following steps need to be completed:
1. [Configure the AI Lab](../main_config.ipynb).
2. [Load the US Flights data](../data/data_flights.ipynb).

## Setup

### Open Secure Configuration Storage

In [None]:
%run ../utils/access_store_ui.ipynb
display(get_access_store_ui('../'))

## connect

Let's connect to the Exasol database. For convenience, we will use a wrapper around the [ibis.exasol.connect](https://ibis-project.org/backends/exasol) command.

In [None]:
from exasol.nb_connector.connections import open_ibis_connection

conn = open_ibis_connection(ai_lab_config, compression=True)

# table

We will start by creating a `table` object for the table with the flight delay data.

In [None]:
flights = conn.table('US_FLIGHTS')

Let's have a look at the content of this table.

In [None]:
flights.head().to_pandas()

# filter

Should we compute the statistics on all records in the table?
What about canceled or diverted flights? How should we account for them? Let's see what information we've got for such unfortunate flights.

In [None]:
flights.filter(flights.CANCELLED).head().to_pandas()

In [None]:
flights.filter(flights.DIVERTED).head().to_pandas()

There is no delay information for those flights. So, let's just exclude them.

# group_by and aggregate

Let's compute some statistics on the delay for each carrier.
We can chain together the `filter`, `group_by`, and `aggregate` operators.

In [None]:
delay_by_carrier = flights.filter((flights.CANCELLED | flights.DIVERTED).negate()).group_by('OP_CARRIER_AIRLINE_ID').aggregate(
    flights.CARRIER_DELAY.sum().name('combined_delay'),
    flights.CARRIER_DELAY.count().name('total_delayed'),
    flights.OP_CARRIER_AIRLINE_ID.count().name('total_flights')
)
delay_by_carrier.head().to_pandas()

# mutate

Let's add two new columns to the statistics computed in the previous step - the percentage of flights that have been delayed and the average delay per flight.

In [None]:
delay_by_carrier = delay_by_carrier.mutate(
    percent_delayed=100 * delay_by_carrier.total_delayed / delay_by_carrier.total_flights,
    delay_per_flight=delay_by_carrier.combined_delay / delay_by_carrier.total_flights
)
delay_by_carrier.head().to_pandas()

# join

Now, let's link the table with the carrier names, that are stored in another table called US_AIRLINES.

In [None]:
airlines = conn.table('US_AIRLINES')
delay_by_carrier = delay_by_carrier.join(airlines, 'OP_CARRIER_AIRLINE_ID', how='inner')
delay_by_carrier.head().to_pandas()

# order

Let's order the airlines from worst to best

In [None]:
delay_by_carrier = delay_by_carrier.order_by(delay_by_carrier.delay_per_flight.desc())
delay_by_carrier.head().to_pandas()

# select

Finally, we will select the columns to display and print out the 10 worst airlines.

In [None]:
delay_by_carrier = delay_by_carrier.select('CARRIER_NAME', 'percent_delayed', 'delay_per_flight')
delay_by_carrier.head(10).to_pandas()