<div style="text-align: right;">
  <img src="https://raw.githubusercontent.com/exasol/ai-lab/refs/heads/main/assets/Exasol_Logo_2025_Dark.svg" style="width:200px; margin: 10px;" />
</div>

# Working with Exasol Using the PyExasol Connector

In this notebook, we will show some basic operations on data using PyExasol. You can find more detailed information on using PyExasol in its [documentation](https://exasol.github.io/pyexasol/master/index.html).

The notebook is organized as a quickstart tutorial in which we will be looking at the flexibility and ease by which PyExasol allows users to import, export, and transform data both in Python and in the Exasol database. To showcase this ability, we will be using data on US flight delays. In particular, we will explore the delay caused by the carrier. We will rank the carriers using the delay as the performance metric. The data is publicly accessible at the [Bureau of Transportation Statistics](https://www.transtats.bts.gov/Homepage.asp) of the US Department of Transportation.

## Prerequisites

Prior to using this notebook the following steps need to be completed:
1. [Configure the AI Lab](../main_config.ipynb).

Please note:
* AI Lab currently is shipped with PyExasol version `1.3.0`. The `import_from_parquet` and `export_to_parquet` were introduced in PyExasol version `1.2.0`, and the usage with polars was first introduced in PyExasol version `1.0.0`. Both functions for parquet and polars should work for all supported database versions, as these functions convert data into streamed CSVs before executing the IMPORT or EXPORT statement.

## 1. Setup

### 1.1 Open Secure Configuration Storage (SCS)

In [None]:
%run ../utils/access_store_ui.ipynb
display(get_access_store_ui('../'))

### 1.2 Download Files to Local Filesystem

In [None]:
import requests

def download_file(url, filename):
    try:
        response = requests.get(url, stream=True)
        response.raise_for_status()
        
        with open(filename, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
                
        print("File downloaded successfully!")
        
    except requests.exceptions.RequestException as e:
        print(f"Error downloading file: {str(e)}")

#### 1.2.1 Download Parquet File from S3 Bucket & Inspect with Pyarrow

We download a prepared flights file from an AWS S3 Bucket to our local filesystem and then we load and inspect the data with pyarrow.

In [None]:
flights_parquet_file = "US_FLIGHTS_FEB_2024.parquet"

download_file(
    url="https://ai-lab-example-data-s3.s3.eu-central-1.amazonaws.com/first_steps/US_FLIGHTS_FEB_2024.parquet",
    filename=flights_parquet_file
)

Let's take a look at the contents on this file.

In [None]:
from pyarrow import dataset

parquet_table = dataset.dataset(flights_parquet_file).to_table()

parquet_table.schema

In [None]:
parquet_table.to_pandas().head()

#### 1.2.2 Download CVS File from Remote Filesystem & Inspect with Polars
This section demonstrates how to download a CSV file from a remote source.
The data is publicly accessible at the [Bureau of Transportation Statistics](https://www.transtats.bts.gov/Homepage.asp) of the US Department of Transportation.

In [None]:
airlines_csv_file = "US_AIRLINES.csv"

download_file(
    url="https://dut5tonqye28.cloudfront.net/ai_lab/flight-info/US_AIRLINES.csv",
    filename=airlines_csv_file
)

Let's take a look at the contents on this file.

In [None]:
# This can be removed once https://github.com/exasol/notebook-connector/issues/306 is resolved, and there's a new notebook-connector release.
%pip install polars==1.35.2

In [None]:
import polars as pl

polars_dataframe = pl.read_csv(airlines_csv_file)

In [None]:
polars_dataframe.schema

In [None]:
polars_dataframe.head()

### 1.3 Create Tables

We will start by creating empty tables for storing the flight delay data.

In [None]:
from typing import NamedTuple

class TableInfo(NamedTuple):
    table: str
    schema: str = ai_lab_config.db_schema

    @property
    def as_tuple(self):
        """This format is needed for PyExasol connections as keyword arguments"""
        return (self.schema, self.table)

    @property
    def as_string(self):
        """The format format is important for completed queries sent to PyExasol"""
        return f"{self.schema}.{self.table}"
        

In [None]:
us_airlines = TableInfo(table="US_AIRLINES")
us_flights = TableInfo(table="US_FLIGHTS")


us_airlines_ddl = f"""
CREATE OR REPLACE TABLE {us_airlines.as_string} (
  OP_CARRIER_AIRLINE_ID DECIMAL(10, 0) IDENTITY PRIMARY KEY,
  CARRIER_NAME VARCHAR(1000)
)
"""

us_flights_ddl = f"""
CREATE OR REPLACE TABLE {us_flights.as_string} (
  FL_DATE TIMESTAMP, 
  OP_CARRIER_AIRLINE_ID DECIMAL(10, 0),
  ORIGIN_AIRPORT_SEQ_ID DECIMAL(10, 0),
  ORIGIN_STATE_ABR CHAR(2),
  DEST_AIRPORT_SEQ_ID DECIMAL(10, 0),
  DEST_STATE_ABR CHAR(2),
  CRS_DEP_TIME CHAR(4),
  DEP_DELAY DOUBLE, 
  CRS_ARR_TIME CHAR(4),
  ARR_DELAY DOUBLE,
  CANCELLED BOOLEAN,
  CANCELLATION_CODE CHAR(1),
  DIVERTED BOOLEAN,
  CRS_ELAPSED_TIME DOUBLE,
  ACTUAL_ELAPSED_TIME DOUBLE, 
  DISTANCE DOUBLE,
  CARRIER_DELAY DOUBLE,
  WEATHER_DELAY DOUBLE,
  NAS_DELAY DOUBLE,
  SECURITY_DELAY DOUBLE,
  LATE_AIRCRAFT_DELAY DOUBLE
)
"""

In [None]:
from exasol.nb_connector.connections import open_pyexasol_connection

with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    for ddl in (us_airlines_ddl, us_flights_ddl):
        conn.execute(ddl)

### 1.4 Import Airline Information from CSV with PyExasol

We use PyExasol's import_from_file to import the CSV into the US_AIRLINES table.

In [None]:
from pathlib import Path

with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    path = Path(airlines_csv_file)
    import_params = {
        "column_delimiter": '"',
        "column_separator": ",",
        "row_separator": "CRLF",
        "skip": 1,
    }
    conn.import_from_file(path, us_airlines.as_tuple, import_params)

## 2. Roundtrip with Parquet

### 2.1 Importing Parquet Data from a Local Filesystem

This section demonstrates how to import a parquet file from a local source into the database using PyExasol.


In [None]:
from pathlib import Path

with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    conn.import_from_parquet(source=Path(flights_parquet_file), table=us_flights.as_tuple)

### 2.2 Aggregating & Transforming the Data in the Database

Let's find out which airline has the highest delay per flight:

In [None]:
highest_delay_per_flight_query = f"""
SELECT
  CARRIER_NAME "Airline",
  SUM(CARRIER_DELAY) "Combined Delay",
  COUNT(CARRIER_DELAY) "Delayed Flights",
  COUNT(F.OP_CARRIER_AIRLINE_ID) "Total flights",
  ROUND( SUM(CARRIER_DELAY) / COUNT(F.OP_CARRIER_AIRLINE_ID), 1 ) "Delay per flight"
FROM {us_flights.as_string} F
  JOIN {us_airlines.as_string} A 
  ON A.OP_CARRIER_AIRLINE_ID = F.OP_CARRIER_AIRLINE_ID
WHERE NOT (CANCELLED OR DIVERTED)
GROUP BY CARRIER_NAME
ORDER BY "Delay per flight" DESC
"""

In [None]:
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    result = conn.execute(highest_delay_per_flight_query).fetchall()

result

### 2.3 Exporting the Transformed Data to Parquet

This section demonstrates how to export transformed data to a parquet file in the local filesystem.

In [None]:
highest_delay_parquet_directory = "highest_delay"

#### 2.3.1 Save One File without Additional callback_params 

This code will save all of the data for the query into 1 parquet file, 
but as `callback_params={"existing_data_behavior":...}` was not changed, if you re-run this cell without deleting the directory, 
**you will get an exception**:

```python
ValueError: 'highest_delay' contains existing files and `callback_params['existing_data_behavior']` is not one of these values: ('overwrite_or_ignore', 'delete_matching').
```

In [None]:
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    conn.export_to_parquet(dst=highest_delay_parquet_directory, query_or_table=highest_delay_per_flight_query)

#### 2.3.2 Repeatedly Save One File with callback_params["existing_data_behavior"]="overwrite_or_ignore"

This code will save all of the data for the query into 1 parquet file and overwrite any existing & matching parquet filename.<br>
This is due to: `callback_params["existing_data_behavior"] = "overwrite_or_ignore"`.<br>
**This allows the cell to be executed multiple times, unlike in 2.3.1.**


"existing_data_behavior" can be set to:
> * `error` (default value) raises an error if **any** data exists in the destination.

> * `overwrite_or_ignore` will ignore any existing data and will overwrite files with the same name as an output file. Other existing files will be ignored. This behavior, in combination with a unique basename_template for each write, will allow for an append workflow                                                                                                                                                             
                                                                                                                                                                > * `delete_matching` is useful when you are writing a partitioned dataset. The first time each partition directory is encountered the entire directory will be deleted. This allows you to overwrite old partitions completely.

In [None]:
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    conn.export_to_parquet(dst=highest_delay_parquet_directory, query_or_table=highest_delay_per_flight_query,
                          callback_params={"existing_data_behavior": "overwrite_or_ignore"})

#### 2.3.3 Save Multiple Files with callback_params["max_rows_per_file"]=5

This code will save all of the data for the query into 3 parquet files and overwrite any existing & matching parquet filename.<br>
The saving into multiple files comes from: `callback["max_rows_per_file"]=5` and `callback["max_rows_per_group"]=5` <br>
This overwriting behavior is due to: `callback_params["existing_data_behavior"] = "overwrite_or_ignore"`. 

**Note:** If ``max_rows_per_file`` is altered, ensure that ``max_rows_per_group`` is set to a value less than or equal to the value of ``max_rows_per_file``.

In [None]:
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    conn.export_to_parquet(dst=highest_delay_parquet_directory, query_or_table=highest_delay_per_flight_query,
                          callback_params={"existing_data_behavior": "overwrite_or_ignore",
                                           "max_rows_per_file":5,
                                           "max_rows_per_group":5
                                          })