<div style="text-align: right;">
  <img src="https://raw.githubusercontent.com/exasol/ai-lab/refs/heads/main/assets/Exasol_Logo_2025_Dark.svg" style="width:200px; margin: 10px;" />
</div>

# Working with Exasol Using the PyExasol Connector

This notebook shows basic operations on example data using PyExasol. For more details, see also the [PyExasol documentation](https://exasol.github.io/pyexasol/master/index.html).

Organized as a quickstart tutorial, the notebook looks the flexibility and ease by which PyExasol allows importing data into and exporting data from an Exasol database. The notebbook also shows how to transform data both in Python and within the database. The tutorial uses Polars, PyArrow and example data on US flight delays:

* [Polars](https://pola.rs/) is a lightning-fast DataFrame library for Python. A DataFrame is a two-dimensional data structure representing data as a table with rows and columns. The [import_from_polars](https://exasol.github.io/pyexasol/1.3.0/api.html#pyexasol.ExaConnection.import_from_polars) and [export_to_polars](https://exasol.github.io/pyexasol/1.3.0/api.html#pyexasol.ExaConnection.export_to_polars) were introduced in PyExasol version `1.0.0` allowing users to quickly import & export their data into & from an Exasol Database.
* [PyArrow](https://pola.rs/) is a cross-language development platform for in-memory data. The [import_from_parquet](https://exasol.github.io/pyexasol/1.3.0/api.html#pyexasol.ExaConnection.import_from_parquet) and [export_to_parquet](https://exasol.github.io/pyexasol/1.3.0/api.html#pyexasol.ExaConnection.export_to_parquet) were introduced in PyExasol version `1.2.0` to allow users to quickly import & export their data into & from an Exasol Database to a parquet file. A parquet file is a highly compressed, high-speed storage for massive, spreadsheet-like data.
* The example data on US flight delays is used to explore the delay caused by the carrier. We will rank the carriers using the delay as the performance metric. This data is publicly accessible at the [Bureau of Transportation Statistics](https://www.transtats.bts.gov/Homepage.asp) of the US Department of Transportation.

We will use both the import and export functionalities of Polars & PyArrow in a a round trip manner with the data on US flight delays to illustrate how PyExasol creates a versatile & flexible bridge between Python code and the Exasol Database. Beyond specific use cases, like a project using Polars, this can leverage dynamic ecosystems with multiple input and output formats typical for real companies.

## Prerequisites

Prior to using this notebook the following steps need to be completed:
1. [Configure the AI Lab](../main_config.ipynb).

Please note:
* AI Lab currently is shipped with PyExasol version `1.3.0`. 
* Functions for PyArrow's parquet and Polars's DataFrame should work for all supported database versions, as data is converted into streamed CSVs before executing SQL statements `IMPORT` or `EXPORT`.

## 1. Setup

### 1.1 Open Secure Configuration Storage (SCS)

In [None]:
%run ../utils/access_store_ui.ipynb
display(get_access_store_ui('../'))

### 1.2 Download Files to Local Filesystem

In [None]:
import requests

def download_file(url, filename):
    try:
        response = requests.get(url, stream=True)
        response.raise_for_status()
        
        with open(filename, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
                
        print("File downloaded successfully!")
        
    except requests.exceptions.RequestException as e:
        print(f"Error downloading file: {str(e)}")

#### 1.2.1 Download Parquet File from S3 Bucket & Inspect with Pyarrow

In this section we download a prepared file from an AWS S3 Bucket to our local filesystem. The file contains data on US flight delays. After that, we load and inspect the data with PyArrow.

In [None]:
flights_parquet_file = "US_FLIGHTS_FEB_2024.parquet"

download_file(
    url="https://ai-lab-example-data-s3.s3.eu-central-1.amazonaws.com/first_steps/US_FLIGHTS_FEB_2024.parquet",
    filename=flights_parquet_file
)

Let's take a look at the contents on this file.

In [None]:
from pyarrow import dataset

parquet_table = dataset.dataset(flights_parquet_file).to_table()

parquet_table.schema

In [None]:
parquet_table.to_pandas().head()

#### 1.2.2 Download CVS File from Remote Filesystem & Inspect with Polars
This section demonstrates how to download a CSV file from a remote source.

In [None]:
airlines_csv_file = "US_AIRLINES.csv"

download_file(
    url="https://dut5tonqye28.cloudfront.net/ai_lab/flight-info/US_AIRLINES.csv",
    filename=airlines_csv_file
)

Let's take a look at the contents of this file.

In [None]:
import polars as pl

polars_dataframe = pl.read_csv(airlines_csv_file)
polars_dataframe.schema

In [None]:
polars_dataframe.head()

### 1.3 Create Tables

We will start by creating empty tables for storing the flight delay data.

In [None]:
from typing import NamedTuple

class TableInfo(NamedTuple):
    table: str
    schema: str = ai_lab_config.db_schema

    @property
    def as_tuple(self):
        """This format is needed for PyExasol connections as keyword arguments"""
        return (self.schema, self.table)

    @property
    def as_string(self):
        """The format format is important for completed queries sent to PyExasol"""
        return f"{self.schema}.{self.table}"
        

In [None]:
us_airlines = TableInfo(table="US_AIRLINES")
us_flights = TableInfo(table="US_FLIGHTS")


us_airlines_ddl = f"""
CREATE OR REPLACE TABLE {us_airlines.as_string} (
  OP_CARRIER_AIRLINE_ID DECIMAL(10, 0) IDENTITY PRIMARY KEY,
  CARRIER_NAME VARCHAR(1000)
)
"""

us_flights_ddl = f"""
CREATE OR REPLACE TABLE {us_flights.as_string} (
  FLIGHT_ID DECIMAL(18,0) IDENTITY PRIMARY KEY,
  FL_DATE TIMESTAMP, 
  OP_CARRIER_AIRLINE_ID DECIMAL(10, 0),
  ORIGIN_AIRPORT_SEQ_ID DECIMAL(10, 0),
  ORIGIN_STATE_ABR CHAR(2),
  DEST_AIRPORT_SEQ_ID DECIMAL(10, 0),
  DEST_STATE_ABR CHAR(2),
  CRS_DEP_TIME CHAR(4),
  DEP_DELAY DOUBLE, 
  CRS_ARR_TIME CHAR(4),
  ARR_DELAY DOUBLE,
  CANCELLED BOOLEAN,
  CANCELLATION_CODE CHAR(1),
  DIVERTED BOOLEAN,
  CRS_ELAPSED_TIME DOUBLE,
  ACTUAL_ELAPSED_TIME DOUBLE, 
  DISTANCE DOUBLE,
  CARRIER_DELAY DOUBLE,
  WEATHER_DELAY DOUBLE,
  NAS_DELAY DOUBLE,
  SECURITY_DELAY DOUBLE,
  LATE_AIRCRAFT_DELAY DOUBLE
)
"""

In [None]:
from exasol.nb_connector.connections import open_pyexasol_connection

with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    for ddl in (us_airlines_ddl, us_flights_ddl):
        conn.execute(ddl)

### 1.4 Import Airline Information from CSV with PyExasol

We use PyExasol's import_from_file to import the CSV into the US_AIRLINES table.

In [None]:
from pathlib import Path

with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    path = Path(airlines_csv_file)
    import_params = {
        "column_delimiter": '"',
        "column_separator": ",",
        "row_separator": "CRLF",
        "skip": 1,
    }
    conn.import_from_file(path, us_airlines.as_tuple, import_params)

## 2. Roundtrip with Parquet

### 2.1 Importing Parquet Data from a Local Filesystem

This section demonstrates how to import a parquet file from a local source into the database using PyExasol.


In [None]:
from pathlib import Path

columns = parquet_table.column_names

with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    conn.import_from_parquet(source=Path(flights_parquet_file), table=us_flights.as_tuple, import_params={"columns":columns})

### 2.2 Aggregating & Transforming the Data in the Database

Let's find out which airline has the highest delay per flight:

In [None]:
highest_delay_per_flight_query = f"""
SELECT
  CARRIER_NAME "Airline",
  SUM(CARRIER_DELAY) "Combined Delay",
  COUNT(CARRIER_DELAY) "Delayed Flights",
  COUNT(F.OP_CARRIER_AIRLINE_ID) "Total flights",
  ROUND( SUM(CARRIER_DELAY) / COUNT(F.OP_CARRIER_AIRLINE_ID), 1 ) "Delay per flight"
FROM {us_flights.as_string} F
  JOIN {us_airlines.as_string} A 
  ON A.OP_CARRIER_AIRLINE_ID = F.OP_CARRIER_AIRLINE_ID
WHERE NOT (CANCELLED OR DIVERTED)
GROUP BY CARRIER_NAME
ORDER BY "Delay per flight" DESC
"""

In [None]:
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    result = conn.execute(highest_delay_per_flight_query).fetchall()

result

### 2.3 Exporting the Transformed Data to Parquet

This section demonstrates how to export transformed data to a parquet file in the local filesystem.

In [None]:
local_directory = "highest_delay"

#### 2.3.1 Save One File without Additional callback_params 

This code will save all of the data for the query into 1 parquet file, 
but as `callback_params={"existing_data_behavior":...}` was not changed, if you re-run this cell without deleting the directory, 
**you will get an exception**:

```python
ValueError: 'highest_delay' contains existing files and `callback_params['existing_data_behavior']` is not one of these values: ('overwrite_or_ignore', 'delete_matching').
```

In [None]:
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    conn.export_to_parquet(
        dst=local_directory, 
        query_or_table=highest_delay_per_flight_query
    )

#### 2.3.2 Repeatedly Save One File with callback_params["existing_data_behavior"]="overwrite_or_ignore"

This code will save all of the data for the query into 1 parquet file and overwrite any existing & matching parquet filename.<br>
This is due to: `callback_params["existing_data_behavior"] = "overwrite_or_ignore"`.<br>
**This allows the cell to be executed multiple times, unlike in 2.3.1.**


`existing_data_behavior` can be set to:
> * `error` (default value) raises an error if **any** data exists in the destination.

> * `overwrite_or_ignore` will ignore any existing data and will overwrite files with the same name as an output file. Other existing files will be ignored. This behavior, in combination with a unique basename_template for each write, will allow for an append workflow                                                                                                                                                             
                                                                                                                                                                > * `delete_matching` is useful when you are writing a partitioned dataset. The first time each partition directory is encountered the entire directory will be deleted. This allows you to overwrite old partitions completely.

In [None]:
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    conn.export_to_parquet(
        dst=local_directory, 
        query_or_table=highest_delay_per_flight_query,
        callback_params={"existing_data_behavior": "overwrite_or_ignore"}
    )

#### 2.3.3 Save Multiple Files with callback_params["max_rows_per_file"]=5

This code will save all of the data for the query into 3 parquet files and overwrite any existing & matching parquet filename.<br>
The saving into multiple files comes from: `callback["max_rows_per_file"]=5` and `callback["max_rows_per_group"]=5` <br>
This overwriting behavior is due to: `callback_params["existing_data_behavior"] = "overwrite_or_ignore"`. 

**Note:** If ``max_rows_per_file`` is altered, ensure that ``max_rows_per_group`` is set to a value less than or equal to the value of ``max_rows_per_file``.

In [None]:
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    conn.export_to_parquet(
        dst=local_directory,
        query_or_table=highest_delay_per_flight_query,
        callback_params={
            "existing_data_behavior": "overwrite_or_ignore",
            "max_rows_per_file":5,
            "max_rows_per_group":5
        }
    )

## 3. Roundtrip with Polars

### 3.1 Exporting the Data to a Polars DataFrame

This section demonstrates how to export data from an Exasol Database to a Polars DataFrame.

In [None]:
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    highest_delay_per_flight_dataframe = conn.export_to_polars(
        query_or_table=highest_delay_per_flight_query,
    )

highest_delay_per_flight_dataframe.head()

In [None]:
model_query = f"""
SELECT
  FLIGHT_ID,
  TO_CHAR(FL_DATE, 'D') AS day_of_week,
  CASE 
        WHEN TO_CHAR(FL_DATE, 'D') IN ('6', '7') THEN TRUE 
        ELSE FALSE 
    END AS is_weekend,
  OP_CARRIER_AIRLINE_ID,
  CAST(TO_CHAR(TO_TIMESTAMP(CRS_DEP_TIME, 'HH24MI'), 'HH24') AS INTEGER) AS CRS_DEP_HOUR,
  ORIGIN_STATE_ABR,
  DEST_STATE_ABR,
  DISTANCE,
  CASE
      WHEN ARR_DELAY > 0 THEN 1
      ELSE 0
  END AS was_delayed
FROM AI_LAB.US_FLIGHTS
  WHERE CANCELLED IS FALSE
  AND DIVERTED IS FALSE
"""

In [None]:
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    model_dataframe = conn.export_to_polars(query_or_table=model_query)

model_dataframe.head()

In [None]:
model_dataframe.select(pl.col("WAS_DELAYED").value_counts(sort=True))

### 3.2 Training a Model

This is a toy example of a machine learning model. It is not intended to be optimized or well-performing; many steps which you would expect in the preparation of a machine learning model were not done.

First, we one-hot encode categorical values and scale large numerical values.

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn import set_config

# Globally configure sklearn to return Polars DataFrames
set_config(transform_output="polars")

# Apply different transformations to different columns
cat_cols = ['OP_CARRIER_AIRLINE_ID', 'ORIGIN_STATE_ABR', 'DEST_STATE_ABR']
num_cols = ['DISTANCE']

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(sparse_output=False), cat_cols),
        ('num', StandardScaler(), num_cols)  # Added scaling for DISTANCE
    ],
    remainder='passthrough',
    # This removes the prefixes like 'remainder__' and 'cat__'
    verbose_feature_names_out=False 
)

# Transform directly using the Polars DataFrame
model_polars = preprocessor.fit_transform(model_dataframe)

model_polars.head()

Next, we split the data into training and validation sets.

In [None]:
from sklearn.model_selection import train_test_split

X = model_polars.drop('WAS_DELAYED', 'FLIGHT_ID')
y = model_polars.select('WAS_DELAYED')

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2)

Then, we train a random forest on the data.

Finally, we evaluate how our model performed.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

clf_model = LogisticRegression(solver='liblinear', random_state=42)

clf_model.fit(X_train, y_train.to_series())

pred_valid_clf = clf_model.predict(X_valid)

print(f"Accuracy: {accuracy_score(y_valid, pred_valid_clf):.2%}")
print(classification_report(y_valid, pred_valid_clf))

### 3.3 Importing the Model Results into an Exasol Database Table

We run the machine learning model on all of our flight data and insert it into an Exasol Database table.

In [None]:
all_preds = clf_model.predict(model_polars.drop('WAS_DELAYED', 'FLIGHT_ID'))

final_df = model_polars.with_columns(
    pl.Series("WAS_DELAYED_PRED", all_preds)
).select([
    "FLIGHT_ID",
    "WAS_DELAYED",
    "WAS_DELAYED_PRED"
])

final_df.head()

In [None]:
us_flights_model = TableInfo(table="US_FLIGHTS_MODEL")

us_flights_model_ddl = f"""
CREATE OR REPLACE TABLE {us_flights_model.as_string}(
    FLIGHT_ID DECIMAL(18,0) NOT NULL,
    WAS_DELAYED BOOLEAN,
    WAS_DELAYED_PRED BOOLEAN,
    CONSTRAINT fk_flight_id
        FOREIGN KEY (FLIGHT_ID) 
        REFERENCES {us_flights.as_string} (FLIGHT_ID)
)
"""

In [None]:
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    conn.execute(us_flights_model_ddl)

In [None]:
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    conn.import_from_polars(
        src=final_df,
        table=us_flights_model.as_tuple
    )

In [None]:
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    df = conn.export_to_pandas(us_flights_model.as_tuple)
df.head()