# Reading from and writing to a database

In [1]:
from pathlib import Path

import polars as pl

## Creating a SQLite database

A `SQLite` database is simply a file on disk. 

In [2]:
csv_file = "data/nyc_trip_data_1k.csv"
df = pl.read_csv(csv_file)

Establish a directory to hold db

In [3]:
sqliteDBDirectory = Path("data/sqlite/nyc_data")

if not sqliteDBDirectory.exists():
    sqliteDBDirectory.mkdir(parents=True, exist_ok=True)

In [4]:
sqliteDBPath = sqliteDBDirectory / "nyc_trip_data.sqlite"

### Engines for writing to a database

Working with a database we need to specify an engine to communicate between Polars and the database. 

The options are:
- SQLalchemy
- Arrow Database Connectivity (ADBC)

#### SQLalchemy

If we choose SQLalchemy then Polars simply creates a Pandas `DataFrame` backed by PyArrow instead of Numpy.

You can do this as well if you want to have full control over operations:
```python
            df.to_pandas(use_pyarrow_extension_array=True).to_sql(
                name=table_name, con=engine, if_exists=if_exists
            )
```
Polars then internally uses the `to_sql` Pandas method on that Pandas `DataFrame`.

SQLalchemy is a tried and test approach that works for many different databases.

#### Arrow Database Connectivity (ADBC)

ADBC is a promising new approach built around Apache Arrow. 

It *should* be better than SQLalchemy in terms of performance and memory usage. 

However, it is still early days for ADBC and the feature set is still limited compared to SQLalchemy.

If ADBC doesn't work,sticking with SQLalchemy.

### Creating a database
We create a SQLite database with ADBC.

To work with SQLite with ADBC, we need `adbc_driver_sqlite` package.

The connection URI for a SQLite database on disk must begin with `sqlite:///` followed by the path to the database file. 

Call `as_posix` on the `Path` object to extract the path as a string before writing the data to the database

In [5]:
uri = "sqlite:///" + sqliteDBPath.as_posix()
uri

'sqlite:///data/sqlite/nyc_data/nyc_trip_data.sqlite'

In [6]:
if not sqliteDBPath.exists():
    df.sort("passenger_count").write_database(
        table_name="records",
        connection=uri,
        if_table_exists="replace",
        engine="adbc"
    )

## Reading from a database

In [7]:
df = pl.read_database_uri(
    "select * from records limit 3", 
    uri=uri, 
    engine="adbc"
)

df

VendorID,pickup,dropoff,passenger_count,trip_distance,fare_amount,tip_amount
str,str,str,f64,f64,f64,f64
"""id1""","""2022-01-01T02:31:41.000000""","""2022-01-01T02:45:31.000000""",0.0,2.7,12.0,3.15
"""id3""","""2022-01-03T07:59:25.000000""","""2022-01-03T08:17:00.000000""",0.0,3.4,14.0,2.0
"""id3""","""2022-01-03T08:41:51.000000""","""2022-01-03T08:54:11.000000""",0.0,1.8,10.0,0.0


## Reading from a client-server database
To read from a client-server database like Postgres, MySQL, Oracle, etc then the connection string requires the standard connection and login details such as
```python
uri = "postgresql://username:password@server:port/database"
pl.read_database_uri(sql="select * from records",uri=uri)
```

## Filtering rows and selecting columns
The `pl.read_database_uri` function works only in eager mode. 

If reading a database first then `select` a column or `filter` rows, the entire database will be read into memory before the `select` or `filter` is applied.

In [8]:
pl.read_database_uri(
    "select * from records",
    uri=uri,
    engine="adbc"
).filter(
    pl.col("passenger_count") > 3
).head(3)

VendorID,pickup,dropoff,passenger_count,trip_distance,fare_amount,tip_amount
str,str,str,f64,f64,f64,f64
"""id8""","""2022-01-01T00:40:58.000000""","""2022-01-01T01:00:59.000000""",4.0,8.44,25.5,0.0
"""id5""","""2022-01-02T20:56:41.000000""","""2022-01-02T21:02:43.000000""",4.0,0.83,6.0,0.0
"""id2""","""2022-01-03T14:58:00.000000""","""2022-01-03T15:20:45.000000""",4.0,3.9,16.5,0.0


Do the `select` and `filter` in SQL string instead of Polars syntax.

In [9]:
pl.read_database_uri(
    "select * from records where passenger_count > 3",
    uri=uri,
    engine="adbc"
).head(3)

VendorID,pickup,dropoff,passenger_count,trip_distance,fare_amount,tip_amount
str,str,str,f64,f64,f64,f64
"""id8""","""2022-01-01T00:40:58.000000""","""2022-01-01T01:00:59.000000""",4.0,8.44,25.5,0.0
"""id5""","""2022-01-02T20:56:41.000000""","""2022-01-02T21:02:43.000000""",4.0,0.83,6.0,0.0
"""id2""","""2022-01-03T14:58:00.000000""","""2022-01-03T15:20:45.000000""",4.0,3.9,16.5,0.0


In [10]:
pl.read_database_uri(
    "select pickup, dropoff from records",
    uri=uri,
    engine="adbc"
).head(3)

pickup,dropoff
str,str
"""2022-01-01T02:31:41.000000""","""2022-01-01T02:45:31.000000"""
"""2022-01-03T07:59:25.000000""","""2022-01-03T08:17:00.000000"""
"""2022-01-03T08:41:51.000000""","""2022-01-03T08:54:11.000000"""


## DuckDB

DuckDB is like SQLite but be better in analytics.  

Although DuckDB was not built in Arrow, we can pass the Arrow Table from Polars to DuckDB for a query.

Remember to install `duckdb` first

In [11]:
import duckdb

dfPolars = pl.read_csv(csv_file)

Convert the Arrow data to DuckDB

In [12]:
nyc = duckdb.arrow(dfPolars.to_arrow())

Run the query and get the result as an Arrow table

In [13]:
nyc.query(
    "nyc", "SELECT passenger_count,avg(trip_distance) FROM nyc group by passenger_count"
).to_arrow_table()

pyarrow.Table
passenger_count: double
avg(trip_distance): double
----
passenger_count: [[2,3,5],[1],...,[0],[6]]
avg(trip_distance): [[3.3519354838709683,3.627692307692308,1.4658823529411766],[3.533346774193546],...,[1.5687499999999999],[3.7658823529411762]]

Return the result to Polars `DataFrame`

In [14]:
pl.from_arrow(
    nyc.query(
    "nyc", "SELECT passenger_count,avg(trip_distance) FROM nyc group by passenger_count"
).to_arrow_table()
)

passenger_count,avg(trip_distance)
f64,f64
4.0,3.73
2.0,3.351935
3.0,3.627692
0.0,1.56875
6.0,3.765882
5.0,1.465882
1.0,3.533347


## Exercises

### Exercise 1
Get the maximum and average of the passenger count when the trip distance is greater than 5 km. 

Use the ADBC engine

In [15]:
pl.read_database_uri(
    "select max(passenger_count), avg(passenger_count) from records where trip_distance > 5",
    uri=uri,
    engine="adbc"
)

max(passenger_count),avg(passenger_count)
f64,f64
6.0,1.433735


### Exercise 2
Read the Titanic dataset into a `DataFrame`

In [16]:
titaniccsv_file = "data/titanic.csv"

In [18]:
dfTitanic = pl.read_csv(titaniccsv_file)

Write the data into DuckDB with `duckdb.arrow`

In [20]:
titanic = duckdb.arrow(dfTitanic.to_arrow())

Get the average age in each passenger class and return the result as a Polars `DataFrame`

In [22]:
pl.from_arrow(
    titanic.query(
        "titanic", "select Pclass, avg(Age) from titanic group by Pclass"
    ).to_arrow_table()
)

Pclass,avg(Age)
i64,f64
1,38.233441
3,25.14062
2,29.87763
