# 1 - Getting started

## First commands

Getting DuckDB running is as simple as pip installing the package `duckdb` and importing it.

In [1]:
import duckdb

If you don't need to persist the database after you're done with your session, you can immediately run queries against the database with `duckdb.sql`.

In [2]:
query = """
SELECT 'Hello World!'
"""
res = duckdb.sql(query)
print(type(res))
print(res)

<class 'duckdb.duckdb.DuckDBPyRelation'>
┌────────────────┐
│ 'Hello World!' │
│    varchar     │
├────────────────┤
│ Hello World!   │
└────────────────┘



You can directly print the result to see the result in the above format. Alternatively, the result object has a method .show() you can use instead.

In [3]:
res.show()

┌────────────────┐
│ 'Hello World!' │
│    varchar     │
├────────────────┤
│ Hello World!   │
└────────────────┘



If and when you need to access the query results with Python, you can convert the result for example to
- Python object with `res.fetchall()`
- a Pandas DataFrame with `res.df()` or `res.to_df()`

In [4]:
ls = res.fetchall()
print(ls)

[('Hello World!',)]


In [5]:
df = res.df()
print(type(df))
df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,'Hello World!'
0,Hello World!


You can naturally create tables, insert values, create views and so on like in any database. 

In [6]:
query = """
CREATE OR REPLACE TABLE test_table (
    int_col INTEGER,
    str_col VARCHAR
);
CREATE OR REPLACE TABLE another_table (
    int_col INTEGER
)
"""
duckdb.sql(query)
duckdb.sql("SHOW TABLES")

┌───────────────┐
│     name      │
│    varchar    │
├───────────────┤
│ another_table │
│ test_table    │
└───────────────┘

In [7]:
query = """
INSERT INTO test_table (int_col, str_col) VALUES
    (1, 'Hello'),
    (3, 'World'),
    (2, ' ')
"""
duckdb.sql(query)
duckdb.sql("FROM test_table ORDER BY int_col")

┌─────────┬─────────┐
│ int_col │ str_col │
│  int32  │ varchar │
├─────────┼─────────┤
│       1 │ Hello   │
│       2 │         │
│       3 │ World   │
└─────────┴─────────┘

In [8]:
query = """
CREATE OR REPLACE VIEW test_view AS (
    FROM test_table
    WHERE str_col != ' '
)
"""
duckdb.sql(query)
duckdb.sql("FROM test_view ORDER BY int_col")

┌─────────┬─────────┐
│ int_col │ str_col │
│  int32  │ varchar │
├─────────┼─────────┤
│       1 │ Hello   │
│       3 │ World   │
└─────────┴─────────┘

Note that in the DuckDB SQL dialect you can omit `SELECT *`. You can also
- reorder `SELECT` and `FROM`, i.e. you can query `FROM table SELECT cols`,
- exclude columns instead of listing all of the columns you want, i.e. `SELECT * EXCLUDE(cols, we, do, not, want) FROM table`,
- group by all non-aggregated columns, i.e. `SELECT ... FROM table GROUP BY ALL`.

See the [DuckDB documentation](https://duckdb.org/docs/sql/introduction) for the SQL syntax.

## Persisting the database

If you want to persist your database in a file or open a previously saved database, you first need to create a connection to it and the use the connection instead of `duckdb` to run your queries. For example, the following cell either creates a new database to a file test.db or if the file already exists the cell loads it.

In [9]:
conn = duckdb.connect("test.db")
query = """
SELECT 'Hello World!'
"""
conn.sql(query).show()

query = """
CREATE OR REPLACE TABLE yet_another_table (
    col INTEGER
)
"""
conn.sql(query)
conn.sql("SHOW TABLES").show()

┌────────────────┐
│ 'Hello World!' │
│    varchar     │
├────────────────┤
│ Hello World!   │
└────────────────┘

┌───────────────────┐
│       name        │
│      varchar      │
├───────────────────┤
│ yet_another_table │
└───────────────────┘



## Extensions

[Extensions](https://duckdb.org/docs/extensions/overview.html) allow you to add functionality to DuckDB. To see the list of extensions, you can use the `duckdb_extensions()` SQL function.

In [10]:
duckdb.sql("FROM duckdb_extensions()")

┌──────────────────┬─────────┬───────────┬──────────────────────┬──────────────────────────────────┬───────────────────┐
│  extension_name  │ loaded  │ installed │     install_path     │           description            │      aliases      │
│     varchar      │ boolean │  boolean  │       varchar        │             varchar              │     varchar[]     │
├──────────────────┼─────────┼───────────┼──────────────────────┼──────────────────────────────────┼───────────────────┤
│ arrow            │ false   │ false     │                      │ A zero-copy data integration b…  │ []                │
│ autocomplete     │ false   │ false     │                      │ Adds support for autocomplete …  │ []                │
│ aws              │ false   │ false     │                      │ Provides features that depend …  │ []                │
│ azure            │ false   │ false     │                      │ Adds a filesystem abstraction …  │ []                │
│ excel            │ false   │ f

In this tutorial we will need the `postgres` extension, or `postgres_scanner` more specifically. If the extension is listed as not installed, let's install and load it now since we'll use it later.

In [11]:
duckdb.sql("INSTALL postgres")
duckdb.sql("LOAD postgres")

In [12]:
duckdb.sql("FROM duckdb_extensions()")

┌──────────────────┬─────────┬───────────┬──────────────────────┬──────────────────────────────────┬───────────────────┐
│  extension_name  │ loaded  │ installed │     install_path     │           description            │      aliases      │
│     varchar      │ boolean │  boolean  │       varchar        │             varchar              │     varchar[]     │
├──────────────────┼─────────┼───────────┼──────────────────────┼──────────────────────────────────┼───────────────────┤
│ arrow            │ false   │ false     │                      │ A zero-copy data integration b…  │ []                │
│ autocomplete     │ false   │ false     │                      │ Adds support for autocomplete …  │ []                │
│ aws              │ false   │ false     │                      │ Provides features that depend …  │ []                │
│ azure            │ false   │ false     │                      │ Adds a filesystem abstraction …  │ []                │
│ excel            │ false   │ f

Note that you can also install and load extensions with the Python API functions `duckdb.install_extension` and `duckdb.load_extension`.

# 2 - Dataframes

DuckDB can interact with Pandas and Polars dataframes in both ways, i.e. it can read dataframes and can convert query results to dataframes. See the documentation for full details: [Pandas](https://duckdb.org/docs/archive/0.9.2/guides/python/import_pandas), [Polars](https://duckdb.org/docs/archive/0.9.2/guides/python/polars).

As an example, we'll load one of our generated example parquet-files with all three -- DuckDB, Pandas, and Polars -- and see how to convert between the three. If you haven't already, run the script `generate_example_data.py` before moving on. Note that to work with Pandas dataframes you only need to pip install Pandas, but to work with Polars dataframes you also need to install pyarrow.

In [13]:
import pandas as pd
import polars as pl

Let's first look at querying Pandas dataframes with duckdb. If you have named the dataframe e.g. `df_pandas`, you can refer to it in a SQL query in DuckDB just like you'd refer to a table.

In [14]:
df_pandas = pd.read_parquet("data/df_1.parquet")

duckdb.sql("FROM df_pandas")

┌───────┬─────────────────────┬────────────────────┬────────────────────┬────────────────────┬──────────────┐
│  id   │      timestamp      │        col3        │        col1        │        col2        │     tags     │
│ int64 │       varchar       │       double       │       double       │       double       │  varchar[]   │
├───────┼─────────────────────┼────────────────────┼────────────────────┼────────────────────┼──────────────┤
│   101 │ 2023-12-13T06:32:49 │               NULL │               NULL │               NULL │ NULL         │
│   102 │ 2023-04-19T09:54:18 │  329.3245699179571 │  123.8749330113198 │ 208.80899087038708 │ [d, a, c, b] │
│   103 │ 2023-08-21T06:28:57 │               NULL │               NULL │               NULL │ [a, b, c, d] │
│   104 │ 2023-03-02T16:37:20 │  353.0100537270686 │  158.4221395570704 │  293.8240259081357 │ NULL         │
│   105 │ 2023-03-13T15:54:04 │  308.3782509448798 │               NULL │ 228.87403284948138 │ [d]          │
│   106 │ 

As we already saw in the previous section, we can call `.df()` or `.to_df()` in a DuckDB query result to convert it to a Pandas dataframe. For example, let's run a query to get all tags from the dataframe and convert it back to a Pandas dataframe.

In [15]:
# unnest explodes the lists in the tags-column so DISTINCT(unnest(tags)) gets all the tags that appear in the column
duckdb.sql("SELECT DISTINCT(unnest(tags)) AS tag FROM df_pandas ORDER BY ALL").df()

Unnamed: 0,tag
0,a
1,b
2,c
3,d


Let's then see how to do the same steps with Polars instead. The only actual difference when working with Polars dataframes is that a duckdb query result is converted to a Polars dataframe with the `.pl()` method.

In [16]:
df_polars = pl.read_parquet("data/df_1.parquet")

duckdb.sql("FROM df_polars")

┌───────┬─────────────────────┬────────────────────┬────────────────────┬────────────────────┬──────────────┐
│  id   │      timestamp      │        col3        │        col1        │        col2        │     tags     │
│ int64 │       varchar       │       double       │       double       │       double       │  varchar[]   │
├───────┼─────────────────────┼────────────────────┼────────────────────┼────────────────────┼──────────────┤
│   101 │ 2023-12-13T06:32:49 │               NULL │               NULL │               NULL │ NULL         │
│   102 │ 2023-04-19T09:54:18 │  329.3245699179571 │  123.8749330113198 │ 208.80899087038708 │ [d, a, c, b] │
│   103 │ 2023-08-21T06:28:57 │               NULL │               NULL │               NULL │ [a, b, c, d] │
│   104 │ 2023-03-02T16:37:20 │  353.0100537270686 │  158.4221395570704 │  293.8240259081357 │ NULL         │
│   105 │ 2023-03-13T15:54:04 │  308.3782509448798 │               NULL │ 228.87403284948138 │ [d]          │
│   106 │ 

In [17]:
duckdb.sql("SELECT DISTINCT(unnest(tags)) AS tag FROM df_pandas ORDER BY ALL").pl()

tag
str
"""a"""
"""b"""
"""c"""
"""d"""


# 3 - Working with files

DuckDB can read and write CSV, JSON, and Parquet files out of the box. You can add more filetypes with extensions. We'll look at examples of Parquet and JSON files. See the [documentation](https://duckdb.org/docs/archive/0.9.2/data/overview) for more information.

## Reading files

Reading from files can be as simple as querying the file as if it was a table in the database.

In [18]:
duckdb.sql("FROM 'data/1_record.json'").show()
duckdb.sql("FROM 'data/df_1.parquet' LIMIT 5").show()

┌───────┬─────────────────────┬──────────────┐
│  id   │      timestamp      │     tags     │
│ int64 │       varchar       │  varchar[]   │
├───────┼─────────────────────┼──────────────┤
│     1 │ 2023-04-16T20:12:06 │ [b, d, c, a] │
└───────┴─────────────────────┴──────────────┘

┌───────┬─────────────────────┬───────────────────┬───────────────────┬────────────────────┬──────────────┐
│  id   │      timestamp      │       col3        │       col1        │        col2        │     tags     │
│ int64 │       varchar       │      double       │      double       │       double       │  varchar[]   │
├───────┼─────────────────────┼───────────────────┼───────────────────┼────────────────────┼──────────────┤
│   101 │ 2023-12-13T06:32:49 │              NULL │              NULL │               NULL │ NULL         │
│   102 │ 2023-04-19T09:54:18 │ 329.3245699179571 │ 123.8749330113198 │ 208.80899087038708 │ [d, a, c, b] │
│   103 │ 2023-08-21T06:28:57 │              NULL │              NULL

You also have the functions `read_csv_auto()`, `read_csv()`, `read_parquet()`, `read_json()` and `read_json_auto()` if you need more control or you need to set some parameters.

DuckDB also supports reading multiple files by providing a list of filenames or by globbing. The files can even be of different filetypes. For example, if we want to read all of the Parquet-files in our `data/` directory, we can do it like this:

In [19]:
duckdb.sql("FROM read_parquet('data/*.parquet', union_by_name = true) SELECT MEAN(col1), MEAN(col2), MEAN(col3)").show()

┌────────────────────┬────────────────────┬──────────────────┐
│     mean(col1)     │     mean(col2)     │    mean(col3)    │
│       double       │       double       │      double      │
├────────────────────┼────────────────────┼──────────────────┤
│ 149.85944914296797 │ 245.98588768634744 │ 350.500765219307 │
└────────────────────┴────────────────────┴──────────────────┘



Note that we use the function `read_parquet()` with the parameter `union_by_name = true`. This tells DuckDB to combine the schemas of the files by name instead of position which is the default behavior. We need to do this since our randomly generated Parquet-files are not guaranteed to have the columns in the same order!

If we want to read only the files `1_record.json` and `2_record.json` we can give the filenames as a list like this:

In [20]:
duckdb.sql("FROM read_json_auto(['data/1_record.json', 'data/2_record.json'], union_by_name = true)").show()

┌───────┬─────────────────────┬──────────────┬───────────────────┬────────────────────┬────────────────────┐
│  id   │      timestamp      │     tags     │       col1        │        col3        │        col2        │
│ int64 │       varchar       │  varchar[]   │      double       │       double       │       double       │
├───────┼─────────────────────┼──────────────┼───────────────────┼────────────────────┼────────────────────┤
│     1 │ 2023-04-16T20:12:06 │ [b, d, c, a] │              NULL │               NULL │               NULL │
│     2 │ 2023-11-11T03:38:00 │ [d]          │ 143.2773325891486 │ 348.14722168023326 │ 241.33740756266872 │
└───────┴─────────────────────┴──────────────┴───────────────────┴────────────────────┴────────────────────┘



If instead of querying a file you want to read a file and write the data to a table in your database, you can use the [COPY ... FROM](https://duckdb.org/docs/sql/statements/copy#copy--from) statement. Two important caveats to note with COPY ... FROM:
- the table must already exists so you cannot create a new table with COPY ... FROM,
- the columns and their order must match between the table and file, i.e. you cannot match columns by their names.

If you need to create a new table based on a file or if you need to infer the column order, COPY ... FROM is not enough on its own; you will need to use e.g. CREATE TABLE AS SELECT or INSERT INTO ... BY NAME statements. Namely, in our example where our files are generated randomly and the column orders are not fixed, we can not simply COPY ... FROM any of our files.

## Writing files

Writing files is even simpler. You do it with the [COPY ... TO](https://duckdb.org/docs/sql/statements/copy#copy--to) statement. For example, if we'd like to read all of the records in the JSON-files, exclude the tags column, and write the result to a CSV-file, we can do it like this:

In [21]:
query = """
COPY (
    FROM read_json_auto('data/*.json', union_by_name = true)
    SELECT * EXCLUDE(tags)
)
TO 'data/json-records.csv'
"""
duckdb.sql(query)

See the [documentation](https://duckdb.org/docs/sql/statements/copy#copy--to) for all the format options.

# 4 - Interacting with databases

DuckDB has extensions for [MySQL](https://duckdb.org/docs/extensions/mysql), [PostgreSQL](https://duckdb.org/docs/extensions/postgres.html), and [SQLite](https://duckdb.org/docs/extensions/sqlite). These extensions allow you to insert, query, update, and delete data and tables in said databases directly from DuckDB. We'll use Postgres as an example, but MySQL and SQLite work similarly. See the documentation for more information.

## ATTACH and USE

First, we need to connect to a database. This is done with the ATTACH command. Assuming that you have a PostgreSQL running at localhost, with a database named `postgres` and a user `postgres` with the password `postgres`, you attach to it like this:

In [22]:
duckdb.sql("ATTACH 'dbname=postgres user=postgres password=postgres host=localhost' AS pg (TYPE postgres)")

The parameters for the connection can be given as a libpq connection string or as a PostgreSQL URI. The parameters can also be read from environment variables. See the [documentation](https://duckdb.org/docs/extensions/postgres.html#connecting).

Now, if we want to use the SHOW TABLES command to list all tables in the Postgres database, we need to make the database as the default database for DuckDB. You can do it with the USE command. We can also specify a schema with USE. For example, let's create a new schema named duckdbexamples in the Postgres database and switch to it:

In [23]:
duckdb.sql("CREATE SCHEMA pg.duckdbexamples")
duckdb.sql("USE pg.duckdbexamples")
duckdb.sql("SHOW TABLES").show()

┌─────────┐
│  name   │
│ varchar │
├─────────┤
│ 0 rows  │
└─────────┘



## Querying

Now that we have attached the Postgres database, we can run queries against it just like we would against a DuckDB database. For example we can create a table in the schema just like we would in the DuckDB database.

In [24]:
query = """
CREATE OR REPLACE TABLE pg.duckdbexamples.pg_test_table AS FROM read_json_auto(['data/1_record.json', 'data/2_record.json'], union_by_name = true);
FROM pg.duckdbexamples.pg_test_table;
"""
duckdb.sql(query)

┌───────┬─────────────────────┬──────────────┬───────────────────┬────────────────────┬────────────────────┐
│  id   │      timestamp      │     tags     │       col1        │        col3        │        col2        │
│ int64 │       varchar       │  varchar[]   │      double       │       double       │       double       │
├───────┼─────────────────────┼──────────────┼───────────────────┼────────────────────┼────────────────────┤
│     1 │ 2023-04-16T20:12:06 │ [b, d, c, a] │              NULL │               NULL │               NULL │
│     2 │ 2023-11-11T03:38:00 │ [d]          │ 143.2773325891486 │ 348.14722168023326 │ 241.33740756266872 │
└───────┴─────────────────────┴──────────────┴───────────────────┴────────────────────┴────────────────────┘

In [25]:
duckdb.sql("SHOW TABLES")

┌───────────────┐
│     name      │
│    varchar    │
├───────────────┤
│ pg_test_table │
└───────────────┘

Now, when we switch back to the in-memory DuckDB database, SHOW TABLES lists the tables we created in DuckDB.

In [26]:
duckdb.sql("USE memory")
duckdb.sql("SHOW TABLES").show()

┌───────────────┐
│     name      │
│    varchar    │
├───────────────┤
│ another_table │
│ test_table    │
│ test_view     │
└───────────────┘



We can still of course query the Postgres table without needing to USE pg.duckdbexamples.

In [27]:
duckdb.sql("FROM pg.duckdbexamples.pg_test_table")

┌───────┬─────────────────────┬──────────────┬───────────────────┬────────────────────┬────────────────────┐
│  id   │      timestamp      │     tags     │       col1        │        col3        │        col2        │
│ int64 │       varchar       │  varchar[]   │      double       │       double       │       double       │
├───────┼─────────────────────┼──────────────┼───────────────────┼────────────────────┼────────────────────┤
│     1 │ 2023-04-16T20:12:06 │ [b, d, c, a] │              NULL │               NULL │               NULL │
│     2 │ 2023-11-11T03:38:00 │ [d]          │ 143.2773325891486 │ 348.14722168023326 │ 241.33740756266872 │
└───────┴─────────────────────┴──────────────┴───────────────────┴────────────────────┴────────────────────┘

Note especially that we are not using the PostgreSQL dialect. We are using DuckDB's SQL dialect!

## Transactions

We can use transactions too! For example, let's start a transaction and insert a new row to the test table we created.

In [28]:
query = """
BEGIN;
INSERT INTO pg.duckdbexamples.pg_test_table (id) VALUES (999);
FROM pg.duckdbexamples.pg_test_table;
"""
duckdb.sql(query)

┌───────┬─────────────────────┬──────────────┬───────────────────┬────────────────────┬────────────────────┐
│  id   │      timestamp      │     tags     │       col1        │        col3        │        col2        │
│ int64 │       varchar       │  varchar[]   │      double       │       double       │       double       │
├───────┼─────────────────────┼──────────────┼───────────────────┼────────────────────┼────────────────────┤
│     1 │ 2023-04-16T20:12:06 │ [b, d, c, a] │              NULL │               NULL │               NULL │
│     2 │ 2023-11-11T03:38:00 │ [d]          │ 143.2773325891486 │ 348.14722168023326 │ 241.33740756266872 │
│   999 │ NULL                │ NULL         │              NULL │               NULL │               NULL │
└───────┴─────────────────────┴──────────────┴───────────────────┴────────────────────┴────────────────────┘

Now, we can ROLLBACK and the row is no longer there, as it shouldn't.

In [29]:
query = """
ROLLBACK;
FROM pg.duckdbexamples.pg_test_table;
"""
duckdb.sql(query)

┌───────┬─────────────────────┬──────────────┬───────────────────┬────────────────────┬────────────────────┐
│  id   │      timestamp      │     tags     │       col1        │        col3        │        col2        │
│ int64 │       varchar       │  varchar[]   │      double       │       double       │       double       │
├───────┼─────────────────────┼──────────────┼───────────────────┼────────────────────┼────────────────────┤
│     1 │ 2023-04-16T20:12:06 │ [b, d, c, a] │              NULL │               NULL │               NULL │
│     2 │ 2023-11-11T03:38:00 │ [d]          │ 143.2773325891486 │ 348.14722168023326 │ 241.33740756266872 │
└───────┴─────────────────────┴──────────────┴───────────────────┴────────────────────┴────────────────────┘

For documentation regarding transactions in DuckDB, see [this](https://duckdb.org/docs/sql/statements/transactions.html), and for caveats with transactions when you have multiple databases attached, see [this](https://duckdb.org/docs/sql/statements/attach.html#transactional-semantics).

Let's then delete the schema we created in Postgres.

In [30]:
duckdb.sql("DROP SCHEMA pg.duckdbexamples CASCADE")

# 5 - Concurrency and Motherduck

We saw above that we can create and use DuckDB databases with `duckdb.connect()`. An important thing to note with these databases is the [concurrency](https://duckdb.org/docs/connect/concurrency.html) model of DuckDB. In short:
- a single process can read and write to a database, OR
- multiple processes can only read a database.

In other words, if a process needs to write to a database-file, no other process can already be accessing the database and no other process can access it while the connection is open. To illustrate this limitation, run the script `read_database.py` that simply tries to open the DuckDB database `test.db` we created above in read only mode. It will fail with an exception along the lines of
```
duckdb.duckdb.IOException: IO Error: Could not set lock on file "/workspaces/DuckDB-examples/test.db": Resource temporarily unavailable
```
To release the lock we currently have on the file we need to `DETACH`. Before we can `DETACH` the database, we need to switch to a different default database.

In [31]:
conn.sql("""
ATTACH ':memory:';
USE memory;
DETACH test;
""")
conn.close()

Run the script `read_database.py` again. This time it should not throw any exceptions and it should keep the read only connection open until you terminate the script.

Now that the database is locked in read only mode, we can not open it in write mode, but we can open it in read only mode.

In [32]:
try:
    with duckdb.connect("test.db") as write_conn:
        write_conn.sql("SHOW TABLES").show()
        print("Managed to open in write mode.")
except duckdb.duckdb.IOException:
    print("Opening in write mode failed.")

Opening in write mode failed.


In [33]:
try:
    with duckdb.connect("test.db", read_only=True) as read_conn:
        read_conn.sql("SHOW TABLES").show()
        print("Managed to open in read mode.")
except duckdb.duckdb.IOException:
    print("Opening in read mode failed.")

┌───────────────────┐
│       name        │
│      varchar      │
├───────────────────┤
│ yet_another_table │
└───────────────────┘

Managed to open in read mode.


The above two cells also show how you can use a context manager to handle the connection to a database file. If you use a context manager to open a connection to a database file, you then don't need to `DETACH` and `.close()` to release the lock and clean up the connection after you no longer need the connection. In general, it is good practice to use a context manager when you connect to database files.