# 1 - Getting started

## First commands

Getting DuckDB running is as simple as pip installing the package `duckdb` and importing it.

In [1]:
import duckdb

If you don't need to persist the database after you're done with your session, you can immediately run queries against the database with `duckdb.sql`.

In [2]:
query = """
SELECT 'Hello World!'
"""
res = duckdb.sql(query)
print(type(res))
print(res)

<class 'duckdb.duckdb.DuckDBPyRelation'>
┌────────────────┐
│ 'Hello World!' │
│    varchar     │
├────────────────┤
│ Hello World!   │
└────────────────┘



You can directly print the result to see the result in the above format. Alternatively, the result object has a method .show() you can use instead.

In [3]:
res.show()

┌────────────────┐
│ 'Hello World!' │
│    varchar     │
├────────────────┤
│ Hello World!   │
└────────────────┘



If and when you need to access the query results with Python, you can convert the result to
- Python object with `res.fetchall()`
- a Pandas DataFrame with `res.df()` or `res.to_df()`

In [4]:
ls = res.fetchall()
print(ls)

[('Hello World!',)]


In [5]:
df = res.df()
print(type(df))
df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,'Hello World!'
0,Hello World!


You can naturally create tables, insert values, create views and so on like in any database. 

In [6]:
query = """
CREATE OR REPLACE TABLE test_table (
    int_col INTEGER,
    str_col VARCHAR
);
CREATE OR REPLACE TABLE another_table (
    int_col INTEGER
)
"""
duckdb.sql(query)
duckdb.sql("SHOW TABLES")

┌───────────────┐
│     name      │
│    varchar    │
├───────────────┤
│ another_table │
│ test_table    │
└───────────────┘

In [7]:
query = """
INSERT INTO test_table (int_col, str_col) VALUES
    (1, 'Hello'),
    (3, 'World'),
    (2, ' ')
"""
duckdb.sql(query)
duckdb.sql("FROM test_table ORDER BY int_col")

┌─────────┬─────────┐
│ int_col │ str_col │
│  int32  │ varchar │
├─────────┼─────────┤
│       1 │ Hello   │
│       2 │         │
│       3 │ World   │
└─────────┴─────────┘

In [8]:
query = """
CREATE OR REPLACE VIEW test_view AS (
    FROM test_table
    WHERE str_col != ' '
)
"""
duckdb.sql(query)
duckdb.sql("FROM test_view ORDER BY int_col")

┌─────────┬─────────┐
│ int_col │ str_col │
│  int32  │ varchar │
├─────────┼─────────┤
│       1 │ Hello   │
│       3 │ World   │
└─────────┴─────────┘

Note that in the DuckDB SQL dialect you can omit `SELECT *`. You can also
- reorder `SELECT` and `FROM`, i.e. you can query `FROM table SELECT cols`,
- exclude columns instead of listing all of the columns you want, i.e. `SELECT * EXCLUDE(cols, we, do, not, want) FROM table`,
- group by all non-aggregated columns, i.e. `SELECT ... FROM table GROUP BY ALL`.
See the [DuckDB documentation](https://duckdb.org/docs/sql/introduction) for the SQL syntax.

## Persisting the database

If you want to persist your database in a file or open a previously saved database, you first need to create a connection to it and the use the connection instead of `duckdb` to run your queries. For example, the following cell either creates a new database to a file test.db or if the file already exists the cell loads it.

In [9]:
conn = duckdb.connect("test.db")
query = """
SELECT 'Hello World!'
"""
conn.sql(query).show()

query = """
CREATE OR REPLACE TABLE yet_another_table (
    col INTEGER
)
"""
conn.sql(query)
conn.sql("SHOW TABLES").show()

┌────────────────┐
│ 'Hello World!' │
│    varchar     │
├────────────────┤
│ Hello World!   │
└────────────────┘

┌───────────────────┐
│       name        │
│      varchar      │
├───────────────────┤
│ yet_another_table │
└───────────────────┘



## Extensions

[Extensions](https://duckdb.org/docs/extensions/overview.html) allow you to add functionality to DuckDB. To see the list of extensions, you can use the `duckdb_extensions()` SQL function.

In [10]:
duckdb.sql("FROM duckdb_extensions()")

┌──────────────────┬─────────┬───────────┬──────────────────────┬──────────────────────────────────┬───────────────────┐
│  extension_name  │ loaded  │ installed │     install_path     │           description            │      aliases      │
│     varchar      │ boolean │  boolean  │       varchar        │             varchar              │     varchar[]     │
├──────────────────┼─────────┼───────────┼──────────────────────┼──────────────────────────────────┼───────────────────┤
│ arrow            │ false   │ false     │                      │ A zero-copy data integration b…  │ []                │
│ autocomplete     │ false   │ false     │                      │ Adds support for autocomplete …  │ []                │
│ aws              │ false   │ false     │                      │ Provides features that depend …  │ []                │
│ azure            │ false   │ false     │                      │ Adds a filesystem abstraction …  │ []                │
│ excel            │ false   │ f

In this tutorial we will need the `postgres` extension, or `postgres_scanner` more specifically. If the extension is listed as not installed, let's install and load it now since we'll use it later.

In [11]:
duckdb.sql("INSTALL postgres")
duckdb.sql("LOAD postgres")

In [12]:
duckdb.sql("FROM duckdb_extensions()")

┌──────────────────┬─────────┬───────────┬──────────────────────┬──────────────────────────────────┬───────────────────┐
│  extension_name  │ loaded  │ installed │     install_path     │           description            │      aliases      │
│     varchar      │ boolean │  boolean  │       varchar        │             varchar              │     varchar[]     │
├──────────────────┼─────────┼───────────┼──────────────────────┼──────────────────────────────────┼───────────────────┤
│ arrow            │ false   │ false     │                      │ A zero-copy data integration b…  │ []                │
│ autocomplete     │ false   │ false     │                      │ Adds support for autocomplete …  │ []                │
│ aws              │ false   │ false     │                      │ Provides features that depend …  │ []                │
│ azure            │ false   │ false     │                      │ Adds a filesystem abstraction …  │ []                │
│ excel            │ false   │ f

Note that you can also install and load extensions with the Python API functions `duckdb.install_extension` and `duckdb.load_extension`.

# 2 - Dataframes

DuckDB can interact with Pandas and Polars dataframes in both ways, i.e. it can read dataframes and can convert query results to dataframes. See the documentation for full details: [Pandas](https://duckdb.org/docs/archive/0.9.2/guides/python/import_pandas), [Polars](https://duckdb.org/docs/archive/0.9.2/guides/python/polars).

As an example, we'll load one of our generated example parquet-files with all three -- DuckDB, Pandas, and Polars -- and see how to convert between the three. If you haven't already, run the script `generate_example_data.py` before moving on. Note that to work with Pandas dataframes you only need to pip install Pandas, but to work with Polars dataframes you also need to install pyarrow.

In [14]:
import pandas as pd
import polars as pl

Let's first look at querying Pandas dataframes with duckdb. If you have named the dataframe e.g. `df_pandas`, you can refer to it in a SQL query in DuckDB just like you'd refer to a table.

In [15]:
df_pandas = pd.read_parquet("data/df_1.parquet")

duckdb.sql("FROM df_pandas")

┌───────┬─────────────────────┬────────────────────┬────────────────────┬──────────────┬────────────────────┐
│  id   │      timestamp      │        col3        │        col2        │     tags     │        col1        │
│ int64 │       varchar       │       double       │       double       │  varchar[]   │       double       │
├───────┼─────────────────────┼────────────────────┼────────────────────┼──────────────┼────────────────────┤
│   101 │ 2023-02-26T01:30:43 │ 314.40823312422475 │ 246.18723693298915 │ [a, b]       │               NULL │
│   102 │ 2023-12-05T16:57:48 │               NULL │               NULL │ NULL         │ 135.62228296381673 │
│   103 │ 2023-07-15T13:45:17 │               NULL │  293.7636032104896 │ [b, c, d, a] │ 146.59103165511684 │
│   104 │ 2023-10-16T14:09:49 │               NULL │               NULL │ [c, d, a, b] │ 123.75590364288591 │
│   105 │ 2023-01-17T03:49:30 │               NULL │ 211.88470721466794 │ [c, d, a]    │               NULL │
│   106 │ 

As we already saw in the previous section, we can call `.df()` or `.to_df()` in a DuckDB query result to convert it to a Pandas dataframe. For example, let's run a query to get all tags from the dataframe and convert it back to a Pandas dataframe.

In [26]:
# unnest explodes the lists in the tags-column so DISTINCT(unnest(tags)) gets all the tags that appear in the column
duckdb.sql("SELECT DISTINCT(unnest(tags)) AS tag FROM df_pandas ORDER BY ALL").df()

Unnamed: 0,tag
0,a
1,b
2,c
3,d


Let's then see how to do the same steps with Polars instead. The only actual difference when working with Polars dataframes is that a duckdb query result is converted to a Polars dataframe with the `.pl()` method.

In [31]:
df_polars = pl.read_parquet("data/df_1.parquet")

duckdb.sql("FROM df_polars")

┌───────┬─────────────────────┬────────────────────┬────────────────────┬──────────────┬────────────────────┐
│  id   │      timestamp      │        col3        │        col2        │     tags     │        col1        │
│ int64 │       varchar       │       double       │       double       │  varchar[]   │       double       │
├───────┼─────────────────────┼────────────────────┼────────────────────┼──────────────┼────────────────────┤
│   101 │ 2023-02-26T01:30:43 │ 314.40823312422475 │ 246.18723693298915 │ [a, b]       │               NULL │
│   102 │ 2023-12-05T16:57:48 │               NULL │               NULL │ NULL         │ 135.62228296381673 │
│   103 │ 2023-07-15T13:45:17 │               NULL │  293.7636032104896 │ [b, c, d, a] │ 146.59103165511684 │
│   104 │ 2023-10-16T14:09:49 │               NULL │               NULL │ [c, d, a, b] │ 123.75590364288591 │
│   105 │ 2023-01-17T03:49:30 │               NULL │ 211.88470721466794 │ [c, d, a]    │               NULL │
│   106 │ 

In [32]:
duckdb.sql("SELECT DISTINCT(unnest(tags)) AS tag FROM df_pandas ORDER BY ALL").pl()

tag
str
"""a"""
"""b"""
"""c"""
"""d"""


# 3 - Working with files

# 4 - Interacting with databases