# DuckDB as Fugue Backend

[DuckDB](https://duckdb.org/) is an in-process SQL OLAP database management system. The speed is very good on even gigabytes of data on local machines.
Fugue has a deep integration with DuckDB. Fugue not only uses DuckDB as the SQL engine, but also implemented all [execution engine](../advanced/execution_engine.ipynb) methods using DuckDB SQL and relations. So in most part of the workflow, the data tables are kept in DuckDB and in rare cases
the tables will be materialized and converted to arrow dataframes. Note this [blog](https://duckdb.org/2021/12/03/duck-arrow.html) explains that converting
between DuckDB and Arrow has minimal overhead.

## Installation

```bash
pip install fugue[duckdb]
```

## Hello World

To use it prgrammatically, you only need to

In [2]:
import fugue_duckdb

This is to register all the types and execution engines of DuckDB. Then you can write a hello world:

In [4]:
from fugue_sql import fsql
import pandas as pd

df = pd.DataFrame(dict(a=[0,1,1,2], b=[10,20,30,40]))

fsql("""
SELECT a, SUM(b) AS b FROM df GROUP BY a
PRINT
""", df=df).run("duckdb")

DuckDataFrame
a:long|b:long
------+------
0     |10    
1     |50    
2     |40    
Total count: 3



DataFrames()

Now, let's consider the notebook use case

In [5]:
from fugue_notebook import setup

setup()

<IPython.core.display.Javascript object>

## A Practical Workflow

In this workflow, we will create some mock dataframes, save to csv, load it and save to parquet, and do some basic EDA on the dataframes.

### Create data and save and load

In [33]:
import numpy as np

np.random.seed(0)
n = 1000

df1 = pd.DataFrame(dict(
    a = np.random.choice(["a", "b", "c"], n),
    b = np.random.rand(n),
    c = pd.date_range(start="2020-01-01 14:15:16", periods=n, freq="s")
))

df2 = pd.DataFrame(dict(
    c = pd.date_range(start="2020-01-01 14:15:16", periods=n, freq="s"),
    d = np.random.choice([True, False, None], n),
))

In [34]:
%%fsql duck
PRINT df1, df2

Unnamed: 0,a,b,c
0,a,0.671383,2020-01-01 14:15:16
1,b,0.344718,2020-01-01 14:15:17
2,a,0.713767,2020-01-01 14:15:18
3,b,0.639187,2020-01-01 14:15:19
4,b,0.399161,2020-01-01 14:15:20
5,c,0.43176,2020-01-01 14:15:21
6,a,0.614528,2020-01-01 14:15:22
7,c,0.070042,2020-01-01 14:15:23
8,a,0.822407,2020-01-01 14:15:24
9,a,0.653421,2020-01-01 14:15:25


Unnamed: 0,c,d
0,2020-01-01 14:15:16,
1,2020-01-01 14:15:17,
2,2020-01-01 14:15:18,
3,2020-01-01 14:15:19,True
4,2020-01-01 14:15:20,True
5,2020-01-01 14:15:21,True
6,2020-01-01 14:15:22,
7,2020-01-01 14:15:23,
8,2020-01-01 14:15:24,True
9,2020-01-01 14:15:25,False


In [35]:
%%fsql duck
SAVE df1 OVERWRITE "/tmp/df1.csv" (header=TRUE)
SAVE df2 OVERWRITE "/tmp/df2.csv" (header=FALSE)

Load back CSVs and save to parquets. Parquet is always a better choice than CSV

In [36]:
%%fsql duck
LOAD "/tmp/df1.csv" (header=TRUE, infer_schema=TRUE)
SAVE OVERWRITE "/tmp/df1.parquet"
LOAD "/tmp/df2.csv" COLUMNS c:datetime,d:bool
SAVE AND USE OVERWRITE "/tmp/df2.parquet"
PRINT

Unnamed: 0,c,d
0,2020-01-01 14:15:16,
1,2020-01-01 14:15:17,
2,2020-01-01 14:15:18,
3,2020-01-01 14:15:19,True
4,2020-01-01 14:15:20,True
5,2020-01-01 14:15:21,True
6,2020-01-01 14:15:22,
7,2020-01-01 14:15:23,
8,2020-01-01 14:15:24,True
9,2020-01-01 14:15:25,False


### Basic EDA

In [37]:
%%fsql duck
df1 = LOAD "/tmp/df1.parquet"
df2 = LOAD "/tmp/df2.parquet"
df3 = 
    SELECT df1.*, df2.d FROM df1 INNER JOIN df2 ON df1.c = df2.c
    YIELD DATAFRAME

PRINT ROWCOUNT

Unnamed: 0,a,b,c,d
0,c,0.800256,2020-01-01 14:23:24,False
1,b,0.955568,2020-01-01 14:23:25,
2,b,0.31655,2020-01-01 14:23:26,True
3,b,0.826805,2020-01-01 14:23:27,True
4,a,0.103991,2020-01-01 14:23:28,True
5,a,0.633982,2020-01-01 14:23:29,
6,c,0.751032,2020-01-01 14:23:30,False
7,a,0.155978,2020-01-01 14:23:31,
8,a,0.426002,2020-01-01 14:23:32,False
9,c,0.892707,2020-01-01 14:23:33,True


The yielded `df3` can be directly used on the next cells. `YIELD DATAFRAME` or `YIELD FILE` are extremely useful for EDA step.

In [38]:
%%fsql duck
top2 = SELECT a, SUM(b) AS b FROM df3 GROUP BY a ORDER BY b DESC LIMIT 2
top_groups = SELECT df3.* FROM df3 INNER JOIN top2 ON df3.a = top2.a YIELD DATAFRAME
PRINT ROWCOUNT

SELECT minute(c) AS m, COUNT(*) AS ct GROUP BY 1 ORDER BY 2 DESC
PRINT

Unnamed: 0,a,b,c,d
0,c,0.800256,2020-01-01 14:23:24,False
1,b,0.955568,2020-01-01 14:23:25,
2,b,0.31655,2020-01-01 14:23:26,True
3,b,0.826805,2020-01-01 14:23:27,True
4,c,0.751032,2020-01-01 14:23:30,False
5,c,0.892707,2020-01-01 14:23:33,True
6,b,0.103578,2020-01-01 14:23:34,False
7,c,0.018096,2020-01-01 14:23:35,True
8,c,0.590585,2020-01-01 14:23:36,True
9,c,0.798689,2020-01-01 14:23:38,False


Unnamed: 0,m,ct
0,24,46
1,25,45
2,28,43
3,19,42
4,26,42
5,31,41
6,30,40
7,17,39
8,18,39
9,22,39


And a couple of Fugue SQL specific syntax

In [54]:
%%fsql duck
TAKE 2 ROWS FROM top_groups PREPARTITION BY a PRESORT b
PRINT

FILL NULLS PARAMS d:TRUE FROM top_groups
PRINT

SAMPLE 1 PERCENT SEED 0 FROM top_groups 
PRINT ROWCOUNT

DROP ROWS IF ANY NULLS FROM top_groups
PRINT ROWCOUNT

DROP COLUMNS b, x IF EXISTS FROM top_groups
PRINT

Unnamed: 0,a,b,c,d
0,b,0.000546,2020-01-01 14:17:51,True
1,b,0.024273,2020-01-01 14:21:34,True
2,c,0.001383,2020-01-01 14:16:47,
3,c,0.004655,2020-01-01 14:24:44,True


Unnamed: 0,a,b,c,d
0,c,0.800256,2020-01-01 14:23:24,False
1,b,0.955568,2020-01-01 14:23:25,True
2,b,0.31655,2020-01-01 14:23:26,True
3,b,0.826805,2020-01-01 14:23:27,True
4,c,0.751032,2020-01-01 14:23:30,False
5,c,0.892707,2020-01-01 14:23:33,True
6,b,0.103578,2020-01-01 14:23:34,False
7,c,0.018096,2020-01-01 14:23:35,True
8,c,0.590585,2020-01-01 14:23:36,True
9,c,0.798689,2020-01-01 14:23:38,False


Unnamed: 0,a,b,c,d
0,b,0.357639,2020-01-01 14:24:43,
1,b,0.401688,2020-01-01 14:24:47,False
2,b,0.898825,2020-01-01 14:30:34,
3,c,0.416934,2020-01-01 14:31:55,


Unnamed: 0,a,b,c,d
0,c,0.800256,2020-01-01 14:23:24,False
1,b,0.31655,2020-01-01 14:23:26,True
2,b,0.826805,2020-01-01 14:23:27,True
3,c,0.751032,2020-01-01 14:23:30,False
4,c,0.892707,2020-01-01 14:23:33,True
5,b,0.103578,2020-01-01 14:23:34,False
6,c,0.018096,2020-01-01 14:23:35,True
7,c,0.590585,2020-01-01 14:23:36,True
8,c,0.798689,2020-01-01 14:23:38,False
9,c,0.388404,2020-01-01 14:23:41,False


Unnamed: 0,a,c,d
0,c,2020-01-01 14:23:24,False
1,b,2020-01-01 14:23:25,
2,b,2020-01-01 14:23:26,True
3,b,2020-01-01 14:23:27,True
4,c,2020-01-01 14:23:30,False
5,c,2020-01-01 14:23:33,True
6,b,2020-01-01 14:23:34,False
7,c,2020-01-01 14:23:35,True
8,c,2020-01-01 14:23:36,True
9,c,2020-01-01 14:23:38,False
