# Introduction to DuckDB

DuckDB is an in-process, analytical SQL database. You can think of it as SQLite for analytics (OLAP), not transactions (OLTP).

Instead of running as a server, DuckDB runs inside your application process - Python, R, Java, C++, etc. and queries data where it already lives.

## What makes DuckDB special

#### Built for analytics

DuckDB is optimized for:

- Large scans
- Aggregations
- Joins
- Window functions
- Complex analytical queries

Think: dashboards, feature engineering, ad-hoc analysis, data science---not user logins or high-frequency writes.

#### In-process (no server)

There's:

- No daemon
- No port
- No client/server overhead

You just:

```python
import duckdb
duckdb.sql("SELECT * FROM 'data.parquet'")
```

#### Columnar + vectorized execution

Under the hood, DuckDB:

- Stores data in columnar format
- Processes data in vectors (chunks) instead of row-by-row

This is why it's fast. Very fast. Often competitive with Spark for single-machine workloads.

#### Reads data directly (zero-copy vibes)

DuckDB can query:

- Parquet
- CSV
- JSON
- Arrow
- Pandas / Polars DataFrames
- S3 / HTTP / local files

Without importing them first.

```sql
SELECT avg(price)
FROM read_parquet('s3://bucket/sales/*.parquet');
```


### How it compares to other tools

| Tool     | Best at                  | DuckDB difference                                        |
| -------- | ------------------------ | -------------------------------------------------------- |
| SQLite   | App storage              | DuckDB is analytical                                     |
| Postgres | Transactions + analytics | DuckDB is faster for scans, simpler to embed             |
| Pandas   | In-memory analysis       | DuckDB scales better + SQL                               |
| Spark    | Distributed big data     | DuckDB is single-node but simpler & often faster locally |
| BigQuery | Cloud analytics          | DuckDB is local, offline, cheap (free)                   |


### Typical use cases

DuckDB shines when you:

- Do data analysis in Python/R
- Work with Parquet files
- Need SQL over DataFrames
- Want reproducible, local analytics
- Don't want to run a database server

Examples:

- Feature engineering for ML
- Exploratory data analysis
- Replacing slow pandas groupbys
- Local analytics before pushing to a warehouse


### Limitations

DuckDB is not:

- A transactional database
- Multi-user concurrent server
- Meant for high-write workloads

Rule of thumb:

> If you'd normally reach for Spark, BigQuery, or heavy pandas, reach for DuckDB
> If you need users, writes, and locks, consider Postgres


## Installing DuckDB

If you're using Python without conda, install with pip:

```bash
pip install duckdb
```

If you're using a conda environment, install with:

```bash
conda install python-duckdb -c conda-forge
```


:::{tip} What is conda-forge?

conda-forge is a community-run package repository for Conda. It provides pre-built packages for a wide range of software, including DuckDB, making it easy to install and manage dependencies in Conda environments.

**Why conda-forge exists**

The default Conda channel (`defaults`, historically from Anaconda Inc.) has:

- Slower updates
- Fewer packages
- Occasional version lag or mismatches

conda-forge was created to:

- Provide newer packages faster
- Support more platforms
- Ensure everything works together

It's run by hundreds of open-source contributors.

:::


## Using DuckDB

Using DuckDB is as simple as importing the library and running SQL queries. Here's a quick example:


In [None]:
import duckdb

duckdb.sql("SELECT 1").show()

┌───────┐
│   1   │
│ int32 │
├───────┤
│     1 │
└───────┘



### Reading data with DuckDB

DuckDB can read data from multiple sources, including Parquet files, CSVs, and even directly from Pandas DataFrames. You can query these data sources using SQL syntax, making it easy to perform complex analyses without needing to load everything into memory first.

From [DuckDB Documentation](https://duckdb.org/docs/stable/clients/python/overview):

```python
import duckdb

duckdb.read_csv("example.csv")                # read a CSV file into a Relation
duckdb.read_parquet("example.parquet")        # read a Parquet file into a Relation
duckdb.read_json("example.json")              # read a JSON file into a Relation

duckdb.sql("SELECT * FROM 'example.csv'")     # directly query a CSV file
duckdb.sql("SELECT * FROM 'example.parquet'") # directly query a Parquet file
duckdb.sql("SELECT * FROM 'example.json'")    # directly query a JSON file
```


DuckDB can directly query Pandas DataFrames without needing to load them into memory first. This allows you to leverage the power of SQL for data manipulation while still working with familiar DataFrame structures.


In [3]:
import pandas as pd

pandas_df = pd.DataFrame({"name": ["John", "Jane"], "age": [30, 25]})
duckdb.sql("SELECT * FROM pandas_df")

┌─────────┬───────┐
│  name   │  age  │
│ varchar │ int64 │
├─────────┼───────┤
│ John    │    30 │
│ Jane    │    25 │
└─────────┴───────┘

### Persistent Storage

You can create a connection to a persistent database using `duckdb.connect(dbname)`, although DuckDB is often used for ad-hoc analysis without needing to manage a database file.

For our credit card transactions analysis, we'll be using DuckDB to query Parquet files directly, so we won't need to set up a persistent database.
