# Dask Match Demonstration

We make a Dask client, mostly so we can use the dashboard.

In [None]:
from dask.distributed import Client
client = Client()

## Example DataFrame

In [None]:
import dask_expr as dx

df = dx.datasets.timeseries(
    start="2000-01-01", 
    end="2000-12-30", 
    freq="100ms",
)
df

In [None]:
df.head()

## We can do high level query optimization

In [None]:
out = df[df.id == 1000].sum()["x"]

In [None]:
%%time
out.compute()

In [None]:
%%time
out.optimize().compute()

## Let's inspect what's going on

In database terms: 

-  Historically Dask DataFrame/Array never had a logical plan.
-  Instead we wrote everything immediately as a physical plan.

   This was great for flexibility and expressivity, which is what early users craved, but holds us back now as we target less sophisticated users.

Now we're adding a logical plan around these high level collections.

In [None]:
out.pprint()

In [None]:
out.optimize(fuse=False).pprint()

In [None]:
out.optimize().pprint()

In [None]:
df.x + 1    # good

In [None]:
(df + 1).x  # bad

In [None]:
out.expr.visualize()

## Let's look at expressions

In [None]:
# Dask dataframe/series class to match user expected API

out = (df.x + 1)
out

In [None]:
# Holds an expression object, which captures user intent

out.expr

In [None]:
# Operation is stored in the type

type(out.expr)

In [None]:
# Follows a standard class hierarchy

type(out.expr).mro()

In [None]:
# Most optimizations written on the opertions themselves

out.expr._simplify_down??

In [None]:
# State managed as parameters (names)

out.expr._parameters

In [None]:
# ... and operands (values) which are often other expressions

out.expr.operands

In [None]:
out.expr.left

In [None]:
out.expr.right

In [None]:
type(out.expr.left)

In [None]:
dict(
    zip(
        out.left._parameters, 
        out.left.operands,
    )
)

In [None]:
dict(
    zip(
        out.left.frame._parameters, 
        out.left.frame.operands,
    )
)

## What works today and what doesn't

#### Works

-  Native Dask collection
-  Standard optimizations (column projection, predicate pushdown, ...)
-  POC on most operation types
    -  blockwise
    -  reductions
    -  groupby aggregations
    -  sorts/shuffling
    -  data ingestion (like parquet)

It also feels pretty clean and easy to work on from a maintainability perspective

#### Doesn't work

-  API completeness (lots of fill-in to do)
-  Data Writing (but this is easy)
-  Task Annotations / priorities / worker restrictions

#### Future plans

-  Adding new protocols, like parquet-style metadata (working on this now)
-  Keep the expressions on the Scheduler (can probably make better decisions)