Skip to content

Commit

Permalink
Update
Browse files Browse the repository at this point in the history
  • Loading branch information
phofl committed May 3, 2024
1 parent fd24c7a commit d9c6a9d
Showing 1 changed file with 98 additions and 137 deletions.
235 changes: 98 additions & 137 deletions docs/source/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,77 +40,89 @@ Dask offers a set of
API that give access to disitributed DataFrames build on pandas, distributed Arrays build on NumPy
and access to low level parallelism with a Futures API.

We will first create a Dask Cluster to run our computations on before we do a few
exemplary DataFrame computations.
Local
"""""

.. tab-set::
Dask can be used locally on a single machine or on a cluster of machines that is hosted
somewhere. We recommend starting with a cluster that runs on your local machine to get
a feel for the API and the way Dask works.

.. tab-item:: Local
We can create a fully-featured Dask Cluster that is running on our local machine.
This gives us access to multi-process computations and the diagnostic Dashboard.

We can create a fully-featured Dask Cluster that is running on our local machine.
This gives us access to multi-process computations and the diagnostic Dashboard.
.. code-block:: python
.. code-block:: python
from dask.distributed import LocalCluster
cluster = LocalCluster() # Fully-featured local Dask cluster
client = cluster.get_client()
from dask.distributed import LocalCluster
cluster = LocalCluster() # Fully-featured local Dask cluster
client = cluster.get_client()
The :class:`~distributed.LocalCluster` follows the same interface as all other Dask Clusters, so
you can easily switch to a distributed cluster.
The :class:`~distributed.LocalCluster` follows the same interface as all other Dask Clusters, so
you can easily switch to a distributed cluster.

The Dask DataFrame API is used for scaling out pandas computations that can run
on all cores of our machine. Dask uses pandas under the hood to perform the actual
computations, so the API has a very pandas-like feel.
The Dask DataFrame API is used for scaling out pandas computations that can run
on all cores of our machine. Dask uses pandas under the hood to perform the actual
computations, so the API has a very pandas-like feel.

.. code-block:: python
:emphasize-lines: 7
.. code-block:: python
:emphasize-lines: 7
import dask.dataframe as dd
import pandas as pd
import dask.dataframe as dd
import pandas as pd
index = pd.date_range("2021-09-01", periods=2400, freq="1h")
pdf = pd.DataFrame({"a": np.arange(2400), "b": list("abcaddbe" * 300)}, index=index)
index = pd.date_range("2021-09-01", periods=2400, freq="1h")
pdf = pd.DataFrame({"a": np.arange(2400), "b": list("abcaddbe" * 300)}, index=index)
df = dd.from_pandas(df, npartitions=10)
df
df = dd.from_pandas(df, npartitions=10)
df
Dask DataFrame Structure:
a b
npartitions=10
2021-09-01 00:00:00 int64 string
2021-09-11 00:00:00 ... ...
... ... ...
2021-11-30 00:00:00 ... ...
2021-12-09 23:00:00 ... ...
Dask Name: frompandas, 1 expression
Expr=df
Dask DataFrame Structure:
a b
npartitions=10
2021-09-01 00:00:00 int64 string
2021-09-11 00:00:00 ... ...
... ... ...
2021-11-30 00:00:00 ... ...
2021-12-09 23:00:00 ... ...
Dask Name: frompandas, 1 expression
Expr=df
:func:`from_pandas` can be used to create a Dask DataFrame from the
pandas version. Dask also offers :ref:`IO-connectors <dataframe-io-api>` to read and write data from most
common data sources.
:func:`from_pandas` can be used to create a Dask DataFrame from the
pandas version. Dask also offers :ref:`IO-connectors <dataframe-io-api>` to read and write data from most
common data sources.

The DataFrame API is almost identical to the pandas API. We can use normal pandas
methodology to execute operations on the Dask DataFrame.
The DataFrame API is almost identical to the pandas API. We can use normal pandas
methodology to execute operations on the Dask DataFrame.

The main difference is that Dask is lazy and doesn't actually execute the computation
before we ask for it. We can trigger the computation with :func:`~dask.compute` or
:meth:`DataFrame.compute`.
The main difference is that Dask is lazy and doesn't actually execute the computation
before we ask for it. We can trigger the computation with :func:`~dask.compute` or
:meth:`DataFrame.compute`.

.. code-block:: python
.. code-block:: python
df.groupby("b").a.mean().compute()
b
a 1197.5
b 1199.5
c 1198.0
d 1200.5
e 1203.0
Name: a, dtype: float64
df.groupby("b").a.mean().compute()
Triggering a computation will execute the query on the Dask Cluster and materialize the
result as a pandas DataFrame or Series.

b
a 1197.5
b 1199.5
c 1198.0
d 1200.5
e 1203.0
Name: a, dtype: float64
Scaling Out
"""""""""""

Triggering a computation will execute the query on the Dask Cluster and materialize the
result as a pandas DataFrame or Series.
Running Dask on your machine will restrict you to the resources that you have available
locally. We will now show how we can scale out onto a Cluster with dozens or hundreds of
workers. Dask Clusters can be hosted in the Cloud or on any local data center or just
a cluster of machines.


.. tab-set::

.. tab-item:: Managed Cloud

Expand All @@ -124,54 +136,6 @@ exemplary DataFrame computations.
cluster = coiled.Cluster(n_workers=15, region="us-east-2")
client = cluster.get_client()
The Dask DataFrame API is used for scaling out pandas computations that can run
on all cores of our machine. Dask uses pandas under the hood to perform the actual
computations, so the API has a very pandas-like feel.

.. code-block:: python
:emphasize-lines: 3
import dask.dataframe as dd
df = dd.read_parquet("s3://coiled-data/uber/")
df
Dask DataFrame Structure:
hvfhs_license_num tips ...
npartitions=720
string float64
... ...
... ... ...
... ...
... ...
Dask Name: read_parquet, 1 expression
Expr=ReadParquetFSSpec(8e22969)
Dask offers :ref:`IO-connectors <dataframe-io-api>` to read and write data from most
common data sources. The interface of these connectors is very similar to the pandas
equivalent. :func:`read_parquet` has the same behavior as the pandas function.

The DataFrame API is almost identical to the pandas API. We can use normal pandas
methodology to execute operations on the Dask DataFrame.

The main difference is that Dask is lazy and doesn't actually execute the computation
before we ask for it. We can trigger the computation with :func:`~dask.compute` or
:meth:`DataFrame.compute`.

.. code-block:: python
df.groupby("hvfhs_license_num").tips.mean().compute()
hvfhs_license_num
HV0005 0.946925
HV0002 0.338458
HV0004 0.217237
HV0003 0.736362
Name: tips, dtype: float64
Triggering a computation will execute the query on the Coiled Cluster and materialize the
result as a pandas DataFrame on our local machine.

.. tab-item:: Other Deployments

Dask can be deployed with a number of other options that all have their strengths and weaknesses.
Expand All @@ -183,58 +147,55 @@ exemplary DataFrame computations.
from dask_kubernetes.operator import KubeCluster
cluster = KubeCluster(name="my-dask-cluster", image='ghcr.io/dask/dask:latest')
cluster.scale(10)
cluster.scale(15)
This is only a short example and needs more setup for proper use. See the
:ref:`Dask Kubernetes <deploying-kubernetes>` documentation for more information.

The Dask DataFrame API is used for scaling out pandas computations that can run
on all cores of our machine. Dask uses pandas under the hood to perform the actual
computations, so the API has a very pandas-like feel.

.. code-block:: python
:emphasize-lines: 3
Everything works exactly the same as on our local machine, but we can now run computations
on much bigger datasets. We will load a dataset from s3 that holds the `NYC Taxi dataset`_
with around 200 GB of data.

import dask.dataframe as dd
.. code-block:: python
:emphasize-lines: 3
df = dd.read_parquet("s3://coiled-data/uber/")
df
import dask.dataframe as dd
Dask DataFrame Structure:
hvfhs_license_num tips ...
npartitions=720
string float64
... ...
... ... ...
... ...
... ...
Dask Name: read_parquet, 1 expression
Expr=ReadParquetFSSpec(8e22969)
df = dd.read_parquet("s3://coiled-data/uber/")
df
Dask offers :ref:`IO-connectors <dataframe-io-api>` to read and write data from most
common data sources. The interface of these connectors is very similar to the pandas
equivalent. :func:`read_parquet` has the same behavior as the pandas function.
Dask DataFrame Structure:
hvfhs_license_num tips ...
npartitions=720
string float64
... ...
... ... ...
... ...
... ...
Dask Name: read_parquet, 1 expression
Expr=ReadParquetFSSpec(8e22969)
The DataFrame API is almost identical to the pandas API. We can use normal pandas
methodology to execute operations on the Dask DataFrame.
The main difference is that Dask is lazy and doesn't actually execute the computation
before we ask for it. We can trigger the computation with :func:`~dask.compute` or
:meth:`DataFrame.compute`.
The cluster now allows us to process the data much more efficiently than what we could
do on our local machine. Computing the average tip for drivers from Uber and Lyft runs
almost instantly.

.. code-block:: python
.. code-block:: python
df.groupby("hvfhs_license_num").tips.mean().compute()
df.groupby("hvfhs_license_num").tips.mean().compute()
hvfhs_license_num
HV0005 0.946925
HV0002 0.338458
HV0004 0.217237
HV0003 0.736362
Name: tips, dtype: float64
hvfhs_license_num
HV0005 0.946925
HV0002 0.338458
HV0004 0.217237
HV0003 0.736362
Name: tips, dtype: float64
Triggering a computation will execute the query on the Coiled Cluster and materialize the
result as a pandas DataFrame on our local machine.
Triggering a computation will execute the query on the Coiled Cluster and materialize the
result as a pandas DataFrame on our local machine. We have to send the result over the network,
e.g. viewing large results can take significant time.

.. _Coiled: https://docs.coiled.io/user_guide/index.html?utm_source=dask-docs&utm_medium=quickstart
.. |Coiled| replace:: **Coiled**
.. _NYC Taxi dataset: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

0 comments on commit d9c6a9d

Please sign in to comment.