Update

dask · May 3, 2024 · d9c6a9d · d9c6a9d
1 parent fd24c7a
commit d9c6a9d
Showing 1 changed file with 98 additions and 137 deletions.
diff --git a/docs/source/quickstart.rst b/docs/source/quickstart.rst
@@ -40,77 +40,89 @@ Dask offers a set of
 API that give access to disitributed DataFrames build on pandas, distributed Arrays build on NumPy
 and access to low level parallelism with a Futures API.
 
-We will first create a Dask Cluster to run our computations on before we do a few
-exemplary DataFrame computations.
+Local
+"""""
 
-.. tab-set::
+Dask can be used locally on a single machine or on a cluster of machines that is hosted
+somewhere. We recommend starting with a cluster that runs on your local machine to get
+a feel for the API and the way Dask works.
 
-   .. tab-item:: Local
+We can create a fully-featured Dask Cluster that is running on our local machine.
+This gives us access to multi-process computations and the diagnostic Dashboard.
 
-     We can create a fully-featured Dask Cluster that is running on our local machine.
-     This gives us access to multi-process computations and the diagnostic Dashboard.
+.. code-block:: python
 
-     .. code-block:: python
+    from dask.distributed import LocalCluster
+    cluster = LocalCluster()          # Fully-featured local Dask cluster
+    client = cluster.get_client()
 
-        from dask.distributed import LocalCluster
-        cluster = LocalCluster()          # Fully-featured local Dask cluster
-        client = cluster.get_client()
 
-     The :class:`~distributed.LocalCluster` follows the same interface as all other Dask Clusters, so
-     you can easily switch to a distributed cluster.
+The :class:`~distributed.LocalCluster` follows the same interface as all other Dask Clusters, so
+you can easily switch to a distributed cluster.
 
-     The Dask DataFrame API is used for scaling out pandas computations that can run
-     on all cores of our machine. Dask uses pandas under the hood to perform the actual
-     computations, so the API has a very pandas-like feel.
+The Dask DataFrame API is used for scaling out pandas computations that can run
+on all cores of our machine. Dask uses pandas under the hood to perform the actual
+computations, so the API has a very pandas-like feel.
 
-     .. code-block:: python
-        :emphasize-lines: 7
+.. code-block:: python
+    :emphasize-lines: 7
 
-        import dask.dataframe as dd
-        import pandas as pd
+    import dask.dataframe as dd
+    import pandas as pd
 
-        index = pd.date_range("2021-09-01", periods=2400, freq="1h")
-        pdf = pd.DataFrame({"a": np.arange(2400), "b": list("abcaddbe" * 300)}, index=index)
+    index = pd.date_range("2021-09-01", periods=2400, freq="1h")
+    pdf = pd.DataFrame({"a": np.arange(2400), "b": list("abcaddbe" * 300)}, index=index)
 
-        df = dd.from_pandas(df, npartitions=10)
-        df
+    df = dd.from_pandas(df, npartitions=10)
+    df
 
-        Dask DataFrame Structure:
-                                 a       b
-        npartitions=10
-        2021-09-01 00:00:00  int64  string
-        2021-09-11 00:00:00    ...     ...
-        ...                    ...     ...
-        2021-11-30 00:00:00    ...     ...
-        2021-12-09 23:00:00    ...     ...
-        Dask Name: frompandas, 1 expression
-        Expr=df
+    Dask DataFrame Structure:
+                             a       b
+    npartitions=10
+    2021-09-01 00:00:00  int64  string
+    2021-09-11 00:00:00    ...     ...
+    ...                    ...     ...
+    2021-11-30 00:00:00    ...     ...
+    2021-12-09 23:00:00    ...     ...
+    Dask Name: frompandas, 1 expression
+    Expr=df
 
-     :func:`from_pandas` can be used to create a Dask DataFrame from the
-     pandas version. Dask also offers :ref:`IO-connectors <dataframe-io-api>` to read and write data from most
-     common data sources.
+:func:`from_pandas` can be used to create a Dask DataFrame from the
+pandas version. Dask also offers :ref:`IO-connectors <dataframe-io-api>` to read and write data from most
+common data sources.
 
-     The DataFrame API is almost identical to the pandas API. We can use normal pandas
-     methodology to execute operations on the Dask DataFrame.
+The DataFrame API is almost identical to the pandas API. We can use normal pandas
+methodology to execute operations on the Dask DataFrame.
 
-     The main difference is that Dask is lazy and doesn't actually execute the computation
-     before we ask for it. We can trigger the computation with :func:`~dask.compute` or
-     :meth:`DataFrame.compute`.
+The main difference is that Dask is lazy and doesn't actually execute the computation
+before we ask for it. We can trigger the computation with :func:`~dask.compute` or
+:meth:`DataFrame.compute`.
 
-     .. code-block:: python
+.. code-block:: python
+
+    df.groupby("b").a.mean().compute()
+
+    b
+    a    1197.5
+    b    1199.5
+    c    1198.0
+    d    1200.5
+    e    1203.0
+    Name: a, dtype: float64
 
-        df.groupby("b").a.mean().compute()
+Triggering a computation will execute the query on the Dask Cluster and materialize the
+result as a pandas DataFrame or Series.
 
-        b
-        a    1197.5
-        b    1199.5
-        c    1198.0
-        d    1200.5
-        e    1203.0
-        Name: a, dtype: float64
+Scaling Out
+"""""""""""
 
-     Triggering a computation will execute the query on the Dask Cluster and materialize the
-     result as a pandas DataFrame or Series.
+Running Dask on your machine will restrict you to the resources that you have available
+locally. We will now show how we can scale out onto a Cluster with dozens or hundreds of
+workers. Dask Clusters can be hosted in the Cloud or on any local data center or just
+a cluster of machines.
+
+
+.. tab-set::
 
    .. tab-item:: Managed Cloud
 
@@ -124,54 +136,6 @@ exemplary DataFrame computations.
         cluster = coiled.Cluster(n_workers=15, region="us-east-2")
         client = cluster.get_client()
 
-     The Dask DataFrame API is used for scaling out pandas computations that can run
-     on all cores of our machine. Dask uses pandas under the hood to perform the actual
-     computations, so the API has a very pandas-like feel.
-
-     .. code-block:: python
-        :emphasize-lines: 3
-
-        import dask.dataframe as dd
-
-        df = dd.read_parquet("s3://coiled-data/uber/")
-        df
-
-        Dask DataFrame Structure:
-                        hvfhs_license_num       tips    ...
-        npartitions=720
-                                   string    float64
-                                      ...        ...
-        ...                           ...        ...
-                                      ...        ...
-                                      ...        ...
-        Dask Name: read_parquet, 1 expression
-        Expr=ReadParquetFSSpec(8e22969)
-
-     Dask offers :ref:`IO-connectors <dataframe-io-api>` to read and write data from most
-     common data sources. The interface of these connectors is very similar to the pandas
-     equivalent. :func:`read_parquet` has the same behavior as the pandas function.
-
-     The DataFrame API is almost identical to the pandas API. We can use normal pandas
-     methodology to execute operations on the Dask DataFrame.
-
-     The main difference is that Dask is lazy and doesn't actually execute the computation
-     before we ask for it. We can trigger the computation with :func:`~dask.compute` or
-     :meth:`DataFrame.compute`.
-
-     .. code-block:: python
-
-        df.groupby("hvfhs_license_num").tips.mean().compute()
-
-        hvfhs_license_num
-        HV0005    0.946925
-        HV0002    0.338458
-        HV0004    0.217237
-        HV0003    0.736362
-        Name: tips, dtype: float64
-
-     Triggering a computation will execute the query on the Coiled Cluster and materialize the
-     result as a pandas DataFrame on our local machine.
-
    .. tab-item:: Other Deployments
 
      Dask can be deployed with a number of other options that all have their strengths and weaknesses.
@@ -183,58 +147,55 @@ exemplary DataFrame computations.
 
         from dask_kubernetes.operator import KubeCluster
         cluster = KubeCluster(name="my-dask-cluster", image='ghcr.io/dask/dask:latest')
-        cluster.scale(10)
+        cluster.scale(15)
 
      This is only a short example and needs more setup for proper use. See the
      :ref:`Dask Kubernetes <deploying-kubernetes>` documentation for more information.
 
-     The Dask DataFrame API is used for scaling out pandas computations that can run
-     on all cores of our machine. Dask uses pandas under the hood to perform the actual
-     computations, so the API has a very pandas-like feel.
 
-     .. code-block:: python
-        :emphasize-lines: 3
+Everything works exactly the same as on our local machine, but we can now run computations
+on much bigger datasets. We will load a dataset from s3 that holds the `NYC Taxi dataset`_
+with around 200 GB of data.
 
-        import dask.dataframe as dd
+.. code-block:: python
+    :emphasize-lines: 3
 
-        df = dd.read_parquet("s3://coiled-data/uber/")
-        df
+    import dask.dataframe as dd
 
-        Dask DataFrame Structure:
-                        hvfhs_license_num       tips    ...
-        npartitions=720
-                                   string    float64
-                                      ...        ...
-        ...                           ...        ...
-                                      ...        ...
-                                      ...        ...
-        Dask Name: read_parquet, 1 expression
-        Expr=ReadParquetFSSpec(8e22969)
+    df = dd.read_parquet("s3://coiled-data/uber/")
+    df
 
-     Dask offers :ref:`IO-connectors <dataframe-io-api>` to read and write data from most
-     common data sources. The interface of these connectors is very similar to the pandas
-     equivalent. :func:`read_parquet` has the same behavior as the pandas function.
+    Dask DataFrame Structure:
+                    hvfhs_license_num       tips    ...
+    npartitions=720
+                               string    float64
+                                  ...        ...
+    ...                           ...        ...
+                                  ...        ...
+                                  ...        ...
+    Dask Name: read_parquet, 1 expression
+    Expr=ReadParquetFSSpec(8e22969)
 
-     The DataFrame API is almost identical to the pandas API. We can use normal pandas
-     methodology to execute operations on the Dask DataFrame.
 
-     The main difference is that Dask is lazy and doesn't actually execute the computation
-     before we ask for it. We can trigger the computation with :func:`~dask.compute` or
-     :meth:`DataFrame.compute`.
+The cluster now allows us to process the data much more efficiently than what we could
+do on our local machine. Computing the average tip for drivers from Uber and Lyft runs
+almost instantly.
 
-     .. code-block:: python
+.. code-block:: python
 
-        df.groupby("hvfhs_license_num").tips.mean().compute()
+    df.groupby("hvfhs_license_num").tips.mean().compute()
 
-        hvfhs_license_num
-        HV0005    0.946925
-        HV0002    0.338458
-        HV0004    0.217237
-        HV0003    0.736362
-        Name: tips, dtype: float64
+    hvfhs_license_num
+    HV0005    0.946925
+    HV0002    0.338458
+    HV0004    0.217237
+    HV0003    0.736362
+    Name: tips, dtype: float64
 
-     Triggering a computation will execute the query on the Coiled Cluster and materialize the
-     result as a pandas DataFrame on our local machine.
+Triggering a computation will execute the query on the Coiled Cluster and materialize the
+result as a pandas DataFrame on our local machine. We have to send the result over the network,
+e.g. viewing large results can take significant time.
 
 .. _Coiled: https://docs.coiled.io/user_guide/index.html?utm_source=dask-docs&utm_medium=quickstart
 .. |Coiled| replace:: **Coiled**
+.. _NYC Taxi dataset: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page