Setup for O'Reilly Q3 2019 Dask

adbreind · Aug 13, 2019 · 819f2bb · 819f2bb
1 parent 21f7b92
commit 819f2bb
Show file tree

Hide file tree

Showing 27 changed files with 1,334,974 additions and 0 deletions.
diff --git a/01-Intro.ipynb b/01-Intro.ipynb
@@ -0,0 +1,282 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img src='https://dask.org/_images/dask_horizontal_white_no_pad.svg' style='background:#000' width=400>\n",
+    "\n",
+    "# Dask natively scales Python\n",
+    "## Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love\n",
+    "\n",
+    "### Integrates with existing projects\n",
+    "#### BUILT WITH THE BROADER COMMUNITY\n",
+    "\n",
+    "Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn.\n",
+    "\n",
+    "*(from the Dask project homepage at dask.org)*\n",
+    "\n",
+    "* * *\n",
+    "\n",
+    "__What Does This Mean?__\n",
+    "* Built in Python\n",
+    "* Scales *properly* from single laptops to 1000-node clusters\n",
+    "* Leverages and interops with existing Python APIs as much as possible\n",
+    "* Adheres to (Tim Peters') \"Zen of Python\" (https://www.python.org/dev/peps/pep-0020/) ... especially these elements:\n",
+    "    * Explicit is better than implicit.\n",
+    "    * Simple is better than complex.\n",
+    "    * Complex is better than complicated.\n",
+    "    * Readability counts. <i>[ed: that goes for docs, too!]</i>\n",
+    "    * Special cases aren't special enough to break the rules.\n",
+    "    * Although practicality beats purity.\n",
+    "    * In the face of ambiguity, refuse the temptation to guess.\n",
+    "    * If the implementation is hard to explain, it's a bad idea.\n",
+    "    * If the implementation is easy to explain, it may be a good idea.\n",
+    "* While we're borrowing inspiration, it Dask embodies one of Perl's slogans, making easy things easy and hard things possible\n",
+    "    * Specifically, it supports common data-parallel abstractions like Pandas and Numpy\n",
+    "    * But also allows scheduling arbitary custom computation that doesn't fit a preset mold"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Let's See Some Code\n",
+    "\n",
+    "Before we go any further, let's take a look at one particular, common use case for Dask: scaling Pandas dataframes to \n",
+    "* larger datasets (which don't fit in memory) and \n",
+    "* multiple processes (which could be on multiple nodes)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from dask.distributed import Client\n",
+    "\n",
+    "client = Client(n_workers=2, threads_per_worker=1, memory_limit='1GB')\n",
+    "\n",
+    "client"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import dask.dataframe\n",
+    "\n",
+    "ddf = dask.dataframe.read_csv('data/beer_small.csv', blocksize=12e6)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ddf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### What is this Dask Dataframe?\n",
+    "\n",
+    "A large, virtual dataframe divided along the index into multiple Pandas dataframes:\n",
+    "\n",
+    "<img src=\"images/dask-dataframe.svg\" width=\"400px\">"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ddf.map_partitions(type).compute()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ddf.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ddf[ddf.beer_style.str.contains('IPA')].head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ipa = ddf[ddf.beer_style.str.contains('IPA')]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "mean_ipa_review = ipa.groupby('brewery_name').review_overall.agg(['mean','count'])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "mean_ipa_review.compute()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "mean_ipa_review.nlargest(20, 'mean').compute()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`compute` doesn't just run the work, it collects the result to a single, regular Pandas dataframe right here in our initial Python VM.\n",
+    "\n",
+    "Having a local result is convenient, but if we are generating large results, we may want (or need) to produce output in parallel to the filesystem, instead. \n",
+    "\n",
+    "There are writing counterparts to read methods which we can use:\n",
+    "\n",
+    "- `read_csv` \\ `to_csv`\n",
+    "- `read_hdf` \\ `to_hdf`\n",
+    "- `read_json` \\ `to_json`\n",
+    "- `read_parquet` \\ `to_parquet`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "mean_ipa_review.to_csv('ipa-*.csv') #the * is where the partition number will go"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "client.close()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### About Dask\n",
+    "\n",
+    "Dask was created in 2014 by Matthew Rocklin, who received his Ph.D. from University of Chicago in 2013 and has contributed to numerous scientific computing packages. Matt has worked at Sandia and Argonne National Laboratories, as well as Anaconda/Continuum, and currently works at NVIDIA.\n",
+    "\n",
+    "Fundamentally, Dask allows a variety of parallel workflows using existing Python constructs, patterns, or libraries, including dataframes, arrays (scaling out Numpy), bags (an unordered collection construct a bit like `Counter`), and `concurrent.futures`\n",
+    "\n",
+    "In addition to working in conjunction with Python ecosystem tools, Dask's extremely low scheduling overhead (nanoseconds in some cases) allows it work well even on single machines, and smoothly scale up.\n",
+    "\n",
+    "Dask supports a variety of use cases for industry and research: https://stories.dask.org/en/latest/\n",
+    "\n",
+    "With its recent 2.x releases, and integration to other projects (e.g., RAPIDS for GPU computation), many commercial enterprises are paying attention and jumping in to parallel Python with Dask.\n",
+    "\n",
+    "__Dask Ecosystem__\n",
+    "\n",
+    "In addition to the core Dask library and its Distributed scheduler, the Dask ecosystem connects several additional initiatives, including...\n",
+    "* Dask ML - parallel machine learning, with a scikit-learn-style API\n",
+    "* Dask-kubernetes\n",
+    "* Dask-XGBoost\n",
+    "* Dask-YARN\n",
+    "* Dask-image\n",
+    "* Dask-cuDF\n",
+    "* ... and some others\n",
+    "\n",
+    "__What's Not Part of Dask?__\n",
+    "\n",
+    "There are lots of functions that integrate to Dask, but are not represented in the core Dask ecosystem, including...\n",
+    "\n",
+    "* a SQL engine\n",
+    "* data storage\n",
+    "* data catalog\n",
+    "* visualization\n",
+    "* coarse-grained scheduling / orchestration\n",
+    "* streaming\n",
+    "\n",
+    "... although there are typically other Python packages that fill these needs (e.g., Kartothek or Intake for a data catalog).\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### How Do We Set Up and/or Deploy Dask?\n",
+    "\n",
+    "The easiest way to install Dask is with Anaconda: `conda install dask`\n",
+    "\n",
+    "__Schedulers and Clustering__\n",
+    "\n",
+    "Dask has a simple default scheduler called the \"single machine scheduler\" -- this is the scheduler that's used if your `import dask` and start running code without explicitly using a `Client` object. It can be handy for quick-and-dirty testing, but I would (*warning! opinion!*) suggest that a best practice is to __use the newer \"distributed scheduler\" even for single-machine workloads__\n",
+    "\n",
+    "The distributed scheduler can work with \n",
+    "* threads (although that is often not a great idea due to the GIL) in one process\n",
+    "* multiple processes on one machine\n",
+    "* multiple processes on multiple machines\n",
+    "\n",
+    "The distributed scheduler has additional useful features including data locality awareness and realtime graphical dashboards."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}