From d0800450e1454df8729f535d237a3fd94db93c2e Mon Sep 17 00:00:00 2001
From: Matthew Rocklin <mrocklin@gmail.com>
Date: Fri, 5 Oct 2018 12:23:41 -0400
Subject: [PATCH] Update scikit-learn to 0.20 (#47)

This allows us to remove some text around joblib
---
 binder/environment.yml                        |  8 ++--
 machine-learning.ipynb                        | 19 ++++++----
 machine-learning/parallel-prediction.ipynb    |  2 +-
 machine-learning/scale-scikit-learn.ipynb     | 37 +++++++++++--------
 machine-learning/tpot.ipynb                   |  2 +-
 .../training-on-large-datasets.ipynb          |  2 +-
 machine-learning/voting-classifier.ipynb      |  5 +--
 machine-learning/xgboost.ipynb                |  2 +-
 8 files changed, 43 insertions(+), 34 deletions(-)

diff --git a/binder/environment.yml b/binder/environment.yml
index c36387b4..0d221ebc 100644
--- a/binder/environment.yml
+++ b/binder/environment.yml
@@ -3,15 +3,15 @@ channels:
 dependencies:
   - python=3
   - bokeh=0.13
-  - dask=0.19.1
-  - dask-ml=0.9.0
-  - distributed=1.23.1
+  - dask=0.19.2
+  - dask-ml=0.10.0
+  - distributed=1.23.2
   - jupyterlab=0.34
   - nodejs=8.9
   - numpy
   - pandas
   - pyarrow==0.10.0
-  - scikit-learn
+  - scikit-learn=0.20
   - matplotlib
   - nbserverproxy
   - nomkl
diff --git a/machine-learning.ipynb b/machine-learning.ipynb
index 124bda6e..3345e1b0 100644
--- a/machine-learning.ipynb
+++ b/machine-learning.ipynb
@@ -32,18 +32,24 @@
     "\n",
     "Scikit-learn uses [joblib](http://joblib.readthedocs.io/) for single-machine parallelism. This lets you train most estimators (anything that accepts an `n_jobs` parameter) using all the cores of your laptop or workstation.\n",
     "\n",
-    "Dask registers a joblib backend. This lets you train those estimators using all the cores of your *cluster*, by changing one line of code. \n",
+    "Alternatively, Scikit-Learn can use Dask for parallelism.  This lets you train those estimators using all the cores of your *cluster* without significantly changing your code.\n",
     "\n",
     "This is most useful for training large models on medium-sized datasets. You may have a large model when searching over many hyper-parameters, or when using an ensemble method with many individual estimators. For too small datasets, training times will typically be small enough that cluster-wide parallelism isn't helpful. For too large datasets (larger than a single machine's memory), the scikit-learn estimators may not be able to cope (see below)."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Create Scikit-Learn Estimator"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
-    "import dask_ml.joblib  # register the distriubted backend\n",
     "from sklearn.datasets import make_classification\n",
     "from sklearn.svm import SVC\n",
     "from sklearn.model_selection import GridSearchCV\n",
@@ -96,14 +102,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "To fit that normally, we'd call\n",
+    "To fit that normally, we would call\n",
     "\n",
     "```python\n",
     "grid_search.fit(X, y)\n",
     "```\n",
     "\n",
-    "To fit it using the cluster, we just need to use a context manager provided by joblib.\n",
-    "We'll pre-scatter the data to each worker, which can help with performance."
+    "To fit it using the cluster, we just need to use a context manager provided by joblib."
    ]
   },
   {
@@ -114,7 +119,7 @@
    "source": [
     "from sklearn.externals import joblib\n",
     "\n",
-    "with joblib.parallel_backend('dask', scatter=[X, y]):\n",
+    "with joblib.parallel_backend('dask'):\n",
     "    grid_search.fit(X, y)"
    ]
   },
@@ -270,7 +275,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.5"
+   "version": "3.6.4"
   }
  },
  "nbformat": 4,
diff --git a/machine-learning/parallel-prediction.ipynb b/machine-learning/parallel-prediction.ipynb
index ef37a51a..5ab0740f 100644
--- a/machine-learning/parallel-prediction.ipynb
+++ b/machine-learning/parallel-prediction.ipynb
@@ -179,7 +179,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.5"
+   "version": "3.6.4"
   }
  },
  "nbformat": 4,
diff --git a/machine-learning/scale-scikit-learn.ipynb b/machine-learning/scale-scikit-learn.ipynb
index 45ae7f2d..3a58a80a 100644
--- a/machine-learning/scale-scikit-learn.ipynb
+++ b/machine-learning/scale-scikit-learn.ipynb
@@ -64,19 +64,24 @@
     "\n",
     "Scikit-learn uses [joblib](http://joblib.readthedocs.io/) for single-machine parallelism. This lets you train most estimators (anything that accepts an `n_jobs` parameter) using all the cores of your laptop or workstation.\n",
     "\n",
-    "Dask registers a joblib backend. This lets you train those estimators using all the cores of your *cluster*, by changing one line of code.\n",
+    "Alternatively, Scikit-Learn can use Dask for parallelism.  This lets you train those estimators using all the cores of your *cluster* without significantly changing your code.\n",
     "\n",
     "This is most useful for training large models on medium-sized datasets. You may have a large model when searching over many hyper-parameters, or when using an ensemble method with many individual estimators. For too small datasets, training times will typically be small enough that cluster-wide parallelism isn't helpful. For too large datasets (larger than a single machine's memory), the scikit-learn estimators may not be able to cope (though Dask-ML provides other ways for working with larger than memory datasets)."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Create Scikit-Learn Pipeline"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
-    "import dask_ml.joblib  # register the distriubted backend\n",
-    "\n",
     "from pprint import pprint\n",
     "from time import time\n",
     "import logging\n",
@@ -130,6 +135,13 @@
     "])"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Define Grid for Parameter Search"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -185,16 +197,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from sklearn.externals import joblib"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "with joblib.parallel_backend('dask', scatter=[data.data, data.target]):\n",
+    "from sklearn.externals import joblib\n",
+    "\n",
+    "with joblib.parallel_backend('dask'):\n",
     "    grid_search.fit(data.data, data.target)"
    ]
   },
@@ -253,14 +258,14 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "svc = ParallelPostFit(SVC(random_state=0))\n",
+    "svc = ParallelPostFit(SVC(random_state=0, gamma='scale'))\n",
     "\n",
     "param_grid = {\n",
     "    # use estimator__param instead of param\n",
     "    'estimator__C': [0.01, 1.0, 10],\n",
     "}\n",
     "\n",
-    "grid_search = GridSearchCV(svc, param_grid, iid=False)"
+    "grid_search = GridSearchCV(svc, param_grid, iid=False, cv=3)"
    ]
   },
   {
@@ -351,7 +356,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.5"
+   "version": "3.6.4"
   }
  },
  "nbformat": 4,
diff --git a/machine-learning/tpot.ipynb b/machine-learning/tpot.ipynb
index faffa87e..31600386 100644
--- a/machine-learning/tpot.ipynb
+++ b/machine-learning/tpot.ipynb
@@ -180,7 +180,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.5"
+   "version": "3.6.4"
   }
  },
  "nbformat": 4,
diff --git a/machine-learning/training-on-large-datasets.ipynb b/machine-learning/training-on-large-datasets.ipynb
index 83cf7845..a60bd924 100644
--- a/machine-learning/training-on-large-datasets.ipynb
+++ b/machine-learning/training-on-large-datasets.ipynb
@@ -129,7 +129,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.5"
+   "version": "3.6.4"
   }
  },
  "nbformat": 4,
diff --git a/machine-learning/voting-classifier.ipynb b/machine-learning/voting-classifier.ipynb
index 61824c73..128eef13 100644
--- a/machine-learning/voting-classifier.ipynb
+++ b/machine-learning/voting-classifier.ipynb
@@ -98,7 +98,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import dask_ml.joblib\n",
     "from sklearn.externals import joblib\n",
     "from distributed import Client\n",
     "\n",
@@ -110,7 +109,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "To train the voting classifier, we call the classifier's fit method, but enclosed in joblib's ```parallel_backend``` context manager. This distributes training of sub-estimators acoss the cluster. By providing the data in the ```scatter``` argument, the data is pre-emptively sent to each worker in the cluster (follow the [link](http://distributed.readthedocs.io/en/latest/api.html#distributed.client.Client.scatter) for more info)."
+    "To train the voting classifier, we call the classifier's fit method, but enclosed in joblib's ```parallel_backend``` context manager. This distributes training of sub-estimators acoss the cluster."
    ]
   },
   {
@@ -120,7 +119,7 @@
    "outputs": [],
    "source": [
     "%%time \n",
-    "with joblib.parallel_backend(\"dask\", scatter=[X, y]):\n",
+    "with joblib.parallel_backend(\"dask\"):\n",
     "    clf.fit(X, y)\n",
     "\n",
     "print(clf)"
diff --git a/machine-learning/xgboost.ipynb b/machine-learning/xgboost.ipynb
index a29a80da..00819409 100644
--- a/machine-learning/xgboost.ipynb
+++ b/machine-learning/xgboost.ipynb
@@ -295,7 +295,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.5"
+   "version": "3.6.4"
   }
  },
  "nbformat": 4,