From d0800450e1454df8729f535d237a3fd94db93c2e Mon Sep 17 00:00:00 2001 From: Matthew Rocklin Date: Fri, 5 Oct 2018 12:23:41 -0400 Subject: [PATCH] Update scikit-learn to 0.20 (#47) This allows us to remove some text around joblib --- binder/environment.yml | 8 ++-- machine-learning.ipynb | 19 ++++++---- machine-learning/parallel-prediction.ipynb | 2 +- machine-learning/scale-scikit-learn.ipynb | 37 +++++++++++-------- machine-learning/tpot.ipynb | 2 +- .../training-on-large-datasets.ipynb | 2 +- machine-learning/voting-classifier.ipynb | 5 +-- machine-learning/xgboost.ipynb | 2 +- 8 files changed, 43 insertions(+), 34 deletions(-) diff --git a/binder/environment.yml b/binder/environment.yml index c36387b4..0d221ebc 100644 --- a/binder/environment.yml +++ b/binder/environment.yml @@ -3,15 +3,15 @@ channels: dependencies: - python=3 - bokeh=0.13 - - dask=0.19.1 - - dask-ml=0.9.0 - - distributed=1.23.1 + - dask=0.19.2 + - dask-ml=0.10.0 + - distributed=1.23.2 - jupyterlab=0.34 - nodejs=8.9 - numpy - pandas - pyarrow==0.10.0 - - scikit-learn + - scikit-learn=0.20 - matplotlib - nbserverproxy - nomkl diff --git a/machine-learning.ipynb b/machine-learning.ipynb index 124bda6e..3345e1b0 100644 --- a/machine-learning.ipynb +++ b/machine-learning.ipynb @@ -32,18 +32,24 @@ "\n", "Scikit-learn uses [joblib](http://joblib.readthedocs.io/) for single-machine parallelism. This lets you train most estimators (anything that accepts an `n_jobs` parameter) using all the cores of your laptop or workstation.\n", "\n", - "Dask registers a joblib backend. This lets you train those estimators using all the cores of your *cluster*, by changing one line of code. \n", + "Alternatively, Scikit-Learn can use Dask for parallelism. This lets you train those estimators using all the cores of your *cluster* without significantly changing your code.\n", "\n", "This is most useful for training large models on medium-sized datasets. You may have a large model when searching over many hyper-parameters, or when using an ensemble method with many individual estimators. For too small datasets, training times will typically be small enough that cluster-wide parallelism isn't helpful. For too large datasets (larger than a single machine's memory), the scikit-learn estimators may not be able to cope (see below)." ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create Scikit-Learn Estimator" + ] + }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "import dask_ml.joblib # register the distriubted backend\n", "from sklearn.datasets import make_classification\n", "from sklearn.svm import SVC\n", "from sklearn.model_selection import GridSearchCV\n", @@ -96,14 +102,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "To fit that normally, we'd call\n", + "To fit that normally, we would call\n", "\n", "```python\n", "grid_search.fit(X, y)\n", "```\n", "\n", - "To fit it using the cluster, we just need to use a context manager provided by joblib.\n", - "We'll pre-scatter the data to each worker, which can help with performance." + "To fit it using the cluster, we just need to use a context manager provided by joblib." ] }, { @@ -114,7 +119,7 @@ "source": [ "from sklearn.externals import joblib\n", "\n", - "with joblib.parallel_backend('dask', scatter=[X, y]):\n", + "with joblib.parallel_backend('dask'):\n", " grid_search.fit(X, y)" ] }, @@ -270,7 +275,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/machine-learning/parallel-prediction.ipynb b/machine-learning/parallel-prediction.ipynb index ef37a51a..5ab0740f 100644 --- a/machine-learning/parallel-prediction.ipynb +++ b/machine-learning/parallel-prediction.ipynb @@ -179,7 +179,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/machine-learning/scale-scikit-learn.ipynb b/machine-learning/scale-scikit-learn.ipynb index 45ae7f2d..3a58a80a 100644 --- a/machine-learning/scale-scikit-learn.ipynb +++ b/machine-learning/scale-scikit-learn.ipynb @@ -64,19 +64,24 @@ "\n", "Scikit-learn uses [joblib](http://joblib.readthedocs.io/) for single-machine parallelism. This lets you train most estimators (anything that accepts an `n_jobs` parameter) using all the cores of your laptop or workstation.\n", "\n", - "Dask registers a joblib backend. This lets you train those estimators using all the cores of your *cluster*, by changing one line of code.\n", + "Alternatively, Scikit-Learn can use Dask for parallelism. This lets you train those estimators using all the cores of your *cluster* without significantly changing your code.\n", "\n", "This is most useful for training large models on medium-sized datasets. You may have a large model when searching over many hyper-parameters, or when using an ensemble method with many individual estimators. For too small datasets, training times will typically be small enough that cluster-wide parallelism isn't helpful. For too large datasets (larger than a single machine's memory), the scikit-learn estimators may not be able to cope (though Dask-ML provides other ways for working with larger than memory datasets)." ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create Scikit-Learn Pipeline" + ] + }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "import dask_ml.joblib # register the distriubted backend\n", - "\n", "from pprint import pprint\n", "from time import time\n", "import logging\n", @@ -130,6 +135,13 @@ "])" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Define Grid for Parameter Search" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -185,16 +197,9 @@ "metadata": {}, "outputs": [], "source": [ - "from sklearn.externals import joblib" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "with joblib.parallel_backend('dask', scatter=[data.data, data.target]):\n", + "from sklearn.externals import joblib\n", + "\n", + "with joblib.parallel_backend('dask'):\n", " grid_search.fit(data.data, data.target)" ] }, @@ -253,14 +258,14 @@ "metadata": {}, "outputs": [], "source": [ - "svc = ParallelPostFit(SVC(random_state=0))\n", + "svc = ParallelPostFit(SVC(random_state=0, gamma='scale'))\n", "\n", "param_grid = {\n", " # use estimator__param instead of param\n", " 'estimator__C': [0.01, 1.0, 10],\n", "}\n", "\n", - "grid_search = GridSearchCV(svc, param_grid, iid=False)" + "grid_search = GridSearchCV(svc, param_grid, iid=False, cv=3)" ] }, { @@ -351,7 +356,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/machine-learning/tpot.ipynb b/machine-learning/tpot.ipynb index faffa87e..31600386 100644 --- a/machine-learning/tpot.ipynb +++ b/machine-learning/tpot.ipynb @@ -180,7 +180,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/machine-learning/training-on-large-datasets.ipynb b/machine-learning/training-on-large-datasets.ipynb index 83cf7845..a60bd924 100644 --- a/machine-learning/training-on-large-datasets.ipynb +++ b/machine-learning/training-on-large-datasets.ipynb @@ -129,7 +129,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/machine-learning/voting-classifier.ipynb b/machine-learning/voting-classifier.ipynb index 61824c73..128eef13 100644 --- a/machine-learning/voting-classifier.ipynb +++ b/machine-learning/voting-classifier.ipynb @@ -98,7 +98,6 @@ "metadata": {}, "outputs": [], "source": [ - "import dask_ml.joblib\n", "from sklearn.externals import joblib\n", "from distributed import Client\n", "\n", @@ -110,7 +109,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "To train the voting classifier, we call the classifier's fit method, but enclosed in joblib's ```parallel_backend``` context manager. This distributes training of sub-estimators acoss the cluster. By providing the data in the ```scatter``` argument, the data is pre-emptively sent to each worker in the cluster (follow the [link](http://distributed.readthedocs.io/en/latest/api.html#distributed.client.Client.scatter) for more info)." + "To train the voting classifier, we call the classifier's fit method, but enclosed in joblib's ```parallel_backend``` context manager. This distributes training of sub-estimators acoss the cluster." ] }, { @@ -120,7 +119,7 @@ "outputs": [], "source": [ "%%time \n", - "with joblib.parallel_backend(\"dask\", scatter=[X, y]):\n", + "with joblib.parallel_backend(\"dask\"):\n", " clf.fit(X, y)\n", "\n", "print(clf)" diff --git a/machine-learning/xgboost.ipynb b/machine-learning/xgboost.ipynb index a29a80da..00819409 100644 --- a/machine-learning/xgboost.ipynb +++ b/machine-learning/xgboost.ipynb @@ -295,7 +295,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4,