Skip to content

Commit

Permalink
Update scikit-learn to 0.20 (#47)
Browse files Browse the repository at this point in the history
This allows us to remove some text around joblib
  • Loading branch information
mrocklin committed Oct 5, 2018
1 parent 0bd1df8 commit d080045
Show file tree
Hide file tree
Showing 8 changed files with 43 additions and 34 deletions.
8 changes: 4 additions & 4 deletions binder/environment.yml
Expand Up @@ -3,15 +3,15 @@ channels:
dependencies:
- python=3
- bokeh=0.13
- dask=0.19.1
- dask-ml=0.9.0
- distributed=1.23.1
- dask=0.19.2
- dask-ml=0.10.0
- distributed=1.23.2
- jupyterlab=0.34
- nodejs=8.9
- numpy
- pandas
- pyarrow==0.10.0
- scikit-learn
- scikit-learn=0.20
- matplotlib
- nbserverproxy
- nomkl
Expand Down
19 changes: 12 additions & 7 deletions machine-learning.ipynb
Expand Up @@ -32,18 +32,24 @@
"\n",
"Scikit-learn uses [joblib](http://joblib.readthedocs.io/) for single-machine parallelism. This lets you train most estimators (anything that accepts an `n_jobs` parameter) using all the cores of your laptop or workstation.\n",
"\n",
"Dask registers a joblib backend. This lets you train those estimators using all the cores of your *cluster*, by changing one line of code. \n",
"Alternatively, Scikit-Learn can use Dask for parallelism. This lets you train those estimators using all the cores of your *cluster* without significantly changing your code.\n",
"\n",
"This is most useful for training large models on medium-sized datasets. You may have a large model when searching over many hyper-parameters, or when using an ensemble method with many individual estimators. For too small datasets, training times will typically be small enough that cluster-wide parallelism isn't helpful. For too large datasets (larger than a single machine's memory), the scikit-learn estimators may not be able to cope (see below)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create Scikit-Learn Estimator"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import dask_ml.joblib # register the distriubted backend\n",
"from sklearn.datasets import make_classification\n",
"from sklearn.svm import SVC\n",
"from sklearn.model_selection import GridSearchCV\n",
Expand Down Expand Up @@ -96,14 +102,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"To fit that normally, we'd call\n",
"To fit that normally, we would call\n",
"\n",
"```python\n",
"grid_search.fit(X, y)\n",
"```\n",
"\n",
"To fit it using the cluster, we just need to use a context manager provided by joblib.\n",
"We'll pre-scatter the data to each worker, which can help with performance."
"To fit it using the cluster, we just need to use a context manager provided by joblib."
]
},
{
Expand All @@ -114,7 +119,7 @@
"source": [
"from sklearn.externals import joblib\n",
"\n",
"with joblib.parallel_backend('dask', scatter=[X, y]):\n",
"with joblib.parallel_backend('dask'):\n",
" grid_search.fit(X, y)"
]
},
Expand Down Expand Up @@ -270,7 +275,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.6.4"
}
},
"nbformat": 4,
Expand Down
2 changes: 1 addition & 1 deletion machine-learning/parallel-prediction.ipynb
Expand Up @@ -179,7 +179,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.6.4"
}
},
"nbformat": 4,
Expand Down
37 changes: 21 additions & 16 deletions machine-learning/scale-scikit-learn.ipynb
Expand Up @@ -64,19 +64,24 @@
"\n",
"Scikit-learn uses [joblib](http://joblib.readthedocs.io/) for single-machine parallelism. This lets you train most estimators (anything that accepts an `n_jobs` parameter) using all the cores of your laptop or workstation.\n",
"\n",
"Dask registers a joblib backend. This lets you train those estimators using all the cores of your *cluster*, by changing one line of code.\n",
"Alternatively, Scikit-Learn can use Dask for parallelism. This lets you train those estimators using all the cores of your *cluster* without significantly changing your code.\n",
"\n",
"This is most useful for training large models on medium-sized datasets. You may have a large model when searching over many hyper-parameters, or when using an ensemble method with many individual estimators. For too small datasets, training times will typically be small enough that cluster-wide parallelism isn't helpful. For too large datasets (larger than a single machine's memory), the scikit-learn estimators may not be able to cope (though Dask-ML provides other ways for working with larger than memory datasets)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create Scikit-Learn Pipeline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import dask_ml.joblib # register the distriubted backend\n",
"\n",
"from pprint import pprint\n",
"from time import time\n",
"import logging\n",
Expand Down Expand Up @@ -130,6 +135,13 @@
"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Define Grid for Parameter Search"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -185,16 +197,9 @@
"metadata": {},
"outputs": [],
"source": [
"from sklearn.externals import joblib"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"with joblib.parallel_backend('dask', scatter=[data.data, data.target]):\n",
"from sklearn.externals import joblib\n",
"\n",
"with joblib.parallel_backend('dask'):\n",
" grid_search.fit(data.data, data.target)"
]
},
Expand Down Expand Up @@ -253,14 +258,14 @@
"metadata": {},
"outputs": [],
"source": [
"svc = ParallelPostFit(SVC(random_state=0))\n",
"svc = ParallelPostFit(SVC(random_state=0, gamma='scale'))\n",
"\n",
"param_grid = {\n",
" # use estimator__param instead of param\n",
" 'estimator__C': [0.01, 1.0, 10],\n",
"}\n",
"\n",
"grid_search = GridSearchCV(svc, param_grid, iid=False)"
"grid_search = GridSearchCV(svc, param_grid, iid=False, cv=3)"
]
},
{
Expand Down Expand Up @@ -351,7 +356,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.6.4"
}
},
"nbformat": 4,
Expand Down
2 changes: 1 addition & 1 deletion machine-learning/tpot.ipynb
Expand Up @@ -180,7 +180,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.6.4"
}
},
"nbformat": 4,
Expand Down
2 changes: 1 addition & 1 deletion machine-learning/training-on-large-datasets.ipynb
Expand Up @@ -129,7 +129,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.6.4"
}
},
"nbformat": 4,
Expand Down
5 changes: 2 additions & 3 deletions machine-learning/voting-classifier.ipynb
Expand Up @@ -98,7 +98,6 @@
"metadata": {},
"outputs": [],
"source": [
"import dask_ml.joblib\n",
"from sklearn.externals import joblib\n",
"from distributed import Client\n",
"\n",
Expand All @@ -110,7 +109,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"To train the voting classifier, we call the classifier's fit method, but enclosed in joblib's ```parallel_backend``` context manager. This distributes training of sub-estimators acoss the cluster. By providing the data in the ```scatter``` argument, the data is pre-emptively sent to each worker in the cluster (follow the [link](http://distributed.readthedocs.io/en/latest/api.html#distributed.client.Client.scatter) for more info)."
"To train the voting classifier, we call the classifier's fit method, but enclosed in joblib's ```parallel_backend``` context manager. This distributes training of sub-estimators acoss the cluster."
]
},
{
Expand All @@ -120,7 +119,7 @@
"outputs": [],
"source": [
"%%time \n",
"with joblib.parallel_backend(\"dask\", scatter=[X, y]):\n",
"with joblib.parallel_backend(\"dask\"):\n",
" clf.fit(X, y)\n",
"\n",
"print(clf)"
Expand Down
2 changes: 1 addition & 1 deletion machine-learning/xgboost.ipynb
Expand Up @@ -295,7 +295,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.6.4"
}
},
"nbformat": 4,
Expand Down

0 comments on commit d080045

Please sign in to comment.