_Lambda School Data Science — Big Data_

# AWS SageMaker

### Links

#### AWS
- The Open Guide to Amazon Web Services: EC2 Basics _(just this one short section!)_ https://github.com/open-guides/og-aws#ec2-basics
- AWS in Plain English https://www.expeditedssl.com/aws-in-plain-english
- Amazon SageMaker » Create an Amazon SageMaker Notebook Instance https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html
- Amazon SageMaker » Install External Libraries https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-add-external.html

#### Dask
- Why Dask? https://docs.dask.org/en/latest/why.html
- Use Cases https://docs.dask.org/en/latest/use-cases.html
- User Interfaces https://docs.dask.org/en/latest/user-interfaces.html

#### Numba
- A ~5 minute guide http://numba.pydata.org/numba-doc/latest/user/5minguide.html

## 1. Estimate pi
https://en.wikipedia.org/wiki/Approximations_of_π#Summing_a_circle's_area

### With plain Python

In [1]:
import random

def monte_carlo_pi(nsamples):
    acc = 0
    for _ in range(int(nsamples)):
        x = random.random()
        y = random.random()
        if (x**2 + y**2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

In [2]:
%%time
monte_carlo_pi(1e7)

CPU times: user 4.25 s, sys: 0 ns, total: 4.25 s
Wall time: 4.25 s


3.141706

### With Numba
http://numba.pydata.org/

In [3]:
from numba import njit

In [4]:
@njit
def monte_carlo_pi(nsamples):
    acc = 0
    for _ in range(int(nsamples)):
        x = random.random()
        y = random.random()
        if (x**2 + y**2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

In [5]:
%%time
monte_carlo_pi(1e7)

CPU times: user 374 ms, sys: 4.27 ms, total: 378 ms
Wall time: 377 ms


3.14098

## 2. Loop a slow function

### With plain Python

In [6]:
from time import sleep

def slow_square(x):
    sleep(1)
    return x**2

In [7]:
%%time
[slow_square(n) for n in range(16)]

CPU times: user 1.01 ms, sys: 276 µs, total: 1.29 ms
Wall time: 16 s


[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225]

### With Dask
- https://examples.dask.org/delayed.html
- http://docs.dask.org/en/latest/setup/single-distributed.html

In [8]:
from dask import compute, delayed

In [9]:
%%time
compute(delayed(slow_square)(n) for n in range(16))

CPU times: user 11.2 ms, sys: 1.95 ms, total: 13.1 ms
Wall time: 1.01 s


([0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225],)

## 3. Analyze millions of Instacart orders

### Download data
https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2

In [10]:
!wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

--2019-02-25 22:16:09--  https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.81.11
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.81.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 205548478 (196M) [application/x-gzip]
Saving to: ‘instacart_online_grocery_shopping_2017_05_01.tar.gz.2’


2019-02-25 22:16:12 (59.1 MB/s) - ‘instacart_online_grocery_shopping_2017_05_01.tar.gz.2’ saved [205548478/205548478]



In [11]:
!tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

instacart_2017_05_01/
instacart_2017_05_01/._aisles.csv
instacart_2017_05_01/aisles.csv
instacart_2017_05_01/._departments.csv
instacart_2017_05_01/departments.csv
instacart_2017_05_01/._order_products__prior.csv
instacart_2017_05_01/order_products__prior.csv
instacart_2017_05_01/._order_products__train.csv
instacart_2017_05_01/order_products__train.csv
instacart_2017_05_01/._orders.csv
instacart_2017_05_01/orders.csv
instacart_2017_05_01/._products.csv
instacart_2017_05_01/products.csv


In [12]:
%cd instacart_2017_05_01

/home/ec2-user/SageMaker/DS-Unit-3-Sprint-3-Big-Data/module1-aws-sagemaker/instacart_2017_05_01


In [13]:
!ls -lh *.csv

-rw-r--r-- 1 ec2-user ec2-user 2.6K May  2  2017 aisles.csv
-rw-r--r-- 1 ec2-user ec2-user  270 May  2  2017 departments.csv
-rw-r--r-- 1 ec2-user ec2-user 551M May  2  2017 order_products__prior.csv
-rw-r--r-- 1 ec2-user ec2-user  24M May  2  2017 order_products__train.csv
-rw-r--r-- 1 ec2-user ec2-user 104M May  2  2017 orders.csv
-rw-r--r-- 1 ec2-user ec2-user 2.1M May  2  2017 products.csv


### With Pandas

#### Load & merge data

In [14]:
import pandas as pd

In [15]:
%%time
order_products = pd.concat([
    pd.read_csv('order_products__prior.csv'), 
    pd.read_csv('order_products__train.csv')])

order_products.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 33819106 entries, 0 to 1384616
Data columns (total 4 columns):
order_id             int64
product_id           int64
add_to_cart_order    int64
reordered            int64
dtypes: int64(4)
memory usage: 1.3 GB
CPU times: user 10.3 s, sys: 2.03 s, total: 12.3 s
Wall time: 12.3 s


In [16]:
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [17]:
products = pd.read_csv('products.csv')
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49688 entries, 0 to 49687
Data columns (total 4 columns):
product_id       49688 non-null int64
product_name     49688 non-null object
aisle_id         49688 non-null int64
department_id    49688 non-null int64
dtypes: int64(3), object(1)
memory usage: 1.5+ MB


In [18]:
products.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


In [19]:
%%time
order_products = pd.merge(order_products, products[['product_id', 'product_name']])

CPU times: user 7.23 s, sys: 2 s, total: 9.23 s
Wall time: 9.22 s


In [20]:
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name
0,2,33120,1,1,Organic Egg Whites
1,26,33120,5,0,Organic Egg Whites
2,120,33120,13,0,Organic Egg Whites
3,327,33120,5,1,Organic Egg Whites
4,390,33120,28,1,Organic Egg Whites


#### Most popular products?

In [21]:
%time
order_products['product_name'].value_counts()

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.68 µs


Banana                                                                     491291
Bag of Organic Bananas                                                     394930
Organic Strawberries                                                       275577
Organic Baby Spinach                                                       251705
Organic Hass Avocado                                                       220877
Organic Avocado                                                            184224
Large Lemon                                                                160792
Strawberries                                                               149445
Limes                                                                      146660
Organic Whole Milk                                                         142813
Organic Raspberries                                                        142603
Organic Yellow Onion                                                       117716
Organic Garlic  

#### Organic?

In [22]:
%%time
order_products['organic'] = order_products['product_name'].str.contains('Organic')

CPU times: user 17.1 s, sys: 163 ms, total: 17.3 s
Wall time: 17.3 s


In [23]:
%%time
order_products['organic'].value_counts()

CPU times: user 275 ms, sys: 121 ms, total: 395 ms
Wall time: 394 ms


False    23163118
True     10655988
Name: organic, dtype: int64

### With Dask
https://examples.dask.org/dataframe.html

In [24]:
import dask.dataframe as dd
from dask.distributed import Client

client = Client(n_workers=16)
client

0,1
Client  Scheduler: tcp://127.0.0.1:45431  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 16  Cores: 16  Memory: 67.53 GB


#### Load & merge data
https://examples.dask.org/dataframes/01-data-access.html#Read-CSV-files

In [25]:
%%time
order_products = dd.read_csv('order_products*.csv')

CPU times: user 17.7 ms, sys: 4.22 ms, total: 21.9 ms
Wall time: 20.1 ms


In [26]:
%%time
order_products = dd.merge(order_products, products[['product_id', 'product_name']])

CPU times: user 23.7 ms, sys: 3.95 ms, total: 27.6 ms
Wall time: 24.2 ms


http://docs.dask.org/en/latest/dataframe-performance.html#persist-intelligently

In [27]:
%%time
order_products = order_products.persist()

CPU times: user 5.11 ms, sys: 149 µs, total: 5.26 ms
Wall time: 5.05 ms


#### Most popular products?

In [28]:
%%time
order_products['product_name'].value_counts().compute()

CPU times: user 759 ms, sys: 117 ms, total: 876 ms
Wall time: 4.59 s


Banana                                                       491291
Bag of Organic Bananas                                       394930
Organic Strawberries                                         275577
Organic Baby Spinach                                         251705
Organic Hass Avocado                                         220877
Organic Avocado                                              184224
Large Lemon                                                  160792
Strawberries                                                 149445
Limes                                                        146660
Organic Whole Milk                                           142813
Organic Raspberries                                          142603
Organic Yellow Onion                                         117716
Organic Garlic                                               113936
Organic Zucchini                                             109412
Organic Blueberries                             

#### Organic?

In [33]:
%%time
order_products['organic'] = order_products['product_name'].str.contains('Organic')

CPU times: user 5.5 ms, sys: 0 ns, total: 5.5 ms
Wall time: 4.94 ms


In [35]:
%%time
order_products['organic'].value_counts().compute()

CPU times: user 973 ms, sys: 155 ms, total: 1.13 s
Wall time: 8.17 s


False    23163118
True     10655988
Name: organic, dtype: int64

## 4. Fit a machine learning model

### Load data

In [7]:
%cd ./ds1-predictive-modeling-challenge

/home/ec2-user/SageMaker/DS-Unit-3-Sprint-3-Big-Data/module1-aws-sagemaker/ds1-predictive-modeling-challenge


In [8]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

train_features = pd.read_csv('train_features.csv')
train_labels = pd.read_csv('train_labels.csv')

X_train_numeric = train_features.select_dtypes(np.number)
y_train = train_labels['status_group']

### With 2 cores (like Google Colab)
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [9]:
model = RandomForestClassifier(n_estimators=200, oob_score=True, n_jobs=2, random_state=42, verbose=1)
model.fit(X_train_numeric, y_train)
print('Out-of-bag score:', model.oob_score_)

[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    2.7s
[Parallel(n_jobs=2)]: Done 196 tasks      | elapsed:   11.4s
[Parallel(n_jobs=2)]: Done 200 out of 200 | elapsed:   11.6s finished


Out-of-bag score: 0.7206397306397306


### With 16 cores (on AWS m4.4xlarge)

In [11]:
model = RandomForestClassifier(n_estimators=200, oob_score=True, n_jobs=-1, random_state=42, verbose=1)
model.fit(X_train_numeric, y_train)
print('Out-of-bag score:', model.oob_score_)

[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    0.4s
[Parallel(n_jobs=-1)]: Done 168 tasks      | elapsed:    2.0s
[Parallel(n_jobs=-1)]: Done 200 out of 200 | elapsed:    2.3s finished


Out-of-bag score: 0.7206397306397306


## ASSIGNMENT

Revisit a previous assignment or project that had slow speeds or big data.

Make it better with what you've learned today!

You can use `wget` or Kaggle API to get data. Some possibilities include:

- https://www.kaggle.com/c/ds1-predictive-modeling-challenge
- https://www.kaggle.com/ntnu-testimon/paysim1
- https://github.com/mdeff/fma
- https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2 



Also, you can play with [Datashader](http://datashader.org/) and its [example datasets](https://github.com/pyviz/datashader/blob/master/examples/datasets.yml)!

In [2]:
!pip install category_encoders
import category_encoders as ce

Collecting category_encoders
[?25l  Downloading https://files.pythonhosted.org/packages/f7/d3/82a4b85a87ece114f6d0139d643580c726efa45fa4db3b81aed38c0156c5/category_encoders-1.3.0-py2.py3-none-any.whl (61kB)
[K    100% |████████████████████████████████| 61kB 2.7MB/s ta 0:00:011
[31mfastparquet 0.2.1 requires pytest-runner, which is not installed.[0m
Installing collected packages: category-encoders
Successfully installed category-encoders-1.3.0
[33mYou are using pip version 10.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
pd.set_option('max_columns', 100)
train = pd.read_csv('https://raw.githubusercontent.com/valogonor/DS1-Predictive-Modeling-Challenge/master/train_features.csv')
test = pd.read_csv('https://raw.githubusercontent.com/valogonor/DS1-Predictive-Modeling-Challenge/master/test_features.csv')  # Unlabeled, for Kaggle submission
labels = pd.read_csv('https://raw.githubusercontent.com/valogonor/DS1-Predictive-Modeling-Challenge/master/train_labels.csv')
train = train.join(labels.status_group, how='inner')
train = train.drop('id', axis=1)
train.head()

Unnamed: 0,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
0,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,Lake Nyasa,Mnyusi B,Iringa,11,5,Ludewa,Mundindi,109,True,GeoData Consultants Ltd,VWC,Roman,False,1999,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
1,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,Lake Victoria,Nyamara,Mara,20,2,Serengeti,Natta,280,,GeoData Consultants Ltd,Other,,True,2010,gravity,gravity,gravity,wug,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional
2,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,Pangani,Majengo,Manyara,21,4,Simanjiro,Ngorika,250,True,GeoData Consultants Ltd,VWC,Nyumba ya mungu pipe scheme,True,2009,gravity,gravity,gravity,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe,functional
3,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,Ruvuma / Southern Coast,Mahakamani,Mtwara,90,63,Nanyumbu,Nanyumbu,58,True,GeoData Consultants Ltd,VWC,,True,1986,submersible,submersible,submersible,vwc,user-group,never pay,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,non functional
4,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,Lake Victoria,Kyanyamisa,Kagera,18,1,Karagwe,Nyakasimbi,0,True,GeoData Consultants Ltd,,,True,0,gravity,gravity,gravity,other,other,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional


In [4]:
train.isna().sum()

amount_tsh                   0
date_recorded                0
funder                    3635
gps_height                   0
installer                 3655
longitude                    0
latitude                     0
wpt_name                     0
num_private                  0
basin                        0
subvillage                 371
region                       0
region_code                  0
district_code                0
lga                          0
ward                         0
population                   0
public_meeting            3334
recorded_by                  0
scheme_management         3877
scheme_name              28166
permit                    3056
construction_year            0
extraction_type              0
extraction_type_group        0
extraction_type_class        0
management                   0
management_group             0
payment                      0
payment_type                 0
water_quality                0
quality_group                0
quantity

In [5]:
train_no_nulls = train.dropna(axis=1)
train_no_nulls.isna().sum()

amount_tsh               0
date_recorded            0
gps_height               0
longitude                0
latitude                 0
wpt_name                 0
num_private              0
basin                    0
region                   0
region_code              0
district_code            0
lga                      0
ward                     0
population               0
recorded_by              0
construction_year        0
extraction_type          0
extraction_type_group    0
extraction_type_class    0
management               0
management_group         0
payment                  0
payment_type             0
water_quality            0
quality_group            0
quantity                 0
quantity_group           0
source                   0
source_type              0
source_class             0
waterpoint_type          0
waterpoint_type_group    0
status_group             0
dtype: int64

In [6]:
train_no_nulls.nunique().sort_values()

recorded_by                  1
status_group                 3
source_class                 3
management_group             5
quantity_group               5
quantity                     5
waterpoint_type_group        6
quality_group                6
payment_type                 7
payment                      7
source_type                  7
waterpoint_type              7
extraction_type_class        7
water_quality                8
basin                        9
source                      10
management                  12
extraction_type_group       13
extraction_type             18
district_code               20
region                      21
region_code                 27
construction_year           55
num_private                 65
amount_tsh                  98
lga                        125
date_recorded              356
population                1049
ward                      2092
gps_height                2428
wpt_name                 37400
longitude                57516
latitude

In [7]:
cols = ['recorded_by', 'lga', 'date_recorded', 'ward', 'wpt_name']
train = train_no_nulls.drop(columns=cols)
train.head()

Unnamed: 0,amount_tsh,gps_height,longitude,latitude,num_private,basin,region,region_code,district_code,population,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
0,6000.0,1390,34.938093,-9.856322,0,Lake Nyasa,Iringa,11,5,109,1999,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
1,0.0,1399,34.698766,-2.147466,0,Lake Victoria,Mara,20,2,280,2010,gravity,gravity,gravity,wug,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional
2,25.0,686,37.460664,-3.821329,0,Pangani,Manyara,21,4,250,2009,gravity,gravity,gravity,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe,functional
3,0.0,263,38.486161,-11.155298,0,Ruvuma / Southern Coast,Mtwara,90,63,58,1986,submersible,submersible,submersible,vwc,user-group,never pay,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,non functional
4,0.0,0,31.130847,-1.825359,0,Lake Victoria,Kagera,18,1,0,0,gravity,gravity,gravity,other,other,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional


In [8]:
X = train.drop('status_group', axis=1)
y = train.status_group
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((44550, 27), (14850, 27), (44550,), (14850,))

In [18]:
%%time
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score

pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True), 
    StandardScaler(), 
    LogisticRegression(solver='lbfgs', multi_class='ovr')
)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(pipeline.score(X, y))

0.7344612794612795
CPU times: user 27.1 s, sys: 30.1 s, total: 57.1 s
Wall time: 6.09 s


In [21]:
accuracy_score(y_test, y_pred)

0.7312457912457913

In [20]:
%%time
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score

pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True), 
    StandardScaler(), 
    LogisticRegression(solver='lbfgs', multi_class='ovr', n_jobs=-1)
)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(pipeline.score(X, y))

0.7344612794612795
CPU times: user 2.72 s, sys: 664 ms, total: 3.38 s
Wall time: 25.7 s


It actually takes more than 4 times longer to get results using all cores in a logistic regression within a pipeline than it takes to get results using just one core.

In [22]:
pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True), 
    StandardScaler(), 
    RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42, verbose=1)
)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(pipeline.score(X, y))

[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:   20.2s finished
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    1.0s finished


0.9455050505050505


[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    4.9s finished


In [23]:
accuracy_score(y_test, y_pred)

0.7971717171717172

In [25]:
pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True), 
    StandardScaler(), 
    RandomForestClassifier(n_estimators=200, oob_score=True, n_jobs=-1, random_state=42, verbose=1)
)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(pipeline.score(X, y))

[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done 168 tasks      | elapsed:    1.6s
[Parallel(n_jobs=-1)]: Done 200 out of 200 | elapsed:    1.9s finished
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 168 tasks      | elapsed:    0.1s
[Parallel(n_jobs=16)]: Done 200 out of 200 | elapsed:    0.1s finished
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.1s


0.9455050505050505


[Parallel(n_jobs=16)]: Done 168 tasks      | elapsed:    0.3s
[Parallel(n_jobs=16)]: Done 200 out of 200 | elapsed:    0.4s finished


Random forest was faster using all cores.