# Random forest classification

## Dask + RAPIDS GPU cluster

<table>
    <tr>
        <td>
            <img src="https://docs.dask.org/en/latest/_images/dask_horizontal.svg" width="300">
        </td>
        <td>
            <img src="https://rapids.ai/assets/images/RAPIDS-logo-purple.svg" width="300">
        </td>
    </tr>
</table>

In [27]:
import os

# Initialize Dask GPU cluster

In [28]:
from dask.distributed import Client, wait
from dask_saturn import SaturnCluster

n_workers = 2
cluster = SaturnCluster(
    n_workers=n_workers,
    scheduler_size='medium',
    worker_size='g4dnxlarge'
)
client = Client(cluster)
cluster

[2020-12-15 07:02:57] INFO - dask-saturn | Cluster is ready


VBox(children=(HTML(value='<h2>SaturnCluster</h2>'), HBox(children=(HTML(value='\n<div>\n  <style scoped>\n   …

Open the dashboard (link ^) and watch it when you execute some commands, you'll see which tasks are running across the cluster. There are a couple other dashboard pages worth viewing for GPU memory and utilization that are not listed on the navbar, so we grab direct links for those below.

In [29]:
from IPython.display import display, HTML

gpu_links = f'''
<b>GPU Dashboard links</b>
<ul>
<li><a href="{client.dashboard_link}/individual-gpu-memory" target="_blank">GPU memory</a></li>
<li><a href="{client.dashboard_link}/individual-gpu-utilization" target="_blank">GPU utilization</a></li>
</ul>
'''
display(HTML(gpu_links))

If you created your cluster here in this notebook, it might take a few minutes for all your nodes to become available. You can run the chunk below to block until all nodes are ready.

>**Pro tip**: Create and/or start your cluster from the "Dask" page in Saturn if you want to get a head start!

In [30]:
client.wait_for_workers(n_workers=n_workers)

# Load data and feature engineering

Load a full month for this exercise. Note we are loading the data with Dask+RAPIDS now (`dask_cudf.read_csv` vs. `pd.read_csv`)

In [31]:
import numpy as np
import dask_cudf


In [32]:
data = dask_cudf.read_csv(
    's3://kjkasjdk2934872398ojljosudfsu8fuj23/data_rev8.csv',
    storage_options={'anon': True},
    assume_missing=True,
)

In [33]:
print(f'Num rows: {len(data)}, Size: {data.memory_usage(deep=True).sum().compute() / 1e6} MB')

Num rows: 200000, Size: 305.688911 MB


In [34]:
data = data.drop(columns=['Unnamed: 0', 'Time'])
data = data.astype('float32')

Dask performs computations in a [lazy manner](https://tutorial.dask.org/01x_lazy.html), so we persist the dataframe to perform data loading and feature processing and load into GPU memory.

In [35]:
features = list(data.columns[1:])
target = data.columns[0]

# Train model

In [36]:
%pip install pyDOE

Note: you may need to restart the kernel to use updated packages.


In [37]:
n_samples = 10

min_rows_per_node = [2, 50]
rows_sample = [0.1, 0.99]
max_features = [40, 70]

In [38]:
from pyDOE import lhs
import numpy as np
np.random.seed(42)

lhd = lhs(3, samples=n_samples)

In [39]:
lhd

array([[0.71394939, 0.09507143, 0.4181825 ],
       [0.95142344, 0.53042422, 0.81996738],
       [0.51834045, 0.28661761, 0.11559945],
       [0.845607  , 0.62912291, 0.39699099],
       [0.37080726, 0.30205845, 0.07319939],
       [0.15986585, 0.72921446, 0.55247564],
       [0.03745401, 0.42123391, 0.2601115 ],
       [0.48324426, 0.11560186, 0.90464504],
       [0.6431945 , 0.8785176 , 0.73663618],
       [0.20580836, 0.95924146, 0.66118529]])

In [40]:
import pandas as pd

def scale_param(x, limits):
    range_ = limits[1]-limits[0]
    res = x*range_+min(limits)
    return res

samples = pd.DataFrame({'min_rows_per_node': np.round(scale_param(lhd[:,0], min_rows_per_node),0).astype(int).tolist(),
           'rows_sample': scale_param(lhd[:,1], rows_sample).tolist(),
           'max_features': np.round(scale_param(lhd[:,2], max_features),0).astype(int).tolist()
          })
samples.head()

Unnamed: 0,min_rows_per_node,rows_sample,max_features
0,36,0.184614,53
1,48,0.572078,65
2,27,0.35509,43
3,43,0.659919,52
4,20,0.368832,42


In [41]:
from cuml.dask.ensemble import RandomForestRegressor
from cuml.metrics.regression import mean_absolute_error, mean_squared_error, r2_score
from dask import dataframe as dd 

In [42]:
from tqdm.auto import tqdm

In [43]:
fold_train = []
fold_test = []

for fold in tqdm(range(4), total=4):
    fold_train_start = fold*40000
    fold_train_end = (fold+1)*40000
    fold_test_end = (fold+1)*50000

    train_data_x = dd.from_pandas(data[features].compute().iloc[fold_train_start:fold_train_end], npartitions=n_workers)
    train_data_y = dd.from_pandas(data[target].compute().iloc[fold_train_start:fold_train_end], npartitions=n_workers)
    
    test_data_x = dd.from_pandas(data[features].compute().iloc[fold_train_end:fold_test_end], npartitions=n_workers)
    test_data_y = dd.from_pandas(data[target].compute().iloc[fold_train_end:fold_test_end], npartitions=n_workers)
    
    fold_train.append([train_data_x, train_data_y])
    fold_test.append([test_data_x, test_data_y])


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=4.0), HTML(value='')))




In [50]:
fold_train[0][0].persist()

Unnamed: 0_level_0,0_wind_speed_ms,0_temp_c,1_wind_speed_ms,1_temp_c,2_wind_speed_ms,3_wind_speed_ms,4_wind_speed_ms,8_temp_c,0_wind,1_wind,2_wind,3_wind,4_wind,0_wind_speed_ms_lag1,0_temp_c_lag1,1_wind_speed_ms_lag1,1_temp_c_lag1,2_wind_speed_ms_lag1,3_wind_speed_ms_lag1,4_wind_speed_ms_lag1,8_temp_c_lag1,0_wind_lag1,1_wind_lag1,2_wind_lag1,3_wind_lag1,4_wind_lag1,0_wind_speed_ms_lag2,0_temp_c_lag2,1_wind_speed_ms_lag2,1_temp_c_lag2,2_wind_speed_ms_lag2,3_wind_speed_ms_lag2,4_wind_speed_ms_lag2,8_temp_c_lag2,0_wind_lag2,1_wind_lag2,2_wind_lag2,3_wind_lag2,4_wind_lag2,0_wind_speed_ms_lag3,0_temp_c_lag3,1_wind_speed_ms_lag3,1_temp_c_lag3,2_wind_speed_ms_lag3,3_wind_speed_ms_lag3,4_wind_speed_ms_lag3,8_temp_c_lag3,0_wind_lag3,1_wind_lag3,2_wind_lag3,3_wind_lag3,4_wind_lag3,0_wind_speed_ms_lag4,0_temp_c_lag4,1_wind_speed_ms_lag4,1_temp_c_lag4,2_wind_speed_ms_lag4,3_wind_speed_ms_lag4,4_wind_speed_ms_lag4,8_temp_c_lag4,0_wind_lag4,1_wind_lag4,2_wind_lag4,3_wind_lag4,4_wind_lag4,0_wind_speed_ms_lag5,0_temp_c_lag5,1_wind_speed_ms_lag5,1_temp_c_lag5,2_wind_speed_ms_lag5,3_wind_speed_ms_lag5,4_wind_speed_ms_lag5,8_temp_c_lag5,0_wind_lag5,1_wind_lag5,2_wind_lag5,3_wind_lag5,4_wind_lag5,0_wind_speed_ms_lag9,0_temp_c_lag9,1_wind_speed_ms_lag9,1_temp_c_lag9,2_wind_speed_ms_lag9,3_wind_speed_ms_lag9,4_wind_speed_ms_lag9,8_temp_c_lag9,0_wind_lag9,1_wind_lag9,2_wind_lag9,3_wind_lag9,4_wind_lag9,0_wind_speed_ms_lag276,0_temp_c_lag276,1_wind_speed_ms_lag276,1_temp_c_lag276,2_wind_speed_ms_lag276,3_wind_speed_ms_lag276,3_temp_c_lag276,4_wind_speed_ms_lag276,8_temp_c_lag276,0_wind_lag276,1_wind_lag276,2_wind_lag276,3_wind_lag276,4_wind_lag276,Wind_lag300,0_wind_speed_ms_lag300,1_wind_speed_ms_lag300,2_wind_speed_ms_lag300,3_wind_speed_ms_lag300,4_wind_speed_ms_lag300,0_wind_lag300,1_wind_lag300,2_wind_lag300,3_wind_lag300,4_wind_lag300,0_wind_speed_ms_lag1_lag300,1_wind_speed_ms_lag1_lag300,2_wind_speed_ms_lag1_lag300,3_wind_speed_ms_lag1_lag300,4_wind_speed_ms_lag1_lag300,0_wind_lag1_lag300,1_wind_lag1_lag300,2_wind_lag1_lag300,3_wind_lag1_lag300,4_wind_lag1_lag300,0_wind_speed_ms_lag2_lag300,1_wind_speed_ms_lag2_lag300,2_wind_speed_ms_lag2_lag300,3_wind_speed_ms_lag2_lag300,4_wind_speed_ms_lag2_lag300,0_wind_lag2_lag300,1_wind_lag2_lag300,2_wind_lag2_lag300,3_wind_lag2_lag300,4_wind_lag2_lag300,0_wind_speed_ms_lag3_lag300,1_wind_speed_ms_lag3_lag300,2_wind_speed_ms_lag3_lag300,3_wind_speed_ms_lag3_lag300,4_wind_speed_ms_lag3_lag300,0_wind_lag3_lag300,1_wind_lag3_lag300,2_wind_lag3_lag300,3_wind_lag3_lag300,4_wind_lag3_lag300,0_wind_speed_ms_lag4_lag300,1_wind_speed_ms_lag4_lag300,2_wind_speed_ms_lag4_lag300,3_wind_speed_ms_lag4_lag300,4_wind_speed_ms_lag4_lag300,0_wind_lag4_lag300,1_wind_lag4_lag300,2_wind_lag4_lag300,3_wind_lag4_lag300,4_wind_lag4_lag300,0_wind_speed_ms_lag5_lag300,1_wind_speed_ms_lag5_lag300,2_wind_speed_ms_lag5_lag300,3_wind_speed_ms_lag5_lag300,4_wind_speed_ms_lag5_lag300,0_wind_lag5_lag300,1_wind_lag5_lag300,2_wind_lag5_lag300,3_wind_lag5_lag300,4_wind_lag5_lag300,0_wind_speed_ms_lag9_lag300,1_wind_speed_ms_lag9_lag300,2_wind_speed_ms_lag9_lag300,3_wind_speed_ms_lag9_lag300,4_wind_speed_ms_lag9_lag300,0_wind_lag9_lag300,1_wind_lag9_lag300,2_wind_lag9_lag300,3_wind_lag9_lag300,4_wind_lag9_lag300,0_wind_speed_ms_lag276_lag300,1_wind_speed_ms_lag276_lag300,2_wind_speed_ms_lag276_lag300,3_wind_speed_ms_lag276_lag300,4_wind_speed_ms_lag276_lag300,0_wind_lag276_lag300,1_wind_lag276_lag300,2_wind_lag276_lag300,3_wind_lag276_lag300,4_wind_lag276_lag300
npartitions=2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1,Unnamed: 126_level_1,Unnamed: 127_level_1,Unnamed: 128_level_1,Unnamed: 129_level_1,Unnamed: 130_level_1,Unnamed: 131_level_1,Unnamed: 132_level_1,Unnamed: 133_level_1,Unnamed: 134_level_1,Unnamed: 135_level_1,Unnamed: 136_level_1,Unnamed: 137_level_1,Unnamed: 138_level_1,Unnamed: 139_level_1,Unnamed: 140_level_1,Unnamed: 141_level_1,Unnamed: 142_level_1,Unnamed: 143_level_1,Unnamed: 144_level_1,Unnamed: 145_level_1,Unnamed: 146_level_1,Unnamed: 147_level_1,Unnamed: 148_level_1,Unnamed: 149_level_1,Unnamed: 150_level_1,Unnamed: 151_level_1,Unnamed: 152_level_1,Unnamed: 153_level_1,Unnamed: 154_level_1,Unnamed: 155_level_1,Unnamed: 156_level_1,Unnamed: 157_level_1,Unnamed: 158_level_1,Unnamed: 159_level_1,Unnamed: 160_level_1,Unnamed: 161_level_1,Unnamed: 162_level_1,Unnamed: 163_level_1,Unnamed: 164_level_1,Unnamed: 165_level_1,Unnamed: 166_level_1,Unnamed: 167_level_1,Unnamed: 168_level_1,Unnamed: 169_level_1,Unnamed: 170_level_1,Unnamed: 171_level_1,Unnamed: 172_level_1,Unnamed: 173_level_1,Unnamed: 174_level_1,Unnamed: 175_level_1,Unnamed: 176_level_1,Unnamed: 177_level_1,Unnamed: 178_level_1,Unnamed: 179_level_1,Unnamed: 180_level_1,Unnamed: 181_level_1,Unnamed: 182_level_1,Unnamed: 183_level_1,Unnamed: 184_level_1,Unnamed: 185_level_1,Unnamed: 186_level_1
0,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32
20000,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39999,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [52]:
res = []

for sample in tqdm(list(samples.index), total=samples.shape[0]):
    this_res = {}
    this_res['min_rows_per_node'] = samples.loc[sample, 'min_rows_per_node']
    this_res['rows_sample'] = samples.loc[sample, 'rows_sample']
    this_res['max_features'] = samples.loc[sample, 'max_features']
    this_res['res'] = {'folds': []}
    for fold in tqdm(range(4), total=4):
        client.wait_for_workers(n_workers=n_workers)
        this_fold = {}

        rfr = RandomForestRegressor(n_estimators=2000, 
                                    min_rows_per_node = samples.loc[sample, 'min_rows_per_node'],
                                    rows_sample = samples.loc[sample, 'rows_sample'],
                                    max_features = int(samples.loc[sample, 'max_features']),
                                    ignore_empty_partitions=True)
        _ = rfr.fit(*fold_train[fold])

        preds = rfr.predict(fold_test[fold][0]).compute()
        orig = fold_test[fold][1].compute()
        
        this_fold['mae'] = float(mean_absolute_error(orig, preds))
        this_fold['rmse'] = float(mean_squared_error(orig, preds, squared=False))
        this_fold['r2'] = r2_score(orig, preds)
        this_res['res']['folds'].append(this_fold)
    this_res['res']['mae'] = np.mean([x['mae'] for x in this_res['res']['folds']])
    this_res['res']['rmse'] = np.mean([x['rmse'] for x in this_res['res']['folds']])
    this_res['res']['r2'] = np.mean([x['r2'] for x in this_res['res']['folds']])
    res.append(this_res)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=10.0), HTML(value='')))

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=4.0), HTML(value='')))






HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=4.0), HTML(value='')))






HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=4.0), HTML(value='')))






HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=4.0), HTML(value='')))






HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=4.0), HTML(value='')))






HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=4.0), HTML(value='')))






HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=4.0), HTML(value='')))







KeyboardInterrupt: 

In [48]:
help(RandomForestRegressor)

Help on class RandomForestRegressor in module cuml.dask.ensemble.randomforestregressor:

class RandomForestRegressor(cuml.dask.ensemble.base.BaseRandomForestModel, cuml.dask.common.base.DelayedPredictionMixin, cuml.dask.common.base.BaseEstimator)
 |  RandomForestRegressor(workers=None, client=None, verbose=False, n_estimators=10, seed=None, ignore_empty_partitions=False, **kwargs)
 |  
 |  Experimental API implementing a multi-GPU Random Forest classifier
 |  model which fits multiple decision tree classifiers in an
 |  ensemble. This uses Dask to partition data over multiple GPUs
 |  (possibly on different nodes).
 |  
 |  Currently, this API makes the following assumptions:
 |   * The set of Dask workers used between instantiation, fit,
 |     and predict are all consistent
 |   * Training data comes in the form of cuDF dataframes or Dask Arrays
 |     distributed so that each worker has at least one partition.
 |   * The print_summary and print_detailed functions print the
 |     in

In [126]:
test_data_x

Unnamed: 0,0_wind_speed_ms,0_temp_c,1_wind_speed_ms,1_temp_c,2_wind_speed_ms,3_wind_speed_ms,4_wind_speed_ms,8_temp_c,0_wind,1_wind,...,0_wind_speed_ms_lag276_lag300,1_wind_speed_ms_lag276_lag300,2_wind_speed_ms_lag276_lag300,3_wind_speed_ms_lag276_lag300,4_wind_speed_ms_lag276_lag300,0_wind_lag276_lag300,1_wind_lag276_lag300,2_wind_lag276_lag300,3_wind_lag276_lag300,4_wind_lag276_lag300
40000,2.571353,1.000000,100.328552,9.976666,27.726419,2.163373,5.995505,11.6100,0.0,341.499298,...,11.089567,38.500340,17.306984,53.157375,19.537560,12.653730,114.285362,9.911224,170.018341,12.030053
40001,2.406104,0.997500,101.086159,9.985833,25.175226,2.269611,6.237972,11.6400,0.0,344.262299,...,12.446843,38.215996,16.793606,52.945595,19.939255,15.817163,113.194664,8.672321,169.210434,13.021033
40002,2.248091,0.995000,101.847565,9.995000,22.785532,2.379270,6.486890,11.6700,0.0,347.025299,...,13.910580,37.933056,16.290480,52.734375,20.346416,18.980595,112.103958,7.433418,168.402527,14.012013
40003,2.097152,0.992500,102.612785,10.004167,20.552061,2.492407,6.742342,11.7000,0.0,349.788330,...,15.484797,37.651516,15.797506,52.523720,20.759085,22.144028,111.013260,6.194515,167.594620,15.002994
40004,1.953125,0.990000,103.381828,10.013333,18.469528,2.609074,7.004416,11.7300,0.0,352.551331,...,17.173512,37.371372,15.314579,52.313625,21.177296,25.307461,109.922562,4.955612,166.786713,15.993974
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,1.653497,-8.652500,19.792553,-3.700000,29.575298,19.248833,16.003008,0.3225,0.0,13.932680,...,4.019679,322.123474,10.978053,13.910580,10.978053,0.000000,1161.230103,0.000000,3.477742,0.000000
49996,1.713640,-8.643333,19.537560,-3.650000,30.371328,18.679939,15.500333,0.3100,0.0,13.298255,...,3.894594,326.608276,11.039912,13.255963,11.189322,0.000000,1176.253296,0.000000,3.091326,0.000000
49997,1.775224,-8.634167,19.284767,-3.600000,31.181515,18.122366,15.008296,0.2975,0.0,12.663830,...,3.772132,331.134521,11.102004,12.622211,11.403286,0.000000,1191.276489,0.000000,2.704911,0.000000
49998,1.838266,-8.625000,19.034163,-3.550000,32.005985,17.576000,14.526784,0.2850,0.0,12.029405,...,3.652264,335.702362,11.164328,12.008989,11.619960,0.000000,1206.299683,0.000000,2.318495,0.000000


In [86]:
data_train.loc[0:10].compute()

Unnamed: 0,Wind,0_wind_speed_ms,0_temp_c,1_wind_speed_ms,1_temp_c,2_wind_speed_ms,3_wind_speed_ms,4_wind_speed_ms,8_temp_c,0_wind,...,0_wind_speed_ms_lag276_lag300,1_wind_speed_ms_lag276_lag300,2_wind_speed_ms_lag276_lag300,3_wind_speed_ms_lag276_lag300,4_wind_speed_ms_lag276_lag300,0_wind_lag276_lag300,1_wind_lag276_lag300,2_wind_lag276_lag300,3_wind_lag276_lag300,4_wind_lag276_lag300
0,2915.0,4.826809,27.51,68.920998,26.799999,92.959679,84.604523,52.313625,31.59,0.0,...,26.463593,348.913666,28.652617,35.287552,80.062988,50.239567,1248.593994,78.279472,103.68232,259.761932
1,2945.0,4.539937,27.501667,77.854485,26.730833,97.12455,85.620346,52.03437,31.580833,0.0,...,26.308489,345.825226,29.50363,35.476158,79.507004,48.807091,1238.733154,82.222816,104.438751,257.747711
2,3028.0,4.264664,27.493334,87.528381,26.661667,101.41201,86.644264,51.756111,31.571667,0.0,...,26.153994,342.755066,30.371328,35.665436,78.95359,47.374615,1228.872314,86.166161,105.195183,255.733475
3,3125.0,4.000748,27.485001,97.972183,26.592501,105.823814,87.676323,51.478848,31.5625,0.0,...,26.000103,339.703125,31.255875,35.855389,78.402756,45.942135,1219.011475,90.109505,105.951614,253.719254
4,3220.0,3.747952,27.476667,109.215355,26.523333,110.361763,88.716537,51.202576,31.553333,0.0,...,25.846819,336.669342,32.157433,36.046009,77.854485,44.509659,1209.150635,94.052849,106.708038,251.705032
5,3299.0,3.506035,27.468334,121.287376,26.454166,115.027611,89.764946,50.927292,31.544167,0.0,...,25.694138,333.653687,33.07616,36.237309,77.308777,43.077183,1199.289795,97.996193,107.46447,249.690811
6,3366.0,3.274759,27.459999,134.217728,26.385,119.823158,90.821587,50.653,31.535,0.0,...,25.542059,330.656097,34.012222,36.429279,76.765625,41.644703,1189.428955,101.939537,108.220901,247.67659
7,3415.0,3.053884,27.451666,148.035889,26.315834,124.750168,91.88649,50.379692,31.525833,0.0,...,25.390581,327.676544,34.965782,36.621929,76.225021,40.212227,1179.568115,105.882881,108.977333,245.662369
8,3472.0,2.843171,27.443333,162.771332,26.246666,129.810425,92.959679,50.107372,31.516666,0.0,...,25.239704,324.714905,35.937,36.815258,75.686966,38.779751,1169.707153,109.826225,109.733765,243.648148
9,3470.0,2.642381,27.434999,178.453552,26.1775,135.005692,94.041191,49.836033,31.5075,0.0,...,25.089426,321.771179,36.926037,37.009266,75.151451,37.347271,1159.846313,113.769569,110.490189,241.633926


In [87]:
help(data.repartition)

Help on method repartition in module dask_cudf.core:

repartition(*args, **kwargs) method of dask_cudf.core.DataFrame instance
    Wraps dask.dataframe DataFrame.repartition method.
    Uses DataFrame.shuffle if `columns=` is specified.



In [61]:
samples.loc[sample, 'max_features']

50

- min_rows_per_node = min.node.size
- rows_sample = sample.fraction
- max_features = mtry


In [62]:
%%time
_ = rfr.fit(data_train[features], data_train[target])



CPU times: user 24.8 ms, sys: 7.98 ms, total: 32.8 ms
Wall time: 17.4 s


# Save model

In [None]:
# not yet supported with cuml.dask

## Calculate metrics on test set

Use a different month for test set

In [13]:
taxi_test = dask_cudf.read_csv(
    's3://nyc-tlc/trip data/yellow_tripdata_2019-02.csv',
    parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'],
    storage_options={'anon': True},
    assume_missing=True,
)

taxi_test = prep_df(taxi_test)

<br>

Convert to single-GPU DataFrame using `compute()` because the Dask+RAPIDS implementation doesn't yet have `roc_auc_score`

In [14]:
from cuml.metrics.regression import mean_absolute_error

preds = rfc.predict_proba(taxi_test[features])[1]
roc_auc_score(taxi_test[y_col].compute(), preds.compute())

0.5444324612617493

In [69]:
from cuml.metrics.regression import mean_absolute_error

preds = rfr.predict(data_train[features])
mean_absolute_error(data_train[target].compute(), preds.compute())

array(350.27127, dtype=float32)

In [67]:
preds[1]

NotImplementedError: Series getitem in only supported for other series objects with matching partition structure

In [None]:
 cuml.metrics.regression.mean_absolute_error(y_true, y_pred, sample_weight=None, multioutput='uniform_average')[source]