# Examples - Distributed Concurrent.futures 
## - Ad Hoc Distributed Random Forests
https://gist.github.com/mrocklin/9f5720d8658e5f2f66666815b1f03f00

Ad-Hoc Distributed Random Forests on NYCTaxi Dataframes
=======================================================

Using Dask.distributed and Scikit-Learn we train a distributed random forest on the NYCTaxi data.

**Learning Objective**: Predict passenger counts given fare, distance, location, etc..

**Actual Objective**: Show how to use dask.distributed in a free-form way without collections

**Disclaimer**: Our machine learning approach is flawed


In [1]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [2]:
from distributed import Client, progress, wait
e = Client()
e

0,1
Client  Scheduler: tcp://127.0.0.1:52993  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 8.48 GB


## NYCTaxi data living on S3

This is something like 60GB in RAM.

We'll try to predict `passenger_count` given the other numeric columns.

In [3]:
from s3fs import S3FileSystem

s3 = S3FileSystem(anon=True)
s3.ls('dask-data/nyc-taxi/2015/', detail = False)

['dask-data/nyc-taxi/2015/yellow_tripdata_2015-01.csv',
 'dask-data/nyc-taxi/2015/yellow_tripdata_2015-02.csv',
 'dask-data/nyc-taxi/2015/yellow_tripdata_2015-03.csv',
 'dask-data/nyc-taxi/2015/yellow_tripdata_2015-04.csv',
 'dask-data/nyc-taxi/2015/yellow_tripdata_2015-05.csv',
 'dask-data/nyc-taxi/2015/yellow_tripdata_2015-06.csv',
 'dask-data/nyc-taxi/2015/yellow_tripdata_2015-07.csv',
 'dask-data/nyc-taxi/2015/yellow_tripdata_2015-08.csv',
 'dask-data/nyc-taxi/2015/yellow_tripdata_2015-09.csv',
 'dask-data/nyc-taxi/2015/yellow_tripdata_2015-10.csv',
 'dask-data/nyc-taxi/2015/yellow_tripdata_2015-11.csv',
 'dask-data/nyc-taxi/2015/yellow_tripdata_2015-12.csv',
 'dask-data/nyc-taxi/2015/parquet.gz',
 'dask-data/nyc-taxi/2015/parquet',
 'dask-data/nyc-taxi/2015/yellow_tripdata_2015-01.parq']

In [None]:
import dask.dataframe as dd

dfs = dd.read_csv('s3://dask-data/nyc-taxi/2015/*.csv', 
                  parse_dates=['tpep_pickup_datetime', 
                               'tpep_dropoff_datetime'],
                  collection=False,
                  storage_options={'anon': True})

dfs = e.compute(dfs)
# dfs
# progress(dfs)

In [5]:
len(dfs)

365

In [6]:
dfs[:10]

[<Future: status: pending, key: pandas_read_text-652ad00ebcf26f069e83ac511ca8e5b0>,
 <Future: status: pending, key: pandas_read_text-da3e150dfdbb37450dab2e148a1c79b8>,
 <Future: status: pending, key: pandas_read_text-bc5a23b8662841cbf0f2aa0c97d4575f>,
 <Future: status: pending, key: pandas_read_text-e423eb879458e1baede77b52c66d3956>,
 <Future: status: pending, key: pandas_read_text-0f9a3ec9ba3f351aede693f98e1840ee>,
 <Future: status: pending, key: pandas_read_text-6b611c79ca55007bddd5cbfae9222e96>,
 <Future: status: pending, key: pandas_read_text-347e6ddcbd530efe148644b23601baa3>,
 <Future: status: pending, key: pandas_read_text-fc5e73db2ad00009ccee0b7c48e1351a>,
 <Future: status: pending, key: pandas_read_text-ac2d6159ca15202e08d06a7d3a6bd55a>,
 <Future: status: pending, key: pandas_read_text-dd52425f74171b3b0318a8796cf15275>]

In [7]:
dfs[0]

In [None]:
df = dfs[0].result()
df

In [None]:
df.tail()

In [None]:
df.columns

### Start with a sample on a single machine 

In [None]:
df_train, df_test = train_test_split(df)

In [None]:
%%time

columns = ['trip_distance', 'pickup_longitude', 'pickup_latitude', 
           'dropoff_longitude', 'dropoff_latitude', 'payment_type', 
           'fare_amount', 'mta_tax', 'tip_amount', 'tolls_amount']

est = RandomForestClassifier(n_estimators=4)
est.fit(df_train[columns], df_train.passenger_count)

### Score results

In [None]:
est.score(df_test[columns], df_test.passenger_count)

OK, 65% accuracy isn't bad.  

But really, always guessing a single passenger wouldn't be that much worse.

In [None]:
from sklearn.metrics import accuracy_score
import numpy as np

accuracy_score(df_test.passenger_count, 
               np.ones_like(df_test.passenger_count))

In [None]:
(df_test.passenger_count == 1).sum() / len(df_test)

So lets just be upfront that I'm probably not choosing the correct algorithms here.  Machine learning requires at least a little bit of expertise to do well.

### Distributed fit with `e.map`

Lets keep going through the motions of fitting on a cluster though.  It'll be informative, I promise.

We'll map a function across our futures with `e.map`.

In [None]:
len(dfs)

In [None]:
def fit(df):
    est = RandomForestClassifier(n_estimators=4)
    est.fit(df[columns], df.passenger_count)
    return est

train = dfs[:-1]
test = dfs[-1]

estimators = e.map(fit, train)
progress(estimators, complete=False)

### Broadcast our test data across all nodes

In [None]:
test

In [None]:
%time e.replicate([test], n=48)

### Make predictions from each of our models

We'll use `e.submit(function, *args)` in a loop to submit more tasks

In [None]:
def predict(est, X):
    return est.predict(X[columns])

predictions = [e.submit(predict, est, test) for est in estimators]
progress(predictions, complete=False)

In [None]:
x = predictions[3].result()
x

In [None]:
x.shape

### Aggregate by Majority Vote

In [None]:
from scipy.stats import mode
import numpy as np

def mymode(*arrays):
    array = np.stack(arrays, axis=0)
    return mode(array)[0][0]

In [None]:
a_few_predictions = e.gather(predictions[:4])
a_few_predictions

In [None]:
mymode(*a_few_predictions)

### Tree reduce predictions together to single prediciton

We'll use `e.submit(...)` in a nested loop for more interesting tasks

In [None]:
from toolz import partition_all
preds = predictions
while len(preds) > 1:
    preds = [e.submit(mymode, *chunk) 
             for chunk in partition_all(10, preds)]
progress(preds, complete=False)

In [None]:
result = e.gather(preds)[0]

In [None]:
result

In [None]:
accuracy_score(result, test.result().passenger_count)

### Too many single-passenger rides

In [None]:
from toolz import frequencies
frequencies(result)

In [None]:
frequencies(predictions[3].result())

### Conclusion

*  Saw dask.distributed task API
    * `e.submit(function, *args)`
    * `e.map(function, sequence)`
    * `e.gather(futures)`

*  Our machine learning algorithms could improve
*  Replicate with [dec2](https://github.com/dask/dec2/)