<a href="https://colab.research.google.com/github/harupy/mlflow/blob/rapids-optuna/rapids_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Environment Sanity Check #

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

Check the output of `!nvidia-smi` to make sure you've been allocated a Tesla T4, P4, or P100.

In [5]:
!nvidia-smi

Wed Jul 29 15:28:48 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8     9W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

#Setup:
Set up script installs
1. Install most recent Miniconda release compatible with Google Colab's Python install  (3.6.7)
1. removes incompatible files
1. Install RAPIDS libraries
1. Set necessary environment variables
1. Copy RAPIDS .so files into current working directory, a workaround for conda/colab interactions
1. If running v0.11 or higher, updates pyarrow library to 0.15.x.

In [6]:
# Install RAPIDS
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!bash rapidsai-csp-utils/colab/rapids-colab.sh stable

import sys, os

dist_package_index = sys.path.index('/usr/local/lib/python3.6/dist-packages')
sys.path = sys.path[:dist_package_index] + ['/usr/local/lib/python3.6/site-packages'] + sys.path[dist_package_index:]
sys.path
exec(open('rapidsai-csp-utils/colab/update_modules.py').read(), globals())

Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 165, done.[K
remote: Counting objects: 100% (165/165), done.[K
remote: Compressing objects: 100% (160/160), done.[K
remote: Total 165 (delta 60), reused 20 (delta 4), pack-reused 0[K
Receiving objects: 100% (165/165), 48.48 KiB | 6.06 MiB/s, done.
Resolving deltas: 100% (60/60), done.
PLEASE READ
********************************************************************************************************
Changes:
1. Default stable version is now 0.14.  Nightly is now 0.15.  We have fixed the long conda install.  Hooray!
2. You can now declare your RAPIDSAI version as a CLI option and skip the user prompts (ex: '0.14' or '0.15', between 0.13 to 0.15, without the quotes): 
        "!bash rapidsai-csp-utils/colab/rapids-colab.sh <version/label>"
        Examples: '!bash rapidsai-csp-utils/colab/rapids-colab.sh 0.14', or '!bash rapidsai-csp-utils/colab/rapids-colab.sh stable', or '!bash rapidsai-csp-utils/colab/rapids-colab.s

# cuDF and cuML Examples #

Now you can run code! 

What follows are basic examples where all processing takes place on the GPU.

#[cuDF](https://github.com/rapidsai/cudf)#

Load a dataset into a GPU memory resident DataFrame and perform a basic calculation.

Everything from CSV parsing to calculating tip percentage and computing a grouped average is done on the GPU.

_Note_: You must import nvstrings and nvcategory before cudf, else you'll get errors.

In [7]:
import cudf
import io, requests

# download CSV file from GitHub
url="https://github.com/plotly/datasets/raw/master/tips.csv"
content = requests.get(url).content.decode('utf-8')

# read CSV from memory
tips_df = cudf.read_csv(io.StringIO(content))
tips_df['tip_percentage'] = tips_df['tip']/tips_df['total_bill']*100

# display average tip by dining party size
print(tips_df.groupby('size').tip_percentage.mean())

size
1    21.729202
2    16.571919
3    15.215685
4    14.594901
5    14.149549
6    15.622920
Name: tip_percentage, dtype: float64


#[cuML](https://github.com/rapidsai/cuml)#

This snippet loads a 

As above, all calculations are performed on the GPU.

In [8]:
import cuml

# Create and populate a GPU DataFrame
df_float = cudf.DataFrame()
df_float['0'] = [1.0, 2.0, 5.0]
df_float['1'] = [4.0, 2.0, 1.0]
df_float['2'] = [4.0, 2.0, 1.0]

# Setup and fit clusters
dbscan_float = cuml.DBSCAN(eps=1.0, min_samples=1)
dbscan_float.fit(df_float)

print(dbscan_float.labels_)

0    0
1    1
2    2
dtype: int32


# Next Steps #

For an overview of how you can access and work with your own datasets in Colab, check out [this guide](https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92).

For more RAPIDS examples, check out our RAPIDS notebooks repos:
1. https://github.com/rapidsai/notebooks
2. https://github.com/rapidsai/notebooks-contrib

## New code

In [9]:
!pip install --quiet optuna mlflow

[K     |████████████████████████████████| 226 kB 4.3 MB/s 
[K     |████████████████████████████████| 12.4 MB 185 kB/s 
[K     |████████████████████████████████| 1.1 MB 63.8 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 81 kB 10.6 MB/s 
[K     |████████████████████████████████| 1.2 MB 73.4 MB/s 
[K     |████████████████████████████████| 77 kB 7.3 MB/s 
[K     |████████████████████████████████| 49 kB 7.8 MB/s 
[K     |████████████████████████████████| 40 kB 6.5 MB/s 
[K     |████████████████████████████████| 144 kB 76.3 MB/s 
[K     |████████████████████████████████| 158 kB 76.6 MB/s 
[K     |████████████████████████████████| 280 kB 68.7 MB/s 
[K     |████████████████████████████████| 94 kB 4.3 MB/s 
[K     |████████████████████████████████| 1.3 MB 74.1 MB/s 
[K     |████████████████████████████████| 75 kB 

In [10]:
import cudf
import numpy as np
import pandas as pd
import pickle

from cuml.ensemble import RandomForestClassifier as curfc
from cuml.metrics import accuracy_score

from sklearn.ensemble import RandomForestClassifier as skrfc
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

## Prepare train and test data

In [12]:
n_samples = 2**12
n_features = 399
n_info = 300
data_type = np.float32

In [13]:
X,y = make_classification(n_samples=n_samples,
                          n_features=n_features,
                          n_informative=n_info,
                          random_state=123, n_classes=2)

X = pd.DataFrame(X.astype(data_type))
y = pd.Series(y.astype(np.int32))

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.2,
                                                    random_state=0)


In [14]:
X_cudf_train = cudf.DataFrame.from_pandas(X_train)
X_cudf_test = cudf.DataFrame.from_pandas(X_test)

y_cudf_train = cudf.Series(y_train.values)

## Define objective function

In [15]:
def objective(trial):
    params_to_optimize = {
        "max_depth": trial.suggest_int('max_depth', 8, 16)
    }

    cuml_model = curfc(n_estimators=40,
                       max_features=1.0,
                       seed=10,
                       **params_to_optimize)

    cuml_model.fit(X_cudf_train, y_cudf_train)
    fil_preds_orig = cuml_model.predict(X_cudf_test)
    return accuracy_score(y_test.to_numpy(), fil_preds_orig)

## Perform optimization

In [19]:
import optuna
from optuna.integration.mlflow import MLflowCallback

mlflc = MLflowCallback(metric_name='accuracy')
study = optuna.create_study(study_name='rapids', direction='maximize')
study.optimize(objective, n_trials=10, callbacks=[mlflc])


MLflowCallback is experimental (supported from v1.4.0). The interface can change in the future.


For reproducible results, n_streams==1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_seed is set

[I 2020-07-29 15:40:37,430] Trial 0 finished with value: 0.754878044128418 and parameters: {'max_depth': 10}. Best is trial 0 with value: 0.754878044128418.
[I 2020-07-29 15:40:40,696] Trial 1 finished with value: 0.745121955871582 and parameters: {'max_depth': 12}. Best is trial 0 with value: 0.754878044128418.
[I 2020-07-29 15:40:44,494] Trial 2 finished with value: 0.745121955871582 and parameters: {'max_depth': 14}. Best is trial 0 with value: 0.754878044128418.
[I 2020-07-29 15:40:48,251] Trial 3 finished with value: 0.745121955871582 and parameters: {'max_depth': 14}. Best is trial 0 with value: 0.754878044128418.
[I 2020-07-29 15:40:52,055] Trial 4 finished with value: 0.742682933807373 and parameters: {'max_depth': 15}.

## Check MLflow logging results

In [28]:
!cat ./mlruns/1/meta.yaml

artifact_location: file:///content/mlruns/1
experiment_id: '1'
lifecycle_stage: active
name: rapids


In [35]:
import mlflow

runs = mlflow.search_runs(experiment_ids=["1"])
runs.sort_values("metrics.accuracy", ascending=False).head()[["params.max_depth", "metrics.accuracy"]]

Unnamed: 0,params.max_depth,metrics.accuracy
20,8,0.762195
10,8,0.762195
3,9,0.760976
23,9,0.760976
13,11,0.757317
