# Testing Rapid's Features

This notebook is my testing ground for some of RAPIDs features to try to understand how they work and whether they are useful for Sanaa's ML code. 

- dask_cuda - Version of Dask that works on GPUs.
- cudf - Pandas for GPU.
- dask_cudf - Dask wrapped Pandas for GPU.
- cuml - Similar funcitons to Sklearn but for GPU.
- cupy - Numpy for GPU.

The idea is that since all these libraries are a part of RAPIDs then they should play together nicely. And it potentially seems that they could replace the ML script that uses Sklearn to do the machine learning.

## The Libraries:

In [1]:
from dask_cuda import LocalCUDACluster
from dask.distributed import Client

import cudf
import cuml
import cupy as cp
import dask_cudf

cluster = LocalCUDACluster()
client = Client(cluster)

distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Import preload module: dask_cuda.initialize


In [2]:
print(client)

<Client: 'tcp://127.0.0.1:44763' processes=2 threads=2, memory=191.00 GiB>


## Testing dask delayed

In [100]:
from dask import delayed
import time

In [29]:
@delayed
def inc(x):
    time.sleep(0.5)
    return x + 1

@delayed
def double(x):
    time.sleep(0.5)
    return 2 * x

@delayed
def add(x, y):
    time.sleep(0.5)
    return x + y

In [40]:
data = cp.array([1, 2, 3, 4])
output = cp.array([])

In [62]:
data

array([1, 2, 3, 4])

In [63]:
%%time

#data = [1, 2, 3, 4]
data = cp.array([1, 2, 3, 4])

output = []
for x in data:
    a = inc(x)
    b = double(x)
    c = add(a, b)
    output.append(c)

total = delayed(sum)(output)
total

CPU times: user 2.62 ms, sys: 752 µs, total: 3.38 ms
Wall time: 2.72 ms


Delayed('sum-7af87d9f-4f48-4637-bacc-07ae8b6c7efa')

In [32]:
%%time

total.compute()

CPU times: user 339 ms, sys: 48.6 ms, total: 387 ms
Wall time: 3.05 s


34

## Converting main_MLR2_RF.py to GPU 

In [97]:
load_dir_dataset = "/g/data/w97/sho561/Downscale/BARRA/Training_Testing_new/"

In [122]:
train_grids = cp.array([642, 714, 720, 1207, 1233, 1682, 1728, 2348, 2817, 2855, 3002, 3114, 3346, 3809, 4233, 4322, 4615, 4623, 6081, 6145])
all_years = cp.arange(1990,2019, step=1)
train_years = cp.array([1990, 1991, 1992, 1995, 1996, 2001, 2003, 2004, 2016, 2018])
test_years = cp.array([1993, 1994, 1997, 1998, 1999, 2000, 2002, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2017, 2019]) 

featuresList = ['av_lat_hflx', 'av_mslp', 'av_netlwsfc', 'av_netswsfc', 'av_qsair_scrn', 'av_temp_scrn', 
'av_canopy_height', 'av_uwnd10m', 'av_vwnd10m', 'av_leaf_area_index', 'soil_albedo', 'soil_porosity', 'soil_bulk_density', 'topog' ]

seed = 100
ntrees = 100

- Check to see if the array is using GPU

In [92]:
cp.get_array_module(all_years)

<module 'cupy' from '/opt/conda/envs/rapids/lib/python3.9/site-packages/cupy/__init__.py'>

In [None]:
@dask.delayed
def opendata(file)

- Training data

In [133]:
%%time

all_sample_df = cudf.DataFrame()
for year in all_years:
        
    filename_dataset = load_dir_dataset +'%s_%s_predictors_target.csv' %(train_grids[0], year)
    single_year_df = cudf.read_csv(filename_dataset)
    # Multi layer perceptron doesn't like Null values
    single_year_df = single_year_df.dropna(axis=0)
    all_sample_df = cudf.concat([all_sample_df, single_year_df])
        
    
# concatening will mess up with the index of the combined dataframe
all_sample_df= all_sample_df.reset_index()

CPU times: user 0 ns, sys: 797 ms, total: 797 ms
Wall time: 612 ms


In [128]:
cp.get_array_module(single_year_df.to_cupy())

<module 'cupy' from '/opt/conda/envs/rapids/lib/python3.9/site-packages/cupy/__init__.py'>

In [141]:
# split the data into training and testing
index_testing = all_sample_df.where(all_sample_df['year'].isin(test_years))
#index_testing = index_testing.dropna(axis=0).to_cupy()
index_training = all_sample_df.where(all_sample_df['year'].isin(train_years))
#index_training = index_training.dropna(axis=0).to_cupy()
#in_sample_df = all_sample_df.iloc[index_training.to_cupy()]
#out_sample_df = all_sample_df.iloc[index_testing.to_cupy()]

In [142]:
index_training

Unnamed: 0,index,ref_coarse_cell,ref_fine_cell,year,month,day,target,av_lat_hflx,av_mslp,av_netlwsfc,...,topog,soil_albedo,ETnw,ETn,ETne,ETw,ETe,ETsw,ETs,ETse
0,0,642,38410,1990,1,1,127.458336,103.0,101483.25,-95.066406,...,558.253296,0.115572,108.75,114.75,138.75,100.25,116.5,102.25,107.75,126.5
1,1,642,38411,1990,1,1,130.458328,103.0,101483.25,-95.066406,...,578.896240,0.115198,108.75,114.75,138.75,100.25,116.5,102.25,107.75,126.5
2,2,642,38412,1990,1,1,130.916672,103.0,101483.25,-95.066406,...,589.701294,0.114823,108.75,114.75,138.75,100.25,116.5,102.25,107.75,126.5
3,3,642,38413,1990,1,1,131.083328,103.0,101483.25,-95.066406,...,571.424072,0.114449,108.75,114.75,138.75,100.25,116.5,102.25,107.75,126.5
4,4,642,38414,1990,1,1,132.666672,103.0,101483.25,-95.066406,...,535.188477,0.114075,108.75,114.75,138.75,100.25,116.5,102.25,107.75,126.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
677883,23355,642,42571,2018,9,9,69.166664,69.5,101989.75,-109.976562,...,697.690308,0.106115,67.00,72.25,72.25,59.00,74.0,63.00,70.50,78.5
677884,23356,642,42572,2018,9,9,72.000000,69.5,101989.75,-109.976562,...,609.880127,0.107231,67.00,72.25,72.25,59.00,74.0,63.00,70.50,78.5
677885,23357,642,42573,2018,9,9,71.791664,69.5,101989.75,-109.976562,...,558.244934,0.108348,67.00,72.25,72.25,59.00,74.0,63.00,70.50,78.5
677886,23358,642,42574,2018,9,9,73.083336,69.5,101989.75,-109.976562,...,542.575623,0.109464,67.00,72.25,72.25,59.00,74.0,63.00,70.50,78.5


In [149]:
X = all_sample_df[featuresList]
X = cuml.preprocessing.StandardScaler().fit_transform(X) 
#X = X[index_training.dropna(axis=0).to_cupy()]
#print('shape of X is ', X.shape)
#print(index_training.shape)

In [152]:
# target
y = all_sample_df.iloc[index_training.dropna(axis=0).to_cupy()]
y  = y[['target']]

In [153]:
%%time
import random

random.seed(seed)
regr = cuml.linear_model.LinearRegression()
regr.fit(X, y)

ValueError: Expected 677888 rows but got 233856 rows.