<br>
<br>
<center><img src="images/horizontal.png" alt="Coiled logo" style="width: 500px;" align="center"/></center>
<br>
<center><img src="images/dask_horizontal_no_pad.svg" alt="Dask logo" style="width: 500px;"/></center>

# Scalable Machine Learning

Let's do some machine learning using the UC Irvine ML repository's *Combined Cycle Power Plant Data Set* (https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant)

This dataset consists of about 10,000 records of measurements relating to peaker power plants.
- Temperature (AT) in the range 1.81°C and 37.11°C,
- Ambient Pressure (AP) in the range 992.89-1033.30 millibar,
- Relative Humidity (RH) in the range 25.56% to 100.16%
- Exhaust Vacuum (V) in the range 25.36-81.56 cm Hg
- Net hourly electrical energy output (PE) 420.26-495.76 MW

We want to model the power output as a function of the other parameters.

In [1]:
from dask.distributed import Client

client = Client(n_workers=4)

client

Perhaps you already have a cluster running?
Hosting the HTTP server on port 65080 instead


0,1
Client  Scheduler: tcp://127.0.0.1:65081  Dashboard: http://127.0.0.1:65080/status,Cluster  Workers: 4  Cores: 8  Memory: 8.59 GB


In [2]:
import dask.dataframe

ddf = dask.dataframe.read_csv('data/powerplant.csv', sample=512000, blocksize=4e4)
ddf

Unnamed: 0_level_0,AT,V,AP,RH,PE
npartitions=8,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
,float64,float64,float64,float64,float64
,...,...,...,...,...
...,...,...,...,...,...
,...,...,...,...,...
,...,...,...,...,...


In [3]:
y = ddf.PE
y

Dask Series Structure:
npartitions=8
    float64
        ...
     ...   
        ...
        ...
Name: PE, dtype: float64
Dask Name: getitem, 16 tasks

In [4]:
X = ddf.drop(columns=['PE'])
X

Unnamed: 0_level_0,AT,V,AP,RH
npartitions=8,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,float64,float64,float64,float64
,...,...,...,...
...,...,...,...,...
,...,...,...,...
,...,...,...,...


In [5]:
X = X.to_dask_array(lengths=True)
X

Unnamed: 0,Array,Chunk
Bytes,2.45 MB,306.18 kB
Shape,"(76544, 4)","(9568, 4)"
Count,24 Tasks,8 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 2.45 MB 306.18 kB Shape (76544, 4) (9568, 4) Count 24 Tasks 8 Chunks Type float64 numpy.ndarray",4  76544,

Unnamed: 0,Array,Chunk
Bytes,2.45 MB,306.18 kB
Shape,"(76544, 4)","(9568, 4)"
Count,24 Tasks,8 Chunks
Type,float64,numpy.ndarray


In [6]:
y = y.to_dask_array(lengths=True)

In [7]:
y

Unnamed: 0,Array,Chunk
Bytes,612.35 kB,76.54 kB
Shape,"(76544,)","(9568,)"
Count,24 Tasks,8 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 612.35 kB 76.54 kB Shape (76544,) (9568,) Count 24 Tasks 8 Chunks Type float64 numpy.ndarray",76544  1,

Unnamed: 0,Array,Chunk
Bytes,612.35 kB,76.54 kB
Shape,"(76544,)","(9568,)"
Count,24 Tasks,8 Chunks
Type,float64,numpy.ndarray


In [8]:
from dask_ml.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

X_train

Unnamed: 0,Array,Chunk
Bytes,2.20 MB,275.55 kB
Shape,"(68888, 4)","(8611, 4)"
Count,64 Tasks,8 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 2.20 MB 275.55 kB Shape (68888, 4) (8611, 4) Count 64 Tasks 8 Chunks Type float64 numpy.ndarray",4  68888,

Unnamed: 0,Array,Chunk
Bytes,2.20 MB,275.55 kB
Shape,"(68888, 4)","(8611, 4)"
Count,64 Tasks,8 Chunks
Type,float64,numpy.ndarray


In [9]:
y_train

Unnamed: 0,Array,Chunk
Bytes,551.10 kB,68.89 kB
Shape,"(68888,)","(8611,)"
Count,64 Tasks,8 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 551.10 kB 68.89 kB Shape (68888,) (8611,) Count 64 Tasks 8 Chunks Type float64 numpy.ndarray",68888  1,

Unnamed: 0,Array,Chunk
Bytes,551.10 kB,68.89 kB
Shape,"(68888,)","(8611,)"
Count,64 Tasks,8 Chunks
Type,float64,numpy.ndarray


In [10]:
from dask_ml.linear_model import LinearRegression

lr = LinearRegression(solver='lbfgs', max_iter=10)
lr_model = lr.fit(X_train, y_train)



In [11]:
y_predicted = lr_model.predict(X_test)

In [12]:
from dask_ml.metrics import mean_squared_error
from math import sqrt

sqrt(mean_squared_error(y_test, y_predicted))

4.542482915728186

In [13]:
client.close()