# MODULO 5 - Dimensionality Reduction

In the other modules we've seen a number of Machine Learning algorithms but we've always applied a SUPERVISED learning pipeline meaning we've always had a "LABEL" - or "the correct answer" for a number of observations


With UNSUPERVISED Learning we don't have "the correct answer", there is no "label".
Humans deal with this situation all the time and in this module we'll see how we can deal with this in Machine Learning as well

A) Clustering 
  - grouping things based on similarities
  - people do this all the time but it is also lead to Bias or even stereotyping 

B) Dimension Reduction
  - projecting a 3D object into a 2D screen
  - automated feature extraction (specifically features in a way that is useful to computers)
  - this sometimes that it may not be easy to explain to human 

### Auto-encoders
Is a way to use deep learning to create a simple neural network that can learn to recognize "the good" transactions you show only the good transactions and when the NN will run in production it will produce a high reconstructing error for those transactions that are different it may not be a definite indication that is a fraudolent transaction but is a good way to raise a red flag when things don't look quite right

## 1. Importing dependencies 

In [1]:
import h2o
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# 2. Starting H2O

H2O will automatically check if an instance is already running and connect to

In [2]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O cluster uptime:,6 hours 17 mins
H2O cluster timezone:,America/Sao_Paulo
H2O data parsing timezone:,UTC
H2O cluster version:,3.26.0.3
H2O cluster version age:,"21 days, 16 hours and 9 minutes"
H2O cluster name:,H2O_from_python_Semantix_zpo2sn
H2O cluster total nodes:,1
H2O cluster free memory:,2.977 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


# 3. Importing data

Let's take a look at the iris dataset again

In [3]:
url = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris.csv"

In [4]:
data = h2o.import_file(url, destination_frame = "iris")

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [5]:
data.set_names(["sepal_len", "sepal_wid", "petal_len", "petal_wid", "class"])

sepal_len,sepal_wid,petal_len,petal_wid,class
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa




# 4. Splitting the data 

We'll split the data in Train and Test dataset as usual (80/20)

In [6]:
train, test = data.split_frame([0.8], seed = 123)

In [7]:
print("%d/%d" % (train.nrows, test.nrows))

121/29


# 5. Using Neural Networks to build Autoencoders for Dimensionality Reduction 

To do this we'll use the Deep Learning libraries we've used already.
The idea is that if we build a neural network that produces at its output the same input using a smaller number of nodes we are able to build a representation of our data with fewer dimensions

### Defining the input variables to our Neural Network

In [8]:
x = ["sepal_len", "sepal_wid", "petal_len", "petal_wid"]

We now import the AutoEncoder estimator

In [9]:
from h2o.estimators.deeplearning import H2OAutoEncoderEstimator

To build this network we'll use the same amount of nodes as the input (x) to make things simple at the beginning.

It is recommended to use Tanh activation fuction with Autoencoders.
Note that unlike the regression or classifications examples we've used in supervised learning examples, we'll set the y variable to "None".

The rest of the parameters can be normally left to the defaults but we've assigned some vales to allow to get scores at each iteration to understand better what's going on.

In [10]:
m_AE_4 = H2OAutoEncoderEstimator(
    hidden = [4],
    activation = "Tanh",
    epochs = 300,
    model_id = "m_AE_4",
    
    train_samples_per_iteration = train.nrow,
    score_interval = 0,
    score_duty_cycle = 1.0,
)

%time m_AE_4.train(x, None, train)

deeplearning Model Build progress: |██████████████████████████████████████| 100%
Wall time: 6.08 s


Let's look at the score history (sh)

In [11]:
sh = m_AE_4.score_history()
sh.head()

Unnamed: 0,Unnamed: 1,timestamp,duration,training_speed,epochs,iterations,samples,training_rmse,training_mse
0,,2019-09-14 16:26:26,0.012 sec,"0,00000 obs/sec",0.0,0,0.0,0.312791,0.097838
1,,2019-09-14 16:26:26,0.018 sec,17285 obs/sec,1.0,1,121.0,0.268711,0.072206
2,,2019-09-14 16:26:26,0.023 sec,20166 obs/sec,2.0,2,242.0,0.22005,0.048422
3,,2019-09-14 16:26:26,0.027 sec,22687 obs/sec,3.0,3,363.0,0.188021,0.035352
4,,2019-09-14 16:26:26,0.031 sec,24200 obs/sec,4.0,4,484.0,0.164481,0.027054


the Training_MSE represents the reconstructing error - which tells us how similar the output nodes are to the input nodes

We can take a loog at the H2O Flow UI to see the plot of the training_mse

If we look at the tail we can see the lowest reconstructing error we got

In [12]:
sh.tail()

Unnamed: 0,Unnamed: 1,timestamp,duration,training_speed,epochs,iterations,samples,training_rmse,training_mse
84,,2019-09-14 16:26:27,0.651 sec,18962 obs/sec,84.0,84,10164.0,0.02357,0.000556
85,,2019-09-14 16:26:27,0.657 sec,18976 obs/sec,85.0,85,10285.0,0.024299,0.00059
86,,2019-09-14 16:26:27,0.661 sec,19058 obs/sec,86.0,86,10406.0,0.023599,0.000557
87,,2019-09-14 16:26:27,0.665 sec,19140 obs/sec,87.0,87,10527.0,0.023877,0.00057
88,,2019-09-14 16:26:27,0.672 sec,18899 obs/sec,87.0,87,10527.0,0.02357,0.000556


### Considerations about the early stopping

If we look carefully at the tail of the training MSE we see that it was still getting lower althouth it flattened out a bit. What is happening is that the algorigthm is looking at the past 5 iterations and if the metric doens't improve a lot it stops before reaching the end of of epochs (set to 300).

In the experiment above the early stopping was activated at 139 epochs. Let's modify this parameter (stopping_rounds) and see what's happening

In [13]:
m_AE_4 = H2OAutoEncoderEstimator(
    hidden = [4],
    activation = "Tanh",
    epochs = 300,
    model_id = "m_AE_4",
    
    train_samples_per_iteration = train.nrow,
    score_interval = 0,
    score_duty_cycle = 1.0,
    stopping_rounds = 15
)

%time m_AE_4.train(x, None, train)

deeplearning Model Build progress: |██████████████████████████████████████| 100%
Wall time: 8.33 s


In [14]:
sh = m_AE_4.score_history()
sh.tail()

Unnamed: 0,Unnamed: 1,timestamp,duration,training_speed,epochs,iterations,samples,training_rmse,training_mse
264,,2019-09-14 16:26:35,2.377 sec,15298 obs/sec,264.0,264,31944.0,0.010744,0.000115
265,,2019-09-14 16:26:35,2.388 sec,15283 obs/sec,265.0,265,32065.0,0.010351,0.000107
266,,2019-09-14 16:26:35,2.396 sec,15283 obs/sec,266.0,266,32186.0,0.010115,0.000102
267,,2019-09-14 16:26:35,2.415 sec,15203 obs/sec,267.0,267,32307.0,0.014319,0.000205
268,,2019-09-14 16:26:35,2.426 sec,15125 obs/sec,267.0,267,32307.0,0.010115,0.000102


We can see that we got all the way to the end of the epoch 300.
This is because, if we use the same number of hidden nodes as the number of imputs, we should theoretically be able to learn perfectly to describe our input (mse -> 0)

### let's try with fewer layers

In [39]:
m_AE_3 = H2OAutoEncoderEstimator(
    hidden = [3],
    activation = "Tanh",
    epochs = 300,
    model_id = "m_AE_3",
    
    train_samples_per_iteration = train.nrow,
    score_interval = 0,
    score_duty_cycle = 1.0,
    stopping_rounds = 15
)

%time m_AE_3.train(x, None, train)

deeplearning Model Build progress: |██████████████████████████████████████| 100%
Wall time: 6.1 s


In [40]:
sh3 = m_AE_3.score_history()
sh3.tail()

Unnamed: 0,Unnamed: 1,timestamp,duration,training_speed,epochs,iterations,samples,training_rmse,training_mse
238,,2019-09-14 16:53:28,1.775 sec,19447 obs/sec,239.0,239,28919.0,0.023287,0.000542
239,,2019-09-14 16:53:28,1.783 sec,19424 obs/sec,240.0,240,29040.0,0.023512,0.000553
240,,2019-09-14 16:53:28,1.791 sec,19453 obs/sec,241.0,241,29161.0,0.023135,0.000535
241,,2019-09-14 16:53:28,1.795 sec,19482 obs/sec,242.0,242,29282.0,0.023861,0.000569
242,,2019-09-14 16:53:28,1.803 sec,19379 obs/sec,242.0,242,29282.0,0.023025,0.00053


In this case we're asking our Neural Network to represent our 4 input variable with only 3 neurons. The model stopped at around 288 epochs and we got a reconstructing error of about 0.0005

Let's try with 2 nodes

In [41]:
m_AE_2 = H2OAutoEncoderEstimator(
    hidden = [2],
    activation = "Tanh",
    epochs = 300,
    model_id = "m_AE_4",
    
    train_samples_per_iteration = train.nrow,
    score_interval = 0,
    score_duty_cycle = 1.0,
    stopping_rounds = 15
)

%time m_AE_2.train(x, None, train)

deeplearning Model Build progress: |██████████████████████████████████████| 100%
Wall time: 6.09 s


In [42]:
sh2 = m_AE_2.score_history()
sh2.tail()

Unnamed: 0,Unnamed: 1,timestamp,duration,training_speed,epochs,iterations,samples,training_rmse,training_mse
237,,2019-09-14 16:53:34,1.826 sec,18573 obs/sec,241.0,241,29161.0,0.053625,0.002876
238,,2019-09-14 16:53:34,1.831 sec,18591 obs/sec,242.0,242,29282.0,0.053064,0.002816
239,,2019-09-14 16:53:34,1.838 sec,18621 obs/sec,243.0,243,29403.0,0.053073,0.002817
240,,2019-09-14 16:53:34,1.843 sec,18638 obs/sec,244.0,244,29524.0,0.053541,0.002867
241,,2019-09-14 16:53:34,1.848 sec,18656 obs/sec,245.0,245,29645.0,0.052927,0.002801


We could still represent our input with only 2 neurons but the reconstructing error was significantly higher: 0.003 (~ 1 order of magnitude)

let's try with 1

In [43]:
m_AE_1 = H2OAutoEncoderEstimator(
    hidden = [1],
    activation = "Tanh",
    epochs = 300,
    model_id = "m_AE_4",
    
    train_samples_per_iteration = train.nrow,
    score_interval = 0,
    score_duty_cycle = 1.0,
    stopping_rounds = 15
)

%time m_AE_1.train(x, None, train)

deeplearning Model Build progress: |██████████████████████████████████████| 100%
Wall time: 6.09 s


In [44]:
sh1 = m_AE_1.score_history()
sh1.tail()

Unnamed: 0,Unnamed: 1,timestamp,duration,training_speed,epochs,iterations,samples,training_rmse,training_mse
63,,2019-09-14 16:53:39,0.313 sec,33085 obs/sec,70.0,70,8470.0,0.107031,0.011456
64,,2019-09-14 16:53:39,0.317 sec,33169 obs/sec,71.0,71,8591.0,0.108778,0.011833
65,,2019-09-14 16:53:39,0.321 sec,33585 obs/sec,73.0,73,8833.0,0.109231,0.011931
66,,2019-09-14 16:53:39,0.325 sec,33988 obs/sec,75.0,75,9075.0,0.107882,0.011639
67,,2019-09-14 16:53:39,0.329 sec,33487 obs/sec,75.0,75,9075.0,0.106694,0.011384


As espected using a single neuron to represent our 4 inputs will give us a much higher reconstructing error: 0.01

### let's increase the number of hidden layers

In [45]:
m_AE_5_3_5 = H2OAutoEncoderEstimator(
    hidden = [5, 3, 5], #this symmetry is recommended
    activation = "Tanh",
    epochs = 300,
    model_id = "m_AE_5_3_5",
    
    train_samples_per_iteration = train.nrow,
    score_interval = 0,
    score_duty_cycle = 1.0,
    stopping_rounds = 15
)

%time m_AE_5_3_5.train(x, None, train)

deeplearning Model Build progress: |██████████████████████████████████████| 100%
Wall time: 6.1 s


In [46]:
sh = m_AE_5_3_5.score_history()
sh.tail()

Unnamed: 0,Unnamed: 1,timestamp,duration,training_speed,epochs,iterations,samples,training_rmse,training_mse
78,,2019-09-14 16:53:45,0.603 sec,20472 obs/sec,78.0,78,9438.0,0.025242,0.000637
79,,2019-09-14 16:53:45,0.615 sec,20209 obs/sec,79.0,79,9559.0,0.031297,0.000979
80,,2019-09-14 16:53:45,0.624 sec,20124 obs/sec,80.0,80,9680.0,0.026853,0.000721
81,,2019-09-14 16:53:45,0.628 sec,20208 obs/sec,81.0,81,9801.0,0.026183,0.000686
82,,2019-09-14 16:53:45,0.636 sec,19880 obs/sec,81.0,81,9801.0,0.025242,0.000637


You can see we got 0.0005 reconstructing data but with only 143 epochs

### Stacked autoencoders

Another option is to stack multiple models together.
This is different than adding hidden layers to the same model tough. Let's see what are the restults

In [49]:
train_AE_3 = m_AE_3.deepfeatures(train, 0) # this takes the 3-neuron mode we've built and extract the features from the first layer (0 in python)

deepfeatures progress: |██████████████████████████████████████████████████| 100%


In [50]:
train_AE_3.dim

[121, 3]

In [51]:
train_AE_3

DF.L1.C1,DF.L1.C2,DF.L1.C3
0.155258,0.453122,0.460569
0.323651,0.439329,0.398356
0.278381,0.50335,0.423558
0.310595,0.499269,0.399517
0.129707,0.482094,0.470037
-0.030115,0.42582,0.453997
0.208125,0.537791,0.417995
0.191769,0.455314,0.442427
0.0670419,0.409021,0.483402
0.202885,0.486321,0.433777




We now take the this data and use it to feed another 3-neuron autoencoder

In [52]:
m_AE_3x3 = H2OAutoEncoderEstimator(
    hidden = [3],
    activation = "Tanh",
    epochs = 300,
    model_id = "m_AE_3",
    
    train_samples_per_iteration = train.nrow,
    score_interval = 0,
    score_duty_cycle = 1.0,
    stopping_rounds = 15
)

%time m_AE_3x3.train([0,1,2], None, train_AE_3) #the x variable is the 3-column vector from the m_AE_3 model


deeplearning Model Build progress: |██████████████████████████████████████| 100%
Wall time: 6.1 s


In [53]:
sh3x3 = m_AE_3x3.score_history()
sh.tail()

Unnamed: 0,Unnamed: 1,timestamp,duration,training_speed,epochs,iterations,samples,training_rmse,training_mse
78,,2019-09-14 16:53:45,0.603 sec,20472 obs/sec,78.0,78,9438.0,0.025242,0.000637
79,,2019-09-14 16:53:45,0.615 sec,20209 obs/sec,79.0,79,9559.0,0.031297,0.000979
80,,2019-09-14 16:53:45,0.624 sec,20124 obs/sec,80.0,80,9680.0,0.026853,0.000721
81,,2019-09-14 16:53:45,0.628 sec,20208 obs/sec,81.0,81,9801.0,0.026183,0.000686
82,,2019-09-14 16:53:45,0.636 sec,19880 obs/sec,81.0,81,9801.0,0.025242,0.000637


As we can see here we get a reconstructor error of 0.0005 with 143 epochs

### Anomalies

For this exercise we're not trying to reduce dimensions but just learn the data. So we'll first re-define the imput variable using all 5 features of the original iris dataset and then build a 16 neuron single layer neural network

In [54]:
x = ["sepal_len", "sepal_wid", "petal_len", "petal_wid", "class"]

In [55]:
m_anomaly16 = H2OAutoEncoderEstimator(
    hidden = [16],
    activation = "Tanh",
    epochs = 300,
    model_id = "m_anomaly16",
    
    train_samples_per_iteration = train.nrow,
    score_interval = 0,
    score_duty_cycle = 1.0,
    stopping_rounds = 15
)

%time m_anomaly16.train(x, None, data) 


deeplearning Model Build progress: |██████████████████████████████████████| 100%
Wall time: 6.09 s


We can now build a dataset with the reconstruction errors from this neural network and store it into a pandas dataframe

In [56]:
anomalies = m_anomaly16.anomaly(data).cbind(data).as_data_frame()

In [57]:
s = anomalies.sort_values("Reconstruction.MSE", ascending=False)

By looking at the head and tail of this data frame we can take a look at the flowers of our dataset that our neural network respectively is less sure and more sure about it.

The tail shows me that this particular network feels pretty sure about the versicolor

In [58]:
s.tail()

Unnamed: 0,Reconstruction.MSE,sepal_len,sepal_wid,petal_len,petal_wid,class
30,7e-06,4.8,3.1,1.6,0.2,Iris-setosa
46,7e-06,5.1,3.8,1.6,0.2,Iris-setosa
52,5e-06,6.9,3.1,4.9,1.5,Iris-versicolor
29,5e-06,4.7,3.2,1.6,0.2,Iris-setosa
11,3e-06,4.8,3.4,1.6,0.2,Iris-setosa


While looking at the head we can see those it is less sure about. 

In [59]:
s.head()

Unnamed: 0,Reconstruction.MSE,sepal_len,sepal_wid,petal_len,petal_wid,class
117,0.000367,7.7,3.8,6.7,2.2,Iris-virginica
41,0.000317,4.5,2.3,1.3,0.3,Iris-setosa
106,0.000317,4.9,2.5,4.5,1.7,Iris-virginica
60,0.000317,5.0,2.0,3.5,1.0,Iris-versicolor
119,0.000305,6.0,2.2,5.0,1.5,Iris-virginica


In a financial application, the head will give us the list of customers we may want to double check by a person as they don't look similar to the typical "good transactions"

# 6. PCA and GLRM for Dimensionality Reduction 

Another option to perform dimensionality reduction is to use PCA - Principal Component Analysis

### PCA (Principal Component Analysis)

Without going into the mathematical details about what PCA is you should think it as a technique to identify a number of INDIPENDENT VARIABLES - or axix if you will - that you can use to represent your data.

you can look up more details here: https://en.wikipedia.org/wiki/Principal_component_analysis

In [60]:
from h2o.estimators.pca import H2OPrincipalComponentAnalysisEstimator

In this case we'll set k = 4 which will give me 4 indipendent variables (principal components) representing the data. This means we are not doing any compression (since we have 4 imput variables)

In [61]:
m_pca = H2OPrincipalComponentAnalysisEstimator(
    k = 4,
    impute_missing = True
)

%time m_pca.train(x, None, train)

pca Model Build progress: |███████████████████████████████████████████████| 100%
Wall time: 6.12 s


In [62]:
m_pca

Model Details
H2OPrincipalComponentAnalysisEstimator :  Principal Components Analysis
Model Key:  PCA_model_python_1568466473771_2135


Importance of components: 

Unnamed: 0,Unnamed: 1,pc1,pc2,pc3,pc4
0,Standard deviation,7.862345,1.489687,0.581041,0.252247
1,Proportion of Variance,0.958956,0.034426,0.005237,0.000987
2,Cumulative Proportion,0.958956,0.993382,0.998619,0.999606




ModelMetricsPCA: pca
** Reported on train data. **

MSE: NaN
RMSE: NaN

Scoring History for GramSVD: 

Unnamed: 0,Unnamed: 1,timestamp,duration,iterations
0,,2019-09-14 16:55:10,1.715 sec,0.0




In the model details above we should have the coefficients indicating the importance of each variable (it doesn't for some reason on my version). You should see that the PC1 can represent the data with more than 95% of confidence while the other variables only contribute very little.

This means we could perform dimentionality reduction.

To obtain the principal component representation we use predict function

In [63]:
p_PCA = m_pca.predict(train)

pca prediction progress: |████████████████████████████████████████████████| 100%


In [64]:
p_PCA

PC1,PC2,PC3,PC4
-5.91038,-2.29922,0.194961,-0.0533056
-5.56902,-1.97345,0.115781,-0.291559
-5.44469,-2.09299,0.176716,-0.0592165
-5.43304,-1.87266,0.15335,-0.0488478
-5.87362,-2.32447,0.222313,0.0693357
-6.47457,-2.32338,0.260379,0.180781
-5.51382,-2.06833,0.233104,0.185269
-5.84792,-2.14793,0.177828,-0.0558771
-6.26279,-2.4232,0.202305,-0.0629065
-5.74869,-2.02189,0.188046,0.0641926




If PC1 carries most of the information and we are dealing with a more complex problem we could choose to just use PC1 to represent the data and continue the data workflow just with PC1

### GLRM (Generalized Low Rank Decomposition)

We can try to do the same thing we've done with PCA or with autoencoders before

In [65]:
from h2o.estimators.glrm import H2OGeneralizedLowRankEstimator

The main advantage of GLRM is that it can deal with cathegorical data (enum data) well.

In [66]:
x = ["sepal_len", "sepal_wid", "petal_len", "petal_wid", "class"]

In [67]:
m_glrm3 = H2OGeneralizedLowRankEstimator(
    k = 3
)

%time m_glrm3.train(x, None, train)

glrm Model Build progress: |██████████████████████████████████████████████| 100%
Wall time: 6.14 s


In [68]:
m_glrm3

Model Details
H2OGeneralizedLowRankEstimator :  Generalized Low Rank Modeling
Model Key:  GLRM_model_python_1568466473771_2136


Model Summary: 

Unnamed: 0,Unnamed: 1,number_of_iterations,final_step_size,final_objective_value
0,,475.0,9.9e-05,9.897283




ModelMetricsGLRM: glrm
** Reported on train data. **

MSE: NaN
RMSE: NaN
Sum of Squared Error (Numeric): 9.897169768518502
Misclassification Error (Categorical): 0.0

Scoring History: 

Unnamed: 0,Unnamed: 1,timestamp,duration,iterations,step_size,objective
0,,2019-09-14 16:55:21,0.194 sec,0.0,0.666667,964.061937
1,,2019-09-14 16:55:21,0.200 sec,1.0,0.444444,964.061937
2,,2019-09-14 16:55:21,0.200 sec,2.0,0.222222,964.061937
3,,2019-09-14 16:55:21,0.204 sec,3.0,0.074074,964.061937
4,,2019-09-14 16:55:21,0.208 sec,4.0,0.018519,964.061937
5,,2019-09-14 16:55:21,0.212 sec,5.0,0.019444,663.619635
6,,2019-09-14 16:55:21,0.218 sec,6.0,0.020417,592.750066
7,,2019-09-14 16:55:21,0.223 sec,7.0,0.021438,546.259785
8,,2019-09-14 16:55:21,0.224 sec,8.0,0.022509,501.00229
9,,2019-09-14 16:55:21,0.229 sec,9.0,0.023635,453.551159



See the whole table with table.as_data_frame()




In [69]:
p_GLRM3 = m_glrm3.predict(train)

glrm prediction progress: |███████████████████████████████████████████████| 100%


In [70]:
p_GLRM3

reconstr_sepal_len,reconstr_sepal_wid,reconstr_petal_len,reconstr_petal_wid,reconstr_class
5.07691,3.52883,1.4048,0.223692,Iris-setosa
4.72945,3.22073,1.4612,0.280057,Iris-setosa
4.66608,3.24201,1.31163,0.221477,Iris-setosa
4.59342,3.11725,1.46403,0.30018,Iris-setosa
5.0537,3.53291,1.36677,0.213031,Iris-setosa
5.50001,3.76599,1.67755,0.324453,Iris-setosa
4.70179,3.26753,1.36286,0.257736,Iris-setosa
4.98627,3.42232,1.48131,0.268108,Iris-setosa
5.37951,3.72689,1.4986,0.235343,Iris-setosa
4.87201,3.32048,1.51971,0.302603,Iris-setosa




If we look at the results on H2O Flow we can see the following:

OUTPUT - IMPORTANCE OF COMPONENTS

                            pc1	    pc2	    pc3
        Standard deviation	7.6560	2.1130	1.3167
    Proportion of Variance	0.9044	0.0689	0.0267
    Cumulative Proportion	0.9044	0.9733	1.0


So we can still represent 90% of the information with just one varriable (PC1) but with only 3 variables the results got worse.

# 7. K-means clustering

Here you can read more about K-means clustering:

https://en.wikipedia.org/wiki/K-means_clustering

http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/k-means.html

https://www.datascience.com/blog/k-means-alternatives

https://en.wikipedia.org/wiki/Curse_of_dimensionality


In [71]:
from h2o.estimators.kmeans import H2OKMeansEstimator

The code is pretty simple, following the syntax we've used so far. The only parameter which is worth noting is k - the number of clusters we want to use to describe our data.

Setting this parameter requires that we know something about the data


In [72]:
x = ["sepal_len", "sepal_wid", "petal_len", "petal_wid"]

In [73]:
m = H2OKMeansEstimator(
    k = 5
)

%time m.train(x, training_frame = train)

kmeans Model Build progress: |████████████████████████████████████████████| 100%
Wall time: 6.1 s


In [74]:
p = m.predict(train)

kmeans prediction progress: |█████████████████████████████████████████████| 100%


In [75]:
p

predict
4
4
4
4
4
1
4
4
1
4




K-means works well with cathegorical variables but doesn't work with high-dimensional data