# GLRM Showcase

The following demo is taken from Anqi (the original GLRM author at H2O).  

I want to illustrate several things from this demo:
- analyzing archetype (Y); 
- analyzing coefficients (X).
- advantages of using GLRM to build your model.

In [None]:
library(h2o)
h2o.init(strict_version_check = FALSE)

## Gait Data

The following dataset contains information from various human subjects walking on a treadmill.  In particular, each subject is attached with sensors at the various joints.  Data from the sensors are collected.

In [None]:
#filename <- "http://s3.amazonaws.com/h2o-public-test-data/smalldata/glrm_test/subject01_walk1.csv"
filename <- "../../data/glrm/subject01_walk1.csv"
gait.hex <- h2o.importFile(path = filename, destination_frame = "gait.hex")
dim(gait.hex)
summary(gait.hex)

We will build a GLRM model using quadratic loss and no regularization since the dataset contains only numeric features.  Skip the first column (time) and set k=10.

In [None]:
gait.glrm <- h2o.glrm(training_frame = gait.hex, cols = 2:ncol(gait.hex), k = 10, loss = "Quadratic", 
                      regularization_x = "None", regularization_y = "None", max_iterations = 1000)

The dataset basically contains the spatial information of a user's head, temple, toes, wrists, elbows, biceps, sternum, acromium (shoulder above arm joint), midfoots, heels, rear/upper shank, thigh, ....  What do you think the archetypes will look like?


In [None]:
gait.y <- gait.glrm@model$archetypes
gait.y.mat <- as.matrix(gait.y)
x_coords <- seq(1, ncol(gait.y), by = 3)
y_coords <- seq(2, ncol(gait.y), by = 3)
feat_nams <- sapply(colnames(gait.y), function(nam) { substr(nam, 1, nchar(nam)-1) })
feat_nams <- as.character(feat_nams[x_coords])
for(k in 1:10) {
    plot(gait.y.mat[k,x_coords], gait.y.mat[k,y_coords], xlab = "X-Coordinate Weight", ylab = "Y-Coordinate Weight", main = paste("Feature Weights of Archetype", k), col = "blue", pch = 19, lty = "solid")
    text(gait.y.mat[k,x_coords], gait.y.mat[k,y_coords], labels = feat_nams, cex = 0.7, pos = 3)

}

Next, we want to understand if we break our data into A=XY, for a given set of Y, what can we use X for?  What can X show us?  

## ACS Data

In this example, we want to predict whether a firm will repeat an offense or not after a compliance action has been carried out on a firm.  The dataset collected here includes information on each investigation, including zip code (ZCTA) where the firm is located, number of violations found, civil penalities assessed.  The zipcode data by itself is not meaningful.  In fact, it is a categorical data with high cardinality (42000).  If we use one hot encoding to expand the zip code column into 42000 columns, our model will run slowly and probably overfit.  

Instead, we choose to user the American Community Survey (ACS) 5-year estimates of household characteristics dataset to expand our zip code column.  Each row of ACS contains information for a unique zip code, other information like household size, income, education level, number of children, etc.  

In [None]:
#filename <- "http://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/census/ACS_13_5YR_DP02_cleaned.zip"
filename <- "../../data/glrm//ACS_13_5YR_DP02_cleaned.zip"
acs_orig <- h2o.importFile(path = filename, col.types = c("enum", rep("numeric", 149)))

In [None]:
dim(acs_orig)
summary(acs_orig)

In [None]:
acs_zcta_col <- acs_orig$ZCTA5
acs_full <- acs_orig[,-which(colnames(acs_orig) == "ZCTA5")]
dim(acs_full)
summary(acs_full)

After removing the zip code column, we build a GLRM model out of the ACS data with k=10.

In [None]:
acs_model <- h2o.glrm(training_frame = acs_full, k = 10, transform = "STANDARDIZE", 
                      loss = "Quadratic", regularization_x = "Quadratic", 
                      regularization_y = "L1", max_iterations = 100, gamma_x = 0.25, gamma_y = 0.5)
plot(acs_model)

Each row of X represents the coefficients needed represent a row of ACS dataset using the archetypes in Y.  For cities that are similar, you will expect them to have similar X values.

In [None]:
zcta_arch_x <- h2o.getFrame(acs_model@model$representation_name)
head(zcta_arch_x)

In [None]:
idx <- ((acs_zcta_col == "10065") |   # Manhattan, NY (Upper East Side)
        (acs_zcta_col == "11219") |   # Manhattan, NY (East Harlem)
        (acs_zcta_col == "66753") |   # McCune, KS
        (acs_zcta_col == "84104") |   # Salt Lake City, UT
        (acs_zcta_col == "94086") |   # Sunnyvale, CA
        (acs_zcta_col == "95014"))    # Cupertino, CA

city_arch <- as.data.frame(zcta_arch_x[idx,1:2])
xeps <- (max(city_arch[,1]) - min(city_arch[,1])) / 10
yeps <- (max(city_arch[,2]) - min(city_arch[,2])) / 10
xlims <- c(min(city_arch[,1]) - xeps, max(city_arch[,1]) + xeps)
ylims <- c(min(city_arch[,2]) - yeps, max(city_arch[,2]) + yeps)
plot(city_arch[,1], city_arch[,2], xlim = xlims, ylim = ylims, xlab = "First Archetype", ylab = "Second Archetype", main = "Archetype Representation of Zip Code Tabulation Areas")
text(city_arch[,1], city_arch[,2], labels = c("Upper East Side", "East Harlem", "McCune", "Salt Lake City", "Sunnyvale", "Cupertino"), pos = 1)

Cities like Sunnyvale and Cupertino, they are more similar than with East Harlem.  Note that we are able to cluster coefficients of archetypes of similar cities today, we have no idea what each archetype actually represent.  This is a general problem with machine learning.

## WHD Data

Next, we build a deeplearning model on the WHD dataset to predict repeat and/or willful violators.  For comparison purposes, we will train a model using the original dataset, the original dataset with the zip code column replaced by the compressed GLRM representation (the X matrix) and the data with the zip code column replaced with all the demographic features in the ACS dataset.

In [None]:
#filename <- "http://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/census/whd_zcta_cleaned.zip"
filename <- "../../data/glrm/whd_zcta_cleaned.zip"
whd_zcta <- h2o.importFile(path = filename, col.types = c(rep("enum", 7), rep("numeric", 97)))
split <- h2o.runif(whd_zcta)
train <- whd_zcta[split <= 0.8,]
test <- whd_zcta[split > 0.8,]
myY <- "flsa_repeat_violator"
myX <- setdiff(5:ncol(train), which(colnames(train) == myY))
orig_time <- system.time(dl_orig <- h2o.deeplearning(x = myX, y = myY, training_frame = train, 
                                                     validation_frame = test, distribution = "multinomial",
                                                     epochs = 0.1, hidden = c(50,50,50)))


In [None]:
zcta_arch_x$zcta5_cd <- acs_zcta_col
whd_arch <- h2o.merge(whd_zcta, zcta_arch_x, all.x = TRUE, all.y = FALSE)
whd_arch$zcta5_cd <- NULL
train_mod <- whd_arch[split <= 0.8,]
test_mod  <- whd_arch[split > 0.8,]
myX <- setdiff(5:ncol(train_mod), which(colnames(train_mod) == myY))
mod_time <- system.time(dl_mod <- h2o.deeplearning(x = myX, y = myY, training_frame = train_mod, 
                                                   validation_frame = test_mod, distribution = "multinomial",
                                                   epochs = 0.1, hidden = c(50,50,50)))


In [None]:
colnames(acs_orig)[1] <- "zcta5_cd"
whd_acs <- h2o.merge(whd_zcta, acs_orig, all.x = TRUE, all.y = FALSE)
train_comb <- whd_acs[split <= 0.8,]
test_comb <- whd_acs[split > 0.8,]
myX <- setdiff(5:ncol(train_comb), which(colnames(train_comb) == myY))
comb_time <- system.time(dl_comb <- h2o.deeplearning(x = myX, y = myY, training_frame = train_comb,
                                                     validation_frame = test_comb, distribution = "multinomial",
                                                     epochs = 0.1, hidden = c(50,50,50)))

In [None]:
data.frame(original = c(orig_time[3], h2o.logloss(dl_orig, train = TRUE), h2o.logloss(dl_orig, valid = TRUE)),
              reduced  = c(mod_time[3], h2o.logloss(dl_mod, train = TRUE), h2o.logloss(dl_mod, valid = TRUE)),
           combined = c(comb_time[3], h2o.logloss(dl_comb, train = TRUE), h2o.logloss(dl_comb, valid = TRUE)),
           row.names = c("runtime", "train_logloss", "test_logloss"))

Compare the performance between the three models. We see that the model built on the reduced WHD data set finishes almost 10 times faster than the model using the original data set, and it yields a lower log-loss error. The model with the combined WHD-ACS data set does not improve significantly on this error. We can conclude that our GLRM compressed the ZCTA demographics with little informational loss.

In [None]:
h2o.shutdown(prompt = FALSE)