In [None]:
# Install and load a few useful packages
# corrplot:    for making a pretty heatmap of the sample correlation matrix
# repr:        for adjusting plot dimensions in jupyter notebook
# MASS:        for multidimensional scaling
# rgl:  for Swiss Roll dataset + nonlinear dimensionality reduction

installed <- installed.packages()
if (!"corrplot" %in% installed) {install.packages("corrplot")}
if (!"MASS" %in% installed) {install.packages("MASS")}
if (!"repr" %in% installed) {install.packages("repr")}
if (!"rgl" %in% installed) {install.packages("rgl")}

library("corrplot")
library("MASS")
library("repr") 
library("rgl")

# seed the rng for reproducibility
set.seed(94608)

“package ‘rgl’ was built under R version 3.3.2”

# Helpful Visualizations

We list a few visualizations that may be helpful when exploring high-dimensional datasets. Additional techniques are discussed in the accompanying [Factor Analysis in R](./Factor Analysis in R.ipynb) and [Factor Analysis Theory](./Factor Analysis Theory.ipynb) notebooks.

In [None]:
# load dataset of wine properties
d = read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", sep=',')
colnames(d) = c("type",	"alcohol", "malic", "ash", "alcalinity", "magnesium", "phenols", "flavanoids", "nonflavanoids", "proanthocyanins", "color", "hue", "dilution", "proline")
d$type <- as.factor(d$type)

# standardize all numeric columns data by subtracting column means (centering) 
# and dividing by the standard deviation (scaling)
ind = sapply(d, is.numeric)
d[ind] = lapply(d[ind], scale)

# inspect the standardized dataset
head(d)

### Correlation Matrix
This technique allows you to quickly you visualize the pairwise correlations between each dimension in your dataset. We used it in the [Factor Analysis in R](./Factor Analysis in R.ipynb) notebook to look at the correlations between different personality dimensions.

In [None]:
# look at the correlations between dimensions in the wine dataset
corrplot(cor(d[,-1]), order = "hclust", tl.col='black', tl.cex=.75, type="upper")

### Scatterplots and Multidimensional Scaling
Multidimensional scaling (MDS) is an approach to dimensionality reduction based on pairwise dissimilarities ("distances") between dimensions. Roger Shepard famously introduced nonmetric MDS in his search for a [universal law of generalization](http://smash.psych.nyu.edu/courses/spring16/learnmem/papers/Shepard1987.pdf).

Brief descriptions:
- **Classic MDS**: Given a distance matrix, find a lower-dimensional embedding that preserves pairwise distances as accurately as possible. Performed using the `cmdscale` function in the `MASS` package. Produces the same results as PCA. 
- **Non-metric MDS**: An iterative, non-linear method. Given a distance matrix, attempts to find a lower-dimensional embedding that preserves the _relative ordering_ of the pairwise distances as closely as possible. Performed using the `isoMDS` function in the `MASS` package.

In [None]:
# adjust plotting options
options(repr.plot.width=10, repr.plot.height=4)
wine.par = par(mfrow=c(1, 3), pty="s")

# perform classic (metric) MDS with k=2 latent dimensions
# observe that this produces the same solution as PCA
mds = cmdscale(dist(d), k=2)
plot(mds, col=d$type, pch=16, main='Metric MDS', xlab='Dim 1', ylab='Dim 2')
legend("bottomright", legend=c("Wine Type 1", "Wine Type 2", "Wine Type 3"), 
       fill=c("black", "red", "green"), bty='n', cex=0.65)

# perform non-metric MDS for k=2 latent dimensions
mds = isoMDS(dist(d), k=2)
plot(mds$points, col=d$type, pch=16, main='Non-Metric MDS', xlab='Dim 1', ylab='Dim 2')
legend("bottomright", legend=c("Wine Type 1", "Wine Type 2", "Wine Type 3"), 
       fill=c("black", "red", "green"), bty='n', cex=0.65)

# Compare MDS against results found using top 2 principal components
pca = prcomp(d[,-1])
plot(pca$x[,1:2], col=d[,1], pch=16, main="PCA")
legend("bottomright", legend=c("Wine Type 1", "Wine Type 2", "Wine Type 3"), 
       fill=c("black", "red", "green"), bty='n', cex=0.65)

### Biplots (PCA)
`biplot` in the `stats` package can be useful when trying to interpret the output of PCA. In a biplot, the red vectors correspond to the dimensions from the original dataset, projected into the subspace spanned by the first two principal components. The length of each vector represents the strength of the correlation between the original data dimension and the principal components.

In [None]:
options(repr.plot.width=10, repr.plot.height=7)
biplot(pca, xlabs = rep("", nrow(d)), main="Wine PCA Biplot")

### Scree Plots
Scree plots are useful when trying to determine the number of components/factors/latent dimensions to use in your dimensionality reduction regimen. Here, we use a scree plot to identify the number of principal components which capture the most variance in the wine dataset. Visually, it appears that the elbow of the plot occurrs around 4 PCs.

In [None]:
options(repr.plot.width=6, repr.plot.height=4)
plot(pca, type='l', main='Wine PCA Scree Plot')

In [None]:
# here's a convenient function for displaying slightly more detailed scree plots
# credit: https://rstudio-pubs-static.s3.amazonaws.com/27823_dbc155ba66444eae9eb0a6bacb36824f.html
pcaCharts <- function(x) {
    options(repr.plot.width=10, repr.plot.height=7)
    x.var <- x$sdev ^ 2
    x.pvar <- x.var/sum(x.var)
    print("Proportion of variance:")
    print(x.pvar)
    
    par(mfrow=c(2,2))
    plot(x.pvar,xlab="Principal component", ylab="Proportion of variance explained", ylim=c(0,1), type='b')
    plot(cumsum(x.pvar),xlab="Principal component", ylab="Cumulative Proportion of variance explained", ylim=c(0,1), type='b')
    screeplot(x)
    screeplot(x,type="l")
    par(mfrow=c(1,1))
}

pcaCharts(pca)

### Nonlinear Approaches: The importance of visualization
This example is meant to illustrate the importance of visualizing the fit of your dimensionality reduction method. Here we'll use a dataset from the ML community known as the "Swiss Roll dataset."

In [None]:
SwissRoll <- function(N=2000, Height=30, Plot=FALSE){
    # credit: https://github.com/Bioconductor-mirror/RDRToolbox/blob/master/R/SwissRoll.R
    ## build manifold
    p = (3 * pi / 2) * (1 + 2 * runif(N, 0, 1));  
    y = Height * runif(N, 0 , 1);
    samples = cbind(p * cos(p), y, p * sin(p));

    ## plot and return samples
    if(Plot){
        ## load rgl for three dimensional plots
        if(!require(rgl))
           stop("package rgl required for three dimensional plots")
    }else{
        	plot3d(samples, xlab="x", ylab="y", zlab="z");
    }
    return(samples)

}

sr = SwissRoll(Plot=TRUE)