## Principal Component Analysis

Principal Component Analysis (PCA) derives an orthogonal projection to convert a given set of observations to linearly uncorrelated variables, called principal components.

In [45]:
# Packages we will use throughout this notebook
using DataFrames
using Statistics
using JSON
using CSV
using Distances
using MultivariateStats

download("https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/housing/housing.csv","newhouses.csv")
houses = CSV.read("newhouses.csv", DataFrame)

Unnamed: 0_level_0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64?,Float64
1,-122.23,37.88,41.0,880.0,129.0,322.0
2,-122.22,37.86,21.0,7099.0,1106.0,2401.0
3,-122.24,37.85,52.0,1467.0,190.0,496.0
4,-122.25,37.85,52.0,1274.0,235.0,558.0
5,-122.25,37.85,52.0,1627.0,280.0,565.0
6,-122.25,37.85,52.0,919.0,213.0,413.0
7,-122.25,37.84,52.0,2535.0,489.0,1094.0
8,-122.25,37.84,52.0,3104.0,687.0,1157.0
9,-122.26,37.84,42.0,2555.0,665.0,1206.0
10,-122.25,37.84,52.0,3549.0,707.0,1551.0


In [37]:
size(houses)

(20640, 10)

In [43]:

X = houses[!, [:housing_median_age,:total_rooms, :population]]
# suppose X and Xte are training and testing data matrix,
# with each observation in a column
X = Matrix(X)
# Half training and the other half as test set
X_tr = X[1:2:end, 1:3]
X_te = X[2:2:end, 1:3]

# train a PCA model, allowing up to 3 dimensions
M = fit(PCA, X_tr; maxoutdim=2)


┌ Info: For saving to png with the Plotly backend PlotlyBase has to be installed.
└ @ Plots C:\Users\tsakalos\.julia\packages\Plots\FI0vT\src\backends.jl:432


PCA(indim = 10320, outdim = 2, principalratio = 1.0)

In [44]:
# apply PCA model to testing set
Yte = transform(M, X_te)

UndefVarError: UndefVarError: transform not defined

In [42]:
using MultivariateStats, RDatasets, Plots
plotly() # using plotly for 3D-interacive graphing

# load iris dataset
iris = dataset("datasets", "iris")

# split half to training set
Xtr = convert(Array,DataArray(iris[1:2:end,1:4]))'
Xtr_labels = convert(Array,DataArray(iris[1:2:end,5]))

# split other half to testing set
Xte = convert(Array,DataArray(iris[2:2:end,1:4]))'
Xte_labels = convert(Array,DataArray(iris[2:2:end,5]))

# suppose Xtr and Xte are training and testing data matrix,
# with each observation in a column

# train a PCA model, allowing up to 3 dimensions
M = fit(PCA, Xtr; maxoutdim=3)

# apply PCA model to testing set
Yte = transform(M, Xte)

# reconstruct testing observations (approximately)
Xr = reconstruct(M, Yte)

# group results by testing set labels for color coding
setosa = Yte[:,Xte_labels.=="setosa"]
versicolor = Yte[:,Xte_labels.=="versicolor"]
virginica = Yte[:,Xte_labels.=="virginica"]

# visualize first 3 principal components in 3D interacive plot
p = scatter(setosa[1,:],setosa[2,:],setosa[3,:],marker=:circle,linewidth=0)
scatter!(versicolor[1,:],versicolor[2,:],versicolor[3,:],marker=:circle,linewidth=0)
scatter!(virginica[1,:],virginica[2,:],virginica[3,:],marker=:circle,linewidth=0)
plot!(p,xlabel="PC1",ylabel="PC2",zlabel="PC3")

UndefVarError: UndefVarError: DataArray not defined

In [41]:
X_tr

10320×3 Matrix{Float64}:
 41.0   880.0   322.0
 52.0  1467.0   496.0
 52.0  1627.0   565.0
 52.0  2535.0  1094.0
 42.0  2555.0  1206.0
 52.0  2202.0   910.0
 52.0  2491.0  1098.0
 52.0  2643.0  1212.0
 52.0  1966.0   793.0
 50.0  2239.0   990.0
  ⋮            
 20.0   755.0   457.0
 16.0  1698.0   731.0
 36.0  1124.0   504.0
 19.0  2043.0  1018.0
 11.0  2640.0  1257.0
 15.0  2319.0  1047.0
 28.0  2332.0  1041.0
 18.0   697.0   356.0
 18.0  1860.0   741.0