# Install Julia on Google Colab
1. Work on a copy of this notebook: _File_ > _Save a copy in Drive_ (you will need a Google account). Alternatively, you can download the notebook using _File_ > _Download .ipynb_, then upload it to [Colab](https://colab.research.google.com/).
2. Execute the following cell (click on it and press Ctrl+Enter) to install Julia, IJulia (the Jupyter kernel for Julia) and other packages. You can update `JULIA_VERSION` and the other parameters, if you know what you're doing. Installation takes 2-3 minutes.
3. Reload this page (press Ctrl+R, or ⌘+R, or the F5 key) and continue to the _Checking the Installation_ section.

* _Note_: If your Colab Runtime gets reset (e.g., due to inactivity), repeat steps 2 and 3.

In [None]:
%%shell
set -e

#---------------------------------------------------#
JULIA_VERSION="1.4.2" # any version ≥ 0.7.0
JULIA_PACKAGES="IJulia BenchmarkTools PyCall PyPlot"
JULIA_PACKAGES_IF_GPU="CUDA"
JULIA_NUM_THREADS=4
#---------------------------------------------------#

if [ -n "$COLAB_GPU" ] && [ -z `which julia` ]; then
  # Install Julia
  JULIA_VER=`cut -d '.' -f -2 <<< "$JULIA_VERSION"`
  echo "Installing Julia $JULIA_VERSION on the current Colab Runtime..."
  BASE_URL="https://julialang-s3.julialang.org/bin/linux/x64"
  URL="$BASE_URL/$JULIA_VER/julia-$JULIA_VERSION-linux-x86_64.tar.gz"
  wget -nv $URL -O /tmp/julia.tar.gz # -nv means "not verbose"
  tar -x -f /tmp/julia.tar.gz -C /usr/local --strip-components 1
  rm /tmp/julia.tar.gz

  # Install Packages
  if [ "$COLAB_GPU" = "1" ]; then
      JULIA_PACKAGES="$JULIA_PACKAGES $JULIA_PACKAGES_IF_GPU"
  fi
  for PKG in `echo $JULIA_PACKAGES`; do
    echo "Installing Julia package $PKG..."
    julia -e 'using Pkg; pkg"add '$PKG'; precompile;"'
  done

  # Install kernel and rename it to "julia"
  echo "Installing IJulia kernel..."
  julia -e 'using IJulia; IJulia.installkernel("julia", env=Dict(
      "JULIA_NUM_THREADS"=>"'"$JULIA_NUM_THREADS"'"))'
  KERNEL_DIR=`julia -e "using IJulia; print(IJulia.kerneldir())"`
  KERNEL_NAME=`ls -d "$KERNEL_DIR"/julia*`
  mv -f $KERNEL_NAME "$KERNEL_DIR"/julia  

  echo ''
  echo "Successfully installed `julia -v`!"
  echo "Please reload this page (press Ctrl+R, ⌘+R, or the F5 key) then"
  echo "jump to the 'Checking the Installation' section."
fi

Installing Julia 1.4.2 on the current Colab Runtime...
2020-12-05 03:39:18 URL:https://storage.googleapis.com/julialang2/bin/linux/x64/1.4/julia-1.4.2-linux-x86_64.tar.gz [99093958/99093958] -> "/tmp/julia.tar.gz" [1]
Installing Julia package IJulia...
    Cloning default registries into `~/.julia`
    Cloning registry from "https://github.com/JuliaRegistries/General.git"
[2K[?25h      Added registry `General` to `~/.julia/registries/General`
  Resolving package versions...
  Installed Artifacts ─────── v1.3.0
  Installed VersionParsing ── v1.2.0
  Installed MbedTLS_jll ───── v2.16.8+1
  Installed ZeroMQ_jll ────── v4.3.2+5
  Installed SoftGlobalScope ─ v1.1.0
  Installed Parsers ───────── v1.0.13
  Installed IJulia ────────── v1.23.1
  Installed JLLWrappers ───── v1.1.3
  Installed JSON ──────────── v0.21.1
  Installed Conda ─────────── v1.5.0
  Installed ZMQ ───────────── v1.2.1
  Installed MbedTLS ───────── v1.0.3
Downloading artifact: MbedTLS
#####################################

# Unit 4 : Unsupervised ML

In [None]:
using Pkg

In [None]:
Pkg.add(["RDatasets","MultivariateStats","Clustering","Plots","CSV","DataFrames","StatsBase","StatsPlots","Distances"])

[32m[1m  Resolving[22m[39m package versions...
[32m[1m   Updating[22m[39m `~/.julia/environments/v1.4/Project.toml`
[90m [no changes][39m
[32m[1m   Updating[22m[39m `~/.julia/environments/v1.4/Manifest.toml`
[90m [no changes][39m




---

## **Dimensionality Reduction- Principal Component Analysis**


---



In [None]:
using MultivariateStats, RDatasets, Plots
plotly() # using plotly for 3D-interacive graphing

# load iris dataset
iris = dataset("datasets", "iris")

# split half to training set
Xtr = convert(Array,iris[1:2:end,1:4])'
Xtr_labels = convert(Array,iris[1:2:end,5])

# split other half to testing set
Xte = convert(Array,iris[2:2:end,1:4])'
Xte_labels = convert(Array,iris[2:2:end,5])

# suppose Xtr and Xte are training and testing data matrix,
# with each observation in a column

# train a PCA model, allowing up to 3 dimensions
M = MultivariateStats.fit(PCA, Xtr; maxoutdim=3)

# apply PCA model to testing set
Yte = MultivariateStats.transform(M, Xte)

# reconstruct testing observations (approximately)
Xr = MultivariateStats.reconstruct(M, Yte)

# group results by testing set labels for color coding
setosa = Yte[:,Xte_labels.=="setosa"]
versicolor = Yte[:,Xte_labels.=="versicolor"]
virginica = Yte[:,Xte_labels.=="virginica"]

# visualize first 3 principal components in 3D interacive plot
p = scatter(setosa[1,:],setosa[2,:],setosa[3,:],marker=:circle,linewidth=0)
scatter!(versicolor[1,:],versicolor[2,:],versicolor[3,:],marker=:circle,linewidth=0)
scatter!(virginica[1,:],virginica[2,:],virginica[3,:],marker=:circle,linewidth=0)
plot!(p,xlabel="PC1",ylabel="PC2",zlabel="PC3",size = (600, 600))

In [None]:
using MultivariateStats, RDatasets, Plots, DataFrames, CSV;
plotly() # using plotly for 3D-interacive graphing 

# load wine dataset
wine =  DataFrame!(CSV.File("wine.csv"))

# split half to training set
Xtr = convert(Array,wine[1:2:end,2:13])'
Xtr_labels = convert(Array,wine[1:2:end,1])

# split other half to testing set
Xte = convert(Array,wine[2:2:end,2:13])'
Xte_labels = convert(Array,wine[2:2:end,1])

# suppose Xtr and Xte are training and testing data matrix,
# with each observation in a column

# train a PCA model, allowing up to 6 dimensions
M = MultivariateStats.fit(PCA, Xtr; maxoutdim=6)

# apply PCA model to testing set
Yte = MultivariateStats.transform(M, Xte)

# reconstruct testing observations (approximately)
Xr = MultivariateStats.reconstruct(M, Yte)

# group results by testing set labels for color coding
type_1_wine = Yte[:,Xte_labels.==1]
type_2_wine = Yte[:,Xte_labels.==2]
type_3_wine = Yte[:,Xte_labels.==3]

# visualize first 3 principal components in 3D interacive plot
p = scatter(type_1_wine[1,:],type_1_wine[2,:],type_1_wine[3,:],marker=:circle,linewidth=0)
scatter!(type_2_wine[1,:],type_2_wine[2,:],type_2_wine[3,:],marker=:circle,linewidth=0)
scatter!(type_3_wine[1,:],type_3_wine[2,:],type_3_wine[3,:],marker=:circle,linewidth=0)
plot!(p,xlabel="PC1",ylabel="PC2",zlabel="PC3",size = (600, 600))



│   caller = top-level scope at In[216]:3
└ @ Core In[216]:3



---

## **K-Means Clustering**


---


In [None]:
using RDatasets, Clustering, Plots
iris = dataset("datasets", "iris"); # load the data

features = collect(Matrix(iris[:, 1:4])'); # features to use for clustering
result = kmeans(features, 3); # run K-means for the 3 clusters

# plot with the point color mapped to the assigned cluster index
scatter(iris.PetalLength, iris.PetalWidth, marker_z=result.assignments, color=:lightrainbow, legend=false)


## Without Normalization

In [None]:
using MultivariateStats, RDatasets, Plots, DataFrames, CSV
plotly() # using plotly for 3D-interacive graphing 

# load wine dataset
wine =  DataFrame!(CSV.File("./wine.csv"))

# split half to training set
Xtr = convert(Array,wine[1:2:end,2:13])'
Xtr_labels = convert(Array,wine[1:2:end,1])

# split other half to testing set
Xte = convert(Array,wine[2:2:end,2:13])'
Xte_labels = convert(Array,wine[2:2:end,1])
 
# suppose Xtr and Xte are training and testing data matrix,
# with each observation in a column

# train a PCA model, allowing up to 6 dimensions
M = MultivariateStats.fit(PCA, Xtr; maxoutdim=3)

# apply PCA model to testing set
Yte = MultivariateStats.transform(M, Xte)

# run K-means for the 3 clusters
result = kmeans(Yte, 3);

# reconstruct testing observations (approximately)
Xr = MultivariateStats.reconstruct(M, Yte)

# group results by testing set labels for color coding
type_1_wine = Yte[:,result.assignments.==1]
type_2_wine = Yte[:,result.assignments.==2]
type_3_wine = Yte[:,result.assignments.==3]

# visualize first 3 principal components in 3D interacive plot
p = scatter(type_1_wine[1,:],type_1_wine[2,:],type_1_wine[3,:],marker=:circle,linewidth=0)
scatter!(type_2_wine[1,:],type_2_wine[2,:],type_2_wine[3,:],marker=:circle,linewidth=0)
scatter!(type_3_wine[1,:],type_3_wine[2,:],type_3_wine[3,:],marker=:circle,linewidth=0)
plot!(p,xlabel="PC1",ylabel="PC2",zlabel="PC3",size = (600, 600))



│   caller = top-level scope at In[219]:3
└ @ Core In[219]:3


## With Normalization




In [None]:
using MultivariateStats, RDatasets, Plots,StatsBase, DataFrames, CSV
plotly() # using plotly for 3D-interacive graphing 

# load wine dataset
wine =  DataFrame!(CSV.File("./wine.csv"))

X = convert(Array,wine[:,2:14])

# Normalize the Dataset
dt = StatsBase.fit(UnitRangeTransform,X,dims=1)
norm_df = StatsBase.transform(dt,X)

# split half to training set
Xtr = norm_df[1:2:end,:]'
Xtr_labels = convert(Array,wine[1:2:end,1])

# split other half to testing set
Xte = norm_df[2:2:end,:]'
Xte_labels = convert(Array,wine[2:2:end,1])

# suppose Xtr and Xte are training and testing data matrix,
# with each observation in a column

# train a PCA model, allowing up to 6 dimensions
M = MultivariateStats.fit(PCA, Xtr; maxoutdim=3)

# apply PCA model to testing set
Yte = MultivariateStats.transform(M, Xte)

# run K-means for the 3 clusters
result = kmeans(Yte, 3);

# reconstruct testing observations (approximately)
Xr = MultivariateStats.reconstruct(M, Yte)

# group results by testing set labels for color coding
type_1_wine = Yte[:,result.assignments.==1]
type_2_wine = Yte[:,result.assignments.==2]
type_3_wine = Yte[:,result.assignments.==3]

# visualize first 3 principal components in 3D interacive plot
p = scatter(type_1_wine[1,:],type_1_wine[2,:],type_1_wine[3,:],marker=:circle,linewidth=0)
scatter!(type_2_wine[1,:],type_2_wine[2,:],type_2_wine[3,:],marker=:circle,linewidth=0)
scatter!(type_3_wine[1,:],type_3_wine[2,:],type_3_wine[3,:],marker=:circle,linewidth=0)
plot!(p,xlabel="PC1",ylabel="PC2",zlabel="PC3",size = (600, 600))


│   caller = top-level scope at In[220]:3
└ @ Core In[220]:3



---

## **DBSCAN Clustering**


---


In [None]:
using Clustering, Distances, DataFrames, Plots

X1 = randn(2, 200) .+ [0., 5.]
X2 = randn(2, 200) .+ [-5., 0.]
X3 = randn(2, 200) .+ [5., 0.]
X = hcat(X1, X2, X3)

df = convert(DataFrame,X')

D = pairwise(Euclidean(), X, dims=2)

R = dbscan(D, 1.0, 5)

scatter(df.x1, df.x2, marker_z=R.assignments, color=:lightrainbow, legend=false)

│   caller = top-level scope at In[131]:7
└ @ Core In[131]:7


## Without Normalization

In [None]:
using RDatasets, Clustering, Plots, Distances

# load the data
iris = dataset("datasets", "iris"); 

# features to use for clustering
X = collect(Matrix(iris[:, 1:4]));

D = pairwise(Euclidean(), X, dims=1)

R = dbscan(D, 0.67, 26)

# plot with the point color mapped to the assigned cluster index
scatter(iris.PetalLength, iris.PetalWidth, marker_z=R.assignments, color=:lightrainbow, legend=false)

## With Normalization

In [None]:
using RDatasets, Clustering, Plots, Distances, StatsBase

# load the data
iris = dataset("datasets", "iris"); 

# features to use for clustering
features = collect(Matrix(iris[:, 1:4]));

# Normalize the Dataset
dt = StatsBase.fit(UnitRangeTransform, features, dims=1)
norm_feat = StatsBase.transform(dt,features)

# calculate Distance Matrix
D = pairwise(Euclidean(), norm_feat, dims=1)
D+=D'
R = dbscan(D, 0.3, 15)

# plot with the point color mapped to the assigned cluster index
scatter(iris.PetalLength, iris.PetalWidth, marker_z=R.assignments, color=:lightrainbow, legend=false)



---

## **Hierarchial Clustering**

---




In [None]:
using Clustering,StatsPlots
D = rand(10, 10)
D += D'
hc = hclust(D, linkage=:single)
plot(hc)

## Without Normalization

In [None]:
using Clustering,StatsPlots,CSV,Distances

# load wine dataset
wholesale =  DataFrame!(CSV.File("./Wholesale_customers_data.csv"))

X = float(convert(Array,wholesale[:,1:8]))

#calculate proximity matrix for norm_df
R = pairwise(Euclidean(), X, dims=1)

println("Symmetry of R : " , R==R')
println()

# Since R is symmetric we ignore this step
# R += R' 

hc = hclust(R, linkage=:average)
plot(hc,size = (900, 600))

Symmetry of R : true



│   caller = top-level scope at In[223]:2
└ @ Core In[223]:2


## With Normalization

In [None]:
using Clustering,StatsPlots,CSV,StatsBase,Distances

# load wine dataset
wholesale =  DataFrame!(CSV.File("./Wholesale_customers_data.csv"))

X = float(convert(Array,wholesale[:,1:8]))

# Normalize the Dataset
dt = StatsBase.fit(UnitRangeTransform,X,dims=1)
norm_df = StatsBase.transform(dt,X)

#calculate proximity matrix for norm_df
R = pairwise(Euclidean(), norm_df, dims=1)

println("Symmetry of R : " , R==R')
println()

# Since R is symmetric we ignore this step
# R += R' 

hc = hclust(R, linkage=:average)
plot(hc,size = (900, 600))


Symmetry of R : true



│   caller = top-level scope at In[136]:2
└ @ Core In[136]:2
