# Regression with Deep Neural Model (DNN) using Julia Flux.

by Uki D. Lucas

September 11, 2020

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Regression-with-Deep-Neural-Model-(DNN)-using-Julia-Flux." data-toc-modified-id="Regression-with-Deep-Neural-Model-(DNN)-using-Julia-Flux.-1">Regression with Deep Neural Model (DNN) using Julia Flux.</a></span></li><li><span><a href="#Motivation" data-toc-modified-id="Motivation-2">Motivation</a></span></li><li><span><a href="#Declare-libraries-to-be-used" data-toc-modified-id="Declare-libraries-to-be-used-3">Declare libraries to be used</a></span></li><li><span><a href="#Set-Hyper-Parameters" data-toc-modified-id="Set-Hyper-Parameters-4">Set Hyper Parameters</a></span></li><li><span><a href="#DataSet" data-toc-modified-id="DataSet-5">DataSet</a></span><ul class="toc-item"><li><span><a href="#Explore-possible-RDatasets-data-sets" data-toc-modified-id="Explore-possible-RDatasets-data-sets-5.1">Explore possible RDatasets data sets</a></span></li><li><span><a href="#Fetch-Iris-DataFrame" data-toc-modified-id="Fetch-Iris-DataFrame-5.2">Fetch Iris DataFrame</a></span></li><li><span><a href="#Write-the-dataset-locally" data-toc-modified-id="Write-the-dataset-locally-5.3">Write the dataset locally</a></span></li></ul></li><li><span><a href="#Exploratory-Data-Analysis-(EDA)" data-toc-modified-id="Exploratory-Data-Analysis-(EDA)-6">Exploratory Data Analysis (EDA)</a></span><ul class="toc-item"><li><span><a href="#Print-column-numbers-and-names" data-toc-modified-id="Print-column-numbers-and-names-6.1">Print column numbers and names</a></span></li><li><span><a href="#Group-rows-by-Species-name" data-toc-modified-id="Group-rows-by-Species-name-6.2">Group rows by Species name</a></span></li><li><span><a href="#Filter-DataFrame-row-for-Specie-name" data-toc-modified-id="Filter-DataFrame-row-for-Specie-name-6.3">Filter DataFrame row for Specie name</a></span></li><li><span><a href="#Show-Types-of-the-columns" data-toc-modified-id="Show-Types-of-the-columns-6.4">Show Types of the columns</a></span></li></ul></li><li><span><a href="#One-hot-encoding-of-the-categories" data-toc-modified-id="One-hot-encoding-of-the-categories-7">One-hot encoding of the categories</a></span><ul class="toc-item"><li><span><a href="#Insert-one-hot-columns-into-DataFrame" data-toc-modified-id="Insert-one-hot-columns-into-DataFrame-7.1">Insert one-hot columns into DataFrame</a></span></li><li><span><a href="#Insert-one-hot-encodings-for-each-row" data-toc-modified-id="Insert-one-hot-encodings-for-each-row-7.2">Insert one-hot encodings for each row</a></span></li><li><span><a href="#Save-the-dataset-to-the-disk" data-toc-modified-id="Save-the-dataset-to-the-disk-7.3">Save the dataset to the disk</a></span></li></ul></li><li><span><a href="#Split-the-dataset-into-training-and-validation-sets" data-toc-modified-id="Split-the-dataset-into-training-and-validation-sets-8">Split the dataset into training and validation sets</a></span></li><li><span><a href="#Define-DNN-model" data-toc-modified-id="Define-DNN-model-9">Define DNN model</a></span></li><li><span><a href="#Resources" data-toc-modified-id="Resources-10">Resources</a></span></li></ul></div>

# Motivation

I have not found a clear example of how to perform a regression (prediction of a number) using Julia Deep Neural Network, in this case using Flux.

I am using Iris dataset because it is very well known, and hence trying to remove on step of difficulty.

# Declare libraries to be used

In [1]:
using Flux
using Flux: logitcrossentropy, normalise, onecold, onehotbatch
using Statistics: mean
using Parameters: @with_kw

# Set Hyper Parameters

You can run the cell below only once.

In [2]:
@with_kw mutable struct HyperParameters
    learning_rate::Float64 = 0.5
    epochs::Int = 110
    split_ratio::Float64 = 0.2
end

HyperParameters

# DataSet

## Explore possible RDatasets data sets

In [3]:
download( "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/datasets.csv" , 
    "RDatasets.csv" ) # URL, name to save
using CSV
df1 = CSV.read("RDatasets.csv")
println(size(df1))
df1[100:105,:]

(1303, 12)


Unnamed: 0_level_0,Package,Item,Title
Unnamed: 0_level_1,String,String,String
1,carData,States,Education and Related Statistics for the U.S. States
2,carData,TitanicSurvival,Survival of Passengers on the Titanic
3,carData,Transact,Transaction data
4,carData,UN,"National Statistics from the United Nations, Mostly From 2009-2011"
5,carData,UN98,United Nations Social Indicators Data 1998]
6,carData,USPop,Population of the United States


## Fetch Iris DataFrame

In [4]:
using RDatasets: dataset
iris = dataset("datasets", "iris") # return DataFrames.DataFrame
iris[1:5, :]

Unnamed: 0_level_0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Categorical…
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa


## Write the dataset locally

In [5]:
CSV.write("iris.csv", iris, delim = ',')

"iris.csv"

# Exploratory Data Analysis (EDA)


## Print column numbers and names

In [6]:
for i in 1:length(names(iris))
    println(i, " ", names(iris)[i])
end

1 SepalLength
2 SepalWidth
3 PetalLength
4 PetalWidth
5 Species


## Group rows by Species name

In [7]:
using DataFrames
groups = groupby(iris, [:Species] ) # GroupedDataFrame{DataFrame}

#[ groups[1][1,5] groups[2][1,5] groups[3][1,5] ] # show row 1, column 5 from each group

labels = []
for i in 1:length(groups)
    label = groups[i][1,5]
    push!(labels, label ) # example CategoricalString{UInt8} "setosa"
    println( label )
end

labels

setosa
versicolor
virginica


3-element Array{Any,1}:
 CategoricalString{UInt8} "setosa"
 CategoricalString{UInt8} "versicolor"
 CategoricalString{UInt8} "virginica"

## Filter DataFrame row for Specie name

In [8]:
filter_virginica = iris[!, :Species] .== "virginica" # for each row determin if Species is virginica
filter_virginica[end-3:end]

4-element BitArray{1}:
 1
 1
 1
 1

In [9]:
x = iris[filter_virginica, 5:8]
x[1:3, :]

BoundsError: BoundsError: attempt to access String

## Show Types of the columns

In [10]:
eltype.(eachcol(iris))

5-element Array{DataType,1}:
 Float64
 Float64
 Float64
 Float64
 CategoricalString{UInt8}

# One-hot encoding of the categories

- https://fluxml.ai/Flux.jl/stable/data/onehot/

## Insert one-hot columns into DataFrame

In [11]:
insertcols!(iris                     # DataFrame to be changed
        , 6                          # insert as column number,
        , makeunique=true            # if the name of the column exist, make is name_1
        , setosa=0                   # name of the colum and values, make sure type is right
        )  
insertcols!(iris                      
        , 7                               
        , makeunique=true                
        , versicolor=0                    
        )  
insertcols!(iris                    
        , 8                              
        , makeunique=true               
        , virginica=0              
        )  
iris[1:3, 5:end]

Unnamed: 0_level_0,Species,setosa,versicolor,virginica
Unnamed: 0_level_1,Categorical…,Int64,Int64,Int64
1,setosa,0,0,0
2,setosa,0,0,0
3,setosa,0,0,0


## Insert one-hot encodings for each row

There is a **Flux.onehot** method, but it is so cryptic that I prefered to write my own code.

In [20]:
using DataFrames
column_species     = 5                              # column number with lables
column_setosa      = 6
column_versicolor  = 7
column_virginica   = 8

number_of_rows = size(iris)[1]                      # rows, columns

for i in 1:number_of_rows                           # go thru all rows
    specie = string( iris[i, column_species] )      # CategoricalString{UInt8} to String
    
    if specie == "setosa" 
        iris[i, column_setosa] = 1
    elseif specie == "versicolor" 
        iris[i, column_versicolor] = 1
    elseif specie == "virginica" 
        iris[i, column_virginica] = 1
    else
        println( specie, " not found!" )
    end
end

iris[48:53, 5:end]                                  # a quick sanity check

Unnamed: 0_level_0,Species,setosa,versicolor,virginica
Unnamed: 0_level_1,Categorical…,Int64,Int64,Int64
1,setosa,1,0,0
2,setosa,1,0,0
3,setosa,1,0,0
4,versicolor,0,1,0
5,versicolor,0,1,0
6,versicolor,0,1,0


## Save the dataset to the disk

In [21]:
CSV.write("iris.csv", iris, delim = ',')

"iris.csv"

# Split the dataset into training and validation sets

In [35]:
using Random

function split_dataset(df, split_ratio)
    records = size(df, 1)
    validation_rows = Random.randsubseq(1:records, split_ratio)
    training_rows = [i for i in 1:records if isempty(searchsorted(validation_rows, i))]
    return (df[training_rows, :], df[validation_rows, :]) # training, validation sets
end

params = HyperParameters()
training_set, validation_set = split_dataset(iris, params.split_ratio)

(117×8 DataFrame. Omitted printing of 3 columns
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species      │
│     │ [90mFloat64[39m     │ [90mFloat64[39m    │ [90mFloat64[39m     │ [90mFloat64[39m    │ [90mCategorical…[39m │
├─────┼─────────────┼────────────┼─────────────┼────────────┼──────────────┤
│ 1   │ 4.9         │ 3.0        │ 1.4         │ 0.2        │ setosa       │
│ 2   │ 4.7         │ 3.2        │ 1.3         │ 0.2        │ setosa       │
│ 3   │ 4.6         │ 3.1        │ 1.5         │ 0.2        │ setosa       │
│ 4   │ 5.0         │ 3.6        │ 1.4         │ 0.2        │ setosa       │
│ 5   │ 4.6         │ 3.4        │ 1.4         │ 0.3        │ setosa       │
│ 6   │ 5.0         │ 3.4        │ 1.5         │ 0.2        │ setosa       │
│ 7   │ 4.4         │ 2.9        │ 1.4         │ 0.2        │ setosa       │
│ 8   │ 4.9         │ 3.1        │ 1.5         │ 0.1        │ setosa       │
│ 9   │ 4.8         │ 3.4        │ 1.6         │ 0.2   

# Define DNN model


- $ \sigma $ is a sigmoid activation function
- https://fluxml.ai/Flux.jl/stable/models/layers/

In [22]:
using Flux # UndefVarError: 𝜎 not defined

number_of_neurons = 3
number_of_outputs = 1

model = Chain(
    Dense(number_of_neurons, number_of_outputs )
    # ...
    )

Chain(Dense(3, 1))

In [None]:
X = convert(Array, iris[1:100, 1:2])'  # The observations have to be in the columns


In [None]:

# SVM format expects observations in columns and features in rows
X = array(iris[:, 1:4])'
p, n = size(X)

# SVM format expects positive and negative examples to +1/-1
Y = [species == "setosa" ? 1.0 : -1.0 for species in iris[:Species]]

# Select a subset of the data for training, test on the rest.
train = randbool(n)

# We'll fit a model with all of the default parameters
model = svm(X[:,train], Y[train])

# And now evaluate that model on the testset
accuracy = countnz(predict(model, X[:,~train]) .== Y[~train])/countnz(~train)

In [None]:
#

# Resources
- https://medium.com/gft-engineering/start-to-learn-machine-learning-with-the-iris-flower-classification-challenge-4859a920e5e3
- https://medium.com/@Nivitus./iris-flower-classification-machine-learning-d4e337140fa4

In [None]:
using Flux
using Flux: logitcrossentropy, normalise, onecold, onehotbatch
using Statistics: mean
using Parameters: @with_kw

@with_kw mutable struct Args
    lr::Float64 = 0.5
    repeat::Int = 110
end

function get_processed_data(args)
    labels = Flux.Data.Iris.labels()
    features = Flux.Data.Iris.features()

    # Subract mean, divide by std dev for normed mean of 0 and std dev of 1.
    normed_features = normalise(features, dims=2)

    klasses = sort(unique(labels))
    onehot_labels = onehotbatch(labels, klasses)

    # Split into training and test sets, 2/3 for training, 1/3 for test.
    train_indices = [1:3:150 ; 2:3:150]

    X_train = normed_features[:, train_indices]
    y_train = onehot_labels[:, train_indices]

    X_test = normed_features[:, 3:3:150]
    y_test = onehot_labels[:, 3:3:150]

    #repeat the data `args.repeat` times
    train_data = Iterators.repeated((X_train, y_train), args.repeat)
    test_data = (X_test,y_test)

    return train_data, test_data
end

# Accuracy Function
accuracy(x, y, model) = mean(onecold(model(x)) .== onecold(y))

# Function to build confusion matrix
function confusion_matrix(X, y, model)
    ŷ = onehotbatch(onecold(model(X)), 1:3)
    y * transpose(ŷ)
end

function train(; kws...)
    # Initialize hyperparameter arguments
    args = Args(; kws...)	

    #Loading processed data
    train_data, test_data = get_processed_data(args)

    # Declare model taking 4 features as inputs and outputting 3 probabiltiies, 
    # one for each species of iris.
    model = Chain(Dense(4, 3))
	
    # Defining loss function to be used in training
    # For numerical stability, we use here logitcrossentropy
    loss(x, y) = logitcrossentropy(model(x), y)
	
    # Training
    # Gradient descent optimiser with learning rate `args.lr`
    optimiser = Descent(args.lr)

    println("Starting training.")
    Flux.train!(loss, params(model), train_data, optimiser)
	
    return model, test_data
end

function test(model, test)
    # Testing model performance on test data 
    X_test, y_test = test
    accuracy_score = accuracy(X_test, y_test, model)

    println("\nAccuracy: $accuracy_score")

    # Sanity check.
    @assert accuracy_score > 0.8

    # To avoid confusion, here is the definition of a Confusion Matrix: https://en.wikipedia.org/wiki/Confusion_matrix
    println("\nConfusion Matrix:\n")
    display(confusion_matrix(X_test, y_test, model))
end

cd(@__DIR__)
model, test_data = train()
test(model, test_data)