In [None]:
using Plots
gr()

## A. Data Table choices

Machine learning is all about finding patterns in data, so it is very reasonble to start with data.

In [None]:
# Some Data (try your own)
x1 = [2,6.5,7,2]
x2 = [3,3.5,4,6]
y  = [10.1, 19.9, 30.1, 40.3]
pets = ["Cat","Dog","Bird","Hamster"]
plot(x1,y,
    label="X1 and Y", line=(7,:green), marker=(10,0.8,:red), xlims=(0,10), ylims=(0,50),
    xlabel="X",ylabel="Y")
plot!(x2,y,
    label="X2 and Y", line=(7,:blue), marker=(10,0.8,:red))
plot!(legend=:topleft)

In [3]:
plot(x1,x2,y,marker=(10,0.8,:red),st=:surface)
plot!(x1,x2,y,marker=(10,0.8,:red))
plot!(x1,zeros(x2),y,line=(7,:green),marker=(10,0.8,:red),zlim=(0,50))
plot!(zeros(x1),x2,y,line=(7,:blue),marker=(10,0.8,:red))

### A.1. Just a matrix please. (No labels, no extras, simple.)

In [4]:
data1 = [x1 x2 y pets]

4×4 Array{Any,2}:
 2.0  3.0  10.1  "Cat"    
 6.5  3.5  19.9  "Dog"    
 7.0  4.0  30.1  "Bird"   
 2.0  6.0  40.3  "Hamster"

### A.2. Data Frames: Inspired by the R universe.

In [5]:
using DataFrames
data2 = DataFrame(X1=x1,X2=x2,Y=y,Pets=pets) #  X1,X2,Y,Pets are labels (not data)

Unnamed: 0,X1,X2,Y,Pets
1,2.0,3.0,10.1,Cat
2,6.5,3.5,19.9,Dog
3,7.0,4.0,30.1,Bird
4,2.0,6.0,40.3,Hamster


In [6]:
data2[[2,4]]

Unnamed: 0,X2,Pets
1,3.0,Cat
2,3.5,Dog
3,4.0,Bird
4,6.0,Hamster


In [None]:
#Pkg.add("CSV")
using CSV
CSV.write("data.csv", data2)

In [None]:
;cat data.csv

### A.3. Indexed Tables (Treat data like array indices, knows type information)

In [None]:
# Pkg.add("IndexedTables")
using  IndexedTables.Table
using IndexedTables
data3 = Table(Columns(X1=x1,X2=x2),Columns(Y=y,Pets=pets))

In [None]:
data3[6.5,3.5]

In [None]:
typeof.([data1,data2,data3])

### A.4. JuliaDB (Lots of bells and whistles, many files, parallelism, ...)

In [None]:
#Pkg.add("JuliaDB")
using JuliaDB:DTable
using JuliaDB

In [None]:
data4 = distribute(data3, 1) 

In [None]:
data5 = loadfiles(["data.csv"]) 

In [None]:
typeof(data4)

In [None]:
data4[1:2]

In [None]:
select(data4,1=>i->i≥7) 

In [None]:
filter(t->(t[1]>30),data4) 

### A.5 IterableTables

In [None]:

#using IterableTables, DataTables, TypedTables # haven't investigated  much but looks very nice

## B. Simple Plane Fitting

[So why is it called "Regression" anyway?](http://blog.minitab.com/blog/statistics-and-quality-data-analysis/so-why-is-it-called-regression-anyway) Dalton's original meaning not quite what it means today.

B.1 Linear Regression function

In [None]:
x1 = [2, 6.5, 7, 2]
x2 = [3, 3.5, 4, 6]
y  = [10.1,  19.9,  30.1 , 40.3]
# b, w =  linreg(x,y)
# no simple function?
A = [ones(x1) x1 x2]
b = (A\y)[1]
w = (A\y)[2:3]
b,w

In [None]:
@which linreg(x1[:],y[:])

In [None]:
#Pkg.add("PlotUtils") 
using PlotUtils

In [None]:
#gr()
glvisualize()
x1 = [2 6.5 7 2]
x2 = [3 3.5 4 6]
y  = [10.1  19.9  30.1  40.3]
b = -24.70401416765054
w = [1.63684, 10.3377]
plot(x1,x2,y,marker=(10,.8,:red))
plot!(0:.1:10, 0:.1:10, (x,y)->b+w[1]*x+w[2]*y,color=:bone,st=:surface)
gui()

Mathematically equivalent Approaches <br>
B.2 Linear Algebra Least Squares

In [None]:
q,r = qr(A)
r\(q'y)

B.3 Basic Formula

B.4 optimization  (think machine learning) via the package optim.jl

In [None]:
using Optim   # Julia all the way down
loss(bw) = sum(abs2,bw[1]+x1*bw[2]+x2*bw[3]-y) 
optimize(loss,[0.0,0.0,0.0]).minimizer

B.5 optimization with the package JuMP <br>
Note not every julia function can be in @objective or @NLobjective
but that would be the goal. See  [linear and quadratic objective Jump Notes](http://www.juliaopt.org/JuMP.jl/0.18/refexpr.html)  and [Nonlinear Jump Notes](http://www.juliaopt.org/JuMP.jl/0.18/nlp.html#syntax-notes).

In [None]:
Pkg.add("Ipopt")

In [11]:
using JuMP, Ipopt
n = length(x1)
m = Model(solver=IpoptSolver(print_level=1))
@variable(m,w[1:3])
@objective(m, Min, sum((w[1] + w[2]*x1[i]+ w[3]*x2[i]-y[i])^2 for i in 1:n))
solve(m)
println( "w=",getvalue(w))


******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
 Ipopt is released as open source code under the Eclipse Public License (EPL).
         For more information visit http://projects.coin-or.org/Ipopt
******************************************************************************

w=[-24.704, 1.63684, 10.3377]


B.6 Generalized Linear Models <br>
the very fancy statistical thing

In [8]:
#Pkg.add("GLM")
using GLM # Generalized Linear Models

In [10]:
lm(@formula(Y~X1+X2), data2)

DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: Y ~ 1 + X1 + X2

Coefficients:
             Estimate Std.Error  t value Pr(>|t|)
(Intercept)   -24.704   7.45481 -3.31383   0.1866
X1            1.63684  0.673333  2.43095   0.2484
X2            10.3377   1.40813  7.34139   0.0862


The lines above are obviously b and w
We assume at the start X is known without error, b,w,σ are unknown and
the real Y is distributed like  b+w*X+$\sigma *$randn(),
and the Y we have are samples from this distribution.

Under these assumptions, if we fit many times, the b and w would be normal, with these predicted standard deviations.

The third column is just the ratio of column 1 to column 2 , thus normalizing the situation to a standard normal.

When the probability column is less than .05, we can reject the hypothesis that the intercept/slope is 0 at the 5 percent signficance level. What does this mean? It means we feel pretty good about our intercept and slope. If the probability is higher than .05 we can not reject the null hypothesis, meaning that we feel 0 for the intercept/slope could have been possible. In particular a 0 slope says that the dependent variable is not really statistically dependent after all.

### C. Stochastic Gradient Descent

In [None]:
loss(w,b,i) =(w*x[i]+b-y[i])^2  # loss due to point i
Dloss(w,b,i) = 2*(w*x[i]+b-y[i])*[x[i];1]

In [None]:
w,b = 0.0, 0.0
for t=1:100000
    η = .002  # there seems to be an art to picking these steplengths
    i = rand(1:4)
    d = Dloss(w,b,i)
    w -= η * d[1]
    b -= η * d[2]  
end
 println(b," ",w)   

###  D. KNET

In [None]:
#Pkg.add("Knet")
using Knet

In [None]:
predict(w,x) = w[2]*x .+ w[1]
loss(w,x,y) = sum(abs2, y - predict(w,x)) 

In [None]:
lossgradient = grad(loss)

In [None]:
function train(w, data; lr=.1)
    p=1
    for (x,y) in data
        println("This is pass $p")
        p+=1
        dw = lossgradient(w, x, y)
        for i in 1:length(w)
            w[i] -= lr * dw[i]
        end
    end
    return w
end

In [None]:
train([0.0,0.0],zip(x,y),lr=.01) # not enough data

In [None]:
data = [(x[i],y[i]) for i=1:4]

In [None]:
function train2(w, data; lr=.1)
       for t in 1:10000
          
        (x,y) = data[rand(1:4)]
        dw = lossgradient(w, x, y)
            for i=1:length(w)
            w[i] -= lr * dw[i]
        end
    end
    return w
end

In [None]:
train2([0.0;0.0],data,lr=.01) 

### E. TensorFlow

In [None]:
#Pkg.add("TensorFlow")
using TensorFlow
session = Session()

In [None]:
W = TensorFlow.Variable(randn())
b = TensorFlow.Variable(randn())

In [None]:
X = placeholder(Float32)
Y = multiply(X,W).+b
Y_obs = placeholder(Float32)

In [None]:
Loss=sum( (Y.-Y_obs).^2 )

In [None]:
optimizer = TensorFlow.train.GradientDescentOptimizer(1e-3)
minimizer = TensorFlow.train.minimize(optimizer, Loss)

In [None]:
run(session, global_variables_initializer())
for i in 1:20000
    run(session, minimizer, Dict(X=>x, Y_obs=>y))
end

In [None]:
run(session, [b, W])