In [1]:
using Plots
gr()

Plots.GRBackend()

## A. Data Table choices

Machine learning is all about finding patterns in data, so it is very reasonble to start with data.

In [2]:
# Some Data (try your own)
x = [5,6.5,7,8]
y = [10.1, 19.9, 30.1, 40.3]
plot(x,y,
    label="Y", line=(7,:green), marker=(10,0.8,:red), xlims=(0,10), ylims=(0,50),
    xlabel="X",ylabel="Y")
    

### A.1. Just a matrix please. (No labels, no extras, simple.)

In [3]:
data1 = [x y]

4×2 Array{Float64,2}:
 5.0  10.1
 6.5  19.9
 7.0  30.1
 8.0  40.3

In [4]:
data1 = [x y]

4×2 Array{Float64,2}:
 5.0  10.1
 6.5  19.9
 7.0  30.1
 8.0  40.3

### A.2. Data Frames: Inspired by the R universe.

In [5]:
using DataFrames
data2 = DataFrame(X=x,Y=y) # Upper Case X and Y are labels (not data)

Unnamed: 0,X,Y
1,5.0,10.1
2,6.5,19.9
3,7.0,30.1
4,8.0,40.3


In [6]:
data2[1]

4-element DataArrays.DataArray{Float64,1}:
 5.0
 6.5
 7.0
 8.0

In [7]:
using CSV
CSV.write("data.csv", data2)



CSV.Sink(    CSV.Options:
        delim: ','
        quotechar: '"'
        escapechar: '\\'
        null: ""
        dateformat: dateformat"yyyy-mm-dd", IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf, ptr=1, mark=-1), "data.csv", 8, true, String["X", "Y"], false)

In [8]:
;cat data.csv

"X","Y"
5.0,10.1
6.5,19.9
7.0,30.1
8.0,40.3


### A.3. Indexed Tables (Treat data like array indices, knows type information)

In [9]:
using  IndexedTables.Table
data3 = Table(Columns(X=x),Columns(Y=y))

LoadError: [91mUndefVarError: Columns not defined[39m

In [10]:
data3[6.5]

LoadError: [91mUndefVarError: data3 not defined[39m

In [11]:
typeof.([data1,data2,data3])

LoadError: [91mUndefVarError: data3 not defined[39m

### A.4. JuliaDB (Lots of bells and whistles, many files, parallelism, ...)

In [12]:
using JuliaDB:DTable

In [13]:
data4 = distribute(data3, 1) 

LoadError: [91mUndefVarError: distribute not defined[39m

In [14]:
data5 = loadfiles(["data.csv"]) 

LoadError: [91mUndefVarError: loadfiles not defined[39m

In [15]:
typeof(data4)

LoadError: [91mUndefVarError: data4 not defined[39m

In [16]:
data4[1:2]

LoadError: [91mUndefVarError: data4 not defined[39m

In [17]:
select(data4,1=>i->i≥2) 

LoadError: [91mUndefVarError: data4 not defined[39m

In [18]:
filter(t->(t[1]<8 && t[2]>11),data4)

LoadError: [91mUndefVarError: data4 not defined[39m

### A.5 IterableTables

In [19]:
using IterableTables, DataTables, TypedTables # haven't investigated  much but looks very nice


Use "Cell{Name,ElType}(...) where {Name,ElType}" instead.

Use "Column{Name,StorageType}(...) where {Name,StorageType}" instead.

Use "Row{Names,Types}(...) where {Names,Types}" instead.

Use "Table{Names,StorageTypes}(...) where {Names,StorageTypes}" instead.

Use "Table{Names,StorageTypes}(...) where {Names,StorageTypes}" instead.
  likely near /Users/edelman/.julia/v0.6/TypedTables/src/show.jl:351
  likely near /Users/edelman/.julia/v0.6/TypedTables/src/show.jl:351
  likely near /Users/edelman/.julia/v0.6/TypedTables/src/show.jl:351
  likely near /Users/edelman/.julia/v0.6/TypedTables/src/show.jl:351
  likely near /Users/edelman/.julia/v0.6/TypedTables/src/show.jl:351
  likely near /Users/edelman/.julia/v0.6/TypedTables/src/show.jl:351
  likely near /Users/edelman/.julia/v0.6/TypedTables/src/show.jl:351
  likely near /Users/edelman/.julia/v0.6/TypedTables/src/show.jl:351
  likely near /Users/edelman/.julia/v0.6/TypedTables/src/show.jl:351
  likely near /Users/edelman/.julia/v0.6/Ty

## B. Simple Line Fitting

[So why is it called "Regression" anyway?](http://blog.minitab.com/blog/statistics-and-quality-data-analysis/so-why-is-it-called-regression-anyway) Dalton's original meaning not quite what it means today.

B.1 Linear Regression function

In [20]:
b, w =  linreg(x,y)

(-42.45733333333333, 10.197333333333333)

In [21]:
plot()
plot(x,y,
    label="Y", line=(4,:blue), marker=(7,0.8,:blue), xlims=(0,10), ylims=(0,50),
    xlabel="X",ylabel="Y")
plot!(x->w*x+b,xlims=(minimum(x)-.5,maximum(x)+.5), line=(4,:red), label="best fit line")
plot!(x->w*x+b, x ,marker=(7,0.8,:red), label="" )
for i = 1:length(x)
    plot!([x[i],x[i]],[y[i],w*x[i]+b],line=(7,:black))
end
plot!(legend=:topleft)

Mathematically equivalent Approaches <br>
B.2 Linear Algebra Least Squares

In [22]:
[ones(x) x]\y 

2-element Array{Float64,1}:
 -42.4573
  10.1973

In [23]:
A = [ones(x) x]
(A'A)\A'y  # normal equations usually not recommended

2-element Array{Float64,1}:
 -42.4573
  10.1973

In [24]:
q,r=qr(A)
r\(q'y)

2-element Array{Float64,1}:
 -42.4573
  10.1973

In [25]:
[length(x) sum(x); sum(x) x⋅x]\[sum(y);x⋅y]

2-element Array{Float64,1}:
 -42.4573
  10.1973

B.3 Basic Formula

In [26]:
w = cov(x,y)/var(x) # same as (x.-mean(x))⋅(y.-mean(y))/sum(abs2,x.-mean(x))
b = mean(y)-w*mean(x)
b,w

(-42.45733333333333, 10.197333333333333)

In [27]:
@which linreg(x,y)

B.4 optimization  (think machine learning) via the package optim.jl

In [28]:
using Optim
loss(wb) = sum(abs2,wb[1]*x.+wb[2]-y) # uglyish
optimize(loss,[0.0,0.0]).minimizer

2-element Array{Float64,1}:
  10.1973
 -42.4573

B.5 optimization with the package JuMP (Probably can be simpler)

In [29]:
Pkg.add("Ipopt")

[1m[36mINFO: [39m[22m[36mPackage Ipopt is already installed
[39m[1m[36mINFO: [39m[22m[36mMETADATA is out-of-date — you may not have the latest version of Ipopt
[39m[1m[36mINFO: [39m[22m[36mUse `Pkg.update()` to get the latest versions of your packages
[39m

In [30]:
using JuMP, Ipopt
loss(w,b) = sum(abs2,w*x.+b-y)  # not ugly
m = Model(solver=IpoptSolver(print_level=0))
@variable(m,w)
@variable(m,b)
@NLobjective(m, Min, (w*x[1]+b-y[1])^2 + (w*x[2]+b-y[2])^2 + (w*x[3]+b-y[3])^2  + (w*x[4]+b-y[4])^2)
solve(m)
 println("w = ", getvalue(w), " b = ", getvalue(b))


******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
 Ipopt is released as open source code under the Eclipse Public License (EPL).
         For more information visit http://projects.coin-or.org/Ipopt
******************************************************************************

w = 10.19733333333333 b = -42.4573333333333


B.6 Generalized Linear Models <br>
the very fancy statistical thing

In [31]:
using GLM # Generalized Linear Models

In [33]:
lm(@formula(Y~X), data2)

DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: Y ~ 1 + X

Coefficients:
             Estimate Std.Error  t value Pr(>|t|)
(Intercept)  -42.4573    9.9622 -4.26184   0.0509
X             10.1973   1.48405   6.8713   0.0205


The lines above are obviously b and w
We assume at the start X is known without error, b,w,σ are unknown and
the real Y is distributed like  b+w*X+σrandn(),
and the Y we have are samples from this distribution.

Under these assumptions, if we fit many times, the b and w would be normal, with these predicted standard deviations.

The third column is just the ratio of column 1 to column 2 , thus normalizing the situation to a standard normal.

When the probability column is less than .05, we can reject the hypothesis that the intercept/slope is 0 at the 5 percent signficance level. What does this mean? It means we feel pretty good about our intercept and slope. If the probability is higher than .05 we can not reject the null hypothesis, meaning that we feel 0 for the intercept/slope could have been possible. In particular a 0 slope says that the dependent variable is not really statistically dependent after all.

### C. Stochastic Gradient Descent