# Julia Demo

This is an adapted version of the tutorial at juliabox.com, chapter 2, DataSciences - Algorithms

lets do a simple benchmark between python and julia on regression

In [None]:
using Plots
xvals = repeat(1:0.5:10,inner=2)
yvals = 3 .+ xvals + 2 .* rand(length(xvals)).-1
scatter(xvals,yvals,color=:black,leg=false)

Now we want to fit a line through this. Linear Regression! Let's write a simple function in julia:

In [None]:
using Statistics
function find_best_fit(xvals,yvals)
    meanx = mean(xvals)
    meany = mean(yvals)
    stdx = std(xvals)
    stdy = std(yvals)
    r = cor(xvals,yvals)
    a = r*stdy/stdx
    b = meany - a*meanx
    return a,b
end

In [None]:
a,b = find_best_fit(xvals,yvals)
ynew = a*xvals .+ b

In [None]:
plot!(xvals,ynew)

now more data!

In [None]:
xvals = 1:100000;
xvals = repeat(xvals,inner=3);
yvals = 3 .+ xvals + 2 .* rand(length(xvals)).-1;

In [None]:
@time a,b = find_best_fit(xvals,yvals)

In [None]:
using PyCall
using Conda

In [None]:
py"""
import numpy
def find_best_fit_python(xvals,yvals):
    meanx = numpy.mean(xvals)
    meany = numpy.mean(yvals)
    stdx = numpy.std(xvals)
    stdy = numpy.std(yvals)
    r = numpy.corrcoef(xvals,yvals)[0][1]
    a = r*stdy/stdx
    b = meany - a*meanx
    return a,b
"""

In [None]:
find_best_fit_python = py"find_best_fit_python"

In [None]:
xpy = PyObject(xvals)
ypy = PyObject(yvals)
@time a,b = find_best_fit_python(xpy,ypy)

In [None]:
using BenchmarkTools

In [None]:
@btime a,b = find_best_fit_python(xvals,yvals)

In [None]:
@btime a,b = find_best_fit(xvals,yvals)

## Data Processing

* lets download some data and do some work on it.

In [None]:
using DataFrames
using CSV
download("http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv","houses.csv")
houses = CSV.read("houses.csv")

In [None]:
using StatPlots
@df houses scatter(:sq__ft,:price,markersize=3,xlab="square feet",ylab="price")

What's with those houses at zero size and positive prices? must be wrong.

In [None]:
using Query
# x = @from i in houses begin
#     @where i.sq__ft > 0
#     @select {i.sq__ft,i.price}
#     @collect DataFrame
# end
# @df x scatter(:sq__ft,:price,markersize=3,xlab="square feet",ylab="price")
# even better: in a pipeline!
houses |>
    @filter(_.sq__ft > 0) |>
    @df scatter(:sq__ft,:price,markersize=3,xlab="square feet",ylab="price")