# Julia for Data Analysis

## Bogumił Kamiński

# Lecture 3. Julia's support for scaling projects

## Understanding Julia's type system

### A single function can have multiple methods

In [2]:
using Pkg
Pkg.activate(Base.current_project())

[32m[1m  Activating[22m[39m project at `~/repos/JuliaForDataAnalysis`


In [3]:
methods(cd)

### Every function has its own type that is a subtype of `Function`

In [4]:
sum isa Function

true

In [5]:
typeof(sum)

typeof(sum) (singleton type of function sum, subtype of Function)

In [6]:
typeof(sum) == Function

false

### Types in Julia are arranged in a hierarchy

In [7]:
supertype(typeof(sum))

Function

In [8]:
function print_supertypes(T)
  println(T)
  T == Any || print_supertypes(supertype(T))
  return nothing
end

print_supertypes (generic function with 1 method)

In [9]:
print_supertypes(Int64)

Int64
Signed
Integer
Real
Number
Any


In [10]:
function print_subtypes(T, indent_level=0)
  println(" "^indent_level, T)
  for S in subtypes(T)
    print_subtypes(S, indent_level + 2)
  end
  return nothing
end

print_subtypes (generic function with 2 methods)

In [11]:
print_subtypes(Integer)

Integer


  Bool
  Signed
    BigInt
    Int128
    Int16
    Int32
    Int64
    Int8
  Unsigned
    UInt128
    UInt16
    UInt32
    UInt64
    UInt8


### Union of types

In [12]:
Union{Signed,Unsigned}

Union{Signed, Unsigned}

In [13]:
Union{String,Missing}

Union{Missing, String}

### Finding a common supertype for values

In [14]:
print_supertypes(typeof([1.0, 2.0, 3.0]))

Vector{

Float64}
DenseVector{Float64}
AbstractVector{Float64}
Any


In [15]:
print_supertypes(typeof(1:3))

UnitRange{Int64}
AbstractUnitRange{Int64}
OrdinalRange{Int64, Int64}
AbstractRange{Int64}
AbstractVector{Int64}
Any


In [16]:
AbstractVector

AbstractVector[90m (alias for [39m[90mAbstractArray{T, 1} where T[39m[90m)[39m

In [17]:
typejoin(typeof([1.0, 2.0, 3.0]), typeof(1:3))

AbstractVector[90m (alias for [39m[90mAbstractArray{T, 1} where T[39m[90m)[39m

## Multiple dispatch in Julia

### Defining methods of a function

In [16]:
fun(x) = println("unsupported type")

fun (generic function with 1 method)

In [17]:
fun(x::Number) = println("a number was passed")

fun (generic function with 2 methods)

In [18]:
fun(x::Float64) = println("a Float64 value")

fun (generic function with 3 methods)

In [19]:
methods(fun)

In [20]:
fun("hello!")

unsupported type


In [21]:
fun(1)

a number was passed


In [22]:
fun(1.0)

a Float64 value


### Method ambiguity problem

In [23]:
bar(x, y) = "no numbers passed"

bar (generic function with 1 method)

In [24]:
bar(x::Number, y) = "first argument is a number"

bar (generic function with 2 methods)

In [25]:
bar(x, y::Number) = "second argument is a number"

bar (generic function with 3 methods)

In [26]:
bar("hello", "world")

"no numbers passed"

In [27]:
bar(1, "world")

"first argument is a number"

In [28]:
bar("hello", 2)

"second argument is a number"

In [29]:
bar(1, 2)

LoadError: MethodError: bar(::Int64, ::Int64) is ambiguous. Candidates:
  bar(x::Number, y) in Main at In[24]:1
  bar(x, y::Number) in Main at In[25]:1
Possible fix, define
  bar(::Number, ::Number)

In [30]:
 bar(x::Number, y::Number) = "both arguments are numbers"

bar (generic function with 4 methods)

In [31]:
bar(1, 2)

"both arguments are numbers"

In [32]:
methods(bar)

### Improved implementation of Winsorized mean

In [33]:
function winsorized_mean(x::AbstractVector, k::Integer)
    k >= 0 || throw(ArgumentError("k must be non-negative"))
    length(x) > 2 * k || throw(ArgumentError("k is too large"))
    y = sort!(collect(x))
    for i in 1:k
        y[i] = y[k + 1]
        y[end - i + 1] = y[end - k]
    end
    return sum(y) / length(y)
end

winsorized_mean (generic function with 1 method)

In [34]:
winsorized_mean([8, 3, 1, 5, 7], 1)

5.0

In [35]:
winsorized_mean(1:10, 2)

5.5

In [36]:
winsorized_mean(1:10, "a")

LoadError: MethodError: no method matching winsorized_mean(::UnitRange{Int64}, ::String)
[0mClosest candidates are:
[0m  winsorized_mean(::AbstractVector, [91m::Integer[39m) at In[33]:1

In [37]:
winsorized_mean(10, 1)

LoadError: MethodError: no method matching winsorized_mean(::Int64, ::Int64)
[0mClosest candidates are:
[0m  winsorized_mean([91m::AbstractVector[39m, ::Integer) at In[33]:1

In [38]:
winsorized_mean(1:10, -1)

LoadError: ArgumentError: k must be non-negative

In [39]:
winsorized_mean(1:10, 5)

LoadError: ArgumentError: k is too large

## Working with packages and modules

### Including files

In [None]:
include("file1.jl")
include("file2.jl")
include("file3.jl")

### Defining modules

In [40]:
module ExampleModule
    function example()
        println("Hello")
    end
end # ExampleModule

Main.ExampleModule

### Using packages

In [41]:
import Statistics
x = [1, 2, 3]

3-element Vector{Int64}:
 1
 2
 3

In [42]:
mean(x)

LoadError: UndefVarError: mean not defined

In [43]:
Statistics.mean(x)

2.0

In [44]:
using Statistics
mean(x)

2.0

### Using the StatsBase.jl package to compute Winsorized mean

In [45]:
using Statistics
using StatsBase

In [46]:
? winsor

search: [0m[1mw[22m[0m[1mi[22m[0m[1mn[22m[0m[1ms[22m[0m[1mo[22m[0m[1mr[22m [0m[1mw[22m[0m[1mi[22m[0m[1mn[22m[0m[1ms[22m[0m[1mo[22m[0m[1mr[22m! [0m[1mw[22m[0m[1mi[22m[0m[1mn[22m[0m[1ms[22m[0m[1mo[22m[0m[1mr[22mized_mean



```
winsor(x::AbstractVector; prop=0.0, count=0)
```

Return an iterator of all elements of `x` that replaces either `count` or proportion `prop` of the highest elements with the previous-highest element and an equal number of the lowest elements with the next-lowest element.

The number of replaced elements could be smaller than specified if several elements equal the lower or upper bound.

To compute the Winsorized mean of `x` use `mean(winsor(x))`.

# Example

```julia
julia> collect(winsor([5,2,3,4,1], prop=0.2))
5-element Array{Int64,1}:
 4
 2
 3
 4
 2
```


In [47]:
mean(winsor([8, 3, 1, 5, 7], count=1))

5.0

## Using macros

### Macro syntax

In [48]:
@time 1 + 2

  0.000000 seconds


3

In [49]:
@time(1 + 2)

  0.000001 seconds


3

In [50]:
@assert 1 == 2 "1 is not equal 2"

LoadError: AssertionError: 1 is not equal 2

In [51]:
@assert(1 == 2, "1 is not equal 2")

LoadError: AssertionError: 1 is not equal 2

In [52]:
@macroexpand @assert(1 == 2, "1 is not equal 2")

:(if 1 == 2
      nothing
  else
      Base.throw(Base.AssertionError("1 is not equal 2"))
  end)

In [53]:
@macroexpand @time 1 + 2

quote
    [90m#= timing.jl:216 =#[39m
    while false
        [90m#= timing.jl:216 =#[39m
    end
    [90m#= timing.jl:217 =#[39m
    local var"#39#stats" = Base.gc_num()
    [90m#= timing.jl:218 =#[39m
    local var"#41#elapsedtime" = Base.time_ns()
    [90m#= timing.jl:219 =#[39m
    local var"#42#compile_elapsedtime" = Base.cumulative_compile_time_ns_before()
    [90m#= timing.jl:220 =#[39m
    local var"#40#val" = $(Expr(:tryfinally, :(1 + 2), quote
    var"#41#elapsedtime" = Base.time_ns() - var"#41#elapsedtime"
    [90m#= timing.jl:222 =#[39m
    var"#42#compile_elapsedtime" = Base.cumulative_compile_time_ns_after() - var"#42#compile_elapsedtime"
end))
    [90m#= timing.jl:224 =#[39m
    local var"#43#diff" = Base.GC_Diff(Base.gc_num(), var"#39#stats")
    [90m#= timing.jl:225 =#[39m
    Base.time_print(var"#41#elapsedtime", (var"#43#diff").allocd, (var"#43#diff").total_time, Base.gc_alloc_count(var"#43#diff"), var"#42#compile_elapsedtime", true)
    [90m#= tim

### Benchmarking with macros

In [54]:
using BenchmarkTools
x = rand(10^6); # use ; at the end of the expression passed in REPL to suppress printing its value to the terminal
@benchmark winsorized_mean($x, 10^5) # since x is a global variable use $x to ensure proper benchmarking of the tested code

BenchmarkTools.Trial: 90 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m53.340 ms[22m[39m … [35m58.776 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m55.912 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m55.910 ms[22m[39m ± [32m 1.484 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.59% ± 1.30%

  [39m [39m [39m█[39m [39m▁[39m [39m [39m▄[39m▁[39m▁[39m [39m [39m [39m▄[39m [39m [39m [39m [39m [39m [39m▄[39m█[39m [39m█[39m [39m▁[39m▄[39m [39m [34m▄[39m[39m▄[39m [39m▁[39m [39m [39m [39m [39m [39m▁[39m▄[39m▁[39m▁[39m▁[39m▁[39m [39m [39m▁[39m█[39m [39m [39m [39m [39m [39m [39m▁[39m▁[39m▁[39m [39m [39m [39m 
  [39m▆[39m▆[39m█[39m▆[39m█[39m▁[39m▆[39m

In [55]:
using Statistics
using StatsBase
@benchmark mean(winsor($x; count=10^5))

BenchmarkTools.Trial: 347 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m13.046 ms[22m[39m … [35m23.587 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 15.44%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m13.988 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m14.426 ms[22m[39m ± [32m 1.228 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m2.43% ±  4.88%

  [39m [39m [39m [39m [39m [39m█[39m█[39m▂[39m▆[34m [39m[39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▄[39m▅[39m▃[39m▄[39m▄[39m█[3

In [56]:
@btime mean(winsor($x; count=10^5))

  12.982 ms (2 allocations: 7.63 MiB)


0.4997748954801521