# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), Dec 25, 2017**

A brief introduction to basic usage of `DataFrames`. Tested under `DataFrames` master on 2017-12-05.
I will try to keep it up to date as the package evolves.

In [1]:
using DataFrames

## Extras - selected functionalities of selected packages

### FreqTables: creating categorical tables

In [2]:
using FreqTables
df = DataFrame(a=rand('a':'d', 1000), b=rand(["x", "y", "z"], 1000))
ft = freqtable(df, :a, :b) # observe that dimensions are sorted if possible

4×3 Named Array{Int64,2}
a ╲ b │   x    y    z
──────┼──────────────
'a'   │  86   93   82
'b'   │  72   75   78
'c'   │  82  103   86
'd'   │  73   77   93

In [3]:
ft[1,1], ft['b', "z"] # you can index the result using numbers or names

(86, 78)

In [4]:
prop(ft, 1) # getting proportions - 1 means we want to calculate them in rows (first dimension)

4×3 Named Array{Float64,2}
a ╲ b │        x         y         z
──────┼─────────────────────────────
'a'   │ 0.329502  0.356322  0.314176
'b'   │     0.32  0.333333  0.346667
'c'   │ 0.302583  0.380074  0.317343
'd'   │ 0.300412  0.316872  0.382716

In [5]:
prop(ft, 2) # and columns are normalized to 1.0 now

4×3 Named Array{Float64,2}
a ╲ b │        x         y         z
──────┼─────────────────────────────
'a'   │  0.27476  0.267241  0.241888
'b'   │ 0.230032  0.215517  0.230088
'c'   │ 0.261981  0.295977  0.253687
'd'   │ 0.233227  0.221264  0.274336

In [6]:
x = categorical(rand(1:3, 10))
levels!(x, [3, 1, 2, 4]) # reordering levels and adding an extra level
freqtable(x) # order is preserved and not-used level is shown

4-element Named Array{Int64,1}
Dim1  │ 
──────┼──
3     │ 2
1     │ 2
2     │ 6
4     │ 0

In [7]:
freqtable([1,1,2,3,missing]) # by default missings are listed

4-element Named Array{Int64,1}
Dim1    │ 
────────┼──
1       │ 2
2       │ 1
3       │ 1
missing │ 1

In [8]:
freqtable([1,1,2,3,missing], skipmissing=true) # but we can skip them

3-element Named Array{Int64,1}
Dim1  │ 
──────┼──
1     │ 2
2     │ 1
3     │ 1

### DataFramesMeta - working on `DataFrame`

In [9]:
using DataFramesMeta
df = DataFrame(x=1:8, y='a':'h', z=repeat([true,false], outer=4))

Unnamed: 0,x,y,z
1,1,'a',True
2,2,'b',False
3,3,'c',True
4,4,'d',False
5,5,'e',True
6,6,'f',False
7,7,'g',True
8,8,'h',False


In [10]:
@with(df, :x+:z) # expressions with columns of DataFrame

8-element Array{Int64,1}:
 2
 2
 4
 4
 6
 6
 8
 8

In [11]:
@with df begin # you can define code blocks
    a = :x[:z]
    b = :x[.!:z]
    :y + [a; b]
end

8-element Array{Char,1}:
 'b'
 'e'
 'h'
 'k'
 'g'
 'j'
 'm'
 'p'

In [12]:
a # @with creates hard scope so variables do not leak out

LoadError: [91mUndefVarError: a not defined[39m

In [13]:
df2 = DataFrame(a = [:a, :b, :c])
@with(df2, :a .== ^(:a)) # sometimes we want to work on raw Symbol, ^() escapes it

3-element BitArray{1}:
  true
 false
 false

In [14]:
df2 = DataFrame(x=1:3, y=4:6, z=7:9)
@with(df2, _I_(2:3)) # _I_(expression) is translated to df2[expression]

Unnamed: 0,y,z
1,4,7
2,5,8
3,6,9


In [15]:
@where(df, :x .< 4, :z .== true) # very useful macro for filtering

Unnamed: 0,x,y,z
1,1,'a',True
2,3,'c',True


In [16]:
@select(df, :x, y = 2*:x, z=:y) # create a new DataFrame based on the old one

Unnamed: 0,x,y,z
1,1,2,'a'
2,2,4,'b'
3,3,6,'c'
4,4,8,'d'
5,5,10,'e'
6,6,12,'f'
7,7,14,'g'
8,8,16,'h'


In [17]:
@transform(df, a=1, x = 2*:x, y=:x) # create a new DataFrame adding columns based on the old one

Unnamed: 0,x,y,z,a
1,2,1,True,1
2,4,2,False,1
3,6,3,True,1
4,8,4,False,1
5,10,5,True,1
6,12,6,False,1
7,14,7,True,1
8,16,8,False,1


In [18]:
@transform(df, a=1, b=:a) # old DataFrame is used and :a is not present there

LoadError: [91mKeyError: key :a not found[39m

WIP: @by, grouping, sorting

In [19]:
@orderby(df, :z, -:x) # sorting into a new data frame, less powerful than sort, but lightweight

Unnamed: 0,x,y,z
1,8,'h',False
2,6,'f',False
3,4,'d',False
4,2,'b',False
5,7,'g',True
6,5,'e',True
7,3,'c',True
8,1,'a',True


In [20]:
@linq df |> # chaining of operations on DataFrame
    where(:x .< 5) |>
    orderby(:z) |>
    transform(x²=:x.^2) |>
    select(:z, :x, :x²)

Unnamed: 0,z,x,x²
1,False,2,4
2,False,4,16
3,True,1,1
4,True,3,9


In [21]:
f(df, col) = df[col] # you can define your own functions and put them in the chain
@linq df |> where(:x .<= 4) |> f(:x)

4-element Array{Int64,1}:
 1
 2
 3
 4

### DataFramesMeta - working on grouped `DataFrame`

In [22]:
df = DataFrame(a = 1:12, b = repeat('a':'d', outer=3))
g = groupby(df, :b)

DataFrames.GroupedDataFrame  4 groups with keys: Symbol[:b]
First Group:
3×2 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ a │ b   │
├─────┼───┼─────┤
│ 1   │ 1 │ 'a' │
│ 2   │ 5 │ 'a' │
│ 3   │ 9 │ 'a' │
⋮
Last Group:
3×2 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ a  │ b   │
├─────┼────┼─────┤
│ 1   │ 4  │ 'd' │
│ 2   │ 8  │ 'd' │
│ 3   │ 12 │ 'd' │

In [23]:
@by(df, :b, first=first(:a), last=last(:a), mean=mean(:a)) # more convinient than by from DataFrames

Unnamed: 0,b,first,last,mean
1,'a',1,9,5.0
2,'b',2,10,6.0
3,'c',3,11,7.0
4,'d',4,12,8.0


In [24]:
@based_on(g, first=first(:a), last=last(:a), mean=mean(:a)) # the same as by but on grouped DataFrame

Unnamed: 0,b,first,last,mean
1,'a',1,9,5.0
2,'b',2,10,6.0
3,'c',3,11,7.0
4,'d',4,12,8.0


In [25]:
@where(g, mean(:a) > 6.5) # filter gropus on aggregate conditions

DataFrames.GroupedDataFrame  2 groups with keys: Symbol[:b]
First Group:
3×2 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ a  │ b   │
├─────┼────┼─────┤
│ 1   │ 3  │ 'c' │
│ 2   │ 7  │ 'c' │
│ 3   │ 11 │ 'c' │
⋮
Last Group:
3×2 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ a  │ b   │
├─────┼────┼─────┤
│ 1   │ 4  │ 'd' │
│ 2   │ 8  │ 'd' │
│ 3   │ 12 │ 'd' │

In [26]:
@orderby(g, -sum(:a)) # order groups on aggregate conditions

DataFrames.GroupedDataFrame  4 groups with keys: Symbol[:b]
First Group:
3×2 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ a  │ b   │
├─────┼────┼─────┤
│ 1   │ 4  │ 'd' │
│ 2   │ 8  │ 'd' │
│ 3   │ 12 │ 'd' │
⋮
Last Group:
3×2 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ a │ b   │
├─────┼───┼─────┤
│ 1   │ 1 │ 'a' │
│ 2   │ 5 │ 'a' │
│ 3   │ 9 │ 'a' │

In [27]:
@transform(g, center = mean(:a), centered = :a - mean(:a)) # perform operations within a group and return ungroped DataFrame

Unnamed: 0,a,b,center,centered
1,1,'a',5.0,-4.0
2,5,'a',5.0,0.0
3,9,'a',5.0,4.0
4,2,'b',6.0,-4.0
5,6,'b',6.0,0.0
6,10,'b',6.0,4.0
7,3,'c',7.0,-4.0
8,7,'c',7.0,0.0
9,11,'c',7.0,4.0
10,4,'d',8.0,-4.0


In [28]:
DataFrame(g) # a nice convinience function not defined in DataFrames

Unnamed: 0,a,b
1,1,'a'
2,5,'a'
3,9,'a'
4,2,'b'
5,6,'b'
6,10,'b'
7,3,'c'
8,7,'c'
9,11,'c'
10,4,'d'


In [29]:
@transform(g) # actually this is the same

Unnamed: 0,a,b
1,1,'a'
2,5,'a'
3,9,'a'
4,2,'b'
5,6,'b'
6,10,'b'
7,3,'c'
8,7,'c'
9,11,'c'
10,4,'d'


In [30]:
@linq df |> groupby(:b) |> where(mean(:a) > 6.5) |> DataFrame # you can do chaining on grouped DataFrames as well

Unnamed: 0,a,b
1,3,'c'
2,7,'c'
3,11,'c'
4,4,'d'
5,8,'d'
6,12,'d'
