# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), Dec 2, 2017**

A brief introduction to basic usage of `DataFrames`. Tested under version `0.11`.
I will try to keep it up to date as the package evolves.

In [1]:
using DataFrames # load package

## Reshaping DataFrames

### Wide to long

In [2]:
x = DataFrame(id=[1,2,3,4], id2=[1,1,2,2], M1=[11,12,13,14], M2=[111,112,113,114])

Unnamed: 0,id,id2,M1,M2
1,1,1,11,111
2,2,1,12,112
3,3,2,13,113
4,4,2,14,114


In [3]:
melt(x, :id, [:M1, :M2]) # first pass id-variables and then measure variables; meltdf makes a view

Unnamed: 0,variable,value,id
1,M1,11,1
2,M1,12,2
3,M1,13,3
4,M1,14,4
5,M2,111,1
6,M2,112,2
7,M2,113,3
8,M2,114,4


In [4]:
# optionally you can rename columns; melt and stack are identical but order of arguments is reversed
stack(x, [:M1, :M2], :id, variable_name=:key, value_name=:observed) # first measures and then id-s; stackdf creates view

Unnamed: 0,key,observed,id
1,M1,11,1
2,M1,12,2
3,M1,13,3
4,M1,14,4
5,M2,111,1
6,M2,112,2
7,M2,113,3
8,M2,114,4


In [5]:
# if second argument is omitted in melt or stack , all other columns are assumed to be the second argument
# but measure variables are selected only if they are <: AbstractFloat
melt(x, [:id, :id2])

Unnamed: 0,variable,value,id,id2
1,M1,11,1,1
2,M1,12,2,1
3,M1,13,3,2
4,M1,14,4,2
5,M2,111,1,1
6,M2,112,2,1
7,M2,113,3,2
8,M2,114,4,2


In [6]:
melt(x, [1, 2]) # you can use index instead of symbol

Unnamed: 0,variable,value,id,id2
1,M1,11,1,1
2,M1,12,2,1
3,M1,13,3,2
4,M1,14,4,2
5,M2,111,1,1
6,M2,112,2,1
7,M2,113,3,2
8,M2,114,4,2


In [7]:
bigx = DataFrame(rand(10^6, 10)) # a test comparing creation of new DataFrame and a view
bigx[:id] = 1:10^6
@time melt(bigx, :id)
@time meltdf(bigx, :id);

  0.349221 seconds (63.08 k allocations: 232.114 MiB, 25.32% gc time)
  0.259048 seconds (103.05 k allocations: 5.470 MiB)


In [8]:
x = DataFrame(id = [1,1,1], id2=['a','b','c'], a1 = rand(3), a2 = rand(3))

Unnamed: 0,id,id2,a1,a2
1,1,'a',0.592718,0.136227
2,1,'b',0.922075,0.770645
3,1,'c',0.252682,0.380463


In [9]:
melt(x)

Unnamed: 0,variable,value,id,id2
1,a1,0.592718,1,'a'
2,a1,0.922075,1,'b'
3,a1,0.252682,1,'c'
4,a2,0.136227,1,'a'
5,a2,0.770645,1,'b'
6,a2,0.380463,1,'c'


In [10]:
melt(DataFrame(rand(3,2))) # by default stack and melt treats floats as value columns

Unnamed: 0,variable,value
1,x1,0.198758
2,x1,0.223756
3,x1,0.369233
4,x2,0.704364
5,x2,0.231084
6,x2,0.504981


In [11]:
df = DataFrame(rand(3,2))
df[:key] = [1,1,1]
mdf = melt(df) # duplicates in key are silently accepted

Unnamed: 0,variable,value,key
1,x1,0.403734,1
2,x1,0.802916,1
3,x1,0.0837099,1
4,x2,0.129751,1
5,x2,0.33382,1
6,x2,0.685341,1


### Long to wide

In [12]:
y = melt(x, [1,2])
x,y

(3×4 DataFrames.DataFrame
│ Row │ id │ id2 │ a1       │ a2       │
├─────┼────┼─────┼──────────┼──────────┤
│ 1   │ 1  │ 'a' │ 0.592718 │ 0.136227 │
│ 2   │ 1  │ 'b' │ 0.922075 │ 0.770645 │
│ 3   │ 1  │ 'c' │ 0.252682 │ 0.380463 │, 6×4 DataFrames.DataFrame
│ Row │ variable │ value    │ id │ id2 │
├─────┼──────────┼──────────┼────┼─────┤
│ 1   │ a1       │ 0.592718 │ 1  │ 'a' │
│ 2   │ a1       │ 0.922075 │ 1  │ 'b' │
│ 3   │ a1       │ 0.252682 │ 1  │ 'c' │
│ 4   │ a2       │ 0.136227 │ 1  │ 'a' │
│ 5   │ a2       │ 0.770645 │ 1  │ 'b' │
│ 6   │ a2       │ 0.380463 │ 1  │ 'c' │)

In [13]:
unstack(y, :id2, :variable, :value) # stndard unstack with unique key

Unnamed: 0,id2,a1,a2
1,'a',0.592718,0.136227
2,'b',0.922075,0.770645
3,'c',0.252682,0.380463


In [14]:
unstack(y, :variable, :value) # all other columns are treated as keys

Unnamed: 0,id,id2,a1,a2
1,1,'a',0.592718,0.136227
2,1,'b',0.922075,0.770645
3,1,'c',0.252682,0.380463


In [15]:
unstack(y) # by default :id, :variable and :value names are assumed; in this case it produces duplicate keys



Unnamed: 0,id,a1,a2
1,1,0.252682,0.380463


In [16]:
unstack(stack(DataFrame(rand(3,2))), :variable, :value) # unable to unstack when no key column is present

LoadError: [91mBoundsError: attempt to access ()
  at index [1][39m

In [17]:
# this went through as now stack assumes that columns :x1 and :x2 are keys because those are Int not Float64 as before
unstack(stack(DataFrame(rand(Int, 3,2))), :variable, :value)

Unnamed: 0,x1,x2
