# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), December 8, 2019**

In [1]:
using DataFrames # load package

## Reshaping DataFrames

### Wide to long

In [2]:
x = DataFrame(id=[1,2,3,4], id2=[1,1,2,2], M1=[11,12,13,14], M2=[111,112,113,114])

Unnamed: 0_level_0,id,id2,M1,M2
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,1,1,11,111
2,2,1,12,112
3,3,2,13,113
4,4,2,14,114


In [3]:
stack(x, [:M1, :M2], :id) # first pass measure variables and then id-variable

Unnamed: 0_level_0,variable,value,id
Unnamed: 0_level_1,Symbol,Int64,Int64
1,M1,11,1
2,M1,12,2
3,M1,13,3
4,M1,14,4
5,M2,111,1
6,M2,112,2
7,M2,113,3
8,M2,114,4


add `view=true` keyword argument to make a view; in that case columns of the resulting data frame share memory with columns of the source data frame, so the operation is potentially unsafe

In [4]:
# optionally you can rename columns
stack(x, [:M1, :M2], :id, variable_name=:key, value_name=:observed)

Unnamed: 0_level_0,key,observed,id
Unnamed: 0_level_1,Symbol,Int64,Int64
1,M1,11,1
2,M1,12,2
3,M1,13,3
4,M1,14,4
5,M2,111,1
6,M2,112,2
7,M2,113,3
8,M2,114,4


if second argument is omitted in `stack` , all other columns are assumed to be the id-variables

In [5]:
stack(x, Not([:id, :id2]))

Unnamed: 0_level_0,variable,value,id,id2
Unnamed: 0_level_1,Symbol,Int64,Int64,Int64
1,M1,11,1,1
2,M1,12,2,1
3,M1,13,3,2
4,M1,14,4,2
5,M2,111,1,1
6,M2,112,2,1
7,M2,113,3,2
8,M2,114,4,2


In [6]:
stack(x, Not([1, 2])) # you can use index instead of symbol

Unnamed: 0_level_0,variable,value,id,id2
Unnamed: 0_level_1,Symbol,Int64,Int64,Int64
1,M1,11,1,1
2,M1,12,2,1
3,M1,13,3,2
4,M1,14,4,2
5,M2,111,1,1
6,M2,112,2,1
7,M2,113,3,2
8,M2,114,4,2


In [7]:
x = DataFrame(id = [1,1,1], id2=['a','b','c'], a1 = rand(3), a2 = rand(3))

Unnamed: 0_level_0,id,id2,a1,a2
Unnamed: 0_level_1,Int64,Char,Float64,Float64
1,1,'a',0.78389,0.476641
2,1,'b',0.672166,0.698446
3,1,'c',0.105133,0.2595


 if `stack` is not passed any measure variables by default numeric variables are selected as measures

In [8]:
stack(x)

Unnamed: 0_level_0,variable,value,id,id2
Unnamed: 0_level_1,Symbol,Float64,Int64,Char
1,a1,0.78389,1,'a'
2,a1,0.672166,1,'b'
3,a1,0.105133,1,'c'
4,a2,0.476641,1,'a'
5,a2,0.698446,1,'b'
6,a2,0.2595,1,'c'


here all columns are treated as measures:

In [9]:
stack(DataFrame(rand(3,2)))

Unnamed: 0_level_0,variable,value
Unnamed: 0_level_1,Symbol,Float64
1,x1,0.972563
2,x1,0.208469
3,x1,0.85896
4,x2,0.89959
5,x2,0.423649
6,x2,0.415266


In [10]:
df = DataFrame(rand(3,2))
df.key = [1,1,1]
mdf = stack(df) # duplicates in key are silently accepted

Unnamed: 0_level_0,variable,value,key
Unnamed: 0_level_1,Symbol,Float64,Int64
1,x1,0.772418,1
2,x1,0.442216,1
3,x1,0.360197,1
4,x2,0.818011,1
5,x2,0.045373,1
6,x2,0.867399,1


### Long to wide

In [11]:
x = DataFrame(id = [1,1,1], id2=['a','b','c'], a1 = rand(3), a2 = rand(3))

Unnamed: 0_level_0,id,id2,a1,a2
Unnamed: 0_level_1,Int64,Char,Float64,Float64
1,1,'a',0.385941,0.264271
2,1,'b',0.996747,0.746892
3,1,'c',0.0901839,0.526402


In [12]:
y = stack(x)

Unnamed: 0_level_0,variable,value,id,id2
Unnamed: 0_level_1,Symbol,Float64,Int64,Char
1,a1,0.385941,1,'a'
2,a1,0.996747,1,'b'
3,a1,0.0901839,1,'c'
4,a2,0.264271,1,'a'
5,a2,0.746892,1,'b'
6,a2,0.526402,1,'c'


In [13]:
unstack(y, :id2, :variable, :value) # stndard unstack with a specified key

Unnamed: 0_level_0,id2,a1,a2
Unnamed: 0_level_1,Char,Float64⍰,Float64⍰
1,'a',0.385941,0.264271
2,'b',0.996747,0.746892
3,'c',0.0901839,0.526402


In [14]:
unstack(y, :variable, :value) # all other columns are treated as keys

Unnamed: 0_level_0,id,id2,a1,a2
Unnamed: 0_level_1,Int64,Char,Float64⍰,Float64⍰
1,1,'a',0.385941,0.264271
2,1,'b',0.996747,0.746892
3,1,'c',0.0901839,0.526402


In [15]:
# all columns other than named :variable and :value are treated as keys
unstack(y)

Unnamed: 0_level_0,id,id2,a1,a2
Unnamed: 0_level_1,Int64,Char,Float64⍰,Float64⍰
1,1,'a',0.385941,0.264271
2,1,'b',0.996747,0.746892
3,1,'c',0.0901839,0.526402


In [16]:
# you can rename the unstacked columns
unstack(y, renamecols=n->Symbol(:unstacked_, n))

Unnamed: 0_level_0,id,id2,unstacked_a1,unstacked_a2
Unnamed: 0_level_1,Int64,Char,Float64⍰,Float64⍰
1,1,'a',0.385941,0.264271
2,1,'b',0.996747,0.746892
3,1,'c',0.0901839,0.526402


In [17]:
df = stack(DataFrame(rand(3,2)))

Unnamed: 0_level_0,variable,value
Unnamed: 0_level_1,Symbol,Float64
1,x1,0.805485
2,x1,0.133465
3,x1,0.79315
4,x2,0.601833
5,x2,0.958318
6,x2,0.419136


In [18]:
unstack(df, :variable, :value) # unable to unstack when no key column is present

ArgumentError: ArgumentError: No key column found