# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), October 5, 2022**

In [1]:
using DataFrames # load package

## Reshaping DataFrames

### Wide to long

In [2]:
x = DataFrame(id=[1,2,3,4], id2=[1,1,2,2], M1=[11,12,13,14], M2=[111,112,113,114])

Row,id,id2,M1,M2
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,1,1,11,111
2,2,1,12,112
3,3,2,13,113
4,4,2,14,114


In [3]:
stack(x, [:M1, :M2], :id) # first pass measure variables and then id-variable

Row,id,variable,value
Unnamed: 0_level_1,Int64,String,Int64
1,1,M1,11
2,2,M1,12
3,3,M1,13
4,4,M1,14
5,1,M2,111
6,2,M2,112
7,3,M2,113
8,4,M2,114


add `view=true` keyword argument to make a view; in that case columns of the resulting data frame share memory with columns of the source data frame, so the operation is potentially unsafe

In [4]:
# optionally you can rename columns
stack(x, ["M1", "M2"], "id", variable_name="key", value_name="observed")

Row,id,key,observed
Unnamed: 0_level_1,Int64,String,Int64
1,1,M1,11
2,2,M1,12
3,3,M1,13
4,4,M1,14
5,1,M2,111
6,2,M2,112
7,3,M2,113
8,4,M2,114


if second argument is omitted in `stack` , all other columns are assumed to be the id-variables

In [5]:
stack(x, Not([:id, :id2]))

Row,id,id2,variable,value
Unnamed: 0_level_1,Int64,Int64,String,Int64
1,1,1,M1,11
2,2,1,M1,12
3,3,2,M1,13
4,4,2,M1,14
5,1,1,M2,111
6,2,1,M2,112
7,3,2,M2,113
8,4,2,M2,114


In [6]:
stack(x, Not([1, 2])) # you can use index instead of symbol

Row,id,id2,variable,value
Unnamed: 0_level_1,Int64,Int64,String,Int64
1,1,1,M1,11
2,2,1,M1,12
3,3,2,M1,13
4,4,2,M1,14
5,1,1,M2,111
6,2,1,M2,112
7,3,2,M2,113
8,4,2,M2,114


In [7]:
x = DataFrame(id = [1,1,1], id2=['a','b','c'], a1 = rand(3), a2 = rand(3))

Row,id,id2,a1,a2
Unnamed: 0_level_1,Int64,Char,Float64,Float64
1,1,a,0.58349,0.819258
2,1,b,0.669934,0.767127
3,1,c,0.10618,0.775644


 if `stack` is not passed any measure variables by default numeric variables are selected as measures

In [8]:
stack(x)

Row,id,id2,variable,value
Unnamed: 0_level_1,Int64,Char,String,Float64
1,1,a,a1,0.58349
2,1,b,a1,0.669934
3,1,c,a1,0.10618
4,1,a,a2,0.819258
5,1,b,a2,0.767127
6,1,c,a2,0.775644


here all columns are treated as measures:

In [9]:
stack(DataFrame(rand(3,2), :auto))

Row,variable,value
Unnamed: 0_level_1,String,Float64
1,x1,0.071003
2,x1,0.753937
3,x1,0.570336
4,x2,0.409501
5,x2,0.497623
6,x2,0.643734


In [10]:
df = DataFrame(rand(3,2), :auto)
df.key = [1,1,1]
mdf = stack(df) # duplicates in key are silently accepted

Row,key,variable,value
Unnamed: 0_level_1,Int64,String,Float64
1,1,x1,0.222378
2,1,x1,0.798064
3,1,x1,0.420131
4,1,x2,0.543722
5,1,x2,0.848304
6,1,x2,0.392049


### Long to wide

In [11]:
x = DataFrame(id = [1,1,1], id2=['a','b','c'], a1 = rand(3), a2 = rand(3))

Row,id,id2,a1,a2
Unnamed: 0_level_1,Int64,Char,Float64,Float64
1,1,a,0.547439,0.633376
2,1,b,0.890724,0.904306
3,1,c,0.629677,0.0864275


In [12]:
y = stack(x)

Row,id,id2,variable,value
Unnamed: 0_level_1,Int64,Char,String,Float64
1,1,a,a1,0.547439
2,1,b,a1,0.890724
3,1,c,a1,0.629677
4,1,a,a2,0.633376
5,1,b,a2,0.904306
6,1,c,a2,0.0864275


In [13]:
unstack(y, :id2, :variable, :value) # stndard unstack with a specified key

Row,id2,a1,a2
Unnamed: 0_level_1,Char,Float64?,Float64?
1,a,0.547439,0.633376
2,b,0.890724,0.904306
3,c,0.629677,0.0864275


In [14]:
unstack(y, :variable, :value) # all other columns are treated as keys

Row,id,id2,a1,a2
Unnamed: 0_level_1,Int64,Char,Float64?,Float64?
1,1,a,0.547439,0.633376
2,1,b,0.890724,0.904306
3,1,c,0.629677,0.0864275


In [15]:
# all columns other than named :variable and :value are treated as keys
unstack(y)

Row,id,id2,a1,a2
Unnamed: 0_level_1,Int64,Char,Float64?,Float64?
1,1,a,0.547439,0.633376
2,1,b,0.890724,0.904306
3,1,c,0.629677,0.0864275


In [16]:
# you can rename the unstacked columns
unstack(y, renamecols=n->string("unstacked_", n))

Row,id,id2,unstacked_a1,unstacked_a2
Unnamed: 0_level_1,Int64,Char,Float64?,Float64?
1,1,a,0.547439,0.633376
2,1,b,0.890724,0.904306
3,1,c,0.629677,0.0864275


In [17]:
df = stack(DataFrame(rand(3,2), :auto))

Row,variable,value
Unnamed: 0_level_1,String,Float64
1,x1,0.240547
2,x1,0.891005
3,x1,0.926475
4,x2,0.355812
5,x2,0.604021
6,x2,0.509928


Unstacking without key column and with duplicates (also see at the end of this notebook for more examples of `combine` keyword argument)

In [18]:
unstack(df, :variable, :value) # unable to unstack when no key column is present

LoadError: ArgumentError: Duplicate entries in unstack at row 2 for key () and variable x1. Pass `combine` keyword argument to specify how they should be handled.

In [19]:
unstack(df, :variable, :value, combine=copy)

Row,x1,x2
Unnamed: 0_level_1,Array…?,Array…?
1,"[0.240547, 0.891005, 0.926475]","[0.355812, 0.604021, 0.509928]"


`unstack` fills missing combinations with `missing`, but you can change this default with `fill` keyword argument.

In [20]:
df = DataFrame(key=[1, 1, 2], variable=["a", "b", "a"], value=1:3)

Row,key,variable,value
Unnamed: 0_level_1,Int64,String,Int64
1,1,a,1
2,1,b,2
3,2,a,3


In [21]:
unstack(df, :variable, :value)

Row,key,a,b
Unnamed: 0_level_1,Int64,Int64?,Int64?
1,1,1,2
2,2,3,missing


In [22]:
unstack(df, :variable, :value, fill=0)

Row,key,a,b
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,1,2
2,2,3,0


`unstack` allows combining values stored in column/row combinations if there are more than one of them. For example:

In [23]:
df = DataFrame(row=rand(1:3, 15), col=rand('a':'d', 15), value=1:15)

Row,row,col,value
Unnamed: 0_level_1,Int64,Char,Int64
1,1,a,1
2,3,a,2
3,3,a,3
4,2,c,4
5,3,d,5
6,1,c,6
7,1,a,7
8,2,d,8
9,2,c,9
10,1,d,10


In [24]:
unstack(df, :row, :col, :value, combine=sum)

Row,row,a,c,d,b
Unnamed: 0_level_1,Int64,Int64?,Int64?,Int64?,Int64?
1,1,8,18,36,missing
2,3,5,missing,5,13
3,2,missing,27,8,missing


For comparison:

In [25]:
combine(groupby(df, [:row, :col], sort=true), :value => sum)

Row,row,col,value_sum
Unnamed: 0_level_1,Int64,Char,Int64
1,1,a,8
2,1,c,18
3,1,d,36
4,2,c,27
5,2,d,8
6,3,a,5
7,3,b,13
8,3,d,5
