# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), Apr 21, 2018**

### Reference

* https://github.com/JuliaComputing/JuliaBoxTutorials/tree/master/introductory-tutorials/broader-topics-and-ecosystem/intro-to-julia-DataFrames

### Series

* https://deepstat.tistory.com/69 (01. constructors)(in English)
* https://deepstat.tistory.com/70 (01. constructors)(한글)
* https://deepstat.tistory.com/71 (02. basicinfo)(in English)
* https://deepstat.tistory.com/72 (02. basicinfo)(한글)
* https://deepstat.tistory.com/73 (03. missingvalues)(in English)
* https://deepstat.tistory.com/74 (03. missingvalues)(한글)
* https://deepstat.tistory.com/75 (04. loadsave)(in English)
* https://deepstat.tistory.com/76 (04. loadsave)(한글)
* https://deepstat.tistory.com/77 (05. columns)(in English)
* https://deepstat.tistory.com/78 (05. columns)(한글)
* https://deepstat.tistory.com/79 (06. rows)(in English)
* https://deepstat.tistory.com/80 (06. rows)(한글)
* https://deepstat.tistory.com/81 (07. factors)(in English)
* https://deepstat.tistory.com/82 (07. factors)(한글)
* https://deepstat.tistory.com/83 (08. joins)(in English)
* https://deepstat.tistory.com/84 (08. joins)(한글)
* https://deepstat.tistory.com/85 (09. reshaping)(in English)
* https://deepstat.tistory.com/86 (09. reshaping)(한글)

In [1]:
using DataFrames # load package

## Reshaping DataFrames

### Wide to long

In [2]:
x = DataFrame(id=[1,2,3,4], id2=[1,1,2,2], M1=[11,12,13,14], M2=[111,112,113,114])

Unnamed: 0_level_0,id,id2,M1,M2
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,1,1,11,111
2,2,1,12,112
3,3,2,13,113
4,4,2,14,114


In [3]:
melt(x, :id, [:M1, :M2]) # first pass id-variables and then measure variables; meltdf makes a view

Unnamed: 0_level_0,variable,value,id
Unnamed: 0_level_1,Symbol,Int64,Int64
1,M1,11,1
2,M1,12,2
3,M1,13,3
4,M1,14,4
5,M2,111,1
6,M2,112,2
7,M2,113,3
8,M2,114,4


In [4]:
# optionally you can rename columns; melt and stack are identical but order of arguments is reversed
stack(x, [:M1, :M2], :id, variable_name=:key, value_name=:observed) # first measures and then id-s; stackdf creates view

Unnamed: 0_level_0,key,observed,id
Unnamed: 0_level_1,Symbol,Int64,Int64
1,M1,11,1
2,M1,12,2
3,M1,13,3
4,M1,14,4
5,M2,111,1
6,M2,112,2
7,M2,113,3
8,M2,114,4


In [5]:
# if second argument is omitted in melt or stack , all other columns are assumed to be the second argument
# but measure variables are selected only if they are <: AbstractFloat
melt(x, [:id, :id2])

Unnamed: 0_level_0,variable,value,id,id2
Unnamed: 0_level_1,Symbol,Int64,Int64,Int64
1,M1,11,1,1
2,M1,12,2,1
3,M1,13,3,2
4,M1,14,4,2
5,M2,111,1,1
6,M2,112,2,1
7,M2,113,3,2
8,M2,114,4,2


In [6]:
melt(x, [1, 2]) # you can use index instead of symbol

Unnamed: 0_level_0,variable,value,id,id2
Unnamed: 0_level_1,Symbol,Int64,Int64,Int64
1,M1,11,1,1
2,M1,12,2,1
3,M1,13,3,2
4,M1,14,4,2
5,M2,111,1,1
6,M2,112,2,1
7,M2,113,3,2
8,M2,114,4,2


In [7]:
bigx = DataFrame(rand(10^6, 10)) # a test comparing creation of new DataFrame and a view
bigx[:id] = 1:10^6
@time melt(bigx, :id)
@time melt(bigx, :id)
@time meltdf(bigx, :id)
@time meltdf(bigx, :id);

  0.255109 seconds (172.28 k allocations: 237.679 MiB, 34.60% gc time)
  0.203728 seconds (144 allocations: 228.889 MiB, 53.30% gc time)
  0.386479 seconds (633.47 k allocations: 32.617 MiB, 15.71% gc time)
  0.000075 seconds (117 allocations: 6.453 KiB)


In [8]:
x = DataFrame(id = [1,1,1], id2=['a','b','c'], a1 = rand(3), a2 = rand(3))

Unnamed: 0_level_0,id,id2,a1,a2
Unnamed: 0_level_1,Int64,Char,Float64,Float64
1,1,'a',0.446038,0.735251
2,1,'b',0.508045,0.783346
3,1,'c',0.874669,0.724064


In [9]:
melt(x)

Unnamed: 0_level_0,variable,value,id,id2
Unnamed: 0_level_1,Symbol,Float64,Int64,Char
1,a1,0.446038,1,'a'
2,a1,0.508045,1,'b'
3,a1,0.874669,1,'c'
4,a2,0.735251,1,'a'
5,a2,0.783346,1,'b'
6,a2,0.724064,1,'c'


In [10]:
melt(DataFrame(rand(3,2))) # by default stack and melt treats floats as value columns

Unnamed: 0_level_0,variable,value
Unnamed: 0_level_1,Symbol,Float64
1,x1,0.407512
2,x1,0.958294
3,x1,0.993427
4,x2,0.121015
5,x2,0.987261
6,x2,0.438873


In [11]:
df = DataFrame(rand(3,2))
df[:key] = [1,1,1]
mdf = melt(df) # duplicates in key are silently accepted

Unnamed: 0_level_0,variable,value,key
Unnamed: 0_level_1,Symbol,Float64,Int64
1,x1,0.0148016,1
2,x1,0.0783944,1
3,x1,0.794611,1
4,x2,0.113415,1
5,x2,0.966621,1
6,x2,0.0950933,1


### Long to wide

In [12]:
x = DataFrame(id = [1,1,1], id2=['a','b','c'], a1 = rand(3), a2 = rand(3))

Unnamed: 0_level_0,id,id2,a1,a2
Unnamed: 0_level_1,Int64,Char,Float64,Float64
1,1,'a',0.982001,0.765671
2,1,'b',0.00268151,0.780911
3,1,'c',0.333175,0.0896065


In [13]:
y = melt(x, [1,2])
display(x)
display(y)

Unnamed: 0_level_0,id,id2,a1,a2
Unnamed: 0_level_1,Int64,Char,Float64,Float64
1,1,'a',0.982001,0.765671
2,1,'b',0.00268151,0.780911
3,1,'c',0.333175,0.0896065


Unnamed: 0_level_0,variable,value,id,id2
Unnamed: 0_level_1,Symbol,Float64,Int64,Char
1,a1,0.982001,1,'a'
2,a1,0.00268151,1,'b'
3,a1,0.333175,1,'c'
4,a2,0.765671,1,'a'
5,a2,0.780911,1,'b'
6,a2,0.0896065,1,'c'


In [14]:
unstack(y, :id2, :variable, :value) # stndard unstack with a unique key

Unnamed: 0_level_0,id2,a1,a2
Unnamed: 0_level_1,Char,Float64⍰,Float64⍰
1,'a',0.982001,0.765671
2,'b',0.00268151,0.780911
3,'c',0.333175,0.0896065


In [15]:
unstack(y, :variable, :value) # all other columns are treated as keys

Unnamed: 0_level_0,id,id2,a1,a2
Unnamed: 0_level_1,Int64,Char,Float64⍰,Float64⍰
1,1,'a',0.982001,0.765671
2,1,'b',0.00268151,0.780911
3,1,'c',0.333175,0.0896065


In [16]:
# by default :id, :variable and :value names are assumed; in this case it produces duplicate keys
unstack(y)

│   caller = top-level scope at In[16]:1
└ @ Core In[16]:1
└ @ DataFrames /home/yt/.julia/packages/DataFrames/1PqZ3/src/abstractdataframe/reshape.jl:244


Unnamed: 0_level_0,id,a1,a2
Unnamed: 0_level_1,Int64,Float64⍰,Float64⍰
1,1,0.333175,0.0896065


In [17]:
df = stack(DataFrame(rand(3,2)))

Unnamed: 0_level_0,variable,value
Unnamed: 0_level_1,Symbol,Float64
1,x1,0.524652
2,x1,0.990633
3,x1,0.419322
4,x2,0.583264
5,x2,0.0647236
6,x2,0.0752103


In [18]:
unstack(df, :variable, :value) # unable to unstack when no key column is present

ArgumentError: ArgumentError: No key column found