# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), Apr 21, 2018**

### 출처

* https://github.com/JuliaComputing/JuliaBoxTutorials/tree/master/introductory-tutorials/broader-topics-and-ecosystem/intro-to-julia-DataFrames

### 함께보기

* https://deepstat.tistory.com/69 (01. constructors)(in English)
* https://deepstat.tistory.com/70 (01. constructors)(한글)
* https://deepstat.tistory.com/71 (02. basicinfo)(in English)
* https://deepstat.tistory.com/72 (02. basicinfo)(한글)
* https://deepstat.tistory.com/73 (03. missingvalues)(in English)
* https://deepstat.tistory.com/74 (03. missingvalues)(한글)
* https://deepstat.tistory.com/75 (04. loadsave)(in English)
* https://deepstat.tistory.com/76 (04. loadsave)(한글)
* https://deepstat.tistory.com/77 (05. columns)(in English)
* https://deepstat.tistory.com/78 (05. columns)(한글)
* https://deepstat.tistory.com/79 (06. rows)(in English)
* https://deepstat.tistory.com/80 (06. rows)(한글)
* https://deepstat.tistory.com/81 (07. factors)(in English)
* https://deepstat.tistory.com/82 (07. factors)(한글)
* https://deepstat.tistory.com/83 (08. joins)(in English)
* https://deepstat.tistory.com/84 (08. joins)(한글)
* https://deepstat.tistory.com/85 (09. reshaping)(in English)
* https://deepstat.tistory.com/86 (09. reshaping)(한글)

In [1]:
using DataFrames # load package

## 데이터프레임 모양 바꾸기 (Reshaping DataFrames)

### 넓은 형태에서 긴 형태로 (Wide to long)

In [2]:
x = DataFrame(id=[1,2,3,4], id2=[1,1,2,2], M1=[11,12,13,14], M2=[111,112,113,114])

Unnamed: 0_level_0,id,id2,M1,M2
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,1,1,11,111
2,2,1,12,112
3,3,2,13,113
4,4,2,14,114


In [3]:
melt(x, :id, [:M1, :M2]) # id 변수를 먼저 넣고, measure 변수를 넣는다. meltdf는 뷰(view)를 만든다.

Unnamed: 0_level_0,variable,value,id
Unnamed: 0_level_1,Symbol,Int64,Int64
1,M1,11,1
2,M1,12,2
3,M1,13,3
4,M1,14,4
5,M2,111,1
6,M2,112,2
7,M2,113,3
8,M2,114,4


In [4]:
# 추가적으로 행 이름을 지정해 줄 수 있다. (melt와 stack은 같은 역할을 하지만 인자(argument)의 순서가 다르다.
stack(x, [:M1, :M2], :id, variable_name=:key, value_name=:observed) # measure가 먼저오고 id가 나중에 온다. stackdf는 뷰(view)를 만든다.

Unnamed: 0_level_0,key,observed,id
Unnamed: 0_level_1,Symbol,Int64,Int64
1,M1,11,1
2,M1,12,2
3,M1,13,3
4,M1,14,4
5,M2,111,1
6,M2,112,2
7,M2,113,3
8,M2,114,4


In [5]:
# 만일 melt나 stack 함수에 두 번째 인자가 없다면, 모든 다른 행이 두 번째 인자인 것 처럼 실행된다.
# 하지만 measure 변수는 <: AbstractFloat 타입일 때만 선택된다.
melt(x, [:id, :id2])

Unnamed: 0_level_0,variable,value,id,id2
Unnamed: 0_level_1,Symbol,Int64,Int64,Int64
1,M1,11,1,1
2,M1,12,2,1
3,M1,13,3,2
4,M1,14,4,2
5,M2,111,1,1
6,M2,112,2,1
7,M2,113,3,2
8,M2,114,4,2


In [6]:
melt(x, [1, 2]) # 행 이름 대신에 인덱스(index)를 사용할 수도 있다.

Unnamed: 0_level_0,variable,value,id,id2
Unnamed: 0_level_1,Symbol,Int64,Int64,Int64
1,M1,11,1,1
2,M1,12,2,1
3,M1,13,3,2
4,M1,14,4,2
5,M2,111,1,1
6,M2,112,2,1
7,M2,113,3,2
8,M2,114,4,2


In [7]:
bigx = DataFrame(rand(10^6, 10)) # 데이터프레임을 만드는 것과 뷰(view)를 만드는 것의 차이를 보기 위해 만듦.
bigx[:id] = 1:10^6
@time melt(bigx, :id)
@time melt(bigx, :id)
@time meltdf(bigx, :id)
@time meltdf(bigx, :id);

  0.253764 seconds (172.28 k allocations: 237.679 MiB, 33.81% gc time)
  0.201528 seconds (144 allocations: 228.889 MiB, 53.45% gc time)
  0.387475 seconds (633.47 k allocations: 32.617 MiB, 15.52% gc time)
  0.000169 seconds (117 allocations: 6.453 KiB)


In [8]:
x = DataFrame(id = [1,1,1], id2=['a','b','c'], a1 = rand(3), a2 = rand(3))

Unnamed: 0_level_0,id,id2,a1,a2
Unnamed: 0_level_1,Int64,Char,Float64,Float64
1,1,'a',0.327033,0.248548
2,1,'b',0.347882,0.853054
3,1,'c',0.862527,0.489913


In [9]:
melt(x)

Unnamed: 0_level_0,variable,value,id,id2
Unnamed: 0_level_1,Symbol,Float64,Int64,Char
1,a1,0.327033,1,'a'
2,a1,0.347882,1,'b'
3,a1,0.862527,1,'c'
4,a2,0.248548,1,'a'
5,a2,0.853054,1,'b'
6,a2,0.489913,1,'c'


In [10]:
melt(DataFrame(rand(3,2))) # 기본적으로 stack과 melt는 float타입을 value행으로 취급한다.

Unnamed: 0_level_0,variable,value
Unnamed: 0_level_1,Symbol,Float64
1,x1,0.509455
2,x1,0.873081
3,x1,0.820428
4,x2,0.27679
5,x2,0.422507
6,x2,0.535333


In [11]:
df = DataFrame(rand(3,2))
df[:key] = [1,1,1]
mdf = melt(df) # key가 중복되더라도 아무런 메세지 없이 실행된다.

Unnamed: 0_level_0,variable,value,key
Unnamed: 0_level_1,Symbol,Float64,Int64
1,x1,0.362025,1
2,x1,0.61307,1
3,x1,0.475587,1
4,x2,0.778,1
5,x2,0.771635,1
6,x2,0.0312133,1


### 긴 형태에서 넓은 형태로 (Long to wide)

In [12]:
x = DataFrame(id = [1,1,1], id2=['a','b','c'], a1 = rand(3), a2 = rand(3))

Unnamed: 0_level_0,id,id2,a1,a2
Unnamed: 0_level_1,Int64,Char,Float64,Float64
1,1,'a',0.675709,0.172469
2,1,'b',0.0652696,0.748051
3,1,'c',0.394419,0.452423


In [13]:
y = melt(x, [1,2])
display(x)
display(y)

Unnamed: 0_level_0,id,id2,a1,a2
Unnamed: 0_level_1,Int64,Char,Float64,Float64
1,1,'a',0.675709,0.172469
2,1,'b',0.0652696,0.748051
3,1,'c',0.394419,0.452423


Unnamed: 0_level_0,variable,value,id,id2
Unnamed: 0_level_1,Symbol,Float64,Int64,Char
1,a1,0.675709,1,'a'
2,a1,0.0652696,1,'b'
3,a1,0.394419,1,'c'
4,a2,0.172469,1,'a'
5,a2,0.748051,1,'b'
6,a2,0.452423,1,'c'


In [14]:
unstack(y, :id2, :variable, :value) # key가 하나인 기본적인 unstack

Unnamed: 0_level_0,id2,a1,a2
Unnamed: 0_level_1,Char,Float64⍰,Float64⍰
1,'a',0.675709,0.172469
2,'b',0.0652696,0.748051
3,'c',0.394419,0.452423


In [15]:
unstack(y, :variable, :value) # 모든 행이 key로 받아들여진다.

Unnamed: 0_level_0,id,id2,a1,a2
Unnamed: 0_level_1,Int64,Char,Float64⍰,Float64⍰
1,1,'a',0.675709,0.172469
2,1,'b',0.0652696,0.748051
3,1,'c',0.394419,0.452423


In [16]:
# 기본적으로 (:id, :variable, :value)가 가정돼있다. 이 경우에는 중복되는 key가 있어서 경고(Warning)을 출력한다.
unstack(y)

│   caller = top-level scope at In[16]:1
└ @ Core In[16]:1
└ @ DataFrames /home/yt/.julia/packages/DataFrames/1PqZ3/src/abstractdataframe/reshape.jl:244


Unnamed: 0_level_0,id,a1,a2
Unnamed: 0_level_1,Int64,Float64⍰,Float64⍰
1,1,0.394419,0.452423


In [17]:
df = stack(DataFrame(rand(3,2)))

Unnamed: 0_level_0,variable,value
Unnamed: 0_level_1,Symbol,Float64
1,x1,0.971137
2,x1,0.0227282
3,x1,0.0354664
4,x2,0.42655
5,x2,0.287566
6,x2,0.278549


In [18]:
unstack(df, :variable, :value) # key 행이 없으면 unstack이 되지 않는다.

ArgumentError: ArgumentError: No key column found