# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), Apr 21, 2018**

### Reference

* https://github.com/JuliaComputing/JuliaBoxTutorials/tree/master/introductory-tutorials/broader-topics-and-ecosystem/intro-to-julia-DataFrames

### Series

* https://deepstat.tistory.com/69 (01. constructors)(in English)
* https://deepstat.tistory.com/70 (01. constructors)(한글)
* https://deepstat.tistory.com/71 (02. basicinfo)(in English)
* https://deepstat.tistory.com/72 (02. basicinfo)(한글)
* https://deepstat.tistory.com/73 (03. missingvalues)(in English)
* https://deepstat.tistory.com/74 (03. missingvalues)(한글)
* https://deepstat.tistory.com/75 (04. loadsave)(in English)
* https://deepstat.tistory.com/76 (04. loadsave)(한글)
* https://deepstat.tistory.com/77 (05. columns)(in English)
* https://deepstat.tistory.com/78 (05. columns)(한글)
* https://deepstat.tistory.com/79 (06. rows)(in English)
* https://deepstat.tistory.com/80 (06. rows)(한글)
* https://deepstat.tistory.com/81 (07. factors)(in English)
* https://deepstat.tistory.com/82 (07. factors)(한글)
* https://deepstat.tistory.com/83 (08. joins)(in English)
* https://deepstat.tistory.com/84 (08. joins)(한글)
* https://deepstat.tistory.com/85 (09. reshaping)(in English)
* https://deepstat.tistory.com/86 (09. reshaping)(한글)
* https://deepstat.tistory.com/87 (10. transforms)(in English)
* https://deepstat.tistory.com/88 (10. transforms)(한글)
* https://deepstat.tistory.com/89 (11. performance)(in English)
* https://deepstat.tistory.com/90 (11. performance)(한글)
* https://deepstat.tistory.com/91 (12. pitfalls)(in English)
* https://deepstat.tistory.com/92 (12. pitfalls)(한글)

In [1]:
using DataFrames

## 가능한 함정들 (Possible pitfalls)

### `데이터프레임`을 만들 때 무엇이 복사되는지 알아야 한다.

In [2]:
x = DataFrame(rand(3, 5))

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.674434,0.383204,0.742057,0.607063,0.312016
2,0.266033,0.923104,0.404489,0.420966,0.0165625
3,0.148082,0.938792,0.819837,0.869153,0.417423


In [3]:
y = DataFrame(x)
x === y # 복사된 게 아니다. (같은 객체(object)다.)

│   caller = top-level scope at In[3]:1
└ @ Core In[3]:1


true

In [4]:
y = copy(x)
x === y # 같은 객체(object)가 아니다.

false

In [5]:
all(x[i] === y[i] for i in ncol(x)) # 그러나 열들은 같다.

true

In [6]:
x = 1:3; y = [1, 2, 3]; df = DataFrame(x=x,y=y) # 배열(array)를 만들거나 행을 넣을 때도 마찬가지다. (범위(range)를 제외하고)

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,2,2
3,3,3


In [7]:
y === df[:y] # 같은 객체다.

true

In [8]:
typeof(x), typeof(df[:x]) # 범위(range) 는 벡터(vector)로 바뀐다.

(UnitRange{Int64}, Array{Int64,1})

### `그룹화된 데이터프레임`의 부모객체(parent)를 수정하지 마라.

In [9]:
x = DataFrame(id=repeat([1,2], outer=3), x=1:6)
g = groupby(x, :id)

GroupedDataFrame with 2 groups based on key: :id
First Group: 3 rows
│ Row │ id    │ x     │
│     │ [90mInt64[39m │ [90mInt64[39m │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 1     │ 3     │
│ 3   │ 1     │ 5     │
⋮
Last Group: 3 rows
│ Row │ id    │ x     │
│     │ [90mInt64[39m │ [90mInt64[39m │
├─────┼───────┼───────┤
│ 1   │ 2     │ 2     │
│ 2   │ 2     │ 4     │
│ 3   │ 2     │ 6     │

In [10]:
x[1:3, 1]=[2,2,2]
g # 이제 결과가 잘못됐다. g는 단지 뷰(view)일 뿐이다.

GroupedDataFrame with 2 groups based on key: :id
First Group: 3 rows
│ Row │ id    │ x     │
│     │ [90mInt64[39m │ [90mInt64[39m │
├─────┼───────┼───────┤
│ 1   │ 2     │ 1     │
│ 2   │ 2     │ 3     │
│ 3   │ 1     │ 5     │
⋮
Last Group: 3 rows
│ Row │ id    │ x     │
│     │ [90mInt64[39m │ [90mInt64[39m │
├─────┼───────┼───────┤
│ 1   │ 2     │ 2     │
│ 2   │ 2     │ 4     │
│ 3   │ 2     │ 6     │

### `데이터프레임`의 열을 선택할 때 논리값(boolean)을 이용할 수도 있음을 기억해라.

In [11]:
using Random
Random.seed!(1)
x = DataFrame(rand(5, 5))

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.236033,0.210968,0.555751,0.209472,0.0769509
2,0.346517,0.951916,0.437108,0.251379,0.640396
3,0.312707,0.999905,0.424718,0.0203749,0.873544
4,0.00790928,0.251662,0.773223,0.287702,0.278582
5,0.488613,0.986666,0.28119,0.859512,0.751313


In [12]:
x[x[:x1] .< 0.25] # 행별로가 아닌 열별로 선택했다. (열 수와 행 수가 같아서 우연히 작동할 수 있었다.)

Unnamed: 0_level_0,x1,x4
Unnamed: 0_level_1,Float64,Float64
1,0.236033,0.209472
2,0.346517,0.251379
3,0.312707,0.0203749
4,0.00790928,0.287702
5,0.488613,0.859512


In [13]:
x[x[:x1] .< 0.25, :] # 아마 이게 우리가 원한 것일 거다.

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.236033,0.210968,0.555751,0.209472,0.0769509
2,0.00790928,0.251662,0.773223,0.287702,0.278582


### 데이터프레임의 열 선택은 명시적으로 복사(explicit copy)하지 않으면 별칭(alias)을 만든다.

In [14]:
x = DataFrame(a=1:3)
x[:b] = x[1] # 별칭(alias)
x[:c] = x[:, 1] # 이 또한 별칭
x[:d] = x[1][:] # 복사
x[:e] = copy(x[1]) # 명시적 복사(explicit copy)
display(x)
x[1,1] = 100
display(x)

Unnamed: 0_level_0,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,1,1,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


Unnamed: 0_level_0,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,100,100,100,1,1
2,2,2,2,2,2
3,3,3,3,3,3


│   caller = top-level scope at In[14]:3
└ @ Core In[14]:3
