# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), Apr 21, 2017**


### 출처

* https://github.com/JuliaComputing/JuliaBoxTutorials/tree/master/introductory-tutorials/broader-topics-and-ecosystem/intro-to-julia-DataFrames

### 함께보기

* https://deepstat.tistory.com/69 (01. constructors)(in English)
* https://deepstat.tistory.com/70 (01. constructors)(한글)
* https://deepstat.tistory.com/71 (02. basicinfo)(in English)
* https://deepstat.tistory.com/72 (02. basicinfo)(한글)
* https://deepstat.tistory.com/73 (03. missingvalues)(in English)
* https://deepstat.tistory.com/74 (03. missingvalues)(한글)
* https://deepstat.tistory.com/75 (04. loadsave)(in English)
* https://deepstat.tistory.com/76 (04. loadsave)(한글)
* https://deepstat.tistory.com/77 (05. columns)(in English)
* https://deepstat.tistory.com/78 (05. columns)(한글)
* https://deepstat.tistory.com/79 (06. rows)(in English)
* https://deepstat.tistory.com/80 (06. rows)(한글)
* https://deepstat.tistory.com/81 (07. factors)(in English)
* https://deepstat.tistory.com/82 (07. factors)(한글)
* https://deepstat.tistory.com/83 (08. joins)(in English)
* https://deepstat.tistory.com/84 (08. joins)(한글)

In [1]:
using DataFrames # load package

## 데이터프레임 조인하기 (Joining DataFrames)

### 조인할 데이터프레임 준비하기 (Preparing DataFrames for a join)

In [2]:
x = DataFrame(ID=[1,2,3,4,missing], name = ["Alice", "Bob", "Conor", "Dave","Zed"])
y = DataFrame(id=[1,2,5,6,missing], age = [21,22,23,24,99])
println(x)
println(y)

5×2 DataFrame
│ Row │ ID      │ name   │
│     │ [90mInt64⍰[39m  │ [90mString[39m │
├─────┼─────────┼────────┤
│ 1   │ 1       │ Alice  │
│ 2   │ 2       │ Bob    │
│ 3   │ 3       │ Conor  │
│ 4   │ 4       │ Dave   │
│ 5   │ [90mmissing[39m │ Zed    │
5×2 DataFrame
│ Row │ id      │ age   │
│     │ [90mInt64⍰[39m  │ [90mInt64[39m │
├─────┼─────────┼───────┤
│ 1   │ 1       │ 21    │
│ 2   │ 2       │ 22    │
│ 3   │ 5       │ 23    │
│ 4   │ 6       │ 24    │
│ 5   │ [90mmissing[39m │ 99    │


In [3]:
rename!(x, :ID=>:id) # 조인(joini할 기준이 되는 행 이름은 같아야만 한다.

Unnamed: 0_level_0,id,name
Unnamed: 0_level_1,Int64⍰,String
1,1,Alice
2,2,Bob
3,3,Conor
4,4,Dave
5,missing,Zed


### 기본 조인 (Standard joins: inner, left, right, outer, semi, anti)

In [4]:
join(x, y, on=:id) # 기본적으로 이너조인(inner join)을 수행한다. 결측(missing)도 조인된다.

Unnamed: 0_level_0,id,name,age
Unnamed: 0_level_1,Int64⍰,String,Int64
1,1,Alice,21
2,2,Bob,22
3,missing,Zed,99


In [5]:
join(x, y, on=:id, kind=:left) # 레프트조인(left join)

Unnamed: 0_level_0,id,name,age
Unnamed: 0_level_1,Int64⍰,String,Int64⍰
1,1,Alice,21
2,2,Bob,22
3,3,Conor,missing
4,4,Dave,missing
5,missing,Zed,99


In [6]:
join(x, y, on=:id, kind=:right) # 라이트조인(right join)

Unnamed: 0_level_0,id,name,age
Unnamed: 0_level_1,Int64⍰,String⍰,Int64
1,1,Alice,21
2,2,Bob,22
3,missing,Zed,99
4,5,missing,23
5,6,missing,24


In [7]:
join(x, y, on=:id, kind=:outer) #아우터조인(outer join)

Unnamed: 0_level_0,id,name,age
Unnamed: 0_level_1,Int64⍰,String⍰,Int64⍰
1,1,Alice,21
2,2,Bob,22
3,3,Conor,missing
4,4,Dave,missing
5,missing,Zed,99
6,5,missing,23
7,6,missing,24


In [8]:
join(x, y, on=:id, kind=:semi) #세미조인(semi join)

Unnamed: 0_level_0,id,name
Unnamed: 0_level_1,Int64⍰,String
1,1,Alice
2,2,Bob
3,missing,Zed


In [9]:
join(x, y, on=:id, kind=:anti) #안티조인(anti join)

Unnamed: 0_level_0,id,name
Unnamed: 0_level_1,Int64⍰,String
1,3,Conor
2,4,Dave


### 크로스조인 (Cross join)

In [10]:
# 크로스조인(cross-join)은 "on" 인자(argument)를 필요로 하지 않는다.
# 크로스조인(cross-join)은 카테이션 곱(Cartesian product) 혹은 인자(argument)를 만든다.
function expand_grid(;xs...) # R 언어에서 쓰이는 expand.grid의 간단한 형태의 함수
    reduce((x,y) -> join(x, DataFrame(Pair(y...)), kind=:cross),
           DataFrame(Pair(xs[1]...)), xs[2:end])
end

expand_grid(a=[1,2], b=["a","b","c"], c=[true,false])

ArgumentError: ArgumentError: unable to construct DataFrame from Pair{Int64,Int64}

In [11]:
?reduce

search: [0m[1mr[22m[0m[1me[22m[0m[1md[22m[0m[1mu[22m[0m[1mc[22m[0m[1me[22m map[0m[1mr[22m[0m[1me[22m[0m[1md[22m[0m[1mu[22m[0m[1mc[22m[0m[1me[22m



```
reduce(op, itr; [init])
```

Reduce the given collection `itr` with the given binary operator `op`. If provided, the initial value `init` must be a neutral element for `op` that will be returned for empty collections. It is unspecified whether `init` is used for non-empty collections.

For empty collections, providing `init` will be necessary, except for some special cases (e.g. when `op` is one of `+`, `*`, `max`, `min`, `&`, `|`) when Julia can determine the neutral element of `op`.

Reductions for certain commonly-used operators may have special implementations, and should be used instead: `maximum(itr)`, `minimum(itr)`, `sum(itr)`, `prod(itr)`,  `any(itr)`, `all(itr)`.

The associativity of the reduction is implementation dependent. This means that you can't use non-associative operations like `-` because it is undefined whether `reduce(-,[1,2,3])` should be evaluated as `(1-2)-3` or `1-(2-3)`. Use [`foldl`](@ref) or [`foldr`](@ref) instead for guaranteed left or right associativity.

Some operations accumulate error. Parallelism will be easier if the reduction can be executed in groups. Future versions of Julia might change the algorithm. Note that the elements are not reordered if you use an ordered collection.

# Examples

```jldoctest
julia> reduce(*, [2; 3; 4])
24

julia> reduce(*, [2; 3; 4]; init=-1)
-24
```

---

```
reduce(f, A; dims=:, [init])
```

Reduce 2-argument function `f` along dimensions of `A`. `dims` is a vector specifying the dimensions to reduce, and the keyword argument `init` is the initial value to use in the reductions. For `+`, `*`, `max` and `min` the `init` argument is optional.

The associativity of the reduction is implementation-dependent; if you need a particular associativity, e.g. left-to-right, you should write your own loop or consider using [`foldl`](@ref) or [`foldr`](@ref). See documentation for [`reduce`](@ref).

# Examples

```jldoctest
julia> a = reshape(Vector(1:16), (4,4))
4×4 Array{Int64,2}:
 1  5   9  13
 2  6  10  14
 3  7  11  15
 4  8  12  16

julia> reduce(max, a, dims=2)
4×1 Array{Int64,2}:
 13
 14
 15
 16

julia> reduce(max, a, dims=1)
1×4 Array{Int64,2}:
 4  8  12  16
```


### 복잡한 형태의 조인 (Complex cases of joins)

In [12]:
x = DataFrame(id1=[1,1,2,2,missing,missing],
              id2=[1,11,2,21,missing,99],
              name = ["Alice", "Bob", "Conor", "Dave","Zed", "Zoe"])
y = DataFrame(id1=[1,1,3,3,missing,missing],
              id2=[11,1,31,3,missing,999],
              age = [21,22,23,24,99, 100])
println(x)
println(y)

6×3 DataFrame
│ Row │ id1     │ id2     │ name   │
│     │ [90mInt64⍰[39m  │ [90mInt64⍰[39m  │ [90mString[39m │
├─────┼─────────┼─────────┼────────┤
│ 1   │ 1       │ 1       │ Alice  │
│ 2   │ 1       │ 11      │ Bob    │
│ 3   │ 2       │ 2       │ Conor  │
│ 4   │ 2       │ 21      │ Dave   │
│ 5   │ [90mmissing[39m │ [90mmissing[39m │ Zed    │
│ 6   │ [90mmissing[39m │ 99      │ Zoe    │
6×3 DataFrame
│ Row │ id1     │ id2     │ age   │
│     │ [90mInt64⍰[39m  │ [90mInt64⍰[39m  │ [90mInt64[39m │
├─────┼─────────┼─────────┼───────┤
│ 1   │ 1       │ 11      │ 21    │
│ 2   │ 1       │ 1       │ 22    │
│ 3   │ 3       │ 31      │ 23    │
│ 4   │ 3       │ 3       │ 24    │
│ 5   │ [90mmissing[39m │ [90mmissing[39m │ 99    │
│ 6   │ [90mmissing[39m │ 999     │ 100   │


In [13]:
join(x, y, on=[:id1, :id2]) # 2개 행을 기준으로 조인

Unnamed: 0_level_0,id1,id2,name,age
Unnamed: 0_level_1,Int64⍰,Int64⍰,String,Int64
1,1,1,Alice,22
2,1,11,Bob,21
3,missing,missing,Zed,99


In [14]:
join(x, y, on=[:id1], makeunique=true) # 중복되는 경우 모든 경우의 결합을 다 만들어준다. (이 예제는 이너조인(inner join))

Unnamed: 0_level_0,id1,id2,name,id2_1,age
Unnamed: 0_level_1,Int64⍰,Int64⍰,String,Int64⍰,Int64
1,1,1,Alice,11,21
2,1,1,Alice,1,22
3,1,11,Bob,11,21
4,1,11,Bob,1,22
5,missing,missing,Zed,missing,99
6,missing,missing,Zed,999,100
7,missing,99,Zoe,missing,99
8,missing,99,Zoe,999,100


In [15]:
join(x, y, on=[:id1], kind=:semi) # 예외적으로 세미조인(semi join)인 경우는 모든 결합을 다 만들어주지 않는다.

Unnamed: 0_level_0,id1,id2,name
Unnamed: 0_level_1,Int64⍰,Int64⍰,String
1,1,1,Alice
2,1,11,Bob
3,missing,missing,Zed
4,missing,99,Zoe
