# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), Dec 5, 2017**

A brief introduction to basic usage of `DataFrames`. Tested under `DataFrames` master on 2017-12-05.
I will try to keep it up to date as the package evolves.

In [1]:
using DataFrames # load package

## Handling missing values

In [2]:
missing, typeof(missing) # singleton type

(missing, Missings.Missing)

In [3]:
x = [1, 2, missing, 3] # arrays automatically create an appropriate union type

4-element Array{Union{Int64, Missings.Missing},1}:
 1       
 2       
  missing
 3       

In [4]:
ismissing(1), ismissing(missing), ismissing(x), ismissing.(x) # check if variable is missing

(false, true, false, Bool[false, false, true, false])

In [5]:
eltype(x), Missings.T(eltype(x)) # extract the type combined with Missing (useful for arrays)

(Union{Int64, Missings.Missing}, Int64)

In [6]:
missing == missing, missing != missing, missing < missing # missing comparisons produce missing

(missing, missing, missing)

In [7]:
1 == missing, 1 != missing, 1 < missing # the same with values of other types

(missing, missing, missing)

In [8]:
isequal(missing, missing), missing === missing, isequal(1, missing), isless(1, missing) # those produce Bool result

(true, true, false, true)

In [9]:
map(x -> x(missing), [sin, cos, zero, sqrt]) # many (not all) functions handle missing

4-element Array{Missings.Missing,1}:
 missing
 missing
 missing
 missing

In [10]:
map(x -> x(missing, 1), [+, - , *, /, div]) # part 2

5-element Array{Missings.Missing,1}:
 missing
 missing
 missing
 missing
 missing

In [11]:
map(x -> x([1,2,missing]), [minimum, maximum, extrema, mean, any, float]) # part 3

6-element Array{Any,1}:
 missing                                            
 missing                                            
 (missing, missing)                                 
 missing                                            
 missing                                            
 Union{Float64, Missings.Missing}[1.0, 2.0, missing]

In [12]:
collect(skipmissing([1, missing, 2, missing])) # skipmissings returns iterator skipping missing values

2-element Array{Int64,1}:
 1
 2

In [13]:
collect(Missings.replace([1.0, missing, 2.0, missing], NaN)) # the same but replacing missings

4-element Array{Float64,1}:
   1.0
 NaN  
   2.0
 NaN  

In [14]:
unique([1, missing, 2, missing]), levels([1, missing, 2, missing]) # get unique values with or without missings

(Any[1, missing, 2], [1, 2])

In [15]:
x = [1,2,3]
y = allowmissing(x) # convert to type accepting missings
z = disallowmissing(y) # and back, this would fail if y contained missing
x,y,z

([1, 2, 3], Union{Int64, Missings.Missing}[1, 2, 3], [1, 2, 3])

In [16]:
x = DataFrame(Int, 2, 3)
showcols(x)
allowmissing!(x, 1) # make first column accept missings
allowmissing!(x, :x3) # make :x3 column accept missings
println("\n\nAfter: ", eltypes(x))

2×3 DataFrames.DataFrame
│ Col # │ Name │ Eltype │ Missing │ Values  │
├───────┼──────┼────────┼─────────┼─────────┤
│ 1     │ x1   │ Int64  │ 0       │ 0  …  0 │
│ 2     │ x2   │ Int64  │ 0       │ 0  …  0 │
│ 3     │ x3   │ Int64  │ 0       │ 0  …  0 │

After: Type[Union{Int64, Missings.Missing}, Int64, Union{Int64, Missings.Missing}]


In [17]:
x = DataFrame(A=[1, missing, 3, 4], B=["A", "B", missing, "C"])
println("Complete cases:\n", completecases(x)) # find rows with all complete data
y = dropmissing(x) # remove rows with incomplete data from DataFrame, create a new DataFrame
dropmissing!(x) # the same but in-place
[x, y]

Complete cases:
Bool[true, false, false, true]


2-element Array{DataFrames.DataFrame,1}:
 2×2 DataFrames.DataFrame
│ Row │ A │ B │
├─────┼───┼───┤
│ 1   │ 1 │ A │
│ 2   │ 4 │ C │
 2×2 DataFrames.DataFrame
│ Row │ A │ B │
├─────┼───┼───┤
│ 1   │ 1 │ A │
│ 2   │ 4 │ C │