# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), Apr 21, 2018**

In [1]:
using DataFrames # load package

## Handling missing values

The type `Missings.Missing` allows us to deal with missing values for singletons. 

In [2]:
missing, typeof(missing)

(missing, Missings.Missing)

Arrays automatically create an appropriate union type.

In [3]:
x = [1, 2, missing, 3]

4-element Array{Union{Int64, Missings.Missing},1}:
 1       
 2       
  missing
 3       

`ismissing` checks if a variable is missing.

In [4]:
ismissing(1), ismissing(missing), ismissing(x), ismissing.(x)

(false, true, false, Bool[false, false, true, false])

We can extract the type combined with Missing from a `Union` via

(This is useful for arrays!)

In [5]:
eltype(x), Missings.T(eltype(x))

(Union{Int64, Missings.Missing}, Int64)

`missing` comparisons produce `missing`.

In [6]:
missing == missing, missing != missing, missing < missing

(missing, missing, missing)

This is also true when `missing`s are compared with values of other types.

In [7]:
1 == missing, 1 != missing, 1 < missing

(missing, missing, missing)

`isequal`, `isless`, and `===` produce results of type `Bool`.

In [8]:
isequal(missing, missing), missing === missing, isequal(1, missing), isless(1, missing)

(true, true, false, true)

In the next few examples, we see that many (not all) functions handle `missing`.

In [9]:
map(x -> x(missing), [sin, cos, zero, sqrt]) # part 1

4-element Array{Missings.Missing,1}:
 missing
 missing
 missing
 missing

In [10]:
map(x -> x(missing, 1), [+, - , *, /, div]) # part 2 

5-element Array{Missings.Missing,1}:
 missing
 missing
 missing
 missing
 missing

In [11]:
map(x -> x([1,2,missing]), [minimum, maximum, extrema, mean, any, float]) # part 3

6-element Array{Any,1}:
 missing                                            
 missing                                            
 (missing, missing)                                 
 missing                                            
 missing                                            
 Union{Float64, Missings.Missing}[1.0, 2.0, missing]

`skipmissing` returns iterator skipping missing values. We can use `collect` and `skipmissing` to create an array that excludes these missing values.

In [12]:
collect(skipmissing([1, missing, 2, missing]))

2-element Array{Int64,1}:
 1
 2

Similarly, here we use `collect` to create an array that replaces all missing values with `NaN`.

In [13]:
collect(Missings.replace([1.0, missing, 2.0, missing], NaN))

4-element Array{Float64,1}:
   1.0
 NaN  
   2.0
 NaN  

Another way to do this:

In [14]:
coalesce.([1.0, missing, 2.0, missing], NaN)

4-element Array{Float64,1}:
   1.0
 NaN  
   2.0
 NaN  

Caution: `nothing` would also be replaced here.

In [15]:
coalesce.([1.0, missing, nothing, missing], NaN)

4-element Array{Float64,1}:
   1.0
 NaN  
 NaN  
 NaN  

You can use `recode` if you have homogenous output types.

In [16]:
recode([1.0, missing, 2.0, missing], missing=>NaN)

4-element Array{Float64,1}:
   1.0
 NaN  
   2.0
 NaN  

You can use `unique` or `levels` to get unique values with or without missings, respectively.

In [17]:
unique([1, missing, 2, missing]), levels([1, missing, 2, missing])

(Any[1, missing, 2], [1, 2])

In this next example, we convert `x` to `y` with `allowmissing`, where `y` has a type that accepts missings.

In [None]:
x = [1,2,3]
y = allowmissing(x)

Then, we convert back with `disallowmissing`. This would fail if `y` contained missing values!

In [18]:
z = disallowmissing(y)
x,y,z

([1, 2, 3], Union{Int64, Missings.Missing}[1, 2, 3], [1, 2, 3])

In this next example, we show that the type of each column in `x` is initially `Int64`. After using `allowmissing!` to accept missing values in columns 1 and 3, the types of those columns become `Union`s of `Int64` and `Missings.Missing`.

In [19]:
x = DataFrame(Int, 2, 3)
println("Before: ", eltypes(x))
allowmissing!(x, 1) # make first column accept missings
allowmissing!(x, :x3) # make :x3 column accept missings
println("After: ", eltypes(x))

Before: Type[Int64, Int64, Int64]
After: Type[Union{Int64, Missings.Missing}, Int64, Union{Int64, Missings.Missing}]


In this next example, we'll use `completecases` to find all the rows of a `DataFrame` that have complete data.

In [None]:
x = DataFrame(A=[1, missing, 3, 4], B=["A", "B", missing, "C"])
println(x)
println("Complete cases:\n", completecases(x))

We can use `dropmissing` or `dropmissing!` to remove the rows with incomplete data from a `DataFrame` and either create a new `DataFrame` or mutate the original in-place.

In [20]:
y = dropmissing(x)
dropmissing!(x)
[x, y]

4×2 DataFrames.DataFrame
│ Row │ A       │ B       │
├─────┼─────────┼─────────┤
│ 1   │ 1       │ A       │
│ 2   │ [90mmissing[39m │ B       │
│ 3   │ 3       │ [90mmissing[39m │
│ 4   │ 4       │ C       │
Complete cases:
Bool[true, false, false, true]


2-element Array{DataFrames.DataFrame,1}:
 2×2 DataFrames.DataFrame
│ Row │ A │ B │
├─────┼───┼───┤
│ 1   │ 1 │ A │
│ 2   │ 4 │ C │
 2×2 DataFrames.DataFrame
│ Row │ A │ B │
├─────┼───┼───┤
│ 1   │ 1 │ A │
│ 2   │ 4 │ C │

When we call `showcols` on a `DataFrame` with dropped missing values, the columns still allow missing values.

In [21]:
showcols(x)

2×2 DataFrames.DataFrame
│ Col # │ Name │ Eltype                          │ Missing │ Values  │
├───────┼──────┼─────────────────────────────────┼─────────┼─────────┤
│ 1     │ A    │ Union{Int64, Missings.Missing}  │ 0       │ 1  …  4 │
│ 2     │ B    │ Union{Missings.Missing, String} │ 0       │ A  …  C │

Since we've excluded missing values, we can safely use `disallowmissing` so that the columns will no longer accept missing values.

In [22]:
disallowmissing!(x)
showcols(x)

2×2 DataFrames.DataFrame
│ Col # │ Name │ Eltype │ Missing │ Values  │
├───────┼──────┼────────┼─────────┼─────────┤
│ 1     │ A    │ Int64  │ 0       │ 1  …  4 │
│ 2     │ B    │ String │ 0       │ A  …  C │