# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), October 4, 2018**

In [1]:
using DataFrames # load package

## Handling missing values

A singelton type `Missing` allows us to deal with missing values.

In [2]:
missing, typeof(missing)

(missing, Missing)

Arrays automatically create an appropriate union type.

In [3]:
x = [1, 2, missing, 3]

4-element Array{Union{Missing, Int64},1}:
 1       
 2       
  missing
 3       

`ismissing` checks if passed value is missing.

In [4]:
ismissing(1), ismissing(missing), ismissing(x), ismissing.(x)

(false, true, false, Bool[false, false, true, false])

We can extract the type combined with Missing from a `Union` via `Missings.T`

(This is useful for arrays!)

In [5]:
eltype(x), Missings.T(eltype(x))

(Union{Missing, Int64}, Int64)

`missing` comparisons produce `missing`.

In [6]:
missing == missing, missing != missing, missing < missing

(missing, missing, missing)

This is also true when `missing`s are compared with values of other types.

In [7]:
1 == missing, 1 != missing, 1 < missing

(missing, missing, missing)

`isequal`, `isless`, and `===` produce results of type `Bool`. Notice that `missing` is considered greater than any numeric value.

In [8]:
isequal(missing, missing), missing === missing, isequal(1, missing), isless(1, missing)

(true, true, false, true)

In the next few examples, we see that many (not all) functions handle `missing`.

In [9]:
map(x -> x(missing), [sin, cos, zero, sqrt]) # part 1

4-element Array{Missing,1}:
 missing
 missing
 missing
 missing

In [10]:
map(x -> x(missing, 1), [+, - , *, /, div]) # part 2 

5-element Array{Missing,1}:
 missing
 missing
 missing
 missing
 missing

In [11]:
using Statistics # needed for mean
map(x -> x([1,2,missing]), [minimum, maximum, extrema, mean, float]) # part 3

5-element Array{Any,1}:
 missing                                   
 missing                                   
 (missing, missing)                        
 missing                                   
 Union{Missing, Float64}[1.0, 2.0, missing]

`skipmissing` returns iterator skipping missing values. We can use `collect` and `skipmissing` to create an array that excludes these missing values.

In [12]:
collect(skipmissing([1, missing, 2, missing]))

2-element Array{Int64,1}:
 1
 2

Similarly, here we combine `collect` and `Missings.replace` to create an array that replaces all missing values with some value (`NaN` in this case).

In [13]:
collect(Missings.replace([1.0, missing, 2.0, missing], NaN))

4-element Array{Float64,1}:
   1.0
 NaN  
   2.0
 NaN  

Another way to do this:

In [14]:
coalesce.([1.0, missing, 2.0, missing], NaN)

4-element Array{Float64,1}:
   1.0
 NaN  
   2.0
 NaN  

You can use `recode` if you have homogenous output types.

In [15]:
recode([1.0, missing, 2.0, missing], missing=>NaN)

4-element Array{Float64,1}:
   1.0
 NaN  
   2.0
 NaN  

You can use `unique` or `levels` to get unique values with or without missings, respectively.

In [16]:
unique([1, missing, 2, missing]), levels([1, missing, 2, missing])

(Union{Missing, Int64}[1, missing, 2], [1, 2])

In this next example, we convert `x` to `y` with `allowmissing`, where `y` has a type that accepts missings.

In [17]:
x = [1,2,3]
y = allowmissing(x)

3-element Array{Union{Missing, Int64},1}:
 1
 2
 3

Then, we convert back with `disallowmissing`. This would fail if `y` contained missing values!

In [18]:
z = disallowmissing(y)
x,y,z

([1, 2, 3], Union{Missing, Int64}[1, 2, 3], [1, 2, 3])

In this next example, we show that the type of each column in `x` is initially `Int64`. After using `allowmissing!` to accept missing values in columns 1 and 3, the types of those columns become `Union{Int64,Missing}`.

In [19]:
x = DataFrame(Int, 2, 3)
println("Before: ", eltypes(x))
allowmissing!(x, 1) # make first column accept missings
allowmissing!(x, :x3) # make :x3 column accept missings
println("After: ", eltypes(x))

Before: Type[Int64, Int64, Int64]
After: Type[Union{Missing, Int64}, Int64, Union{Missing, Int64}]


In this next example, we'll use `completecases` to find all the rows of a `DataFrame` that have complete data.

In [20]:
x = DataFrame(A=[1, missing, 3, 4], B=["A", "B", missing, "C"])

Unnamed: 0_level_0,A,B
Unnamed: 0_level_1,Int64⍰,String⍰
1,1,A
2,missing,B
3,3,missing
4,4,C


In [21]:
x

Unnamed: 0_level_0,A,B
Unnamed: 0_level_1,Int64⍰,String⍰
1,1,A
2,missing,B
3,3,missing
4,4,C


In [22]:
println("Complete cases:\n", completecases(x))

Complete cases:
Bool[true, false, false, true]


We can use `dropmissing` or `dropmissing!` to remove the rows with incomplete data from a `DataFrame` and either create a new `DataFrame` or mutate the original in-place.

In [23]:
y = dropmissing(x)
dropmissing!(x)

Unnamed: 0_level_0,A,B
Unnamed: 0_level_1,Int64⍰,String⍰
1,1,A
2,4,C


In [24]:
x

Unnamed: 0_level_0,A,B
Unnamed: 0_level_1,Int64⍰,String⍰
1,1,A
2,4,C


In [25]:
y

Unnamed: 0_level_0,A,B
Unnamed: 0_level_1,Int64⍰,String⍰
1,1,A
2,4,C


When we call `describe` on a `DataFrame` with dropped missing values, the columns still allow missing values.

In [26]:
describe(x)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Union…,Int64,DataType
1,A,2.5,1,2.5,4,,0,Int64
2,B,,A,,C,2.0,0,String


Since we've excluded missing values, we can safely use `disallowmissing!` so that the columns will no longer accept missing values (we can see this as `nmissing` column is empty).

In [27]:
disallowmissing!(x)
describe(x)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Union…,Nothing,DataType
1,A,2.5,1,2.5,4,,,Int64
2,B,,A,,C,2.0,,String
