# Data Analysis in Julia with Data Frames
### (John Myles White)
* https://www.youtube.com/watch?v=XRClA5YLiIc

In [1]:
# Pkg.add("DataFrames")
# Pkg.add("RDatasets")

INFO: Cloning cache of DataArrays from https://github.com/JuliaStats/DataArrays.jl.git
INFO: Cloning cache of DataFrames from https://github.com/JuliaStats/DataFrames.jl.git
INFO: Cloning cache of FileIO from https://github.com/JuliaIO/FileIO.jl.git
INFO: Cloning cache of GZip from https://github.com/JuliaIO/GZip.jl.git
INFO: Cloning cache of Reexport from https://github.com/simonster/Reexport.jl.git
INFO: Cloning cache of SortingAlgorithms from https://github.com/JuliaLang/SortingAlgorithms.jl.git
INFO: Cloning cache of StatsBase from https://github.com/JuliaStats/StatsBase.jl.git
INFO: Installing DataArrays v0.3.9
INFO: Installing DataFrames v0.8.4
INFO: Installing FileIO v0.2.0
INFO: Installing GZip v0.2.20
INFO: Installing Reexport v0.0.3
INFO: Installing SortingAlgorithms v0.1.0
INFO: Installing StatsBase v0.11.1
INFO: Package database updated
INFO: METADATA is out-of-date — you may not have the latest version of DataFrames
INFO: Use `Pkg.update()` to get the latest versions of yo

In [5]:
using DataFrames
using RDatasets

INFO: Precompiling module DataFrames.


## How do we cope with missing data?

In [6]:
# This works fine
v = [0.5, 0.6, 0.7, 0.8, 0.9]
mean(v)

0.7

In [7]:
# But this doesn't
# If you're not using DataFrames, NA is not defined.
v = [0.5, 0.6, 0.7, NA, 0.9]
mean(v)
# 

LoadError: LoadError: MethodError: Cannot `convert` an object of type DataArrays.NAtype to an object of type Float64
This may have arisen from a call to the constructor Float64(...),
since type constructors fall back to convert methods.
while loading In[7], in expression starting on line 2

The NA type:
    * Represents a missing value
        - Like NULL in some systems
    * Poisons other values
        - Like NaN for floating point numbers

In [9]:
# Poisoning other values. 
println(1 + NA)

println(1 > NA)

println(isna(NA))

NA
NA
true


In [11]:
println(NaN == NaN)

println(NaN < 1)

println(NaN > 1)

false
false
false


DataArray{T} adds NA's to Array{T}

DataArray{T} can sotre T or NA

- T is a parameter type in DataArray, and the type of NA is NA

In [12]:
typeof(NA)

DataArrays.NAtype

In [14]:
dv = DataArray([1, 2, 3])
println(dv)

dv[1] = NA

join(dv, "::") # general way of string joining.

[1,2,3]


"NA::2::3"

Convenience constructors:
    - zeros(): fill with 0
    - ones(): fill with 1
    - falses(): fill with falses
    - trues(): fill with trues
    - dataeye(): 
    - datadiagm(): fill with values diagnally

* maybe these functions have been removed or modified.

## How do we cope with heteregeneous data?

Example data structure is as follow:

Name        Height    Weight    Gender

Jonh Smith    73.0     NA        Male

Jane Doe      68.0     130      Female

In [22]:
type profile
    Name::String
    Body_size::Dict
    Gender::String
end



In [23]:
body_size = ["Height", "Weight"]

2-element Array{String,1}:
 "Height"
 "Weight"

In [25]:
a = profile("John Smith", Dict("Height"=>73.0, "Weight"=>NA), "Male")
b = profile("Jane Doe", Dict("Height"=>68.0, "Weight"=>130), "Female")

profile("Jane Doe",Dict{String,Any}(Pair{String,Any}("Height",68.0),Pair{String,Any}("Weight",130)),"Female")

In [26]:
# use DataFrame
df = DataFrame()
df["Name"] = DataVector["John Smith", "Jane DOe"]
df["Height"] = DataVector[73.0, 68.0]
df["Weight"] = DataVector[NA, 130]
df["Gender"] = DataVector["Male", "Female"]



LoadError: LoadError: MethodError: Cannot `convert` an object of type String to an object of type DataArrays.DataArray{T,1}
This may have arisen from a call to the constructor DataArrays.DataArray{T,1}(...),
since type constructors fall back to convert methods.
while loading In[26], in expression starting on line 3