# DataFrames, RCall and RData

Currently the capabilities for working with data frames, model matrices, formulas, etc. are housed in the `DataFrames` package.  There are some performance issues with the current representation and it may be replaced by another implementation in a few months to a year.

When the `RCall` package is attached it starts an `R` process and provides for two-way communications with it.  Strings prefixed by `R` are evaluated as R expressions.

In [1]:
using DataFrames, RCall

In [2]:
R"str(Formaldehyde)"  # one of the data sets from R's datasets package

'data.frame':	6 obs. of  2 variables:
 $ carb  : num  0.1 0.3 0.5 0.6 0.7 0.9
 $ optden: num  0.086 0.269 0.446 0.538 0.626 0.782


RCall.RObject{RCall.NilSxp}
NULL


In [3]:
formaldehyde = rcopy(R"Formaldehyde")

Unnamed: 0,carb,optden
1,0.1,0.086
2,0.3,0.269
3,0.5,0.446
4,0.6,0.538
5,0.7,0.626
6,0.9,0.782


In [4]:
typeof(formaldehyde)

DataFrames.DataFrame

In [5]:
typeof.(formaldehyde.columns) # check the type of each column

2-element Array{DataType,1}:
 DataArrays.DataArray{Float64,1}
 DataArrays.DataArray{Float64,1}

In [6]:
orchardsprays = rcopy(R"OrchardSprays")

Unnamed: 0,decrease,rowpos,colpos,treatment
1,57.0,1.0,1.0,D
2,95.0,2.0,1.0,E
3,8.0,3.0,1.0,B
4,69.0,4.0,1.0,H
5,92.0,5.0,1.0,G
6,90.0,6.0,1.0,F
7,15.0,7.0,1.0,C
8,2.0,8.0,1.0,A
9,84.0,1.0,2.0,C
10,6.0,2.0,2.0,B


In [7]:
typeof.(orchardsprays.columns)

4-element Array{DataType,1}:
 DataArrays.DataArray{Float64,1}           
 DataArrays.DataArray{Float64,1}           
 DataArrays.DataArray{Float64,1}           
 DataArrays.PooledDataArray{String,UInt8,1}

In [8]:
R"""
summary(fm <- lm(optden ~ 1 + carb, Formaldehyde))
"""

RCall.RObject{RCall.VecSxp}

Call:
lm(formula = optden ~ 1 + carb, data = Formaldehyde)

Residuals:
        1         2         3         4         5         6 
-0.006714  0.001029  0.002771  0.007143  0.007514 -0.011743 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.005086   0.007834   0.649    0.552    
carb        0.876286   0.013535  64.744 3.41e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.008649 on 4 degrees of freedom
Multiple R-squared:  0.999,	Adjusted R-squared:  0.9988 
F-statistic:  4192 on 1 and 4 DF,  p-value: 3.409e-07



In [9]:
using GLM
lm(@formula(optden ~ 1 + carb), formaldehyde)

DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: optden ~ 1 + carb

Coefficients:
               Estimate  Std.Error  t value Pr(>|t|)
(Intercept)  0.00508571 0.00783368 0.649211   0.5516
carb           0.876286  0.0135345  64.7444    <1e-6


Notice that the first time that you call a function like `lm()` it is rather slow because that function and many of the functions it calls need to be compiled by the just-in-time (JIT) compiler.  Subsequent calls are much faster.

In [10]:
@time lm(@formula(optden ~ 1 + carb), formaldehyde)

  0.000992 seconds (681 allocations: 51.109 KB)


DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: optden ~ 1 + carb

Coefficients:
               Estimate  Std.Error  t value Pr(>|t|)
(Intercept)  0.00508571 0.00783368 0.649211   0.5516
carb           0.876286  0.0135345  64.7444    <1e-6


`@time` and `@formula` are calls to *macros*.  `R` provides *lazy evaluation* which is what makes it possible to store the call to the `lm` function in `R` as part of the object generated by the call.  In `Julia` all the arguments to the function must be evaluated at the time of the call.  Macros, on the other hand, provide access to their arguments as expressions.  Formulas must be wrapped in a macro call because of the non-standard interpretation of `~` in a formula.

The `RCall` package requires a local installation of `R`.  For access to data sets only, it can be more convenient to use the `RData` package to read saved `.RData` or `.rda` files.  `RData` is one of a group of packages that extend the `load` function for different file types.  Once it is attached, i.e. 
```julia
using RData
```
in a Julia session, the `.rda` and `.RData` extensions will be recognized.

The `MixedModels` package contains a saved `.rda` file with several sample data frames in its `"test"` directory.  Loading this file produces a `Dict` ("dictionary" of key-value pairs) of type `Dict{String,Any}`.  I find it more convenient to use `Symbol`s as keys because they are easier to type and because IJulia code blocks and the REPL provide tab completion of the symbols that are keys.

In [11]:
using RData

In [12]:
const dat = convert(Dict{Symbol,Any}, load(Pkg.dir("MixedModels", "test", "dat.rda")))

Dict{Symbol,Any} with 61 entries:
  :Assay         => 60×6 DataFrames.DataFrame…
  :WWheat        => 60×3 DataFrames.DataFrame…
  :Gasoline      => 32×6 DataFrames.DataFrame…
  :Alfalfa       => 72×4 DataFrames.DataFrame…
  :BIB           => 24×5 DataFrames.DataFrame…
  :IncBlk        => 24×8 DataFrames.DataFrame…
  :Semi2         => 72×6 DataFrames.DataFrame…
  :KWDYZ         => 28710×12 DataFrames.DataFrame…
  :Multilocation => 108×7 DataFrames.DataFrame…
  :Arabidopsis   => 625×8 DataFrames.DataFrame…
  :gb12          => 512×12 DataFrames.DataFrame…
  :Gcsemv        => 1905×5 DataFrames.DataFrame…
  :Hsb82         => 7185×8 DataFrames.DataFrame…
  :AvgDailyGain  => 32×6 DataFrames.DataFrame…
  :Dyestuff2     => 30×2 DataFrames.DataFrame…
  :InstEval      => 73421×7 DataFrames.DataFrame…
  :bdf           => 2287×28 DataFrames.DataFrame…
  :grouseticks   => 403×7 DataFrames.DataFrame…
  :Weights       => 399×4 DataFrames.DataFrame…
  :Mmmec         => 354×6 DataFrames.DataFrame…
  :Cu

In [13]:
dat[:Cultivation]

Unnamed: 0,G,B,A,Y
1,1,a,con,27.4
2,1,a,dea,29.7
3,1,a,liv,34.5
4,1,b,con,29.4
5,1,b,dea,32.5
6,1,b,liv,34.4
7,2,a,con,28.9
8,2,a,dea,28.7
9,2,a,liv,33.4
10,2,b,con,28.7


Prefacing an assignment with `const`, as in
```julia
const dat = convert(...
```
is helpful, but not required, for global (i.e. top-level) identifiers.  It indicates that the type of object identified by that name will not change, although the contents may be changed.

For the purposes of this workshop, consider it an idiom.