# JuliaR

### On DataFrames and R (dplyr)-like functionality in Julia

<img src="meme.jpg" alt="Drawing" style="width: 500px;"/>


---

---

---

---

---

## Package management 

### To download

```Pkg.add``` $\equiv$ ```install.packages```

In [1]:
import Pkg # load Pkg package into namespace
Pkg.add("DataFrames") # download package from github

[32m[1m   Updating[22m[39m registry at `~/.julia/registries/General`


[?25l    

[32m[1m   Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`








[32m[1m  Resolving[22m[39m package versions...
[32m[1m   Updating[22m[39m `~/.julia/environments/v1.4/Project.toml`
[90m [no changes][39m
[32m[1m   Updating[22m[39m `~/.julia/environments/v1.4/Manifest.toml`
[90m [no changes][39m


### To load 

Either

In [2]:
import DataFrames 

┌ Info: Precompiling DataFrames [a93c6f00-e57d-5684-b7b6-d8193f3e46c0]
└ @ Base loading.jl:1260


Adds functions/variables/structs from ```DataFrames``` to namesapces with ```DataFrames.``` prefix.

i.e. 

In [3]:
DataFrames.nrow

nrow (generic function with 2 methods)

calls the ```nrow()``` function from ```DataFrames```;

sans prefix, ```nrow()``` throws an error.

In [4]:
nrow

UndefVarError: UndefVarError: nrow not defined

or

In [5]:
using DataFrames

Adds exported functions/variables/structs from ```DataFrames``` to namespace.

i.e. ```nrow()``` calls the ```nrow()``` function from ```DataFrames```.

In [6]:
nrow

nrow (generic function with 2 methods)

---

---

---

---

---

## Getting help

1) type ```?``` then the thing you want help with in the REPL

In [7]:
? + 

search: [0m[1m+[22m



```
+(x, y...)
```

Addition operator. `x+y+z+...` calls this function with all arguments, i.e. `+(x, y, z, ...)`.

# Examples

```jldoctest
julia> 1 + 20 + 4
25

julia> +(1, 20, 4)
25
```

---

```
dt::Date + t::Time -> DateTime
```

The addition of a `Date` with a `Time` produces a `DateTime`. The hour, minute, second, and millisecond parts of the `Time` are used along with the year, month, and day of the `Date` to create the new `DateTime`. Non-zero microseconds or nanoseconds in the `Time` type will result in an `InexactError` being thrown.


2) Google it. 

3) Post in the slack

---

---

---

---

---

## Functions (very quickly)

Most similar to `R` is 

In [64]:
f1 = function(x,y)
    return x*y
end 

ErrorException: invalid redefinition of constant f1

most similar to MATLAB

In [67]:
function head(df) 
    first(df,5)
end

head (generic function with 1 method)

---

---

---

---

---

## Loading data 

In [11]:
import CSV 

┌ Info: Precompiling CSV [336ed68f-0bac-5ca0-87d4-7b16caf5d00b]
└ @ Base loading.jl:1260


Either 

```CSV.read("income.csv", DataFrame)```

or 

In [12]:
incomeCSV = CSV.File("income.csv") # an object of CSV type
income = DataFrame(incomeCSV)

Unnamed: 0_level_0,Year,AHE,Sex
Unnamed: 0_level_1,Int64,Float64,String
1,1992,12.9991,male
2,1992,11.618,male
3,1992,17.3773,male
4,1992,10.0613,female
5,1992,16.7567,male
6,1992,9.21617,female
7,1992,15.9587,female
8,1992,27.3692,male
9,1992,10.6392,male
10,1992,6.98195,male


which is equivalent to 

In [13]:
income = CSV.File("income.csv") |> DataFrame # piping!!

Unnamed: 0_level_0,Year,AHE,Sex
Unnamed: 0_level_1,Int64,Float64,String
1,1992,12.9991,male
2,1992,11.618,male
3,1992,17.3773,male
4,1992,10.0613,female
5,1992,16.7567,male
6,1992,9.21617,female
7,1992,15.9587,female
8,1992,27.3692,male
9,1992,10.6392,male
10,1992,6.98195,male


There is also a ```DelimitedFiles``` package, but I've never used it. 

---

---

---

---

---

## Basic operations 

The usage of some functionality is *exactly* the same as R;

In [14]:
nrow(income)

11130

In [15]:
ncol(income)

3

In [16]:
names(income)

3-element Array{String,1}:
 "Year"
 "AHE"
 "Sex"

---

---

---

---

---

## Indexing ```DataFrames```

There are many options (too many?).

Either 

In [17]:
income.Sex # most similar to income$Sex in R
income."Sex"

11130-element PooledArrays.PooledArray{String,UInt32,1,Array{UInt32,1}}:
 "male"
 "male"
 "male"
 "female"
 "male"
 "female"
 "female"
 "male"
 "male"
 "male"
 "female"
 "female"
 "male"
 ⋮
 "male"
 "male"
 "female"
 "male"
 "female"
 "female"
 "male"
 "female"
 "male"
 "male"
 "female"
 "female"

Or index with square brackets
```
income[rows, columns]
```

To get all rows we use either
```
: # as in MATLAB
```
or
```
! 
```

The columns argument can be any of 
```
:Sex # the syntax for a Symbol type
"Sex" 
3
```
or
```
[:Sex]
["Sex"]
[3]
```


The following are all ways to index a `DataFrame`. 

In [61]:
## most similar to income$Sex in R
income.Sex 
income."Sex"
# returns a COPY as a vector

## return a COPY of the column(s)
# as a vector
income[:,:Sex] 
income[:,"Sex"]
income[:,3]
# as a DataFrame
income[:,[:Sex]]
income[:,["Sex"]]
income[:,[3]]

## return a VIEW of the column(s)
# as a vector
income[!,:Sex] 
income[!,"Sex"]
income[!,3]
# as a DataFrame
income[!,[:Sex]] # this is my personal preference
income[!,["Sex"]]
income[!,[3]] |> head


Unnamed: 0_level_0,Sex
Unnamed: 0_level_1,String
1,2
2,male
3,female
4,male
5,male
6,male
7,male
8,male
9,male
10,male


In [55]:
income.Sex 

11130-element PooledArrays.PooledArray{String,UInt32,1,Array{UInt32,1}}:
 "male"
 "male"
 "female"
 "male"
 "male"
 "male"
 "male"
 "male"
 "male"
 "male"
 "male"
 "male"
 "female"
 ⋮
 "male"
 "male"
 "male"
 "male"
 "male"
 "male"
 "male"
 "male"
 "female"
 "female"
 "male"
 "female"

In [59]:
income."Sex"[1] = "2"

"2"

In [60]:
income

Unnamed: 0_level_0,Year,AHE,Sex
Unnamed: 0_level_1,Int64,Float64,String
1,1996,52.4434,2
2,1996,52.4434,male
3,1992,50.27,female
4,1994,50.2342,male
5,1996,49.9461,male
6,1996,49.9461,male
7,1998,49.4506,male
8,1998,48.0769,male
9,1992,48.0358,male
10,1992,48.0358,male


We can also pass arrays of indicies 

```income[!,[:Sex, :Year]]```

or invert selection with 

```income[!,Not(:Sex)]```


---

#### What is the difference between a *copy* and a *view*

In [19]:
aCopy = income[:,[:Sex]]
aCopy[1,:Sex] = "hello"
first(aCopy,5)

Unnamed: 0_level_0,Sex
Unnamed: 0_level_1,String
1,hello
2,male
3,male
4,female
5,male


In [20]:
first(income,5)

Unnamed: 0_level_0,Year,AHE,Sex
Unnamed: 0_level_1,Int64,Float64,String
1,1992,12.9991,male
2,1992,11.618,male
3,1992,17.3773,male
4,1992,10.0613,female
5,1992,16.7567,male


In [21]:
aView = income[!,[:Sex]]
aView[1,:Sex] = "hello"
first(aView,5)

Unnamed: 0_level_0,Sex
Unnamed: 0_level_1,String
1,hello
2,male
3,male
4,female
5,male


In [22]:
first(income,5)

Unnamed: 0_level_0,Year,AHE,Sex
Unnamed: 0_level_1,Int64,Float64,String
1,1992,12.9991,hello
2,1992,11.618,male
3,1992,17.3773,male
4,1992,10.0613,female
5,1992,16.7567,male


***```income``` has how changed!!!!***

Be careful.

---

---

---

---

---

---

## Piping 

Pass the object on the left to the function of the right. 

i.e. In R we use the `%>%` infix operator 
```
> aFun <- function(x) x^2
> b <- 2 
> b %>% aFun
[1] 4
```

In julia its `|>`

In [23]:
fun(x) = x^2
b = 2 
b |> fun  

4

Can broadcast via piping too

In [24]:
A = [1 2; 3 4]
A .|> fun 

2×2 Array{Int64,2}:
 1   4
 9  16

In [25]:
A |> fun 

2×2 Array{Int64,2}:
  7  10
 15  22

If this is your jam, then see the `Piping` package for some cool stuff.

## A likeness with ```dplyr```

The ```DataFrames``` package has the same data manipulation functionality as ```dplyr```



``` R ``` functions and their ```julia``` equivalents;

 - In R ```rename()```$\equiv$ in julia ```rename()``` - rename columns.
 
 - In R ```filter()```$\equiv$ in julia ```filter()``` - picks cases based on their values.
 
 - In R ```select()```$\equiv$ in julia ```select()``` - picks variables based on their names.
 
 - In R ```mutate()```$\equiv$ in julia ```transform()``` - adds new variables that are functions of existing variables.

 - In R ```summarise()```$\equiv$ in julia ```combine()``` - reduces multiple values down to a single summary

 - In R ```arrange()```$\equiv$ in julia ```sort()``` - changes the ordering of the rows.
 
 - In R ```group_by()```$\equiv$in julia ```groupby()``` - returns a ```GroupedDataFrame``` object.
 

```rename!(), filter!(), select!(), transform!(), sort!()``` also exist to manipulate DataFrames in-place
 
Here, common syntax is either 
```
:ColumName => :NewName
:ColumName => function => :NewName
:ColumName => function
```

Ex's.



In [26]:
incomeCopy = copy(income);
rename!(incomeCopy, :AHE => :Income) |> head # ! to operator on incomeCopy in place

Unnamed: 0_level_0,Year,Income,Sex
Unnamed: 0_level_1,Int64,Float64,String
1,1992,12.9991,hello
2,1992,11.618,male
3,1992,17.3773,male
4,1992,10.0613,female
5,1992,16.7567,male


In [27]:
filter!(:Income => (x-> x>20), incomeCopy) |> head # for some reason this one is backwards to the others!
# also has an optional argument view::Bool to specify whether to return a view or a copy

Unnamed: 0_level_0,Year,Income,Sex
Unnamed: 0_level_1,Int64,Float64,String
1,1992,27.3692,male
2,1992,20.2197,male
3,1992,26.0806,male
4,1992,20.9769,male
5,1992,26.8107,male


In [28]:
select!(incomeCopy, Not(:Year)) |> head 

Unnamed: 0_level_0,Income,Sex
Unnamed: 0_level_1,Float64,String
1,27.3692,male
2,20.2197,male
3,26.0806,male
4,20.9769,male
5,26.8107,male


In [29]:
incomeAdjust(income,sex) = (isequal.(sex,"male")*0.87 + isequal.(sex,"female")) .* income
transform!(incomeCopy,[:Income,:Sex] => incomeAdjust => :AdjustedIncome) |> head 

Unnamed: 0_level_0,Income,Sex,AdjustedIncome
Unnamed: 0_level_1,Float64,String,Float64
1,27.3692,male,23.8112
2,20.2197,male,17.5912
3,26.0806,male,22.6902
4,20.9769,male,18.2499
5,26.8107,male,23.3253


In [30]:
temp = copy([income;income;income;income;income;income;income;income;income;income;
        income;income;income;income;income;income;income;income;income;income;
        income;income;income;income;income;income;income;income;income;income;
        income;income;income;income;income;income;income;income;income;income;
        income;income;income;income;income;income;income;income;income;income;
        income;income;income;income;income;income;income;income;income;income;
        income;income;income;income;income;income;income;income;income;income;
        income;income;income;income;income;income;income;income;income;income;
        income;income;income;income;income;income;income;income;income;income;
        income;income;income;income;income;income;income;income;income;income;
        income;income;income;income;income;income;income;income;income;income;
        income;income;income;income;income;income;income;income;income;income;
        income;income;income;income;income;income;income;income;income;income;
        income;income;income;income;income;income;income;income;income;income;
        income;income;income;income;income;income;income;income;income;income;
        income;income;income;income;income;income;income;income;income;income;
        income;income;income;income;income;income;income;income;income;income;
        income;income;income;income;income;income;income;income;income;income;
        income;income;income;income;income;income;income;income;income;income;
        income;income;income;income;income;income;income;income;income;income])
@time transform!(temp,[:AHE,:Sex] => incomeAdjust => :AdjustedIncome);

  0.364838 seconds (591.07 k allocations: 77.866 MiB, 3.57% gc time)


In [31]:
temp = copy(income)
@time transform(temp,[:AHE,:Sex] => incomeAdjust => :AdjustedIncome);

  0.029192 seconds (70.31 k allocations: 4.103 MiB)


In [32]:
incomeBySex = groupby(income, :Sex) 

Unnamed: 0_level_0,Year,AHE,Sex
Unnamed: 0_level_1,Int64,Float64,String
1,1992,11.618,male
2,1992,17.3773,male
3,1992,16.7567,male
4,1992,27.3692,male
5,1992,10.6392,male
6,1992,6.98195,male
7,1992,14.4938,male
8,1992,20.2197,male
9,1992,9.79621,male
10,1992,16.7511,male

Unnamed: 0_level_0,Year,AHE,Sex
Unnamed: 0_level_1,Int64,Float64,String
1,1992,12.9991,hello


In [33]:
keys(incomeBySex)

3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (Sex = "male",)
 GroupKey: (Sex = "female",)
 GroupKey: (Sex = "hello",)

In [34]:
incomeBySex[(Sex="male",)] |> head 

Unnamed: 0_level_0,Year,AHE,Sex
Unnamed: 0_level_1,Int64,Float64,String
1,1992,11.618,male
2,1992,17.3773,male
3,1992,16.7567,male
4,1992,27.3692,male
5,1992,10.6392,male


In [35]:
incomeBySex[1] |> head 

Unnamed: 0_level_0,Year,AHE,Sex
Unnamed: 0_level_1,Int64,Float64,String
1,1992,11.618,male
2,1992,17.3773,male
3,1992,16.7567,male
4,1992,27.3692,male
5,1992,10.6392,male


In [36]:
incomeBySexYear = groupby(income, [:Sex, :Year])
keys(incomeBySexYear)

9-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (Sex = "hello", Year = 1992)
 GroupKey: (Sex = "male", Year = 1992)
 GroupKey: (Sex = "female", Year = 1992)
 GroupKey: (Sex = "female", Year = 1994)
 GroupKey: (Sex = "male", Year = 1994)
 GroupKey: (Sex = "female", Year = 1996)
 GroupKey: (Sex = "male", Year = 1996)
 GroupKey: (Sex = "male", Year = 1998)
 GroupKey: (Sex = "female", Year = 1998)

In [37]:
incomeBySexYear[(Sex="male",Year=1992)] |> head 

Unnamed: 0_level_0,Year,AHE,Sex
Unnamed: 0_level_1,Int64,Float64,String
1,1992,11.618,male
2,1992,17.3773,male
3,1992,16.7567,male
4,1992,27.3692,male
5,1992,10.6392,male


In [38]:
describe(income)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Union…,Nothing,DataType
1,Year,1994.87,1992,1994.0,1998,,,Int64
2,AHE,16.2627,2.13649,14.9838,52.4434,,,Float64
3,Sex,,female,,male,3.0,,String


In [39]:
sort!(income, :AHE, rev = true) |> head 

Unnamed: 0_level_0,Year,AHE,Sex
Unnamed: 0_level_1,Int64,Float64,String
1,1996,52.4434,male
2,1996,52.4434,male
3,1992,50.27,female
4,1994,50.2342,male
5,1996,49.9461,male


---

---

## Linear Models 

In [40]:
x -> 2*x + 1 

#7 (generic function with 1 method)

In-line function 

In [41]:
f1(x) = 2*x + 1

f1 (generic function with 1 method)

Other function definition 

In [42]:
f2 = function(x)
    2*x + 1 
end 

#9 (generic function with 1 method)

In [43]:
function f3(x)
    2*x + 1
end

f3 (generic function with 1 method)

Can also use a return keyword (like R)

In [44]:
function f4(x) 
    return 2*x + 1
end 

f4 (generic function with 1 method)

Optional/named arguments 

In [45]:
function f5(;x=1)
    return 2*x + 1
end

f5 (generic function with 1 method)

In [46]:
f5()

3

In [47]:
f5(1)

MethodError: MethodError: no method matching f5(::Int64)
Closest candidates are:
  f5(; x) at In[45]:2

In [48]:
f5(x=1)

3

Typing and multiple dispatch are far too expansive topics to cover here. 

You can do most things without them to begin with. 

In [49]:
using GLM

In [50]:
lm1 = lm(@formula(log(AHE) ~ Year), income)

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

:(log(AHE)) ~ 1 + Year

Coefficients:
─────────────────────────────────────────────────────────────────────────────
                   Coef.  Std. Error     t  Pr(>|t|)    Lower 95%   Upper 95%
─────────────────────────────────────────────────────────────────────────────
(Intercept)  2.19738      3.79287     0.58    0.5624  -5.23732     9.63208
Year         0.000248876  0.00190131  0.13    0.8959  -0.00347803  0.00397578
─────────────────────────────────────────────────────────────────────────────

In [51]:
exp.(predict(lm1, DataFrame(Year=2021, Sex="female")))

1-element Array{Float64,1}:
 14.885074146300814

In [52]:
using Plots 

see also `StatsPlots`

In [53]:
scatter(fitted(lm1.model), residuals(lm1.model))

UndefVarError: UndefVarError: fitted not defined

In [54]:
lm2 = lm(@formula(log(AHE) ~ Sex + Year), income, contrasts = Dict(:Year => DummyCoding()))

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

:(log(AHE)) ~ 1 + Sex + Year

Coefficients:
──────────────────────────────────────────────────────────────────────────────
                   Coef.  Std. Error       t  Pr(>|t|)   Lower 95%   Upper 95%
──────────────────────────────────────────────────────────────────────────────
(Intercept)   2.64379     0.00930183  284.22    <1e-99   2.62556     2.66202
Sex: hello   -0.0789068   0.442598     -0.18    0.8585  -0.946476    0.788663
Sex: male     0.12926     0.00841026   15.37    <1e-52   0.112774    0.145745
Year: 1994   -0.0352327   0.0115052    -3.06    0.0022  -0.0577849  -0.0126805
Year: 1996   -0.0514071   0.0118822    -4.33    <1e-4   -0.0746983  -0.0281159
Year: 1998    0.00993609  0.0118891     0.84    0.4033  -0.0133687   0.0332409
──────────────────────────────────────────────────────────────────────────────

---

## A task