# Working with DataFrames.jl v1.3.0

# Part 2

## Bogumił Kamiński

In this part of the tutorial we will work with a data set taken from the paper:

D. F. Lott, "[Dominance relations and breeding rate in mature male American bison](https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1439-0310.1979.tb00302.x)", Zeitschrift Tierpsychologie, 1979, 49: 418-432

You can find the description of the interpretation of the data [here](http://moreno.ss.uci.edu/data.html#bison).
In short the data set stores information about dominance encounters and breeding behaviors of 26 males in a herd of American bison.

The data set *bison.json* that we will work with is bundled with this file in a GitHub gist.

Each line in *bison.json* is a JSON entry giving:
* bison id `:id`
* bison breeding success `:breeding`
* a list of pairs giving domination relation between this bison and other bisons in form of pairs: other bison id and domination value

Our objective is to read in this data into a `DataFrame` and then analyze if higher domination relation of a correlates with breeding success.

We start with loading the required packages.

In [1]:
using DataFrames
using JSON3
using Statistics

and changing the number of columns printed:

In [2]:
ENV["COLUMNS"] = 500

500

As usual Before we start let us make sure that you have the right versions of packages installed.

The output of the command below should be:
```
  [6e4b80f9] BenchmarkTools v1.2.0
  [336ed68f] CSV v0.9.11
  [8be319e6] Chain v0.4.8
  [a93c6f00] DataFrames v1.3.0
  [7073ff75] IJulia v1.23.2
  [0f8b85d8] JSON3 v1.9.2
  [91a5bcdd] Plots v1.25.0
```

In [3]:
] status

[32m[1m      Status[22m[39m `C:\WORK\dev\DataFramesTutorials\DataFrames-Showcase\Project.toml`
 [90m [6e4b80f9] [39mBenchmarkTools v1.2.0
 [90m [336ed68f] [39mCSV v0.9.11
 [90m [8be319e6] [39mChain v0.4.8
 [90m [a93c6f00] [39mDataFrames v1.3.0
 [90m [7073ff75] [39mIJulia v1.23.2
 [90m [0f8b85d8] [39mJSON3 v1.9.2
 [90m [91a5bcdd] [39mPlots v1.25.0


Let us peek into the *bison.json* file:

In [4]:
readlines("bison.json")

26-element Vector{String}:
 "{\"id\":\"1\", \"3\":8, \"4\":5, \"2\":2, \"6\":6, \"8\":11, \"9\":3, \"5\":21, \"10\":5, \"7\":7, \"12\":1, \"17\":3, \"18\":5, \"13\":2, \"14\":3, \"21\":4, \"19\":2, \"15\":1, \"20\":7, \"breeding\":4}"
 "{\"id\":\"3\", \"4\":4, \"6\":4, \"8\":4, \"9\":8, \"5\":12, \"10\":12, \"7\":3, \"12\":2, \"17\":2, \"18\":7, \"13\":5, \"14\":1, \"21\":3, \"19\":4, \"20\":5, \"16\":1, \"breeding\":2}"
 "{\"id\":\"4\", \"1\":1, \"3\":1, \"2\":3, \"6\":1, \"8\":4, \"9\":4, \"5\":8, \"10\":10, \"7\":1, \"17\":1, \"18\":4, \"13\":6, \"21\":2, \"19\":2, \"15\":3, \"11\":1, \"20\":4, \"22\":3, \"breeding\":3}"
 "{\"id\":\"26\", \"2\":7, \"25\":1, \"23\":2, \"10\":3, \"21\":1, \"19\":1, \"breeding\":1}"
 "{\"id\":\"2\", \"1\":1, \"3\":2, \"4\":3, \"26\":1, \"6\":1, \"8\":3, \"5\":3, \"10\":3, \"7\":2, \"12\":2, \"17\":1, \"18\":3, \"13\":2, \"14\":4, \"21\":1, \"19\":1, \"15\":1, \"20\":2, \"breeding\":1}"
 "{\"id\":\"25\", \"1\":2, \"24\":8, \"9\":12, \"23\":1, \"10\":7, 

Indeed we see that we have 26 lines in the file. We immediately notice that bison id is numeric, which might be a challenge if we want to use it as a column name (fortunately for DataFrames.jl v0.21 it is not a problem).

First we populate the data frame with the data stored in the file:

In [5]:
df = DataFrame()

In [6]:
foreach(JSON3.read.(readlines("bison.json"))) do row
    push!(df, row, cols=:union)
end

In [7]:
df

Unnamed: 0_level_0,breeding,17,20,8,7,2,12,18,13,21,9,6,14,19,id,10,4,3,5,15,16,1,22,11,25,23,26,24
Unnamed: 0_level_1,Int64,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,String,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?
1,4,3,7,11,7,2,1,5,2,4,3,6,3,2,1,5,5,8,21,1,missing,missing,missing,missing,missing,missing,missing,missing
2,2,2,5,4,3,missing,2,7,5,3,8,4,1,4,3,12,4,missing,12,missing,1,missing,missing,missing,missing,missing,missing,missing
3,3,1,4,4,1,3,missing,4,6,2,4,1,missing,2,4,10,missing,1,8,3,missing,1,3,1,missing,missing,missing,missing
4,1,missing,missing,missing,missing,7,missing,missing,missing,1,missing,missing,missing,1,26,3,missing,missing,missing,missing,missing,missing,missing,missing,1,2,missing,missing
5,1,1,2,3,2,missing,2,3,2,1,missing,1,4,1,2,3,3,2,3,1,missing,1,missing,missing,missing,missing,1,missing
6,3,missing,missing,missing,missing,missing,missing,missing,missing,3,12,missing,8,3,25,7,missing,missing,missing,2,missing,2,missing,missing,missing,1,missing,8
7,4,1,4,5,1,missing,2,1,3,missing,2,missing,1,1,6,5,missing,missing,6,1,missing,1,missing,missing,missing,missing,missing,missing
8,1,4,7,missing,missing,1,missing,2,2,missing,11,missing,1,6,8,6,1,missing,3,missing,1,2,1,1,missing,6,missing,5
9,4,1,1,2,missing,missing,missing,missing,1,3,2,missing,3,2,24,2,1,missing,missing,2,missing,missing,missing,missing,missing,1,missing,missing
10,1,4,3,3,1,missing,missing,missing,missing,4,missing,2,2,4,9,9,1,missing,missing,3,2,4,missing,missing,1,2,missing,1


First note that we have used `cols=:union` in the `push!` command when piping the JSON data into `df` data frame. If we do this then by default:
* new columns are automatically added (i.e. if the next JSON contains the column that was not present in the data frame already it will be added and previously existing rows are filled with `missing` for this column), you see this case e.g. in column with name `"26"` (last column of `df`)
* if some JSON does not have some column then again it is not a problem, simply `missing` will be put in respective column in the corresponding row, you see this case e.g. in column with name `"17"` in row 4
* columns automatically get promoted to an appropriate type (in this case columns containing missing values were promoted in this way)

So as you can see with `push!` you can add data to a data frame without knowing its schema upfront. The same functionality is provided by `append!` and `vcat`.

We note that the order of columns of our data frame is not very nice. This is due to the fact that JSON3 does not give guarantees on ordering of columns. Fortunately this is easily fixed:

In [8]:
select!(df, :id, :breeding, sort(names(df, r"\d"), by=x -> parse(Int, x)))

Unnamed: 0_level_0,id,breeding,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26
Unnamed: 0_level_1,String,Int64,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?
1,1,4,missing,2,8,5,21,6,7,11,3,5,missing,1,2,3,1,missing,3,5,2,7,4,missing,missing,missing,missing,missing
2,3,2,missing,missing,missing,4,12,4,3,4,8,12,missing,2,5,1,missing,1,2,7,4,5,3,missing,missing,missing,missing,missing
3,4,3,1,3,1,missing,8,1,1,4,4,10,1,missing,6,missing,3,missing,1,4,2,4,2,3,missing,missing,missing,missing
4,26,1,missing,7,missing,missing,missing,missing,missing,missing,missing,3,missing,missing,missing,missing,missing,missing,missing,missing,1,missing,1,missing,2,missing,1,missing
5,2,1,1,missing,2,3,3,1,2,3,missing,3,missing,2,2,4,1,missing,1,3,1,2,1,missing,missing,missing,missing,1
6,25,3,2,missing,missing,missing,missing,missing,missing,missing,12,7,missing,missing,missing,8,2,missing,missing,missing,3,missing,3,missing,1,8,missing,missing
7,6,4,1,missing,missing,missing,6,missing,1,5,2,5,missing,2,3,1,1,missing,1,1,1,4,missing,missing,missing,missing,missing,missing
8,8,1,2,1,missing,1,3,missing,missing,missing,11,6,1,missing,2,1,missing,1,4,2,6,7,missing,1,6,5,missing,missing
9,24,4,missing,missing,missing,1,missing,missing,missing,2,2,2,missing,missing,1,3,2,missing,1,missing,2,1,3,missing,1,missing,missing,missing
10,9,1,4,missing,missing,1,missing,2,1,3,missing,9,missing,missing,missing,2,3,2,4,missing,4,3,4,missing,2,1,1,missing


Again, note how expressive DataFrames.jl is. With `names(df, r"\d")` we selected all column names that contain a digit as strings, and then we have sorted them using their numeric value.

The same selection could have been written as `names(df, Not([:id, :breeding]))`. If we wanted to be more cautious with our regex we could have written `names(df, r"^\d+$")`. In this case all variants we described give exactly the same result.

Before we move forward let me highlight that it is very easy to access the columns with non-standard names (like strings consisting of numbers) in the following way:

In [9]:
df."1"

26-element Vector{Union{Missing, Int64}}:
  missing
  missing
 1
  missing
 1
 2
 1
 2
  missing
 4
 1
 2
 1
 3
  missing
 1
 1
 1
 1
  missing
  missing
  missing
  missing
  missing
  missing
  missing

or e.g.:

In [10]:
df[:, "1"]

26-element Vector{Union{Missing, Int64}}:
  missing
  missing
 1
  missing
 1
 2
 1
 2
  missing
 4
 1
 2
 1
 3
  missing
 1
 1
 1
 1
  missing
  missing
  missing
  missing
  missing
  missing
  missing

Now for each bison let us calculate an aggregate of domination values:

In [11]:
df2 = select(df, :breeding, AsTable(r"\d") => ByRow(sum∘skipmissing) => :score)

Unnamed: 0_level_0,breeding,score
Unnamed: 0_level_1,Int64,Int64
1,4,96
2,2,77
3,3,59
4,1,15
5,1,36
6,3,46
7,4,34
8,1,60
9,4,21
10,1,46


a similar way to achieve this would be to replace `missing` with `0` using `coalesce` on `df` and then just use `+` on whole columns:

In [12]:
select(coalesce.(df, 0), :breeding, r"\d" => (+) => :score)

Unnamed: 0_level_0,breeding,score
Unnamed: 0_level_1,Int64,Int64
1,4,96
2,2,77
3,3,59
4,1,15
5,1,36
6,3,46
7,4,34
8,1,60
9,4,21
10,1,46


We finish by aggregating `:score` column by `:breeding` column:

In [13]:
combine(groupby(df2, :breeding, sort=true), :score .=> [mean, std, minimum, median, maximum])

Unnamed: 0_level_0,breeding,score_mean,score_std,score_minimum,score_median,score_maximum
Unnamed: 0_level_1,Int64,Float64,Float64,Int64,Float64,Int64
1,0,22.4286,18.0172,1,21.0,57
2,1,30.5556,15.4119,15,28.0,60
3,2,39.0,33.2866,15,25.0,77
4,3,49.25,19.1028,24,52.5,68
5,4,50.3333,40.0791,21,34.0,96


Note that it is very easy to apply multiple transformations at the same time using broadcasting.

Looking at the data indeed it seems that `:breeding` and `:score` are positively correlated.