# Tabular data with DataFrames.jl

The `result` that we got above is a `DataFrame` from [DataFrames.jl](http://juliadata.github.io/DataFrames.jl/stable/), one of the major Julia packages for tabular data. 

## Basic stuff
One can do basic selection of these data, for example selecting specific columns, removing columns, adding new columns or rows, etc. The `path` column of the `results` is not so useful, so let's remove it:

In [None]:
select!(results, Not(:path))

Row,model,b,r,a,d
Unnamed: 0_level_1,String?,Int64?,Int64?,Int64?,Int64?
1,linear,2,3,1,missing
2,linear,3,4,1,missing
3,cubic,missing,126,1,5
4,cubic,missing,217,1,6
5,linear,2,4,2,missing
6,linear,3,5,2,missing
7,cubic,missing,127,2,5
8,cubic,missing,218,2,6


To access specific rows or columns you can access a dataframe like a matrix, e.g. `df[:, 1:3]` gives the first three rows while `df[:, [:a, :b]]` gives the rows selected by name.

Adding new data to a dataframe is also straight forward. One can add new rows or new columns. New rows are added by either providing a vector (where the vector's length matches the number of columns), or by providing a named tuple that explicitly names which column gets which new value.

In [None]:
df1 = results[1:3, 1:3]

Row,model,b,r
Unnamed: 0_level_1,String?,Int64?,Int64?
1,linear,2,3
2,linear,3,4
3,cubic,missing,126


In [None]:
# add new row by order of columns
push!(df1, ["test", 5, 5])

Row,model,b,r
Unnamed: 0_level_1,String?,Int64?,Int64?
1,linear,2,3
2,linear,3,4
3,cubic,missing,126
4,test,5,5


In [None]:
# add new row by name of columnns
push!(df1, (model = "test2", r=89, b = -5))

Row,model,b,r
Unnamed: 0_level_1,String?,Int64?,Int64?
1,linear,2,3
2,linear,3,4
3,cubic,missing,126
4,test,5,5
5,test2,-5,89


In [None]:
# add new column with given name by field assignment
df1.z = rand(size(df1, 1))
df1

Row,model,b,r,z
Unnamed: 0_level_1,String?,Int64?,Int64?,Float64
1,linear,2,3,0.821092
2,linear,3,4,0.35409
3,cubic,missing,126,0.576274
4,test,5,5,0.781562
5,test2,-5,89,0.416933


## Querying and manipulating a `DataFrame`

General querying and data manipulation can be done on `DataFrame`s using standard library functions from DataFrames.jl. However, it is [recommended](https://dataframes.juliadata.org/stable/man/querying_frameworks/#Data-manipulation-frameworks) to use a package dedicated to querying and data manupulation that provides simpler/easier syntax for it. Several options exist, each providing a slightly different variant on the syntax, so you can pick whatever you feel most comfortable in. Here we will use a LINQ-like [Query.jl](https://www.queryverse.org/Query.jl/stable/) syntax, which can be used with any tabular Julia data.

In [None]:
using Query
results

Row,model,b,r,a,d
Unnamed: 0_level_1,String?,Int64?,Int64?,Int64?,Int64?
1,linear,2,3,1,missing
2,linear,3,4,1,missing
3,cubic,missing,126,1,5
4,cubic,missing,217,1,6
5,linear,2,4,2,missing
6,linear,3,5,2,missing
7,cubic,missing,127,2,5
8,cubic,missing,218,2,6


Let's perform a basic query that we explain step-by-step:

In [None]:
# `@from ... begin` initializes a query
# here `row` (any name would be fine) is the
# iterable of the tabular data. Think of it as a
# NamedTuple. You can access its values by name with the . syntax
q = @from row in results begin
    # `@where` filters elements where the following
    # expression is true.
    @where row.model == "linear"
    # `@select {stuff... } creates a new named tuple
    @select {row.a, row.r, rsquared = row.r^2}
    # `@collect` collects the selected results into
    # the specified data structure
    @collect DataFrame
end

Row,a,r,rsquared
Unnamed: 0_level_1,Int64?,Int64?,Int64?
1,1,3,9
2,1,4,16
3,2,4,16
4,2,5,25


As you can see, you don't have to necessarily collect only existing columns, you can even create new ones (specified by name).

Of course, much, much more things are possible to do in such a query. For more, see the documentation of Query.jl.

# Exercises


## Plotting subsets of a dataframe

`DataFrames` has a default dataset that is used in its test suite. Install the `CSV` package, and load this dataset with the command:
```julia
using DataFrames, CSV
iris = DataFrame(CSV.File(
    joinpath(dirname(pathof(DataFrames)), 
    "../docs/src/assets/iris.csv")
))
```

This dataset has various flower species (column `:Species`). For every species create a 1x2 figure with the following plots:

* [1,1] = scatter plot of `SepalLength` vs `SepalWidth`. 
* [1,2] = scatter plot of `PetalLength` vs `PetalWidth`.

for each of these sub-scatter plots, calculate and print the Pearson correlaton coefficient.
