# Tabular data with DataFrames.jl

In this notebook we'll work with data by using [DataFrames.jl](http://juliadata.github.io/DataFrames.jl/stable/), one of the major Julia packages for tabular data. 


In [None]:
import Pkg
Pkg.activate(joinpath(@__DIR__, ".."))

In [2]:
using DataFrames, CSV

The dataset we'll be looking at comes from [Kaggle](https://www.kaggle.com/datasets/deepu1109/star-dataset?resource=download), and contains observed features of a few stars. The data is stored in a `CSV` file, which we'll first read into a `DataFrame` object.

In [None]:
filepath = joinpath(@__DIR__, "..", "data", "star_features.csv") # path of data file
df = CSV.read(filepath, DataFrame) # read the file and convert it to a `DataFrame` 

We can see that this dataframe has 240 rows and 7 columns. To get the names of the columns in a Vector format, you can call the `names` function on the dataframe.

In [None]:
names(df)

## Basic stuff
One can do basic selection of this data, for example selecting specific columns, removing columns, adding new columns or rows, etc. In this notebook we won't be using the  `star_type` column, so let's remove it:

In [None]:
select!(df, Not(:star_type))

To access specific rows or columns you can access a dataframe like a matrix, e.g. `df[:, 1:3]` gives the first three rows while `df[:, [:a, :b]]` gives the rows selected by name.

Adding new data to a dataframe is also straight forward. One can add new rows or new columns. New rows are added by either providing a vector (where the vector's length matches the number of columns), or by providing a named tuple that explicitly names which column gets which new value.

In [None]:
df1 = df[1:3, 1:3]

In [None]:
# add new row by order of columns
push!(df1, [5772, 1, 1])

In [None]:
# add new row by name of columnns
push!(df1, (temperature_kelvin = 3600, radius_Rsun=640, luminosity_Lsun = 65_000))

In [None]:
# add new column with given name by field assignment
df1.random_column = rand(size(df1, 1))
df1

## Selecting specific rows

There are many ways to select specific rows based on values of certain columns. For example, if we wanted to select the stars with effective temperatures below a certain value, we could do:

In [None]:
df[df.temperature_kelvin .<= 5772, :] # select all the rows for which temperature_kelvin is less than or equal to 5772

In that example, we also included all the columns. If we wanted to only get specific columns, we could do

In [None]:
df[df.temperature_kelvin .<= 5772, [:luminosity_Lsun, :radius_Rsun]] # same as above, except we now only select two columns

We can also combine conditional statements to make more specific selections.

In [None]:
# select rows where temperature <= 5772, and radius is bigger than 0.11 and smaller than 0.67,
# and exclude the `star_colour` column
df[df.temperature_kelvin .<= 5772 .&& (0.11 .< df.radius_Rsun .< 0.67), Not(:star_colour)]

## Querying and manipulating a `DataFrame`

General querying and data manipulation can be done on `DataFrame`s using standard library functions from DataFrames.jl. However, it is [recommended](https://dataframes.juliadata.org/stable/man/querying_frameworks/#Data-manipulation-frameworks) to use a package dedicated to querying and data manupulation that provides simpler/easier syntax for it. Several options exist, each providing a slightly different variant on the syntax, so you can pick whatever you feel most comfortable in. Here we will use a LINQ-like [Query.jl](https://www.queryverse.org/Query.jl/stable/) syntax, which can be used with any tabular Julia data.

In [None]:
using Query
df

Let's perform a basic query that we explain step-by-step:

In [None]:
# `@from ... begin` initializes a query
# here `row` (any name would be fine) is the
# iterable of the tabular data. Think of it as a
# NamedTuple. You can access its values by name with the . syntax
q = @from row in df begin
    # `@where` filters elements where the following
    # expression is true.
    @where row.spectral_class == "O"
    # `@select {stuff... } creates a new named tuple
    @select {row.radius_Rsun, row.absolute_magnitude, Lsquared = row.luminosity_Lsun^2}
    # `@collect` collects the selected results into
    # the specified data structure
    @collect DataFrame
end

As you can see, you don't have to necessarily collect only existing columns, you can even create new ones (specified by name).

Of course, much, much more things are possible to do in such a query. For more, see the documentation of Query.jl.

# Exercises


## Plotting subsets of a dataframe

`DataFrames` has a default dataset that is used in its test suite. Install the `CSV` package, and load this dataset with the command:
```julia
using DataFrames, CSV
iris = DataFrame(CSV.File(
    joinpath(dirname(pathof(DataFrames)), 
    "../docs/src/assets/iris.csv")
))
```

This dataset has various flower species (column `:Species`). For every species create a 1x2 figure with the following plots:

* [1,1] = scatter plot of `SepalLength` vs `SepalWidth`. 
* [1,2] = scatter plot of `PetalLength` vs `PetalWidth`.

for each of these sub-scatter plots, calculate and print the Pearson correlaton coefficient.
