# DataFrames and data wrangling

In this notebook we will look at the [DataFrames.jl](https://juliadata.github.io/DataFrames.jl/stable/) package.

`DataFrame` objects contain data tables consisting of a series of vectors, each representing a column or variable.

We will focus on the following operations:

* Filtering rows
* Selecting columns
* Adding and modifying columns
* Sorting
* Performing caculations on all rows or by groups of rows

We will also look at the [DataFramesMeta.jl](https://github.com/JuliaData/DataFramesMeta.jl) and [Query.jl](https://www.queryverse.org/Query.jl/stable/) packages that provide additional functionality for working with DataFrames.

As always we first have to load the package.

In [1]:
using DataFrames

Here is a simple `DataFrame` object created with code.

In [2]:
df = DataFrame(x = 1:5, y = ["red", "blue", "red", "blue", "green"])

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,String
1,1,red
2,2,blue
3,3,red
4,4,blue
5,5,green


Individual columns can be refenced as `df.y` or `df[!, :y]`. Neither of these makes a copy of the column, so if the data in the column changes, it will change for all references.

In [3]:
y = df[!, :y]

5-element Array{String,1}:
 "red"  
 "blue" 
 "red"  
 "blue" 
 "green"

In [4]:
y[3] = "purple"

"purple"

In [5]:
df

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,String
1,1,red
2,2,blue
3,3,purple
4,4,blue
5,5,green


To make a copy of a column, use `df[:, :y]` instead.

In [6]:
y2 = df[:, :y]

5-element Array{String,1}:
 "red"   
 "blue"  
 "purple"
 "blue"  
 "green" 

In [7]:
y2[3] = "yellow"

"yellow"

In [8]:
df, y2

(5×2 DataFrame
│ Row │ x     │ y      │
│     │ [90mInt64[39m │ [90mString[39m │
├─────┼───────┼────────┤
│ 1   │ 1     │ red    │
│ 2   │ 2     │ blue   │
│ 3   │ 3     │ purple │
│ 4   │ 4     │ blue   │
│ 5   │ 5     │ green  │, ["red", "blue", "yellow", "blue", "green"])

Notice that the data in `df` did not change this time.

For the rest of this notebook, we'll work with the `flights` dataset from the R package `nycflights`.

In [9]:
using RData
nycflights = load("data/nycflights13.RData")

Dict{String,Any} with 5 entries:
  "airlines" => 16×2 DataFrame…
  "airports" => 1458×8 DataFrame. Omitted printing of 4 columns…
  "flights"  => 336776×19 DataFrame. Omitted printing of 13 columns…
  "planes"   => 3322×9 DataFrame. Omitted printing of 6 columns…
  "weather"  => 26115×15 DataFrame. Omitted printing of 8 columns…

We can see descriptive statistics for a DataFrame using the `describe` function.

In [10]:
flights = nycflights["flights"]
describe(flights)

Unnamed: 0_level_0,variable,mean,min,median,max
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any
1,year,2013.0,2013,2013.0,2013
2,month,6.54851,1,7.0,12
3,day,15.7108,1,16.0,31
4,dep_time,1349.11,1,1401.0,2400
5,sched_dep_time,1344.25,106,1359.0,2359
6,dep_delay,12.6391,-43.0,-2.0,1301.0
7,arr_time,1502.05,1,1535.0,2400
8,sched_arr_time,1536.38,1,1556.0,2359
9,arr_delay,6.89538,-86.0,-5.0,1272.0
10,carrier,,9E,,YV


### Dot syntax for vectorizing functions and operators

Before we learn about filtering, we need to learn about Julia's "dot syntax" for vectorizing functions and operators.

Suppose we have a vector of floating point numbers:

In [11]:
A = [1.0, 2.0, 3.0]

3-element Array{Float64,1}:
 1.0
 2.0
 3.0

And suppose we want to calculate the sine of each number. We can calculate the sine of a number using the `sin` function like this:

In [12]:
sin(1.0)

0.8414709848078965

What happens if we call `sin` on `A`?

In [13]:
sin(A)

MethodError: MethodError: no method matching sin(::Array{Float64,1})
Closest candidates are:
  sin(!Matched::BigFloat) at mpfr.jl:743
  sin(!Matched::Missing) at math.jl:1138
  sin(!Matched::Complex{Float16}) at math.jl:1086
  ...

The `sin` function doesn't know how to operate on the type `Array{Float64,1}`, which is how Julia describes a one-dimensional array of double precision (64-bit) floating point numbers.

Some languages would require us to use a separate "vectorized" function, but in Julia, we can do this automatically usiing the following "dot" syntax:

In [14]:
sin.(A)

3-element Array{Float64,1}:
 0.8414709848078965
 0.9092974268256817
 0.1411200080598672

The same thing works with operators, but the dot comes after the operator. For example:

In [15]:
A .^ 3

3-element Array{Float64,1}:
  1.0
  8.0
 27.0

### Filtering rows

Here is an example of filtering rows. We want only the rows where `month` equals 7 and `day` equals 17. We use the `first` function to display just the first 6 rows in the notebook.

Notice the use of dot syntax.

In [16]:
first(flights[(flights.month .== 7) .& (flights.day .== 17), :], 6)

Unnamed: 0_level_0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time
Unnamed: 0_level_1,Int32,Int32,Int32,Int32⍰,Int32,Float64⍰,Int32⍰,Int32
1,2013,7,17,2,2030,212.0,117,2202
2,2013,7,17,5,2359,6.0,345,340
3,2013,7,17,6,2135,151.0,245,30
4,2013,7,17,9,2305,64.0,113,13
5,2013,7,17,104,2359,65.0,440,344
6,2013,7,17,143,2030,313.0,242,2202


If we wanted rows where `month` is 11 or 12, we could do this. (The output is suppressed by the ; at the end.)

In [17]:
flights[(flights.month .== 11) .| (flights.month .== 12), :];

The following does the same thing, but requires a little explanation. Here `in[11, 12]` actually returns a *function* that checks whether its argument is *in* the collection `[11, 12]`. This function is then vectorized (or broadcast) over the `month` column of the dataframe. The result is the rows of the dataframe for which month is 11 or 12. We then display just the first 6 rows.

In [18]:
first(flights[in([11, 12]).(flights.month), :], 6)

Unnamed: 0_level_0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time
Unnamed: 0_level_1,Int32,Int32,Int32,Int32⍰,Int32,Float64⍰,Int32⍰,Int32
1,2013,11,1,5,2359,6.0,352,345
2,2013,11,1,35,2250,105.0,123,2356
3,2013,11,1,455,500,-5.0,641,651
4,2013,11,1,539,545,-6.0,856,827
5,2013,11,1,542,545,-3.0,831,855
6,2013,11,1,549,600,-11.0,912,923


Julia takes a little getting used to, but can be very expressive and powerful.

### Selecting columns

We select columns using the `select` and `select!` functions. The `select` function returns a new dataframe, while `select!` does an in-place select, returning a view into the existing dataframe. We'll just use `select`.

Let's refresh our memory of what columns we have available.

In [19]:
names(flights)

19-element Array{Symbol,1}:
 :year          
 :month         
 :day           
 :dep_time      
 :sched_dep_time
 :dep_delay     
 :arr_time      
 :sched_arr_time
 :arr_delay     
 :carrier       
 :flight        
 :tailnum       
 :origin        
 :dest          
 :air_time      
 :distance      
 :hour          
 :minute        
 :time_hour     

We can select columns by name like this:

In [20]:
first(select(flights, [:year, :month, :day]), 6)

Unnamed: 0_level_0,year,month,day
Unnamed: 0_level_1,Int32,Int32,Int32
1,2013,1,1
2,2013,1,1
3,2013,1,1
4,2013,1,1
5,2013,1,1
6,2013,1,1


We can also select columns by position, but this can easily result in errors if the columns of a dataframe later change.

In [21]:
first(select(flights, [1, 3, 5]), 6)

Unnamed: 0_level_0,year,day,sched_dep_time
Unnamed: 0_level_1,Int32,Int32,Int32
1,2013,1,515
2,2013,1,529
3,2013,1,540
4,2013,1,545
5,2013,1,600
6,2013,1,558


There several other ways we can select columns, but a handy one matches column names using a regular expression.

In [22]:
first(select(flights, r"^(dep|arr)"), 6)

Unnamed: 0_level_0,dep_time,dep_delay,arr_time,arr_delay
Unnamed: 0_level_1,Int32⍰,Float64⍰,Int32⍰,Float64⍰
1,517,2.0,830,11.0
2,533,4.0,850,20.0
3,542,2.0,923,33.0
4,544,-1.0,1004,-18.0
5,554,-6.0,812,-25.0
6,554,-4.0,740,12.0


### Adding and modifying columns

We'll work with a subset of columns of the `flight` dataset. We will select those columns using the `@select` macro from the `Query.jl` package. This package provides functionality similar the `dplyr` package for R.

In [23]:
using Query

flights_sml = flights |> 
  @select(flights, 1:3, endswith("delay"), :distance, :air_time) |>
  DataFrame

first(flights_sml, 6)

Unnamed: 0_level_0,year,month,day,dep_delay,arr_delay,distance,air_time
Unnamed: 0_level_1,Int32,Int32,Int32,Float64⍰,Float64⍰,Float64,Float64⍰
1,2013,1,1,2.0,11.0,1400.0,227.0
2,2013,1,1,4.0,20.0,1416.0,227.0
3,2013,1,1,2.0,33.0,1089.0,160.0
4,2013,1,1,-1.0,-18.0,1576.0,183.0
5,2013,1,1,-6.0,-25.0,762.0,116.0
6,2013,1,1,-4.0,12.0,719.0,150.0


Now we can add columns using the `@mutate` macro, also from `Query.jl`.

In [24]:
flights_sml |>
  @mutate(gain = _.dep_delay - _.arr_delay,
          speed = (_.distance / _.air_time) * 60) |>
  DataFrame |>
  (x -> first(x, 6))

Unnamed: 0_level_0,year,month,day,dep_delay,arr_delay,distance,air_time,gain,speed
Unnamed: 0_level_1,Int32,Int32,Int32,Float64⍰,Float64⍰,Float64,Float64⍰,Float64⍰,Float64⍰
1,2013,1,1,2.0,11.0,1400.0,227.0,-9.0,370.044
2,2013,1,1,4.0,20.0,1416.0,227.0,-16.0,374.273
3,2013,1,1,2.0,33.0,1089.0,160.0,-31.0,408.375
4,2013,1,1,-1.0,-18.0,1576.0,183.0,17.0,516.721
5,2013,1,1,-6.0,-25.0,762.0,116.0,19.0,394.138
6,2013,1,1,-4.0,12.0,719.0,150.0,-16.0,287.6


### Sorting

A `DataFrame` can be sorted using the standard `sort` function. Note that `sort` produces a copy of the `DataFrame`. To sort in place, use `sort!` instead.

In [25]:
first(sort(flights, (:year, :month, :day)), 6)

Unnamed: 0_level_0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time
Unnamed: 0_level_1,Int32,Int32,Int32,Int32⍰,Int32,Float64⍰,Int32⍰,Int32
1,2013,1,1,517,515,2.0,830,819
2,2013,1,1,533,529,4.0,850,830
3,2013,1,1,542,540,2.0,923,850
4,2013,1,1,544,545,-1.0,1004,1022
5,2013,1,1,554,600,-6.0,812,837
6,2013,1,1,554,558,-4.0,740,728
