# A deep dive into DataFrames.jl indexing
# Part 1: indexing in DataFrames.jl by example
### Bogumił Kamiński

What are we going to cover:
* `getindex`, a.k.a. `x[...]`
* `setindex`, a.k.a. `x[...] =`
* `broadcast`, a.k.a. `fun.(x)`
* `broadcat`, a.k.a. `x .= ...`

Indexable types that DataFrames.jl defines:
* `DataFrame`
* `SubDataFrame`
* `DataFrameRow`
* `DataFrameRows`
* `DataFrameColumns`
* `GroupedDataFrame`
* `GroupKeys`
* `GroupKey`
* `StackedVector`
* `RepeatedVector`

### Environment setup

In [1]:
using DataFrames

In [2]:
using CSV

In [20]:
using BenchmarkTools

┌ Info: Precompiling BenchmarkTools [6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf]
└ @ Base loading.jl:1278


In [25]:
using Dates

In [3]:
ENV["COLUMNS"] = 500 # allow output up to 500 characters wide not to be truncated when displayed

500

In [15]:
ENV["LINES"] = 15 # we do not need to see too many lines in the examples we work with

15

In [16]:
df = CSV.File("fh_5yrs.csv") |> DataFrame

Unnamed: 0_level_0,date,volume,open,high,low,close,adjclose,symbol
Unnamed: 0_level_1,Date…,Int64,Float64,Float64,Float64,Float64,Float64,String
1,2020-07-02,257500,17.64,17.74,17.62,17.71,17.71,AAAU
2,2020-07-01,468100,17.73,17.73,17.54,17.68,17.68,AAAU
3,2020-06-30,319100,17.65,17.8,17.61,17.78,17.78,AAAU
4,2020-06-29,405500,17.67,17.69,17.63,17.68,17.68,AAAU
5,2020-06-26,335100,17.49,17.67,17.42,17.67,17.67,AAAU
6,2020-06-25,246800,17.6,17.6,17.52,17.59,17.59,AAAU
7,2020-06-24,329200,17.61,17.71,17.56,17.61,17.61,AAAU
8,2020-06-23,351800,17.55,17.66,17.55,17.66,17.66,AAAU
9,2020-06-22,308300,17.5,17.57,17.44,17.5,17.5,AAAU
10,2020-06-19,153800,17.27,17.4,17.26,17.4,17.4,AAAU


#### Warm up exercises

*Get short description of columns in our data frame*

(see https://github.com/JuliaData/DataFrames.jl/issues/2269 for a discussion of the design decisions here, feel free to comment there if you have an opinion)

*Get information about exact types of the columns stored in the data frame*

*Get names of columns as strings*

*Get names of columns as `Symbol`s*

## DataFrame and SubDataFrame

### `getindex`

Get a single column as a whole without copying

In [17]:
unique([df.date, df."date", df[!, 1], df[!, :date], df[!, "date"]])

1-element Array{Array{Dates.Date,1},1}:
 [Dates.Date("2020-07-02"), Dates.Date("2020-07-01"), Dates.Date("2020-06-30"), Dates.Date("2020-06-29"), Dates.Date("2020-06-26"), Dates.Date("2020-06-25"), Dates.Date("2020-06-24"), Dates.Date("2020-06-23"), Dates.Date("2020-06-22"), Dates.Date("2020-06-19")  …  Dates.Date("2015-01-21"), Dates.Date("2015-01-20"), Dates.Date("2015-01-16"), Dates.Date("2015-01-14"), Dates.Date("2015-01-13"), Dates.Date("2015-01-12"), Dates.Date("2015-01-09"), Dates.Date("2015-01-07"), Dates.Date("2015-01-05"), Dates.Date("2015-01-02")]

In [18]:
unique([getproperty(df, :date), getproperty(df, "date"), getindex(df, !, 1), getindex(df, !, :date), getindex(df,!, "date")])

1-element Array{Array{Dates.Date,1},1}:
 [Dates.Date("2020-07-02"), Dates.Date("2020-07-01"), Dates.Date("2020-06-30"), Dates.Date("2020-06-29"), Dates.Date("2020-06-26"), Dates.Date("2020-06-25"), Dates.Date("2020-06-24"), Dates.Date("2020-06-23"), Dates.Date("2020-06-22"), Dates.Date("2020-06-19")  …  Dates.Date("2015-01-21"), Dates.Date("2015-01-20"), Dates.Date("2015-01-16"), Dates.Date("2015-01-14"), Dates.Date("2015-01-13"), Dates.Date("2015-01-12"), Dates.Date("2015-01-09"), Dates.Date("2015-01-07"), Dates.Date("2015-01-05"), Dates.Date("2015-01-02")]

Get a single column as a whole with copying

In [13]:
unique([copy(df.date), copy(df."date"), df[:, 1], df[:, :date], df[:, "date"]])

1-element Array{Array{Dates.Date,1},1}:
 [Dates.Date("2020-07-02"), Dates.Date("2020-07-01"), Dates.Date("2020-06-30"), Dates.Date("2020-06-29"), Dates.Date("2020-06-26"), Dates.Date("2020-06-25"), Dates.Date("2020-06-24"), Dates.Date("2020-06-23"), Dates.Date("2020-06-22"), Dates.Date("2020-06-19")  …  Dates.Date("2015-01-21"), Dates.Date("2015-01-20"), Dates.Date("2015-01-16"), Dates.Date("2015-01-14"), Dates.Date("2015-01-13"), Dates.Date("2015-01-12"), Dates.Date("2015-01-09"), Dates.Date("2015-01-07"), Dates.Date("2015-01-05"), Dates.Date("2015-01-02")]

Let us compare the performance of various ways to get a column without copying

In [21]:
@btime $df.date
@btime $df."date"
@btime $df[!, 1]
@btime $df[!, :date]
@btime $df[!, "date"];

  13.326 ns (0 allocations: 0 bytes)
  39.011 ns (0 allocations: 0 bytes)
  4.604 ns (0 allocations: 0 bytes)
  13.426 ns (0 allocations: 0 bytes)
  38.950 ns (0 allocations: 0 bytes)


#### Exercise

Check the same but with copying

Do you think it really matters in practice how fast is an access to column of a data frame?

Let us check how lookup speed scales with the number of columns:

In [40]:
@time df_tmp = DataFrame(ones(1, 100_000))

  0.078036 seconds (599.57 k allocations: 47.574 MiB)


Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x20,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30,x31,x32,x33,x34,x35,x36,x37,x38,x39,x40,x41,x42,x43,x44,x45,x46,x47,x48,x49,x50,x51,x52,x53,x54,x55,x56
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [44]:
@btime $df_tmp.x100000
@btime $df_tmp."x100000"
@btime $df_tmp[!, 100000];

  15.029 ns (0 allocations: 0 bytes)
  57.739 ns (0 allocations: 0 bytes)
  4.599 ns (0 allocations: 0 bytes)
