# Data Manipulation in Julia
## David Gold
### 10/26/16 - NYC Julia Users Meetup


# About me
* Ph.D. student in Deptartment of Statistics at the University of Washington
    * work with [Johannes Lederer](https://johanneslederer.com/people/) on high-dimensional inference
* Worked on NullableArrays.jl at the Recurse Center over the summer of 2015

# This Talk: Two Main Threads

* StructuredQueries.jl (summer project at [Julia Labs](http://julia.mit.edu/))
    * My blog post describing this work can be found [here](http://julialang.org/blog/2016/10/StructuredQueries)
* Tabular data support in Julia more generally
    * How data manipulation libraries are relevant to planned changes to DataFrames.jl

### Main takeaway: we're (slowly) making progress

## By the way

StructuredQueries.jl and all that follows are works in progress and subject to change. There are design choices that need to be made, implementations that need to be cleaned up/tuned, etc.

All that said, there is a release schedule in mind. 

# StructuredQueries.jl (SQ)

### Query representation framework

Goal: Represent the structure of a "query" with a directed acyclical graph (DAG) object
* "query" like in SQL, or a series of manipulations applied to data, as in `dplyr`

Why?
* ~~Everybody else is using DAGs, so we should too~~
* ~~DAG are my initials, and I want to name something in my package after myself~~
* Doing so allows us to decouple a query's representation from its execution
    * Queries can be generic over different backends (in-memory Julia objects, SQL databases)
    * Solve the "column-indexing" problem
    * Kind of solves the "`Nullable` lifting" problem

(More on the column-indexing and lifting problems later, if there's time)

# An example

Consider the `iris` dataset.

In [1]:
Pkg.checkout("DataFrames", "master") # at your own risk

INFO: Checking out DataFrames master...
INFO: Pulling DataFrames latest master...
INFO: No packages to install, update or remove


In [2]:
using DataFrames
using DataStreams
using CSV
iris_csv = CSV.Source(joinpath(Pkg.dir("TablesDemo"), "csv/iris.csv"))
iris = Data.stream!(iris_csv, DataFrame(Data.schema(iris_csv)), false) # Thanks, @quinnj

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
1,5.1,3.5,1.4,0.2,"""setosa"""
2,4.9,3.0,1.4,0.2,"""setosa"""
3,4.7,3.2,1.3,0.2,"""setosa"""
4,4.6,3.1,1.5,0.2,"""setosa"""
5,5.0,3.6,1.4,0.2,"""setosa"""
6,5.4,3.9,1.7,0.4,"""setosa"""
7,4.6,3.4,1.4,0.3,"""setosa"""
8,5.0,3.4,1.5,0.2,"""setosa"""
9,4.4,2.9,1.4,0.2,"""setosa"""
10,4.9,3.1,1.5,0.1,"""setosa"""


Suppose we want to restrict to rows whose value for `sepal_length` is greater than `7.5`.


## Vector-based row indexing into `DataFrame`s



In [None]:
iris[iris[:sepal_length] .> 7.5]

## DataFramesMeta.jl


In [None]:
using DataFramesMeta #?
@where iris :sepal_length .> 7.5

Simon Byrne's [JuliaCon 2016 talk](https://www.youtube.com/watch?v=ScCY_nE0hlU) has a good discussion of these approaches.

# `@query`

SQ provides the `@query` macro with which to describe a query/series of manipulations against a data source, e.g.

In [3]:
using StructuredQueries
q = @query filter(iris, sepal_length > 7.5)

Query against a source of type DataFrames.DataFrame

We can inspect the representation of the original query using `graph`:

In [4]:
graph(q)

FilterNode
  arguments:
      1)  sepal_length > 7.5
  inputs:
      1)  DataNode
            source:  source of type DataFrame


# What do I do with my `Query`? 

# Extend it

One can extend a `Query` object by querying it:

In [5]:
q2 = @query q |> select(species, petal_width)

Query against a source of type DataFrames.DataFrame

In [6]:
graph(q2)

SelectNode
  arguments:
      1)  species
      2)  petal_width
  inputs:
      1)  FilterNode
            arguments:
                1)  sepal_length > 7.5
            inputs:
                1)  DataNode
                      source:  source of type DataFrame


One can use such composability with functions, e.g.

In [7]:
f(q::Query) = @query q |> select(petal_width)
q3 = f(@query filter(iris, sepal_length > 7.5))
graph(q3)


SelectNode
  arguments:
      1)  petal_width
  inputs:
      1)  FilterNode
            arguments:
                1)  sepal_length > 7.5
            inputs:
                1)  DataNode
                      source:  source of type DataFrame


In [8]:
q4 = f(@query filter(iris, sepal_width < petal_width))
graph(q4)

SelectNode
  arguments:
      1)  petal_width
  inputs:
      1)  FilterNode
            arguments:
                1)  sepal_width < petal_width
            inputs:
                1)  DataNode
                      source:  source of type DataFrame


# `collect` it

Materialize a query as an in-memory Julia object (e.g. another `DataFrame`) using `collect`

In [9]:
collect(q)

LoadError: LoadError: MethodError: no method matching _collect(::DataFrames.DataFrame, ::StructuredQueries.FilterNode)
while loading In[9], in expression starting on line 1

Whoops! StructuredQueries.jl only houses the graph-producing machinery. Collection machinery lives in another package...

...(tentatively titled)

In [10]:
using Collect # NOTE: not registered

In [11]:
collect(q)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
1,7.6,3.0,6.6,2.1,"""virginica"""
2,7.7,3.8,6.7,2.2,"""virginica"""
3,7.7,2.6,6.9,2.3,"""virginica"""
4,7.7,2.8,6.7,2.0,"""virginica"""
5,7.9,3.8,6.4,2.0,"""virginica"""
6,7.7,3.0,6.1,2.3,"""virginica"""


"Collect.jl", or whatever it will be called, will re-export StructuredQueries.jl -- users will never have to write `using StructuredQueries`. 

SQ also provides a `@collect` macro that behaves the same as `@query` but automatically collects the resultant `Query`:

In [12]:
@collect iris |> filter(sepal_length > 7.5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
1,7.6,3.0,6.6,2.1,"""virginica"""
2,7.7,3.8,6.7,2.2,"""virginica"""
3,7.7,2.6,6.9,2.3,"""virginica"""
4,7.7,2.8,6.7,2.0,"""virginica"""
5,7.9,3.8,6.4,2.0,"""virginica"""
6,7.7,3.0,6.1,2.3,"""virginica"""


# How does it work? (for `DataFrame`s)

Without getting too much into the weeds, consider the following query

In [13]:
graph(q)

FilterNode
  arguments:
      1)  sepal_length > 7.5
  inputs:
      1)  DataNode
            source:  source of type DataFrame


We transform the data source into an iterator over tuples, (currently via `zip`) and pass this iterator to an internal function that applies the filtering predicate to each row returned by the iterator. 

Note that passing the iterator through a function barrier circumvents type-inferability difficulties associated with naively indexing into a `DataFrame` by field, (e.g. naively trying to loop over `df[:sepal_length]`.

If the predicate is satisfied, the function pushes the row to the result `DataFrame`. 


The filtering predicate itself is stored in the `FilterNode` object:

In [14]:
graph(q).helpers

1-element Array{StructuredQueries.FilterHelper,1}:
 StructuredQueries.FilterHelper{##3#4}(#3,Symbol[:sepal_length])

# Backend agnosticism

For SQL database sources, we can transform the graph into appropriate SQL.

See Yeesian Ng's (@yeesian) work at [SQLQuery](https://github.com/yeesian/SQLQuery.jl), in particular [PR #2](https://github.com/yeesian/SQLQuery.jl/pull/2)

# What else can I do?

# Projections / Transformations

In [19]:
@collect iris |>
    select(
        species,
        twice_petal_width = 2 * petal_width,
        something = digamma(log(sepal_length))
    )

Unnamed: 0,species,twice_petal_width,something
1,"""setosa""",0.4,0.150852
2,"""setosa""",0.4,0.116766
3,"""setosa""",0.4,0.0800386
4,"""setosa""",0.4,0.0605702
5,"""setosa""",0.4,0.134118
6,"""setosa""",0.8,0.197703
7,"""setosa""",0.6,0.0605702
8,"""setosa""",0.4,0.134118
9,"""setosa""",0.4,0.0191522
10,"""setosa""",0.2,0.116766


# Grouping / Aggregation


In [23]:
@collect iris |>
    groupby(species, sepal_length > .5) |>
    summarize(res = mean(petal_width))

Grouped DataFrames.DataFrame
Groupings by:
    species 
    pred_1 (with alias :sepal_length > 0.5) 
Source: 3×3 DataFrames.DataFrame
│ Row │ species      │ pred_1 │ res   │
├─────┼──────────────┼────────┼───────┤
│ 1   │ "setosa"     │ true   │ 0.244 │
│ 2   │ "versicolor" │ true   │ 1.326 │
│ 3   │ "virginica"  │ true   │ 2.026 │




# Joins

In [32]:
df1 = DataFrame(
    A1 = rand(1:3, 10),
    B1 = rand(1:10, 10)
)

Unnamed: 0,A1,B1
1,1,8
2,3,7
3,2,9
4,1,3
5,3,3
6,1,9
7,1,4
8,1,6
9,1,6
10,2,5


In [31]:
df2 = DataFrame(
    A2 = rand(1:3, 10),
    B2 = rand(10),
    C2 = rand(["green", "blue"], 10)
)

Unnamed: 0,A2,B2,C2
1,1,0.553672,blue
2,1,0.667555,green
3,2,0.609567,blue
4,3,0.998367,blue
5,3,0.9212,green
6,3,0.508623,blue
7,3,0.949373,blue
8,3,0.0672602,blue
9,2,0.709144,blue
10,3,0.675623,green


In [33]:
@collect join(df1, df2, A1 = A2)

Unnamed: 0,A2,B2,C2,A1,B1
1,1,0.553672,blue,1,8
2,1,0.553672,blue,1,3
3,1,0.553672,blue,1,9
4,1,0.553672,blue,1,4
5,1,0.553672,blue,1,6
6,1,0.553672,blue,1,6
7,1,0.667555,green,1,8
8,1,0.667555,green,1,3
9,1,0.667555,green,1,9
10,1,0.667555,green,1,4


In [34]:
@collect join(df1, df2, A1 = ifelse(B2 > .5, 1, 3))

Unnamed: 0,A2,B2,C2,A1,B1
1,1,0.553672,blue,1,8
2,1,0.553672,blue,1,3
3,1,0.553672,blue,1,9
4,1,0.553672,blue,1,4
5,1,0.553672,blue,1,6
6,1,0.553672,blue,1,6
7,1,0.667555,green,1,8
8,1,0.667555,green,1,3
9,1,0.667555,green,1,9
10,1,0.667555,green,1,4


# Issues/TODOs

* Name resolution
    * Current parsing machinery assumes that all unadorned names inside function calls are "attributes"
    * What if I want to do 
    
    ```julia
    @collect iris |> summarize(sum = reduce(+, sepal_length))
    ```
    ?
    * Conversely, what if `f` is a field of some `df::DataFrame` whose respective column is a vector of functions? Then 
    ```julia
    @collect select(f(a))
    ``` 
    will not work as expected, because names of *called* functions are inherited from enclosing scope.
* Interpolation/parametrized queries
    * `@collect summarize($c * A)`
* Implementation-wise...
    * Joins like all get out (c.f. Jamie Brandon's [blog post](http://scattered-thoughts.net/blog/2016/10/11/a-practical-relational-query-compiler-in-500-lines/) on [Imp](https://github.com/jamii/imp) (!))

# But where are the benchmarks?

\>_>

# ~~Ruining(?)~~ ~~Bettering(?)~~ Revising the Interface

In [1]:
# Restart the jupyter kernel
Pkg.checkout("StructuredQueries", "with")
using StructuredQueries

INFO: Checking out StructuredQueries with...
INFO: Pulling StructuredQueries latest with...
INFO: No packages to install, update or remove


# Revised interface

In [18]:
tbl = DataFrame(
    A = rand(10),
    B = rand(10)
);

In [19]:
q = @with tbl(i) do
    filter(i.A > .5)
    select(C = 5 * i.B)
end

Query against a source of type Tuple{DataFrames.DataFrame}

`i` denotes a "row token" that represents an arbitrary row of `tbl` in row-wise operations, such as `filter`ing by row, above.

Very much inspired by LINQ/Query.jl. However, tries to maintain conceptual focus of syntax on data as a whole, not just on iterative aspect. 

In [20]:
graph(q)

Node{:select}
  arguments:
      1)  C=5 * i.B
  inputs:
      1)  Node{:filter}
            arguments:
                1)  i.A > 0.5
            inputs:
                1)  DataNode{:DataFrames.DataFrame}
                      source:  source of type DataFrame


Supports a general yet compact in-line/single-verb syntax:

In [21]:
q = @with tbl(i) filter(i.A > .5)

Query against a source of type Tuple{DataFrames.DataFrame}

In [22]:
graph(q)

Node{:filter}
  arguments:
      1)  i.A > 0.5
  inputs:
      1)  DataNode{:DataFrames.DataFrame}
            source:  source of type DataFrame


## Join parsing

Associating a "row token" with each data source encourages regime in which there is no one "privileged" data source argument to each manipulation verb, unlike as in the previously shown interface.

In [24]:
tbl1, tbl2, tbl3, tbl4 = DataFrame(), DataFrame(), DataFrame(), DataFrame();

In [25]:
q = @with tbl1(i), tbl2(j) filter(
    i.A < j.B, i.C == "foo", i.D == j.D
)

Query against a source of type Tuple{Tuple{DataFrames.DataFrame},Tuple{DataFrames.DataFrame}}

Note the difference between the type of source shown immediately above and that listed in the `show`ing of `Query` objects earlier. 

In [26]:
graph(q)

Node{:filter}
  arguments:
      1)  i.A < j.B
  inputs:
      1)  Node{:innerjoin}
            arguments:
                1)  i.D == j.D
            inputs:
                1)  DataNode{:DataFrames.DataFrame}
                      source:  source of type DataFrame
                2)  Node{:filter}
                      arguments:
                          1)  i.C == "foo"
                      inputs:
                          1)  DataNode{:DataFrames.DataFrame}
                                source:  source of type DataFrame


Things can get relatively complicated...

In [27]:
q = @with tbl1(i), tbl2(j), tbl3(k), tbl4(h) do
    filter(i.A > .5, i.B == j.B, baz(j.C) < .5)
    filter(i.C < .5, j.D == "foo", k.D == h.D, digamma(k.E) > .5, h.F == "bar")
    join(i.D == k.D)
    groupby(f(i.A) < .5, k.D, g(j.B * k.C))
end

Query against a source of type Tuple{Tuple{Tuple{DataFrames.DataFrame},Tuple{DataFrames.DataFrame}},Tuple{Tuple{DataFrames.DataFrame},Tuple{DataFrames.DataFrame}}}

...

In [28]:
graph(q)

Node{:groupby}
  arguments:
      1)  f(i.A) < 0.5
      2)  k.D
      3)  g(j.B * k.C)
  inputs:
      1)  Node{:innerjoin}
            arguments:
                1)  i.D == k.D
            inputs:
                1)  Node{:innerjoin}
                      arguments:
                          1)  k.D == h.D
                      inputs:
                          1)  Node{:filter}
                                arguments:
                                    1)  digamma(k.E) > 0.5
                                inputs:
                                    1)  DataNode{:DataFrames.DataFrame}
                                          source:  source of type DataFrame
                          2)  Node{:filter}
                                arguments:
                                    1)  h.F == "bar"
                                inputs:
                                    1)  DataNode{:DataFrames.DataFrame}
                                          source:  source of type DataFram

## Advantages
* Name resolution is more straightforward
* Refer to attributes of a table in different contexts
    * `i.A` is a "row-wise" context
    * something like `:B` could be used as a "column-context"
        * e.g. 
        ```julia
        @with iris() select(avg = mean(:sepal_length))
        ```
      This also allows more flexibility in transforming after aggregating, e.g.
      ```julia
      @with iris() select(log_avg = log(mean(:sepal_length)))
      ```
    (This is possible in previously described interface as well)
* Opens up possibilities for neat window function syntaxes, e.g.
   ```julia
   @with iris(i) filter(
       i.sepal_length > mean([ j.sepal_length for j in iris if j.petal_width > i.petal_width ])
   )
   ```
   (i.e. Restrict to observations `i` from `iris` whose value for `sepal_length` is greater than the mean of all values of `sepal_length` over observations `j` from `iris` whose value for `petal_width` is greater than the present observation `i`'s value of `petal_width`.)    
   (However, such functionality is not generic.)
* encourages generality of manipulation verbs over number of sources
* Support for more LINQ/Query.jl-like generality over iterators


## Disadvantages
* Additional complexity
    * for the user, e.g. one has to specify row tokens everywhere
    * for the developer (ie., me): need to keep track of maps from tokens to maps from field names to column indices
* Why is the data source name now a function call?
* Abuse of `.` for denoting field reference
    * (This wouldn't be a problem if we had named tuples!)
    * can always use `Base.getfield` or a selector for actual field retrieval 

# Tabular Data Support More Generally

SQ is a data manipulation interface. Julian data scientists may deal with data in `DataFrame`s. So, let's talk about `DataFrame`s. 

(Stage direction: get on soap box.)

We said above that SQ works by iterating over a tuple-producing iterator. Query.jl works similarly. Iteration is fundamental to both implementations. 

So, we need row-wise iteration over the contents of a `DataFrame` to be fast. Which means that we need such iteration to be type-inferable. 

Naive iteration over `DataArray`s doesn't fit the bill:

In [29]:
using DataArrays
D = DataArray(1:10)
@code_warntype next(D, start(D)) # not good

Variables:
  #self#::Base.#next
  x::DataArrays.DataArray{Int64,1}
  state::Int64

Body:
  begin 
      return (Core.tuple)($(Expr(:invoke, MethodInstance for getindex(::DataArrays.DataArray{Int64,1}, ::Int64), :(DataArrays.getindex), :(x), :(state))),(Base.box)(Int64,(Base.add_int)(state::Int64,1)))::TUPLE{UNION{DATAARRAYS.NATYPE,INT64},INT64}
  end::TUPLE{UNION{DATAARRAYS.NATYPE,INT64},INT64}


 `NullableArray`s are better



In [30]:
X = NullableArray(1:10)
@code_warntype next(X, start(X))

Variables:
  #self#::Base.#next
  A::NullableArrays.NullableArray{Int64,1}
  i::Tuple{Base.OneTo{Int64},Int64}
  idx::Int64
  s::Int64
  #temp#@_6::Int64
  #temp#@_7::Nullable{Int64}

Body:
  begin 
      SSAValue(3) = (Base.getfield)(i::Tuple{Base.OneTo{Int64},Int64},2)::Int64
      SSAValue(5) = (Base.box)(Int64,(Base.add_int)(SSAValue(3),1))
      #temp#@_6::Int64 = $(QuoteNode(1))
      SSAValue(6) = (Base.box)(Int64,(Base.add_int)(1,1))
      idx::Int64 = SSAValue(3)
      #temp#@_6::Int64 = SSAValue(6)
      SSAValue(7) = (Base.box)(Int64,(Base.add_int)(2,1))
      s::Int64 = SSAValue(5)
      #temp#@_6::Int64 = SSAValue(7)
      SSAValue(8) = idx::Int64
      $(Expr(:inbounds, false))
      # meta: location /Users/David/.julia/v0.6/NullableArrays/src/indexing.jl getindex 22
      #temp#@_7::Nullable{Int64} = (Base.select_value)((Base.arrayref)((Core.getfield)(A::NullableArrays.NullableArray{Int64,1},:isnull)::Array{Bool,1},SSAValue(8))::Bool,$(Expr(:new, Nullable{Int64}, false))

But now (on DataFrames.jl master, where `DataFrame`s are built on top of `NullableArray`s and `CategoricalArray`s), indexing into a `DataFrame` column returns a `Nullable`:

In [31]:
iris[1, :sepal_length]

Nullable{Float64}(5.1)

On the one hand, this is the whole point of using `NullableArray`s. On the other hand, it's a bit of a pain:

In [3]:
f(x) = 2 * x
f.(iris[:sepal_length])

LoadError: LoadError: MethodError: no method matching *(::Int64, ::Nullable{Float64})
Closest candidates are:
  *(::Any, ::Any, !Matched::Any, !Matched::Any...) at operators.jl:287
  *{S}(!Matched::Nullable{Union{}}, ::Nullable{S}) at /Users/David/.julia/v0.6/NullableArrays/src/operators.jl:139
  *{T<:Union{Int128,Int16,Int32,Int64,Int8,UInt128,UInt16,UInt32,UInt64,UInt8}}(::T<:Union{Int128,Int16,Int32,Int64,Int8,UInt128,UInt16,UInt32,UInt64,UInt8}, !Matched::T<:Union{Int128,Int16,Int32,Int64,Int8,UInt128,UInt16,UInt32,UInt64,UInt8}) at int.jl:33
  ...
while loading In[3], in expression starting on line 2

# How can we make `Nullable`s usable?

I.e., how can we recover all functionality over non-`Nullable`s for argument signatures with `Nullable`s?

This is often referred to as "lifting" functionality over `Nullable` arguments (modulo some technicalities).

# Method extension

One strategy is to manually extend each method such as `*` to handle signatures with `Nullable`s.


In [4]:
import Base.*
*(x, y::Nullable) = isnull(y) ? Nullable{T}() : Nullable(x * y.value)

* (generic function with 154 methods)

In [5]:
f.(iris[:sepal_length])

150-element Array{Nullable{Float64},1}:
 10.2
 9.8 
 9.4 
 9.2 
 10.0
 10.8
 9.2 
 10.0
 8.8 
 9.8 
 10.8
 9.6 
 9.6 
 ⋮   
 12.0
 13.8
 13.4
 13.8
 11.6
 13.6
 13.4
 13.4
 12.6
 13.0
 12.4
 11.8

## Advantages

* Makes every such extended method work as expected
* Properly propagate invariants in three-valued logic

## Disadvantages

* That's a lot of methods...
* One might try to lift a small, core subset of functions and rely on generic dispatch for user-defined functions defined in terms of the core lifted functions, but this only works if the user-defined functions have untyped argumented signatures (in which case they can't take advantage of multiple dispatch), e.g.

In [7]:
f(x, y) = x * y
f(Nullable(1), Nullable(2))



Nullable{Int64}(2)

In [13]:
g(x::Int, y::Int) = x * y
g(x::Float64, y::Float64) = x + y
g(Nullable(1), Nullable(2))



LoadError: LoadError: MethodError: no method matching g(::Nullable{Int64}, ::Nullable{Int64})
while loading In[13], in expression starting on line 3

# Higher-order lifting

Another way to go is to use a higher-order function, say, `lift`

```julia
@inline function lift(f, x1, x2)
    if null_safe_op(f, typeof(x1), typeof(x2))
        return @compat Nullable(
            f(x1.value, x2.value), !(isnull(x1) | isnull(x2))
        )
    else
        U = Core.Inference.return_type(
            f, Tuple{eltype(typeof(x1)), eltype(typeof(x2))}
        )
        if isnull(x1) | isnull(x2)
            return Nullable{U}()
        else
            return Nullable(f(unsafe_get(x1), unsafe_get(x2)))
        end
    end
end
```


In [None]:
Higher-order lifting is fully general...

In [14]:
digamma(Nullable(1.5))

LoadError: LoadError: MethodError: no method matching digamma(::Nullable{Float64})
Closest candidates are:
  digamma(!Matched::BigFloat) at mpfr.jl:465
  digamma(!Matched::Union{Complex{Float64},Float64}) at special/gamma.jl:155
  digamma(!Matched::Union{Complex{Float32},Float32}) at special/gamma.jl:494
  ...
while loading In[14], in expression starting on line 1

In [16]:
using StructuredQueries: lift
lift(digamma, Nullable(1.5))

Nullable{Float64}(0.03649)

... handles "mixed-signature" arguments

In [18]:
h(x::Int, y::Float64) = log(x) * erf(y)
lift(h, 2, Nullable(1.5))

Nullable{Float64}(0.669653)

...as long as you write `lift` everywhere. =/

SQ automatically replaces function calls `f(xs...)` with `lift(f, xs...)` everywhere. So, `NullableArray`-based `DataFrame`s will be usable, at least through querying macros such as those provided by StructuredQueries.jl (and also Query.jl, which employs a variant of the method-extension strategy described above).

Arguably, it isn't satisfactory to rely on a quering framework to make the `Nullable` approach to missing values usable. Missingness and data manipulation are orthogonal.

But as a matter of practicality, is it so bad? We have reason to encourage use of macro-based querying frameworks over the traditional indexing/vectorized `DataFrames` API anyway. Why not piggy-back a solution to the `Nullable` lifting problem?

# An ideal solution??

There may be none, but what about:

* Split `Nullable` into two types -- one that propagates null values (`A`) and one that doesn't (`B`) -- and then automatically lower any call `f(xs...)` where any of the `xs...` are of type `A` to `lift(f, xs...)`. 

This is fully general, user-friendly, and properly 

## Other alternatives

* Special case small `Union` types (e.g. `Union{NAtype, T}`)
* Write code like [`broadcast` for DataArrays](https://github.com/JuliaStats/DataArrays.jl/blob/master/src/broadcast.jl)

# Some Final Thoughts

* StructuredQueries.jl, Query.jl and others are
* DataStreams.jl 

In [None]:
# I'm very grateful to

* John Myles White (FaceBook)
* Yeesian Ng (MIT)
* Alan Edelman & Viral Shah (MIT, Julia Computing)
* Many others at Julia Central and elsewhere

* Thank you to Spencer Lyons and the NYC Julia users Meetup group for inviting me to speak, and thank you to ebay for hosting!