# Data science with Julia

Julia has an increasingly compelling story for data analysis, based around the `DataFrames.jl`, `Query.jl`, and `JuliaDB.jl` packages.  The story will improve even further with 0.7 / 1.0, which includes named tuples.

There is now a variety of packages for loading different file formats, many integrated into the `FileIO.jl` package, which loads a file based on its type.

Note that Excel files may be read using the [ExcelReaders.jl](https://github.com/davidanthoff/ExcelReaders.jl) package.

## Reading CSVs

Let's download some data from Cincinnati:

https://data.cincinnati-oh.gov/Growing-Economy/City-of-Cincinnati-Vendor-Payments/qrj9-83t8

In [None]:
download("https://data.cincinnati-oh.gov/resource/wv6n-ukpk.csv", "vendor_data.csv")

And read it in:

In [None]:
using CSV

vendor_data = CSV.read("vendor_data.csv")

In [None]:
typeof(vendor_data)

We see that `CSV.read` returns a **`DataFrame`**, equivalent to that of R or of the Python `pandas` package, i.e. a data structure in which columns contain different kinds of data, and may be different Julia types. There may also be missing data, labelled as `missing`, which is a value provided by the `Missings.jl` package:

In [None]:
missing

In [None]:
typeof(missing)

We can find out the names of the columns:

In [None]:
names(vendor_data)

and extract a given column:

In [None]:
vendors = unique(vendor_data[:vendor_name])

We may want to extract all of those transactions corresponding to a given vendor, which we do in a vectorized way:

**Exercise**: Find which rows correspond to the second vendor, and extract those rows.

## Querying a database using `Query.jl`

The syntax for the above kind of operation gets very messy very fast, so we instead turn to a more powerful tool, `Query.jl`, which provides a syntax based on C#'s LINQ:

In [None]:
using Query, DataFrames

In [None]:
@from v in vendor_data begin
     @where v.vendor_name == "American Chemical Society"
     @select v
     @collect DataFrame
end

**Exercise**: (i) Make an interactive tool for choosing transactions with a given vendor.

(ii) Find the number of transactions with each vendor.

## JuliaDB

Another recent development in the data space is `JuliaDB.jl`, an efficient key-value store written in pure Julia. It is fully typed and hence more efficient than standard `DataFrame`s.