# Exploratory Data Analysis


First, let's look at a classic example of why it's important to plot data in addition to looking at statistics.

## [Anscombe's quartet](https://en.wikipedia.org/wiki/Anscombe's_quartet)

Anscombe's quartet is a set of four datasets with variables `X` and `Y`.

In [None]:
using CSV
using DataFrames
using Gadfly
using Query
using RDatasets
using Statistics

First we will load the dataset.

In [None]:
anscombe = dataset("datasets", "anscombe")

Next we calculate means and standard deviations for each variable. Notice that they are all the same.

In [None]:
describe(anscombe, :mean, :std)

Each `X`, `Y` pair also has the same correlation.

In [None]:
cor(anscombe[!, :X1], anscombe[!, :Y1]),
cor(anscombe[!, :X2], anscombe[!, :Y2]),
cor(anscombe[!, :X3], anscombe[!, :Y3]),
cor(anscombe[!, :X4], anscombe[!, :Y4])

Now we will convert the dataset to tidy format so it's easier to create plots.

In [None]:
anscombe[!, :rowid] = 1:nrow(anscombe)

sa = stack(anscombe, Not(:rowid))

r = r"(\w)(\d+)"
m = match.(r, string.(sa.variable))

sa[!, :variable] = map(x->x.match[1], m)
sa[!, :set] = map(x->x.match[2], m)

tidy_anscombe = unstack(sa, :variable, :value);

In [None]:
first(tidy_anscombe, 6)

Finally, let's plot each set of `X`, `Y` values with a linear regression line. Note that the plots are quite different and that the linear regression lines are all the same.

In [None]:
plot(tidy_anscombe, xgroup=:set, x=:X, y=:Y, Geom.subplot_grid(Geom.point, Geom.smooth(method=:lm)))

### Final example

Here is a final example that illustrates why you should always plot your data.

In [None]:
ds = CSV.read("../data/Datasaurus_data.csv", header=[:x, :y]);

Let's look at some statistics.

In [None]:
describe(ds)

In [None]:
cor(ds.x, ds.y)

And histograms of each variable.

In [None]:
plot(ds, x=:x, Geom.histogram(bincount=30))

In [None]:
plot(ds, x=:y, Geom.histogram(bincount=30))

Nothing really jumps out.

How about a scatterplot?

In [None]:
plot(ds, x=:x, y=:y, Geom.point)