# Exploratory Data Analysis


First, let's look at a classic example of why it's important to plot data in addition to looking at statistics.

## [Anscombe's quartet](https://en.wikipedia.org/wiki/Anscombe's_quartet)

Anscombe's quartet is a set of four datasets with variables `X` and `Y`.

In [None]:
using CSV
using DataFrames
using Gadfly
using GLM
using RDatasets
using Statistics

First we will load the dataset.

In [None]:
anscombe = dataset("datasets", "anscombe")

Next we calculate means and standard deviations for each variable. Notice that they are all the same.

In [None]:
describe(anscombe, :mean, :std)

Each `X`, `Y` pair also has the same correlation.

In [None]:
cor(anscombe[!, :X1], anscombe[!, :Y1]),
cor(anscombe[!, :X2], anscombe[!, :Y2]),
cor(anscombe[!, :X3], anscombe[!, :Y3]),
cor(anscombe[!, :X4], anscombe[!, :Y4])

Let's run linear regressions of each `Y` on its respective `X` and compare results. Note that the fitted slopes and intercepts are nearly the same.

In [None]:
lm1 = lm(@formula(Y1 ~ X1), anscombe)
lm2 = lm(@formula(Y2 ~ X2), anscombe)
lm3 = lm(@formula(Y3 ~ X3), anscombe)
lm4 = lm(@formula(Y4 ~ X4), anscombe)

print(lm1)
print(lm2)
print(lm3)
print(lm4)

Now let's plot each set of `X`, `Y` values with a linear regression line. Note that the plots are quite different and that the linear regression lines are all the same.

In [None]:
p1 = plot(anscombe, x=:X1, y=:Y1, Geom.point, Geom.smooth(method=:lm))
p2 = plot(anscombe, x=:X2, y=:Y2, Geom.point, Geom.smooth(method=:lm))
p3 = plot(anscombe, x=:X3, y=:Y3, Geom.point, Geom.smooth(method=:lm))
p4 = plot(anscombe, x=:X4, y=:Y4, Geom.point, Geom.smooth(method=:lm))

title(gridstack([p1 p2; p3 p4]), "Anscombe's Quartet")

### Final example

Here is a final example that illustrates why you should always plot your data.

In [None]:
ds = CSV.read("../data/Datasaurus_data.csv", header=[:x, :y]);

Let's look at some statistics.

In [None]:
describe(ds)

In [None]:
cor(ds.x, ds.y)

And histograms of each variable.

In [None]:
plot(ds, x=:x, Geom.histogram(bincount=30))

In [None]:
plot(ds, x=:y, Geom.histogram(bincount=30))

So far, nothing really jumps out.

How about a scatterplot?

In [None]:
plot(ds, x=:x, y=:y, Geom.point)

And here it is again with a linear regression line.

In [None]:
plot(ds, x=:x, y=:y, Geom.point, Geom.smooth(method=:lm))

#### Always plot your data!