# The Anscombe's Quartet Dataset

## Background of the Dataset

<p align="center">
  <img src="Francis_Anscombe.jpg">
  <br><b>Francis Anscombe</b><br>

Anscombe's quartet was developed by Frank Anscombe in 1973. [1] It consists of 4 datasets. Each dataset has 11 (x, y) pairs. The dataset was developed in order to illustrate the importance of constructing graphs. He believed that graphs should be produced as they help us perceive and appreciate some broad features of the data and they help us look behind those features to see what else is there. [2] Each dataset has similar summary statistics but look much different when viewed in a graph. 


## Descriptive Statistics

In [1]:
import pandas as pd

df = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/anscombe.csv")
                 
df

Unnamed: 0.1,Unnamed: 0,x1,x2,x3,x4,y1,y2,y3,y4
0,1,10,10,10,8,8.04,9.14,7.46,6.58
1,2,8,8,8,8,6.95,8.14,6.77,5.76
2,3,13,13,13,8,7.58,8.74,12.74,7.71
3,4,9,9,9,8,8.81,8.77,7.11,8.84
4,5,11,11,11,8,8.33,9.26,7.81,8.47
5,6,14,14,14,8,9.96,8.1,8.84,7.04
6,7,6,6,6,8,7.24,6.13,6.08,5.25
7,8,4,4,4,19,4.26,3.1,5.39,12.5
8,9,12,12,12,8,10.84,9.13,8.15,5.56
9,10,7,7,7,8,4.82,7.26,6.42,7.91


In [2]:
df.loc[:,['x1', 'y1']]

Unnamed: 0,x1,y1
0,10,8.04
1,8,6.95
2,13,7.58
3,9,8.81
4,11,8.33
5,14,9.96
6,6,7.24
7,4,4.26
8,12,10.84
9,7,4.82


In [3]:
df.describe()

Unnamed: 0.1,Unnamed: 0,x1,x2,x3,x4,y1,y2,y3,y4
count,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0
mean,6.0,9.0,9.0,9.0,9.0,7.500909,7.500909,7.5,7.500909
std,3.316625,3.316625,3.316625,3.316625,3.316625,2.031568,2.031657,2.030424,2.030579
min,1.0,4.0,4.0,4.0,8.0,4.26,3.1,5.39,5.25
25%,3.5,6.5,6.5,6.5,8.0,6.315,6.695,6.25,6.17
50%,6.0,9.0,9.0,9.0,8.0,7.58,8.14,7.11,7.04
75%,8.5,11.5,11.5,11.5,8.0,8.57,8.95,7.98,8.19
max,11.0,14.0,14.0,14.0,19.0,10.84,9.26,12.74,12.5


All have a count of 11

Mean of x's = 9.000000

Mean of y's = 7.50 (correct to 2 d.p.)

Standard deviation of x's = 3.316625

Standard deviation of y's = 2.03 (correct to 2 d.p.)

In [4]:
x1 = df.loc[:,'x1']
x2 = df.loc[:,'x2']
x3 = df.loc[:,'x3']
x4 = df.loc[:,'x4']

y1 = df.loc[:,'y1']
y2 = df.loc[:,'y2']
y3 = df.loc[:,'y3']
y4 = df.loc[:,'y4']

### Determining the linear regression line

In [5]:
import numpy as np
import matplotlib.pyplot as plt

linear1 = np.polyfit(x1, y1, 1)

linear2 = np.polyfit(x2, y2, 1)

linear3 = np.polyfit(x3, y3, 1)

linear4 = np.polyfit(x4, y4, 1)

print("The slope of the line and the y intercept for the linear regression lines for each dataset are:")
      
print("x1 vs y1:", linear1)
print("x2 vs y2:", linear2)
print("x3 vs y3:", linear3)
print("x4 vs y4:", linear4)


The slope of the line and the y intercept for the linear regression lines for each dataset are:
x1 vs y1: [ 0.50009091  3.00009091]
x2 vs y2: [ 0.5         3.00090909]
x3 vs y3: [ 0.49972727  3.00245455]
x4 vs y4: [ 0.49990909  3.00172727]


We can see that the linear regression lines for each dataset is y= 0.5x + 3.00 (m correct to 1 d.p. and c correct to 2 d.p.)

In [9]:
# Adapted from https://stackoverflow.com/questions/6148207/linear-regression-with-matplotlib-numpy 
from scipy.stats import linregress

slope1, intercept1, rvalue1, pvalue1, stderr1 = linregress(x1,y1)

print(f"For dataset 1 the slope is: {slope1}, the y intercept for the regression line is {intercept1}, and the correlation coefficient is {rvalue1}")

For dataset 1 the slope is: 0.5000909090909091, the y intercept for the regression line is 3.0000909090909103, and the correlation coefficient is 0.8164205163448399


## Plotting the Dataset

In [None]:
# Subplots from https://matplotlib.org/gallery/specialty_plots/anscombe.html
# Linear regression plotting from https://plot.ly/matplotlib/linear-fits/
# Linear regression line was determined below (in Descriptive Statistics section) to be equal to y = 0.5x + 3.00

import numpy as np

# Function to return minimum x values to produce linear regression line
def xLine(xVal):
    return np.array([np.min(xVal), np.max(xVal)])

# Function to return y values for passed x values to produce linear regression line
def yLine(xArray):
    return 0.5 * xArray  + 3.00


plt.subplot(221)
plt.plot(x1, y1, 'r.', xLine(x1), yLine(xLine(x1)))

plt.subplot(222)
plt.plot(x2, y2, 'b.', xLine(x2), yLine(xLine(x2)))

plt.subplot(223)
plt.plot(x3, y3, 'g.', xLine(x3), yLine(xLine(x3)))

plt.subplot(224)
plt.plot(x4, y4, 'y.', xLine(x4), yLine(xLine(x4)))

plt.show()

## Why the Dataset is Interesting

The Anscombe Quartet datasets illustrate why it's important to graph a dataset and not just rely on summary statistics. For the first dataset the points appear as would be expected with a well fitting linear model. [3] However, the remaining three do not. In the second dataset the points lay on a curve and therefore a linear regression in this instance is not appropriate. In the third dataset the points appear to all lay on a straight line apart from an outlier. This outlier has a great effect on the linear regression. It may be necessary to remove the outlier and perform a linear regression with this removed. In the fourth example there doesn't appear to be any relationship between the x and y values except when the outlier is taken into account. 

For datasets three and four the standard regression calculation should be accompanied with a warning that one observation has played a critical role. [1]


## References

[1] Wikipedia. Anscombe quartet.
[https://en.wikipedia.org/wiki/Anscombe%27s_quartet]

[2] San Jose State University. Graphs in Statistical Analysis.
[www.sjsu.edu/faculty/gerstman/StatPrimer/anscombe1973.pdf]

[3] Eager Eyes. Anscombe's Quartet.
https://eagereyes.org/criticism/anscombes-quartet
