# Anscombe's Quartet Exercise

The purpose of this exercise is to demonstrate the importance of data visualization in statistical analysis. Anscombe's Quartet consists of four datasets that have nearly identical summary statistics (e.g., mean, variance, and correlation), but look significantly different when graphed.

By following this exercise, you will:

* Plot all four datasets.
* Visualize the differences between them.
* See the necessity of graphical analysis alongside numerical data summaries.


https://en.wikipedia.org/wiki/Anscombe%27s_quartet

## 1. Import Libraries
We start by importing the necessary libraries for data visualization and numerical operations:

In [1]:
import matplotlib.pyplot as plt
import numpy as np

## 2. Create data sets
Now we proceed to create the data sets

In [2]:
# Set up the datasets provided in the code
x = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
y2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]
y3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]
x4 = [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8]
y4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]

datasets = {
    'I': (x, y1),
    'II': (x, y2),
    'III': (x, y3),
    'IV': (x4, y4)
}


## 3. We think that we are experts ?
Let's plot some sumaries (statistics)
Now we proceed to create the data sets





In [3]:
import pandas as pd

# Create a DataFrame for the summary statistics of each dataset
summary_stats = {
    'Dataset': ['I', 'II', 'III', 'IV'],
    'Mean_X': [np.mean(x), np.mean(x), np.mean(x), np.mean(x4)],
    'Mean_Y': [np.mean(y1), np.mean(y2), np.mean(y3), np.mean(y4)],
    'Variance_X': [np.var(x, ddof=1), np.var(x, ddof=1), np.var(x, ddof=1), np.var(x4, ddof=1)],
    'Variance_Y': [np.var(y1, ddof=1), np.var(y2, ddof=1), np.var(y3, ddof=1), np.var(y4, ddof=1)],
    'Correlation': [
        np.corrcoef(x, y1)[0, 1],
        np.corrcoef(x, y2)[0, 1],
        np.corrcoef(x, y3)[0, 1],
        np.corrcoef(x4, y4)[0, 1]
    ]
}

summary_df = pd.DataFrame(summary_stats)

#summary_df # uncomment this line

What we see ?, this is a typical example of misleading statustical indicators !, Let's plot

## 4. Let have some plots

Lets plot the points and also fit the some curves

In [4]:
# Create a figure with 2x2 subplots for the four datasets
plot_figure = False # put True to plot

if plot_figure:
  fig, axs = plt.subplots(2, 2, figsize=(10, 10))
  fig.suptitle("Anscombe's Quartet", fontsize=16)

  # Define a list of titles for each subplot
  titles = ['Dataset I', 'Dataset II', 'Dataset III', 'Dataset IV']

  # Loop over datasets and plot each one on its corresponding axis
  for ax, (title, (x_data, y_data)) in zip(axs.flatten(), zip(titles, datasets.values())):
      ax.scatter(x_data, y_data, color='blue')
      ax.plot(np.unique(x_data), np.poly1d(np.polyfit(x_data, y_data, 1))(np.unique(x_data)), color='red')
      ax.set_title(title)
      ax.set_xlabel('X')
      ax.set_ylabel('Y')

  # Adjust layout to prevent overlapping elements
  plt.tight_layout(rect=[0, 0, 1, 0.95])
  plt.show()
