# Author: FBB

### This is a demo of the famous Anscombe's Quartet 
### developed BY Federica B Bianco, UD @fedhere 
### for CUSP Principle of Urban Informatics. 

## The moral of the story is: look at your data!
### version 1: August 2015 
### last update: October 2020


In [1]:
import os
import sys
import numpy as np
import pylab as pl

import json

%pylab inline

Populating the interactive namespace from numpy and matplotlib


### lets read in a file which is not a regular csv file, called anscombe.txt, but let's use the pandas moduleand its read_csv function

In [3]:
import pandas as pd
pd.read_csv("https://raw.githubusercontent.com/fedhere/PUS2020_FBianco/master/data/anscombe.txt")

Unnamed: 0,Anscombe's Quartet
0,Anscombe-Data-Set I I II II III III IV IV
1,X Y X Y X Y X Y
2,10 8.04 10 9.14 10 7.46 8 6.58
3,8 6.95 8 8.14 8 6.77 8 5.76
4,13 7.58 13 8.74 13 12.74 8 7.71
5,9 8.81 9 8.77 9 7.11 8 8.84
6,11 8.33 11 9.26 11 7.81 8 8.47
7,14 9.96 14 8.1 14 8.84 8 7.04
8,6 7.24 6 6.13 6 6.08 8 5.25
9,4 4.26 4 3.1 4 5.39 19 12.5


notice that I used the link to github starting with raw: https://raw.githubusercontent.com/fedhere/UInotebooks/master/anscombe.txt. The page https://github.com/fedhere/UInotebooks/blob/master/anscombe.txt is an html file hosting the text for rendering. Downloading that would download an html file. Try it. 

## pandas organizes datasets in "data frames"

## lets read the file a little better by identifying what to use as header to name our variables

In [4]:
pd.read_csv("https://raw.githubusercontent.com/fedhere/PUS2020_FBianco/master/data/anscombe.txt", 
            header = [1, 2], delimiter=' ')

Unnamed: 0_level_0,Anscombe-Data-Set,I,I,II,II,III,III,IV,IV,Unnamed: 9_level_0,Unnamed: 10_level_0
Unnamed: 0_level_1,Unnamed: 0_level_1,X,Y,X,Y,X,Y,X,Y,Unnamed: 9_level_1,Unnamed: 10_level_1
0,,10.0,8.04,10.0,9.14,10.0,7.46,8.0,6.58,,
1,,8.0,6.95,8.0,8.14,8.0,6.77,8.0,5.76,,
2,,13.0,7.58,13.0,8.74,13.0,12.74,8.0,7.71,,
3,,9.0,8.81,9.0,8.77,9.0,7.11,8.0,8.84,,
4,,11.0,8.33,11.0,9.26,11.0,7.81,8.0,8.47,,
5,,14.0,9.96,14.0,8.1,14.0,8.84,8.0,7.04,,
6,,6.0,7.24,6.0,6.13,6.0,6.08,8.0,5.25,,
7,,4.0,4.26,4.0,3.1,4.0,5.39,19.0,12.5,,
8,,12.0,10.84,12.0,9.13,12.0,8.15,8.0,5.56,,
9,,7.0,4.82,7.0,7.26,7.0,6.42,8.0,7.91,,


## lets read the file a little better yet by only reading the rows we want 

In [5]:
pd.read_csv("https://raw.githubusercontent.com/fedhere/PUS2020_FBianco/master/data/anscombe.txt", 
            header = [1, 2], nrows = 11, delimiter=' ')

Unnamed: 0_level_0,Anscombe-Data-Set,I,I,II,II,III,III,IV,IV,Unnamed: 9_level_0,Unnamed: 10_level_0
Unnamed: 0_level_1,Unnamed: 0_level_1,X,Y,X,Y,X,Y,X,Y,Unnamed: 9_level_1,Unnamed: 10_level_1
0,,10,8.04,10,9.14,10,7.46,8,6.58,,
1,,8,6.95,8,8.14,8,6.77,8,5.76,,
2,,13,7.58,13,8.74,13,12.74,8,7.71,,
3,,9,8.81,9,8.77,9,7.11,8,8.84,,
4,,11,8.33,11,9.26,11,7.81,8,8.47,,
5,,14,9.96,14,8.1,14,8.84,8,7.04,,
6,,6,7.24,6,6.13,6,6.08,8,5.25,,
7,,4,4.26,4,3.1,4,5.39,19,12.5,,
8,,12,10.84,12,9.13,12,8.15,8,5.56,,
9,,7,4.82,7,7.26,7,6.42,8,7.91,,


## if we are happy lets save the data frame read by pandas
### (we could drop the columns we do not need after that, but I'm not gonna bother)

In [7]:
ansc = pd.read_csv("https://raw.githubusercontent.com/fedhere/PUS2020_FBianco/master/data/anscombe.txt", 
                   header = [1, 2], nrows = 11, delimiter=' ')
#fig=pl.figure(figsize=(10,10))

ansc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 11 columns):
(Anscombe-Data-Set, Unnamed: 0_level_1)       0 non-null float64
(I, X)                                        11 non-null int64
(I, Y)                                        11 non-null float64
(II, X)                                       11 non-null int64
(II, Y)                                       11 non-null float64
(III, X)                                      11 non-null int64
(III, Y)                                      11 non-null float64
(IV, X)                                       11 non-null int64
(IV, Y)                                       11 non-null float64
(Unnamed: 9_level_0, Unnamed: 9_level_1)      0 non-null float64
(Unnamed: 10_level_0, Unnamed: 10_level_1)    0 non-null float64
dtypes: float64(7), int64(4)
memory usage: 1.0 KB


## The dataframe is a class. In this case the class hosts 4 datasets: I,II,III,IV. We can think of the dataframe as a python dictionary as well. 
## If we think of it as a class lets refer to the first dataset identified as I as ansc.I
## Otherwise I can refer to it as a dictionary as ansc[I]

this structure propagates downward to elements of ansc.I too!