# Statistical Data Visualization



![logo](logo_thumbnail.png)

*Data Science @ SC*

![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)

## Problem statement

## Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import requests

In [None]:
sns.set()

## Loading the data

In [None]:
TRAIN_FEATURES_URL = "https://drivendata-prod.s3.amazonaamazonawsws.com/data/7/public/4910797b-ee55-40a7-8668-10efd5c1b960.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIARVBOBDCYVI2LMPSY%2F20201013%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20201013T211537Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=ca380fb79c2e6103b31dc88b7310d09999ec7df2d2aa208cf8f6ed66cbe4b6ca"
TRAIN_LABELS_URL = "https://drivendata-prod.s3.amazonaws.com/data/7/public/0bf8bc6e-30d0-4c50-956a-603fc693d966.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIARVBOBDCYVI2LMPSY%2F20201013%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20201013T211537Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=1a2031752342b10cd759d1f1a446256de027851ac28cfc991b5aae9d890c5315"
TEST_FEATURES_URL = "https://drivendata-prod.s3..com/data/7/public/702ddfc5-68cd-4d1d-a0de-f5f566f76d91.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIARVBOBDCYVI2LMPSY%2F20201013%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20201013T211537Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=70ce0575a5db9e798145b348df1532938e45dd90864c47ae035894ec61bd4b7a"

Make GET request and write the results to a `.csv` file.

**NOTE**: these URLs have expiring credential tokens. If these don't work, go to the [data download](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/data/) 
page of the competition to get fresh links

In [None]:
r = requests.get(TRAIN_FEATURES_URL)
with open("train_features.csv", 'wb') as f:
    f.write(r.content)

r = requests.get(TRAIN_LABELS_URL)
with open("train_labels.csv", 'wb') as f:
    f.write(r.content)

r = requests.get(TEST_FEATURES_URL)
with open("test_features.csv", 'wb') as f:
    f.write(r.content)

Read in the data using `pandas`

In [None]:
X_train = pd.read_csv("train_features.csv")
y_train = pd.read_csv("train_labels.csv")
X_test = pd.read_csv("test_features.csv")

## Feature types

View feature descriptions on the [competition page](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/page/25/#features_list)

In [None]:
X_types = X_train.dtypes
X_types

Make lists of categorical and numerical variables. This will be useful for making visualizations

In [None]:
X_cat = (X_types
    [X_types == "object"]
    .append(pd.Series({"region_code":"int64", "district_code": "int64"}))
    .index
)

X_num = (X_types
    [(X_types == "int64") | (X_types == "float64")]
    .drop(["id", "region_code", "district_code"])
    .index
)

## Why is visualization important?

First approach to exploring a dataset: taking summary statistics

In [None]:
X_train[X_num].describe()

### Numbers don't tell the entire story

#### Example 1: Amscombe's quartet

In [None]:
df_quartet = sns.load_dataset("anscombe")

Summary statistics. Look the mean and standard deviation for each of these datasets?

In [None]:
(df_quartet
     [df_quartet["dataset"] == "I"]
     [["x", "y"]]
     .describe()
)

In [None]:
(df_quartet
     [df_quartet["dataset"] == "II"]
     [["x", "y"]]
     .describe()
)

In [None]:
(df_quartet
     [df_quartet["dataset"] == "III"]
     [["x", "y"]]
     .describe()
)

In [None]:
(df_quartet
     [df_quartet["dataset"] == "IV"]
     [["x", "y"]]
     .describe()
)

Visualization

In [None]:
sns.lmplot(
    x="x", 
    y="y", 
    col="dataset", 
    hue="dataset", 
    data=df_quartet,
    col_wrap=2, 
    ci=None, 
    palette="rocket", 
    height=4,
    scatter_kws={"s": 50, "alpha": 0.5}
)

#### Example 2: Datasaurus Dozen

Even more extreme example where all the summary statistics are the same!

![Datasaurus dozen](https://d2f99xq7vri1nk.cloudfront.net/AllDinosGrey_1.png)

*Justin Matejka, George Fitzmaurice, Autodesk Research, 2017* | [link](https://www.autodesk.com/research/publications/same-stats-different-graphs)

## Statistical plots

### 1. Distributions

#### 1.1 Numeric

#### 1.2 Numeric | Numeric

#### 1.3 Numeric | Categorical

#### 1.4 Categorical

#### 1.5 Categorical | Categorical

#### 1.6 Many Numerics

### 2. Miscalleneous

#### 2.1 Summarizing relations

Using correlation or mutual information

#### 2.2 Structure of missing values

#### 2.3 Data transformations

Log or polar transformations may reveal extra structure