# Statistical Data Visualization



![logo](logo_thumbnail.png)

*Data Science @ SC*

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/datascienceucsc/workshops/blob/master/f2020/data-visualization/statistical_visualization.ipynb)

## DrivenData Water Table: Problem Statement

Using data from Taarifa and the Tanzanian Ministry of Water, can you predict which pumps are functional, which need some repairs, and which don't work at all? This is an intermediate-level practice competition. Predict one of these three classes based on a number of variables about what kind of pump is operating, when it was installed, and how it is managed. A smart understanding of which waterpoints will fail can improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania.

## Our Goal

In this notebook, we'll be doing an initial data exploration, where you will learn about common statistical plots and what to look for when you first start a machine learning competition.

## Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import numpy as np

In [None]:
sns.set()

## Loading the data

In [None]:
TRAIN_LABELS_URL = "https://drivendata-prod.s3.amazonaws.com/data/7/public/0bf8bc6e-30d0-4c50-956a-603fc693d966.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIARVBOBDCYVI2LMPSY%2F20201016%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20201016T000043Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=332f27ce1bd763748a70eb7eefb9d9b9aab60b30fefe0b17e931d348b3cd3d03"
TRAIN_FEATURES_URL = "https://drivendata-prod.s3.amazonaws.com/data/7/public/4910797b-ee55-40a7-8668-10efd5c1b960.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIARVBOBDCYVI2LMPSY%2F20201016%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20201016T000043Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=8324f307d1f98b8a31be32ff1c22b75cc8cf1e8cbd94333d49d9e6014ca864ba"

Make GET request and write the results to a `.csv` file.

**NOTE**: these URLs have expiring credential tokens. If these don't work, go to the [data download](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/data/) 
page of the competition to get fresh links

In [None]:
r = requests.get(TRAIN_FEATURES_URL)
with open("train_features.csv", 'wb') as f:
    f.write(r.content)

r = requests.get(TRAIN_LABELS_URL)
with open("train_labels.csv", 'wb') as f:
    f.write(r.content)

Read in the data using `pandas`

In [None]:
X_train = pd.read_csv("train_features.csv")
y_train = pd.read_csv("train_labels.csv")
X_train['status_group'] = y_train["status_group"]

## Feature types

View feature descriptions on the [competition page](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/page/25/#features_list)

In [None]:
X_types = X_train.dtypes
X_types

Make lists of categorical and numerical variables. This will be useful for making visualizations

In [None]:
X_cat = (X_types
    [X_types == "object"]
    .append(pd.Series({"region_code":"int64", "district_code": "int64"}))
    .index
)

X_num = (X_types
    [(X_types == "int64") | (X_types == "float64")]
    .drop(["id", "region_code", "district_code"])
    .index
)

## Why is visualization important?

First approach to exploring a dataset: taking summary statistics

In [None]:
X_train[X_num].describe()

### Numbers don't tell the entire story

#### Example 1: Amscombe's quartet

In [None]:
df_quartet = sns.load_dataset("anscombe")

Summary statistics. Look the mean and standard deviation for each of these datasets?

In [None]:
(df_quartet
     [df_quartet["dataset"] == "I"]
     [["x", "y"]]
     .describe()
)

In [None]:
(df_quartet
     [df_quartet["dataset"] == "II"]
     [["x", "y"]]
     .describe()
)

In [None]:
(df_quartet
     [df_quartet["dataset"] == "III"]
     [["x", "y"]]
     .describe()
)

In [None]:
(df_quartet
     [df_quartet["dataset"] == "IV"]
     [["x", "y"]]
     .describe()
)

Visualization

In [None]:
sns.lmplot(
    x="x", 
    y="y", 
    col="dataset", 
    hue="dataset", 
    data=df_quartet,
    col_wrap=2, 
    ci=None, 
    palette="rocket", 
    height=4,
    scatter_kws={"s": 50, "alpha": 0.5}
)

#### Example 2: Datasaurus Dozen

Even more extreme example where all the summary statistics are the same!

![Datasaurus dozen](https://d2f99xq7vri1nk.cloudfront.net/AllDinosGrey_1.png)

*Justin Matejka, George Fitzmaurice, Autodesk Research, 2017* | [link](https://www.autodesk.com/research/publications/same-stats-different-graphs)

## Statistical plots for the DrivenData Water Table competition

Now, we'll visualize different the data from the DrivenData Water Table competition. We'd like to get a feel for our data before we dive right in to modeling.

First, we'll run Pandas `infer_objects`, which tries to infer what data type each columns should be (int, string, etc). In pandas, the data type are called `dtypes`. Then, we'll run `.head(n)` to look at the first `n` rows of our data.

In [None]:
X_train = X_train.infer_objects()
X_train.head(5)

### 0. Missing Values

Real world datasets are not always perfect, in fact, they very often have missing entries. We'd like to know how many entires are missing from each column (to determine if there is enough data to use), the number of rows with missing data, and the percentage of the missing data. If there is very little missing data, we could potentially drop the rows with any missing value. However, if there is not, we'll have to think about how we can handle the missing values. 

In [None]:
fig, ax = plt.subplots(figsize = (25, 10))
missing_count = X_train.isnull().sum()
ax = sns.barplot(x=missing_count.index, y=missing_count.values)
ax.xaxis.set_tick_params(rotation=45)

### 1. Distributions

As you can see from our data above, we only have a few columns that are numeric. For the non-numeric columns, we can use Pandas `.value_counts()` to get the number of rows with each unique value. For example:

In [None]:
X_train['funder'].value_counts()

Gives us the number of rows with the Government of Tanzania as the funder, etc. Below, we show the distribution of `gps_height`. 

#### 1.1 Numeric

Use a histogram

In [None]:
sns.displot(X_train, x="latitude", kind="hist")

#### 1.2 Numeric | Numeric

Use a joint distribution plot

In [None]:
sns.displot(X_train, x="latitude", y="longitude")

#### 1.3 Numeric | Categorical

Use a boxplot or a color-coded bivariate scatter plot

In [None]:
fig, ax = plt.subplots(figsize = (10,10))
sns.boxplot(data = X_train, x="status_group", y="population")
ax.xaxis.set_tick_params(rotation=45)
ax.set_yscale("log")

In [None]:
fig, ax = plt.subplots(figsize = (10,10))
ax = sns.scatterplot(data = X_train, x="amount_tsh", y="population", hue="status_group")
ax.set_yscale("log")

#### 1.4 Categorical

Use value count plots

#### 1.5 Categorical | Categorical

#### 1.6 Global view of data

In [None]:
sns.pairplot(X_train[X_num])

### 2. Miscalleneous

#### 2.1 Summarizing relations


##### 2.1.1 Pearson correlation
Here, we visualize Pearson correlation between all numeric columns. Obviously, take this with a grain of salt. Not all these columns are numerically representative of some distribution, and hence the correlation won't have much meaning.

In [None]:
sns.heatmap(X_train[X_num].corr(), cmap="rocket", vmin = -1, vmax=1)

##### 2.1.2 Mutual information

In [None]:
import sklearn

#### 2.3 Data transformations

Log or polar transformations may reveal extra structure.
For instance, it's often good to use a log scale on variables that 
follow a power law (e.g. population, income, etc...)