In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

## Introduction

In this very short chapter, we will introduce basic visual checks that you can easily write for your data.
You could consider this to be the collection of non-automatable, human-in-the-loop pieces that one needs to worry about.

## Visual Diagnostics

Now that we're done with the schema checks, let's do some sanity checks on the data as well. This is my personal favourite too, as some of the activities here overlap with the early stages of exploratory data analysis.

We're going to switch datasets here, and move to a 'corrupted' version of the Boston Economic Indicators dataset. Its file path is: `./data/boston_ei-corrupt.csv`.

In [None]:
import pandas as pd
import seaborn as sns
sns.set_style('white')
%matplotlib inline

df = pd.read_csv('data/boston_ei-corrupt.csv')
df.head()

### Demo: Visual Diagnostics

We can use a package called `missingno`, which gives us a quick visual view of the completeness of the data. This is a good starting point for deciding whether you need to manually comb through the data or not.

In [None]:
# First, we check for missing data.
import missingno as msno
msno.matrix(df)

Immediately it's clear that there's a number of rows with empty values! Nothing beats a quick visual check like this one.

We can get a table version of this using another package called `pandas_summary`. 

In [None]:
# We can do the same using pandas-summary.
from pandas_summary import DataFrameSummary

dfs = DataFrameSummary(df)
dfs.summary()

`dfs.summary()` returns a Pandas DataFrame; this means we can write tests for data completeness!

## Distributions

Most of what is coming is going to be a demonstration of the kinds of tools that are potentially useful for you. Feel free to relax from coding, as these aren't necessarily obviously automatable.

### Numerical Data

We can take the EDA portion further, by doing an empirical cumulative distribution plot for each data column.

In [None]:
import numpy as np
def compute_dimensions(length):
    """
    Given an integer, compute the "square-est" pair of dimensions for plotting.
    
    Examples:
    - length: 17 => rows: 4, cols: 5
    - length: 14 => rows: 4, cols: 4
    
    This is a utility function; can be tested separately.
    """
    sqrt = np.sqrt(length)
    floor = int(np.floor(sqrt))
    ceil = int(np.ceil(sqrt))
    
    if floor ** 2 >= length:
        return (floor, floor)
    elif floor * ceil >= length:
        return (floor, ceil)
    else:
        return (ceil, ceil)
    
compute_dimensions(length=17)

assert compute_dimensions(17) == (4, 5)
assert compute_dimensions(16) == (4, 4)
assert compute_dimensions(15) == (4, 4)
assert compute_dimensions(11) == (3, 4)

In [None]:
# Next, let's visualize the empirical CDF for each column of data.
import matplotlib.pyplot as plt

def empirical_cumdist(data, ax, title=None):
    """
    Plots the empirical cumulative distribution of values.
    """
    x, y = np.sort(data), np.arange(1, len(data)+1) / len(data)
    ax.scatter(x, y)
    ax.set_title(title)
    
data_cols = [i for i in df.columns if i not in ['Year', 'Month']]
n_rows, n_cols = compute_dimensions(len(data_cols))

fig = plt.figure(figsize=(n_cols*3, n_rows*3))
from matplotlib.gridspec import GridSpec
gs = GridSpec(n_rows, n_cols)
for i, col in enumerate(data_cols):
    ax = plt.subplot(gs[i])
    empirical_cumdist(df[col], ax, title=col)
    
plt.tight_layout()
plt.show()

It's often a good idea to **standardize** numerical data (that aren't count data). The term **standardize** often refers to the statistical procedure of subtracting the mean and dividing by the standard deviation, yielding an empirical distribution of data centered on 0 and having standard deviation of 1.

### Exercise

Write a test for a function that standardizes a column of data. Then, write the function.

**Note:** This function is also implemented in the `scikit-learn` library as part of their `preprocessing` module. However, in case an engineering decision that you make is that you don't want to import an entire library just to use one function, you can re-implement it on your own.

```python
def standard_scaler(x):
    return (x - x.mean()) / x.std()

def test_standard_scaler(x):
    std = standard_scaler(x)
    assert np.allclose(std.mean(), 0)
    assert np.allclose(std.std(), 1)
```

### Exercise

Now, plot the grid of standardized values.

In [None]:
data_cols = [i for i in df.columns if i not in ['Year', 'Month']]
n_rows, n_cols = compute_dimensions(len(data_cols))

fig = plt.figure(figsize=(n_cols*3, n_rows*3))
from matplotlib.gridspec import GridSpec
gs = GridSpec(n_rows, n_cols)
for i, col in enumerate(data_cols):
    ax = plt.subplot(gs[i])
    empirical_cumdist(standard_scaler(df[col]), ax, title=col)
    
plt.tight_layout()
plt.show()

### Exercise

Did we just copy/paste the function?! It's time to stop doing this. Let's refactor the code into a function that can be called.

### Categorical Data

For categorical-type data, we can plot the empirical distribution as well. (This example uses the `smartphone_sanitization.csv` dataset.)

In [None]:
from collections import Counter

def empirical_catdist(data, ax, title=None):
    d = Counter(data)
    print(d)
    x = range(len(d.keys()))
    labels = list(d.keys())
    y = list(d.values())
    ax.bar(x, y)
    ax.set_xticks(x)
    ax.set_xticklabels(labels)

smartphone_df = pd.read_csv('data/smartphone_sanitization.csv')
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
empirical_catdist(smartphone_df['site'], ax=ax)

# Statistical Checks

- Report on deviations from normality.

## Normality?!

- The Gaussian (Normal) distribution is commonly assumed in downstream statistical procedures, e.g. outlier detection.
- We can test for normality by using a K-S test.

## K-S test

From Wikipedia:

> In statistics, the Kolmogorov–Smirnov test (K–S test or KS test) is a nonparametric test of the equality of continuous, one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution (one-sample K–S test), or to compare two samples (two-sample K–S test). It is named after Andrey Kolmogorov and Nikolai Smirnov.

![](https://upload.wikimedia.org/wikipedia/commons/c/cf/KS_Example.png)

In [None]:
from scipy.stats import ks_2samp
import numpy.random as npr

# Simulate a normal distribution with 10000 draws.
normal_rvs = npr.normal(size=10000)
result = ks_2samp(normal_rvs, df['labor_force_part_rate'].dropna())
result.pvalue < 0.05

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
empirical_cumdist(normal_rvs, ax=ax)
empirical_cumdist(df['hotel_occup_rate'], ax=ax)

## Exercise

Re-create the panel of cumulative distribution plots, this time adding on the Normal distribution, and annotating the p-value of the K-S test in the title.

In [None]:
data_cols = [i for i in df.columns if i not in ['Year', 'Month']]
n_rows, n_cols = compute_dimensions(len(data_cols))

fig = plt.figure(figsize=(n_cols*3, n_rows*3))
from matplotlib.gridspec import GridSpec
gs = GridSpec(n_rows, n_cols)
for i, col in enumerate(data_cols):
    ax = plt.subplot(gs[i])
    test = ks_2samp(normal_rvs, standard_scaler(df[col]))
    empirical_cumdist(normal_rvs, ax)
    empirical_cumdist(standard_scaler(df[col]), ax, title=f"{col}, p={round(test.pvalue, 2)}")
    
plt.tight_layout()
plt.show()