<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Initial-Setup" data-toc-modified-id="Initial-Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Initial Setup</a></span><ul class="toc-item"><li><span><a href="#Loading-data-into-a-DataFrame" data-toc-modified-id="Loading-data-into-a-DataFrame-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Loading data into a DataFrame</a></span></li></ul></li><li><span><a href="#The-Boxplot-(and-friends)" data-toc-modified-id="The-Boxplot-(and-friends)-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>The Boxplot (and friends)</a></span><ul class="toc-item"><li><span><a href="#Exercise:-What-can-we-say-about-these-two-distributions?" data-toc-modified-id="Exercise:-What-can-we-say-about-these-two-distributions?-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span><em>Exercise</em>: What can we say about these two distributions?</a></span></li></ul></li><li><span><a href="#The-Histogram" data-toc-modified-id="The-Histogram-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>The Histogram</a></span><ul class="toc-item"><li><span><a href="#Exercise:-Compare-bp_young-and-bp_old-using-histograms-to-highlight-shape-differences." data-toc-modified-id="Exercise:-Compare-bp_young-and-bp_old-using-histograms-to-highlight-shape-differences.-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span><em>Exercise:</em> Compare bp_young and bp_old using histograms to highlight shape differences.</a></span></li><li><span><a href="#Binning-Decisions" data-toc-modified-id="Binning-Decisions-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Binning Decisions</a></span></li><li><span><a href="#Statistical-Error" data-toc-modified-id="Statistical-Error-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Statistical Error</a></span></li></ul></li></ul></div>

# Univariate Distributions 
We'll be looking at univariate features in this course, focusing on histograms, boxplots, and other methods.  The data we'll be investigating comes from the Pima Indians Diabetes dataset (an important study for diabetes research, especially in the context of pregnancy).

## Initial Setup
Loading in our packages, loading in our data, and making sure everything looks good.

In [None]:
%matplotlib inline 

In [None]:
# Basic imports
import numpy as np
import pandas as pd
pd.options.display.max_columns = 100

import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10,8)
plt.rcParams['font.size'] = (16)

import seaborn as sns

import holoviews as hv
from holoviews import opts
import hvplot
import hvplot.pandas

# pd.options.plotting.backend = 'holoviews'

hv.extension('bokeh', 'matplotlib', width="600")


### Loading data into a DataFrame
This time around, we're starting with our pickle:

In [None]:
df = pd.read_csv('../data_sets/diabetes.csv')
# or use pd.read_pickle('/pghbio/dbmi/batmanlab/bpollack/data_course/data_sets/rats.p')
df.head() # View the first five rows

## The Boxplot (and friends)
The boxplot is one of the simplest methods of visualizing univariate data.  Is it always a useful tool?

In [None]:
# Let's make two groups, outliers and inliers.  Let's plot their distributions using a standard boxplot
outliers = np.concatenate([df.query('Age<22 and BloodPressure>0').BloodPressure.values, df.query('Age>60 and BloodPressure>0').BloodPressure.values])
inliers = df.query('30<Age<40 and BloodPressure>0').BloodPressure.values
_ = plt.boxplot([outliers, inliers], patch_artist=True, labels=['outliers', 'inliers'])
plt.xticks(fontsize=20)
plt.grid(True)
plt.title('Boxplots of Blood Pressure for Age Inliers and Outliers', fontsize=18)
plt.ylabel('Blood Pressure', fontsize=18)

### *Exercise*: What can we say about these two distributions?
Using only 'inliers' and 'outliers', how else can we compare them?

In [None]:
# Visualize the distributions

## The Histogram
Histograms are the workhorse of univariate visualization.  They can be used in a number of different ways to dissect and explore your data.  Still, one must understand the potential limitations of a histogram to prevent misleading conclusions or inaccurate comparisons.

In [None]:
# Before we play with histograms, lets first bin the our dataframe to reflect our earlier analysis.
bp_young = df.query('Age<22 and BloodPressure>0').BloodPressure
bp_old = df.query('Age>60 and BloodPressure>0').BloodPressure
bp_mid = df.query('30<Age<40 and BloodPressure>0').BloodPressure

### *Exercise:* Compare bp_young and bp_old using histograms to highlight shape differences.

In [None]:
# make histogram comparison here 


### Binning Decisions
The number and size of bins in a histogram has a large effect on what we see.  How can choose a binning that accurately captures the underlying data distribution?

In [None]:
# Plot BloodPressure, varying the binning method
for n in [1,5,10,20,50,100,'auto']:
    plt.figure()
    df.BloodPressure.hist(bins=n)
    plt.title(f'bins={n}')

### Statistical Error
The number of entries is directly related to the statistical power of the distribution.  If we assume each histogram bin arrives from a poisson process, then the statistical error for each bin can be approximated as sqrt(n_i) for i bins. See 


In [None]:
from skhep.visual.mpl_plotter import MplPlotter as skplt

In [None]:
fig, ax = plt.subplots(1,2, figsize=(15,8))
skplt.hist(df.BloodPressure[0:100], alpha=0.5, bins=15, errorbars=True, err_style='line', ax=ax[0])
ax[0].set_title('Hist With Errorbars (few data)')
ax[0].set_ylabel('Counts')
ax[0].set_xlabel('Blood Pressure')
skplt.hist(df.BloodPressure, alpha=0.5, bins=15, errorbars=True, err_style='line', ax=ax[1], color='C1')
ax[1].set_title('Hist With Errorbars (more data)')
ax[1].set_ylabel('Counts')
ax[1].set_xlabel('Blood Pressure')