# Assignment 3 - Building a Custom Visualization

---

In this assignment you must choose one of the options presented below and submit a visual as well as your source code for peer grading. 

**Easiest option:** Implement the bar coloring as described above - a color scale with only three colors, (e.g. blue, white, and red). Assume the user provides the y axis value of interest as a parameter or variable.


**Harder option:** Implement the bar coloring as described in the paper, where the color of the bar is actually based on the amount of data covered (e.g. a gradient ranging from dark blue for the distribution being certainly below this y-axis, to white if the value is certainly contained, to dark red if the value is certainly not contained as the distribution is above the axis).

**Even Harder option:** Add interactivity to the above, which allows the user to click on the y axis to set the value of interest. The bar colors should change with respect to what value the user has selected.

**Hardest option:** Allow the user to interactively set a range of y values they are interested in, and recolor based on this (e.g. a y-axis band, see the paper for more details).

---

*Note: The data given for this assignment is not the same as the data used in the article and as a result the visualizations may look a little different.*

# Solution

**Student:** Ilya Rusin (irusin@protonmail.ch), Twitter  [@rusin](https://twitter.com/rusin)

**Date:** 15 July 2017

## I chose the Hardest option

In [1]:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from matplotlib.widgets import SpanSelector, Cursor
import matplotlib.cm as cm
import matplotlib.colors as col

import pandas as pd
import numpy as np
from scipy.stats import norm

## Some theory about stats

Specifically, if $\sigma$ is the standard deviation of the distribution of the sample, then $\frac{\sigma}{\sqrt{N}}$ is the standard deviation of the mean.

$$ \sigma_{\textrm{mean}_i} = \frac{\sigma_{\textrm{distribution}_i}}{\sqrt{N_i}} $$

The mean is normally distributed (at least for large $N$) for any reasonable distribution by the _Central Limit Theorem_, and hence the 95% confidence interval of the mean is: 
$$ CI_i = \mu_i\pm1.96*\sigma_{\textrm{mean}_i} $$ 

The standard deviation of the mean is also often called the standard error.

### So the formula for Confidence Interval is

$$ CI_i = \mu_i\pm1.96*\frac{\sigma_{\textrm{distribution}_i}}{\sqrt{N_i}} $$ 



## Main algorithm

We need to evaluate probabilify $F(X)$ of a distribution's value falling within a range $\left(x_{\textrm{min}};x_{\textrm{max}}\right)$ $-$ the probability that the distribution represented by the error bar is contained in the region mapped as a color to the corresponding bar.

It tests if the value is inside or outside the range.


As [Wikipedia](https://en.wikipedia.org/wiki/Cumulative_distribution_function) says:
>In probability theory and statistics, the **cumulative distribution function (CDF)** of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x.

>![](https://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Normal_Distribution_CDF.svg/320px-Normal_Distribution_CDF.svg.png)

>Cumulative distribution functions are also used to specify the distribution of multivariate random variables.

The probability that $X$ lies in the semi-closed interval $(a, b]$, where $a  <  b$, is therefore:

$$ P(a < X <= b) = F_X(b) - F_X(a) $$

As we have range, the CDF of a continuous random variable X can be expressed as the integral of its probability density function $F(X)$ as follows:
$$ F_X(x) = \int_{x_{\textrm{min}}}^{x_{\textrm{max}}}f_X(t)dt$$

In SciPy there are a function **scipy.stats.rv_continuous.cdf** $-$ cumulative distribution function of the given RV:

>```stats.rv_continuous.cdf(x, *args, **kwds)```

> or simply ```norm.cdf```

For range:
>```F_X = norm.cdf(xmax, mean, sigma) - norm.cdf(xmin, mean, sigma)```

In [6]:
class iProbabilisticBarChart:
    """
    A base class that can be used for creating clicable probabilistic charts and solving
    the challenges of interpreting plots with confidence intervals.
    """
    
    # basic greys: lighter for regular, darker for emphasis
    greys = ['#afafaf','#7b7b7b'] # ticks and boxes, arrows, legend ticks and text
    # horizontal bar: nice red
    horzo_bar = '#004a80'
    # set bar colormap
    cmap = cm.get_cmap('RdBu')
    n1 = 3650
    font = {'color':  'gray',
        'weight': 'normal',
        'size': 14,
        }     
    
    # instantiate the class
    def __init__(self): 
        """
        Initialize the data and a new figure.
        """
        
        # seed for data.
        np.random.seed(12345)
        # get some data to plot

        self.df = pd.DataFrame(np.c_[np.random.normal(32000, 200000, self.n1), # np.c_ class to transpose array faster
                   np.random.normal(43000, 100000, self.n1), 
                   np.random.normal(43500, 140000, self.n1), 
                   np.random.normal(48000, 70000, self.n1)], 
                  columns = [1992, 1993, 1994, 1995])
        
        self.df_stats = self.df.describe()
        
        # get means standart deviation
        self.means_std = [i/((self.n1)**0.5)*1.96 for i in self.df_stats.loc['std'].values]

        self.fig = plt.figure(figsize=(9, 6), dpi= 80, facecolor='w', edgecolor='k')
        self.ax1 = self.fig.add_subplot(111)
        
        # Create a discrete color map
        self.mymap = col.LinearSegmentedColormap.from_list('mycolors',['lightgray','darkred'])
        
        # Making a fake colormapping object
        # Using contourf to provide my colorbar info, then clearing the figure
        Z = [[0,0],[0,0]]
        levels = np.arange(0,1,0.11)
        self.CS3 = self.ax1.contourf(Z, levels, cmap = self.mymap)
        self.ax1.cla()        
        
        # plot the bar chart and make a reference to the rectangles
        self.rects = self.ax1.bar(
            range(len(self.df.columns)), 
            self.df_stats.loc['mean'].values,
            yerr = self.means_std,
            align='center', 
            alpha=1, 
            color=self.greys[0],
            error_kw=dict(ecolor = 'gray', lw = 2, capsize = 20, capthick = 2, elinewidth = 2))
        
        ## TICKS AND TEXT AND SPINES
        
        plt.title('Confidence Interval Interactivity:\n Hardest way. Select the Chart Range To Recolor', color=self.greys[1])
        plt.xticks(range(len(self.df.columns)), self.df.columns)
        

        
        # plot the colorbar
        self.clb = plt.colorbar(self.CS3, extendrect = False)
        self.clb.set_label('Probability', labelpad=10, y=0.45, fontdict=self.font)

        # do some formatting 
        self.formatArtists(plt.gca())        
        
        # provide text handlers:
        self.initProbTexts()

        
        self.span = SpanSelector(self.ax1, self.onRangeSelect, 'vertical', useblit=True, span_stays = True,
                    rectprops=dict(alpha=0.5, facecolor='gray'))
        
        self.cursor = Cursor(self.ax1, vertOn = False, useblit = True, color = 'darkred', linewidth = 2)
        
    def initProbTexts(self):
        self.textProbs = [] 
        for i,rect in enumerate(self.rects):
            self.textProbs.append(self.ax1.text(rect.get_x() + rect.get_width()/2.,\
                              0.5*min(self.df_stats.loc['mean'].values),\
                              '',\
                              ha='center', va='bottom'))

      
        self.valueOfInterest = self.ax1.text(self.rects[0].get_x() + self.rects[0].get_width(),\
                                             1.15*max(self.df_stats.loc['mean'].values),\
                                             'Range of interest: select y-axis range',\
                                            fontdict=self.font)
            
        
    def printProbabilities(self, probs):
        for i, rect in enumerate(self.rects):
            self.textProbs[i].set_visible(probs[i] > 0.3)
            self.textProbs[i].set_text('$F(x) =$ {:.2f}'.format(probs[i]))
        
    def formatArtists(self, ax):
        """
        Does some recoloring and formatting of the ticks, labels, and spines.
        Receives the axes of the current figure.
        """
        # recolor the ticks
        ax.xaxis.set_tick_params(which='major', colors=self.greys[1])
        ax.yaxis.set_tick_params(which='major', colors=self.greys[1])
        
        self.clb.ax.yaxis.set_tick_params(which='major', colors=self.greys[1])

        # recolor the spines
        for pos in ['top', 'right', 'bottom', 'left']:
            ax.spines[pos].set_edgecolor(self.greys[0])
           
            
        ax.spines['right'].set_color('none')
        ax.spines['top'].set_color('none')
        ax.yaxis.set_ticks_position('left')
        ax.xaxis.set_ticks_position('none') 


        ax.yaxis.set_major_locator(ticker.MaxNLocator(8))
        ax.yaxis.set_minor_locator(ticker.MaxNLocator(24))
        ax.yaxis.set_major_formatter(ticker.StrMethodFormatter("{x:,}"))
        ax.spines['bottom'].set_smart_bounds(True)

        #ax1.set_ylim(0-df1.loc['mean'].max()*0.30, df1.loc['mean'].max()*1.15)
        plt.tick_params(axis='x', which='major', labelsize=16, pad = 20)            
            
    ## EVENT HANDLERS
    def onRangeSelect(self, ymin, ymax): 
        """
        Handle the logic for handling bar coloring when the range slider 
        is set.
        """
        
        cdf_i = lambda i: norm.cdf(ymax,
                                   self.df_stats.ix['mean'].values[i],\
                                   self.means_std[i]) -\
                          norm.cdf(ymin,\
                                   self.df_stats.ix['mean'].values[i],\
                                   self.means_std[i])

        CDFs = [cdf_i(j) for j in range(len(self.means_std))]
        
        for i, rect in enumerate(self.rects):
            rect.set_color(self.mymap(CDFs[i]))
            self.printProbabilities(CDFs)
        
        self.valueOfInterest.set_text('y-axis range $({:,.2f} : {:,.2f})$'.format(ymin, ymax))    
        
        self.fig.canvas.draw()
        
        #plt.savefig('Ilya-Rusin-Week3-Hardest-WithRange.png')
        
    def showPlot(self, png = False):
        """
        Convenience if not using the inline display setup %matplotlib notebook
        """
        if png:
            plt.savefig('Ilya-Rusin-Week3-Hardest.png')
        else:
            plt.show()   

In [7]:
ib = iProbabilisticBarChart()

In [8]:
ib.showPlot(png = False)