## Imports

In [None]:
# Standard Imports
import numpy              as np
import pandas             as pd

# Custom Modules
import graphs
import plotly_graphs      as pg

# Notebook Appearance
from IPython.core.display import display, HTML

display(HTML("<style>.container { width:95% !important; }</style>"))
%matplotlib inline

## Table Of Contents

- [Continuous Data](#Continuous-Data)
    - [Histograms](#Histograms)
    - [2d Histogram](#2d-Histogram)
    - [KDE Plots](#KDE-Plots)
    - [Box Plots](#Box-Plots)
    - [Violin Plots](#Violin-Plots)
    - [Regression Plots](#Regression-Plots)
    - [Heat Maps](#Heat-Maps)
    - [Residual Plots](#Residual-Plots)
    
- [Categorical Data](#Categorical-Data)
    - [Bar Plots](#Bar-Plots)
        - [Bar Plot](#Bar-Plot)
        - [Plotly Bar Chart](#Plotly-Bar-Chart)
    - [Count Plots](#Count-Plots)
    - [Categorical Box Plots](#Categorical-Box-Plots)
    - [Categorical Violin Plots](#Categorical-Violin-Plots)
    
- [Utility Functions](#Utility-Functions)
    - [Tables](#Tables)

## Reading In The Data

The data I will be using for these examples is the data I used for my capstone project at General Assembly: it is a dataset of MRI measurements from patients who have had heart attacks.  I will also be using a set of predictions from my capstone to illustrate the residual plots.

The project can be found [here](https://github.com/a-bergman/DSI-Capstone).

In [None]:
heart = pd.read_csv("../Examples/Data/mri_cleaned.csv")
preds = pd.read_csv("../Examples/Data/engineered_model_predictions.csv")

## <center>Continuous Data</center>

### Histograms

A histogram looks at a single _numeric_ values: it counts the occurence of each value and thus plots the distribution of values in a data set.

The red line indicates the mean of the distribution.

In [None]:
graphs.histograms(df      = heart,
                  columns = ["lvedv", "lvesv"],
                  titles  = ["End Diastolic Volume", "End Systolic Volume"],
                  labels  = ["Volume In mL", "Volume In mL"],
                  ylabel  = "Frequency",
                  ticks   = [np.arange(0,625,50), np.arange(0,525,50)],
                  dim     = (18,8),
                  row     = 2,
                  col     = 2)

### 2d Histogram

A two dimensional histogram also looks at numeric data: it looks at two numeric columns and plots where the two variables overlap with a colormap.  This type of data works best with large data when you would otherwise need a large number of buckets.

In [None]:
pg.histogram_2d(x      = "age",
                y      = "lvedv",
                df     = heart,
                title  = "Age & EDV",
                xlabel = "Age",
                ylabel = "EDV",
                xticks = np.arange(0,110,10),
                yticks = np.arange(0,625,75))

This chart will not display on GitHub because it does not support iframes: click [here](https://nbviewer.jupyter.org/github/a-bergman/Easy-Graphing/blob/master/Examples/Example%20Charts.ipynb#2d-Histogram) to view this graph.

### KDE Plots

A KDE (kernel density estimate) is similar to a histogram, but is subtlely different.  A KDE plot estimates the probability density function for a continuous variable, but does not make an assumption about the underlying distribution of the data.  The graphs are smooth because they are making estimates, not plotting the actual values.

More information can be found [here](https://en.wikipedia.org/wiki/Kernel_density_estimation)

In [None]:
graphs.kdeplots(df     = heart,
                cols   = ["lvedv", "lvesv"],
                title  = "End Diastolic & Systolic Volume",
                dim    = (14,7),
                colors = ["orange", "blue"],
                labels = ["Diastolic", "Systolic"],
                xlabel = "Volume In mL",
                ylabel = "Probability",
                ticks  = np.arange(0,650,50))

### Box Plots

A box plot is another way of visualizing continuous data and are useful because they give a summary of the data: the bars on the plot give us the minimum, 25<sup>th</sup>, median, 75<sup>th</sup>, maximum, and any outliers are plotted as dots or points.  In addition to that summary, these plots are useful because they indicate the presence and degree of outliers.

In [None]:
graphs.boxplots(df      = heart,
                columns = ["lvedv", "lvesv"],
                titles  = ["End Diastolic Volume", "End Systolic Volume"],
                labels  = ["Volume In mL",  "Volume In mL", "Percentage"],
                ticks   = [np.arange(0,650,50), np.arange(0,550,50)],
                dim     = (18,8),
                row     = 2,
                col     = 2)

[Top](#Table-Of-Contents)

### Violin Plots

A violin plot does the same as a box plot but it adds the probability density of the data on each side of the line.  This adds the benefit of showing the distribution of the data in relation to the five digit summary of the box plot.

In [None]:
graphs.violinplots(df      = heart,
                   columns = ["lvedv", "lvesv"],
                   titles  = ["End Diastolic Volume", "End Systolic Volume"],
                   labels  = ["Volume In mL", "Volume In mL"],
                   ticks   = [np.arange(0,650,50), np.arange(0,550,50)],
                   dim     = (18,8),
                   row     = 2,
                   col     = 2)

### Regression Plots

A scatter plot is a way of correlation the relationship between two continuous variables, usually the target variable and a feature in the model and in doing so we can determine if our variables have a linear relationship which is important for many modeling techniques.

A regression plot is the same but it adds a line of best fit to the data which makes visualizing the correlation between two variables easier.

In [None]:
graphs.regressionplots(df      = heart,
                       columns = ["lvesv", "lvef"],
                       y       = "lvedv",
                       titles  = ["ESV", "EF"],
                       labels  = ["Systolic Volume (mL)", "Fraction"],
                       ylabel  = "Diastolic Volume (mL)",
                       ticks   = [np.arange(0,550,50), np.arange(0,100,10)],
                       dim     = (18,8),
                       row     = 2,
                       col     = 2,)

[Top](#Table-Of-Contents)

### Heat Maps

A heat map shows the correlation between continuous variables in a shaded matrix.

In this function, the matrix is "split" meaning that it does not duplicate each correlation across the diagonal and does not indicate the correlation between identical variables.

In [None]:
graphs.heatmap(df      = heart,
               columns = ["age", "lvedv","lvesv", "lvef"],
               dim     = (10,10),
               title   = "Correlations Among Numeric Columns",
               vmin    = -1,
               vmax    = 1)

### Residual Plots

Residual plots are generated after generating predictions with a model and represents the difference between each predicted and generated value.  These graphs are used to detect heteroscedasticity in the residuals, i.e. if there is a pattern to the residuals: a pattern indicates that a model is not performing equally across all data.  This is a particular concern in linear regression because it indicates that one of the assumptions in linear regression, independence of errors, is being violated.

For each plot this function makes, the x-axis represents true values and the y-axis represents predicted values.

In [None]:
graphs.residualplots(df      = preds,
                     x       = "Actual",
                     columns = ["Random Forest Reg.", "XGBoost Reg."],
                     titles  = ["Random Forest Reg.", "XGBoost Reg."],
                     dim     = (18,8),
                     row     = 2,
                     col     = 3,
                     xlabel  = "Actual",
                     ylabel  = "Predicted")

[Top](#Table-Of-Contents)

## <center>Categorical Data</center>

### Bar Plots

A Seaborn bar plot returns a measure of central tendency, by default the mean, of a categorical variable which is why it requires an x and y variable.

For a more classic bar plot, use the `countplots` function.

In [None]:
graphs.barplots(df      = heart,
                columns = ["aortic_reg", "mitral_reg"],
                y       = "lvedv",
                labels  = ["Severity", "Severity"],
                ylabel  = "End Diastolic Volume",
                titles  = ["Aortic Regurgitation", "Mitral Regurgitation"],
                dim     = (18,8),
                row     = 2,
                col     = 2)

The same graphs as above, but with the `hue` parameter included.

Here `0` is female and `1` is male.

In [None]:
graphs.barplots(df      = heart,
                columns = ["aortic_reg", "mitral_reg"],
                y       = "lvedv",
                labels  = ["Severity", "Severity"],
                ylabel  = "End Diastolic Volume",
                hue     = "sex",
                titles  = ["Aortic Regurgitation", "Mitral Regurgitation"],
                dim     = (18,8),
                row     = 2,
                col     = 2)

#### Bar Plot

Same as the above, but only creates a single bar plot.

In [None]:
graphs.barplot(df     = heart,
               x      = "tricusp_reg",
               y      = "lvesv",
               title  = "Tricuspid Regurgitation",
               label  = "Severity",
               ylabel = "Systolic Volume",
               ticks  = np.arange(0,175,25),
               dim    = (10,5))

The same graph as above with the `hue` parameter included.

In [None]:
graphs.barplot(df     = heart,
               x      = "tricusp_reg",
               y      = "lvesv",
               hue    = "sex",
               title  = "Tricuspid Regurgitation",
               label  = "Severity",
               ylabel = "Systolic Volume",
               ticks  = np.arange(0,175,25),
               dim    = (10,5))

#### Plotly Bar Chart

There is another function to create a bar chart through Plotly: it generates a true bar chart, equivalent to `countplots` in Seaborn.

In [None]:
pg.bar_chart(df     = heart,
             col    = "tricusp_reg",
             widths = [0.75, 0.75, 0.75, 0.75,
                       0.75, 0.75, 0.75],
             width  = 750,
             height = 500,
             title  = "Severity Of Tricuspid Regurgitation",
             xlabel = "Severity",
             ticks  = np.arange(0,3500,500),
             text_pos = "none")

This chart will not display on GitHub because it does not support iframes: click [here](https://nbviewer.jupyter.org/github/a-bergman/Easy-Graphing/blob/master/Examples/Example%20Charts.ipynb) to view this graph.

[Top](#Table-Of-Contents)

### Count Plots

The count plot is simpler because it counts the frequency of a category in a data set.

In [None]:
graphs.countplots(df      = heart,
                  columns = ["aortic_reg", "mitral_reg"],
                  titles  = ["Aortic Regurgitation", "Mitral Regurgitation"],
                  labels  = ["Severity", "Severity"],
                  ylabel  = "Count",
                  dim     = (18,8),
                  row     = 2,
                  col     = 2)

The same as above but with `hue` included.

In [None]:
graphs.countplots(df      = heart,
                  columns = ["aortic_reg"],
                  titles  = ["Aortic Regurgitation"],
                  labels  = ["Severity"],
                  hue     = "sex",
                  ylabel  = "Count",
                  dim     = (18,8),
                  row     = 2,
                  col     = 2)

[Top](#Table-Of-Contents)

### Categorical Box Plots

A categorical box plot is a modified box plot that uses a categorical variable as its x-axis.  Thus it returns a box plot summary of data for each category.

In [None]:
graphs.categorical_boxplots(df      = heart,
                            x       = "aortic_stenosis",
                            columns = ["lvedv", "lvesv"],
                            titles  = ["End Diastolic Volume", "End Systolic Volume"],
                            labels  = ["Severity", "Severity"],
                            ylabels = ["Volume", "Volume"],
                            ticks   = [np.arange(0,650,50), np.arange(0,550,50)],
                            dim     = (18,12),
                            row     = 2,
                            col     = 2,
                            orient  = "v")

Same as the above graph but with the hue option included.

In [None]:
graphs.categorical_boxplots(df      = heart,
                            x       = "aortic_stenosis",
                            columns = ["lvedv"],
                            hue     = "sex",
                            titles  = ["End Diastolic Volume"],
                            labels  = ["Severity"],
                            ylabels = ["Volume"],
                            ticks   = [np.arange(0,650,50)],
                            dim     = (12,6),
                            row     = 1,
                            col     = 1,
                            orient  = "v")

[Top](#Table-Of-Contents)

### Categorical Violin Plots

A categorical violin plot is a modified violin plot that uses a categorical variable as its x-axis.  Thus it returns a violin plot summary of data for each category.

In [None]:
graphs.categorical_violinplots(df = heart,
                               x  = "aortic_stenosis",
                               columns = ["age"],
                               titles  = ["Age"],
                               labels  = ["Age"],
                               ylabels = ["Years"],
                               ticks   = [np.arange(0,110,100)],
                               dim     = (16,8),
                               row     = 1,
                               col     = 1)

In [None]:
graphs.categorical_violinplots(df      = heart,
                               x       = "aortic_stenosis",
                               columns = ["age", "lvef"],
                               titles  = ["Age", "Ejection Fraction"],
                               labels  = ["Severity", "Severity"],
                               ylabels = ["Years", "Percentage"],
                               ticks   = [np.arange(0,110,10), np.arange(0,100,10)],
                               dim     = (18,12),
                               row     = 2,
                               col     = 2)

Same as the above graph, but with a hue specified.

In [None]:
graphs.categorical_violinplots(df      = heart,
                               x       = "aortic_stenosis",
                               columns = ["age"],
                               hue     = "sex",
                               titles  = ["Age"],
                               labels  = ["Severity"],
                               ylabels = ["Years"],
                               ticks   = [np.arange(0,110,10)],
                               dim     = (12,6),
                               row     = 1,
                               col     = 1)

Same as the above graph, but with a hue and split specified.

In [None]:
graphs.categorical_violinplots(df      = heart,
                               x       = "aortic_stenosis",
                               columns = ["age"],
                               hue     = "sex",
                               titles  = ["Age"],
                               labels  = ["Severity"],
                               ylabels = ["Years"],
                               ticks   = [np.arange(0,110,10)],
                               dim     = (12,6),
                               row     = 1,
                               col     = 1,
                               split   = True)

[Top](#Table-Of-Contents)

## <center>Utility Functions</center>

### Tables

Plotly contains a function to create tables which are easy to save as a .png & distribute.

In [None]:
pg.table(h_values = ["Actual Values", "Random Forest Reg.", 
                     "XGBoost Reg."],
         c_values = [preds["Actual"].head(10), preds["Random Forest Reg."].head(10),
                     preds["XGBoost Reg."].head(10)],
         width    = 750)

[Top](#Table-Of-Contents)