# Plotting Gallery

In [1]:
import pandas as pd
fm = pd.read_csv('../fm.csv', index_col='trip_log_id')

**Pie Charts**

A *pie chart* does a good job of showing the distribution of different categorical variables. The `piechart` function makes it easy to see what a variety of parameters would look like:

![](img/dynamic_piechart.gif)
  
We can play with the sliders to find the image that looks just right. The *Merge Slider* make an "Other" category by combining together infrequent values. The *Drop Slider* in a sense does the opposite: it removes high frequency values from the total. In this example, that's a reasonable thing to do since we don't want to include `JFK` or `LGA` in flights to NYC.

Once we've settled on a parameter set, we can make a static plot by setting `dynamic=False` with the parameters we like.

In [2]:
from henchman.plotting import show
import henchman.plotting as hplot
show(hplot.piechart(fm['flights.dest'], 
                           drop_n=2, mergepast=10, dynamic=False), 
     title='Destination Airport for flights from NYC')
show(hplot.piechart(fm['flights.carrier'], 
                           sort=False, mergepast=None, dynamic=False),
     title='Airline percentages for flights to NYC')


**Histograms**

A histogram is a way of showing the distribution of a numeric variable. Histograms can be tricky: small changes to input parameters can dramatically change what the final graph will look like. In particular, the *number of bins* (`n_bins`) and the excluded values (`col_max` and `col_min`) change the height of particular bars! To sidestep this, we have an interactive method for histograms as well

![](img/dynamic_histogram.gif)
Once we've settled on parameters, we can once again set `dyanmic=False` to get a static plot that looks just how we would like.

In [3]:
from henchman.plotting import show
import henchman.plotting as hplot
show(hplot.histogram(fm['flights.MEAN(trip_logs.arr_delay)'], 
                            n_bins=50, dynamic=False),
     title='Histogram of average flight delay', height=400, width=400)
show(hplot.histogram(fm['flights.MEAN(trip_logs.arr_delay)'], 
                            n_bins=50, col_max=120, dynamic=False),
     title='Histogram of average flight delay', height=400, width=400)

In [4]:
show(hplot.histogram(
                fm['flights.MEAN(trip_logs.arr_delay)'], 
                fm['label'], 
                n_bins=50,
                col_max=150,
                normalized=False, dynamic=False),
          title='Actual delays overlayed over historical delays (real)')
show(hplot.histogram(
                fm['flights.MEAN(trip_logs.arr_delay)'], 
                fm['label'], 
                n_bins=50,
                col_max=150,
                normalized=True, dynamic=False),
          title='Actual delays overlayed over historical delays (normalized)')

**Bivariate Plots**

In [5]:
show(hplot.scatterplot(fm['distance'].fillna(0), fm['scheduled_elapsed_time']),
     x_axis='Distance', y_axis='Scheduled Elapsed Time', height=300)
show(hplot.scatterplot(fm['distance'].fillna(0), fm['scheduled_elapsed_time'], fm['label'], hover=False),
     x_axis='Distance', y_axis='Scheduled Elapsed Time', height=300)

In [11]:
show(hplot.timeseries(fm['time'], fm['label'], n_bins=60, dynamic=False), 
     title='Mean labels per day', height=300, width=900)

**Machine Learning**

In [6]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from henchman.learning import create_model
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64', 'bool']

X = fm.select_dtypes(include=numerics).fillna(0)
y = X.pop('label')

score, fit_model = create_model(X, y, RandomForestClassifier(), roc_auc_score, n_splits=4)

show(hplot.feature_importances(X, fit_model, n_feats=10), height=400)
show(hplot.roc_auc(X, y, RandomForestClassifier(), n_splits=4), height=400)
show(hplot.f1(X, y, RandomForestClassifier(), n_splits=4), height=400)