# Using Featuretools to Predict Missed Appointments
In this notebook, we use [Featuretools](https://github.com/Featuretools/featuretools) to automatically generate features relating to when patients don't show up for doctor appointments. We quickly reconstruct the features that were made by hand in the most popular [kernel](https://www.kaggle.com/somrikbanerjee/predicting-show-up-no-show) and make some other interesting features automatically.

In [64]:
import numpy as np
import pandas as pd
import featuretools as ft
print('Featuretools version {}'.format(ft.__version__))

# Data Wrangling
# After loading the data with pandas, we /have/ to fix typos in some column names
# but we change others as well to suit personal preference.
data = pd.read_csv("../input/KaggleV2-May-2016.csv", parse_dates=['AppointmentDay', 'ScheduledDay'])
data.index = data['AppointmentID']
data.rename(columns = {'Hipertension': 'hypertension',
                       'Handcap': 'handicap',
                       'PatientId': 'patient_id',
                       'AppointmentID': 'appointment_id',
                       'ScheduledDay': 'scheduled_time',
                       'AppointmentDay': 'appointment_day',
                       'Neighbourhood': 'neighborhood',
                       'No-show': 'no_show'}, inplace = True)
for column in data.columns:
    data.rename(columns = {column: column.lower()}, inplace = True)
data['appointment_day'] = data['appointment_day'] + pd.Timedelta('1d') - pd.Timedelta('1s')

data['no_show'] = data['no_show'].map({'No': False, 'Yes': True})

# Show the size of the data in a print statement
print('{} Appointments, {} Columns'.format(data.shape[0], data.shape[1]))
print('Appointments: {}'.format(data.shape[0]))
print('Schedule times: {}'.format(data.scheduled_time.nunique()))
print('Patients: {}'.format(data.patient_id.nunique()))
print('Neighborhoods: {}'.format(data.neighborhood.nunique()))
pd.options.display.max_columns=100 
pd.options.display.float_format = '{:.2f}'.format


data.head(3)

Featuretools version 0.1.20
110527 Appointments, 14 Columns
Appointments: 110527
Schedule times: 103549
Patients: 62299
Neighborhoods: 81


Unnamed: 0_level_0,patient_id,appointment_id,gender,scheduled_time,appointment_day,age,neighborhood,scholarship,hypertension,diabetes,alcoholism,handicap,sms_received,no_show
AppointmentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
5642903,29872499824296.0,5642903,F,2016-04-29 18:38:08,2016-04-29 23:59:59,62,JARDIM DA PENHA,0,1,0,0,0,0,False
5642503,558997776694438.0,5642503,M,2016-04-29 16:08:27,2016-04-29 23:59:59,56,JARDIM DA PENHA,0,0,0,0,0,0,False
5642549,4262962299951.0,5642549,F,2016-04-29 16:19:04,2016-04-29 23:59:59,62,MATA DA PRAIA,0,0,0,0,0,0,False


This dataset is a single table of appointments with more than sixty thousand unique patients. Each row represents a scheduled appointment and our goal is to predict if the patient actually shows up for that appointment. From that table, we use Featuretools to automatically generate the features below.

*Note: For convenience, the next cell has all of the code necessary to create the feature matrix. We'll go through the content step-by-step in the next section.*

In [65]:
import featuretools.variable_types as vtypes
# This is all of the code from the notebook
# No need to run/read this cell if you're running everything else

# List the semantic type for each column
variable_types = {'gender': vtypes.Categorical,
                  'patient_id': vtypes.Categorical,
                  'age': vtypes.Ordinal,
                  'scholarship': vtypes.Boolean,
                  'hypertension': vtypes.Boolean,
                  'diabetes': vtypes.Boolean,
                  'alcoholism': vtypes.Boolean,
                  'handicap': vtypes.Boolean,
                  'no_show': vtypes.Boolean,
                  'sms_received': vtypes.Boolean}

# Use those variable types to make an EntitySet and Entity from that table
es = ft.EntitySet('Appointments')
es = es.entity_from_dataframe(entity_id="appointments",
                              dataframe=data,
                              index='appointment_id',
                              time_index='scheduled_time',
                              secondary_time_index={'appointment_day': ['no_show', 'sms_received']},
                              variable_types=variable_types)

# Add a patients entity with patient-specific variables
es.normalize_entity('appointments', 'patients', 'patient_id',
                    additional_variables=['scholarship',
                                          'hypertension',
                                          'diabetes',
                                          'alcoholism',
                                          'handicap'])

# Make locations, ages and genders
es.normalize_entity('appointments', 'locations', 'neighborhood',
                    make_time_index=False)
es.normalize_entity('appointments', 'ages', 'age',
                    make_time_index=False)
es.normalize_entity('appointments', 'genders', 'gender',
                    make_time_index=False)

# Take the index and the appointment time to use as a cutoff time
cutoff_times = es['appointments'].df[['appointment_id', 'scheduled_time', 'no_show']].sort_values(by='scheduled_time')

# Rename cutoff time columns to avoid confusion
cutoff_times.rename(columns = {'scheduled_time': 'cutoff_time', 
                               'no_show': 'label'},
                    inplace = True)

# Make feature matrix from entityset/cutoff time pair
fm_final, _ = ft.dfs(entityset=es,
                      target_entity='appointments',
                      agg_primitives=['count', 'percent_true'],
                      trans_primitives=['is_weekend', 'weekday', 'day', 'month', 'year'],
                      drop_contains=['appointment_day'],
                      approximate='3h',
                      cutoff_time=cutoff_times[20000:],
                      verbose=True)

print('Features: {}, Rows: {}'.format(fm_final.shape[1], fm_final.shape[0]))
fm_final.tail(3)

Features: 27, Rows: 90527


Unnamed: 0,appointment_id,label,neighborhood,gender,patient_id,age,WEEKEND(scheduled_time),WEEKDAY(scheduled_time),DAY(scheduled_time),MONTH(scheduled_time),YEAR(scheduled_time),patients.COUNT(appointments),patients.PERCENT_TRUE(appointments.no_show),patients.PERCENT_TRUE(appointments.sms_received),patients.WEEKDAY(first_appointments_time),patients.DAY(first_appointments_time),patients.MONTH(first_appointments_time),patients.YEAR(first_appointments_time),locations.COUNT(appointments),locations.PERCENT_TRUE(appointments.no_show),locations.PERCENT_TRUE(appointments.sms_received),ages.COUNT(appointments),ages.PERCENT_TRUE(appointments.no_show),ages.PERCENT_TRUE(appointments.sms_received),genders.COUNT(appointments),genders.PERCENT_TRUE(appointments.no_show),genders.PERCENT_TRUE(appointments.sms_received)
90524,5790466,False,JARDIM CAMBURI,M,356247857784.0,27,False,2,8,6,2016,1,0.0,0.0,2,8,6,2016,7701.0,0.18,0.32,1375.0,0.23,0.34,38679,0.19,0.28
90525,5790481,False,JARDIM CAMBURI,F,234131759175.0,30,False,2,8,6,2016,1,0.0,0.0,2,8,6,2016,7701.0,0.18,0.32,1520.0,0.23,0.34,71824,0.2,0.32
90526,5790484,False,JARDIM CAMBURI,F,5237164264312.0,27,False,2,8,6,2016,0,,,2,8,6,2016,7701.0,0.18,0.32,1375.0,0.23,0.34,71824,0.2,0.32


This feature matrix has features like `MONTH` and `WEEKDAY` of the scheduled time and also more complicated features like "how often do patients not show up to this location" (`locations.PERCENT_TRUE(appointments.no_show)`). It takes roughly 20 minutes of work to structure any data and make your first feature matrix using Featuretools. We'll walk through the steps now.

## Structuring the Data
We are given a single table of data. Feature engineering requires that we use what we understand about the data to build numeric rows (feature vectors) which we can use as input into machine learning algorithms. The primary benefit of Featuretools is that it does not require you make those features by hand. The requirement instead is that you pass in what you know about the data.

That knowledge is stored in a Featuretools [EntitySet](https://docs.featuretools.com/loading_data/using_entitysets.html). `EntitySets` are a collection of tables with  information about relationships between tables and semantic typing for every column. We're going to show how to
+ pass in information about semantic types of columns,
+ load in a dataframe to an `EntitySet` and
+ tell the `EntitySet` about reasonable new `Entities` to make from that dataframe.

In [12]:
# List the semantic type for each column

import featuretools.variable_types as vtypes
variable_types = {'gender': vtypes.Categorical,
                  'patient_id': vtypes.Categorical,
                  'age': vtypes.Ordinal,
                  'scholarship': vtypes.Boolean,
                  'hypertension': vtypes.Boolean,
                  'diabetes': vtypes.Boolean,
                  'alcoholism': vtypes.Boolean,
                  'handicap': vtypes.Boolean,
                  'no_show': vtypes.Boolean,
                  'sms_received': vtypes.Boolean}

The `variable_types` dictionary is a place to store information about the semantic type of each column. While many types can be detected automatically, some are necessarily tricky. As an example, computers tend to read `age` as numeric. Even though ages are numbers, it's can be useful to think of them as `Categorical` or `Ordinal`. Changing the variable type will change which functions are automatically applied to generate features.

Next, we make an entity `appointments`:

In [13]:
# Make an entity named 'appointments' which stores dataset metadata with the dataframe
es = ft.EntitySet('Appointments')
es = es.entity_from_dataframe(entity_id="appointments",
                              dataframe=data,
                              index='appointment_id',
                              time_index='scheduled_time',
                              secondary_time_index={'appointment_day': ['no_show', 'sms_received']},
                              variable_types=variable_types)
es['appointments']

Entity: appointments
  Variables:
    scheduled_time (dtype: datetime_time_index)
    appointment_day (dtype: datetime)
    neighborhood (dtype: categorical)
    gender (dtype: categorical)
    patient_id (dtype: categorical)
    age (dtype: ordinal)
    scholarship (dtype: boolean)
    hypertension (dtype: boolean)
    diabetes (dtype: boolean)
    alcoholism (dtype: boolean)
    handicap (dtype: boolean)
    no_show (dtype: boolean)
    sms_received (dtype: boolean)
    appointment_id (dtype: index)
  Shape:
    (Rows: 110527, Columns: 14)

We have turned the dataframe into an entity by calling the function `entity_from_dataframe`. Notice that we specified an index, a time index, a secondary time index and the `variable_types` from the last cell as keyword arguments. 

The time index and secondary time index notate what time the data is recorded. By doing that, we can avoid using data from the future while creating features. Since the label is in the dataframe, we either need to specify a time index or drop the column entirely.

Finally, we build new entities from our existing one using `normalize_entity`. We take unique values from `patient`, `age`, `neighborhood` and `gender` and make a new `Entity` for each whose rows are the unique values. To do that we only need to specify where we start (`appointments`), the name of the new entity (e.g. `patients`) and what the index should be (e.g. `patient_id`). Having those additional `Entities` and `Relationships` tells the algorithm about reasonable groupings which allows for some neat aggregations.

In [14]:
# Make a patients entity with patient-specific variables
es.normalize_entity('appointments', 'patients', 'patient_id',
                    additional_variables=['scholarship',
                                          'hypertension',
                                          'diabetes',
                                          'alcoholism',
                                          'handicap'])

# Make locations, ages and genders
es.normalize_entity('appointments', 'locations', 'neighborhood',
                    make_time_index=False)
es.normalize_entity('appointments', 'ages', 'age',
                    make_time_index=False)
es.normalize_entity('appointments', 'genders', 'gender',
                    make_time_index=False)


Entityset: Appointments
  Entities:
    appointments [Rows: 110527, Columns: 9]
    patients [Rows: 62299, Columns: 7]
    locations [Rows: 81, Columns: 1]
    ages [Rows: 104, Columns: 1]
    genders [Rows: 2, Columns: 1]
  Relationships:
    appointments.patient_id -> patients.patient_id
    appointments.neighborhood -> locations.neighborhood
    appointments.age -> ages.age
    appointments.gender -> genders.gender

In [16]:
# Show the patients entity
es['patients'].df.head(2)

Unnamed: 0_level_0,patient_id,scholarship,hypertension,diabetes,alcoholism,handicap,first_appointments_time
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
832256398961987,832256398961987,0,0,0,0,0,2015-11-10 07:13:56
91637474953513,91637474953513,0,1,0,0,0,2015-12-03 08:17:28


## Generating Features with Deep Feature Synthesis
With our data  structued in an `EntitySet`, we can immediately build features across our entity and relationships with Deep Feature Synthesis (DFS). As an example, the feature `locations.PERCENT_TRUE(no_show)` will calculate percentage of patients of at this location that haven't shown up in the past.

This is where the time indices get used. We set the `cutoff_time` for each row to be when the patient schedules the appointment. That means that DFS, while building features, will only use the data that is known as the appointment is made. In particular, it won't use the label to create features.

In [17]:
# Take the index and the appointment time to use as a cutoff time
cutoff_times = es['appointments'].df[['appointment_id', 'scheduled_time', 'no_show']].sort_values(by='scheduled_time')

# Rename columns to avoid confusion
cutoff_times.rename(columns = {'scheduled_time': 'cutoff_time', 
                               'no_show': 'label'},
                    inplace = True)

In [67]:
# Generate features using the constructed entityset
fm, features = ft.dfs(entityset=es,
                      target_entity='appointments',
                      agg_primitives=['count', 'percent_true'],
                      trans_primitives=['is_weekend', 'weekday', 'day', 'month', 'year'],
                      max_depth=3, 
                      approximate='3h',
                      cutoff_time=cutoff_times[20000:],
                      verbose=True)
fm.tail()

Built 38 features
Elapsed: 03:35 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 11/11 chunks


Unnamed: 0_level_0,label,neighborhood,gender,patient_id,age,WEEKEND(scheduled_time),WEEKEND(appointment_day),WEEKDAY(scheduled_time),WEEKDAY(appointment_day),DAY(scheduled_time),DAY(appointment_day),MONTH(scheduled_time),MONTH(appointment_day),YEAR(scheduled_time),YEAR(appointment_day),patients.COUNT(appointments),patients.PERCENT_TRUE(appointments.no_show),patients.PERCENT_TRUE(appointments.sms_received),patients.WEEKDAY(first_appointments_time),patients.DAY(first_appointments_time),patients.MONTH(first_appointments_time),patients.YEAR(first_appointments_time),locations.COUNT(appointments),locations.PERCENT_TRUE(appointments.no_show),locations.PERCENT_TRUE(appointments.sms_received),ages.COUNT(appointments),ages.PERCENT_TRUE(appointments.no_show),ages.PERCENT_TRUE(appointments.sms_received),genders.COUNT(appointments),genders.PERCENT_TRUE(appointments.no_show),genders.PERCENT_TRUE(appointments.sms_received),patients.PERCENT_TRUE(appointments.WEEKEND(scheduled_time)),patients.PERCENT_TRUE(appointments.WEEKEND(appointment_day)),locations.PERCENT_TRUE(appointments.WEEKEND(scheduled_time)),locations.PERCENT_TRUE(appointments.WEEKEND(appointment_day)),ages.PERCENT_TRUE(appointments.WEEKEND(scheduled_time)),ages.PERCENT_TRUE(appointments.WEEKEND(appointment_day)),genders.PERCENT_TRUE(appointments.WEEKEND(scheduled_time)),genders.PERCENT_TRUE(appointments.WEEKEND(appointment_day))
appointment_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1
5790461,False,JARDIM CAMBURI,M,729255235141745.0,54,False,False,2,2,8,8,6,6,2016,2016,5,0.0,0.2,4,6,5,2016,7701.0,0.18,0.32,1529.0,0.17,0.34,38679,0.19,0.28,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5790464,False,JARDIM CAMBURI,F,947614361749238.1,43,False,False,2,2,8,8,6,6,2016,2016,2,0.5,0.0,2,18,5,2016,7701.0,0.18,0.32,1343.0,0.22,0.34,71824,0.2,0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5790466,False,JARDIM CAMBURI,M,356247857784.0,27,False,False,2,2,8,8,6,6,2016,2016,1,0.0,0.0,2,8,6,2016,7701.0,0.18,0.32,1375.0,0.23,0.34,38679,0.19,0.28,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5790481,False,JARDIM CAMBURI,F,234131759175.0,30,False,False,2,2,8,8,6,6,2016,2016,1,0.0,0.0,2,8,6,2016,7701.0,0.18,0.32,1520.0,0.23,0.34,71824,0.2,0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5790484,False,JARDIM CAMBURI,F,5237164264312.0,27,False,False,2,2,8,8,6,6,2016,2016,0,,,2,8,6,2016,7701.0,0.18,0.32,1375.0,0.23,0.34,71824,0.2,0.32,,,0.0,0.0,0.0,0.0,0.0,0.0


We have applied and stacked **primitives** like `MONTH`, `WEEKDAY` and `PERCENT_TRUE` to build features accross all the `Entities` in our `EntitySet`.

Feel free to fork this kernel and modify the parameters. By doing so, you can get very different feature matrices. Here's a short overview of the keywords used:
+ `target_entity` is the entity for which we're building features. It would be equally easy to make a feature matrix for the `locations` entity
+ `agg_primitives` and `trans_primitives` are lists of which primitives will be used while constructing features. The full list can be found by running `ft.list_primitives()`
+ `max_depth=3` says to stack up to 3 primitives deep.
+ `approximate='3h'` rounds cutoff times into blocks that are 3 hours long for faster computation
+ `cutoff_time` is a dataframe that says when to calculate each row
+ `verbose=True` makes the progress bar

For more information, see the [documentation](https://docs.featuretools.com/automated_feature_engineering/afe.html) of dfs.
## Machine Learning
We can put the created feature matrix directly into sklearn. Similar to the other kernels, we do not do a good job predicting no-shows. With one unshuffled train test split, our `roc_auc_score` is roughly .5 with similar scores for F1 and K-first. 

In [69]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve

X = fm.copy().fillna(0)
label = X.pop('label')
X = X.drop(['patient_id', 'neighborhood', 'gender'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, label, test_size=0.30, shuffle=False)
clf = RandomForestClassifier(n_estimators=150)
clf.fit(X_train, y_train)
probs = clf.predict_proba(X_test)
print('AUC score of {:.2f}'.format(roc_auc_score(y_test, probs[:,1])))

AUC score of 0.53


In [71]:
feature_imps = [(imp, X.columns[i]) for i, imp in enumerate(clf.feature_importances_)]
feature_imps.sort()
feature_imps.reverse()
print('Random Forest Feature Importances:')
for i, f in enumerate(feature_imps[0:8]):
    print('{}: {} [{:.3f}]'.format(i + 1, f[1], f[0]/feature_imps[0][0]))


Random Forest Feature Importances:
1: locations.COUNT(appointments) [1.000]
2: ages.COUNT(appointments) [0.939]
3: age [0.885]
4: DAY(appointment_day) [0.864]
5: ages.PERCENT_TRUE(appointments.no_show) [0.646]
6: locations.PERCENT_TRUE(appointments.no_show) [0.645]
7: ages.PERCENT_TRUE(appointments.sms_received) [0.636]
8: locations.PERCENT_TRUE(appointments.sms_received) [0.636]


In [27]:
from bokeh.models import HoverTool
from bokeh.models.sources import ColumnDataSource
from bokeh.plotting import figure
from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
from sklearn.metrics import precision_score, recall_score, f1_score

def plot_roc_auc(y_test, probs, pos_label=1):
    fpr, tpr, thresholds = roc_curve(y_test, 
                                     probs[:, 1], 
                                     pos_label=pos_label)


    output_notebook()
    p = figure(height=400, width=400)
    p.line(x=fpr, y=tpr)
    p.title.text ='Receiver operating characteristic'
    p.xaxis.axis_label = 'False Positive Rate'
    p.yaxis.axis_label = 'True Positive Rate'

    p.line(x=fpr, y=fpr, color='red', line_dash='dashed')
    return(p)

def plot_f1(y_test, probs, nprecs):
    threshes = [x/1000. for x in range(50, nprecs)]
    precisions = [precision_score(y_test, probs[:,1] > t) for t in threshes]
    recalls = [recall_score(y_test, probs[:,1] > t) for t in threshes]
    fones = [f1_score(y_test, probs[:,1] > t) for t in threshes]
    
    output_notebook()
    p = figure(height=400, width=400)
    p.line(x=threshes, y=precisions, color='green', legend='precision')
    p.line(x=threshes, y=recalls, color='blue', legend='recall')
    p.line(x=threshes, y=fones, color='red', legend='f1')
    p.xaxis.axis_label = 'Threshold'
    p.title.text = 'Precision, Recall, and F1 by Threshold'
    return(p)

def plot_kfirst(ytest, probs, firstk=500):
    A = pd.DataFrame(probs)
    A['y_test'] = y_test.values
    krange = range(firstk)
    firstk = []
    for K in krange:
        a = A[1][:K]
        a = [1 for prob in a]
        b = A['y_test'][:K]
        firstk.append(precision_score(b, a))
    
    output_notebook()
    p = figure(height=400, width=400)
    p.step(x=krange, y=firstk)
    p.xaxis.axis_label = 'Predictions sorted by most likely'
    p.yaxis.axis_label = 'Precision'
    p.title.text = 'K-first'
    p.yaxis[0].formatter.use_scientific = False
    return p

p1 = plot_roc_auc(y_test, probs)
p2 = plot_f1(y_test, probs, 1000)
p3 = plot_kfirst(y_test, probs, 300)

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [62]:
show(gridplot([p1, p2, p3], ncols=1))

# Some Plots
An interesting workflow with this dataset is to plot generated features to learn about the data. Here, we'll show the number of visits by neighborhood, and the likelihood to show up by neighborhood and age as created by DFS.

In [59]:
tmp = fm.groupby('neighborhood').apply(lambda df: df.tail(1))['locations.COUNT(appointments)'].sort_values().reset_index().reset_index()
hover = HoverTool(tooltips=[
    ("Count", "@{locations.COUNT(appointments)}"),
    ("Place", "@neighborhood"),
])
source = ColumnDataSource(tmp)
p4 = figure(width=400, 
           height=400,
           tools=[hover, 'box_zoom', 'reset', 'save'])
p4.scatter('index', 'locations.COUNT(appointments)', alpha=.7, source=source, color='teal')
p4.title.text = 'Appointments by Neighborhood'
p4.xaxis.axis_label = 'Neighborhoods (hover to view)'
p4.yaxis.axis_label = 'Count'

tmp = fm.groupby('neighborhood').apply(lambda df: df.tail(1))[['locations.COUNT(appointments)', 
                                                               'locations.PERCENT_TRUE(appointments.no_show)']].sort_values(
    by='locations.COUNT(appointments)').reset_index().reset_index()
hover = HoverTool(tooltips=[
    ("Prob", "@{locations.PERCENT_TRUE(appointments.no_show)}"),
    ("Place", "@neighborhood"),
])
source = ColumnDataSource(tmp)
p5 = figure(width=400, 
           height=400,
           tools=[hover, 'box_zoom', 'reset', 'save'])
p5.scatter('index', 'locations.PERCENT_TRUE(appointments.no_show)', alpha=.7, source=source, color='maroon')
p5.title.text = 'Probability of no-show by Neighborhood'
p5.xaxis.axis_label = 'Neighborhoods (hover to view)'
p5.yaxis.axis_label = 'Probability of no-show'

tmp = fm.tail(5000).groupby('age').apply(lambda df: df.tail(1))[['ages.COUNT(appointments)']].sort_values(
    by='ages.COUNT(appointments)').reset_index().reset_index()
hover = HoverTool(tooltips=[
    ("Count", "@{ages.COUNT(appointments)}"),
    ("Age", "@age"),
])
source = ColumnDataSource(tmp)
p6 = figure(width=400, 
           height=400,
           tools=[hover, 'box_zoom', 'reset', 'save'])
p6.scatter('age', 'ages.COUNT(appointments)', alpha=.7, source=source, color='magenta')
p6.title.text = 'Appointments by Age'
p6.xaxis.axis_label = 'Age'
p6.yaxis.axis_label = 'Count'

source = ColumnDataSource(X.tail(5000).groupby('age').apply(lambda x: x.tail(1)))

hover = HoverTool(tooltips=[
    ("Prob", "@{ages.PERCENT_TRUE(appointments.no_show)}"),
    ("Age", "@age"),
])

p7 = figure(title="Probability no-show by Age", 
           x_axis_label='Age', 
           y_axis_label='Probability of no-show',
           width=400,
           height=400,
           tools=[hover, 'box_zoom', 'reset', 'save']
)

p7.scatter('age', 'ages.PERCENT_TRUE(appointments.no_show)', 
          alpha=.7, 
          source=source)



In [60]:
show(gridplot([p4, p6, p5, p7], ncols=2))