# Featuretools to Predict Missed Appointments
In this notebook, [Featuretools](https://github.com/Featuretools/featuretools) to automatically generate features relating to when patients don't show up for doctor appointments. We follow the approach in the most popular [kernel]() to demonstrate the ways in which Featuretools simplifies and extends common data science operations. To get started, download the [Kaggle](https://www.kaggle.com/joniarroba/noshowappointments/data) data on appointment noshows and store it in a `data` folder in this repository.


In [1]:
import numpy as np
import pandas as pd
import featuretools as ft
ft.__version__

'0.1.17'

# Step 1: Set an EntitySet structure for Featuretools
We load in the data from a csv file in the `data` folder of this repository. There are some typos in the column names that we would like to fix.

In [2]:
data = pd.read_csv("data/KaggleV2-May-2016.csv")
data.index = data['AppointmentID']
data.rename(columns = {'Hipertension': 'Hypertension',
                       'Handcap': 'Handicap',
                       'PatientId': 'PatientID',
                       'No-show': 'NoShow'}, inplace = True)
data['NoShow'] = data['NoShow'].map({'No': 0, 'Yes': 1})
data.head()

data['AppointmentTime'] = data['AppointmentDay']
cutoff_times = data[['AppointmentID', 'AppointmentDay']]

In [85]:
hist1 = data[['Age', 'PatientID']].groupby(['Age']).count()
hist2 = data[['Neighbourhood', 'PatientID']].groupby(['Neighbourhood']).count()

from bokeh.plotting import figure, output_notebook, show
from bokeh.layouts import column

output_notebook()
plot1 = figure(plot_width=1000, plot_height=300)
plot1.vbar(x=hist1.index, width=0.5, bottom=0, top=hist1.iloc[:,0])

plot2 = figure(x_range = list(hist2.index), plot_width=1000, plot_height=500)
plot2.vbar(x=hist2.index, width=0.5, bottom=0, top=hist2.iloc[:,0])
plot2.xaxis.major_label_orientation = 1.57

show(column(plot1, plot2))



Next, we set up an [EntitySet](). An EntitySet is a way of storing data, data metadata, and relationships which makes it possible to automatically generate features. Even though the types of columns will be set automatically based on what `pandas` reads them as, we might want to specify a different type for a column explicitly. For instance, `Age` defaults as a Numeric column, but it's more accurate to think of it as an ordered categorical type of information (Ordinal). Similarly, EntitySet will know that `SMS_received` has only numbers, but we can explicitly set it to Boolean.

In [3]:
import featuretools.variable_types as vtypes

# Give featuretools column metadata
variable_types = {'Gender': vtypes.Categorical,
                  'Age': vtypes.Ordinal,
                  'Scholarship': vtypes.Boolean,
                  'Hypertension': vtypes.Boolean,
                  'Diabetes': vtypes.Boolean,
                  'Alcoholism': vtypes.Boolean,
                  'Handicap': vtypes.Boolean,
                  'NoShow': vtypes.Boolean,
                  'SMS_received': vtypes.Boolean}

We call our EntitySet "appointment_data" and build it from the ground up. First, we use `entity_from_dataframe` to turn our data into an entity and to apply the variable types we just set. Additionally, we can set a time index to attatch a particular datetime to every row. However, not columns occur at the same time.

For this dataset, we have both the time an appointment is set (ScheduledDay), and the actual time of the appointment ('AppointmentDay'). Notably, we won't know if a person is a NoShow or not until the actual appointment day. By setting up our entity in this way we can trust Featuretools to handle time based label leakage problems so that we can spend more time selecting our features.

In [4]:
# Create an `EntitySet` named `appointment_data`
es = ft.EntitySet('appointment_data')

# Make an entity named 'appointments' which stores dataset metadata with the dataframe
es = es.entity_from_dataframe(entity_id="appointments",
                              dataframe=data,
                              index='AppointmentID',
                              time_index='ScheduledDay',
                              secondary_time_index={'AppointmentDay': ['NoShow', 'SMS_received']},
                              variable_types=variable_types)

Next, we choose some interesting variables to group by and make additional entities to represent those concepts. As an example, it's possible that NoShow is directly related to the age of the patient in the event. By `normalizing` our entity, we create a structure which groups together events with the same age. We'll do the same thing with the gender of the patient and `PatientID`, so that if the same patient goes twice we use all of that information together.

Notice that along with new entities the relationships are also automatically generated.

In [5]:
es.normalize_entity('appointments', 'patients', 'PatientID', 
                    make_time_index=True)
es.normalize_entity('appointments', 'ages', 'Age',
                    make_time_index=False)
es.normalize_entity('appointments', 'genders', 'Gender',
                    make_time_index=False)

Entityset: appointment_data
  Entities:
    appointments (shape = [110527, 15])
    patients (shape = [62299, 2])
    ages (shape = [104, 1])
    genders (shape = [2, 1])
  Relationships:
    appointments.PatientID -> patients.PatientID
    appointments.Age -> ages.Age
    appointments.Gender -> genders.Gender

# Step 2: Create features with Deep Feature Synthesis
Next, if we so choose, we can create custom primitives to provide. In this case, following the example notebook, we'll calculate the probability of a boolean event as "number of times event happens" divided by "number of events". We can pass our custom `Prob` primitive into Deep Feature Synthesis along with a host of other primitives to calculate a feature matrix.

In [6]:
# Custom primitive: TODO fix to correctly incorporate label data while training
from featuretools.primitives import make_agg_primitive
def probability(boolean):
    numtrue = len([x for x in boolean if x==1])
    return numtrue/len(boolean)

Prob = make_agg_primitive(probability,
                          input_types=[vtypes.Boolean],
                          return_type=vtypes.Numeric)
    
    

In [7]:
from featuretools.primitives import Weekday, Hour, Count, Day, NUnique
fm, features = ft.dfs(entityset=es,
                      target_entity='appointments',
                      agg_primitives=[Prob, Count, NUnique],
                      trans_primitives=[Weekday, Hour, Day],
                      max_depth=3,
                      drop_contains=['AppointmentDay'],
                      cutoff_time=cutoff_times,
                      features_only=False,
                      verbose=True)

Building features: 154it [00:00, 4378.25it/s]
Progress: 100%|██████████| 27/27 [02:40<00:00,  5.95s/cutoff time]


Some of these features are strings, which won't work with machine learning. We'll use the built in feature matrix encoder to make new columns which one-hot-encode 'Neighbourhood' and 'Gender'.

In [8]:
fm, features = ft.synthesis.encode_features(fm, features, 
                                            top_n=5, 
                                            include_unknown=False, 
                                            to_encode=['Neighbourhood', 'Gender'], 
                                            inplace=False, 
                                            verbose=True)
fm.tail(5)

Encoding pass 1: 100%|██████████| 69/69 [00:00<00:00, 191.90feature/s]
Encoding pass 2: 100%|██████████| 74/74 [00:00<00:00, 1701.69feature/s]


Unnamed: 0_level_0,PatientID,Neighbourhood = JARDIM CAMBURI,Neighbourhood = MARIA ORTIZ,Neighbourhood = RESISTÊNCIA,Neighbourhood = JARDIM DA PENHA,Neighbourhood = ITARARÉ,Gender = F,Gender = M,Age,Scholarship,...,ages.NUM_UNIQUE(appointments.HOUR(ScheduledDay)),ages.NUM_UNIQUE(appointments.HOUR(AppointmentTime)),ages.NUM_UNIQUE(appointments.DAY(ScheduledDay)),ages.NUM_UNIQUE(appointments.DAY(AppointmentTime)),genders.NUM_UNIQUE(appointments.WEEKDAY(ScheduledDay)),genders.NUM_UNIQUE(appointments.WEEKDAY(AppointmentTime)),genders.NUM_UNIQUE(appointments.HOUR(ScheduledDay)),genders.NUM_UNIQUE(appointments.HOUR(AppointmentTime)),genders.NUM_UNIQUE(appointments.DAY(ScheduledDay)),genders.NUM_UNIQUE(appointments.DAY(AppointmentTime))
AppointmentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5790461,729255200000000.0,1,0,0,0,0,0,1,54,0,...,15,1,31,23,6,6,16,1,31,24
5790464,947614400000000.0,1,0,0,0,0,1,0,43,0,...,15,1,31,23,6,6,16,1,31,24
5790466,356247900000.0,1,0,0,0,0,0,1,27,0,...,15,1,30,24,6,6,16,1,31,24
5790481,234131800000.0,1,0,0,0,0,1,0,30,0,...,15,1,31,23,6,6,16,1,31,24
5790484,5237164000000.0,1,0,0,0,0,1,0,27,0,...,15,1,30,24,6,6,16,1,31,24


# Step 3: Predict

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score

labels = es['appointments'].df['NoShow']
X_train, X_test, y_train, y_test = train_test_split(fm, labels, test_size=0.20)
clf = RandomForestClassifier(n_estimators=50)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
print(roc_auc_score(preds, y_test))

0.5443598919556305


In [10]:
feature_imps = [(imp, fm.columns[i]) for i, imp in enumerate(clf.feature_importances_)]
feature_imps.sort()
feature_imps.reverse()
feature_imps[0:5]


[(0.08848723183679484, 'PatientID'),
 (0.060011229317557556, 'HOUR(ScheduledDay)'),
 (0.0488132810834492, 'patients.HOUR(first_appointments_time)'),
 (0.045632960436193866, 'ages.PROBABILITY(appointments.NoShow)'),
 (0.04557003016542001, 'patients.DAY(first_appointments_time)')]

In [11]:
from bokeh.plotting import figure, output_notebook, show 
from bokeh.models import HoverTool

# prepare some data
x = fm[fm['DAY(AppointmentTime)']==8]['Age']
y = fm[fm['DAY(AppointmentTime)']==8]['ages.PROBABILITY(appointments.NoShow)']
x, y = zip(*set(zip(x,y)))


hover = HoverTool(tooltips=[
    ("Prob", "$y{%0}"),
    ("Age", "$x{0}"),
])
# output to static HTML file
output_notebook()
# create a new plot with a title and axis labels
p = figure(title="Probability patient doesn't show up by age", 
           x_axis_label='Age', 
           y_axis_label='Probability of NoShow',
           tools=[hover])

# add a line renderer with legend and line thickness
p.scatter(x, y, alpha=.7, radius=1.5)

# show the results
show(p)

  elif np.issubdtype(type(obj), np.float):
