# Using Featuretools to Predict Missed Appointments
In this notebook, [Featuretools](https://github.com/Featuretools/featuretools) to automatically generate features relating to when patients don't show up for doctor appointments. We follow the approach in the most popular [kernel]() to demonstrate the ways in which Featuretools simplifies and extends common data science operations. An advantage of using automated feature extraction is that it allows us to flip the standard data science process: we can explore our data *after* building interesting features.

To get started, download the [Kaggle](https://www.kaggle.com/joniarroba/noshowappointments/data) data on appointment noshows and store it in a `data` folder in this repository.


In [1]:
import numpy as np
import pandas as pd
import featuretools as ft
ft.__version__

'0.1.17'

# Structuring the Data
The key to the whole game is setting up a dataset in a way that accurately represents what we are interested in learning. To start, we'll load the data into a `pandas.DataFrame` and do some minor modifications so that I can more easily remember the column names.

In [2]:
data = pd.read_csv("data/KaggleV2-May-2016.csv")
data.index = data['AppointmentID']
data.rename(columns = {'Hipertension': 'Hypertension',
                       'Handcap': 'Handicap',
                       'PatientId': 'PatientID',
                       'Neighbourhood': 'Neighborhood',
                       'No-show': 'Label'}, inplace = True)
data['Label'] = data['Label'].map({'No': 0, 'Yes': 1})
data.head(2)



Unnamed: 0_level_0,PatientID,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighborhood,Scholarship,Hypertension,Diabetes,Alcoholism,Handicap,SMS_received,Label
AppointmentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
5642903,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,0
5642503,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,0


Next, we set up an [EntitySet](https://docs.featuretools.com/loading_data/using_entitysets.html). An EntitySet is a way of storing data, data metadata, and relationships which makes it possible to automatically generate features. 

Even though the types of columns have been set when we loaded our data into a `pandas.DataFrame`, we might want to specify a different type for a column explicitly. A standard example comes up in this dataset: `Age` is a Numeric type (every element is a number) but it's more accurate to think of it as an ordered categorical type of information (Ordinal). Our other columns are also Numeric, but would be more accurately described as Boolean (0 or 1).

In [3]:
import featuretools.variable_types as vtypes

# Give featuretools column metadata
variable_types = {'Gender': vtypes.Categorical,
                  'Age': vtypes.Ordinal,
                  'Scholarship': vtypes.Boolean,
                  'Hypertension': vtypes.Boolean,
                  'Diabetes': vtypes.Boolean,
                  'Alcoholism': vtypes.Boolean,
                  'Handicap': vtypes.Boolean,
                  'Label': vtypes.Boolean,
                  'SMS_received': vtypes.Boolean}

We call our EntitySet "Appointments" and build it from the ground up. First, we use `entity_from_dataframe` to turn our data into an entity and to apply the variable types. We also set a time index to attatch a particular datetime to every row. However, in this dataset, not all columns occur at the same time.

In the provided data we are given both the time an appointment is set (`ScheduledDay`), and the actual time of the appointment (`AppointmentDay`). Most notably, we won't know whether or not a person shows up to their appointment when they schedule it. Accounting for that difference by assigning `Labels` a `secondary_time_index` lets us lean on Deep Feature Synthesis to not leak Label information into any predictions we might make.

In [4]:
data['AppointmentTime'] = data['AppointmentDay']

# Create an `EntitySet` named `appointment_data`
es = ft.EntitySet('Appointments')

# Make an entity named 'appointments' which stores dataset metadata with the dataframe
es = es.entity_from_dataframe(entity_id="appointments",
                              dataframe=data,
                              index='AppointmentID',
                              time_index='ScheduledDay',
                              secondary_time_index={'AppointmentDay': ['Label', 'SMS_received']},
                              variable_types=variable_types)

Next, we choose some interesting variables to group by and make additional entities to represent those concepts. We can theorize that people might be more or less likely to show up depending on their age. By `normalizing` our entity, we create a structure which groups together events where the `Age` column is the same. We'll do the same thing with the gender of the patient and `PatientID`. It's easy to suspect that if a particular patient has cancelled in the past, that might have an impact on if they will in the future.

Notice that along with new entities the relationships are also automatically generated.

In [5]:
es.normalize_entity('appointments', 'patients', 'PatientID', 
                    make_time_index=True)
es.normalize_entity('appointments', 'ages', 'Age',
                    make_time_index=False)
es.normalize_entity('appointments', 'genders', 'Gender',
                    make_time_index=False)

Entityset: Appointments
  Entities:
    appointments (shape = [110527, 15])
    patients (shape = [62299, 2])
    ages (shape = [104, 1])
    genders (shape = [2, 1])
  Relationships:
    appointments.PatientID -> patients.PatientID
    appointments.Age -> ages.Age
    appointments.Gender -> genders.Gender

# Generating Features with Deep Feature Synthesis
With our data nicely structued in an **EntitySet**, we can immediately build useful features. We can create features with built in primitives like `Hour`, which returns the hour of a datetime and `PercentTrue`, which will calculate how often a boolean value is true. As an example, a features like `ages.PERCENT_TRUE(Label)` will calculate the probability that a patient doesn't show up based on their age.

It is at this stage that our time indices come into play. Since we set the cutoff time to be the Appointment Day, Deep Feature Synthesis will only use data that was obtained prior to that cutoff time for each calculation. This gives us the flexibility to not care about which value in the time series we're trying to predict: our data will be valid at each point. Let's calculate features and look at the first 20 columns.

In [6]:
from featuretools.primitives import Weekday, Hour, Count, Day, NUnique, PercentTrue

cutoff_times = data[['AppointmentID', 'AppointmentDay']]

fm, features = ft.dfs(entityset=es,
                      target_entity='appointments',
                      agg_primitives=[Count, NUnique, PercentTrue],
                      trans_primitives=[Weekday, Hour, Day],
                      drop_contains=['AppointmentDay'],
                      max_depth=3,
                      cutoff_time=cutoff_times,
                      features_only=False,
                      verbose=True)
fm.head(3).iloc[:,30:40]

Building features: 154it [00:00, 4622.68it/s]
Progress: 100%|██████████| 27/27 [02:19<00:00,  5.15s/cutoff time]


Unnamed: 0_level_0,ages.NUM_UNIQUE(appointments.PatientID),ages.NUM_UNIQUE(appointments.Neighborhood),ages.NUM_UNIQUE(appointments.Gender),ages.PERCENT_TRUE(appointments.Scholarship),ages.PERCENT_TRUE(appointments.Hypertension),ages.PERCENT_TRUE(appointments.Diabetes),ages.PERCENT_TRUE(appointments.Alcoholism),ages.PERCENT_TRUE(appointments.Handicap),ages.PERCENT_TRUE(appointments.Label),ages.PERCENT_TRUE(appointments.SMS_received)
AppointmentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
5217179,184.0,58,2,0.003215,0.598071,0.135048,0.003215,0.118971,0.0,0.0
5218520,186.0,57,2,0.0,0.660714,0.164286,0.0,0.114286,0.0,0.0
5235449,353.0,63,2,0.003322,0.598007,0.285714,0.028239,0.061462,0.0,0.0


In [7]:
fm[['Age', 'Gender', 'Neighborhood', 'ages.PERCENT_TRUE(appointments.Label)']].head(3)

Unnamed: 0_level_0,Age,Gender,Neighborhood,ages.PERCENT_TRUE(appointments.Label)
AppointmentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5217179,84,M,SANTO ANDRÉ,0.0
5218520,83,F,REDENÇÃO,0.0
5235449,74,F,MONTE BELO,0.0


Some of these features are strings, which won't work with machine learning. We'll use the built in feature matrix encoder to make new columns which one-hot-encode 'Neighborhood' and 'Gender'.

In [8]:
enc_fm, _ = ft.synthesis.encode_features(fm, features, 
                                            top_n=5, 
                                            include_unknown=False, 
                                            to_encode=['Neighborhood', 'Gender'], 
                                            inplace=False, 
                                            verbose=True)
enc_fm.tail(5)

Encoding pass 1: 100%|██████████| 69/69 [00:00<00:00, 166.53feature/s]
Encoding pass 2: 100%|██████████| 74/74 [00:00<00:00, 2137.18feature/s]


Unnamed: 0_level_0,PatientID,Neighborhood = JARDIM CAMBURI,Neighborhood = MARIA ORTIZ,Neighborhood = RESISTÊNCIA,Neighborhood = JARDIM DA PENHA,Neighborhood = ITARARÉ,Gender = F,Gender = M,Age,Scholarship,...,ages.NUM_UNIQUE(appointments.HOUR(ScheduledDay)),ages.NUM_UNIQUE(appointments.HOUR(AppointmentTime)),ages.NUM_UNIQUE(appointments.DAY(ScheduledDay)),ages.NUM_UNIQUE(appointments.DAY(AppointmentTime)),genders.NUM_UNIQUE(appointments.WEEKDAY(ScheduledDay)),genders.NUM_UNIQUE(appointments.WEEKDAY(AppointmentTime)),genders.NUM_UNIQUE(appointments.HOUR(ScheduledDay)),genders.NUM_UNIQUE(appointments.HOUR(AppointmentTime)),genders.NUM_UNIQUE(appointments.DAY(ScheduledDay)),genders.NUM_UNIQUE(appointments.DAY(AppointmentTime))
AppointmentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5790461,729255200000000.0,1,0,0,0,0,0,1,54,0,...,15,1,31,23,6,6,16,1,31,24
5790464,947614400000000.0,1,0,0,0,0,1,0,43,0,...,15,1,31,23,6,6,16,1,31,24
5790466,356247900000.0,1,0,0,0,0,0,1,27,0,...,15,1,30,24,6,6,16,1,31,24
5790481,234131800000.0,1,0,0,0,0,1,0,30,0,...,15,1,31,23,6,6,16,1,31,24
5790484,5237164000000.0,1,0,0,0,0,1,0,27,0,...,15,1,30,24,6,6,16,1,31,24


# Step 3: Predict

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score

labels = es['appointments'].df['Label']
X_train, X_test, y_train, y_test = train_test_split(enc_fm, labels, test_size=0.40)
clf = RandomForestClassifier(n_estimators=50)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
print(roc_auc_score(preds, y_test))

0.5489206884505459


In [10]:
feature_imps = [(imp, enc_fm.columns[i]) for i, imp in enumerate(clf.feature_importances_)]
feature_imps.sort()
feature_imps.reverse()
feature_imps[0:5]


[(0.08692924279356477, 'PatientID'),
 (0.05710289121394349, 'HOUR(ScheduledDay)'),
 (0.05051720687253581, 'ages.PERCENT_TRUE(appointments.Label)'),
 (0.048519700503394055, 'patients.HOUR(first_appointments_time)'),
 (0.045590429156840855, 'patients.DAY(first_appointments_time)')]

# Data exploration

In [11]:
from bokeh.plotting import figure, output_notebook, show
from bokeh.layouts import column

hist1 = data[['Age', 'PatientID']].groupby(['Age']).count()
hist2 = data[['Neighborhood', 'PatientID']].groupby(['Neighborhood']).count()

output_notebook()
plot1 = figure(plot_width=900, plot_height=300, title='Patients by Age')
plot1.vbar(x=hist1.index, 
           width=0.5, 
           bottom=0, 
           top=hist1.iloc[:,0], 
           color='purple', 
           alpha=.7)

plot2 = figure(x_range = list(hist2.index), plot_width=900, plot_height=500, title='Patients by Neighbourhood')
plot2.vbar(x=hist2.index, width=0.5, 
           bottom=0, 
           top=hist2.iloc[:,0], 
           alpha=.5, 
           color='red')
plot2.xaxis.major_label_orientation = 1.57


show(column(plot1, plot2))

In [12]:
from bokeh.models import HoverTool

# prepare some data
x = fm[fm['DAY(AppointmentTime)']==8]['Age']
y = fm[fm['DAY(AppointmentTime)']==8]['ages.PERCENT_TRUE(appointments.Label)']
x, y = zip(*set(zip(x,y)))


hover = HoverTool(tooltips=[
    ("Prob", "$y{%0}"),
    ("Age", "$x{0}"),
])
# output to static HTML file
output_notebook()
# create a new plot with a title and axis labels
p = figure(title="Probability patient doesn't show up by age", 
           x_axis_label='Age', 
           y_axis_label='Probability of NoShow',
           tools=[hover])

# add a line renderer with legend and line thickness
p.scatter(x, y, alpha=.7, radius=1.5)

# show the results
show(p)

  elif np.issubdtype(type(obj), np.float):


In [13]:
temp = fm[['PatientID', 'patients.PERCENT_TRUE(appointments.Label)']]
temp['AppointmentID'] = fm.index
hist = temp.groupby(['PatientID']).count()

output_notebook()
plot1 = figure(plot_width=900, plot_height=300, title='Appointments by Patient')
plot1.vbar(x=hist.index, 
           width=0.5, 
           bottom=0, 
           top=hist.iloc[:,0], 
           color='green', 
           alpha=.7)
show(plot1)

In [14]:
n = 10
patientiddf = data[['PatientID', 'AppointmentID']].groupby(['PatientID']).count().sort_values(by='AppointmentID', ascending=False)
patients_with_lessthan10 = patientiddf[patientiddf['AppointmentID']<=n]
patients_with_morethan10 = patientiddf.shape[0] - patients_with_lessthan10.shape[0]
print("{} percent of patients have {} or fewer appointments".format(
        patients_with_lessthan10.shape[0]/patientiddf.shape[0] * 100, n))

hist = patients_with_lessthan10['AppointmentID']
output_notebook()
plot1 = figure(plot_width=900, plot_height=300, title='Appointments by Patient')
plot1.vbar(x=hist.index, 
            width=0.5, 
            bottom=0, 
            top=hist, 
            color='green', 
            alpha=.7)
show(plot1)

99.60191977399316 percent of patients have 10 or fewer appointments


In [37]:
from featuretools.primitives import make_trans_primitive
cutoff_times = data[['AppointmentID', 'AppointmentDay']]

def high_volume(index):
    A = pd.DataFrame(index, columns=['Name'])
    A['ones'] = 1
    gb = A.groupby(['Name']).count().to_dict()['ones']
    #import pdb; pdb.set_trace()
    output=[]

    for key in index:
        if gb[key] >= 5:
            output.append(1)
        else:
            output.append(0)
    return output
    
FreqFlier = make_trans_primitive(high_volume,
                                 input_types=[vtypes.Id],
                                 cls_attributes={'needs_all_values': True},
                                 return_type=vtypes.Boolean)
fm, features = ft.dfs(entityset=es,
                      target_entity='appointments',
                      agg_primitives=[Count],
                      drop_contains=['AppointmentDay'],
                      max_depth=3,
                      cutoff_time=cutoff_times,
                      trans_primitives=[FreqFlier],
                      features_only=False,
                      verbose=True)



Building features: 0it [00:00, ?it/s][A[A

Building features: 28it [00:00, 8383.22it/s][A[A

Progress:   0%|          | 0/27 [00:00<?, ?cutoff time/s][A[A

KeyError: "['ages.COUNT(appointments)'] not in index"

In [36]:
fm.head(20)

Unnamed: 0_level_0,PatientID,Neighborhood,Gender,Age,Scholarship,Hypertension,Diabetes,Alcoholism,Handicap,HIGH_VOLUME(PatientID),HIGH_VOLUME(Gender),HIGH_VOLUME(Age)
AppointmentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
5217179,1423329000000.0,SANTO ANDRÉ,M,84,0,1,1,0,1,0,1,1
5218520,4616858000000.0,REDENÇÃO,F,83,0,1,0,0,0,0,1,1
5235449,55589630000000.0,MONTE BELO,F,74,0,0,0,0,0,0,1,1
5235643,91896940000000.0,GURIGICA,F,70,0,1,1,0,0,0,1,1
5235655,1534482000000.0,JUCUTUQUARA,F,87,0,0,0,0,0,0,1,1
5236116,313648100000000.0,REDENÇÃO,M,71,0,1,1,0,0,0,1,1
5236380,159618300000000.0,PRAIA DO CANTO,F,88,0,1,0,0,0,0,1,1
5303666,96467680000000.0,RESISTÊNCIA,F,1,0,0,0,0,0,0,1,1
5304747,743764600000000.0,MARUÍPE,M,48,0,1,0,0,0,0,1,1
5317449,7414865000000.0,JESUS DE NAZARETH,F,77,0,1,0,0,0,1,1,1


In [17]:
es['patients']

Entity: patients
  Variables:
    PatientID (dtype: index)
    first_appointments_time (dtype: datetime_time_index)
  Shape:
    (62299, 2)