## ❓ No Show Prediction

Given *data about medical appointments*, let's try to predict whether a given subject will be a **no-show** or not.

We will use Tensorflow ANN to make our predictions.

Data source: https://www.kaggle.com/datasets/joniarroba/noshowappointments

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

import tensorflow as tf

from sklearn.metrics import classification_report, confusion_matrix

2025-05-31 07:18:35.151437: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [19]:
tf.random.set_seed(100)

In [2]:
data = pd.read_csv('KaggleV2-May-2016.csv')
data

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,2.987250e+13,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,5.589978e+14,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4.262962e+12,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,8.679512e+11,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8.841186e+12,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110522,2.572134e+12,5651768,F,2016-05-03T09:15:35Z,2016-06-07T00:00:00Z,56,MARIA ORTIZ,0,0,0,0,0,1,No
110523,3.596266e+12,5650093,F,2016-05-03T07:27:33Z,2016-06-07T00:00:00Z,51,MARIA ORTIZ,0,0,0,0,0,1,No
110524,1.557663e+13,5630692,F,2016-04-27T16:03:52Z,2016-06-07T00:00:00Z,21,MARIA ORTIZ,0,0,0,0,0,1,No
110525,9.213493e+13,5630323,F,2016-04-27T15:09:23Z,2016-06-07T00:00:00Z,38,MARIA ORTIZ,0,0,0,0,0,1,No


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   PatientId       110527 non-null  float64
 1   AppointmentID   110527 non-null  int64  
 2   Gender          110527 non-null  object 
 3   ScheduledDay    110527 non-null  object 
 4   AppointmentDay  110527 non-null  object 
 5   Age             110527 non-null  int64  
 6   Neighbourhood   110527 non-null  object 
 7   Scholarship     110527 non-null  int64  
 8   Hipertension    110527 non-null  int64  
 9   Diabetes        110527 non-null  int64  
 10  Alcoholism      110527 non-null  int64  
 11  Handcap         110527 non-null  int64  
 12  SMS_received    110527 non-null  int64  
 13  No-show         110527 non-null  object 
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB


In [4]:
data.describe()

Unnamed: 0,PatientId,AppointmentID,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received
count,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0
mean,147496300000000.0,5675305.0,37.088874,0.098266,0.197246,0.071865,0.0304,0.022248,0.321026
std,256094900000000.0,71295.75,23.110205,0.297675,0.397921,0.258265,0.171686,0.161543,0.466873
min,39217.84,5030230.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4172614000000.0,5640286.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,31731840000000.0,5680573.0,37.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,94391720000000.0,5725524.0,55.0,0.0,0.0,0.0,0.0,0.0,1.0
max,999981600000000.0,5790484.0,115.0,1.0,1.0,1.0,1.0,4.0,1.0


### Cleaning

In [5]:
print("Total Missing Values: ", data.isna().sum().sum())

Total Missing Values:  0


In [6]:
{column: len(data[column].unique()) for column in data.columns}

{'PatientId': 62299,
 'AppointmentID': 110527,
 'Gender': 2,
 'ScheduledDay': 103549,
 'AppointmentDay': 27,
 'Age': 104,
 'Neighbourhood': 81,
 'Scholarship': 2,
 'Hipertension': 2,
 'Diabetes': 2,
 'Alcoholism': 2,
 'Handcap': 5,
 'SMS_received': 2,
 'No-show': 2}

In [7]:
data = data.drop(['PatientId', 'AppointmentID'], axis=1)

### Feature Engineering

In [8]:
data

Unnamed: 0,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No
...,...,...,...,...,...,...,...,...,...,...,...,...
110522,F,2016-05-03T09:15:35Z,2016-06-07T00:00:00Z,56,MARIA ORTIZ,0,0,0,0,0,1,No
110523,F,2016-05-03T07:27:33Z,2016-06-07T00:00:00Z,51,MARIA ORTIZ,0,0,0,0,0,1,No
110524,F,2016-04-27T16:03:52Z,2016-06-07T00:00:00Z,21,MARIA ORTIZ,0,0,0,0,0,1,No
110525,F,2016-04-27T15:09:23Z,2016-06-07T00:00:00Z,38,MARIA ORTIZ,0,0,0,0,0,1,No


In [9]:
data.rename(columns = {'ScheduledDay': 'ScheduledDate', 'AppointmentDay': 'AppointmentDate'}, inplace=True)
data

Unnamed: 0,Gender,ScheduledDate,AppointmentDate,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No
...,...,...,...,...,...,...,...,...,...,...,...,...
110522,F,2016-05-03T09:15:35Z,2016-06-07T00:00:00Z,56,MARIA ORTIZ,0,0,0,0,0,1,No
110523,F,2016-05-03T07:27:33Z,2016-06-07T00:00:00Z,51,MARIA ORTIZ,0,0,0,0,0,1,No
110524,F,2016-04-27T16:03:52Z,2016-06-07T00:00:00Z,21,MARIA ORTIZ,0,0,0,0,0,1,No
110525,F,2016-04-27T15:09:23Z,2016-06-07T00:00:00Z,38,MARIA ORTIZ,0,0,0,0,0,1,No


In [10]:
data['ScheduledYear'] = data['ScheduledDate'].apply(lambda x: int(x[0:4]))
data['ScheduledMonth'] = data['ScheduledDate'].apply(lambda x: int(x[5:7]))
data['ScheduledDay'] = data['ScheduledDate'].apply(lambda x: int(x[8:10]))
data['ScheduledHour'] = data['ScheduledDate'].apply(lambda x: int(x[11:13]))
data['ScheduledMinute'] = data['ScheduledDate'].apply(lambda x: int(x[14:16]))
data['ScheduledSecond'] = data['ScheduledDate'].apply(lambda x: int(x[17:19]))

data['AppointmentYear'] = data['AppointmentDate'].apply(lambda x: int(x[0:4]))
data['AppointmentMonth'] = data['AppointmentDate'].apply(lambda x: int(x[5:7]))
data['AppointmentDay'] = data['AppointmentDate'].apply(lambda x: int(x[8:10]))

In [11]:
data = data.drop(['ScheduledDate', 'AppointmentDate'], axis=1)
data

Unnamed: 0,Gender,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show,ScheduledYear,ScheduledMonth,ScheduledDay,ScheduledHour,ScheduledMinute,ScheduledSecond,AppointmentYear,AppointmentMonth,AppointmentDay
0,F,62,JARDIM DA PENHA,0,1,0,0,0,0,No,2016,4,29,18,38,8,2016,4,29
1,M,56,JARDIM DA PENHA,0,0,0,0,0,0,No,2016,4,29,16,8,27,2016,4,29
2,F,62,MATA DA PRAIA,0,0,0,0,0,0,No,2016,4,29,16,19,4,2016,4,29
3,F,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No,2016,4,29,17,29,31,2016,4,29
4,F,56,JARDIM DA PENHA,0,1,1,0,0,0,No,2016,4,29,16,7,23,2016,4,29
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110522,F,56,MARIA ORTIZ,0,0,0,0,0,1,No,2016,5,3,9,15,35,2016,6,7
110523,F,51,MARIA ORTIZ,0,0,0,0,0,1,No,2016,5,3,7,27,33,2016,6,7
110524,F,21,MARIA ORTIZ,0,0,0,0,0,1,No,2016,4,27,16,3,52,2016,6,7
110525,F,38,MARIA ORTIZ,0,0,0,0,0,1,No,2016,4,27,15,9,23,2016,6,7


In [12]:
{column: data[column].unique() for column in data.select_dtypes('object').columns}

{'Gender': array(['F', 'M'], dtype=object),
 'Neighbourhood': array(['JARDIM DA PENHA', 'MATA DA PRAIA', 'PONTAL DE CAMBURI',
        'REPÚBLICA', 'GOIABEIRAS', 'ANDORINHAS', 'CONQUISTA',
        'NOVA PALESTINA', 'DA PENHA', 'TABUAZEIRO', 'BENTO FERREIRA',
        'SÃO PEDRO', 'SANTA MARTHA', 'SÃO CRISTÓVÃO', 'MARUÍPE',
        'GRANDE VITÓRIA', 'SÃO BENEDITO', 'ILHA DAS CAIEIRAS',
        'SANTO ANDRÉ', 'SOLON BORGES', 'BONFIM', 'JARDIM CAMBURI',
        'MARIA ORTIZ', 'JABOUR', 'ANTÔNIO HONÓRIO', 'RESISTÊNCIA',
        'ILHA DE SANTA MARIA', 'JUCUTUQUARA', 'MONTE BELO',
        'MÁRIO CYPRESTE', 'SANTO ANTÔNIO', 'BELA VISTA', 'PRAIA DO SUÁ',
        'SANTA HELENA', 'ITARARÉ', 'INHANGUETÁ', 'UNIVERSITÁRIO',
        'SÃO JOSÉ', 'REDENÇÃO', 'SANTA CLARA', 'CENTRO', 'PARQUE MOSCOSO',
        'DO MOSCOSO', 'SANTOS DUMONT', 'CARATOÍRA', 'ARIOVALDO FAVALESSA',
        'ILHA DO FRADE', 'GURIGICA', 'JOANA D´ARC', 'CONSOLAÇÃO',
        'PRAIA DO CANTO', 'BOA VISTA', 'MORADA DE CAMBURI', 'SANT

### Encoding

In [13]:
def binary_encode(df, column, positive_value):
    df = df.copy()
    df[column] = df[column].apply(lambda x: 1 if x == positive_value else 0)
    return df

def onehot_encode(df, column, prefix):
    df = df.copy()
    dummies = pd.get_dummies(df[column], prefix=prefix, dtype=int)
    df = pd.concat([df, dummies], axis=1)
    df = df.drop(column, axis=1)
    return df

In [14]:
data = binary_encode(data, 'Gender', positive_value='M')
data = binary_encode(data, 'No-show', positive_value='Yes')

data = onehot_encode(data, 'Neighbourhood', prefix='N')

In [15]:
data

Unnamed: 0,Gender,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show,ScheduledYear,...,N_SANTOS REIS,N_SEGURANÇA DO LAR,N_SOLON BORGES,N_SÃO BENEDITO,N_SÃO CRISTÓVÃO,N_SÃO JOSÉ,N_SÃO PEDRO,N_TABUAZEIRO,N_UNIVERSITÁRIO,N_VILA RUBIM
0,0,62,0,1,0,0,0,0,0,2016,...,0,0,0,0,0,0,0,0,0,0
1,1,56,0,0,0,0,0,0,0,2016,...,0,0,0,0,0,0,0,0,0,0
2,0,62,0,0,0,0,0,0,0,2016,...,0,0,0,0,0,0,0,0,0,0
3,0,8,0,0,0,0,0,0,0,2016,...,0,0,0,0,0,0,0,0,0,0
4,0,56,0,1,1,0,0,0,0,2016,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110522,0,56,0,0,0,0,0,1,0,2016,...,0,0,0,0,0,0,0,0,0,0
110523,0,51,0,0,0,0,0,1,0,2016,...,0,0,0,0,0,0,0,0,0,0
110524,0,21,0,0,0,0,0,1,0,2016,...,0,0,0,0,0,0,0,0,0,0
110525,0,38,0,0,0,0,0,1,0,2016,...,0,0,0,0,0,0,0,0,0,0


### Splitting/Scaling

In [16]:
y = data['No-show'].copy()
X = data.drop('No-show', axis=1)

In [17]:
scaler = StandardScaler()

X = scaler.fit_transform(X)

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=100)

### Training

In [22]:
y_train.mean()

0.2014657222624341

In [27]:
print("Class Distribution (Positive to Negative): {:.1f}% / {:.1f}%".format(y_train.mean()*100, (1 - y_train.mean())*100))

Class Distribution (Positive to Negative): 20.1% / 79.9%


In [29]:
inputs = tf.keras.Input(shape=(X.shape[1]))
x = tf.keras.layers.Dense(64, activation='relu')(inputs)
x = tf.keras.layers.Dense(64, activation='relu')(x)
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model = tf.keras.Model(inputs, outputs)

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=[
        'accuracy',
        tf.keras.metrics.AUC(name='auc')
    ]
)

history = model.fit(
    X_train,
    y_train,
    validation_split=0.2,
    batch_size=32,
    epochs=100,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(
            monitor='val_loss',
            patience=3,
            restore_best_weights=True
        )
    ]
)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100


### Results

In [30]:
model.evaluate(X_test, y_test)



[0.4623696208000183, 0.795349657535553, 0.7088032960891724]

In [31]:
y_true = np.array(y_test)
y_pred = np.squeeze(np.array(model.predict(X_test) >= 0.5, dtype=int))



In [32]:
print("Classification Report:\n\n", classification_report(y_true, y_pred))

Classification Report:

               precision    recall  f1-score   support

           0       0.80      0.98      0.88     26427
           1       0.47      0.06      0.11      6732

    accuracy                           0.80     33159
   macro avg       0.64      0.52      0.50     33159
weighted avg       0.74      0.80      0.73     33159



In [33]:
print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))

Confusion Matrix:
 [[25966   461]
 [ 6325   407]]
