# Investigating the medical appointment no-show trends
The following dataset collects information from 100,000 medical appointments taking place over three spring months of 2016 in Brazil. The data set is focused on the attendance of patients and their characteristics, such as gender, age, date of appointment, health conditions, medical scholarship, SMS reminders, etc. A number of characteristics of the patient are included in each row.

In the following report I will analyse several factors by fitting them into a classification ML model and determine whether those may affect attendance of patients to their scheduled medical appointments:

- Age
- Long-term health condition or disability
- Medical Scholarship
- Waiting time for scheduled appointment if it was booked in advance
- SMS reminders

In [1]:
# libraries
import pandas as pd
import numpy as np
import seaborn as sns
sns.set_style("whitegrid")
%matplotlib inline
# Logistic regression class
from sklearn.linear_model import LogisticRegression


In [2]:
# load the data
df = pd.read_csv("clean_nsa.csv")
df.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hypertension,Diabetes,Alcoholism,Handicap,SMS_received,No_show,No_show_b,Waiting_time,Waiting_period,Age_group
0,29872500000000.0,5642903,F,2016-04-29,2016-04-29,62,JARDIM DA PENHA,0,1,0,0,0,0,No,0,0,Day,Adult
1,558997800000000.0,5642503,M,2016-04-29,2016-04-29,56,JARDIM DA PENHA,0,0,0,0,0,0,No,0,0,Day,Adult
2,4262962000000.0,5642549,F,2016-04-29,2016-04-29,62,MATA DA PRAIA,0,0,0,0,0,0,No,0,0,Day,Adult
3,867951200000.0,5642828,F,2016-04-29,2016-04-29,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No,0,0,Day,Child
4,8841186000000.0,5642494,F,2016-04-29,2016-04-29,56,JARDIM DA PENHA,0,1,1,0,0,0,No,0,0,Day,Adult


In [3]:
df.tail()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hypertension,Diabetes,Alcoholism,Handicap,SMS_received,No_show,No_show_b,Waiting_time,Waiting_period,Age_group
110516,2572134000000.0,5651768,F,2016-05-03,2016-06-07,56,MARIA ORTIZ,0,0,0,0,0,1,No,0,35,Trimester,Adult
110517,3596266000000.0,5650093,F,2016-05-03,2016-06-07,51,MARIA ORTIZ,0,0,0,0,0,1,No,0,35,Trimester,Adult
110518,15576630000000.0,5630692,F,2016-04-27,2016-06-07,21,MARIA ORTIZ,0,0,0,0,0,1,No,0,41,Trimester,Young adult
110519,92134930000000.0,5630323,F,2016-04-27,2016-06-07,38,MARIA ORTIZ,0,0,0,0,0,1,No,0,41,Trimester,Adult
110520,377511500000000.0,5629448,F,2016-04-27,2016-06-07,54,MARIA ORTIZ,0,0,0,0,0,1,No,0,41,Trimester,Adult


In [4]:
df.describe()

Unnamed: 0,PatientId,AppointmentID,Age,Scholarship,Hypertension,Diabetes,Alcoholism,Handicap,SMS_received,No_show_b,Waiting_time
count,110521.0,110521.0,110521.0,110521.0,110521.0,110521.0,110521.0,110521.0,110521.0,110521.0,110521.0
mean,147490600000000.0,5675304.0,37.089386,0.098271,0.197257,0.071869,0.030401,0.020259,0.321043,0.201898,10.184345
std,256086000000000.0,71296.91,23.109885,0.297682,0.397929,0.258272,0.17169,0.140884,0.466879,0.401419,15.255153
min,39217.84,5030230.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4172457000000.0,5640284.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,31731850000000.0,5680573.0,37.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
75%,94389630000000.0,5725524.0,55.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,15.0
max,999981600000000.0,5790484.0,115.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,179.0


In [5]:
# remove no_show, leaving only no_show_b, as this will be usen in a model.
# No_show_b represents 0 = No, 1 = Yes to missed appointments.
df = df.drop(columns=['No_show'], axis=1, inplace = True)

### EDA Summary
According to this particular data set, there are following conclusions:
- Certain age groups such as Teenagers and Young adults may be considered to be a factor to the probability of no-shows.
- A long-term health condition or disability does not have a big effect on the probability of medical no-shows in this dataset.
- Medical Scholarship yielded interesting results, as people with scholarships tend to miss their appointments more, according to this dataset.
- Waiting time for scheduled appointment does have an effect on the probability of attendance, the more time passes between booking date and appointment, the fewer people show up.
- SMS reminders do have a slight effect on attendance, however, with time they become less effective.

# Logistic regression
Since we are dealing with just two possibilities, this is a binary classification problem. I propose to used Logistic regression model to estimate possibility of medical appointment no-show.

In [None]:
# instantiate the model (using the default parameters)
logreg = LogisticRegression()

# fit the model with data
logreg.fit(X, y)

# predict the response values for the observations in X
logreg.predict(X)