# No-Show Appointments

## 1. Introduction

### 1.1 About the Data
This dataset collects information from 100k medical appointments in public hospitals in Vitoria, Brazil. The objective of this analysis is to investigate what characteristics make the patients more likely to miss their appointments. This dataset was taken from [Kaggle](https://www.kaggle.com/joniarroba/noshowappointments).

### 1.2 Questions
The questions below will be answered in this analysis.
- What factors are important in order to predict if a patient will show up for their scheduled appointment?
- Do different age groups miss their appointments more frequently than others?
- Does the size of the hospital have an effect on the no-show rate?

## 2. Data Wrangling
### 2.1 Importing necessary packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import date
import calendar

%matplotlib inline

### 2.2 Data Gathering

Importing data as a dataframe using pandas `read_csv`

In [None]:
df = pd.read_csv('../input/KaggleV2-May-2016.csv')

### 2.3 Data Assessment

Here the data is explored to find the structure and check for missing values.

In [None]:
df.head()

In [None]:
df.info()

No values are missing, although the some column headers contain typos and don't follow naming conventions. Also some columns contain the wrong datatypes.

### 2.4 Data Cleaning

The columns are renamed for better readability and to correct typos.

In [None]:
df.rename(columns=lambda x: x.lower(), inplace=True)

In [None]:
df.rename(columns={'patientid':'patient_id', 'appointmentid':'appointment_id', 
                   'scheduledday':'scheduled_day', 'appointmentday':'appointment_day',
                   'neighbourhood':'neighborhood', 'scholarship':'bolsa_familia',
                   'hipertension':'hypertension', 'handcap':'handicap',
                   'no-show':'no_show'}, inplace=True)

The `scholarship` column was changed to `bolsa_familia` to avoid any confusion to what this column represents.

Next the unique values for every column are printed to check for wrong data and outliers.

In [None]:
print('Gender:',df.gender.unique())
print('Age:',sorted(df.age.unique()))
print('Neighborhood:',df.neighborhood.unique())
print('Bolsa Familia:',df.bolsa_familia.unique())
print('Hypertension:',df.hypertension.unique())
print('Diabetes:',df.diabetes.unique())
print('Alcoholism:',df.alcoholism.unique())
print('Handicap:',df.handicap.unique())
print('SMS Received:',df.sms_received.unique())
print('No-show:',df.no_show.unique())

The age column seems to have some entries with negative age and some entries with age over 100 years. Entries with these ages will be treated as outliers and will be removed from the data.

In [None]:
df = df[(df.age >= 0) & (df.age <= 100)]

The `scheduled_day` and `appointment_day` columns are not useful as they are, however we can extract from them the day of the week of the appointment and the waiting time between the scheduled day and the appointment day. To do this, both columns types need to be changed to datetime.

*It would have been useful to extract the time of the appointment, however the actual appointment hour was not included in the data.

In [None]:
df['scheduled_day'] = pd.to_datetime(df['scheduled_day'])
df['appointment_day'] = pd.to_datetime(df['appointment_day'])

Change the Yes / No values in `no_show` to 'no_show' and 'show' to avoid confusing the meaning of Yes and No in this case.

In [None]:
df.no_show.replace(to_replace=dict(Yes='no_show', No='show'), inplace=True)

Create a new column called `waiting_days` that indicates how many days passed between scheduling the appointment and the appointment itself. Any row that shows a negative value, which occurs since the `appointment_day` column doesn't indicate the time of the appointment, had its value changed to 0.

In [None]:
df['waiting_days'] = df['appointment_day'] - df['scheduled_day']

In [None]:
df['waiting_days'] = df['waiting_days'].astype('timedelta64[D]')

In [None]:
df['waiting_days'] = np.where(df['waiting_days'] < 0, 0, df['waiting_days'])

Two new columns `appointment_weekday` and `scheduled_weekday` are created to indicate the weekday of the appointment and of when the appointment was scheduled.

In [None]:
df['appointment_weekday'] = df['appointment_day'].dt.weekday_name

In [None]:
df['scheduled_weekday'] = df['scheduled_day'].dt.weekday_name

## 3. Data Exploration

In [None]:
df.head()

### 3.1 Mean No-show

To begin the exploration, the mean no-show rate is calculated.

In [None]:
mean_no_show = df['no_show'].value_counts()[1] / len(df['no_show'])

mean_no_show


The mean no-show rate for this population is 20%.

### 3.2 Weekday Analysis

First the no-show rates are calculated for the appointment weekday and the scheduled weekday, then these rates are plotted for easier visualization.

#### 3.2.1 Appointment Weekday

In [None]:
def groupby_rate(column):
    #Return a series with the no-show rates for a specific characteristic (column).
    
    func_count = df.groupby(['no_show', column]).count()['patient_id']
    func_rate = func_count['no_show'] / (func_count['no_show'] + func_count['show'])
    return func_rate

In [None]:
sort = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']

apt_weekdays_rate = groupby_rate('appointment_weekday').reindex(sort, copy=False);

In [None]:
ind = np.arange(len(apt_weekdays_rate))
width = 0.35
plt.bar(ind, apt_weekdays_rate, width, color='r', alpha=.7, label='No-Show')
location = ind
plt.ylabel('No-show Rate')
plt.xlabel('Weekday')
plt.title('No-show Rate by Appointment Weekday')
labels = sort
plt.xticks(location, labels)
plt.yticks(np.arange(0, 0.4, step=0.05))
plt.legend();
print(apt_weekdays_rate)

The bar plot above shows that the no-show rate doesn't vary much depending on the weekday of the appointment, excluding Saturday. However there's a higher chance of a no-show for the appointments in the last two days of the week, with the highest no-show rates happening on Saturday and the lowest on Thursday. 

#### 3.2.2 Scheduled Weekday

In [None]:
sch_weekdays_rate = groupby_rate('scheduled_weekday').reindex(sort, copy=False);

In [None]:
ind = np.arange(len(sch_weekdays_rate))
width = 0.35
plt.bar(ind, sch_weekdays_rate, width, color='r', alpha=.7, label='No-Show')
location = ind
plt.ylabel('No-show Rate')
plt.xlabel('Weekday')
plt.title('No-show Rate by Scheduled Weekday')
labels = sort
plt.xticks(location, labels)
plt.yticks(np.arange(0, 0.4, step=0.05))
plt.legend();
print(sch_weekdays_rate)

The bar plot above shows that the no-show rate for all scheduled weekdays except Saturday has little variation. Except for the appointments scheduled on Saturdays, which have a no-show rate of 4.17%, all appointments scheduled on other weekdays have a 20% no-show rate.

#### 3.2.3 Weekday Conclusion

While the Saturday appointments have slightly higher no-show rates than other weekdays, the appointments scheduled on Saturdays have much lower no-show rates than ones scheduled on other weekdays. A possible explanation for this is that people would rather not go to the doctor on the weekend. However if they do go during the weekend to schedule an appointment, then they will probably show up since it's likely to be urgent.

### 3.3 Patient Characteristic Analysis

In this part the no-show rates will be calculated for each patient characteristic (column).

In [None]:
gender_rate = groupby_rate('gender')
bolsa_rate = groupby_rate('bolsa_familia')
hypertension_rate = groupby_rate('hypertension')
diabetes_rate = groupby_rate('diabetes')
alcoholism_rate = groupby_rate('alcoholism')
handicap_rate = groupby_rate('handicap')

#### 3.3.1 Gender

In [None]:
ind = np.arange(len(gender_rate))
width = 0.35
plt.bar(ind, gender_rate, width, color='r', alpha=.7, label='No-Show')
location = ind
plt.ylabel('No-show Rate')
plt.xlabel('Gender')
plt.title('No-show Rate by Gender')
plt.xticks(location)
plt.yticks(np.arange(0, 0.4, step=0.05))
plt.legend();
print(gender_rate)

The plot above indicates that gender doesn't have much effect on the no-show rate.

#### 3.3.2 Bolsa-Familia

In [None]:
ind = np.arange(len(bolsa_rate))
width = 0.35
plt.bar(ind, bolsa_rate, width, color='r', alpha=.7, label='No-Show')
location = ind
plt.ylabel('No-show Rate')
plt.xlabel('Bolsa-Familia')
plt.title('No-show Rate by Bolsa-Familia recievers')
plt.xticks(location)
plt.yticks(np.arange(0, 0.4, step=0.05))
plt.legend();
print(bolsa_rate)

The plot above indicates that people who receive Bolsa-Familia are more likely not to show up for their appointments.

#### 3.3.3 Hypertension

In [None]:
ind = np.arange(len(hypertension_rate))
width = 0.35
plt.bar(ind, hypertension_rate, width, color='r', alpha=.7, label='No-Show')
location = ind
plt.ylabel('No-show Rate')
plt.xlabel('Hypertension')
plt.title('No-show Rate by Hypertension')
plt.xticks(location)
plt.yticks(np.arange(0, 0.4, step=0.05))
plt.legend();
print(hypertension_rate)

People with hypertension are less likely not to show for their appointments. A probable cause for this is that they need to follow-up with their treatment in order to receive the hypertension medicine they need.

#### 3.3.4 Diabetes

In [None]:
ind = np.arange(len(diabetes_rate))
width = 0.35
plt.bar(ind, diabetes_rate, width, color='r', alpha=.7, label='No-Show')
location = ind
plt.ylabel('No-show Rate')
plt.xlabel('Diabetes')
plt.title('No-show Rate by Diabetes')
plt.xticks(location)
plt.yticks(np.arange(0, 0.4, step=0.05))
plt.legend();
print(diabetes_rate)

People with diabetes are less likely not to show for their appointments. Similarly to hypertension, a probable cause for this is that they need to follow-up in order to receive the medicine they need.

#### 3.3.5 Alcoholism

In [None]:
ind = np.arange(len(alcoholism_rate))
width = 0.35
plt.bar(ind, alcoholism_rate, width, color='r', alpha=.7, label='No-Show')
location = ind
plt.ylabel('No-show Rate')
plt.xlabel('Alcoholism')
plt.title('No-show Rate by Alcoholism')
plt.xticks(location)
plt.yticks(np.arange(0, 0.4, step=0.05))
plt.legend();
print(alcoholism_rate)

The plot above indicates that alcoholism doesn't have much effect on the no-show rate.

#### 3.3.6 Handicap

In [None]:
ind = np.arange(len(handicap_rate))
width = 0.35
plt.bar(ind, handicap_rate, width, color='r', alpha=.7, label='No-Show')
location = ind
plt.ylabel('No-show Rate')
plt.xlabel('Handicap')
plt.title('No-show Rate by Handicaps')
plt.xticks(location)
plt.yticks(np.arange(0, 0.4, step=0.05))
plt.legend()
print(handicap_rate)

The plot above People with no handicaps have the same no-show rate as the populations mean.

Patients with one handicap have a lower no-show rate than the mean. When the number of handicaps grow, the no-show rate grows too.

#### 3.3.7 Age

In [None]:
bins = [0, 18, 34, 50, 70, 100]

age_group = df.groupby(['no_show', pd.cut(df['age'], bins)]).size().unstack().transpose()

age_group_rate = age_group['no_show'] / (age_group['no_show'] + age_group['show'])

In [None]:
ind = np.arange(len(age_group_rate))
width = 0.35
plt.bar(ind, age_group_rate, width, color='r', alpha=.7, label='No-Show')
location = ind
plt.ylabel('No-show Rate')
plt.xlabel('Age Group')
plt.title('No-show Rate by Age Group')
labels = ['0-18', '19-34', '35-50', '51-70', '71-100']
plt.xticks(location, labels)
plt.yticks(np.arange(0, 0.4, step=0.05))
plt.legend();
print(age_group_rate)

The bar plot above shows that patients younger than 34 years are more likely to miss their appointment. Patients between the ages of 35 and 50 have the same no-show rate as the population mean and patients older than 51 years are less likely to miss their appointments.

### 3.4 Neighborhood Analysis

In this part the no-show rate and size (count) of each neighborhood will be compared and plotted. Since the size of the hospitals is not defined in the data, the relative size of each hospital will be inferred from the number of appointments that each hospital had.

In [None]:
neighborhood_rate = groupby_rate('neighborhood')
neighborhood_count = df['neighborhood'].value_counts()
neighborhood_cmb = pd.concat([neighborhood_rate, neighborhood_count], axis=1, sort=False)

neighborhood_cmb.rename(columns={'patient_id':'no_show_rate', 'neighborhood':'size'}, inplace=True)

In [None]:
plt.scatter(neighborhood_cmb['size'], neighborhood_cmb['no_show_rate'], c='r', alpha=0.7)
plt.ylabel('No-show Rate')
plt.xlabel('Neighborhood Size')
plt.title('No-show Rate by Neighborhood Size');

The scatter plot above doesn't show a relation between neighborhood size and no-show rate.

## 4. Conclusion

### 4.1 Results

- The analysis indicates that there's a 20% chance a patient will not show up to their appointment. There are also some characteristics that increase or lower this chance.

- Patients with diabetes and hypertension are less likely to miss their appointment, while patients who receive Bolsa-Familia are more likely to miss it. Patients are also slightly more likely to miss an appointment on Friday and Saturday, while appointments scheduled on Saturdays are not likely to be missed.

- Patients younger than 34 years are more likely to miss their appointment, patients between the ages of 35 and 50 have the same no-show rate as the population mean and patients older than 51 years are less likely to miss their appointments.

- The size of the hospital doesn't seem to have an effect on the no-show rate.

### 4.2 Limitations

- All variables, except age, are categorical. This mean that strong correlations cannot be provided using statistical methods.
- The correlations drawn in this analysis were based on the data however, as the saying goes, correlation does not imply causation. In other words, even though the data shows a relation between some characteristic and the no-show rate, it doesn't necessarily mean that this characteristic is causing this.
- Many appointments were scheduled on the same day that they happened. This might mean that the hospital attended a patient that didn't have an appointment and then simply entered this appointment in the schedule. If this is the case, the data might be skewed and the no-show rate is actually higher than it appears.
