Liudmila Semenova
UDACITY Data Analyst Nanodegree Program
2019, April 16

PROJECT 2: Investigate a Dataset Medical Appointment No Shows
Introduction
Data Wrangling
Exploratory Data Analysis
Conclusions
Introduction
The subject of this analysis is a Kaggle dataset Medical Appointment No Shows which describes more than 110,000 medical appointments in health units located within the city of Vitoria, Brazil and 14 features of each appointment.

Not all the data were clearly explained by the author in Data Dictionary, however some clarifications provided in the Discussion section allowed to conclude the following about features:

PatientId - identification of a patient.
AppointmentID - identification of each appointment.
Gender - gender of a patient: male or female.
ScheduledDay - a date on which an appointment was scheduled.
AppointmentDay - a date of an appointment.
Age - age of a patient.
Neighbourhood - a neighbourhood where an appointment took place.
Scholarship - if a patient has government financial aid Bolsa Familia: yes or no.
Hipertension - if a patient has high blood pressure: yes or no.
Diabetes - if a patient has diabetes: yes or now.
Alcoholism - if a patient is suffering from alcoholism: yes or now.
Handcap - if a patient has disability: the number of all of them.
SMS_received - if a patient got a text message appointment reminders: yes or no.
No-show - if a patient came to an appointment: no for "came" and yes for "didn't come".

Thus, on the basis of these variables, the following questions can be investigated:

Are patient no-shows related to a patient's personal features such as age, gender, the mentioned diseases, disabilities or getting government financial aid?

Do the appointment reminders affect the coming to the appointment?

Is the number of days between the date when the appointment was scheduled and the appointment date related to whether the patient come to the appointment or not?

Are there neighbourhoods where the patients are more likely not to miss their appointments?


In [1]:
# load libraries;
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")
%matplotlib inline

Data Wrangling¶
General Properties
At this stage we load in the data and perform several operations to get information about:

the numbers of rows and columns;
missing values of each column;
data types of all columns;
duplicated rows and duplicated values;
values of those columns which it possible to check visually.

In [None]:
# load data, rename colums and check the dataframe; 
labels = ['patient_id', 'app_id', 'gender', 'scheduled', 'appointment', 'age', 'neighbourhood', 'aid', 
          'hypertension', 'diabetes', 'alcoholism', 'disability', 'sms', 'no_show']
df = pd.read_csv('medical_appointment_no_shows.csv', header=0, names = labels)
df.head()


In [None]:

# check dimensions of the dataframe to get the numbers of rows and columns;
df.shape


These numbers of rows and columns show a mistake in the dataframe description given by the author on Kaggle.com, where he says about 300,000 medical appointments and 15 variables of each. So we should change this information in the Introduction section.

In [None]:

# check concise summary of the dataframe;
df.info()

In [None]:
# check if there is any missing data another way;
df.isnull().sum()

In [None]:
# check if there any duplicated rows in the dataframe;
sum(df.duplicated())

In [None]:
# check the numbers of unique values in each column;
df.nunique()

In [None]:
# check for duplicates in patient_id column;
sum(df.patient_id.duplicated())

In [None]:
# check the data for all column except for patient_id, app_id, scheduled and appointment;
print('gender - ', df.gender.unique(), '\n', 
      'age - ', df.age.unique(), '\n',
      'neighbourhood - ', df.neighbourhood.unique(), '\n',
      'aid - ', df.aid.unique(), '\n',
      'hypertension - ', df.hypertension.unique(), '\n',
      'diabetes - ', df.diabetes.unique(), '\n',
      'alcoholism - ', df.alcoholism.unique(), '\n',
      'disability - ', df.disability.unique(), '\n',
      'sms - ', df.sms.unique(), '\n',
      'no_show - ', df.no_show.unique(), sep = '')

In [None]:
# check data types of each column; 
df.dtypes

In [None]:
# check object data types more detailed;
print('gender - ', type(df['gender'][0]), '\n', 
      'scheduled - ', type(df['scheduled'][0]), '\n',
      'appointment - ', type(df['appointment'][0]), '\n',
      'neighbourhood - ', type(df['neighbourhood'][0]), '\n',
      'no_show - ', type(df['no_show'][0]), sep = '')


Thus, we can conclude the following:

the dataset has 110,527 rows and 14 columns;
there are no missing data or duplicated rows;
the patient_id column has 48,228 duplicate values;
the scheduled and appointment columns have string data type instead of datetime;
the age column has values such as -1 and 0 (-1 seems like a typing error, and 0 means children under one year according the dataframe author explanation);
the disability column has 5 values for the exact number of patient disabilities which is redundant for further analysis;
the aid, hypertension, diabetes, alcoholism, sms columns have values 1 and 0 instead of Yes and No.


# Data Cleaning¶
Based on the results of the dataframe inspection we perform the next steps to clean the data:

drop the row where the age is equal -1;
convert the scheduled and appointment columns to datetime data type;
create a new column waiting (days between scheduled and appointment date);
change the disability column values to binary: 0 for patient without disabilities and 1 for patients with any number of disabilities;
change aid, hypertension, diabetes, alcoholism, disability, sms columns values to Yes and No for convenience of further calculations and convert these columns to string data type;
drop app_id column since it is redundant for futher analysis.

In [None]:
# search the rows where age less than 0;
df.query('age < 0')

In [None]:
# drop the row where age less than 0 and check changes;;
df = df.query('age >= 0')
print(sorted(df.age.unique()))

As patients make appointments in advance, values of the scheduled column have to be less than the values of the appointment column or at least equal, so check this condition.

In [None]:
# search for rows where appointment date are earlier than scheduled date;
df.query('appointment < scheduled')


One entry has 6 days difference between scheduled and appointment dates which is no doubt an error. Other dates have only one day differences which can be both an entry error or an error due to missing appointment time. So drop all these rows.

In [None]:
# drop the rows where appointment date are earlier than scheduled date and check changes;
df = df.query('appointment >= scheduled')
df.query('appointment < scheduled')

In [None]:
# create a new column named waiting (days between scheduled and appointment days) and check changes;
df.insert(5, 'waiting', (df.appointment - df.scheduled).dt.days)
df.head()

In [None]:

# count the naumber of rows whis disabilies equal 2, 3, 4;
(df['disability'] > 1).sum()

In [None]:
# change values 2, 3, 4 in disability column to 1 and check changes; 
df['disability'].replace([2,3,4],[1,1,1], inplace = True)
(df['disability'] > 1).sum()

In [None]:
# change values in aid, hypertension, diabetes, alcoholism, disability, sms columns to Yes and No and check changes; 
def change_to_yes_no(dataframe, col):
    dataframe[col] = dataframe[col].apply(str)
    dataframe[col].replace(['1', '0'], ['Yes', 'No'], inplace = True);

change_to_yes_no(df, 'aid')
change_to_yes_no(df, 'hypertension')
change_to_yes_no(df, 'diabetes')
change_to_yes_no(df, 'alcoholism')
change_to_yes_no(df, 'disability')
change_to_yes_no(df, 'sms')

df.head()

In [None]:
# drop app_id column and check changes; 
df.drop(['app_id'], axis=1, inplace=True)
df.head()


Exploratory Data Analysis¶
At this stage we explore obtained data to answer the research questions which were posed in the Introduction section.

Question 1. Are patient no-shows related to a patient's personal features such as age, gender, hypertension, diabetes, alcoholism, disabilities or getting government financial aid?
Explore the dataframe in terms of patient features as age, gender, diseases, disabilities and getting financial aid. As many patients had more than one appointment create new dataframe with only the first appointment for each patient to avoid analysing the same patients more than once.

In [None]:
# drop rows with duplicated patient_id values and check dementions of the new dataframe;
df_patients = df.drop_duplicates(['patient_id'], keep='first')
df_patients.shape


AGE¶
Explore patient distribution by gender.

In [None]:
# patient distribution by age;
sns.distplot(df_patients['age'], bins=115, color='green', kde = False);
plt.xlabel('Age of Patients');
plt.ylabel('Number of Patients');

In [None]:
# statisctics: mean, min, 25%, 50%, 75% max for age;
df_patients.age.describe()


As we can see the average age of patients is about 37 y.o., youngest patients are children under 1 y.o. and the oldest patients are 115 y.o. Only 25% patients are children and teenagers and 75% are not older than 56 y.o.

For the further exploration divide the patients into 5 age groups.

In [None]:
# cut the age column data into discrete chunks;
age_bin_values = [0, 15, 25, 55, 65, 115]
age_bin_names = ['0-14', '15-24', '25-54', '55-64', '65+']
ages = pd.cut(df_patients.age, bins=age_bin_values, labels=age_bin_names)
sns.countplot(x=ages, palette='Paired').set_xticklabels(age_bin_names);
plt.xlabel('Age of Patients');
plt.ylabel('Number of Patiens');


Explore how many patients from each age group came to their appointments and how many did not come in proportions, then visualize the data.

In [None]:
# divide each age bin into 2 groups: patients who came to the appointment and who did not
# and create new dataframe with obtained data in proportions;
df_age_no_show = df_patients.groupby([pd.cut(df_patients['age'],age_bin_values),'no_show']).count().patient_id.unstack()
df_age_no_show['No-show patients'] = df_age_no_show['Yes'] / (df_age_no_show['Yes'] + df_age_no_show['No'])
df_age_no_show['Show-up patients'] = df_age_no_show['No'] / (df_age_no_show['Yes'] + df_age_no_show['No'])
df_age_no_show.drop(['No', 'Yes'], axis=1, inplace=True)

# visualize the result;
df_age_no_show.plot(kind='bar').set_xticklabels(['0-14', '15-24', '25-54', '55-64', '65+'], rotation=0);
plt.xlabel('Age of Patients');
plt.ylabel('Number of Patients in Proportions');
plt.legend(loc=(1.02, 0.50));


As we can see young patients (aged 15-24) tend to miss their appointments often than other groups of patients. Elderly patients (aged 65 and above) tend to miss the appointments less often than other groups of patients. So we can conclude that the older the patients get, the less likely they will miss the appointments.

The only exception is children under the age of 15. Since they usually visit doctors accompanied by their parents the relation between missing and not missing appointments should be approximately equal the same relation for adult patients (aged 25-54) which is obtained data confirms.

# GENDER¶
Explore patient distribution by gender in absolute numbers and in proportions.

In [None]:

# patient distribution by gender in absolute numbers;
df_patients.gender.value_counts()

In [None]:
# patient distribution by gender in proportions;
df_patients.gender.value_counts(normalize=True)

In [None]:
# visualization of patient distribution by gender;
sns.countplot(x='gender', data=df_patients, palette='Paired').set_xticklabels(['Female - 64%', 'Male - 36%']);
plt.xlabel('Gender of Patietns');
plt.ylabel('Number of Patients');


Explore how many females and males came to their appointments and how many did not come in proportions, then visualize the data.

In [None]:
# get the data how many females and males came to their appointments and how many did not come in proportions;
f_no_show = df_patients.query('gender == "F"').groupby('no_show').size().transform(lambda x: x/x.sum())
m_no_show = df_patients.query('gender == "M"').groupby('no_show').size().transform(lambda x: x/x.sum())

# create dataframe using obtained data;
gender_no_show = [{'No-show patients': f_no_show['Yes'], 'Show-up patients': f_no_show['No']},
                  {'No-show patients': m_no_show['Yes'], 'Show-up patients': m_no_show['No']}]
df_gender_no_show = pd.DataFrame(gender_no_show)
df_gender_no_show.insert(0, 'Gender', ['Female', 'Male'])
df_gender_no_show = df_gender_no_show.set_index('Gender')

# visualize the proportion females and males who came and did not come to their appointments;
df_gender_no_show.plot(kind='bar').set_xticklabels(['Female', 'Male'], rotation = 0);
plt.xlabel('Gender of Patients');
plt.ylabel('Number of Patients in Proportions');
plt.legend(loc=(1.02, 0.50));

As we can see almost equal numbers females and males miss their appointments so we cannot say that there is any relationship between gender and patients no-shows.

# GOVERNMENT FINANCIAL AID BOLSA FAMILIA
Explore patient distribution by getting government financial aid in absolute numbers and in proportions.

In [None]:
# patient distribution by getting government financial aid in absolute numbers;
df_patients.aid.value_counts()

In [None]:
# patient distribution by getting government financial aid in proportions;
df_patients.aid.value_counts(normalize=True)

In [None]:
# visualization of patient distribution by getting government financial aid;
sns.countplot(x='aid', data=df_patients, palette='Paired').set_xticklabels(['No - 91%', 'Yes - 9%']);
plt.xlabel('Getting Financial Aid');
plt.ylabel('Number of Patients');

Explore how many patients who get financial aid and who do not came to their appointments and how many did not in proportions, tnen visualize the data.

In [None]:
# get the data how many patients with and without financial aid came to their appointments 
# and how many did not come in proportions;
bf_no_show = df_patients.query('aid == "Yes"').groupby('no_show').size().transform(lambda x: x/x.sum())
nobf_no_show = df_patients.query('aid == "No"').groupby('no_show').size().transform(lambda x: x/x.sum())

# create dataframe using obtained data;
aid_no_show = [{'No-show patients': bf_no_show['Yes'], 'Show-up patients': bf_no_show['No']},
                  {'No-show patients': nobf_no_show['Yes'], 'Show-up patients': nobf_no_show['No']}]
df_aid_no_show = pd.DataFrame(aid_no_show)
df_aid_no_show.insert(0, 'Getting financial aid', ['Yes', 'No'], True)
df_aid_no_show = df_aid_no_show.set_index('Getting financial aid')

# visualize the proportion of patients with and without financial aid 
# who came and did not come to their appointments;
df_aid_no_show.plot(kind='bar').set_xticklabels(['Yes', 'No'], rotation = 0);
plt.xlabel('Getting Financial Aid');
plt.ylabel('Number of Patients in Proportions');
plt.legend(loc=(1.02, 0.50));

As we can see the patients who have financial aid tend slightly often to miss their appointments than the patients who do not have financial aid. However, this difference is not significant and requires further exploration.

# Question 2. Do the appointment reminders affect the coming to the appointment?¶
Explore the appointment distribution by numbers of the appointment reminders. As it does not matter for this analysis if the appointment was first or follow-up we should use the original dataframe with non-unique patients.

In [None]:
# check how many appointments were preceded by sms and how many were not in absolute number;
df.sms.value_counts()

In [None]:
# check how many appointments were preceded by sms and how many were not in proportions;
df.sms.value_counts(normalize=True)

In [None]:
# visualize ow many appointments were preceded by sms and how many were not;
sns.countplot(x='sms', data=df, palette='Paired').set_xticklabels(['No - 67%', 'Yes - 33%']);
plt.xlabel('SMS');
plt.ylabel('Number of Patients');


Explore how many appointments with and without sms were missed and were not missed by patients in proportions, then visualize the data.

In [None]:
# get the data how many appointments with and without sms were missed and were not missed by patients in proportions;
sms_no_show = df.query('sms == "Yes"').groupby('no_show').size().transform(lambda x: x/x.sum())
nosms_no_show = df.query('sms == "No"').groupby('no_show').size().transform(lambda x: x/x.sum())

# create dataframe using obtained data;
sms_no_show = [{'Missing appointments': sms_no_show['Yes'], 'Not missing appointments': sms_no_show['No']},
                  {'Missing appointments': nosms_no_show['Yes'], 'Not missing appointments': nosms_no_show['No']}]
df_sms_no_show = pd.DataFrame(sms_no_show)
df_sms_no_show.insert(0, 'SMS', ['Yes', 'No'], True)
df_sms_no_show = df_sms_no_show.set_index('SMS')

# visualize the proportion of appointments with and without sms which were missed and were not;
df_sms_no_show.plot(kind='bar').set_xticklabels(['Yes', 'No'], rotation = 0);
plt.ylabel('Number of appointments in Propotions');
plt.xlabel('SMS');
plt.legend(loc=(1.02, 0.50));

As we can see the appointments preceded by sms were missed noticeably more than the appointments without sms. So according obtained results the appointment reminders are more related to the patient no-shows.

# Question 3. Is the number of days between the date when the appointment was scheduled and the appointment date related to whether the patient come to the appointment or not?¶
Explore the appointment distribution by days between the scheduled and the appointment dates. As it does not matter for this analysis if the appointment was first or follow-up we should use the original dataframe with non-unique patients.

In [None]:
# patient distribution by age;
sns.distplot(df['waiting'], bins=175, color='red', kde = False);
plt.ylabel('Number of Appointments');
plt.xlabel('Days');

In [None]:
# statisctics: mean, min, 25%, 50%, 75% max for age;
df.waiting.describe()

As we can see the average waiting time (days between the scheduled and the appointment dates) is about 10 days although about 25% of patients can visit the doctor on the same day. While the maximum waiting time is 179 days, 75% of patients wait for the appointment not longer than 15 days.

For the further exploration divide waiting days into 5 groups by time of waiting.

In [None]:
# cut the waiting column data into discrete chunks;
waiting_bin_values = [0, 7, 30, 60, 90, 180]
waiting_bin_names = ['Same day', '1 week', '1 month', '2 months', '3 months and more']
waiting = pd.cut(df.waiting, bins = waiting_bin_values, labels = waiting_bin_names)
sns.countplot(x=waiting, palette='Paired').set_xticklabels(waiting_bin_names, rotation = 45);
plt.ylabel('Number of Appointments');
plt.xlabel('Waiting Days');

Explore how many patients from each waiting days group came to their appointments and how many did not come in proportions, then visualize the data.

In [None]:
# divide each age bin into 2 groups: patients who came to the appointment and who did not
# and create new dataframe with obtained data in proportions;
df_waiting_no_show = df.groupby([pd.cut(df['waiting'],waiting_bin_values),'no_show']).count().patient_id.unstack()
df_waiting_no_show['No-show patients'] = df_waiting_no_show['Yes'] / (df_waiting_no_show['Yes'] + df_waiting_no_show['No'])
df_waiting_no_show['Show-up patients'] = df_waiting_no_show['No'] / (df_waiting_no_show['Yes'] + df_waiting_no_show['No'])
df_waiting_no_show.drop(['No', 'Yes'], axis=1, inplace=True)

# visualize the result;
df_waiting_no_show.plot(kind='bar').set_xticklabels(['Same day', '1-7 days', '8-30 days', '31-60 days', '61-90 days and longer'], rotation=45);
plt.xlabel('Waiting Days');
plt.ylabel('Number of Appointments');
plt.legend(loc=(1.02, 0.50));


As we can see the patients who can visit clinics at the day when they scheduled the appointment tend to miss the appointment less often than the patients who have to wait longer. The patients who have to wait more than one week but less than one months miss their appointments more often than other patients. However, after a month of waiting the longer patients wait the less they tend to miss their appointments.



# Question 4. Are there neighbourhoods where the patients are more likely not to miss their appointments?¶
Explore the neighbourhoods distribution by number of appointments. As it does not matter for this analysis if the appointment was first or follow-up we should use the original dataframe with non-unique patients.

In [None]:
# get the data how many appointments were scheduled in clinics of each neighbourhoods in absolute numbers;
df_neigh_total=df.groupby(['neighbourhood','no_show']).count().patient_id.unstack().fillna(0)
df_neigh_total['Total appointments'] = df_neigh_total['Yes'] + df_neigh_total['No']
df_neigh_total.drop(['Yes','No'], axis=1, inplace=True)
df_neigh_total.sort_values('Total appointments', ascending = False, inplace = True)

# visualize the result;
df_neigh_total.plot(kind='bar', figsize=(20,10), color='DarkOrange');
plt.xlabel('Neighbourhoods');
plt.ylabel('Number of Appointments');
plt.legend(loc=(1.02, 0.50));


Jardim Camburi is the neighbourhood where clinics with the largest number of scheduled appointments are located. However some of them could have been missed. So we should analyse in what neighbourhood clinics with the largest number of not missing appointments are located.

In [None]:
# get the data how many missing and not missing appointments were in clinics 
# of each neighbourhoods in absolute numbers;
df_neigh_abs=df.groupby(['neighbourhood','no_show']).count().patient_id.unstack()
df_neigh_abs['No-show patients'] = df_neigh_abs['Yes']
df_neigh_abs['Show-up patients'] = df_neigh_abs['No']
df_neigh_abs.drop(['Yes','No'], axis=1, inplace=True)
df_neigh_abs = df_neigh_abs.sort_values('Show-up patients', ascending = False).head(80)

# visualize the result;
df_neigh_abs['Show-up patients'].plot(kind='bar', figsize=(20,10), color='#db8457');
df_neigh_abs['No-show patients'].plot(kind='bar', figsize=(20,10), color='#4e73ae');
plt.xlabel('Neighbourhoods');
plt.ylabel('Number of Appointments');
plt.legend(loc=(1.02, 0.50));


Jardim Camburi is also the neighbourhood whith clinics with the largest number of not missing appointments.

We also can explore in what neighbourhood there are clinics with the largest number of not missing appointments in proporitons.

In [None]:

# get the data how many missing and not missing appointments were in clinics of each neighbourhoods in proporitons;
df_neigh=df.groupby(['neighbourhood','no_show']).count().patient_id.unstack()
df_neigh['No-show patients'] = df_neigh['Yes'] / (df_neigh['Yes'] + df_neigh['No'])
df_neigh['Show-up patients'] = df_neigh['No'] / (df_neigh['Yes'] + df_neigh['No'])
df_neigh.drop(['Yes','No'], axis=1, inplace=True)
df_neigh = df_neigh.sort_values('Show-up patients', ascending = False).head(80)

# visualize the result;
df_neigh['Show-up patients'].plot(kind='bar', figsize=(20,10), color='#db8457');
df_neigh['No-show patients'].plot(kind='bar', figsize=(20,10), color='#4e73ae');
plt.xlabel('Neighbourhoods');
plt.ylabel('Number of Appointments in Proportions');
plt.legend(loc=(1.02, 0.50));

# Conclusions
The analysis showed the following:

● In terms of age the greatest tendency to miss their appointments is shown by the young patients, the least - the patients aged 65 y.o. and above. The older the patients get, the less likely they miss the appointments, except for children under the age of 15 y.o. Since they usually visit doctors accompanied by their parents the relation between the missing and not missing appointments is equal the same relation for the patients aged 25-54 y.o.

● Women and men miss their appointments equally, however female patients are almost twice as many as male. This can be explained by the fact that women more often accompany their children to medical appointments than men.

● The most patients do not have the government financial aid Bolsa Familia and those of them who have tend to miss their appointments more often.

● Vast majority of the patients do not suffer from hypertension, diabetes, alcoholism and do not have any disabilities. The patients with hypertension, alcoholism and disabilities have tendency to miss their appointments more often than the parients who do not have these health issues, and the patients with diabetes on the contrary miss their appointments less often than the patients without diabetes.

● As opposed to what was expected the appointments which were preceded by sms reminders were missed noticeably more than the appointments without sms reminders.

● The average waiting time for the appointments is about 10 days although about 25% of the patients can visit their doctors on the same day. While the maximum waiting time is 179 days, 75% of the patients are waiting for the appointment not longer than 15 days. The patients who can visit clinics at the day when they scheduled the appointment tend to miss the appointments less often than the patients who have to wait longer. The patients who have to wait more than one week but less than one months miss their appointments more often than other patients. However, after a month of waiting the longer the patients wait the less they tend to miss the appointments.

● The neighbourhood where there are the clinics with the largest number of scheduled appointments and not missing appointments is Jardim Camburi. The neighbourhood with the largest number of not missing appointments in percentage terms is Ilha Do Boi.