<a id='intro'></a>
## Introduction

The dataset on "No-show Appointments" on Kaggle was collected from medical appointments in Brazil during 2016. It includes information on 110,527 appointments, with features such as patient demographics, medical history, appointment details, and attendance.



### Research Question

Is there a correlation between having a chronic illness and appointment attendance?

In [5]:
#import Python libraries
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score

<a id='wrangling'></a>
## Data Wrangling

In [6]:
#Read and display dataset 
df = pd.read_csv('medical_no_shows_2016.csv')
df.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


In [7]:
#Use df.info() to view details about the dataframe

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   PatientId       110527 non-null  float64
 1   AppointmentID   110527 non-null  int64  
 2   Gender          110527 non-null  object 
 3   ScheduledDay    110527 non-null  object 
 4   AppointmentDay  110527 non-null  object 
 5   Age             110527 non-null  int64  
 6   Neighbourhood   110527 non-null  object 
 7   Scholarship     110527 non-null  int64  
 8   Hipertension    110527 non-null  int64  
 9   Diabetes        110527 non-null  int64  
 10  Alcoholism      110527 non-null  int64  
 11  Handcap         110527 non-null  int64  
 12  SMS_received    110527 non-null  int64  
 13  No-show         110527 non-null  object 
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB


In [8]:
#Use df.shape() to see the shape of the dataframe

df.shape

(110527, 14)

In [9]:
#Inspect for duplicated rows using .duplicated()

df.duplicated().value_counts()

False    110527
dtype: int64

In [10]:
#df.isnull() to inspect for NaN values

df.isnull().sum()

PatientId         0
AppointmentID     0
Gender            0
ScheduledDay      0
AppointmentDay    0
Age               0
Neighbourhood     0
Scholarship       0
Hipertension      0
Diabetes          0
Alcoholism        0
Handcap           0
SMS_received      0
No-show           0
dtype: int64

In [11]:
#Verify that age range is possible

df['Age'].max(), df['Age'].min()

(115, -1)

In [12]:
#Verify that the boolean columns have 2 possible values  

df.nunique()

PatientId          62299
AppointmentID     110527
Gender                 2
ScheduledDay      103549
AppointmentDay        27
Age                  104
Neighbourhood         81
Scholarship            2
Hipertension           2
Diabetes               2
Alcoholism             2
Handcap                5
SMS_received           2
No-show                2
dtype: int64

### Data Cleaning

In [13]:
# used .astype() to convert PatientID column into appropriate datatype 
df['PatientId'] = df['PatientId'].astype(int)
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   PatientId       110527 non-null  int64 
 1   AppointmentID   110527 non-null  int64 
 2   Gender          110527 non-null  object
 3   ScheduledDay    110527 non-null  object
 4   AppointmentDay  110527 non-null  object
 5   Age             110527 non-null  int64 
 6   Neighbourhood   110527 non-null  object
 7   Scholarship     110527 non-null  int64 
 8   Hipertension    110527 non-null  int64 
 9   Diabetes        110527 non-null  int64 
 10  Alcoholism      110527 non-null  int64 
 11  Handcap         110527 non-null  int64 
 12  SMS_received    110527 non-null  int64 
 13  No-show         110527 non-null  object
dtypes: int64(9), object(5)
memory usage: 11.8+ MB


In [14]:
#Identify which patient had an invalid age value of -1. The index of that row is 99832. Dropped that row 

df.query('Age <= -1')
df.drop(df.index[99832], inplace=True)

In [15]:
#Verify the row was dropped and changes were saved to df

df.query('Age <= -1')

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show


In [16]:
#Rename the hypertension column so that it is spelled correctly 

df.rename(columns={'Hipertension': 'Hypertension'}, inplace=True)

In [17]:
#Convert No-show column values to 0s and 1s 
df['No-show'] = df['No-show'].map({'No': 0, 'Yes': 1})

In [18]:
df.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hypertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872499824296,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,0
1,558997776694438,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,0
2,4262962299951,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,0
3,867951213174,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,0
4,8841186448183,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,0


<a id='eda'></a>
## Exploratory Data Analysis


### Does having a medical condition affect appointment attendance?

In [20]:
# Identify the distributions of people who do not have any of the specified conditions and whether they attended 
count_healthy_total = df[
    (df['Diabetes'] == 0) &
    (df['Hypertension'] == 0) 
].shape[0]

count_healthy_no_show = df[
    (df['Diabetes'] == 0) &
    (df['Hypertension'] == 0) &
    (df['No-show'] == 1)
].shape[0]

count_healthy_showed = df[
    (df['Diabetes'] == 0) &
    (df['Hypertension'] == 0) &
    (df['No-show'] == 0)
].shape[0]

print(f'Healthy Total:', count_healthy_total)
print(f'Healthy No-showed:', count_healthy_no_show, (count_healthy_no_show/count_healthy_total) *100 ,'%')
print(f'Healthy Showed:', count_healthy_showed, (count_healthy_showed/count_healthy_total) *100 ,'%')

Healthy Total: 87268
Healthy No-showed: 18258 20.921758261905854 %
Healthy Showed: 69010 79.07824173809415 %


In [21]:
#### Is there a correlation between having diabetes and appointment attendance? 

In [22]:
# Identify the distributions of people who are diabetic and whether they attended 
count_diabetic_total = df[
    (df['Diabetes'] == 1) &
    (df['Hypertension'] == 0)
].shape[0]

count_diabetic_no_show = df[
    (df['Diabetes'] == 1) &
    (df['Hypertension'] == 0) &
    (df['No-show'] == 1)
].shape[0]

count_diabetic_showed = df[
    (df['Diabetes'] == 1) &
    (df['Hypertension'] == 0) &
    (df['No-show'] == 0)
].shape[0]

print(f'Diabetic Total:', count_diabetic_total)
print(f'Diabetic No-showed:', count_diabetic_no_show, (count_diabetic_no_show/count_diabetic_total) *100 ,'%')
print(f'Diabetic Showed:', count_healthy_showed, (count_diabetic_showed/count_diabetic_total) *100 ,'%')

Diabetic Total: 1457
Diabetic No-showed: 289 19.83527796842828 %
Diabetic Showed: 69010 80.16472203157173 %


#### Is there a correlation between having hypertension and appointment attendance? 


In [38]:
count_hypertension_total = df[
    (df['Hypertension'] == 1) &
    (df['Diabetes'] == 0)
].shape[0]

count_hypertension_no_show = df[
    (df['Hypertension'] == 1) &
    (df['Diabetes'] == 0) &
    (df['No-show'] == 1)
].shape[0]

count_hypertension_showed = df[
    (df['Hypertension'] == 1) &
    (df['Diabetes'] == 0) &
    (df['No-show'] == 0)
].shape[0]

print(f'High BP Total:', count_hypertension_total)
print(f'High BP No-showed:', count_hypertension_no_show, (count_hypertension_no_show/count_hypertension_total) *100 ,'%')
print(f'High BP Showed:', count_hypertension_showed, (count_hypertension_showed/count_hypertension_total) *100 ,'%')

High BP Total: 15315
High BP No-showed: 2631 17.179236043095006 %
High BP Showed: 12684 82.820763956905 %


#### Does having both diabetes and hypertension affect appointment attendance?

In [28]:
# Create a new column that indicates whether a patient has both diabetes and hypertension
df['diabetes_hypertension'] = (df['Diabetes'] == 1) & (df['Hypertension'] == 1)

In [29]:
count_both_total = df[
    (df['diabetes_hypertension'] == True)
].shape[0]

count_both_no_show = df[
    (df['diabetes_hypertension'] == True) & 
    (df['No-show'] == 1)
].shape[0]

count_both_showed = df[
    (df['diabetes_hypertension'] == True) &
    (df['No-show'] == 0)
].shape[0]

print(f'Diabetic & High BP Total:', count_both_total)
print(f'Diabetic & High BP No-showed:', count_both_no_show, (count_both_no_show/count_both_total) *100 ,'%')
print(f'Diabetic & High BP Showed:', count_both_showed, (count_both_showed/count_both_total) *100 ,'%')

Diabetic & High BP Total: 6486
Diabetic & High BP No-showed: 1141 17.59173604687018 %
Diabetic & High BP Showed: 5345 82.40826395312982 %


### Statistical Analysis 

#### Use Chi-Square Test of Independence to determine if there is a significant association between chronic illness status and appointment attendance.

In [30]:
#Exploring whether there is a statistically signifigant association between being diabetic status and attendance
contingency_table = pd.crosstab(df['Diabetes'], df['No-show'])
chi2, p, dof, expected = chi2_contingency(contingency_table)
print(f"Chi2: {chi2}, p-value: {p}")

Chi2: 25.326693550869877, p-value: 4.839646820880228e-07


In [31]:
#Exploring whether there is a statistically signifigant association between blood pressure status and attendance
contingency_table = pd.crosstab(df['Hypertension'], df['No-show'])
chi2, p, dof, expected = chi2_contingency(contingency_table)
print(f"Chi2: {chi2}, p-value: {p}")

Chi2: 140.66859528017784, p-value: 1.9011212241495915e-32


In [32]:
#Exploring whether there is a statistically signifigant association between having diabetes and high blood pressure status vs. attendance
contingency_table = pd.crosstab(df['diabetes_hypertension'], df['No-show'])
chi2, p, dof, expected = chi2_contingency(contingency_table)
print(f"Chi2: {chi2}, p-value: {p}")

Chi2: 28.76934712167211, p-value: 8.153135678619768e-08


The Chi2 value of 25.33 suggests a difference between the observed and expected frequencies of diabetes status and appointment no-shows. The p-vlaue of 4.84×10^−7 is much less than the common significance level of 0.05, indicating strong evidence against the null hypothesis. This means that there is a statistically significant association between having diabetes and the likelihood of a no-show.

The Chi2 value of 140.67 indicates a very strong difference between observed and expected frequencies regarding hypertension status and appointment no-shows. The p-value of 1.90x10^-32 provides evidence against the null hypothesis. This result suggests a statistically significant association between hypertension and the likelihood of a no-show.

Both diabetes and hypertension have significant associations with the likelihood of missing an appointment, as indicated by the high Chi2 values and very low p-values. These results suggest that patients with these chronic conditions might exhibit different patterns in attending or missing medical appointments compared to those without these conditions.

#### Use Logistic Regression to assess the impact of chronic illnesses on the probability of no-shows while controlling for other factors

In [34]:
# Prepare the features and target variable
X = df[['Diabetes', 'Hypertension', 'diabetes_hypertension']]
y = df['No-show']

# Fit the logistic regression model
model = LogisticRegression()
model.fit(X, y)

# Output the coefficients and p-values
coefficients = model.coef_
intercept = model.intercept_

print(f"Intercept: {intercept}")
print(f"Coefficients: {coefficients}")

Intercept: [-1.3296341]
Coefficients: [[-0.06609475 -0.24336145  0.09504308]]


Diabetes Coefficient: -0.00863598. This suggests that having diabetes slightly decreases the likelihood of not showing up, but the effect is very small since the coefficient is close to zero.

Hypertension Coefficient: -0.24090543. This indicates that having hypertension reduces the likelihood of not showing up for an appointment more noticeably.

Interaction Term Coefficient: 0.03023134. A positive coefficient here suggests that the combined presence of both conditions slightly increases the likelihood of a no-show, but the effect size is relatively small.