# Project: Exploring the influence of patient's data on the possibility of not showing up at a medical appointment

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

In the following analysis, I am going to take a closer look at the "Medical No-Show Appointments" dataset originally sourced on Kaggle (source link: https://www.kaggle.com/joniarroba/noshowappointments).

This dataset collects information from 100k medical appointments in Brazil and is focused on the question of whether or not patients show up for their appointment. A number of characteristics about the patient are included in each row. (Source: https://docs.google.com/document/d/e/2PACX-1vTlVmknRRnfy_4eTrjw5hYGaiQim5ctr9naaRd4V9du2B5bxpd8FEH3KtDgp8qVekw7Cj1GLk1IXdZi/pub?embedded=True)

I am going to dive into patients characteristics and analyse if they influence the probability of patiens showing up at an appointment.

Research questions:

1: People of which age group are most likely not showing up at appointments?

2: Patients from which neighbourhoods are more likely to miss an appointment?

3: Is there an influence of diseases or a physical handicap?

4: Does a scholarship from Bolsa Familia welfare program increase the chance of patients showing up at their appointments?

In [238]:
%matplotlib inline


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import unicodecsv

<a id='wrangling'></a>
## Data Wrangling

### General Properties

In [239]:
# Load your data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.

with open('noshowappointments-kagglev2-may-2016.csv', 'rb') as d:
    reader = unicodecsv.DictReader(d)
    noshowapp = list(reader)
    
noshowapp[0]    


OrderedDict([('PatientId', '29872499824296'),
             ('AppointmentID', '5642903'),
             ('Gender', 'F'),
             ('ScheduledDay', '2016-04-29T18:38:08Z'),
             ('AppointmentDay', '2016-04-29T00:00:00Z'),
             ('Age', '62'),
             ('Neighbourhood', 'JARDIM DA PENHA'),
             ('Scholarship', '0'),
             ('Hipertension', '1'),
             ('Diabetes', '0'),
             ('Alcoholism', '0'),
             ('Handcap', '0'),
             ('SMS_received', '0'),
             ('No-show', 'No')])

Loading the data using Python's unicodecsv module. Printing out the first line of the dataset by using the DictReader the check if the columns and data were loaded correctly. 

In [240]:
df = pd.DataFrame(noshowapp)
df

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872499824296,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997776694438,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962299951,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951213174,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186448183,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110522,2572134369293,5651768,F,2016-05-03T09:15:35Z,2016-06-07T00:00:00Z,56,MARIA ORTIZ,0,0,0,0,0,1,No
110523,3596266328735,5650093,F,2016-05-03T07:27:33Z,2016-06-07T00:00:00Z,51,MARIA ORTIZ,0,0,0,0,0,1,No
110524,15576631729893,5630692,F,2016-04-27T16:03:52Z,2016-06-07T00:00:00Z,21,MARIA ORTIZ,0,0,0,0,0,1,No
110525,92134931435557,5630323,F,2016-04-27T15:09:23Z,2016-06-07T00:00:00Z,38,MARIA ORTIZ,0,0,0,0,0,1,No


Creating and printing a Pandas DataFrame to get an overview of the dataset and having an idea of the dimensions (110527 rows x 14 columns)

In [241]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   PatientId       110527 non-null  object
 1   AppointmentID   110527 non-null  object
 2   Gender          110527 non-null  object
 3   ScheduledDay    110527 non-null  object
 4   AppointmentDay  110527 non-null  object
 5   Age             110527 non-null  object
 6   Neighbourhood   110527 non-null  object
 7   Scholarship     110527 non-null  object
 8   Hipertension    110527 non-null  object
 9   Diabetes        110527 non-null  object
 10  Alcoholism      110527 non-null  object
 11  Handcap         110527 non-null  object
 12  SMS_received    110527 non-null  object
 13  No-show         110527 non-null  object
dtypes: object(14)
memory usage: 11.8+ MB


Accessing general information about the dataframe by using the df.info function.

(Source: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) 

Result: The dataset is structured in 14 columns (column index 0-13) and has 110527 entries (0-110526). 
Every column counts 110527 non-null/non-NA values, which means that there is no missing data in one of the columns. 
The df.info function also shows the data types used in the dataset. For this dataset, there are only columns containing the data type object (data with text/mixed numeric and non-numeric values). 

(Source: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html)

This requires to allocate a suitable data type for some columns in the following data cleaning steps to prepare the dataset for further analysis.



In [242]:
df.describe()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
count,110527,110527,110527,110527,110527,110527,110527,110527,110527,110527,110527,110527,110527,110527
unique,62299,110527,2,103549,27,104,81,2,2,2,2,5,2,2
top,822145925426128,5579643,F,2016-05-06T07:09:54Z,2016-06-06T00:00:00Z,0,JARDIM CAMBURI,0,0,0,0,0,0,No
freq,88,1,71840,24,4692,3539,7717,99666,88726,102584,107167,108286,75045,88208


Using the Pandas df.describe function to show some more insights about the dataset. The output of this function depends of the data types that are stored in the dataframe.

In this case, the output shows some interesting information:

The row "unique" shows the count of unique values for each column.<br/>
--> 110527 unique values in the column "AppointmentID" (ok, no double appointment data).<br/>
--> less unique values in the column "PatientID" (plausible, Patients usually have more than one Appointment at the same doctor).<br/>
--> 2 unique values for gender (ok, male/female)<br/>
--> 104 unique vales in the "Age" column (plausible)<br/>

The rows "top" and "freq" show the values which occured most frequently and how often they occured in a specific column.
--> most patients were recorded with the age 0 (plausible, children in their first year of life)<br/>
--> 5 unique values for the column "Handcap" (interesting, because the other columns which tell something about the patients diseases (e.g. "Diabetes" or "Alcoholism") only contain 2 unique values (0 for FALSE, 1 for TRUE) --> further inspection of this column to check the unique values.

(Source: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html)




In [243]:
df['Handcap'].unique()

array(['0', '1', '2', '3', '4'], dtype=object)

Printing out the unique values of column "Handcap". 
The description for this column on kaggle says that "Handcap" only has the values TRUE(=1) or FALSE(=0). 

(Source: https://www.kaggle.com/joniarroba/noshowappointments)

In our dataset, we have the 5 unique values shown in the array above which does not match with the description on kaggle. In the following steps, the column data will be cleaned (0 = False, 1-4 = True) to make it comparable with the other columns. 

### Data Cleaning

**Cleaning step 1: cleaning column "Handcap"**

In [244]:
df = df.astype({'Handcap':'int64'})

df['Handcap'] = np.where(df['Handcap'] > 1, 1, df['Handcap'])

df['Handcap'].unique()

array([0, 1])

Cleaning the column "Handcap": 

--> changed data type of the column's values to int64 using the datafram.astype function 

(Source: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html)

--> replaced the values 2,3,4 with 1 (reason: change data type to bool later) by using numpy function 

(Source: https://numpy.org/doc/stable/reference/generated/numpy.where.html)

--> checked the output of the cleaning step: array now only contains values o and 1 --> ok

**Cleaning step 2: changing datatypes for columns where needed**

In [245]:
dtypeDict = {
    "PatientId" : "object",
    "AppointmentID" : "object",
    "Gender" : "object",
    "ScheduledDay" : "datetime64",
    "AppointmentDay" : "datetime64",
    "Age" : "int64",
    "Neighbourhood" : "object",
    "Scholarship" : "int64",
    "Hipertension" : "int64",
    "Diabetes" : "int64",
    "Alcoholism" : "int64",
    "Handcap" : "int64",
    "SMS_received" : "int64",
    "No-show" : "object"
}

for col in df:
    df[col] = df[col].astype(dtypeDict[col])  
    
df.dtypes

PatientId                 object
AppointmentID             object
Gender                    object
ScheduledDay      datetime64[ns]
AppointmentDay    datetime64[ns]
Age                        int64
Neighbourhood             object
Scholarship                int64
Hipertension               int64
Diabetes                   int64
Alcoholism                 int64
Handcap                    int64
SMS_received               int64
No-show                   object
dtype: object

Cleaning the data types for the columns where needed:

--> creating the Dictionary dtypeDict containing the column names as Dict keys and the pandas data types to be assigned as values

--> looping over the columns and setting the datatype for each column according to the values in the dict

--> checking the output by printing out the data types using the df.dtypes function 
(Source: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html)

In [246]:
df

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872499824296,5642903,F,2016-04-29 18:38:08,2016-04-29,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997776694438,5642503,M,2016-04-29 16:08:27,2016-04-29,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962299951,5642549,F,2016-04-29 16:19:04,2016-04-29,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951213174,5642828,F,2016-04-29 17:29:31,2016-04-29,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186448183,5642494,F,2016-04-29 16:07:23,2016-04-29,56,JARDIM DA PENHA,0,1,1,0,0,0,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110522,2572134369293,5651768,F,2016-05-03 09:15:35,2016-06-07,56,MARIA ORTIZ,0,0,0,0,0,1,No
110523,3596266328735,5650093,F,2016-05-03 07:27:33,2016-06-07,51,MARIA ORTIZ,0,0,0,0,0,1,No
110524,15576631729893,5630692,F,2016-04-27 16:03:52,2016-06-07,21,MARIA ORTIZ,0,0,0,0,0,1,No
110525,92134931435557,5630323,F,2016-04-27 15:09:23,2016-06-07,38,MARIA ORTIZ,0,0,0,0,0,1,No


<a id='eda'></a>
## Exploratory Data Analysis

Research questions:

1: People of which age group are most likely not showing up at appointments?

2: Patients from which neighbourhoods are more likely to miss an appointment?

3: Is there an influence of diseases or a physical handicap?

4: Does a scholarship from Bolsa Familia welfare program increase the chance of patients showing up at their appointments?

### 1:  People of which age group are most likely not showing up at appointments? 

Adding a column to the df based on predefined agegroups to see differences between the age groups.

Creating a list of conditions and matching the age group to every row in the dataframe by using the np.select function.
(Source: https://numpy.org/doc/stable/reference/generated/numpy.select.html)

In [247]:
conditions = [
    (df['Age'] >= 0) & (df['Age'] < 16),
    (df['Age'] >= 16) & (df['Age'] < 20),
    (df['Age'] >= 20) & (df['Age'] < 30),
    (df['Age'] >= 30) & (df['Age'] < 50),
    (df['Age'] >= 50) & (df['Age'] < 60),
    (df['Age'] >= 60)
    ]

agegroups = ['age 0-15', 'age 16-19', 'age 20-29', 'age 30-49', 'age 50-59','age > 60']

df['Age Group'] = np.select(conditions, agegroups)

df.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show,Age Group
0,29872499824296,5642903,F,2016-04-29 18:38:08,2016-04-29,62,JARDIM DA PENHA,0,1,0,0,0,0,No,age > 60
1,558997776694438,5642503,M,2016-04-29 16:08:27,2016-04-29,56,JARDIM DA PENHA,0,0,0,0,0,0,No,age 50-59
2,4262962299951,5642549,F,2016-04-29 16:19:04,2016-04-29,62,MATA DA PRAIA,0,0,0,0,0,0,No,age > 60
3,867951213174,5642828,F,2016-04-29 17:29:31,2016-04-29,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No,age 0-15
4,8841186448183,5642494,F,2016-04-29 16:07:23,2016-04-29,56,JARDIM DA PENHA,0,1,1,0,0,0,No,age 50-59


Looking at each age group and counting how many appointments were missed (each appointment counts, it is not taken into account that there could be multiple records for one patient).

Using the df.value_counts() set normalize=True to see ratio of no-show appointments compared to appointments where patients showed up. 


In [248]:
df_grouped = df.groupby("Age Group")

result = df_grouped['No-show'].value_counts(normalize=True)

result

Age Group  No-show
0          No         1.000000
age 0-15   No         0.785516
           Yes        0.214484
age 16-19  No         0.748780
           Yes        0.251220
age 20-29  No         0.753267
           Yes        0.246733
age 30-49  No         0.789653
           Yes        0.210347
age 50-59  No         0.825002
           Yes        0.174998
age > 60   No         0.846880
           Yes        0.153120
Name: No-show, dtype: float64

Sorting the age groups by highest no-show ratios. 

Filtering the result series to only show no-show ratios and sorting the series descending. 

In [249]:
result_noshow = result.filter(like='Yes', axis=0)

result_noshow.sort_values(ascending=False)


Age Group  No-show
age 16-19  Yes        0.251220
age 20-29  Yes        0.246733
age 0-15   Yes        0.214484
age 30-49  Yes        0.210347
age 50-59  Yes        0.174998
age > 60   Yes        0.153120
Name: No-show, dtype: float64

Result: The age group with the highest ratio of no-show appointments is from age 16-19. 

### 2: Patients from which neighbourhoods are most likely to miss an appointment? 

Filtering for only the No-show appointments, grouping by neighbourhood and counting the rows for each neighbourhood. Sorting the result in descending order to show neigbourhood with highes no-show rate (using normalized values).

In [250]:
nbh_grouped = df.groupby("Neighbourhood")

nbh_result = nbh_grouped['No-show'].value_counts(normalize=True).filter(like='Yes', axis=0).sort_values(ascending=False)

nbh_result


Neighbourhood                No-show
ILHAS OCEÂNICAS DE TRINDADE  Yes        1.000000
SANTOS DUMONT                Yes        0.289185
SANTA CECÍLIA                Yes        0.274554
SANTA CLARA                  Yes        0.264822
ITARARÉ                      Yes        0.262664
                                          ...   
DE LOURDES                   Yes        0.154098
SOLON BORGES                 Yes        0.147122
MÁRIO CYPRESTE               Yes        0.145553
AEROPORTO                    Yes        0.125000
ILHA DO BOI                  Yes        0.085714
Name: No-show, Length: 80, dtype: float64

Result: the neighbourhood of ILHAS OCEÂNICAS DE TRINDADE has the highes no-show rate (rate = 1 --> need to take a look at absolute number of entries for this neighbourhood as it seems to be a very low number of entries or even just 1).

Showing the absolut numbers of entries for first 3 neighbourhoods as ILHAS OCEÂNICAS DE TRINDADE seems to be an outlier. 

In [251]:
nbh_count_abs = df.groupby("Neighbourhood")['No-show'].value_counts()

nbh_count_abs.loc[['ILHAS OCEÂNICAS DE TRINDADE','SANTOS DUMONT','SANTA CECÍLIA']]

Neighbourhood                No-show
ILHAS OCEÂNICAS DE TRINDADE  Yes          2
SANTA CECÍLIA                No         325
                             Yes        123
SANTOS DUMONT                No         907
                             Yes        369
Name: No-show, dtype: int64

Result: ILHAS OCEÂNICAS DE TRINDADE has only 2 entries in total which were both no-show appointments, therefore this result can be disregarded as there is not a significant number of entries.

Therefore, SANTA CECÍLIA can be stated as the neighbourhood with the highes rate of no-show appointments.

### 3: Is there an influence of diseases or physical handicaps? 

Writing a function that calculates the no-show ratio for the given disease (e.g. Alcoholism). Calling that function for every single disease to compare No-show ratio of patients with disease vs. ratio of patients without disease.

In [252]:
def noshow_dis(diseases):
  for x in diseases:
    x_grouped = df.groupby(diseases)
    x_result = x_grouped['No-show'].value_counts(normalize=True).filter(like='Yes', axis=0).sort_values(ascending=False)
    return x_result
    
noshow_dis("Diabetes")

Diabetes  No-show
0         Yes        0.203628
1         Yes        0.180033
Name: No-show, dtype: float64

In [253]:
noshow_dis("Hipertension")

Hipertension  No-show
0             Yes        0.209037
1             Yes        0.173020
Name: No-show, dtype: float64

In [254]:
noshow_dis("Alcoholism")

Alcoholism  No-show
0           Yes        0.201946
1           Yes        0.201488
Name: No-show, dtype: float64

In [255]:
noshow_dis("Handcap")

Handcap  No-show
0        Yes        0.202353
1        Yes        0.181615
Name: No-show, dtype: float64

RESULT: Alcoholism seem to have no influence on the likelyness of patients showing up at their appointment. The No-Show rates for people with alcoholism are about the same.

For other diseases or handycaps, the no-show rates are a bit lower compared to patients without the diseases or handycaps.

### 4: Does a scholarship from Bolsa Familia welfare program increase the chance of patients showing up at their appointments?

Applying the noshow_dis() function to see if there is an influence of the Bolsa Familia Scholarship.

In [256]:
noshow_dis("Scholarship")

Scholarship  No-show
1            Yes        0.237363
0            Yes        0.198072
Name: No-show, dtype: float64

Result: A Scholarship does not increase the chance of patients showing up at their appointment. In fact, the no-show rate for patients without a sholarship is lower than for people who take part in the welfare program. 

<a id='conclusions'></a>
## Conclusions

People from the age group of 16-19 years have the highest no-show ratio compared to other age groups. 

Furthermore, the patients from SANTA CECÍLIA neighbourhood have the highes ratio of no-show appointments.

Diseases and handycaps seem to have an influence on the likelyness of patients showing up at their doctors appointment as there are lower no-show rates compared to people without these specific diseases. There is no statistical proof of this theory as no statistical tests have been performed, it is just a tendency that can be seen through the data.

Alcoholism seems to have no influence on the behaviour of patients as there is a similiar no-show ration compared to patients without alcoholism. 

A Scholarship does not increase the ratio of patients showing up at their appointment. In fact, the no-show rate for patients without a sholarship is lower than for people who take part in the welfare program. 

The results of this analysis are limited as only the normalized no-show ratio has taken into account to compare, for example, different age groups or people with or without a disease. The absolute number of entries which could give an idea how significant a group of entries could be was only taken into account at the neighbourhood comparison. 
Therefore, the results only give a first tendency how likely it is that patients with specific properties (according to the dataset) show up at their appointments. 

For further investigation of the patients data, some kind of no-show score (which combines all properties which have an influence on the now-show ratio) could be calculated, but this exceeds the time frame of this project.