# Project: No-show Appointment Data Analysis

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

The dataset collects information from 100k medical appointments in Brazil. A number of characteristics about the patient are included.
>
<li>'PatientID' identifies a specific patient.
<li>'AppointmentID' identifies a specific appointment.
<li>'Gender' indicates the gender of the patient.
<li>‘ScheduledDay’ tells us the date the patient set up their appointment.
<li>'AppointmentDay' tells us the date of the appointment.
<li>'Age' indicates the age of the patient.
<li>‘Neighborhood’ indicates the location of the hospital.
<li>‘Scholarship’ indicates whether or not the patient is enrolled in Brasilian welfare program Bolsa Família.
<li>'Hipertension' indicates whether or not the patient is suffering from hipertension.
<li>'Diabetes' indicates whether or not the patient is suffering from diabetes.
<li>'Alcoholism' indicates whether or not the patient is suffering from alcoholism.
<li>'Handcap' indicates whether or not the patient has a disability.
<li>'SMS_received' tells us if the patient received an SMS upfront about the upcoming appointment.
<li>'No-show' indicates whether or not the patient showed up for their appointment.

This analysis focuses on data of patients that showed up or didn't show up for their medical appointments.
The intention is to find differences between the two groups.

### Questions
>
<li>Do no-shows for appointments differ depending on the age of the patient? Hypothesis: Older patients are more responsible and have a lower no-show rate.
<li>Do no-shows for appointments differ if patients received an SMS? Hypothesis: Patients that received an SMS are reminded of their appointment and have a lower no-show rate.
<li>Do no-shows for appointments differ if patients have a scholarship? Hypothesis: Patients with a scholarship have an insurance and therefore, don't need to worry about the expenses of an appointment and have a lower no-show rate.
<li>Do no-shows for appointments differ if the appointment is scheduled way in advance? Hypothesis: Patients that schedule their appointments way in advance, forget about them and have a higher no-show rate.

<a id='wrangling'></a>
## Data Wrangling

### General Properties

In [1]:
# import of packages 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

In [2]:
# load data 
df = pd.read_csv('noshowappointments-kagglev2-may-2016.csv')

In [4]:
df.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   patient_id       110527 non-null  float64
 1   appointment_id   110527 non-null  int64  
 2   gender           110527 non-null  object 
 3   scheduled_day    110527 non-null  object 
 4   appointment_day  110527 non-null  object 
 5   age              110527 non-null  int64  
 6   neighbourhood    110527 non-null  object 
 7   scholarship      110527 non-null  int64  
 8   hipertension     110527 non-null  int64  
 9   diabetes         110527 non-null  int64  
 10  alcoholism       110527 non-null  int64  
 11  handicap         110527 non-null  int64  
 12  messaged         110527 non-null  int64  
 13  no_show          110527 non-null  object 
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB


In [40]:
df.describe()

Unnamed: 0,patient_id,appointment_id,age,scholarship,hipertension,diabetes,alcoholism,handicap,messaged
count,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0
mean,147496300000000.0,5675305.0,37.088874,0.098266,0.197246,0.071865,0.0304,0.022248,0.321026
std,256094900000000.0,71295.75,23.110205,0.297675,0.397921,0.258265,0.171686,0.161543,0.466873
min,39217.84,5030230.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4172614000000.0,5640286.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,31731840000000.0,5680573.0,37.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,94391720000000.0,5725524.0,55.0,0.0,0.0,0.0,0.0,0.0,1.0
max,999981600000000.0,5790484.0,115.0,1.0,1.0,1.0,1.0,4.0,1.0


In [24]:
df.duplicated().sum()

0

In [33]:
column_list = df.columns
column_list = column_list.tolist()

In [41]:
df.duplicated(subset='patient_id').sum()

48228

In [46]:
for i in range(0, len(df.columns)):
    print(df.columns[i])
    print(df.duplicated(subset=df.columns[i]).sum())

patient_id
48228
appointment_id
0
gender
110525
scheduled_day
6978
appointment_day
110500
age
110423
neighbourhood
110446
scholarship
110525
hipertension
110525
diabetes
110525
alcoholism
110525
handicap
110522
messaged
110525
no_show
110525


In [57]:
for i in range(0, len(df.columns)):
    print(df[df.columns[i]].value_counts())

8.221459e+14    88
9.963767e+10    84
2.688613e+13    70
3.353478e+13    65
7.579746e+13    62
                ..
1.779297e+13     1
9.985120e+11     1
3.256827e+13     1
9.232297e+13     1
5.133834e+14     1
Name: patient_id, Length: 62299, dtype: int64
5771266    1
5680512    1
5602682    1
5598584    1
5584243    1
          ..
5686642    1
5692785    1
5647727    1
5645678    1
5769215    1
Name: appointment_id, Length: 110527, dtype: int64
F    71840
M    38687
Name: gender, dtype: int64
2016-05-06T07:09:54Z    24
2016-05-06T07:09:53Z    23
2016-04-25T17:18:27Z    22
2016-04-25T17:17:46Z    22
2016-04-25T17:17:23Z    19
                        ..
2016-06-03T07:11:11Z     1
2016-05-11T15:41:11Z     1
2016-05-25T08:37:46Z     1
2016-05-12T08:25:12Z     1
2016-05-02T07:22:37Z     1
Name: scheduled_day, Length: 103549, dtype: int64
2016-06-06T00:00:00Z    4692
2016-05-16T00:00:00Z    4613
2016-05-09T00:00:00Z    4520
2016-05-30T00:00:00Z    4514
2016-06-08T00:00:00Z    4479
2016-05-11

In [63]:
df.scheduled_day.min()

'2015-11-10T07:13:56Z'

In [65]:
df.scheduled_day.max()

'2016-06-08T20:07:23Z'

In [66]:
df.appointment_day.min()

'2016-04-29T00:00:00Z'

In [67]:
df.appointment_day.max()

'2016-06-08T00:00:00Z'

In [68]:
df.age.describe()

count    110527.000000
mean         37.088874
std          23.110205
min          -1.000000
25%          18.000000
50%          37.000000
75%          55.000000
max         115.000000
Name: age, dtype: float64

In [100]:
neighbourhood = pd.DataFrame(df.neighbourhood.value_counts()).reset_index().rename(columns={'index':'neighbourhood', 'neighbourhood':'count'})
neighbourhood.neighbourhood = neighbourhood.neighbourhood.str.lower()
neighbourhood.neighbourhood = neighbourhood.neighbourhood.str.title()
neighbourhood = neighbourhood.sort_values(by=['neighbourhood'])
neighbourhood

Unnamed: 0,neighbourhood,count
78,Aeroporto,8
17,Andorinhas,2262
66,Antônio Honório,271
65,Ariovaldo Favalessa,282
59,Barro Vermelho,423
...,...,...
21,São José,1977
14,São Pedro,2448
6,Tabuazeiro,3132
71,Universitário,152


### Data Cleaning 

<li> Adjust column names
<li> Adjust data types
<li> Find na values
<li> Adjust unrealistic values for age
<li> Adjust values for neighbourhood (title)
<li> Adjust values for neighbourhood (Santa Lucia and Santa Luzia)

In [15]:
# adjust column names
df.columns
df.rename(columns={'PatientId':'patient_id', 'AppointmentID':'appointment_id', 'Gender':'gender', 
                   'ScheduledDay':'scheduled_day', 'AppointmentDay':'appointment_day', 'Age':'age', 
                   'Neighbourhood':'neighbourhood', 'Scholarship':'scholarship', 'Hipertension':'hipertension',
                   'Diabetes':'diabetes', 'Alcoholism':'alcoholism', 'Handcap':'handicap', 'SMS_received':'messaged',
                   'No-show':'no_show'}, inplace=True)
df.head()

Unnamed: 0,patient_id,appointment_id,gender,scheduled_day,appointment_day,age,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handicap,messaged,no_show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


<a id='eda'></a>
## Exploratory Data Analysis

### Question 1

In [None]:
# difference scheduled day and appoinment day

In [None]:
# plot ages

In [69]:
# difference in missed and not missed appointments and hypothesis testing

In [None]:
# more values for same patient id / different action?

<a id='conclusions'></a>
## Conclusions

### Limitations



### Question 1:

Hypothesis: 

