# Project: No-show appointments in Brazil

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction
This dataset collects information from 100k medical appointments in Brazil and is focused on the question of whether or not patients show up for their appointment. A number of characteristics about the patient are included in each row.
 - ‘ScheduledDay’ tells us on what day the patient set up their appointment.
 - ‘Neighborhood’ indicates the location of the hospital.
 - ‘Scholarship’ indicates whether or not the patient is enrolled in Brasilian welfare program Bolsa Família.
 - Be careful about the encoding of the last column: it says ‘No’ if the patient showed up to their appointment, and ‘Yes’ if they did not show up.

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
% matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

In [174]:
df = pd.read_csv('./data/noshowappointments.csv')
df.tail(n=2)

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
110525,92134930000000.0,5630323,F,2016-04-27T15:09:23Z,2016-06-07T00:00:00Z,38,MARIA ORTIZ,0,0,0,0,0,1,No
110526,377511500000000.0,5629448,F,2016-04-27T13:30:56Z,2016-06-07T00:00:00Z,54,MARIA ORTIZ,0,0,0,0,0,1,No


### Types of data inside the dataframe

In [141]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
PatientId         110527 non-null float64
AppointmentID     110527 non-null int64
Gender            110527 non-null object
ScheduledDay      110527 non-null object
AppointmentDay    110527 non-null object
Age               110527 non-null int64
Neighbourhood     110527 non-null object
Scholarship       110527 non-null int64
Hipertension      110527 non-null int64
Diabetes          110527 non-null int64
Alcoholism        110527 non-null int64
Handcap           110527 non-null int64
SMS_received      110527 non-null int64
No-show           110527 non-null object
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB


### Size of the original dataframe

In [142]:
df.shape

(110527, 14)

### Number of unique area in the dataframe

In [158]:
df['Neighbourhood'].unique().size

81

## Conjecture: 
If the waiting time is higher people tend to miss their appointment.

In order to get meaningful data to verify this conjecture we need to add a new column with the waiting time to get an appointment in the dataframe.

** One important thing to notice: If it is a same day appointment the wait time calculation gives negative ouput, which is not possible. Those values are replaced by 0.**

In [175]:
appointment_time_stamp = pd.to_datetime(df['AppointmentDay'])
scheduled_time_stamp = pd.to_datetime(df['ScheduledDay'])
wait_time = appointment_time_stamp - scheduled_time_stamp

wait_hour = []
for time in wait_time:
    h = np.timedelta64(time)/np.timedelta64(1, 'h')
    if h >=0:
        wait_hour.append(h/24)
    else:
        wait_hour.append(0)

In [176]:
df = df.assign(wait_time = wait_hour)

### Drop unnecessary columns from the dataframe

In [177]:
drop_col = ['PatientId', 'AppointmentID', 'ScheduledDay','AppointmentDay']
df.drop(labels = drop_col, axis = 1, inplace = True)

In [178]:
df.tail(n = 2)

Unnamed: 0,Gender,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show,wait_time
110525,F,38,MARIA ORTIZ,0,0,0,0,0,1,No,40.368484
110526,F,54,MARIA ORTIZ,0,0,0,0,0,1,No,40.436852


### Save cleaned data into a csv file

In [155]:
df.to_csv('./data/cleaned-appointment-data.csv', index = False)