

# Project: Investigate the No Show Appointments Dataset

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

In this project, we will be analyzing data associated with the medical appointments in Brazil and is foucsed on the question of whether or not patients showed up for their appointment. In particular, we will be interested in finding the trends among patiends with different health conditions and other factors. We will be doing data wrangling, data cleaning, exploratory data analysis, and draw a conclusion. 

#### Import all of the packages that we will need for this project

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

% matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

### General Properties

#### Load dataset and print out the first 5 rows of the dataset

In [None]:
df=pd.read_csv('noshowappointments-kagglev2-may-2016.csv')
df.head(5)

#### Check the nstructure of the dataset - 14 rows and 110527 columns

In [None]:
df.shape

#### A description of the dataset - notice something odd about the dataset are: min of Age is -1, max of Handcap is 4

In [None]:
df.describe()

#### Look for instances of missing or possibly errant data - no missing data in our dataset

In [None]:
df.info()

#### Perform histogram on variables -  except age (not a normal distribution), appointment ID and patient ID, all other variables are taking the value of 0 or 1

In [None]:
df.hist(figsize=(12,12));



### Data Cleaning (Replace this with more specific notes!)

#### Trim and clean the data by dropping the columns that we will not be using - PatientID, AppointmentID and Neighourhood

In [None]:
df.drop(['PatientId','AppointmentID','Neighbourhood'],axis=1, inplace=True)

In [None]:
df.head(3)

#### Drop the row where the Age is -1

In [None]:
df.query('Age == -1')

In [None]:
df.drop(index=99832, axis=1, inplace=True) 

#### Double check on the -1 Age row - nothing shows up

In [None]:
df.query('Age==-1')

#### Drop the row where Handcap is 4

In [None]:
df.query('Handcap==4')

In [None]:
df.drop(index={91820,98538,104268}, axis=1, inplace=True) 

#### Double check the rows Handcap equals 4 - nothing shows up

In [None]:
df.query('Handcap==4')

#### Get dummy variable for the column No-show - 0 when they showed up at the medical appointment, and 1 when they didn't show up at the appointment

In [None]:
dummies = pd.get_dummies(df['No-show'],drop_first=True)
df=pd.concat([df,dummies],axis=1)
df.head()

#### Drop the orignial No-show column and rename the dummy variable as No_show

In [None]:
df.drop('No-show',axis=1, inplace=True)

In [None]:
df.rename(columns={'Yes':'No_show'}, inplace=True)
df.head()

<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 - Does Age show any pattens among patients who showed up and didn't show up?

#### FIrst, let's find out how many patients showed up and didn't show up for their appointment

In [None]:
df.groupby('No_show').count().Age

#### Find out the percentage of patients showed up

In [None]:
88205/(88205+22318)

#### Split the data into 2 parts based on whether they showed up or not

In [None]:
show=df.query('No_show==0')
no_show=df.query('No_show==1')

#### Find out the mean Age for each group of parients - the average age of patients who showed up was 37 and average age of parients who didn't show up was 34

In [None]:
print(show['Age'].mean())
no_show['Age'].mean()

#### Now let's show the age distribution using histogram

In [None]:
show['Age'].hist(alpha=0.5, color='r', label='show')
no_show['Age'].hist(alpha=0.5,label='no show');

Conclusion: There are significantly more patients who showed up to their medical appointments than who didn't. Patients who showed up are older, on average, than patients who didn't show up. Among the ones who didn't show up, majority patient were less than 60 years old; among the ones who showed up, patients who were less than 10 years-old and people who were in their 50s had the highest show-up rate.

### Research Question 2  - Patients with what health condition had the most no-show-up?

#### Count the number of no-show patients who had Hipertension, Diabets and Alcoholism, and plot them out using bar charts

In [None]:
df.groupby('Hipertension').No_show.count()

In [None]:
nohip = df.groupby('Hipertension').No_show.count()[0]

In [None]:
hip = df.groupby('Hipertension').No_show.count()[1]

In [None]:
plt.bar(["No Hiper Tension","Hiper Tension"],[nohip, hip])
plt.title("Number of patients who had Hipertension and who didn't")
plt.ylabel("Number of Patients");

#### Do the same for the variable Diabetes and Alcoholism

In [None]:
df.groupby('Diabetes').No_show.count()

In [None]:
nodia = df.groupby('Diabetes').No_show.count()[0]

In [None]:
dia =df.groupby('Diabetes').No_show.count()[1]

In [None]:
plt.bar(["No DIabetes","Diabetes"],[nodia, dia])
plt.title("Number of patients who had Diabetes and who didn't")
plt.ylabel("Number of Patients");

In [None]:
df.groupby('Alcoholism').No_show.count()

In [None]:
noalcho = df.groupby('Diabetes').No_show.count()[0]

In [None]:
alcho =df.groupby('Diabetes').No_show.count()[1]

In [None]:
plt.bar(["No Alcoholism","Alcoholism"],[noalcho, alcho])
plt.title("Number of patients who had Alcoholism and who didn't")
plt.ylabel("Number of Patients");

Conclusion: Among the 3 health conditions we examined, patients with Hipertension had the highest number of no-shows compared to patients with other health conditions. 

### Research Question 3  - Does SMS help to increase the show-up?

#### FInd out among the patients who received SMS, how many showed up and how many didn't

In [None]:
sms=df[df['SMS_received']==1]

Calculate the show-up rate when patients received a SMS - 72%

In [None]:
sms[sms['No_show']==0].shape[0]/sms.shape[0]*100

#### FInd out among the patients who didn't received SMS, how many showed up and how many didn't

In [None]:
nosms=df[df['SMS_received']==0]

Calculate the show-up rate when patients didn't receive a SMS - 83%

In [None]:
nosms[nosms['No_show']==0].shape[0]/nosms.shape[0]*100

Conclusion: Interestingly, it turned out the sending a SMS to patients led to a lower low-up rate based on our dataset.

<a id='conclusions'></a>
## Conclusions

Overall, the dataset we examined in this practise showed that the majority patients showed up at their medical appointment, with a show-up rate of 80%.
Among them, patients who were at the age of 60 or less are observied to be more likely to show up in their medical appointment. Among the 3 health conditions we examined, patients with Hipertension had the highest number of no-shows compared to patients with other health conditions. And sending a SMS to patients led to a lower low-up rate based on our dataset.

However, there are many limitations due to the nature of the dataset. For exmaple, the data is highly inbalanced, distribution of the variable Age is left skewed, all other variables are binary with 1 being the majority values. 





