# Project: No-show appointments Data investigation 
## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction
> In this project we are going to work with no-show appointment to get closer look to what makes people donot show in their appointment.


In [None]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline

<a id='wrangling'></a>
## Data Wrangling


### General Properties

In [None]:
# Load data and explore it.
df = pd.read_csv("KaggleV2-May-2016.csv")
#print out a few lines
df.head()

In [None]:
df.columns

**Changes needed:**
<ol>
 <li> Make columns header lower case so it is easly changable.
 <li> We can change the columns header and use underscores to make it easier to read and work with.   
 <li> Explore missing data.
 <li> Explore duplicated data.
 <li> Changing Appointment Date to day time
     
 

In [None]:
df.info()

In [None]:
df.shape

#### The Data set has 110527 featurs and 14 roes 

In [None]:
#look for missing vaules
df.isnull().sum()

#### There is not missing values 

In [None]:
#look for dublicated data
df.duplicated().sum()

#### There is no dublicated data

In [None]:
df.describe()

<There is not missing values>

## Data Cleaning 

<h3>1- Make columns names lower case

In [None]:
df.columns = df.columns.str.lower()

<h3>2- Changing Columns names and adding underscores 

In [None]:
df.columns

In [None]:
column_id = ['patient_id', 'appointment_id', 'gender', 'scheduled_day',
       'appointment_day', 'age', 'neighbourhood', 'scholarship', 'hipertension',
       'diabetes', 'alcoholism', 'handcap', 'sms_received', 'no_show']
df.columns = column_id
# for i in column_id:
#     for i in column_id:
#         column_id = "_".join(column_id)
# column_id
df.columns

In [None]:
# column_day = ['scheduledday','appointmentday']
# for i in column_day:
#     df[i] = df[i].apply(lambda x: x[:-3] +  x[-3:])
# df

<h3>3- changing date of assignments to date time 

In [None]:
column_day = ['scheduled_day','appointment_day']
for date in column_day:
    df[date] = pd.to_datetime(df[date])
df

In [None]:
df.info()

<h3>4- Changing id from float to int

In [None]:
df["appointment_id"] = df["appointment_id"].astype(int)
df["appointment_id"]

In [None]:
df["patient_id"] = df["patient_id"].astype(int)

In [None]:
df

<h3>5- Changing Age from string to int 

In [None]:
df['age'] = df['age'].astype(int)
df

<h3>6- Changing no_show variable to number 0/1 to plot it more easly
    ( 0 for not attend and 1 for apperance)

In [None]:
df["no_show"].nunique()

In [None]:
df["no_show"] = np.where((df["no_show"]=="Yes"), 0, 1)
df["no_show"]

<h3>7- Changing the gender variable to number 0/1 to plot it more easily (1 for female, 0 for male)

In [None]:
df["gender"] = np.where((df["gender"]=="F"), 1, 0)
df["gender"]

In [None]:
df

In [None]:
df.info()

<a id='eda'></a>
## Exploratory Data Analysis


### Create a relation between every variable and the no-show variable

#### Create a new data frame to store the no-shows and shows data

In [None]:
no_shows = df.query('no_show == "0"')
no_shows

In [None]:
shows = df.query('no_show == "1"')
shows

### 1- How many people appear or disapear in appointment 

In [None]:
#Number of appearing people  
Nshows= sum(df["no_show"] == 1)
Nshows

In [None]:
#Number of not attending people
Nno_shows= sum(df["no_show"] == 0)
Nno_shows

#### The Number of people that did not show is more than tha number of people that shew.

In [None]:
# Data to plot
labels = ['No Shows', 'Shows']
sizes = [Nno_shows, Nshows]
colors = [ 'tomato' ,'lightblue']
explode = (0, 0.1)

# The plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=40)
plt.title('Attendance percentage')
plt.axis('equal')
fig = plt.gcf()
fig.set_size_inches(5,5);

## 2- Relation between Gender and showing

### The Gender of all attendence

In [None]:
Nfemales = df[df["gender"] == 1]
Nfemales

In [None]:
# Number of all females
F = Nfemales['gender'].sum()
F

In [None]:
Nmales = df[df["gender"] == 0]
Nmales

In [None]:
# Number of all males
M = Nmales['gender'].count()
M

In [None]:
# Data to plot
labels = ['ALL Female', 'All Male']
sizes = [F, M]
colors = ['pink', 'royalblue']
explode = (0, 0.1)

# The plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=40)
plt.title('Gender Distribution by Shows')
plt.axis('equal')
fig = plt.gcf()
fig.set_size_inches(5,5);

### showing

#### Females that show on appointments 

In [None]:
female_shows = shows[shows['gender']==1]
female_shows

In [None]:
f = female_shows["gender"].sum()
f

#### Males that show on appointments 

In [None]:
males_shows = shows[shows['gender']==0]
males_shows

In [None]:
m = males_shows['gender'].count()
m

In [None]:
# Data to plot
labels = ['Female', 'male']
sizes = [f, m]
colors = [ 'tomato' ,'lightblue']
explode = (0, 0.1)

# The plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=40)
plt.title('Gender appointments')
plt.axis('equal')
fig = plt.gcf()
fig.set_size_inches(5,5);

#### Females is more likely to appoint

### Not showing

#### Females that donot shows on appointments 

In [None]:
female_no_shows = no_shows[no_shows['gender']==1]
female_no_shows

In [None]:
no_f = female_no_shows['gender'].sum()
no_f

#### males that donot shows on appointments 

In [None]:
males_no_shows = no_shows[no_shows['gender']==0]
males_no_shows

In [None]:
no_m = males_no_shows['gender'].count()
no_m

In [None]:
# Data to plot
labels = ['Female', 'male']
sizes = [no_f, no_m]
colors = ['lightblue', 'lightgreen']
explode = (0, 0.1)

# The plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=40)
plt.title('Gender Distribution by No Shows')
plt.axis('equal')
fig = plt.gcf()
fig.set_size_inches(5,5);

#### Females are more like to donot show in their appointments 

### 3- Relation between Age and showing

#### The Average of all patients 

In [None]:
all_age_mean = df['age'].mean()
all_age_mean

#### The Average of all patients that show in their appointments 

In [None]:
age_show_mean = shows['age'].mean()
age_show_mean

The average age of people that shows in the appointment is 37.3 

#### The Average of all patients that do not show in their appointments 

In [None]:
age_no_show_mean = no_shows['age'].mean()
age_no_show_mean

The average age of people that does not show in the appointment is 37.3 

In [None]:
# Data to plot 
showAge = shows['age']
noshowAge = no_shows['age']

In [None]:
## The plot of show data
plt.hist(showAge, bins=100)
plt.title('Age Distribution by Shows')
plt.xlabel('Age')
plt.ylabel('Number of Appointments')

In [None]:
## Plot no show data 
plt.hist(noshowAge, bins=100, color= "pink")
plt.title('Age Distribution by no Shows')
plt.xlabel('Age')
plt.ylabel('Number of Appointments')

### 4- The relation between the Neighbourhood and the attendence of the patients 

In [None]:
df["neighbourhood"]

#### The relation between shows and neighbourhood

In [None]:
hood_show = shows.groupby('neighbourhood')['no_show'].count().reset_index(name= "Count").sort_values("Count")
hood_show

In [None]:
# The first five neighbourhoods that it is most likly to show in appointments
hood_show.tail()

In [None]:
# The neighbourhood that it is most likly to show in appointments
hood_show[hood_show["Count"] == hood_show["Count"].max()] 

#### It is JARDIM CAMBURI neighbourhood

In [None]:
hood = hood_show["neighbourhood"].head(10)
hood

In [None]:
cHood = hood_show["Count"].head(10)
cHood

In [None]:
## The plot of show data 
fig = plt.figure()
ax = fig.add_axes([0,0,4,3])
ax.bar(hood, cHood)

#### The relation between no shows and neighbourhood

In [None]:
hood_no_show = no_shows.groupby('neighbourhood')['no_show'].count().reset_index(name= "Count").sort_values("Count")
hood_no_show

In [None]:
# The first five neighbourhoods that it is most likly to no show in appointments
hood_no_show.tail()

In [None]:
# The neighbourhood that it is most likly to no show in appointments
hood_no_show[hood_no_show["Count"] ==hood_no_show["Count"].max()] 

In [None]:
no_hood = hood_no_show["neighbourhood"].head(10)
no_cHood = hood_no_show["Count"].head(10)

#### It is JARDIM CAMBURI Neighbourhood

In [None]:
## The plot of noshow data 
fig = plt.figure()
ax = fig.add_axes([0,0,4,3])
ax.bar(no_hood, no_cHood)

### 4- The relation between the Scholarship in the Bolsa Família and showing in appointments. 

In [None]:
d = df.groupby("scholarship")["no_show"].count().reset_index(name= "Count")
d.head()

#### In all the data set, we have 99666 patients is not in the scholarship and 10861 is in the scholarship

In [None]:
att = shows.groupby("scholarship")["no_show"].count().reset_index(name= "Count")
att

#### In all the patients that show their appointments, there are 79925 patients is not in the scholarship and 8283 is in the scholarship 

In [None]:
not_att = no_shows.groupby("scholarship")["no_show"].count().reset_index(name= "Count")
not_att

#### In all the patients that show their appointments, there are 19741 patients is not in the scholarship and 2578 is in the scholarship 

## 5- What is the patients health and doese it affects their attendence 

In [None]:
df.head()

In [None]:
#Total Hipertension
df_Hip1 = df[df["hipertension"]==1]
df_Hip1

In [None]:
df_Hip1["hipertension"].count()

#### The total Number of hipertension in the patients is 21801

In [None]:
#Total Diabetes
df_dia1 = df[df["diabetes"]==1]
df_dia1

In [None]:
df_dia1["diabetes"].count()

#### The total Number of Diabetes in the patients is 7943

In [None]:
#Total Alcholism
df_alc1 = df[df["alcoholism"]==1]
df_alc1

In [None]:
df_alc1["alcoholism"].count()

#### The total Number of alcoholism in the patients is 3360

In [None]:
#Total handicap 
df_hand = df[df["handcap"]==1] 
df_hand

In [None]:
df["handcap"].count()

#### The total Number of handicap in the patients is 110527

## Total shows 

In [None]:
# hipertension
df_hip2 = shows[shows["hipertension"]==1]
df_hip2

In [None]:
# diabetes
df_dia2 = shows[shows["diabetes"]==1]
df_dia2

In [None]:
#alcoholism
df_alch2 = shows[shows["alcoholism"]==1]
df_alch2

In [None]:
# Handicap
df_hand2 = shows[shows["handcap"]==1]
df_hand2

In [None]:
# The plot
labels = ['Hypertension','Diabetes','Alcoholism', 'Handicap']
sizes = [df_hip2.shape[0], df_dia2.shape[0], df_alch2.shape[0], df_hand2.shape[0]]
colors = ['palevioletred', 'lightpink', 'lavender', 'plum']
explode = (0, 0, 0.1, 0)

plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=70)

plt.title('Health Designation by Shows')
plt.axis('equal')
fig = plt.gcf()
fig.set_size_inches(5,5)

## Total no show

In [None]:
# hipertension
df_hip3 = no_shows[no_shows["hipertension"]==1]
df_hip3

In [None]:
# diabetes
df_dia3 = no_shows[no_shows["diabetes"]==1]
df_dia3

In [None]:
#alcoholism
df_alch3 = no_shows[no_shows["alcoholism"]==1]
df_alch3

In [None]:
# Handicap
df_hand3 = no_shows[no_shows["handcap"]==1]
df_hand3

In [None]:
# The plot
labels = ['Hypertension','Diabetes','Alcoholism', 'Handicap']
sizes = [df_hip3.shape[0], df_dia3.shape[0], df_alch3.shape[0], df_hand3.shape[0]]
colors = ['steelblue', 'lightblue', 'lavender', 'turquoise']
explode = (0, 0, 0.1, 0)

plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=70)

plt.title('Health Designation by no Show')
plt.axis('equal')
fig = plt.gcf()
fig.set_size_inches(5,5)

## 6- Doese the patients that recieved SMS is more likely to show in their appointments ? 

In [None]:
# Total number of patients that recived SMS 
smsNo = df[df["sms_received"] == 1] 
smsNo

In [None]:
smsNo.shape[0]

#### Total number of patients that recived SMS is 35482

### Total number of patients that recieved SMS and show in appointments

In [None]:
sms_show = shows[shows["sms_received"] == 1] 
sms_show

In [None]:
a = sms_show.shape[0]
a

#### Total number of patients that revieved SMS and shows their appointments is 25698

In [None]:
sms_no_show = no_shows[no_shows["sms_received"] == 1] 
sms_no_show

In [None]:
b = sms_no_show.shape[0]
b

#### Total number of patients that revieved SMS and doesnot show their appointments is 9784

In [None]:
## The plot of SMS data 
locations = [1, 2]
heights = [a, b]
labels = ['Shows', 'No-Shows']

bar1 = plt.bar(locations, heights, tick_label=labels, color=['slateblue','darkslateblue'])
plt.title('SMS Messages Received')
plt.xlabel('Appointments')
plt.ylabel('SMS Receipt Rate');


### Total number of patients that did not recieve SMS and show in appointments

In [None]:
no_sms_show = shows[shows["sms_received"] == 0] 
no_sms_show

In [None]:
x = no_sms_show.shape[0]
x

In [None]:
no_sms_no_show = no_shows[no_shows["sms_received"] == 0] 
no_sms_no_show

In [None]:
y = no_sms_no_show.shape[0]
y

In [None]:
## The plot of SMS data 
locations = [1, 2]
heights = [a, b]
labels = ['Shows', 'No-Shows']

bar1 = plt.bar(locations, heights, tick_label=labels, color=['lightblue','pink'])
plt.title('SMS Messages not Received')
plt.xlabel('Appointments')
plt.ylabel('SMS Receipt Rate');


<a id='conclusions'></a>
## Conclusions
> In General, about 79.8% of all patients shew in their appointments, and 20.2% did not show.
<ol>
    The factors that a affects the showing of patients in their appointments:
    <li> Gender:
        <ul>The most propotion os patients are women: 65% of all patients are females, and 35% is males </ul></ul>
        <ul>Females are more likely to show in their appointments, with 64.9% and 35.1% are males</ul></ul>
        <ul>Females are more likely to not show in their appointments, with 64.4% and 34.6% are males</ul>
    <li> Age:
        <ul>The average age of all patients is 37</ul>
        <ul>The average age of patients that show in their appointments is 38</ul>
        <ul>The average age of patients that do not show in their appointments is 34</ul>
    <li> Neighbourhood:
        <ul>The neighbourhood that it is most likly to show in appointments is JARDIM CAMBURI with 6252</ul>
        <ul>The neighbourhood that it is most likly to not show in appointments is JARDIM CAMBURI with 1465</ul>
    <li> Scholarship in the Bolsa Família:
        <ul>In all the data set, we have 99666 patients is not in the scholarship and 10861 is in the scholarship</ul>
        <ul>In all the patients that show their appointments, there are 79925 patients is not in the scholarship and 8283 is in the scholarship </ul>
        <ul>In all the patients that show their appointments, there are 19741 patients is not in the scholarship and 2578 is in the scholarship </ul>       
    <li> patients health:
        <ul>The total Number of hipertension in the patients is 21801</ul>
        <ul>The total Number of Diabetes in the patients is 7943</ul>
        <ul>The total Number of alcoholism in the patients is 3360</ul>
        <ul>The total Number of handicap in the patients is 110527</ul>
        <ul>Hipertensions are more likely to show in their appointments, and Handicap is less likely to show</ul>
        <ul>Hipertensions are more likely to not show in their appointments, and Handicap is less likely to not show</ul> 
    <li>Recieving SMS:
        <ul>Total number of patients that recived SMS is 35482</ul>
        <ul>The patients that recieved SMS are more likely to show up their appointmrnts</ul>
        <ul>The patients that recieved SMS are more likely to not show up their appointmrnts</ul>
        
As we see, Thoese Features are not enough to pretend what is exactly affects the attendence of the patients, maybe collecting another information like weather conditions, the distance from their home and the clinic.




In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])