

# Project: Investigate a Dataset (Medical Appointment No Shows From Kaggle)

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction
A patient makes a doctor appointment, receives all the instructions and no-show.In this notebook we will answer to Why do 30% of patients miss their scheduled appointments?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time
%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

### General Properties

In [2]:
# Load your data and print out a few lines. Perform operations to inspect data
df = pd.read_csv('noshowappointments-kagglev2-may-2016.csv')
df.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


In [3]:
df.shape

(110527, 14)

This data cotain  110527 rows and 14 variables.
Variables are:
- PatientId: number that identify patient
- AppointmentID: identify the appointment
- Gender: gender of each patient
- ScheduledDay
- AppointmentDay
- Age
- Scholarship
- Hipertension
- Diabetes
- Alcoholism
- Handcap
- SMS_received
- No-show

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   PatientId       110527 non-null  float64
 1   AppointmentID   110527 non-null  int64  
 2   Gender          110527 non-null  object 
 3   ScheduledDay    110527 non-null  object 
 4   AppointmentDay  110527 non-null  object 
 5   Age             110527 non-null  int64  
 6   Neighbourhood   110527 non-null  object 
 7   Scholarship     110527 non-null  int64  
 8   Hipertension    110527 non-null  int64  
 9   Diabetes        110527 non-null  int64  
 10  Alcoholism      110527 non-null  int64  
 11  Handcap         110527 non-null  int64  
 12  SMS_received    110527 non-null  int64  
 13  No-show         110527 non-null  object 
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB


In [5]:
# types and look for instances of missing or possibly errant data.
df.isnull().notnull().sum()

PatientId         110527
AppointmentID     110527
Gender            110527
ScheduledDay      110527
AppointmentDay    110527
Age               110527
Neighbourhood     110527
Scholarship       110527
Hipertension      110527
Diabetes          110527
Alcoholism        110527
Handcap           110527
SMS_received      110527
No-show           110527
dtype: int64

*There are not any missing value in given dataset*

### Data Cleaning 

In [6]:
# Check full duplicated rows
df.duplicated().sum()

0

In [7]:
# Rename all variables to be in conventional naming
df.rename(columns=lambda x:x.lower().replace('-','_'), inplace = True)

In [8]:
df.columns

Index(['patientid', 'appointmentid', 'gender', 'scheduledday',
       'appointmentday', 'age', 'neighbourhood', 'scholarship', 'hipertension',
       'diabetes', 'alcoholism', 'handcap', 'sms_received', 'no_show'],
      dtype='object')

In [9]:
# Check if a patient has more than an appointment
df.patientid.unique().shape

(62299,)

*There are only 62299 patients booked for appointments. which means that there are more than patient has more than appointment*

In [10]:
# Verify that patients are repeted with the same No show state
df[['patientid','no_show']].duplicated().sum()

38710

**We will consider this rows as duplicated rows and will drop them becouse that persent a usual from this patient or he will book again**

In [11]:
# Drop duplicated patient with the same No-show state 
df.drop_duplicates(['patientid','no_show'], inplace=True)

In [12]:
df.shape

(71817, 14)

In [13]:
# drop unuseful variable (that has no affect to our data becouse it 's personal, not genaric and it 's related to the date of the appointment)
df.drop(['patientid','appointmentid','scheduledday','appointmentday'],axis = 1, inplace = True)

In [14]:
df.shape[1]

10

In [15]:
df['gender'].unique()

array(['F', 'M'], dtype=object)

In [16]:
df.head()

Unnamed: 0,gender,age,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,no_show
0,F,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,M,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,F,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,F,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,F,56,JARDIM DA PENHA,0,1,1,0,0,0,No


In [17]:
# Get sumary statistics to show that if any value is unconsistint
df.describe()

Unnamed: 0,age,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received
count,71817.0,71817.0,71817.0,71817.0,71817.0,71817.0,71817.0
mean,36.526978,0.095534,0.195065,0.070958,0.025036,0.020135,0.335561
std,23.378518,0.293954,0.396254,0.256757,0.156235,0.155337,0.47219
min,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,17.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,36.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,55.0,0.0,0.0,0.0,0.0,0.0,1.0
max,115.0,1.0,1.0,1.0,1.0,4.0,1.0


In [18]:
df['age'].quantile([0.1,0.9])

0.1     4.0
0.9    68.0
Name: age, dtype: float64

In [21]:
df[df['age'] >= 100]

Unnamed: 0,gender,age,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,no_show
58014,F,102,CONQUISTA,0,0,0,0,0,0,No
63912,F,115,ANDORINHAS,0,0,0,0,1,0,Yes
76284,F,115,ANDORINHAS,0,0,0,0,1,0,No
79270,M,100,TABUAZEIRO,0,0,0,0,1,0,No
90372,F,102,MARIA ORTIZ,0,0,0,0,0,0,No
92084,F,100,ANTÔNIO HONÓRIO,0,0,0,0,0,1,No
97666,F,115,SÃO JOSÉ,0,1,0,0,0,1,No
108506,F,100,MARUÍPE,0,0,0,0,0,0,No


In [None]:
# Inspect ages with negative value
df.query('age <0 ')

In [None]:
# Drop this row
df.drop(index = 99832, inplace = True)

<a id='eda'></a>`
## Exploratory Data Analysis

the main question is: **What are variables the affect to attendance at the appointmen?** 

In [None]:
# Get General Insight From variables distributions of hole dataset
df.hist(figsize = (10,10));

In [None]:
df['no_show'].value_counts().plot(kind = 'pie',title = 'Gender with show',autopct='%1.1f%%');

In [None]:
# Grouping our dataset to two groups one for show and other for n_show
show = df.no_show == "No"
no_show = df.no_show == "Yes"

In [None]:
df[show].mean()

In [None]:
df[no_show].mean()

> *Untill this point there is no clear affectable variable so we will complete our analysis for each variable*

## Does age affect to be show

In [None]:
# Does patient 's age affect to atendance?
plt.figure(figsize=(15,5))
df['age'][show].hist(bins = 10, label = 'show', color = 'green')
df['age'][no_show].hist(bins = 10, label = 'no show', color = 'red')
plt.legend();
plt.xlabel('Age')
plt.ylabel('No. of Patients')
plt.title('Age vs No. of Patents')
plt.show()

In [None]:
'Min age is: '+ str(df['age'].min()) +' Max age is: ' + str(df['age'].max())

It's clear that patients with age 0:10 years old is the most who attend to appoientment and age 80:115 years old are less attend

## Does chroinc diseases affect to attendance?

In [None]:
def plot_col(col):
    """
    This function plots a bar chart of given col with respect to number of pattients.
    Argument:
        col: column name.
    Return: 
        A figure that cotians the data of show with green and no show with red.
    """
    df[show][col].value_counts().plot(kind = 'bar', label= 'Show', color = 'g')
    df[no_show][col].value_counts().plot(kind = 'bar', label = 'No Show', color = 'r')
    plt.xlabel(col.title())
    plt.ylabel('No. of Patients')
    plt.title(f'{col.title()} VS No. of Patients.')
    plt.legend()
    plt.show()

In [None]:
# First Hibertension State
plot_col('hipertension')

In [None]:
df[show].hipertension.value_counts(),df[no_show].hipertension.value_counts()

In [None]:
# Second Diabetes
plot_col('diabetes')

In [None]:
# Second Hibertension and Diabetes
df[show].groupby(['hipertension', 'diabetes']).count()['no_show'].plot(kind = 'bar', label = 'Show', color = 'g')
df[no_show].groupby(['hipertension', 'diabetes']).count()['no_show'].plot(kind = 'bar', label = 'No Show', color = 'r')
plt.xlabel('Hibertension and Diabetes Sate')
plt.ylabel('No. of Patients')
plt.title('Hibertension and Diabetes Sate VS No. of Patients')
plt.legend()
plt.show()

From Pervious We find that conric diseases is not an affectable factor only tiny effect when paitent has one only

## Does gender affect ?

In [None]:
df[show].gender.value_counts().plot(kind = 'pie',title = 'Gender with show',autopct='%1.1f%%');

In [None]:
df[no_show].gender.value_counts().plot(kind = 'pie',title = 'Gender with no show',autopct='%1.1f%%');

Gender has not strong effect.

In [None]:
# Does reciving sms affect to show appointment?
plot_col('sms_received')

It's clear that the number of patients who received an SMS and attend is less than the number of patients who did not receive SMS and attend which mean that SMS campaign must be reorganized

## Is Neighbourhood has an effect to attendace ?

In [None]:
plt.figure(figsize = [16,8])
plot_col('neighbourhood')

Neighbourhood has a clear effect such that Jardim Camburi has most appoientments and most attend to show apppientment and PARQUE INDUSTRIAL is the less.

## Does Scholarship affected to attendance?

In [None]:
plot_col('scholarship')

It's clear that patients who has a Scholarship attend to appoientment more than who did not have.

## Does alcoholism has an effect ?	

In [None]:
df[show]['alcoholism'].value_counts().plot(kind = 'pie',title = 'Gender with no show',autopct='%1.1f%%');

In [None]:
df[no_show]['alcoholism'].value_counts().plot(kind = 'pie',title = 'Gender with no show',autopct='%1.1f%%');

Alcoholism has no effect.

## Does handcap has an effect?

In [None]:
plot_col('handcap')

There is an tiny relation becouse of most patients whith handcap are attend and patients with no handcap are also attend

<a id='conclusions'></a>
## Conclusions

* Age has a great effect such that patients with 0:10 years old are the most patients attend to appoientment, and patient in between 48 and 58 is the second in order
* Patients who lives in JARDIM CAMBURI and MARIA ORTIZ are the most who attend at the appoientment.
* It's clear that patients who has a Scholarship attend to appoientment more than who did not hav.
* patients who received an SMS and attend is less than the number of patients who did not receive SMS and attend which mean that SMS campaign must be reorganized

## Limitations
There is no clear corelation between attendace at appoientment and gender, chroinic diseases