

# Project: Investigate a Dataset - [Medical Appointment No Shows]

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Dataset Description 

> This dataset includes information about over 100,000 medical appointments of different patients from different neighborhoods in Brazil, and this dataset discuss very important point that why a person makes a doctor appointment, receives all the instructions and no-show. so I will ask questions and answer it to reach the solution for this problem. 
 
**Note**: The columns that have (0 , 1) values (1 means True) (0 means False).

### Question(s) for Analysis

- 1- Is there a correlation between the dataset ?


- 2- What is the ratio of Female to Male ?


- 3- Does the gender affect showing up of the patient ? 


- 4- Does the gender with scholarship affect showing up of the patient ?


- 5- Does the gender with hypertension affect showing up of the patient ?


- 6- Does the gender with diabetes affect showing up of the patient ?


- 7- Does the gender with alcoholism affect showing up of the patient ?


- 8- Does the gender with handicap affect showing up of the patient ?


- 9- Does receiving sms affect showing up in the appointment based on the gender ?


- 10- What is the distribution of the age ?


- 11- Does the age of the paitent affect showing up in the appointment day?


- 12- which day of the week has the most precentage of showing up ?


- 13- which day of the week has the most precentage of  no show up ?

 
- 14- What is the ratio of the month according to showing up and don't show up ?


- 15- Does waiting days affect showing up of the patient ?



In [None]:
# Use this cell to set up import statements for all of the packages that you
#   plan to use.

# Remember to include a 'magic word' so that your visualizations are plotted
#   inline with the notebook. See this page for more:
#   http://ipython.readthedocs.io/en/stable/interactive/magics.html

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import plotly.express as px
%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling



In [None]:
# Load your data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.
df = pd.read_csv('noshowappointments-kagglev2-may-2016.csv')
df.head()

In [None]:
df.info(verbose=True, show_counts=True)

In [None]:
df.shape

In [None]:
df.describe(include=['O'])

In [None]:
df.describe()


### Data Cleaning

 

In [None]:
# make a copy of the dataframe to aviod any problem 
df_new = df.copy()

In [None]:
df_new.dtypes

#### Columns'S type that need to be changed:
 1- PatienID is float64 and it needs to convert to str because I dont want when I caluclate or use describe() fun to consider it as a numeric data
 
 2- AppointmentID is int and I will convert it to str that is because the same reason of patientid
 
 3- ScheduledDay and AppointmentDay need to convert to a datatime because I will use them to extract a month and a day form it. 


In [None]:
# lets convert PatienID to str
df_new['PatientId'] = df_new['PatientId'].astype(str)
df_new['PatientId'].dtypes

In [None]:
# lets covert AppointmentID to str
df_new['AppointmentID'] = df_new['AppointmentID'].astype(str)
df_new['AppointmentID'].dtypes

In [None]:
# convert ScheduledDay and AppointmentDay to datatime
df_new['ScheduledDay'] = pd.to_datetime(df_new['ScheduledDay'])
df_new['AppointmentDay'] = pd.to_datetime(df_new['AppointmentDay'])
df_new[['ScheduledDay','AppointmentDay']].dtypes

In [None]:
# see the new data type after converting 
df_new.dtypes

### Renaming some columns

In [None]:
# first lets replace uppercase to lowercase 
df_new.rename(columns=lambda x: x.strip().lower(), inplace=True)

In [None]:
# second lets fix and put underscore in some columns
rename_column = { 'patientid' : 'patient_id', 'appointmentid': 'appointment_id',
                 'scheduledday' : 'scheduled_day', 'appointmentday':'appointment_day',
                  'hipertension':'hypertension' ,'handcap':'handicap','no-show':'show'
                 }
df_new.rename(columns=rename_column,inplace=True)

In [None]:
df_new.head(3)

#### Changing Show values:
First No-show column before change it to show means 'No' if the patient showed up to their appointment, and 'Yes' if they did not show up.

So after I changed the column to show I am going to change 'No' to 1 and 'Yes' to 0 to avoid any misunderstanding or misconception and to be like other columns (1 means True) and (0 means Flase).

So all the dataset now (1 = True) and (0 = False).

In [None]:
# change the show column values
df_new['show'].replace(['Yes', 'No'], [0,1],inplace=True)

#### Changing the gender column values:
from F and M to Female and Male

In [None]:
df_new['gender'].replace(['F','M'],['Female', 'Male'], inplace=True)

#### Check for missing values

In [None]:
df_new.isnull().sum()

#### Check for duplicate rows

In [None]:
df_new.duplicated().sum()

There is not any duplicated value also

<a id='eda'></a>

## Exploratory Data Analysis




### First: Lets see overall vis between show column and other features:

In [None]:
def get_bar_chart(df_new):
    
    feature = ['scholarship','hypertension', 'diabetes', 'alcoholism', 'handicap', 'sms_received'] # the columns I want to plot

    
    list(enumerate(feature))

    
    plt.figure(figsize=(25,10))
    
    for i in enumerate(feature):
        plt.subplot(2, 3, i[0]+1)
        sns.countplot(x=i[1], hue='show' ,data=df_new, palette = 'Blues')
     
    
    plt.show()
    
get_bar_chart(df_new)    

### Is there a correlation between the dataset ?


In [None]:
df_new.corr()

In [None]:
plt.figure(figsize=(10,5))
sns.heatmap(df_new.corr(), annot=True,linewidth=2, cmap= 'PuBuGn')
plt.title('Correlation between the dataset', fontsize=15)
plt.show()

**After the two steps here what I figured out:**

**Hypertension and Diabetes** have moderate postive correlation(0.43).

**Hypertension and age** have strong postive correlation(0.5).

**Scholarship and show** have negative correlation(-0.02). This means don't have scholarship will increase the possibility of showing up.

**Alcoholism and show** don't have any relationship(0.0002). This means it wouldn't affect showing up of the patient.

**Sms_received and show** have strong negative correlation(-0.13). This means didn't receive sms will increase the possibility of showing up.

## What is the ratio of Female to Male ?

In [None]:
df_new.head()

In [None]:
night_colors = ['rgb(56, 75, 126)', 'rgb(18, 36, 37)']
fig = px.pie(df_new['gender'].value_counts(),
             values = 'gender',
             names = ['Female', 'Male'],
             color_discrete_sequence=px.colors.sequential.Darkmint
            )

fig.update_traces(text = df_new['gender'].value_counts(),
                  textinfo = 'label+percent')

fig.update_layout(title_text = "Gender ratio", title_x = 0.5)


fig.show()

## Does the gender affect showing up of the patient ?

#### Lets first take a peek in the gender based on the count of show up or no show.

In [None]:
df_new.groupby(['gender','show']).count()['patient_id']

In [None]:
plt.style.use('fivethirtyeight')

plt.figure(figsize=(10,6), dpi=60)

sns.countplot(data= df_new, x= 'gender', hue='show', palette= 'Blues')

plt.title('Number of female and male according to show')
plt.ylabel('Show and No-show patients count')
plt.show()

> This is just an overall count of the gender according to showing up in the appoinment day or not. but to know the answer if the gender affect showing up or not I need to calculate the precentage of both of them.

In [None]:
# this just to get the count of gender and i use patient_id to help me count the number of female and male

gender_count = df_new.groupby('gender').count()['patient_id']
gender_count

In [None]:
# to get the the count of the female and male that show up and dont show up in the appointment day.

gender_show = df_new[['gender','show']].value_counts()
gender_show

In [None]:
female_ratio = (gender_show['Female']/gender_count['Female'])*100 # this to calculate with precenatage female that showed up and dont.

label_names = ['Female showed up ', "Female didn't show up"]

explode = [0, 0.15] # to explode the part that male did not show up

colors = ['#4F6272', '#B7C3F3']

plt.pie(female_ratio, radius=1.5, shadow=True ,labels = label_names, explode=explode,colors = colors ,startangle=180,
        autopct='%0.2f%%',textprops = {"fontsize":15, "fontname":"Comic Sans MS"})

plt.title("Ratio of female who showed up and who didn't show up ",fontsize=20, y=1.2, fontname='Comic Sans MS') 

plt.show()


In [None]:
male_ratio = (gender_show['Male']/gender_count['Male'])*100 # this to calculate with precenatage male that showed up and dont.

label_names = ['Male showed up ', "Male didn't show up"]

explode = [0, 0.15] # to explode the part that male did not show up

colors = ['#4F6272', '#B7C3F3']

plt.pie(male_ratio, radius=1.5, shadow=True ,labels = label_names, explode=explode,colors=colors, startangle=180,
        autopct='%0.2f%%',textprops = {"fontsize":15,"fontname":"Comic Sans MS" })

plt.title("Ratio of male who showed up and who didn't show up ",fontsize=20, y=1.2, fontname='Comic Sans MS') 

plt.show()

> **The gender does not affect showing up of the patient**, because the female and male almost equal whether showing up or not.

## Does the gender with scholarship affect showing up of the patient ?

In [None]:
df_new.groupby(['gender', 'scholarship', 'show']).count()['patient_id']

In [None]:
genderr= df_new['gender'].replace(['Female', 'Male'], [1,0])# convert gender column to numeric values to know the relationship
                                                       # between scholarship and showing up to the appointment.

show = df_new['show']*100 # to get the precentage of show column 


plt.figure(figsize=(12,6),dpi=60)
sns.barplot(x=genderr, y=show, hue='scholarship', palette= 'Blues',data=df_new)
plt.xticks([0,1],['Male', 'Female'])
plt.ylabel('Percentage of Showing up')
plt.title('Show up of the patient based on the gender and the scholarship')

plt.show()

>**Yes, the gender with scholarship affect showing up of the patient**. Becuase there is a negative correlation between scholarship and showing up. Here when the people whether male or female dont have scholarship the possibility of thier appearance or showing up increase.

## Does the gender with hypertension affect showing up of the patient ?

In [None]:
df_new.groupby(['gender', 'hypertension', 'show']).count()['patient_id']

In [None]:
plt.figure(figsize=(10,6),dpi=60)

sns.barplot(x=genderr, y=show, hue='hypertension',palette = 'Blues',data=df_new)

plt.xticks([0,1],['Male', 'Female'])
plt.ylabel('Percentage of Showing up')
plt.title('Show up of the patient based on the gender and the hypertension')

plt.show()

>**Yes, the gender with hypertension affect showing up of the patient**. Beacause
there is a relationship or a positive correlation bewteen having hypertension and showing up. Like here I see both gender who have hypertension showing up in the appointment day more than the other who dont have hypertension.

## Does the gender with diabetes affect showing up of the patient ?

In [None]:
df_new.groupby(['gender', 'diabetes', 'show']).count()['patient_id']

In [None]:
plt.figure(figsize=(12,6),dpi=60)

sns.barplot(x=genderr, y=show, hue='diabetes', palette = 'Blues', data=df_new)

plt.xticks([0,1],['Male', 'Female'])
plt.ylabel('Percentage of Showing up')
plt.title('Show up of the patient based on the gender and the diabetes')

plt.show()

>**Yes, the gender with diabetes affect showing up of the patient**. Beacause
there is a relationship or a positive correlation bewteen having diabetes and showing up. Like here I see both gender who have diabetes showing up in the appointment day more than the other who dont have diabetes. Same as hypertension..

## Does the gender with alcoholism affect showing up of the patient ?

In [None]:
df_new.groupby(['gender', 'alcoholism', 'show']).count()['patient_id']

In [None]:
plt.figure(figsize=(10,6),dpi=60)

sns.barplot(x=genderr, y=show, hue='alcoholism', palette = 'Blues', data=df_new)

plt.xticks([0,1],['Male', 'Female'])
plt.ylabel('Percentage of Showing up')
plt.title('Show up of the patient based on the gender and the alcoholism')

plt.show()

> **The alcoholism doesn't affect the showing up of the patient** , and there isn't a strong relationship between alcoholism and showing up of the patient because: 

>Male: the ones who drink show up in the appointment day more than the ones who don't drink.

>Female: is quite the opposit of male, the ones who don't drink show up in the appointment day more than the ones who actually drink.

In [None]:
df_new.head()

## Does the gender with handicap affect showing up of the patient ?

In [None]:
# To know the unuque values of the handicap column
df_new['handicap'].unique()

In [None]:
# To know the value counts of the handicap column
df_new['handicap'].value_counts()

##### Here I have a problem to visulaize it with other variables. 0 means not handicapped, [1,2,3,4] means the person is handicapped.  So I am going to convert [1,2,3,4] to 1.

In [None]:
df_new['handicap'].replace({0:0, 1:1, 2:1, 3:1, 4:1}, inplace=True)

In [None]:
df_new['handicap'].value_counts()

In [None]:
plt.figure(figsize=(14,6),dpi=60)

sns.barplot(x=genderr, y=show, hue='handicap', palette = 'Blues', data=df_new)

plt.xticks([0,1],['Male', 'Female'])
plt.ylabel('Percentage of Showing up')
plt.title('Show up of the patient based on the gender and handicap')

plt.show()

>**Yes, the gender with handicap affect showing up of the patient**. Beacause
there is a positive correlation bewteen being handicapped and showing up. Like here I see both gender who is handicapped showing up in the appointment day more than the other who isn't handicapped.

## Does receiving sms affect showing up in the appointment based on the gender ?

In [None]:
df_new['sms_received'].value_counts()

In [None]:
df_new.groupby(['gender', 'sms_received', 'show']).count()['patient_id']

In [None]:
plt.figure(figsize=(16,6),dpi=60)

sns.barplot(x=genderr, y=show, hue='sms_received', palette = 'Blues', data=df_new)

plt.xticks([0,1],['Male', 'Female'])
plt.ylabel('Percentage of Showing up')
plt.title('Show up of the patient based on the gender and handicap')

plt.show()

>**No, there isn't any relationship between receiving sms and showing up in the appointment based on the gender**. Beacause
both gender who didn't receive sms showing up in the appointment more than the ones who received sms. So receiving sms or not has nothing to do with showing up It isn't the problem for not showing up.

In [None]:
df_new.head()

## What is the distribution of the age ?

In [None]:
df_new['age'].unique()

In [None]:
df_new['age'].value_counts()

In [None]:
df_new['age'].describe()

In [None]:
df_new['age'].mode()

In [None]:
df_new.query('age == -1')

In [None]:
# delete the -1 age from the dataset
df_new.drop(df.index[99832], inplace=True)

In [None]:
# know the count of the people with age 0
df_new.query('age == 0')['age'].count()

**some analysis in the age column:**

min: -1 it is just one row, and it doesn't make sense so i dropped it.

max: 115 its weird but it can happen so I will leave it the same.

mode: is zero its (3539) values maybe zero means babies didn't birth yet, but it doesn't make any sense so I am going to convert it to nan.

In [None]:
#convert the zero values into nan values.
agee = df_new['age'].replace(0,np.nan)

In [None]:
fig = px.box(df_new,agee, labels=dict(x = 'Age'))

fig.update_layout(title_text = 'The distribution of the age', title_x = 0.5)


fig.show()

In [None]:
df_new.query('age == 115')

> There are **outliers** at the age of 115 five people.

## Does the age of the paitent affect showing up in the appointment day?

In [None]:
# lets first add age range column to group the age so that we can visulaize it easily.

ages = agee
bins = [1,18, 40,115]
labels = [ '1-17','18-39', '40-115']
df_new['age_range'] = pd.cut(agee, bins, labels = labels,include_lowest = True)

In [None]:
plt.figure(figsize=(10,6),dpi=60)

sns.countplot(x='age_range', hue='show',palette= 'crest',data= df_new)

plt.ylabel('Count of show',fontsize=15)
plt.xlabel('Age range',fontsize=15)
plt.title('Show up of the patient based on the age range',fontsize=15)

plt.show()

>**Yes, the age of the patient affect showing up in the appointment day**. So there is a postive correlation between the age and showing up . Here when the people getting older the possibility of showing up in the appointment day increase .

In [None]:
df_new.head()

### which days the people choose most for the appointments ?

In [None]:
# first lets convert scheduled_day and appointment_day to date
df_new['scheduled_day'] = pd.to_datetime(df_new['scheduled_day']).dt.date
df_new['appointment_day'] = pd.to_datetime(df_new['appointment_day']).dt.date

# after that lets add the day name of the appointment
df_new['day_of_appointment'] = pd.to_datetime(df_new['appointment_day']).dt.day_name()
df_new.head()

In [None]:
# checking that scheduled day is before appointment day

check = df_new[df_new['scheduled_day'] > df_new['appointment_day']][['scheduled_day','appointment_day']]
check

My doubts are true there are five rows that the appointment day is before the scheduled day and this doesn't make any sense so I will drop them.

In [None]:
df_new.drop(df_new.index[[27033,55226,64175,71533,72362]], inplace=True)

In [None]:
plt.figure(figsize=(12,5))

sns.countplot(x='day_of_appointment', data=df_new, palette = 'crest')

plt.title('Days of the appointment',fontsize=15)
plt.xlabel('Day of the week',fontsize=15)
plt.ylabel('count of the appointments',fontsize=15)

plt.show()

 The days of the appointment that is full **from monday to friday**.

 The most day the people chose to go to the appointment is **wednesday**. 

 **Sunday** has no appointments.

 **saturday** is the least day of appointments.

### which day of the week has the most precentage of showing up ?

In [None]:
# the count of each day of appointment
day_count = df_new.groupby('day_of_appointment')['show'].count()
day_count

In [None]:
# the number of people who show up in the appointment day
show_count = df_new.groupby('day_of_appointment')['show'].sum()
show_count

In [None]:
# the precentage of each day poeple show up in the appointment day
show_prop = df_new.groupby('day_of_appointment')['show'].mean()
show_prop

In [None]:
fig = px.pie(df_new,
            values= show_prop,
            names = ['Friday', 'Monday', 'Saturday', 'Thursday', 'Tuesday', 'Wednesday'],
            color_discrete_sequence=px.colors.sequential.Emrld
            )

fig.update_traces(textinfo = 'label+percent')

fig.update_layout(title_text = 'Percentage the day of the week according to show up ', title_x = 0.5)

fig.show()

> **Thursday** has the most precentage of showing up and **Wednesday** the second day of showing up.



### which day of the week has the most precentage of  no show up ?

In [None]:
# the numbers of the people who don't show up in the appointment days 
no_show_count = day_count -show_count
no_show_count

In [None]:
# The precentage of each day people didn't show up in the appointment day
no_show_prop = 1- show_prop
no_show_prop

In [None]:
fig = px.pie(df_new,
            values= no_show_prop,
            names = ['Friday', 'Monday', 'Saturday', 'Thursday', 'Tuesday', 'Wednesday'],
            color_discrete_sequence=px.colors.sequential.Emrld
            )

fig.update_traces(textinfo = 'label+percent')

fig.update_layout(title_text = "Percentage the day of the week according to no show up ", title_x = 0.5)

fig.show()

> **Saturday** is the most day people didn't show up in the appointment day and then **friday**.

### What is the ratio of the month according to showing up and don't show up ?


In [None]:
df_new['appointment_month'] = pd.to_datetime(df_new['appointment_day']).dt.month_name()
df_new.head()

In [None]:
df_new['appointment_month'].unique()

In [None]:
df_new['appointment_month'].value_counts()

In [None]:
# the count of the month 
month_count = df_new.groupby('appointment_month')['show'].count()
month_count

In [None]:
# the number of the people who show up in these months
month_show = df_new.groupby('appointment_month')['show'].sum()
month_show

In [None]:
# the precentage of each month people show up in the appointment day
show_month_prop = df_new.groupby('appointment_month')['show'].mean()
show_month_prop

In [None]:
fig = px.pie(df_new,
            values= show_month_prop,
            names = ['April','June','May'],
            color_discrete_sequence=px.colors.sequential.Darkmint
            )

fig.update_traces(textinfo = 'label+percent')

fig.update_layout(title_text = 'Month of the appointment and show up ', title_x = 0.5)

fig.show()

> The ratio of three month according to show up almost the same, and we cant say the most month people show up in because there is a huge difference between the three months in  values. **May** is 64037, **June** is 21568, and **April** is 2602.

In [None]:
# the number of the people who didn't show up in these months
month_no_show = month_count - month_show
month_no_show

In [None]:
# the precentage of each month people didn't show up in the appointment day
no_show_month_prop = 1- show_month_prop
no_show_month_prop

In [None]:
fig = px.pie(df_new,
            values= no_show_month_prop,
            names = ['April','June','May'],
            color_discrete_sequence=px.colors.sequential.Darkmint
            )

fig.update_traces( textinfo = 'label+percent')

fig.update_layout(title_text = "Month of the appointment and don't show up ", title_x = 0.5)

fig.show()

> we cant say the highest and the lowest ratio because there is a huge difference between the values of the three months. **may** is 16799, **June** is 4882 and **April** is 633.


### Does waiting days affect showing up of the patient ?

In [None]:
df_new['waiting_days'] = df_new['appointment_day'] - df_new['scheduled_day']

In [None]:
df_new.head()

In [None]:
# lets get red of days word
df_new['waiting_days'] = df_new['waiting_days'].astype('str') # to be able to use split
df_new['waiting_days'] = df_new['waiting_days'].apply(lambda x: x.split()[0])
df_new['waiting_days'] = df_new['waiting_days'].astype('int')
df_new.head()

In [None]:
df_new['waiting_days'].unique()

In [None]:
days = df_new['waiting_days']
bins= [-1,21,46,65,85,105,125,142,160,179]
labels = ['0-20', '21-45', '46-64', '65-84', '85-104', '105-124', '125-141','142-159', '160-179']
df_new['days_range']= pd.cut(days, bins, labels = labels,include_lowest = True)
df_new.head()

In [None]:
# graph the count of the waiting days
plt.figure(figsize=(12,5))

sns.countplot(x='days_range', data=df_new, palette= 'crest')

plt.title('Waiting days between scheduled days and appointment days',fontsize=15)
plt.xlabel('Delay in days', fontsize=15)
plt.ylabel('Count of the waiting days',fontsize=15)

plt.show()

In [None]:
# graph the waiting days according to showing up or not
plt.figure(figsize=(12,5))
sns.barplot(x='days_range', y='show',data=df_new, palette='crest')
plt.title('Showing up or not based on waiting days',fontsize=15)
plt.ylabel('Percent of show',fontsize=15)
plt.xlabel('Delay in days ')
plt.show()

> The waiting days **doesn't affect** showing up of the patient.

<a id='conclusions'></a>
## Conclusions


- Gender doesn't affect showing up becuase the ratio from female to male are almost the same.


- Scholarship affect showing up of the patient wether female or male , the people who don't have scholarship show up more than the ones who already have, its a negative corrlation. 


- Hypertension and diabetes are strong features that affect showing up of the patient.


- There is no relation between alcoholism and showing up.


- Handicapped patients tend to show up more than non-handicapped.


- Receiving-sms doesn't affect showing up, its quiet the opposite the people who didn't receive sms tend to show up more than the ones who already received sms.


- Age of the patient affect showing up , and the possibility of showing up increase when the people getting older.


- There is not a specific day or month that has preference than the otherو they are all the same when it comes to show up.


- Waiting days have no affect on showing up of the apatient.





### Limitations

- Handicap column has five different values(0,1,2,3,4) after alot of research I figured out that 1 to 4 is handicapped with difference precentage of the handicap, so I converted (1,2,3,4) to 1 that means handicap and 0 not handicapped.


- Age column has large number of 0 values and it doesn't make sense according to the project.


- The data just has three months ( April, May, June)


- Most of the variables are categorical, which doesn't allow for a high level of statistical method.



In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])