> **Tip**: Welcome to the Investigate a Dataset project! You will find tips in quoted sections like this to help organize your approach to your investigation. Once you complete this project, remove these **Tip** sections from your report before submission. First things first, you might want to double-click this Markdown cell and change the title so that it reflects your dataset and investigation.

Inverstigating a dataset - No Show Appointments - kagglev2 , may/2016

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Dataset Description
This dataset collects information from 100k medical appointments in Brazil and is focused on the question of whether or not patients show up for their appointment. A number of characteristics about the patient are included in each row.
● ‘ScheduledDay’ tells us on what day the patient set up their appointment.
● ‘Neighborhood’ indicates the location of the hospital.
● ‘Scholarship’ indicates whether or not the patient is enrolled in Brasilian welfare program Bolsa Família.
● ‘No_show’ it says ‘No’ if the patient showed up to their appointment, and ‘Yes’ if they did not show up.





### Question(s) for Analysis

Questions we are trying to answer :
1 - what is the overall appointment show-up vs. no show-up rate
2 - what are the most feature that matter the most of ('Age', 'being alcoholic','Having an SMS', 'Gender', 'Scholarship'), to make the patient make it to his appiontment?¶


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

% matplotlib inline


In [None]:
# Upgrade pandas to use dataframe.explode() function. 
!pip install --upgrade pandas==0.25.0

<a id='wrangling'></a>
## Data Wrangling

checking for cleanliness, and then trim and clean the dataset for analysis.



### General Properties


In [None]:
# Load the data. 
df = pd.read_csv('noshowappointments.csv')
df.describe()
pd.isna(df).sum()
# as we can see there is no null values
# showing the first 5 lines of the data!
df.head()
# checking the info of the data (data types, null values etc..)
df.info() 

# check if there is a dplicates in the data 
print("Num of dublicated : ", + sum(df.duplicated()))

#check if there is an age with minus or 0 value
df[df["Age"] <= 0]



### Data Cleaning
> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).
 

In [None]:
df.rename(columns = {'Hipertension': 'Hypertension',
                'Handcap': 'Handicap','No-show':'No_show'}, inplace = True)
# converting some columns that has date to a datetime datatype
df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay'])
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'])
# we fix any age has 0 or less values because it's not make sense to have those values
# in the data (there is no patient has the exactly the age 0 or less)
meanAge = df['Age'].mean()
df[df['Age'] <= 0] = meanAge
df.No_show[df['No_show'] == 'Yes'] = '1'
df.No_show[df['No_show'] == 'No'] = '0'
df['No_show'] = pd.to_numeric(df['No_show'])
# create a mask for people who came
showed = df['No_show'] == 0
not_showed = df['No_show'] == 1
df['showed'] = showed
df['not_showed'] = not_showed


<a id='eda'></a>
## Exploratory Data Analysis






### Research Question 1 (what is the overall appointment show-up vs. no show-up rate?)


In [None]:
allP = df['showed'].value_counts()
print(allP[1] / allP.sum() * 100)
pieChart = allP.plot.pie(figsize=(10,10), autopct='%1.1f%%', fontsize = 12);
pieChart.set_title("Status" + ' (%) (Per appointment)\n', fontsize = 15);
plt.legend();


### Research Question 2 (what are the most feature that matter the most of ('Age', 'being alcoholic','Having an SMS', 'Gender', 'Scholarship'), to make the patient make it to his appiontment?)


In [None]:

df.head()
df.Age[showed].mean()
df.Age[not_showed].mean()
# using group by function to find relations between features 
# visualize the average of people who came + print the percentage

df.groupby('Age')['showed'].mean().hist(alpha=0.6,bins=25,label='show');
plt.xlabel("Showed")
plt.ylabel("Age")
plt.title("Age of people who showed")
plt.legend()
# using group by function to find relations between features
# and then see the relation between alcoholic people and people who showed up and who didn't
print(df.groupby('Alcoholism')['showed'].mean())
df.groupby('Alcoholism')['showed'].mean().plot(kind='bar',figsize=(22,10));
plt.xlabel("Showed")
plt.ylabel("Alcoholism")
plt.title("relation between alcoholics and patients who showed")
plt.legend()
# using group by function to find relations between features
# and then see the relation between sms recievers and people who showed up and who didn't

print(df.groupby('SMS_received')['showed'].mean())
df.groupby('SMS_received')['showed'].mean().plot(kind='bar',figsize=(22,10));
plt.xlabel("Showed")
plt.ylabel("SMS_recieved")
plt.title("relation between SMS_recievers and patients who showed")
plt.legend()
# using group by function to find relations between features
# and then see the relation between Gender and people who showed up and who didn't

print(df.groupby('Gender')['showed'].mean())
df.groupby('Gender')['showed'].mean().plot(kind='bar',figsize=(22,10));
plt.xlabel("Showed")
plt.ylabel("Gender")
plt.title("relation between Gender and patients who showed")
plt.legend()
# using group by function to find relations between features
# and then see the relation between Gender and people who showed up and who didn't

print(df.groupby('Gender')['showed'].mean())
df.groupby('Gender')['showed'].mean().plot(kind='bar',figsize=(22,10));
plt.xlabel("Showed")
plt.ylabel("Gender")
plt.title("relation between Gender and patients who showed")
plt.legend()


<a id='conclusions'></a>
## Conclusions
• As we can see sending an SMS for the appiontment is not neccessary the right option to make sure that the patient will come

• As we can see in our investigation the Age is the most important factor that decided if a patient would come or not the average of age for people who will be most likely to show up is 39.07518726482 , and the average age for people who are not likely to show up is 35.329151291512915.

• As we can see about 22.8% of people that schedule an appointment did not make it to thier appointment

• As we can see most of people who has Scholarship are most likely to miss thier appointments with a percentage of 76.2% of showing and patients who don't have a scholarship have the percentage 80.1%

• the features such as different gender or alcoholic is not a factor to decide if the person would come to his appointment or not!







## Submitting your Project 

> **Tip**: Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should get a return code of 0, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).

> **Tip**: Alternatively, you can download this report as .html via the **File** > **Download as** submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.

> **Tip**: Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations!

In [1]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])

0