# Project: Investigate a Dataset - No-show appointments

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Dataset Description 

**No-show appointments dataset is a collection of 100,000 medical appointments of patients in Brazil. It seeks to assist in finding reasons why people with appointments with doctors fail to show up for their appointments. There is ample information that will tell a story of why people do not show up and assist in mitigating this challenge. This will help answer questions in the analysis section and  overcome problems like inadequate management of patients ailments.**

The columns in the dataset are: 

- Patient ID - This is a unique identification for individual patients and it helps identify one patient from another.


- Appointment ID - This is a unique identification for individual appointments an it is important to identify an appointment.


- Gender- This column comprises of the genders of the patients who are either male of female. 


- Scheduled day - This is the day a patient scheduled the appointmnet to see the doctor and it helps know the reason they miss appointments.


- Appointment day - This is the day a patient visits the doctor. It helps us know if the patient actually shows up.


- Age - This is how old the patient is. It helps us identify hich ages attend their appointments.


- Neighbourhood - This is the location of the health facility. It helps in determining which hospitals are most visited by patients.


- Scholarship - This shows if the patient is enrolled in the Bolsa Familia welfare program. This will help know if the program members are attending their appointments.


- Diabetes - This shows if the patient has Diabetes or not. This will help know if patients with Diabetes attend their appointments. 



- Hipertension - This shows if the patient has Hipertension or not. This will help know if patients with Hipertension attend their appointments.


- Alcoholism - This shows if the patient is an alcoholic. This will help know if patients with Alcoholism attend their appointments.


- Handcap - This shows if the patient has a Handicap. This will help know if patients with Handicap attend their appointments. 


- SMS received -This shows if the patient has recieved an SMS from the hospital. This will help know if patients receive a SMS from the hospital attend their appointments. 


- No show - This shows if the patient has attended their appointmnet or not. This will help know if patients actually attend their appointments.


### Question(s) for Analysis
The following are some of the questions that we seek to answer from the analysis of this dataset:

1. How is the attendance to the appointments by the patients?


2. What is the correlation between the independent variables and the dependent variable?


3. What factors are most important in predicting the attendance of the patients to their appointments?


In [3]:
#Import statements for all of the packages that are used.

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline


In [None]:
# Upgrade pandas to use dataframe.explode() function. 
!pip install --upgrade pandas==0.25.0

<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you **document your data cleaning steps in mark-down cells precisely and justify your cleaning decisions.**

In this section the data is loaded into the notebook and we shall examine it and determine what mistakes there are and handle them in the Data Cleaning section.


### General Properties
> **Tip**: You should _not_ perform too many operations in each cell. Create cells freely to explore your data. One option that you can take with this project is to do a lot of explorations in an initial notebook. These don't have to be organized, but make sure you use enough comments to understand the purpose of each code cell. Then, after you're done with your analysis, create a duplicate notebook where you will trim the excess and organize your steps so that you have a flowing, cohesive report.

In [21]:
# Loading the data and print out a few lines. Performing operations to inspect data
df_noshow = pd.read_csv('noshowappointments-kagglev2-may-2016.csv')
df_noshow.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


In [22]:
#checking for information on the dataset i.e datatypes,number of rows and columns
df_noshow.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   PatientId       110527 non-null  float64
 1   AppointmentID   110527 non-null  int64  
 2   Gender          110527 non-null  object 
 3   ScheduledDay    110527 non-null  object 
 4   AppointmentDay  110527 non-null  object 
 5   Age             110527 non-null  int64  
 6   Neighbourhood   110527 non-null  object 
 7   Scholarship     110527 non-null  int64  
 8   Hipertension    110527 non-null  int64  
 9   Diabetes        110527 non-null  int64  
 10  Alcoholism      110527 non-null  int64  
 11  Handcap         110527 non-null  int64  
 12  SMS_received    110527 non-null  int64  
 13  No-show         110527 non-null  object 
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB


In [23]:
# Looking for instances of missing data.
df_noshow.isnull().sum()

PatientId         0
AppointmentID     0
Gender            0
ScheduledDay      0
AppointmentDay    0
Age               0
Neighbourhood     0
Scholarship       0
Hipertension      0
Diabetes          0
Alcoholism        0
Handcap           0
SMS_received      0
No-show           0
dtype: int64

In [24]:
# Looking for instances of duplicated data
df_noshow.duplicated().sum()

0

In [35]:
df_noshow.nunique()

Gender                 2
ScheduledDay      103549
AppointmentDay        27
Age                  104
Neighbourhood         81
Scholarship            2
Hypertension           2
Diabetes               2
Alcoholism             2
Handicap               5
SMS_received           2
No-show                2
dtype: int64

In [26]:
#Looking for the unique values in the Handicap column
df_noshow.Handcap.unique()

array([0, 1, 2, 3, 4], dtype=int64)

In [27]:
#Looking at the datatypes of the column values
df_noshow.dtypes

PatientId         float64
AppointmentID       int64
Gender             object
ScheduledDay       object
AppointmentDay     object
Age                 int64
Neighbourhood      object
Scholarship         int64
Hipertension        int64
Diabetes            int64
Alcoholism          int64
Handcap             int64
SMS_received        int64
No-show            object
dtype: object

### Data Cleaning
> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).

During the wrangling process there are few problems noticed with the dataset and they include:

1. The column names Hipertension and Handicap are misspelled.

2. The Patient ID and Appointment ID columns are not relevant to this analysis.

3. The datatypes of Scheduled day and Appointment day are incorrect. They are objects which is incoreect because they are dates which should be date datatypes.
 
 
In this Data Cleaning section we shall make corrections to aforementioned problems. 

In [28]:
# Renaming Hipertension column to Hypertension
df_noshow.rename(columns={'Hipertension': 'Hypertension'}, inplace=True)

#Confirming changes
df_noshow.head(1)

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hypertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No


In [29]:
#Renaming the Handcap column to Handicap
df_noshow.rename(columns={'Handcap':'Handicap'}, inplace=True)

#Confirming changes
df_noshow.head(1)

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hypertension,Diabetes,Alcoholism,Handicap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No


In [None]:
#Renaming the SMS_recieved column to SMS recieved
df_noshow.rename(columns={'SMS_received':'SMS received'}, inplace=True)

#Confirming changes
df_noshow.head(1)

In [30]:
#Dropping columns we do not need for this analysis
df_noshow.drop(['PatientId','AppointmentID'],axis=1, inplace=True)

#Confirming changes
df_noshow.head(1)

Unnamed: 0,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hypertension,Diabetes,Alcoholism,Handicap,SMS_received,No-show
0,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No


In [37]:
#Converting Scheduled day from object to datetime datatype
df_noshow['ScheduledDay'] = pd.to_datetime(df_noshow['ScheduledDay'])
#Confirming changes
df_noshow.dtypes

Gender                         object
ScheduledDay      datetime64[ns, UTC]
AppointmentDay                 object
Age                             int64
Neighbourhood                  object
Scholarship                     int64
Hypertension                    int64
Diabetes                        int64
Alcoholism                      int64
Handicap                        int64
SMS_received                    int64
No-show                        object
dtype: object

In [38]:
#Converting Appointment day from object to datetime datatype
df_noshow['AppointmentDay'] = pd.to_datetime(df_noshow['AppointmentDay'])
#Confirming changes
df_noshow.dtypes

Gender                         object
ScheduledDay      datetime64[ns, UTC]
AppointmentDay    datetime64[ns, UTC]
Age                             int64
Neighbourhood                  object
Scholarship                     int64
Hypertension                    int64
Diabetes                        int64
Alcoholism                      int64
Handicap                        int64
SMS_received                    int64
No-show                        object
dtype: object

<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. **Compute statistics** and **create visualizations** with the goal of addressing the research questions that you posed in the Introduction section. You should compute the relevant statistics throughout the analysis when an inference is made about the data. Note that at least two or more kinds of plots should be created as part of the exploration, and you must  compare and show trends in the varied visualizations. 



> **Tip**: - Investigate the stated question(s) from multiple angles. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables. You should explore at least three variables in relation to the primary question. This can be an exploratory relationship between three variables of interest, or looking at how two independent variables relate to a single dependent variable of interest. Lastly, you  should perform both single-variable (1d) and multiple-variable (2d) explorations.

### Research Question 1: How is the attendance to the appointments by the patients?

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2 : What is the correlation between the independent variables and the dependent variable?

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


In [None]:
### Research Question 3 : What factors are most important in predicting the attendance of the patients to their appointments?

<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed in relation to the question(s) provided at the beginning of the analysis. Summarize the results accurately, and point out where additional research can be done or where additional information could be useful.

> **Tip**: Make sure that you are clear with regards to the limitations of your exploration. You should have at least 1 limitation explained clearly. 

> **Tip**: If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work here, check over your report to make sure that it is satisfies all the areas of the rubric (found on the project submission page at the end of the lesson). You should also probably remove all of the "Tips" like this one so that the presentation is as polished as possible.

## Submitting your Project 

> **Tip**: Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should get a return code of 0, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).

> **Tip**: Alternatively, you can download this report as .html via the **File** > **Download as** submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.

> **Tip**: Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations!

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])