# Project: Investigate a Dataset (Medical Appointment No Shows)

### Table of Contents
+ [Introduction](#introduction)
+ [Data Wrangling](#data_wrangling)
+ [Exploratory Data Analysis](#exploratory_data_analysis)
+ [Conclusions](#conclusions)

<a id='introduction'></a>
## Introduction:
__Selected Dataset:__ [No Show Appointments](https://www.kaggle.com/datasets/joniarroba/noshowappointments)

__Dataset Description:__
This dataset collects information from 100k medical appointments in Brazil and is focused on the question of whether or not patients show up for their appointment. A number of characteristics about the patient are included in each row.

+ __PatientId:__ indicates the patient ID; duplication is possible due to cases where the same patient booked more than one appointment.
+ __AppointmentID:__ indicates appoint ID, this field should be unique
+ __Gender:__ indicates the patient's gender __(M/F)__
+ __ScheduledDay:__ indicates the Date/Time the patient set up their appointment.
+ __AppointmentDay:__ indicates the date/time the patient called to book their appointment.
+ __Age:__ indicates the patient's age.
+ __Neighborhood:__ indicates the location of the hospital.
+ __Scholarship:__ indicates whether or not the patient is enrolled in Brasilian welfare program Bolsa Família.
+ __Hipertension:__ indicates whether or not the patient is experiencing Hypertension.
+ __Diabetes:__ indicates whether or not the patient is experiencing Diabetes.
+ __Alcoholism:__ indicates whether or not the patient is experiencing Alcoholism.
+ __Handcap:__ indicates whether or not the patient is with special needs.
+ __SMS_received:__ indicates whether or not the patient has received a reminder text message.
+ __Show-up:__ __‘No’__ if the patient showed up to their appointment, and __‘Yes’__ if they did not show up.

__Questions to be analyzed from the dataset__
+ What factors are important for us to know in order to predict if a patient will show up for their scheduled appointment?
+ Age and Alcoholism
+ No Show higher on certain days.



In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

<a id='data_wrangling'></a>
## Data Wrangling:

### 1. Gather and Access Dataset: 
Display the general properties and build intuition from the dataset using functions like Info, unique data, describe, datatypes and so on.

In [21]:
# Load dataset
df = pd.read_csv('noshowappointments.csv')

# Display first 2 rows of dataset
df.head(2)

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No


In [22]:
# get the general info of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   PatientId       110527 non-null  float64
 1   AppointmentID   110527 non-null  int64  
 2   Gender          110527 non-null  object 
 3   ScheduledDay    110527 non-null  object 
 4   AppointmentDay  110527 non-null  object 
 5   Age             110527 non-null  int64  
 6   Neighbourhood   110527 non-null  object 
 7   Scholarship     110527 non-null  int64  
 8   Hipertension    110527 non-null  int64  
 9   Diabetes        110527 non-null  int64  
 10  Alcoholism      110527 non-null  int64  
 11  Handcap         110527 non-null  int64  
 12  SMS_received    110527 non-null  int64  
 13  No-show         110527 non-null  object 
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB


In [23]:
df.shape

(110527, 14)

In [27]:
# get number of unique values (rows) in each columns
df.nunique()

PatientId          62299
AppointmentID     110527
Gender                 2
ScheduledDay      103549
AppointmentDay        27
Age                  104
Neighbourhood         81
Scholarship            2
Hipertension           2
Diabetes               2
Alcoholism             2
Handcap                5
SMS_received           2
No-show                2
dtype: int64

In [46]:
df.describe()

Unnamed: 0,PatientId,AppointmentID,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received
count,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0
mean,147496300000000.0,5675305.0,37.088874,0.098266,0.197246,0.071865,0.0304,0.022248,0.321026
std,256094900000000.0,71295.75,23.110205,0.297675,0.397921,0.258265,0.171686,0.161543,0.466873
min,39217.84,5030230.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4172614000000.0,5640286.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,31731840000000.0,5680573.0,37.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,94391720000000.0,5725524.0,55.0,0.0,0.0,0.0,0.0,0.0,1.0
max,999981600000000.0,5790484.0,115.0,1.0,1.0,1.0,1.0,4.0,1.0


In [24]:
# Do a quick viz
# Check for duplicates, and datatypes
# Explore Data


__Dataset Observation:__
+ There is no mising values.
+ There are 14 columns and 110527 rows

Next we plot the dataframe to get a general overview and preliminary understanding of the dataset.

In [None]:
# replace hyphen with underscores, place an undaerscore before day or id and lowercase labels for 2008 dataset

# df_08.rename(columns=lambda x: x.strip().lower().replace(" ", "_"), inplace=True)

<a id='exploratory_data_analysis'></a>
## Exploratory Data Analysis:


<a id='conclusions'></a>
## Conclusions:
