<a href="https://colab.research.google.com/github/dishankkalra23/Medical-Appointment-No-Shows/blob/main/Medical_Appointment_No_Shows.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project: Medical Appointment No Shows

>[Project: Medical Appointment No Shows](#scrollTo=BmfbjWRNOPeO)

>>>[Importing Libraries](#scrollTo=zy3dSfeNPPCc)

>>>[Downloading dataset](#scrollTo=by5KNcWYVbXS)

>>>[Loading dataset](#scrollTo=N14ODJZWXEZd)

>[Data Wrangling](#scrollTo=zzG5_tl6_OJF)

>>[Making new column Scheduled time](#scrollTo=kislWLDHJly8)

>>[Rename columns](#scrollTo=HaJI_xlfgW12)

>>[Changing no_show(to show) column to avoid misconception and easily understandable](#scrollTo=FHgwFcz7kNr4)

>>[TO BE CHECKED: handicap and sms_received](#scrollTo=Cq1jFItJpjz8)

>>[Duplicates in data](#scrollTo=KLgc2zuUrD_9)

>>[Exploratory Data Analysis](#scrollTo=z2dmfUqaOPee)

>>>[Research Question 1 (Replace this header name!)](#scrollTo=z2dmfUqaOPee)

>>>[Research Question 2  (Replace this header name!)](#scrollTo=2HAUXwNDOPef)

>>[Conclusions](#scrollTo=LfvthrGmOPeg)



### Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Downloading dataset

In [2]:
! pip install -q kaggle

In [3]:
# Upload your kaggle.json file containing API token
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"dishankkalra","key":"82ba183e8eee146138f06b14eb47f0ff"}'}

In [4]:
# Move the downloaded file to a location ~/.kaggle/kaggle.json
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/

# You need to give proper permissions to the file (since this is a hidden folder)
! chmod 600 ~/.kaggle/kaggle.json

mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [5]:
# Copy API command to download dataset
! kaggle datasets download -d joniarroba/noshowappointments
! unzip \*.zip
! rm *.zip

Downloading noshowappointments.zip to /content
  0% 0.00/2.40M [00:00<?, ?B/s]
100% 2.40M/2.40M [00:00<00:00, 80.0MB/s]
Archive:  noshowappointments.zip
  inflating: KaggleV2-May-2016.csv   


### Loading dataset

In [6]:
df = pd.read_csv('/content/KaggleV2-May-2016.csv')

In [7]:
df.sample(5)

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
1534,128412400000000.0,5503003,M,2016-03-22T16:20:40Z,2016-04-29T00:00:00Z,16,TABUAZEIRO,0,0,0,0,0,1,Yes
83901,8264633000000.0,5732117,F,2016-05-24T11:00:37Z,2016-05-24T00:00:00Z,74,REPÚBLICA,0,1,1,0,0,0,No
20677,92944550000000.0,5697916,F,2016-05-16T07:03:38Z,2016-05-25T00:00:00Z,25,SÃO JOSÉ,1,0,0,0,0,1,No
16442,28677350000.0,5626371,M,2016-04-27T08:09:33Z,2016-05-12T00:00:00Z,20,TABUAZEIRO,0,0,0,0,0,1,No
82429,967716100000000.0,5591348,F,2016-04-15T16:43:05Z,2016-05-31T00:00:00Z,19,ITARARÉ,0,0,0,0,0,1,No


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   PatientId       110527 non-null  float64
 1   AppointmentID   110527 non-null  int64  
 2   Gender          110527 non-null  object 
 3   ScheduledDay    110527 non-null  object 
 4   AppointmentDay  110527 non-null  object 
 5   Age             110527 non-null  int64  
 6   Neighbourhood   110527 non-null  object 
 7   Scholarship     110527 non-null  int64  
 8   Hipertension    110527 non-null  int64  
 9   Diabetes        110527 non-null  int64  
 10  Alcoholism      110527 non-null  int64  
 11  Handcap         110527 non-null  int64  
 12  SMS_received    110527 non-null  int64  
 13  No-show         110527 non-null  object 
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB


> 

> No missing values in data

> **Data type of columns to be fixed,**
1. PatientId is the unique id for identification of a patient and float datatype doesn't make sense.
2. ScheduledDay and AppointmentDay can be used in analysis if we convert them into date-time format.


# Data Wrangling

In [9]:
df.PatientId = df.PatientId.astype('int')
df.PatientId.dtypes

dtype('int64')

In [10]:
df.ScheduledDay = pd.to_datetime(df.ScheduledDay)
df.AppointmentDay = pd.to_datetime(df.AppointmentDay)
df[['ScheduledDay','AppointmentDay']].dtypes

ScheduledDay      datetime64[ns, UTC]
AppointmentDay    datetime64[ns, UTC]
dtype: object

## Making new column Scheduled time 
> Converting ScheduleDate & AppointmentDate column to store dates and ScheduledTime & AppointmentTime to  store time of appointment

In [11]:
df['ScheduledTime'] = pd.to_datetime(df.ScheduledDay).dt.time
df['AppointmentTime'] = pd.to_datetime(df.AppointmentDay).dt.time

In [12]:
df['ScheduledDay'] = df['ScheduledDay'].dt.date
df['AppointmentDay'] = df['AppointmentDay'].dt.date

In [13]:
df.AppointmentTime.sample(5)

18170     00:00:00
110369    00:00:00
92637     00:00:00
34327     00:00:00
46996     00:00:00
Name: AppointmentTime, dtype: object

In [14]:
df.AppointmentTime.nunique()

1

> Appointment Time is 00:00:00 in all the rows, it is not relevant to analysis. Hence dropping AppointmentTime column 

In [15]:
df.drop(columns='AppointmentTime',inplace=True)
df.columns

Index(['PatientId', 'AppointmentID', 'Gender', 'ScheduledDay',
       'AppointmentDay', 'Age', 'Neighbourhood', 'Scholarship', 'Hipertension',
       'Diabetes', 'Alcoholism', 'Handcap', 'SMS_received', 'No-show',
       'ScheduledTime'],
      dtype='object')

In [16]:
df.Age.describe()

count    110527.000000
mean         37.088874
std          23.110205
min          -1.000000
25%          18.000000
50%          37.000000
75%          55.000000
max         115.000000
Name: Age, dtype: float64

> Age can never be negative, removing rows which have age less than 0

In [17]:
df_less_0 = df.query('Age < 0')
df_less_0.Age.count()

1

> Only single row with age less than 0

In [18]:
df = df.query('Age >= 0')

## Rename columns

In [21]:
labels = {'PatientId':"patient_id", 'AppointmentID':'appointment_id', 'Gender':'gender', 
        'ScheduledDay':'scheduled_day',
       'AppointmentDay':'appointment_day', 'Age':'age', 'Neighbourhood':'neighbourhood', 
       'Scholarship':'scholarship', 'Hipertension':'hypertension',
       'Diabetes':'diabetes', 'Alcoholism':'alcoholism', 'Handcap':'handicap', 
       'SMS_received':'sms_received', 'No-show':'show',
       'ScheduledTime':'scheduled_time'}
df.rename(columns=labels,inplace=True)

In [22]:
df.columns

Index(['patient_id', 'appointment_id', 'gender', 'scheduled_day',
       'appointment_day', 'age', 'neighbourhood', 'scholarship',
       'hypertension', 'diabetes', 'alcoholism', 'handicap', 'sms_received',
       'show', 'scheduled_time'],
      dtype='object')

## Changing no_show(to show) column to avoid misconception and easily understandable

Substituting Yes with 0 and No with 1

In [None]:
df.show = df.show.map({'Yes':0,'No':1})
df.show.astype(int)

In [29]:
df.columns

Index(['patient_id', 'appointment_id', 'gender', 'scheduled_day',
       'appointment_day', 'age', 'neighbourhood', 'scholarship',
       'hypertension', 'diabetes', 'alcoholism', 'handicap', 'sms_received',
       'show', 'scheduled_time'],
      dtype='object')

## TO BE CHECKED: handicap and sms_received

 

In [45]:
handicap_pateints = df.query("handicap > 0").handicap.count()
total_patients = df.handicap.count()
handicap_pateints/total_patients

0.020275772216492047

In [47]:
got_sms = df.query("sms_received == 1").sms_received.count()
total_patients = df.sms_received.count()
got_sms/total_patients

0.3210285362720084

## Duplicates in data

In [24]:
df.duplicated().sum()

0

<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 (Replace this header name!)

In [25]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [26]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!