# MEDICAL APPOINTMENT / SHOWING OR NOT SHOWING UP

This dataset identifies patients that make appointments and tells us if they showed up or not at the end. 14 features in total that can help us find the pattern of why people don't show up when making appointments:<br />
<div display='flex'>
    -PatientId (flat) 39k to 1e15 aprox<br />
    -AppointmentID (float) 5e6 to 6e6 aprox<br />
    -Gender (String) F/M<br />
    -ScheduledDay (String) Format=YYYY-MM-DDTHH:MM:SSZ where the following chars are constant: [T,:,Z]<br />
    -AppointmentDay (String) Format=YYYY-MM-DDTHH:MM:SSZ where the following chars are constant: [T,:,Z]<br />
    -Age (float) -1 to 115<br />
    -Neighbourhood (String) Name of different Neightborhouds<br />
    -Scholarship (float) 0 to 1<br />
    -Hipertension (float) 0 to 1<br />
    -Diabetes (float) 0 to 1<br />
    -Alcoholism (float) 0 to 1<br />
    -Handcap (float) 0 to 1<br />
    -SMS_received (float) 0 to 1<br />
    -No-show (String) No/Yes<p />
Source: https://www.kaggle.com/joniarroba/noshowappointments<p />

### For proper analysis, non label data needs to be transformed into numerical values:
The following data is already numerical and progresive so no conversion is needed<br />
[PatientId, AppointmentID, Age, Scholarship, Hiperrension, Diabetes, Alcoholism, Handcap, SMS_received]<br />
The following data is not numerical but can be converted into progresive numerical data:<br />
[ScheduledDay, AppointmentDay]<br />
The following data is not numerical but can be converted into discrete numerical data:<br />
[Gender,No-show]<br />
The following data is not numerical but cannot be converted in progresive numerical data due to lack of information:<br />
[Neighbourhood] I don't see how neighbourhood contributes to any classification whithout being able to properly rank the neighbourhoods by another variable such as averache income, average travel time to hospital etc...
Since i dont have a basis for this ranking but makes sense that neighbourhood has weight here, i will convert into integer valuesbased on an alphabetic orther.


## Initial Data Analysis

In [1]:
groupNames=['Gender','ScheduledDay','AppointmentDay','Neighbourhood','No-Show']

In [2]:
import pandas as pd
import seaborn as sns
import numpy as ny
import matplotlib.pyplot as plt
import datetime

In [3]:
df=pd.read_csv('medicalappointments.csv')

In [4]:
df.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


#### Determine Data Types

In [5]:
df.describe() #This will help determine numerical types

Unnamed: 0,PatientId,AppointmentID,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received
count,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0
mean,147496300000000.0,5675305.0,37.088874,0.098266,0.197246,0.071865,0.0304,0.022248,0.321026
std,256094900000000.0,71295.75,23.110205,0.297675,0.397921,0.258265,0.171686,0.161543,0.466873
min,39217.84,5030230.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4172614000000.0,5640286.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,31731840000000.0,5680573.0,37.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,94391720000000.0,5725524.0,55.0,0.0,0.0,0.0,0.0,0.0,1.0
max,999981600000000.0,5790484.0,115.0,1.0,1.0,1.0,1.0,4.0,1.0


Floating Point [PatientId, AppointmentID,Age, Scholarship, Hipertension, Diabetes, Alcoholism, Handcap, SMS_received]

##### Non Numerical Data
For the rest of the features i will have to create groups for each one and then get their size to get all possible values they can obtain

In [6]:
groupNames=['Gender','ScheduledDay','AppointmentDay','Neighbourhood','No-show']
group=[]
for i in range(0,len(groupNames)):
    group.append(df.groupby(by=groupNames[i]))
    print(group[i].size())

Gender
F    71840
M    38687
dtype: int64
ScheduledDay
2015-11-10T07:13:56Z    1
2015-12-03T08:17:28Z    1
2015-12-07T10:40:59Z    1
2015-12-07T10:42:42Z    1
2015-12-07T10:43:01Z    1
2015-12-07T10:43:17Z    1
2015-12-07T10:43:34Z    1
2015-12-07T10:43:50Z    1
2015-12-07T10:44:07Z    1
2015-12-07T10:44:25Z    1
2015-12-07T10:44:40Z    1
2015-12-07T10:45:01Z    1
2015-12-08T13:30:21Z    1
2015-12-08T13:30:41Z    1
2015-12-08T13:31:04Z    1
2015-12-08T13:31:21Z    1
2015-12-08T13:31:45Z    1
2015-12-08T13:32:14Z    1
2015-12-08T13:32:34Z    1
2015-12-08T13:33:09Z    1
2015-12-08T13:33:28Z    1
2015-12-08T13:33:50Z    1
2015-12-08T13:58:50Z    1
2015-12-08T13:59:33Z    1
2015-12-08T14:00:52Z    1
2015-12-08T14:01:28Z    1
2015-12-08T14:02:04Z    1
2015-12-08T14:02:31Z    1
2015-12-08T14:03:00Z    1
2015-12-08T14:03:23Z    1
                       ..
2016-06-08T17:23:59Z    1
2016-06-08T17:24:34Z    1
2016-06-08T17:37:08Z    1
2016-06-08T17:42:29Z    1
2016-06-08T17:45:18Z    1
2016-06-0

## Restructurization for Analysis
Thanks to the previous analysis now we know all data types and what needs to be done to transorm it in useful data, so let's do it

#### Gender

In [7]:
df['Gender']=df['Gender'].replace(to_replace='F',value=0)
df['Gender']=df['Gender'].replace(to_replace='M',value=1)

#### Dates
This part is very complex, in my opinion many features can be extracted from the date as it is in the current format:
-Year<br />
-Month<br />
-Day<br />
-Weekday from 0 to 6 where 0 is monday<br />
-Time will not be considered since seems to be missing

In [37]:
Ayear=df['AppointmentDay']
Amonth=df['AppointmentDay']
Aday=df['AppointmentDay']
Aweekday=df['AppointmentDay']
Syear=df['ScheduledDay']
Smonth=df['ScheduledDay']
Sday=df['ScheduledDay']
Sweekday=df['ScheduledDay']

In [38]:
for i in range(1,len(Ayear)):
    cYear=Ayear[i][0:4]
    cMonth=Amonth[i][5:7]
    cDay=Aday[i][8:10]
    cY=int(float(cYear))
    cM=int(float(cMonth))
    cD=int(float(cDay))
    Ayear.set_value(label=i,value=cY)
    Amonth.set_value(label=i,value=cM)
    Aday.set_value(label=i,value=cD)
    cDate=datetime.date(year=cY,month=cM,day=cD)
    Aweekday.set_value(label=i,value=cDate.weekday())

ValueError: could not convert string to float: 

In [25]:
int?

In [15]:
today=datetime.date(year=2017,month=12,day=1)
print(today.weekday())

4
