This notebook will contain a preliminary analysis of the data. We will look at the attributes, their summary, completeness of information and a few basic plots.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime

In [2]:
data = pd.read_csv('all_data.csv', dtype={'PatientId': object, 'AppointmentID': object})

In [3]:
data.shape

(110527, 14)

In [4]:
data.dtypes

PatientId         object
AppointmentID     object
Gender            object
ScheduledDay      object
AppointmentDay    object
Age                int64
Neighbourhood     object
Scholarship        int64
Hipertension       int64
Diabetes           int64
Alcoholism         int64
Handcap            int64
SMS_received       int64
No-show           object
dtype: object

This looks like cleaned data.

## Analysis of gender

In [5]:
data['Gender'].unique()

array(['F', 'M'], dtype=object)

## Analysis of date-time fields

In [7]:
data['ScheduledDay'].unique()

array(['2016-04-29T18:38:08Z', '2016-04-29T16:08:27Z',
       '2016-04-29T16:19:04Z', ..., '2016-04-27T16:03:52Z',
       '2016-04-27T15:09:23Z', '2016-04-27T13:30:56Z'], dtype=object)

In [14]:
data['ScheduledDate'] = [datetime.date.fromisoformat(s[0:10]) for s in data['ScheduledDay']]
data['ScheduledTime'] = [datetime.time.fromisoformat(s[11:19]) for s in data['ScheduledDay']]

In [20]:
data['ScheduledDate'].min()

datetime.date(2015, 11, 10)

In [21]:
data['ScheduledDate'].max()

datetime.date(2016, 6, 8)

In [22]:
data['ScheduledTime'].min()

datetime.time(6, 9, 36)

In [23]:
data['ScheduledTime'].max()

datetime.time(21, 27, 35)

In [24]:
data['AppointmentDay'].unique()

array(['2016-04-29T00:00:00Z', '2016-05-03T00:00:00Z',
       '2016-05-10T00:00:00Z', '2016-05-17T00:00:00Z',
       '2016-05-24T00:00:00Z', '2016-05-31T00:00:00Z',
       '2016-05-02T00:00:00Z', '2016-05-30T00:00:00Z',
       '2016-05-16T00:00:00Z', '2016-05-04T00:00:00Z',
       '2016-05-19T00:00:00Z', '2016-05-12T00:00:00Z',
       '2016-05-06T00:00:00Z', '2016-05-20T00:00:00Z',
       '2016-05-05T00:00:00Z', '2016-05-13T00:00:00Z',
       '2016-05-09T00:00:00Z', '2016-05-25T00:00:00Z',
       '2016-05-11T00:00:00Z', '2016-05-18T00:00:00Z',
       '2016-05-14T00:00:00Z', '2016-06-02T00:00:00Z',
       '2016-06-03T00:00:00Z', '2016-06-06T00:00:00Z',
       '2016-06-07T00:00:00Z', '2016-06-01T00:00:00Z',
       '2016-06-08T00:00:00Z'], dtype=object)

Recall that <code>AppointmentDay</code> is the date of the appointment.

In [25]:
data['AppointmentDate'] = [datetime.date.fromisoformat(s[0:10]) for s in data['AppointmentDay']]

In [26]:
data['AppointmentDate'].min()

datetime.date(2016, 4, 29)

In [27]:
data['AppointmentDate'].max()

datetime.date(2016, 6, 8)

Scheduled dates are as early as Nov-2015 but appointment dates start only from Apr-2016. **Investigate this observation if these fields are important later on. Perhaps they might be. Appointments made too much in advance might be forgotten.**

## Analysis of age

In [28]:
data['Age'].unique()

array([ 62,  56,   8,  76,  23,  39,  21,  19,  30,  29,  22,  28,  54,
        15,  50,  40,  46,   4,  13,  65,  45,  51,  32,  12,  61,  38,
        79,  18,  63,  64,  85,  59,  55,  71,  49,  78,  31,  58,  27,
         6,   2,  11,   7,   0,   3,   1,  69,  68,  60,  67,  36,  10,
        35,  20,  26,  34,  33,  16,  42,   5,  47,  17,  41,  44,  37,
        24,  66,  77,  81,  70,  53,  75,  73,  52,  74,  43,  89,  57,
        14,   9,  48,  83,  72,  25,  80,  87,  88,  84,  82,  90,  94,
        86,  91,  98,  92,  96,  93,  95,  97, 102, 115, 100,  99,  -1])

Some age records are $0$ or lesser. Some others are over $100$.

In [6]:
data.loc[(data['Age'] <= 0)]

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
59,71844284745331,5638545,F,2016-04-29T08:08:43Z,2016-04-29T00:00:00Z,0,CONQUISTA,0,0,0,0,0,0,No
63,236623344873175,5628286,M,2016-04-27T10:46:12Z,2016-04-29T00:00:00Z,0,SÃO BENEDITO,0,0,0,0,0,0,No
64,188517384712787,5616082,M,2016-04-25T13:28:21Z,2016-04-29T00:00:00Z,0,ILHA DAS CAIEIRAS,0,0,0,0,0,1,No
65,271881817799985,5628321,M,2016-04-27T10:48:50Z,2016-04-29T00:00:00Z,0,CONQUISTA,0,0,0,0,0,0,No
67,86471282513499,5639264,F,2016-04-29T08:53:02Z,2016-04-29T00:00:00Z,0,NOVA PALESTINA,0,0,0,0,0,0,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110345,147395196662956,5702537,F,2016-05-16T12:30:58Z,2016-06-01T00:00:00Z,0,RESISTÊNCIA,0,0,0,0,0,0,No
110346,5577525313231,5777724,M,2016-06-06T14:22:34Z,2016-06-08T00:00:00Z,0,RESISTÊNCIA,0,0,0,0,0,0,No
110454,614245995575,5772400,F,2016-06-03T15:18:44Z,2016-06-03T00:00:00Z,0,RESISTÊNCIA,0,0,0,0,0,0,No
110460,43218463343323,5769545,F,2016-06-03T08:56:51Z,2016-06-03T00:00:00Z,0,RESISTÊNCIA,0,0,0,0,0,0,No


In [34]:
data.loc[data['Age'] >= 100]

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show,ScheduledDate,ScheduledTime,AppointmentDate
58014,976294800000000.0,5651757,F,2016-05-03T09:14:53Z,2016-05-03T00:00:00Z,102,CONQUISTA,0,0,0,0,0,0,No,2016-05-03,09:14:53,2016-05-03
63912,31963210000000.0,5700278,F,2016-05-16T09:17:44Z,2016-05-19T00:00:00Z,115,ANDORINHAS,0,0,0,0,1,0,Yes,2016-05-16,09:17:44,2016-05-19
63915,31963210000000.0,5700279,F,2016-05-16T09:17:44Z,2016-05-19T00:00:00Z,115,ANDORINHAS,0,0,0,0,1,0,Yes,2016-05-16,09:17:44,2016-05-19
68127,31963210000000.0,5562812,F,2016-04-08T14:29:17Z,2016-05-16T00:00:00Z,115,ANDORINHAS,0,0,0,0,1,0,Yes,2016-04-08,14:29:17,2016-05-16
76284,31963210000000.0,5744037,F,2016-05-30T09:44:51Z,2016-05-30T00:00:00Z,115,ANDORINHAS,0,0,0,0,1,0,No,2016-05-30,09:44:51,2016-05-30
79270,9739430000000.0,5747809,M,2016-05-30T16:21:56Z,2016-05-31T00:00:00Z,100,TABUAZEIRO,0,0,0,0,1,0,No,2016-05-30,16:21:56,2016-05-31
79272,9739430000000.0,5747808,M,2016-05-30T16:21:56Z,2016-05-31T00:00:00Z,100,TABUAZEIRO,0,0,0,0,1,0,No,2016-05-30,16:21:56,2016-05-31
90372,234283600000.0,5751563,F,2016-05-31T10:19:49Z,2016-06-02T00:00:00Z,102,MARIA ORTIZ,0,0,0,0,0,0,No,2016-05-31,10:19:49,2016-06-02
92084,55783130000000.0,5670914,F,2016-05-06T14:55:36Z,2016-06-03T00:00:00Z,100,ANTÔNIO HONÓRIO,0,0,0,0,0,1,No,2016-05-06,14:55:36,2016-06-03
97666,748234600000000.0,5717451,F,2016-05-19T07:57:56Z,2016-06-03T00:00:00Z,115,SÃO JOSÉ,0,1,0,0,0,1,No,2016-05-19,07:57:56,2016-06-03


## Analysis of neighbourhood

In [35]:
data['Neighbourhood'].unique()

array(['JARDIM DA PENHA', 'MATA DA PRAIA', 'PONTAL DE CAMBURI',
       'REPÚBLICA', 'GOIABEIRAS', 'ANDORINHAS', 'CONQUISTA',
       'NOVA PALESTINA', 'DA PENHA', 'TABUAZEIRO', 'BENTO FERREIRA',
       'SÃO PEDRO', 'SANTA MARTHA', 'SÃO CRISTÓVÃO', 'MARUÍPE',
       'GRANDE VITÓRIA', 'SÃO BENEDITO', 'ILHA DAS CAIEIRAS',
       'SANTO ANDRÉ', 'SOLON BORGES', 'BONFIM', 'JARDIM CAMBURI',
       'MARIA ORTIZ', 'JABOUR', 'ANTÔNIO HONÓRIO', 'RESISTÊNCIA',
       'ILHA DE SANTA MARIA', 'JUCUTUQUARA', 'MONTE BELO',
       'MÁRIO CYPRESTE', 'SANTO ANTÔNIO', 'BELA VISTA', 'PRAIA DO SUÁ',
       'SANTA HELENA', 'ITARARÉ', 'INHANGUETÁ', 'UNIVERSITÁRIO',
       'SÃO JOSÉ', 'REDENÇÃO', 'SANTA CLARA', 'CENTRO', 'PARQUE MOSCOSO',
       'DO MOSCOSO', 'SANTOS DUMONT', 'CARATOÍRA', 'ARIOVALDO FAVALESSA',
       'ILHA DO FRADE', 'GURIGICA', 'JOANA D´ARC', 'CONSOLAÇÃO',
       'PRAIA DO CANTO', 'BOA VISTA', 'MORADA DE CAMBURI', 'SANTA LUÍZA',
       'SANTA LÚCIA', 'BARRO VERMELHO', 'ESTRELINHA', 'FORTE SÃO 

## Analysis of other demographic attributes

In [37]:
data['Scholarship'].unique()

array([0, 1])

In [38]:
data['Hipertension'].unique()

array([1, 0])

In [39]:
data['Diabetes'].unique()

array([0, 1])

In [40]:
data['Alcoholism'].unique()

array([0, 1])

In [41]:
data['Handcap'].unique()

array([0, 1, 2, 3, 4])

In [42]:
data['SMS_received'].unique()

array([0, 1])

## Analysis of response

In [43]:
data['No-show'].unique()

array(['No', 'Yes'], dtype=object)

In [6]:
data[['AppointmentID', 'No-show']].groupby('No-show').count()

Unnamed: 0_level_0,AppointmentID
No-show,Unnamed: 1_level_1
No,88208
Yes,22319


The two classes are not evenly represented. Yet, the imbalance is not so severe that we must start with a special treatment.

## Conclusion
1. The data looks clean.
2. Separate date and time from the <code>ScheduledDay</code> and <code>AppointmentDay</code> fields.
3. Rename <code>No-show</code> as <No-show-orig> and create a new field <code>No-show</code> with values $0$ and $1$.

In [44]:
data.dtypes

PatientId          float64
AppointmentID        int64
Gender              object
ScheduledDay        object
AppointmentDay      object
Age                  int64
Neighbourhood       object
Scholarship          int64
Hipertension         int64
Diabetes             int64
Alcoholism           int64
Handcap              int64
SMS_received         int64
No-show             object
ScheduledDate       object
ScheduledTime       object
AppointmentDate     object
dtype: object