## FE

This notebook gets the data ready for the model & evaluation. <br>
As we have shown in **EDA.ipynb** that summary contains all the data from the 50 days csv files, we will be working with summary.csv only.

In [12]:
import pandas as pd
import numpy as np

In [13]:
df = pd.read_csv('../data/summary.csv', index_col='Unnamed: 0')

We have few possible classes for our categorical variables. <br>
They are also well distributed. <br>
For that reason we shall be using OHE. <br>
Do notice thats by no mean the only or best choice. <br>
The best choice can only be found out empirically; through experimentation. <br>

Henk Van Veen (Nubank Data Scientist // Kaggle Grandmaster - aka triskelion // https://mlwave.com author) has a great presentation on the topic of feature engineering.<br> https://www.slideshare.net/HJvanVeen/feature-engineering-72376750

In [14]:
df[:3]

Unnamed: 0,arrival_time,assessment_end_time,assessment_start_time,consultation_end_time,consultation_start_time,day,duration,pain,patient,priority,temperature
0,430,1055,878,1595,1084,1,510,severe pain,3,normal,36.8
1,280,741,308,1620,773,1,847,no pain,1,urgent,36.6
2,288,851,764,1881,905,1,976,severe pain,2,normal,36.7


In [4]:
features = ['pain','priority']

for feature in features:

    df = pd.concat([df.drop(feature,1), pd.get_dummies(df[feature], prefix=feature)],1)

the consultation time might vary with number of previous seem patients. (doctors are people, they get tired) <br>

so lets feature engineer that too.

In [5]:
df['patient_number_in_day'] = df.groupby('day')['patient'].transform(lambda X: X - min(X) + 1)

lets not drop "patient" even known we wont use it in our model because it will serve as an ID for result validation. <br>

another good feature would be to set how many people are waiting ahead. <br>

In [6]:
def patients_in_line (row, priority=None):
    
    global df
        
    assessment_end_time = row['assessment_end_time']
    
    event_day = row['day']

    if not priority:
        
        value = (len(df.loc[(df['assessment_end_time']<assessment_end_time)&(df['consultation_start_time']>assessment_end_time)&(df['day']==event_day)]))
        
    if priority == 'normal':
        
        value = (len(df.loc[(df['assessment_end_time']<assessment_end_time)&(df['consultation_start_time']>assessment_end_time)&(df['day']==event_day)&(df['priority_normal']==1)]))
        
    if priority == 'urgent':
        
        value = (len(df.loc[(df['assessment_end_time']<assessment_end_time)&(df['consultation_start_time']>assessment_end_time)&(df['day']==event_day)&(df['priority_urgent']==1)]))
    
    return value

In [7]:
df['patients_in_line'] = df.apply(lambda X: patients_in_line(X),1)
df['normal_patients_in_line'] = df.apply(lambda X: patients_in_line(X, priority='normal'),1)
df['urgent_patients_in_line'] = df.apply(lambda X: patients_in_line(X, priority='urgent'),1)

In [8]:
df[:5]

Unnamed: 0,arrival_time,assessment_end_time,assessment_start_time,consultation_end_time,consultation_start_time,day,duration,patient,temperature,pain_moderate pain,pain_no pain,pain_severe pain,priority_normal,priority_urgent,patient_number_in_day,patients_in_line,normal_patients_in_line,urgent_patients_in_line
0,430,1055,878,1595,1084,1,510,3,36.8,0,0,1,1,0,3,0,0,0
1,280,741,308,1620,773,1,847,1,36.6,0,1,0,0,1,1,0,0,0
2,288,851,764,1881,905,1,976,2,36.7,0,0,1,1,0,2,0,0,0
3,944,1244,1089,2105,1294,1,810,4,36.6,0,0,1,0,1,4,0,0,0
4,2099,2248,2139,2717,2289,1,427,8,36.7,0,1,0,1,0,8,0,0,0


now lets take a step back. <br>

the problem description (https://github.com/holding-digital/coding_test-ds) clearly states that:
    
<blockquote>"On this task you will have to create a model to **predict the time a patient will leave a clinic**, after having a 
consultation with a doctor, in an appointment-free service.<br>The clinic receives patients from 14:00 (time 0) to 18:00. Patients that arrived before 18:00 wait inside the clinic for their consultation.<br>There is a triage room with capacity for one patient at a time. As soon as it is available, arriving patients will pass a quick assessment to flag the patients that should have priority in seeing a doctor. At the end of this assessment, 
patients should be given an estimate of the time they will be free, i.e., the expected time for finishing their 
consultation. Your job is to create a system to give that estimate."<blockquote/>

So, the target variable, the duration, should be from the moment of the assessment end to the consultation end. <br>
**HOWEVER** the duration given in the dataset is the "consultation end" - "consultation start". <br>

Hence, we will be considering a diffferent target, the total period as stated in the problem/case. <br>
That might lead to a poor performance model if the target is considered to be the duration of the consultation, however, its what the exercise is actually asking for.

In [9]:
df['actual_duration'] = df['consultation_end_time'] - df['assessment_end_time']

some models do better with normalized features. <br>
we will be using lightgbm so we wont bother to normalize numerical features. <br>

In [10]:
df[:3]

Unnamed: 0,arrival_time,assessment_end_time,assessment_start_time,consultation_end_time,consultation_start_time,day,duration,patient,temperature,pain_moderate pain,pain_no pain,pain_severe pain,priority_normal,priority_urgent,patient_number_in_day,patients_in_line,normal_patients_in_line,urgent_patients_in_line,actual_duration
0,430,1055,878,1595,1084,1,510,3,36.8,0,0,1,1,0,3,0,0,0,540
1,280,741,308,1620,773,1,847,1,36.6,0,1,0,0,1,1,0,0,0,879
2,288,851,764,1881,905,1,976,2,36.7,0,0,1,1,0,2,0,0,0,1030


In [11]:
df.to_csv('../data/FE_data.csv', index=False)