<a href="https://colab.research.google.com/github/annabavaresco/pstrentino/blob/main/Emergency_room_Trentino_Linear_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning Model

In order to predict waiting times in the emergency rooms of Trentino hospitals, we decided to implement three distinct linear models. 
The first one (which will be referred to as model_wgb) will model waiting times for white, green and blue triage and is going to be trained with data collected from each of the emergency rooms, with the exception of Trento-ginecologico and Trento-ortopedico. Since those two emergency rooms usually receive smaller numbers of patients when compared to the others, we found that developing a  separate model for them would lead to more accurate predictions.
The second model (model_totg), therefore, is trained on data collected only from Trento-ginecologico and Trento-ortopedico emergency rooms. Just like model_wgb, it does not take into consideration waiting times for orange and red triages.
As for the patients with orange triage, since their number is smaller, we preferred to use develop a linear model with less predictors and trained on the data collected from all of the emergency rooms.
Finally, we made several attempts to model the waiting times for the red triage, but none of them proved satisfactory, leading to either inaccurate predictions or even negative predicted waiting times. Consequently, we have found that the better prediction we can provide given the data we collected is simply the median of the waiting times for the red triage.







In [None]:
import pandas as pd
pd.options.mode.chained_assignment = None
import datetime as dt
import numpy as np
import random
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from datetime import datetime

In [None]:
pip install mysql-connector-python

Collecting mysql-connector-python
  Downloading mysql_connector_python-8.0.26-cp37-cp37m-manylinux1_x86_64.whl (30.9 MB)
[K     |████████████████████████████████| 30.9 MB 70 kB/s 
Installing collected packages: mysql-connector-python
Successfully installed mysql-connector-python-8.0.26


Before delving into the actual building of the model, data needs to be retrieved directly from an Amazon hosted mySQL database. The following function will connect to the database table, select all the data collected from 5th of May and 5th of August and convert it into a pandas dataframe. 

In [None]:
import mysql.connector

In [None]:
def get_data():
  '''
    Creates a connection with the db hosting our data and converts it into a Pandas dataframe.
  '''
  connection = mysql.connector.connect(
      host = 'emergencyroom.ci8zphg60wmc.us-east-2.rds.amazonaws.com',
      port =  3306,
      user = 'admin',
      database = 'er_trentino',
      password = 'emr00mtr3nt036'
    )

  connection.autocommit = True
  data = pd.read_sql('SELECT * FROM er_trentino.patients', con=connection)

  connection.close()

  return data

df = get_data()
df = df.dropna()
df.head()


Unnamed: 0,triage,hospital,start,end,waiting_time,others,more_severe,less_severe
0,white,001-PS-PSC,2021-05-05 07:30:00,2021-05-05 08:40:00,01:10:00,0,0,0
1,green,001-PS-PSC,2021-05-05 08:40:00,2021-05-05 08:50:00,00:10:00,0,0,0
2,green,001-PS-PSC,2021-05-05 09:00:00,2021-05-05 09:10:00,00:10:00,0,0,0
3,white,001-PS-PSC,2021-05-05 09:10:00,2021-05-05 09:40:00,00:30:00,0,0,0
4,white,001-PS-PSC,2021-05-05 10:30:00,2021-05-05 11:30:00,01:00:00,0,0,0


Let's have a look at the data retrieved from the database. Here is the description of each column:


*   triage: the color associated with the triage
*   hospital: the code identifying the specific emergency room. There are 11 different emengency rooms in Trentino
*   start: timestamp referring to the moment when the patient entered the waiting room
*   end: timestamp referring to the moment when the patient left the waiting room
*   waiting_time: timestamp showing how long the patient has been waiting
*   others: number of patients with the same triage color present at the moment of arrival
*   more_severe: patients with higher level of priority present at the moment of arrival
*   less_severe: patients with lower level of priority present at the moment of arrival


The following functions are meant for data preparation. Here is the list of changes which will be applied on the dataset described above:

1. Splitting into 6 different dataframes: five are based on triage levels (white, green, blue and orange) and one contains only the data collected from the Trento-ginecologico and Trento-oculistico emergency rooms. 
2. Converting values of the "hospital" column into integers and storing them inside a new column named "hosp_code"
3. Converting values of the "waiting_time" column into integers representing the number of minutes
4. Creating a new column "timeslot" containing an integer whose value depends on the time of the day when the patient arrives at the emergency room.  
5. Creating a new column "weekday" with an integer encoding the day of the week when the patient arrived at the emergency room. 
6. Scaling the values of the columns "hosp_code" and "timeslot" so that their values are between 0 and 1. 

In [None]:
def convert_to_mins(last: str):
    '''
        Converts a string in the format hh:mm:ss into an integer representing the number of minutes.
    '''
    l = last.split(':')
    l = [int(s) for s in l[:2]]
    mins = l[1]
    if l[0] != 0:
        hrs = l[0] * 60
        mins += hrs 
        
    return mins

def process_df(data_frame):
    '''
      Takes as input the data retrieved from the database and outputs 5 different datasets with processed data.
    '''
    data = data_frame.copy()

    hosp_dict = {'001-PS-PSC': 10,
                '001-PS-PSG': 11,
                    '001-PS-PSO': 1,
                    '001-PS-PS': 2,
                    '001-PS-PSP': 3,
                    '006-PS-PS': 4,
                    '007-PS-PS': 5,
                    '010-PS-PS': 6,
                    '004-PS-PS': 7,
                    '014-PS-PS': 8,
                    '005-PS-PS': 9
                }
    
    data['hosp_code'] = data['hospital'].apply(lambda x: hosp_dict[x])
    data['weekday'] = data['start'].apply(lambda x: x.weekday())
    data['waiting_time'] = data['waiting_time'].apply(lambda x: convert_to_mins(str(x)))
    timeslot_dict = {}
    ind = 1
    for n in range(24):
        if n < 10:
            hrs =  '0' + str(n) 
        else:
            hrs = str(n)

        for i in range(6):
            mins = str(i) + '0'
            s = hrs + ':' + mins + ':' + '00'
            timeslot_dict[s] = ind
            ind += 1

    data['timeslot'] = data['start'].apply(lambda x: timeslot_dict[x.strftime("%H:%M:%S")])
    data = data.loc[:,['triage', 'waiting_time', 'others', 'more_severe', 'hosp_code', 'weekday', 'timeslot']]
    ndf1 = data.loc[(data['hosp_code']==11) | (data['hosp_code']==10), :]
    data = data.loc[(data['hosp_code']!=11) & (data['hosp_code']!=10), :]
    data['hosp_code'] = data['hosp_code']
    ndf_white = data.loc[data['triage'] == "white",:]
    ndf_green = data.loc[data['triage'] == "green",:]
    ndf_blue = data.loc[data['triage'] == "blue",:]
    ndf_orange = data.loc[data['triage'] == "orange",:]
    ndf_red = data.loc[data['triage'] == "red",:]
    
    return ndf1, ndf_white, ndf_green, ndf_blue, ndf_orange, ndf_red

process_df(df)[0]

Unnamed: 0,triage,waiting_time,others,more_severe,hosp_code,weekday,timeslot
0,white,70,0,0,10,2,46
1,green,10,0,0,10,2,53
2,green,10,0,0,10,2,55
3,white,30,0,0,10,2,56
4,white,60,0,0,10,2,64
...,...,...,...,...,...,...,...
46787,blue,10,0,0,11,3,126
46788,green,10,0,1,11,3,132
46789,blue,10,0,0,11,3,132
46790,blue,10,0,0,11,3,136


These are the datasets which are going to be used to build the models.

In [None]:
dtf1, white, green, blue, orange, red = process_df(df)

## White, green and blue model

As mentioned before, this model is going to be used in order to predict white, blue and green triage waiting times for every emergency room with the exception of Trento-oculistico and Trento-ginecologico. 
The predictors we selected are:
1. the color of the triage (column "triage")
2. the code identifying the emergency room (column "hosp_code")
3. number of patients waiting at the same emergency room with the same triage color (column "others")
4. number of patients waiting at the same emergency room with a higher level of priority (column "more_severe")
5. the day of the week when the patient entered the emergency room (column "weekday")
6. the time of the day when the patient entered the emergency room (column "timeslot")

In [None]:
df_list = [white, green, blue]

for dataframe in df_list:

  #this preprocessing step is done in order to deal with potential outliers
    upper_lim = dataframe.loc[:, 'waiting_time'].quantile(0.99)
    dataframe.loc[dataframe['waiting_time']>upper_lim, 'waiting_time'] = round(upper_lim)
    

dataset = pd.concat([white, green, blue]) 

# turning the triage variable into numeric factors
d= {'white': 1, 'green': 2, 'blue': 3}
dataset['triage'] = dataset['triage'].apply(lambda x: d[x])

dataset.head()

Unnamed: 0,triage,waiting_time,others,more_severe,hosp_code,weekday,timeslot
1594,1,10,0,0,1,1,120
1596,1,10,0,1,1,1,126
1598,1,10,0,4,1,1,132
1606,1,10,0,0,1,2,13
1614,1,130,0,0,1,2,46


In [None]:
X = dataset.loc[:,['triage','hosp_code', 'others', 'more_severe', 'weekday', 'timeslot']]
y = dataset['waiting_time']

trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.1, random_state=0)
regressor_wgb = LinearRegression().fit(trainX, trainy)

In [None]:
testy_pred = regressor_wgb.predict(testX)
myMae = mean_absolute_error(testy, testy_pred)
print(f'The mean absolute error I get with the neural network is {myMae} minutes.')

The mean absolute error I get with the neural network is 45.60324522642791 minutes.


## Orange model

This model is going to be used in order to predict waiting times for the orange triage. The number of predictors we selected is smaller if compared to the one of the previous model. The reason for this is that, for some emergency rooms, the patients with orange triage are usually very few and therefore do not constitute a realiable base for the model to learn how to predict waiting times. For example, we noticed that the Trentino-oculistico emergency room only received 11 patients with orange triage in the timespan of two months. In this case, it clearly does not make sense to include "timeslot" (which ranges from 1 to 144) among predictors. 
Hence, after several trials and considerations, we reached the conclusion that the realiable predictors for the orange triage are only "others" and "weekday".   

In [None]:
dataset = orange
upper_lim = dataset.loc[:, 'waiting_time'].quantile(0.99)
dataset.loc[dataset['waiting_time']>upper_lim, 'waiting_time'] = round(upper_lim)
dataset.head()

Unnamed: 0,triage,waiting_time,others,more_severe,hosp_code,weekday,timeslot
1595,orange,0,0,0,1,1,121
1601,orange,10,0,0,1,1,136
1603,orange,10,0,0,1,1,140
1604,orange,0,0,0,1,2,7
1609,orange,10,0,0,1,2,49


In [None]:
X = dataset.loc[:,['others', 'weekday']]
y = dataset['waiting_time']


trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.1, random_state=0)

regressor_orange = LinearRegression().fit(trainX, trainy)


In [None]:
testy_pred = regressor_orange.predict(testX)
myMae = mean_absolute_error(testy, testy_pred)
print(f'The mean absolute error I get with the neural network is {myMae} minutes.')

The mean absolute error I get with the neural network is 7.4125978209743275 minutes.


## Trento-oculistico and Trento-ginecologico

This last model, as it was mentioned above, is going to be used to predict waiting times for white, green and blue triages for patients arriving at Trento-oculistico and Trento-ginecologico emergency rooms.
Here the same considerations made for the orange model hold. In other words, since the amount of data is not big enough to provide many examples for each level of "timeslot", we decided to exclude it from the set of predictors.

In [None]:
dataset = dtf1

dataset = dataset.loc[(dataset['triage']!= 'orange')&(dataset['triage']!= 'red'),:]
coldict = {'white':1, 'green':2, 'blue':3}
dataset['triage'] = dataset['triage'].apply(lambda x: coldict[x])
for i in range(1,4):
    upper_lim = dataset.loc[dataset['triage']==i, 'waiting_time'].quantile(0.9)
    dataset.loc[(dataset['triage']==i)&(dataset['waiting_time']>upper_lim), 'waiting_time'] = upper_lim

dataset.head()

Unnamed: 0,triage,waiting_time,others,more_severe,hosp_code,weekday,timeslot
0,1,50.0,0,0,10,2,46
1,2,10.0,0,0,10,2,53
2,2,10.0,0,0,10,2,55
3,1,30.0,0,0,10,2,56
4,1,50.0,0,0,10,2,64


In [None]:
X = dataset.loc[:,['triage', 'others', 'more_severe', 'weekday']]
y = dataset['waiting_time']


trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.1, random_state=0)

regressor_totg = LinearRegression().fit(trainX, trainy)

In [None]:
testy_pred = regressor_totg.predict(testX)
myMae = mean_absolute_error(testy, testy_pred)
print(f'The mean absolute error I get with the neural network is {myMae} minutes.')

The mean absolute error I get with the neural network is 12.414883972351864 minutes.


## Red triage

In [None]:
dataset = red
upper_lim = dataset.loc[:, 'waiting_time'].quantile(0.99)
dataset.loc[dataset['waiting_time']>upper_lim, 'waiting_time'] = round(upper_lim)
dataset.head()

Unnamed: 0,triage,waiting_time,others,more_severe,hosp_code,weekday,timeslot
2291,red,10,0,0,1,0,135
4106,red,10,0,0,2,2,34
4160,red,0,0,0,2,2,96
4183,red,0,0,0,2,2,109
4197,red,10,0,0,2,2,118


In [None]:
dataset.waiting_time.median()

0.0