# Task 2

### Description

We used pandas dataframes to import and handle the data. We then preprocessed the data (find the max measurements for each patient, set the pid as the index) and used make_pipeline from scikit-learn to impute the missing values before using a Random Forest Classifier/Regressor for the classification/regression task. In order to get a unique result for each test, we iterated through the respective tests for each subtask.

### Main

In [1]:
import pandas as pd
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.impute import SimpleImputer

In [2]:
# Set the value (100) of a single option (display.max_columns)
pd.set_option('display.max_columns', 100)

In [3]:
# Specify the location of the project folder
folder = '/Users/gian-andreagottini/Documents/Coding/Introduction to Machine Learning/Task 2/'

In [4]:
# Import data
train_data = pd.read_csv(folder + 'train_features.csv')
test_data = pd.read_csv(folder + 'test_features.csv')
train_labels = pd.read_csv(folder + 'train_labels.csv').set_index('pid').sort_index()

In [5]:
# Define function
def preprocess_data(data):
    data_df = data.drop(['Age', 'Time'], axis=1).groupby(by=['pid']).max()
    data_df['Age'] = data[['pid', 'Age']].groupby(by=['pid']).mean('Age')
    return data_df

In [6]:
# Define function
def preprocess_labels(labels, cats):
    labels_df = labels[cats]
    return labels_df

### Sub-Task 1

In [7]:
# Select relevant columns for task 1
cats_task1 = ['LABEL_BaseExcess', 'LABEL_Fibrinogen', 'LABEL_AST','LABEL_Alkalinephos', 'LABEL_Bilirubin_total', 'LABEL_Lactate', 'LABEL_TroponinI', 'LABEL_SaO2', 'LABEL_Bilirubin_direct', 'LABEL_EtCO2']

In [8]:
train_df_task1 = preprocess_data(train_data)
test_df_task1 = preprocess_data(test_data)
labels_df_task1 = preprocess_labels(train_labels, cats_task1)

In [9]:
predictions = pd.DataFrame(index = test_df_task1.index, columns = train_labels.columns)

In [10]:
# clf = classifier
# make_pipeline = executes multiple operations in series
# SimpleImputer() = fills empty values with avg values of its column
# RandomForestClassifier() = decision tree
clf = make_pipeline(SimpleImputer(), RandomForestClassifier())

In [11]:
# Fill columns of task 1 with probabilistic prediction values
for i in range(len(cats_task1)):
    clf.fit(train_df_task1, labels_df_task1[cats_task1[i]])
    
    # predict_proba = probability of specific value being 0 or 1
    # [:, 1] drops second column of array (not needed)
    probs_task1 = clf.predict_proba(test_df_task1)[:, 1]
    predictions[cats_task1[i]] = probs_task1

### Sub-Task 2 (dropped)

### Sub-Task 3

In [12]:
# Select relevant columns for task 3
cats_task3 = ['LABEL_RRate', 'LABEL_ABPm', 'LABEL_SpO2', 'LABEL_Heartrate']

In [13]:
train_df_task3 = preprocess_data(train_data)
test_df_task3 = preprocess_data(test_data)
labels_df_task3 = preprocess_labels(train_labels, cats_task3)

In [14]:
clf = make_pipeline(SimpleImputer(), RandomForestRegressor())

In [15]:
# Fill columns of task 3 with mean values
for i in range(len(cats_task3)):
    clf.fit(train_df_task3, labels_df_task3[cats_task3[i]])
    probs_task3 = clf.predict(test_df_task3)
    predictions[cats_task3[i]] = probs_task3

In [16]:
# Save predictions table as csv file
predictions.to_csv(folder + 'solution.csv', float_format='%.3f') # %.3f = first three decimals