# 4 - Clustering & Classification

We have implemented functions that transform the patient data into a feature matrix and label vector, and then split the data to train and test sets, `(X_train, y_train)` and `(X_test, y_test)`. For the tutorial, we will only use 2,000 for training and 500 for test. 

We will learn a linear classifier to separate the training data into two classes based on whether or not the patient dies during the hospital stay. The labels in `y_train` are binary labels in {−1, 1}, where 1 means that the patient has died in hospital.

In [None]:
#@title Run this cell to download the data and helper files. { display-mode: "form" }
!pip install -U wget
!rm -rf data.zip data lib
!mkdir lib

import wget
wget.download('https://github.com/shengpu1126/BDSI2019-ML/raw/master/lib/config.yaml', 'lib/config.yaml')
wget.download('https://github.com/shengpu1126/BDSI2019-ML/raw/master/lib/helper.py', 'lib/helper.py')
wget.download('https://github.com/shengpu1126/BDSI2019-ML/raw/master/data.zip', 'data.zip')

import zipfile
with zipfile.ZipFile("data.zip","r") as zip_ref:
    zip_ref.extractall(".")

In [None]:
import numpy as np
import pandas as pd
from tqdm import tqdm
from lib.helper import load_data, config

In [None]:
#@title Run this cell define the three preprocessing functions. { display-mode: "form" }
def generate_feature_vector(df):
    """
    Reads a dataframe containing all measurements for a single patient
    within the first 48 hours of the ICU admission, and convert it into
    a feature vector.
    
    Args:
        df: pd.Dataframe, with columns [Time, Variable, Value]
    
    Returns:
        a python dictionary of format {feature_name: feature_value}
        for example, {'Age': 32, 'Gender': 0, 'mean_HR': 84, ...}
    """
    static_variables = config['invariant']
    timeseries_variables = config['timeseries']

    # Replace unknow values
    df = df.replace({-1: np.nan})
    
    # Split time invariant and time series
    static, timeseries = df.iloc[0:5], df.iloc[5:]
    static = static.pivot('Time', 'Variable', 'Value')

    feature_dict = static.iloc[0].to_dict()
    for variable in timeseries_variables:
        measurements = timeseries[timeseries['Variable'] == variable].Value
        feature_dict['mean_' + variable] = np.mean(measurements)
    
    return feature_dict

def impute_missing_values(X):
    """
    For each feature column, impute missing values  (np.nan) with the 
    population mean for that feature.
    
    Args:
        X: np.array, shape (N, d). X could contain missing values
    Returns:
        X: np.array, shape (N, d). X does not contain any missing values
    """
    from sklearn.impute import SimpleImputer
    return SimpleImputer().fit_transform(X)

def normalize_feature_matrix(X):
    """
    For each feature column, normalize all values to range [0, 1].

    Args:
        X: np.array, shape (N, d).
    Returns:
        X: np.array, shape (N, d). Values are normalized per column.
    """
    from sklearn.preprocessing import MinMaxScaler
    return MinMaxScaler().fit_transform(X)

In [None]:
# Load the dataset
# `raw_data` is a dictionary mapping patient ID to the data associated with that patient
raw_data, df_labels = load_data(N=2500)

# Generate features
features = [generate_feature_vector(df) for _, df in tqdm(sorted(raw_data.items()), desc='Generating feature vectors')]
df_features = pd.DataFrame(features).sort_index(axis=1)
feature_names = df_features.columns.tolist()

In [None]:
# Apply imputation and normalization
X, y = df_features.values, df_labels['In-hospital_death'].values
X = impute_missing_values(X)
X = normalize_feature_matrix(X)

# Split data into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y, random_state=3)
del X, y

## Clustering & Visualization

Clustering should be done **only** using continuous features. 

- Documentation for [sklearn.cluster.KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
- $k$-means Example: https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html
- PCA documentation: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
- PCA example: https://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_iris.html

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

In [None]:
# Select only non-binary/categorical features
X = X_train.copy()
X = np.delete(X, [1, 3, 15, 30], axis=1)

In [None]:
## TODO
# Perform k-means clustering on X with 2 clusters
kmeans = KMeans(n_clusters=2, random_state=0)
...

In [None]:
## TODO
# How often does the cluster assignment matches with our label (mortality)?


In [None]:
## TODO
# Run PCA on X


# Visualize the resultant clusters in an axis defined by first two principle components of X



## Other ideas...
- Try other values of $k$
- Try a different clustering algorithm (e.g., spectral clustering)

In [None]:
# ...

---
## Training a classifier

Implementing ML models using sklearn is very straightforward:
```python
# 1. Create a classifier
clf = sklearn.SomeModel()

# 2. Train the classifier
clf.fit(X, y)

# 3. Use the classifier
clf.predict(X)
```

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [None]:
# 1. Create a classifier
clf = LogisticRegression()

In [None]:
# 2. Train the classifier
...

In [None]:
# 3. Predictions for test set
y_pred = ...

In [None]:
# Calculate accuracy
accuracy = ...
print('Accuracy:', accuracy)

## Extra

Try changing the classifier (use a different class, change the arguments, etc.) and recalculating these scores. Create a table that lists the test accuracy for each model you tried. 

| Model class |  Test accuracy |
|------|------|
|   Logistic regression  | ??% |
|   SVM  | ??% |
|   ...  | ... |