# Unsupervised -> Gaussian Mixture Modeling

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sqlalchemy import create_engine

In [2]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'heartdisease'

# Initialize the postgres engine
engine = create_engine(f'postgresql://{postgres_user}:{postgres_pw}@{postgres_host}:{postgres_port}/{postgres_db}')

# Read the data from a sql query to the engine
heartdisease_df = pd.read_sql('SELECT * from {}'.format(postgres_db), con=engine)

# Dispose of the engine
engine.dispose()

# Define the features and the outcome
X = heartdisease_df.iloc[:, :13]
y = heartdisease_df.iloc[:, 13]

# Replace missing values (marked by ?) with a 0
X = X.replace(to_replace='?', value=0)

# Binarize y so that 1 means heart disease diagnosis and 0 means no diagnosis
y = np.where(y > 0, 0, 1)

### 1. Apply GMM to the heart disease data by setting n_components=2. Get ARI and silhoutte scores for your solution and compare it with those of the k-means and hierarchical clustering solutions that you implemented in the assignments of the previous checkpoints. Which algorithm does perform better?


In [3]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_std = scaler.fit_transform(X)

In [6]:
from sklearn.mixture import GaussianMixture
from sklearn import metrics

gmm = GaussianMixture(n_components=2)
preds = gmm.fit_predict(X_std)

print('ARI score:', metrics.adjusted_rand_score(y, preds))
print('Silhouette Score:', metrics.silhouette_score(X_std, preds))

ARI score: 0.4207322145049338
Silhouette Score: 0.16118591340148433


This model has the highest ARI score, meaning it replicates the ground truth far more accurately. However, the silhouette score is within the same ball park as both, meaning it captures similar data points into a cluster about as well as kmeans did. 

In [12]:
co_types = ['full', 'tied', 'diag', 'spherical']

for co_type in co_types:
    gmm = GaussianMixture(n_components=2, covariance_type=co_type)
    preds = gmm.fit_predict(X_std)

    print('Covariance Type:', co_type)
    print('ARI score:', metrics.adjusted_rand_score(y, preds))
    print('Silhouette Score:', metrics.silhouette_score(X_std, preds))
    print('-------------------------------------')

Covariance Type: full
ARI score: 0.4207322145049338
Silhouette Score: 0.16118591340148433
-------------------------------------
Covariance Type: tied
ARI score: 0.18389186035089963
Silhouette Score: 0.13628813153331445
-------------------------------------
Covariance Type: diag
ARI score: 0.18389186035089963
Silhouette Score: 0.13628813153331445
-------------------------------------
Covariance Type: spherical
ARI score: 0.20765243525722465
Silhouette Score: 0.12468753110276873
-------------------------------------


The `full` covariance type easily yielded the best ARI score and Silhouette score. That is so far the best model at predicting heart failure.
