1.Apply GMM to the heart disease data by setting n_components=2. Get ARI and silhoutte scores for your solution and compare it with those of the k-means and hierarchical clustering solutions that you implemented in the assignments of the previous checkpoints. Which algorithm does perform better?

In [1]:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.mixture import GaussianMixture
from sklearn.decomposition import PCA
from sklearn import datasets, metrics
from sqlalchemy import create_engine

In [2]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'heartdisease'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
heartdisease_df = pd.read_sql_query('select * from heartdisease',con=engine)

engine.dispose()

In [3]:
# Make sure the number of rows divides evenly into four samples.
rows = heartdisease_df.shape[0] - heartdisease_df.shape[0] % 2
df = heartdisease_df.iloc[:rows, :]


# Define the features and the outcome
X = df.iloc[:, :13]
y = df.iloc[:, 13]

# Replace missing values (marked by ?) with a 0
X = X.replace(to_replace='?', value=0)

# Binarize y so that 1 means heart disease diagnosis and 0 means no diagnosis
y = np.where(y > 0, 0, 1)

In [4]:
# Standarizing the features
scaler = StandardScaler()
X_std = scaler.fit_transform(X)

In [5]:
# Defining the agglomerative clustering
gmm_cluster = GaussianMixture(n_components=2, random_state=123)

# Fit model
clusters = gmm_cluster.fit_predict(X_std)

In [10]:
print("Adjusted Rand Index of the GMM solution: {}"
      .format(metrics.adjusted_rand_score(y, clusters)))
print("The silhoutte score of the GMM solution: {}"
      .format(metrics.silhouette_score(X_std, clusters, metric='euclidean')))
print('------------------------------------------------------------------------')
print('Adjusted Rand Index of the DBSCAN solution: -0.002660945249813766')
print('The silhoutte score of the DBSCAN solution: -0.1123388915588819')
print('------------------------------------------------------------------------')
print('Adjusted Rand Index of two cluster k-means: 0.43661540614807665')
print('The Silhoutte score of two cluster k-means: 0.17440650461256255')

Adjusted Rand Index of the GMM solution: 0.18230716541111341
The silhoutte score of the GMM solution: 0.1356012327371289
------------------------------------------------------------------------
Adjusted Rand Index of the DBSCAN solution: -0.002660945249813766
The silhoutte score of the DBSCAN solution: -0.1123388915588819
------------------------------------------------------------------------
Adjusted Rand Index of two cluster k-means: 0.43661540614807665
The Silhoutte score of two cluster k-means: 0.17440650461256255


ARI and Silhouette coefficient for k-means are higher than others solutions.
<br>
<br>
2.GMM implementation of scikit-learn has a parameter called covariance_type. This parameter determines the type of covariance parameters to use. Specifically, there are four types you can specify: full, tied, diag, & spherical. Try all of these. Which one does perform better in terms of ARI and silhouette scores?

1.full: This is the default. Each component has its own general covariance matrix.

In [12]:
# Defining the agglomerative clustering with covariance_type=full
gmm_cluster = GaussianMixture(n_components=2, random_state=123, covariance_type='full')

# Fit model
clusters = gmm_cluster.fit_predict(X_std)

In [13]:
print("Adjusted Rand Index of the GMM solution with covariance_type=full: {}"
      .format(metrics.adjusted_rand_score(y, clusters)))
print("The silhoutte score of the GMM solution with covariance_type=full: {}"
      .format(metrics.silhouette_score(X_std, clusters, metric='euclidean')))

Adjusted Rand Index of the GMM solution with covariance_type=full: 0.18230716541111341
The silhoutte score of the GMM solution with covariance_type=full: 0.1356012327371289


2.tied: All components share the same general covariance matrix.

In [14]:
# Defining the agglomerative clustering with covariance_type=tied
gmm_cluster = GaussianMixture(n_components=2, random_state=123, covariance_type='tied')

# Fit model
clusters = gmm_cluster.fit_predict(X_std)

In [15]:
print("Adjusted Rand Index of the GMM solution with covariance_type=tied: {}"
      .format(metrics.adjusted_rand_score(y, clusters)))
print("The silhoutte score of the GMM solution with covariance_type=tied: {}"
      .format(metrics.silhouette_score(X_std, clusters, metric='euclidean')))

Adjusted Rand Index of the GMM solution with covariance_type=tied: 0.18230716541111341
The silhoutte score of the GMM solution with covariance_type=tied: 0.1356012327371289


3.diag: Each component has its own diagonal covariance matrix.

In [18]:
# Defining the agglomerative clustering with covariance_type=diag
gmm_cluster = GaussianMixture(n_components=2, random_state=123, covariance_type='diag')

# Fit model
clusters = gmm_cluster.fit_predict(X_std)

In [19]:
print("Adjusted Rand Index of the GMM solution with covariance_type=diag: {}"
      .format(metrics.adjusted_rand_score(y, clusters)))
print("The silhoutte score of the GMM solution with covariance_type=diag: {}"
      .format(metrics.silhouette_score(X_std, clusters, metric='euclidean')))

Adjusted Rand Index of the GMM solution with covariance_type=diag: 0.18230716541111341
The silhoutte score of the GMM solution with covariance_type=diag: 0.1356012327371289


4.spherical: Each component has its own single variance.

In [20]:
# Defining the agglomerative clustering with covariance_type=spherical
gmm_cluster = GaussianMixture(n_components=2, random_state=123, covariance_type='spherical')

# Fit model
clusters = gmm_cluster.fit_predict(X_std)

In [21]:
print("Adjusted Rand Index of the GMM solution with covariance_type=diag: {}"
      .format(metrics.adjusted_rand_score(y, clusters)))
print("The silhoutte score of the GMM solution with covariance_type=diag: {}"
      .format(metrics.silhouette_score(X_std, clusters, metric='euclidean')))

Adjusted Rand Index of the GMM solution with covariance_type=diag: 0.2060175349560907
The silhoutte score of the GMM solution with covariance_type=diag: 0.12345483213377387


ARI and Silhoutte coefficient are different just in case of covariance_type=spherical with a higher ARI, but a lower Silhoutte coefficient.