In [1]:
import numpy as np
import pandas as pd
from factor_analyzer import FactorAnalyzer
from sklearn.decomposition import FactorAnalysis
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
%matplotlib inline

# 1. Measuring classes similarities

From the machine learning experiments we ran, a new hypothesis emerged, which is that the different classes in our aggregate measure of vascular risk might be similar between them. Therefore, making the multi class classification problem very difficult to differentiate them. The most differentiated classes appeared to be patients with zero VRFs, the most healthy ones, and patients with five VRFs, the most at risk, because when comparing them we achieved the highest performance metrics. However, these hypothesis need to be tested.

In [2]:
# Loading the data
data = pd.read_csv('casuality_data_final_factor_analyzer.csv')
data.shape

(2065, 1427)

In [3]:
# Filtering data groups and combining datasets
heart_df = data.filter(regex='heart')
cardio_cmr_df = data.filter(regex='cardio_cmr')
brain_df = data.filter(regex='brain')
agg_score = data['agg_score']
data = pd.concat((heart_df, cardio_cmr_df, brain_df, agg_score), axis=1)
data.shape

(2065, 1384)

In [4]:
# Splitting each class from our target variable
S_0 = data.loc[data['agg_score'] == 0]
S_1 = data.loc[data['agg_score'] == 1]
S_2 = data.loc[data['agg_score'] == 2]
S_3 = data.loc[data['agg_score'] == 3]
S_4 = data.loc[data['agg_score'] == 4]
S_5 = data.loc[data['agg_score'] == 5]
S_0.shape, S_1.shape, S_2.shape, S_3.shape, S_4.shape, S_5.shape

((523, 1384), (606, 1384), (555, 1384), (273, 1384), (91, 1384), (17, 1384))

# 2. Extracting factors from each class

For this reason we decided to extract five latent factors from each class with all the features included, the same number of factors we extracted for the ML experiments, and then run the k-means clustering algorithm from the Scikit-learn package in python to find the centroids for each factor in each class.

In [5]:
Y0 = S_0["agg_score"]
Y1 = S_1["agg_score"]
Y2 = S_2["agg_score"]
Y3 = S_3["agg_score"]
Y4 = S_4["agg_score"]
Y5 = S_5["agg_score"]

In [6]:
factor_S0 = FactorAnalysis(n_components=5).fit_transform(S_0, Y0)
factor_S1 = FactorAnalysis(n_components=5).fit_transform(S_1, Y1)
factor_S2 = FactorAnalysis(n_components=5).fit_transform(S_2, Y2)
factor_S3 = FactorAnalysis(n_components=5).fit_transform(S_3, Y3)
factor_S4 = FactorAnalysis(n_components=5).fit_transform(S_4, Y4)
factor_S5 = FactorAnalysis(n_components=5).fit_transform(S_5, Y5)
factor_S0.shape, factor_S1.shape, factor_S2.shape, factor_S3.shape, factor_S4.shape, factor_S5.shape 

((523, 5), (606, 5), (555, 5), (273, 5), (91, 5), (17, 5))

# 3. K-means clustering

This algorithm clusters data by trying to separate samples in n groups of equal variance. For this task the k-means algorithm divides a set of samples into disjoint clusters , each described by the mean of the samples in the cluster. These means are known as the cluster “centroids” [1].

In [7]:
# SO
from sklearn.cluster import KMeans
Kmean_S0 = KMeans(n_clusters=1)
Kmean_S0.fit(factor_S0)
centroid_SO = Kmean_S0.cluster_centers_.tolist()

# S1
Kmean_S1 = KMeans(n_clusters=1)
Kmean_S1.fit(factor_S1)
centroid_S1 = Kmean_S1.cluster_centers_.tolist()

# S2
Kmean_S2 = KMeans(n_clusters=1)
Kmean_S2.fit(factor_S2)
centroid_S2 = Kmean_S2.cluster_centers_.tolist()

# S3
Kmean_S3 = KMeans(n_clusters=1)
Kmean_S3.fit(factor_S3)
centroid_S3 = Kmean_S3.cluster_centers_.tolist()

# S4
Kmean_S4 = KMeans(n_clusters=1)
Kmean_S4.fit(factor_S4)
centroid_S4 = Kmean_S4.cluster_centers_.tolist()

# S5
Kmean_S5 = KMeans(n_clusters=1)
Kmean_S5.fit(factor_S5)
centroid_S5 = Kmean_S5.cluster_centers_.tolist()

# 4. Plotting cluster centroids

To be able to plot the results and visualize the distances between our classes, we will just select the first two centroids derived from the first two latent factors from each class to be able to plot them in a two dimensional graph.

In [8]:
centroid_SO[0][:2]

[-1.6281855829588817e-16, -7.518948285319894e-16]

In [9]:
centroid_S1[0][:2]

[-3.0153272618068145e-16, -1.3703742944218102e-16]

In [10]:
centroid_S2[0][:2]

[-2.9949766390648226e-16, -1.401281493243103e-16]

In [11]:
centroid_S3[0][:2]

[-6.100126508929432e-17, 3.0093957444051864e-16]

In [12]:
centroid_S4[0][:2]

[2.342448579428902e-16, 2.5486328554307165e-15]

In [13]:
centroid_S5[0][:2]

[-5.1592717026698455e-16, 1.6718652606120006e-15]

<center><img src="Figures/kmeanscluster.png"></center>

In this plot we can see how S5, patients with an aggregate score of five, and S0, patients with an aggregate score of zero, are the most distant ones, reinforcing our hypothesis that these two classes differ the most. Also, the second largest distance is between S0 and S4, which shows why the comparison between these two aggregate measures achieved the second highest performance metrics after 0vs5. Lastly, S1, S2 and S3 are the closest aggregate measures to S0, showing why these comparisons obtained the lowest performance metrics and why they are the most similar classes from our target variable.

# 5. References

[1] [Pedregosa, F. et al. (2011)](https://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf). Scikit-learn: Machine Learning in Python.