In [1]:
from sklearn.datasets import load_diabetes
from numpy import cov as covM
from numpy import mean
from numpy import median
from numpy import std
from math import sqrt

In [2]:
diabetes_data = load_diabetes()
# diabetes_data

In [3]:
diabetes_features = diabetes_data.data 
diabetes_target = diabetes_data.target

**Part A: Computer CovM of non-target attributes of the data set**

In [4]:
cov_matrix = covM(diabetes_features, rowvar=False)
# cov_matrix

In [5]:
cov_matrix.shape

(10, 10)

The covariance matrix has dimensions 10x10.

**Part B: Compute he correlation of the age and bp attributes (directly from the elements of the covariance matrix**

$$Corr(Age, BP) = \frac{Cov(Age, BP)}{\sqrt{Cov(Age, Age) \cdot Cov(BP, BP)}}$$

We can directly use the covariance matrix to get these values.

In [6]:
cov_age_bp = cov_matrix[0, 3]
cov_age_age = cov_matrix[0, 0]
cov_bp_bp = cov_matrix[3, 3]

In [7]:
corr_age_bp = cov_age_bp / sqrt(cov_age_age * cov_bp_bp)
corr_age_bp

0.33542671054424283

So, the correlation of age and blood pressure is approximately .3354. We found this by using the formula for correlation, which is essentially a standardized version of the covariance. Because of this, we can index the appropriate positions of the matrix to get the corresponding covariances and compute the correlation using these values.

**Part C: Evaluate previous results**

Based on the previous result, I would expect that older patients in the dataset have higher blood pressure. Since blood pressure and age are positively correlated (in the dataset), this means that as age increases, blood pressure increases as well (this is how you interperet correlation). We could think of the data as being globally "related" by a line with a positive slope of about .33, suggesting the trend just described.

**Part D: Check whether the data is consistent with what we found above**

In [8]:
# compute median blood pressure among patients whose age is larger than the median
mean_age = mean(diabetes_features[:, 0])
mean_age

-3.6396225400041895e-16

In [9]:
older_patients = diabetes_features[diabetes_features[:, 0] > mean_age]
younger_patients = diabetes_features[diabetes_features[:, 0] < mean_age]

In [10]:
median_older_patients = median(older_patients[:, 3])
median_older_patients

0.0115437429137471

In [11]:
median_younger_patients = median(younger_patients[:, 3])
median_younger_patients

-0.0228849640236156

The median blood pressure for younger patients is lower than that of older patients. This is evidence in favor of our findings from above. We compute the difference between medians as a number of standard deviations of the blood pressure attribute below:

In [12]:
std_bp = std(diabetes_features[:, 3])
median_difference = abs(median_older_patients - median_younger_patients)
median_diff_in_std = median_difference / std_bp
median_diff_in_std

0.7238221126281117

The difference between the median blood pressures, in terms of standard deviations, is .7238. 