## Setup

In [31]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

The data: Protein quantity measurements of brain tissue samples from multiple regions.

Here we read in the data, store a feature for visualization later, and drop non-numeric features from the dataset.

In [32]:
data = pd.read_csv('ProteinAndPathologyQuantifications.csv')
structures = data['structure_acronym']

In [33]:
clean_data = data.drop(columns=['donor_id', 'donor_name', 'structure_id', 'structure_acronym'])

## Pre-processing

Next, we replace NaN values with mean values and normalize the data.

In [34]:
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
clean_data = imp_mean.fit_transform(clean_data)

print(clean_data)

[[7.79282730e-05 2.35802722e-03 1.13707657e-03 ... 9.38000000e+00
  1.17800000e+01 5.23292251e+02]
 [6.28024529e-05 2.76219563e-03 1.27181308e-03 ... 8.10000000e+00
  4.50200000e+01 8.14938750e+01]
 [6.41346519e-05 3.46832366e-03 1.37873651e-02 ... 2.70000000e+01
  1.58200000e+01 4.70734514e+02]
 ...
 [6.63881638e-05 2.27534536e-03 6.33726487e-03 ... 2.19600000e+01
  6.00000000e-01 1.81375000e-01]
 [7.92516891e-04 4.38366336e-03 1.26577462e-03 ... 0.00000000e+00
  1.24600000e+01 2.05886650e+02]
 [7.92517323e-05 2.13531976e-03 1.35058218e-03 ... 8.82000000e+00
  9.50000000e+00 3.78056250e-01]]


In [35]:
scaler = StandardScaler()

clean_data = scaler.fit_transform(clean_data)

print(clean_data)

##  PCA

Finally, we set-up our PCA model, fit it, and transform the data (embedd in low dimensional space)

In [36]:
pca = PCA(n_components=clean_data.shape[1])
# Fit our PCA model and transform the data (storing it in a seperate variable named 'transformed')
...

Ellipsis

We can see how much variance each PC explains

In [37]:
# Display the explained variance ratio of the first 5 PCs
...

## Plots

Here we see that two main clusters are apparent in this low dimensional space

In [38]:
sns.scatterplot(x=transformed[:,0], y=transformed[:,1])
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

NameError: name 'transformed' is not defined

After coloring by brain region, we can see that the islands represent the difference between the hippocampus and other brain regions

In [None]:
sns.scatterplot(x=transformed[:,0], y=transformed[:,1], hue=structures)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

We can look at other PCs as well. There are outliers that become apparent here!

In [None]:
sns.scatterplot(x=transformed[:,1], y=transformed[:,2], hue=structures)
plt.xlabel('PC2')
plt.ylabel('PC3')
plt.show()