## Setup

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

The data: Protein quantity measurements of brain tissue samples from multiple regions.

Here we read in the data, store a feature for visualization later, and drop non-numeric features from the dataset.

In [2]:
data = pd.read_csv('ProteinAndPathologyQuantifications.csv')
structures = data['structure_acronym']

In [3]:
clean_data = data.drop(columns=['donor_id', 'donor_name', 'structure_id', 'structure_acronym'])

## Pre-processing

Next, we replace NaN values with mean values and normalize the data.

In [4]:
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
clean_data = ... # fit and transform the data using our imputer

In [5]:
clean_data = ... # fit and transform the data using the StandardScaler

##  PCA

Finally, we set-up our PCA model, fit it, and transform the data (embedd in low dimensional space)

In [6]:
pca = PCA(n_components=clean_data.shape[1])
# Fit our PCA model and transform the data (storing it in a seperate variable named 'transformed')
...

We can see how much variance each PC explains

In [7]:
# Display the explained variance ratio of the first 5 PCs
...

## Plots

Here we see that two main clusters are apparent in this low dimensional space

In [8]:
sns.scatterplot(x=transformed[:,0], y=transformed[:,1])
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

After coloring by brain region, we can see that the islands represent the difference between the hippocampus and other brain regions

In [9]:
sns.scatterplot(x=transformed[:,0], y=transformed[:,1], hue=structures)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

We can look at other PCs as well. There are outliers that become apparent here!

In [10]:
sns.scatterplot(x=transformed[:,1], y=transformed[:,2], hue=structures)
plt.xlabel('PC2')
plt.ylabel('PC3')
plt.show()