Now we want to reduce the dimensionality of the proteomic data.
First, we retrieve the stored data.

In [2]:
%store -r normal_patients
%store -r normal_prot
%store -r all_patients
%store -r all_prot

In [None]:
import numpy as np
import pandas as pd
%conda install scikit-learn
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

In [15]:
print(all_prot.max().max())
print(all_prot.min().min())

7.796314684465253
-7.985884888123239


We want our data to be normalized for PCA. The data is pretty widely spread. However, negative values mean it's not just raw expression data. It's likely been transformed already.
Let's do some more math to see what we're looking at.

In [23]:
prot_dist_stats = all_prot.describe().T[['mean','std']]
print(prot_dist_stats['mean'].min())
print(prot_dist_stats['mean'].max())
print(prot_dist_stats['std'].min())
print(prot_dist_stats['std'].max())

-4.533969510866671
2.0996212270047736
0.11167970676536301
2.98387673374958


This indicates the data has probably been centered, but not normalized.
Let's scale each protein to be normalized.

In [27]:
scaler = StandardScaler()
prot_scaled = pd.DataFrame(
    scaler.fit_transform(all_prot), 
    index=all_prot.index, 
    columns=all_prot.columns
)
prot_scaled

  updated_mean = (last_sum + new_sum) / updated_sample_count
  T = new_sum / new_sample_count
  new_unnormalized_variance -= correction**2 / new_sample_count


Name,ARF5,M6PR,ESRRA,FKBP4,NDUFAF7,FUCA2,DBNDD1,SEMA3F,CFTR,CYP51A1,...,DDHD1,WIZ,GBF1,APOA5,WIZ,LDB1,WIZ,RFX7,SWSAP1,SVIL
Database_ID,ENSP00000000233.5,ENSP00000000412.3,ENSP00000000442.6,ENSP00000001008.4,ENSP00000002125.4,ENSP00000002165.5,ENSP00000002501.6,ENSP00000002829.3,ENSP00000003084.6,ENSP00000003100.8,...,ENSP00000500986.2,ENSP00000500993.1,ENSP00000501064.1,ENSP00000501141.1,ENSP00000501256.3,ENSP00000501277.1,ENSP00000501300.1,ENSP00000501317.1,ENSP00000501355.1,ENSP00000501521.1
Patient_ID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
01BR001,0.188284,-1.814247,,-0.666932,2.888948,-0.752438,1.420498,-0.411501,,-0.142503,...,-1.362787,-1.327957,-0.919117,,,-1.654606,-0.290793,-0.334686,-1.038080,
01BR008,-1.236712,1.437186,0.421297,-0.807571,1.120189,-1.032339,-2.042332,-0.577602,-0.340193,0.234808,...,1.540711,,0.105539,,,-0.120659,0.708855,0.682004,-1.617735,
01BR009,-0.415389,0.173460,0.755232,-0.506623,3.217055,-0.938549,-0.303962,0.235078,-0.705326,0.693133,...,1.129658,,-1.372489,,,-0.596078,0.362824,0.782083,-0.702105,
01BR010,0.440114,1.180979,-0.808225,-1.287785,0.434174,-0.727899,-0.139721,0.271867,,1.195923,...,-0.851456,,1.400585,,,-1.133106,-0.394588,0.387457,-1.874066,0.613515
01BR015,-1.222947,-1.648399,,-0.008252,-0.588847,-0.557360,0.563014,0.146210,,-0.796452,...,1.028137,-1.143365,-0.785598,,,0.109407,3.209620,-0.169682,-0.181832,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21BR010,1.584004,0.074477,-1.196397,-0.075532,-0.002527,1.122262,0.367695,0.458972,0.869647,-0.059968,...,-0.281975,1.255194,1.031448,,-0.636325,-0.125450,-0.805658,-0.308279,-1.002666,0.542512
22BR005,-1.331816,0.679751,,1.247367,-0.449744,0.138453,-0.158708,0.530704,,-0.263727,...,1.902622,,0.653675,,,0.110199,-0.736742,0.252909,,0.390202
22BR006,1.064039,0.660140,,-0.454233,0.008209,1.167842,-0.804041,-0.213537,1.248475,0.928210,...,-0.074873,,0.666276,-1.616797,,-0.116793,-0.918482,-0.434764,,1.057679
CPT000814,-1.249180,0.976071,0.527714,0.446050,1.418448,-1.784575,1.005284,0.050648,,-0.372543,...,-4.448773,,0.115693,,,-3.302382,2.039134,,,


We'll have to reselect the proteomics data for just the patients of interest rather than separately normalizing them. But let's cross that bridge when we come to it.
Now that we've normalized the data, we can reduce it to its principal components.

In [28]:
pca = PCA(n_components=2)

x_pca = pca.fit_transform(prot_scaled)
print("Original shape: ", prot_scaled.shape)
print("Reduced shape: ", x_pca.shape)
print("Explained variance ratio: ", pca.explained_variance_ratio_)

ValueError: Input X contains NaN.
PCA does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values