# IDing numerical candidates for PCA

PCA is a powerful way to reduce dimensionality by extracting principal axes of variation from multiple variables. Once highly correlated variables are identified, the largest principal components can be extracted and the smallest discarded.

<!-- TEASER_END -->

One challenge for data analysts is finding which variables are highly correlated. Here's a way to do it in python, using numpy and pandas.

In [116]:
import numpy as np
import pandas as pd

In [100]:
# set up a toy dataset with 10 variables
r = 10
c = 10
np.random.seed([0])
toy_set = np.random.rand(r, c)
column_labels = ['v'+str(i) for i in range(1, c+1)]
toy_df = pd.DataFrame(toy_set, columns=column_labels)

In [101]:
toy_df.head()

Unnamed: 0,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10
0,0.844422,0.757954,0.420572,0.258917,0.511275,0.404934,0.783799,0.303313,0.476597,0.583382
1,0.908113,0.504687,0.281838,0.755804,0.618369,0.250506,0.909746,0.982785,0.810217,0.902166
2,0.310148,0.729832,0.898838,0.683984,0.472143,0.100701,0.434172,0.610887,0.913011,0.966606
3,0.47701,0.86531,0.260492,0.805028,0.548699,0.014042,0.719705,0.398824,0.824845,0.668153
4,0.001143,0.493578,0.867603,0.243911,0.325204,0.870471,0.191067,0.567511,0.238616,0.96754


Pandas has a method that returns correlation coefficients for a DataFrame (`df.corr()`) but it returns a huge matrix that can be difficult deal with. I think its easier to iterate through each combination and build a two column DataFrame.

In [137]:
toy_df.corr()

Unnamed: 0,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10
v1,1.0,0.317837,-0.471922,0.03003,0.71122,0.172376,0.444754,-0.207561,0.129242,-0.681106
v2,0.317837,1.0,-0.068083,-0.031912,0.21386,-0.260664,0.232821,-0.239623,0.097666,-0.182455
v3,-0.471922,-0.068083,1.0,-0.028233,-0.257406,-0.106317,-0.148948,-0.022415,-0.239249,0.130396
v4,0.03003,-0.031912,-0.028233,1.0,0.33517,-0.569691,0.49658,0.098425,0.619628,0.096153
v5,0.71122,0.21386,-0.257406,0.33517,1.0,0.198254,0.549044,-0.3619,0.205398,-0.520312
v6,0.172376,-0.260664,-0.106317,-0.569691,0.198254,1.0,-0.537242,-0.451787,-0.682344,-0.286444
v7,0.444754,0.232821,-0.148948,0.49658,0.549044,-0.537242,1.0,0.10941,0.39469,-0.142269
v8,-0.207561,-0.239623,-0.022415,0.098425,-0.3619,-0.451787,0.10941,1.0,0.498102,0.562419
v9,0.129242,0.097666,-0.239249,0.619628,0.205398,-0.682344,0.39469,0.498102,1.0,0.1234
v10,-0.681106,-0.182455,0.130396,0.096153,-0.520312,-0.286444,-0.142269,0.562419,0.1234,1.0


It's hard to read this. The individual values are easy to access, however.

In [144]:
toy_df.corr()['v1']['v2']

0.31783677602419358

Ahhh. Much easier to understand.

The function below calculates the correlations, iterates through each possible pair of variables, and returns a sorted dataframe to make it easy to id highly correlated variables.

[Itertools](https://docs.python.org/2/library/itertools.html#itertools.combinations) is a python library that handles combinations, permutations, etc.

In [155]:
import itertools

def corr_df(data):
    ''' 
    input: pandas DataFrame
    output: pandas DataFrame listing every possible pair of variables and their corresponding 
            correlation (rho-squared)
    '''
    # get column labels
    column_labels = data.columns
    
    # create the initial correlation table
    corr_df = data.corr()
    
    # create a generator that will iterate through all possible pairs of variables
    combs = itertools.combinations(column_labels, 2)
    
    # iterate through each pair, squaring the correlations
    corrs = [[comb, corr_df[comb[0]][comb[1]]**2] for comb in combs]
    
    # return a DataFrame of the correlations, sorted high-to-low
    return pd.DataFrame(corrs, columns=['Comb', 'R^2']).sort_values('R^2', ascending=False)

Let's check it out on the toy set.

In [156]:
corr_df(toy_df).head()

Unnamed: 0,Comb,R^2
3,"(v1, v5)",0.505834
37,"(v6, v9)",0.465593
8,"(v1, v10)",0.463905
28,"(v4, v9)",0.383938
25,"(v4, v6)",0.324548


Looks like it's working fine. A small random dataset like this isn't likely to have highly correlated variables, though. There are no good candidates for PCA.

A much larger dataset should yield some highly correlated variables and give an idea of how this function will scale up.

In [106]:
# set up a larger dataset with 1000 variables
big_r = 10
big_c = 1000
big_set = np.random.rand(big_r, big_c)
big_column_labels = ['v'+str(i) for i in range(1, big_c+1)]
big_df = pd.DataFrame(big_set, columns=big_column_labels)

Let's run the build-in pandas function first, to get a benchmark.

In [134]:
%%time
big_df.corr().head()

CPU times: user 51.2 ms, sys: 5.85 ms, total: 57 ms
Wall time: 62.6 ms


Unnamed: 0,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,...,v991,v992,v993,v994,v995,v996,v997,v998,v999,v1000
v1,1.0,0.600583,0.265525,0.382618,0.646353,0.051537,0.202357,0.140303,-0.136602,-0.076423,...,0.026115,0.421334,0.100374,-0.056942,0.418734,-0.261678,-0.428925,0.158484,0.071713,0.119796
v2,0.600583,1.0,-0.085824,0.516069,0.249677,0.100644,-0.137207,0.307585,0.446485,-0.084315,...,0.280711,0.36847,0.196018,-0.679984,0.377732,0.268507,0.016908,-0.062176,0.266576,0.138984
v3,0.265525,-0.085824,1.0,-0.348503,0.384715,0.132708,0.130515,0.515675,-0.013319,0.532671,...,0.236077,0.064315,0.208143,0.530063,0.106356,-0.361143,-0.63987,0.518781,0.430666,0.18053
v4,0.382618,0.516069,-0.348503,1.0,0.426266,0.081357,0.060344,-0.057238,0.280474,-0.009246,...,-0.495823,0.041701,0.112435,-0.639355,0.141044,0.327731,0.419215,0.088404,0.11158,-0.078165
v5,0.646353,0.249677,0.384715,0.426266,1.0,0.270706,-0.17268,0.328237,-0.065074,0.447937,...,-0.408177,0.425591,-0.153277,0.012187,0.457297,0.045057,0.002612,0.15172,0.280005,0.011309


In [130]:
%%time
big_corrs = corr_df(big_df)

CPU times: user 6.77 s, sys: 76.7 ms, total: 6.85 s
Wall time: 6.86 s


There's a bit of computational overhead to create this new view, but it's not too serious. 1000 variables means that the number of combinations is: $$\dfrac {1000!}{2!\cdot 998!}=499,500$$

With extremely large datasets, more efficient code may be needed.

Let's see if there are any candidates for PCA.

In [150]:
big_corrs[big_corrs['R^2'] >= .95]

Unnamed: 0,Comb,R^2
359364,"(v471, v521)",0.983346
46827,"(v49, v53)",0.974711
468849,"(v752, v978)",0.955567


Looks like there are three good pairs of candidates. Let's take a look at one.

In [120]:
from sklearn.decomposition import PCA

pca = PCA()
pca.fit(big_df[['v471', 'v521']])
pca.explained_variance_ratio_

array([ 0.99584182,  0.00415818])

The first principal component is so large that it's reasonably safe to transform `v471` and `v521` into a single variable, thereby reducing the dimensions of the dataset.