# Feature Selection using Pearson correlation coefficients

## Data

The website https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ is a collection of datasets for classification and regression. We will use some of them to test our feature selection algorithms

In [1]:
import urllib

filename = "german.numer_scale"
url = "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/" + filename
f = urllib.urlretrieve(url, filename)

## Preprocessing of the data

_MLlib_ relies on _LabeledPoint_ as data structure to that stores a numerical vector (dense or sparse) and a numerical label. An RDD of LabeledPoint represents the dataset given as input to train or test supervised machine learning models.

Spark provides a built-in function to tranforms a libsvm dataset into a RDD[LabeledPoint]

In [2]:
from pyspark.mllib.util import MLUtils
from pyspark.mllib.regression import LabeledPoint

rdd = MLUtils.loadLibSVMFile(sc, filename)
ncols = rdd.first().features.size  # number of columns (no class) of the dataset

## Pearson correlation coefficients

In this notebook we create first the Pearson correlation coefficients (PCCs) between the class and each features (_scoreClass_), then the PCCs between every pair of feature (_scoreMatrix_). Once these intermediate results are completed, we proceed into performing the feature selection

In [7]:
from scipy.stats.stats import pearsonr
import numpy as np

def meltLPclass(lp):
    '''
    This function creates a list of k,v tuples, one per each
    label-feature combination. 'k' corresponds to the index
    of the feature and 'v' corresponds to a tuple of two
    elements: value of the label, value of the feature
    
    Parameters
    ----------
    lp : LabeledPoint
        a point in the feature space with label
    '''
    label = lp.label
    features = lp.features
    r = range(features.size)
    return [(i, (label, features[i])) for i in r]

def meltLPfeatures(lp):
    '''
    This function creates a list of k,v tuples, one per each
    feature-feature combination. 'k' corresponds to the index
    of the features and 'v' corresponds to the values of the
    features
    
    Parameters
    ----------
    lp : LabeledPoint
        a point in the feature space with label
    '''
    label = lp.label
    features = lp.features
    r = range(features.size)
    return [((i, j), (features[i], features[j])) for i in r for j in r if i < j]

def corr(x):
    '''
    This function calculates the Pearson correlation coefficient
    among two variables. It returns the index of the feature and
    its correlation coefficient
    
    Parameters
    ----------
    x : tuple
        x[0] is a scalar value (or a tuple), representing the index(es) of the feature(s)
        x[1] is a pyspark.resultiterable.ResultIterable object
    '''
    idx = x[0]
    values = list(x[1])
    
    l = list(values)
    v1, v2 = zip(*values)
    p = pearsonr(v1, v2)[0]
    
    return (idx, p)

def createScoreMatrix(pairs, ncols):
    '''
    This functions builds a NxN matrix, where N = ncols.
    Rows and columns corresponds to the indexes of features
    and values are the computed scores.
    
    Parameters
    ----------
    pairs : list
        each element of the list is a k,v tuple,
        'k' is a tuple (a, b), where 'a' and 'b' are both feature indexes
        'v' is the score value
    ncols : int
        number of features (no class) in the dataset
    '''
    scoreMatrix = np.zeros((ncols, ncols))
    
    for i in range(len(pairs)):
        t = pairs[i]
        row = t[0][0]
        col = t[0][1]
        v = t[1]
    
    scoreMatrix[row][col] = v
    
    return scoreMatrix

In [8]:
res = rdd.flatMap(meltLPclass).groupByKey().map(corr).collect()
resIdx, resScore = zip(*res)

scoreClass = [resScore[resIdx.index(i)] for i in range(ncols)]
fpairs = rdd.flatMap(meltLPfeatures).groupByKey().map(corr).collect()
scoreMatrix = createScoreMatrix(fpairs, ncols)