##### Problem

It was agreed that we need to select the strongest three out of ten available features. Preferably, these should be a1, b1 and c1 (see dataset-generation.ipynb). a2, b2, c2, c3 may also appear as substitutes of those three, but it is not much welcomed. d1, d2 or d3 should not be selected as they are just noise.

In [1]:
import findspark
findspark.init( '/usr/local/spark' )
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import when
import numpy as np
from pyspark.ml.feature import VectorAssembler, ChiSqSelector
from pyspark.ml.regression import RandomForestRegressor, DecisionTreeRegressor, LinearRegression
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import RegressionEvaluator

##### Creating a spark session

In [2]:
Myspark = SparkSession.builder.master( 'local' ).appName( 'Features' ).getOrCreate()
print( Myspark.version )

2.4.0


##### Reading the file

In [None]:
DATA = Myspark.read.csv( '~/Documents/GitHub/feature-selection-spark/data.csv', 
                        inferSchema=True, header=True )

In [4]:
DATA.dtypes

[('label', 'double'),
 ('a1', 'int'),
 ('a2', 'double'),
 ('b1', 'int'),
 ('b2', 'double'),
 ('c1', 'double'),
 ('c2', 'double'),
 ('c3', 'double'),
 ('d1', 'double'),
 ('d2', 'double'),
 ('d3', 'double')]

In [5]:
DATA.show( 1, vertical=True )

-RECORD 0---------------------
 label | 0.05505462852698298  
 a1    | 1                    
 a2    | 1.45847822002572     
 b1    | 0                    
 b2    | -1.529300753353182   
 c1    | -0.17780551487679785 
 c2    | 4.341582850035831    
 c3    | -1.0191714284369406  
 d1    | 0.13589342212035893  
 d2    | 2.0050211070711055   
 d3    | 1.4536196801663166   
only showing top 1 row



In [6]:
names = DATA.columns[ 1: ]

##### The function

In [7]:
def SelectFeatures( dataframe, feature_names, trials, groups=12 ) :
    """ Selects most important uncorrelated features from a high-dimensional dataset. Based on trying diverse
        decision trees in order to uncover hidden relationships.
            dataframe: a Spark dataframe
            feature_names: a list of all feature names considered for modeling
            trials: number of different trees to try (the more, the better; depends on the computation capacity)
            groups: indicator of detalisation of each tree (roughly speaking, it is a number of groupes the total
                    sample is devided into; so larger values mean a higher detalisation and a larger number of
                    selected features)
        Note:
        It is hard to estimate the number of selected features in advance. Several values for groups= should be
            tried on small trials= (10-30) to understand what value provides approximately the desired number
            of features.
        Returns a dictionary where the keys are performance metric values and the values are arrays of relevant
        column indices """
    vector = VectorAssembler( inputCols=feature_names, outputCol='features' )
    evaluator = RegressionEvaluator( metricName='mae' )
    np.random.seed( 333 )
    seeds = np.random.randint( 0, 4e9, size=trials )
    leaf = round( dataframe.count() / groups )
    summary = {}
    for oneseed in seeds :
        tree = RandomForestRegressor( numTrees=1, minInstancesPerNode=leaf, maxDepth=30, maxBins=999, 
                                     subsamplingRate=0.5, featureSubsetStrategy='onethird', seed=oneseed )
        pipe = Pipeline( stages=[ vector, tree ] )
        model = pipe.fit( dataframe )
        metricvalue = evaluator.evaluate( model.transform( dataframe ) )
        summary[ metricvalue ] = model.stages[ 1 ].featureImportances.indices
        del( [ tree, pipe, model, metricvalue ] )
    summary = sorted( summary.items() )
    return summary

##### Applying the function

In [None]:
possible_features = SelectFeatures( DATA, names, trials=5000, groups=13 )

##### Printing the best 4 feature sets (by MAE)

In [14]:
for i in [ 0, 1, 2, 3 ] :
    print( possible_features[ i ] )

(0.4750700437352839, array([0, 2], dtype=int32))
(0.592239726070981, array([0, 2, 4], dtype=int32))
(0.5924034269231778, array([0, 2, 4], dtype=int32))
(0.6055078534263866, array([0, 2, 5, 6], dtype=int32))


In [13]:
selected = [ names[ i ] for i in possible_features[ 1 ][ 1 ] ]
print( selected )

['a1', 'b1', 'c1']


With a little tuning of the "groups" parameter, the function returns the desired variables in the second- and third-best sets at 5000 trials (the first set contains too few variables)

##### Comparison with a regular decision tree

In [8]:
leaf = round( DATA.count() / 12 )
vector = VectorAssembler( inputCols=names, outputCol='features' )
evaluator = RegressionEvaluator( metricName='mae' )
regtree = DecisionTreeRegressor( minInstancesPerNode=leaf, maxDepth=30, maxBins=999 )
pipe = Pipeline( stages=[ vector, regtree ] )
model = pipe.fit( DATA )
regtree_selected = [ names[ i ] for i in model.stages[ 1 ].featureImportances.indices ]
print( regtree_selected )

['c1']


A regular decision tree selects only the weak linear variable. It is unable to detect the important variables a1 and b1.

##### Comparison with a lasso regression

In [9]:
lasreg = LinearRegression( maxIter=100, regParam=0.01, elasticNetParam=1.0 )
pipe = Pipeline( stages=[ vector, lasreg ] )
model = pipe.fit( DATA )
model.stages[ 1 ].coefficients

DenseVector([0.0, 0.0, 0.0, 0.0, 0.2939, 0.0, 0.002, 0.0, 0.0061, 0.0])

In [10]:
print( [ names[ i ] for i, item in enumerate( model.stages[ 1 ].coefficients ) if item != 0 ] )

['c1', 'c3', 'd2']


A lasso regression (with the regularization adjusted to return 3 coefficients) selects only the linear variables plus one garbage variable. So it also fails to detect the most important relationship between a1 and b1.

##### Comparison with the ChiSqSelector function

In [11]:
# Transforms the target into a binary variable
DATA = DATA.withColumn( 'label', when( DATA.label >= 0, 1 ).otherwise( 0 ) )

In [12]:
chisq = ChiSqSelector( numTopFeatures=3 )
pipe = Pipeline( stages=[ vector, chisq ] )
model = pipe.fit( DATA )
chisq_selected = [ names[ i ] for i in model.stages[ 1 ].selectedFeatures ]
print( chisq_selected )

['d3', 'd1', 'c1']


Chi-Squared feature selection chooses one linear and garbage variables which is, again, insufficient.

In [13]:
Myspark.stop()

##### Conclusion

The function is able to select a small set of the most important variables taking into account complex relationships across them. Other algorithms suitable for feature filtering fail to capture all kinds of relationships.