##### Problem

It was agreed that we need to select the strongest three out of ten available features. Preferably, these should be a1, b1 and c1 (see dataset-generation.ipynb). a2, b2, c2, c3 may also appear as substitutes of those three, but it is not much welcomed. d1, d2 or d3 should not be selected as they are just noise.

In [1]:
import findspark
findspark.init( '/usr/local/spark' )
import pyspark
from pyspark.sql import SparkSession
import numpy as np
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.regression import LinearRegression

##### Creating a spark session

In [2]:
Myspark = SparkSession.builder.master( 'local' ).appName( 'Features' ).getOrCreate()
print( Myspark.version )

2.4.0


##### Reading the file

In [3]:
DATA = Myspark.read.csv( '/home/demetrius/Documents/GitHub/feature-selection-spark/data.csv', 
                        inferSchema=True, header=True )

In [4]:
DATA.dtypes

[('label', 'double'),
 ('a1', 'int'),
 ('a2', 'double'),
 ('b1', 'int'),
 ('b2', 'double'),
 ('c1', 'double'),
 ('c2', 'double'),
 ('c3', 'double'),
 ('d1', 'double'),
 ('d2', 'double'),
 ('d3', 'double')]

In [5]:
DATA.show( 1, vertical=True )

-RECORD 0---------------------
 label | 0.05505462852698298  
 a1    | 1                    
 a2    | 1.45847822002572     
 b1    | 0                    
 b2    | -1.529300753353182   
 c1    | 0.1574185974126949   
 c2    | 1.8302946297668106   
 c3    | -0.08012381649940253 
 d1    | -0.5668537223849737  
 d2    | 2.400231562805219    
 d3    | 3.2270143265957785   
only showing top 1 row



In [6]:
names = DATA.columns[ 1: ]

##### The function

In [8]:
def SelectFeatures( dataframe, feature_names, trials, groups=12 ) :
    """ Selects most important uncorrelated features from a high-dimensional dataset. Based on trying diverse
        decision trees in order to uncover hidden relationships.
            dataframe: a Spark dataframe
            feature_names: a list of all feature names considered for modeling
            trials: number of different trees to try (the more, the better; depends on the computation capacity)
            groups: indicator of detalisation of each tree (roughly speaking, it is a number of groupes the total
                    sample is devided into; so larger values mean a higher detalisation and a larger number of
                    selected features)
        Note:
        It is hard to estimate the number of selected features in advance. Several values for groups= should be
            tried on small trials= (10-30) to understand what value provides approximately the desired number
            of features.
        Returns a dictionary where the keys are performance metric values and the values are arrays of relevant
        column indices """
    vector = VectorAssembler( inputCols=feature_names, outputCol='features' )
    input = vector.transform( DATA )
    evaluator = RegressionEvaluator( metricName='mae' )
    leaf = round( dataframe.count() / groups )
    forest = RandomForestRegressor( numTrees=trials, minInstancesPerNode=leaf, maxDepth=30, maxBins=999, 
                                     subsamplingRate=0.5, featureSubsetStrategy='onethird', seed=333 )
    model = forest.fit( input )
    summary = {}
    for i in range( trials ) :
        print( i )
        metricvalue = evaluator.evaluate( model.trees[ i ].transform( input ) )
        summary[ metricvalue ] = model.trees[ i ].featureImportances.indices
    summary = sorted( summary.items() )
    return summary

##### Applying the function

In [9]:
from time import time

In [10]:
start = time()
possible_features = SelectFeatures( DATA, names, trials=100, groups=14 )
print( time() - start )

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
426.4614806175232


##### Printing the best 3 feature sets (by MAE)

In [11]:
for i in [ 0, 1, 2 ] :
    print( possible_features[ i ] )

(0.644477272060006, array([0, 1, 2, 6], dtype=int32))
(0.7196706658336792, array([4], dtype=int32))
(0.7202045800323651, array([3, 4], dtype=int32))


In [13]:
selected = [ names[ i ] for i in possible_features[ 2 ][ 1 ] ]
print( selected )

['b2', 'c1']


So with a little tuning of the 'groups' parameter (to restrict only 3 variables) the function returns the desired set of variables (see dataset-generation.ipynb) at 1000 trials

##### Comparison with a regular decision tree

In [7]:
leaf = round( DATA.count() / 12 )
vector = VectorAssembler( inputCols=names, outputCol='features' )
evaluator = RegressionEvaluator( metricName='mae' )
regtree = DecisionTreeRegressor( minInstancesPerNode=leaf, maxDepth=30, maxBins=999 )
pipe = Pipeline( stages=[ vector, regtree ] )
model = pipe.fit( DATA )
regtree_selected = [ names[ i ] for i in model.stages[ 1 ].featureImportances.indices ]
print( regtree_selected )

['c1', 'd1', 'd3']


A regular decision tree selects the weak linear variable and garbage variables. It is unable to detect the important relationship between a1, b1 and the label.

##### Comparison with a lasso regression

In [21]:
lasreg = LinearRegression( maxIter=100, regParam=0.011, elasticNetParam=1.0 )
pipe = Pipeline( stages=[ vector, lasreg ] )
model = pipe.fit( DATA )
model.stages[ 1 ].coefficients

DenseVector([0.0, 0.0, 0.0, -0.0006, 0.1896, 0.0, 0.0, 0.0007, 0.0, 0.0])

In [25]:
print( [ names[ i ] for i, item in enumerate( model.stages[ 1 ].coefficients ) if item != 0 ] )

['b2', 'c1', 'd1']


A lasso regression (with the regularization adjusted to return 3 coefficients) misses "a"-features, so it also fails to detect the most important relationship between a1 and b1.

In [None]:
Myspark.stop()

##### Conclusion

The function is able to select a small set of the most important variables taking into account complex relationships across them. Other algorithms suitable for feature filtering fail to capture all kinds of relationships.