In [1]:
import lightgbm as lgbm
import pandas as pd
import numpy as np
from MLFeatureSelection import sequence_selection, importance_selection, coherence_selection,tools

In this Demo notebook, three major selection methods are introducing including 

- ***sequece selection***

- ***importance selection***

- ***coherence selection***

and tools like ***readlog***

you can check the ***algorithms details*** [here](https://github.com/duxuhao/Feature-Selection/blob/master/Algorithms_Graphs)

the detailed ***parameters and functions*** [here](https://github.com/duxuhao/Feature-Selection/blob/master/MLFeatureSelection)

and more ***examples*** [here](https://github.com/duxuhao/Feature-Selection/tree/master/example)

We use the dataset for evaluating the speech intelligibility in a classroom. Details of the dataset is availabel [here](https://github.com/duxuhao/Classroom-Acoustics-Research)

#### 1. Read your dataset

In [2]:
def read():
    df = pd.read_csv('Example/Speech_Intelligibility/CRS.csv')
    return df

#### 2. Define the required loss function based on your requirement, here we use the MAE

In [3]:
def lossfunction(y_pred, y_test):
    """define your own loss function with y_pred and y_test
    return score
    """
    return np.mean(np.abs(y_pred - y_test)/y_test)

#### 3. Define your validation method. It is quite flexible here because you can do whatever you want as long as it return

- evaluation score
- last classifier or estimator

you can do ***k-fold***, ***70%-30%*** validations, etc. In this example, we just simplify the validation with all data training and testing

In [4]:
def validate(X, y, features, clf, lossfunction):
    """define your own validation function with 5 parameters
    input as X, y, features, clf, lossfunction
    clf is set by SetClassifier()
    lossfunction is import earlier
    features will be generate automatically
    function return score and trained classfier
    """
    clf.fit(X[features],y)
    y_pred = clf.predict(X[features])
    score = lossfunction(y,y_pred)
    return score, clf

#### 4. Define the selection methods you use, you can use single method or combination

##### 4.1. sequence selection

- initialized selector with wanted process (sequence, random, cross features) [details](https://github.com/duxuhao/Feature-Selection/blob/master/Algorithms_Graphs/sequence_selection.png)

- import dataframe and define the label name

- import loosfunction and improve direction ('ascend' or 'descend') based on evaluation metrics (accuracy, logloss, etc)

- if cross features process is selected, import dictionary of cross method

- define features that are not trainable

- define list initial features combination (start with [] will select from scratch, start with all features will do backward searching at the beginning)

- generate candidate features list

- can set time limit or features limit

- define selected estimator

- set the log file name

- start running

In [5]:
"""
define dictionary of cross methods
"""

def add(x,y):
    return x + y

def substract(x,y):
    return x - y

def times(x,y):
    return x * y

def divide(x,y):
    return (x + 0.001)/(y + 0.001)

def sq(x,y):
    return x ** 2


CrossMethod = {#'+':add,
               #'-':substract,
               '*':times,
               #'/':divide,
               #'^': sq,
               }

In [6]:
def seq(df,f, notusable,estimator):
    sf = sequence_selection.Select(Sequence = True, Random = False, Cross = True) #initialized selector with wanted process
    sf.ImportDF(df,label = 'SI') #import dataframe and define the label name
    sf.ImportLossFunction(lossfunction, direction = 'descend') #import loosfunction and improve direction
    sf.ImportCrossMethod(CrossMethod) #import dictionary of cross method
    sf.InitialNonTrainableFeatures(notusable) #define features that are not trainable
    sf.InitialFeatures(f) #define list initial features combination
    sf.GenerateCol() #generate candidate features list
    sf.clf = estimator #define selected estimator
    sf.SetLogFile('record_seq.log') #set the log file name
    return sf.run(validate) #start running

##### 4.2. importance selection

- initialized selector [details](https://github.com/duxuhao/Feature-Selection/blob/master/Algorithms_Graphs/importance_selection.png)

- import dataframe and define the label name

- import loosfunction and improve direction ('ascend' or 'descend') based on evaluation metrics (accuracy, logloss, etc)

- define remove features quantity each iteration

- can set time limit

- define selected estimator

- set the log file name

- start running

In [7]:
def imp(df,f,estimator):
    sf = importance_selection.Select() #initialized selector
    sf.ImportDF(df,label = 'SI') #import dataset
    sf.ImportLossFunction(lossfunction, direction = 'descend')  #import loosfunction and improve direction
    sf.InitialFeatures(f)  #define list initial features combination
    sf.SelectRemoveMode(batch = 1) #define remove features quantity each iteration
    sf.clf = estimator #define selected estimator
    sf.SetLogFile('record_imp.log') #set the log file name
    return sf.run(validate) #start running

##### 4.3. coherence selection

- initialized selector [details](https://github.com/duxuhao/Feature-Selection/blob/master/Algorithms_Graphs/coherence_selection.png)

- import dataframe and define the label name

- import loosfunction and improve direction ('ascend' or 'descend') based on evaluation metrics (accuracy, logloss, etc)

- define remove features quantity each iteration and the removable criteria

- can set time limit

- define selected estimator

- set the log file name

- start running

In [8]:
def coh(df,f,estimator):
    sf = coherence_selection.Select() #initialized selector
    sf.ImportDF(df,label = 'SI') #import dataset
    sf.ImportLossFunction(lossfunction, direction = 'descend') #import loosfunction and improve direction 
    sf.InitialFeatures(f) #define list initial features combination
    sf.SelectRemoveMode(batch=1, lowerbound = 0.5) #define remove features quantity each iteration and selection threshold
    sf.clf = estimator #define selected estimator
    sf.SetLogFile('record_coh.log') #set the log file name
    return sf.run(validate) #start running

#### 5. Run combination features selection

In [9]:
def run():
    df = read() # read dataset
    notusable = ['SI'] #not trainable features  
    f = tools.readlog('record2.log',0.086342) # use readlog to read the out log (filename, required score)
    #f = ['SNR','BN'] #initial features combination
    clf = lgbm.LGBMRegressor(num_leaves=35, max_depth=-1)
    uf = f[:]
    print('sequence selection')
    uf = seq(df, uf, notusable,clf)
    print('importance selection')
    uf = imp(df,uf,clf)
    print('coherence selection')
    uf = coh(df,uf,clf)
    return uf

In [10]:
bf = run()

sequence selection
Features Quantity Limit: inf
Time Limit: inf min(s)
test performance of initial features combination
Mean loss: 0.08634175498285059
--------------------start greedy--------------------
SNR
SPL
RT
G
******************** 5 round ********************
BN
0/1
Mean loss: 0.08597677125607062
SNR
reverse 0/4
Mean loss: 0.0869340780457153
SPL
reverse 1/4
Mean loss: 0.08607867917223372
RT
reverse 2/4
Mean loss: 0.09211142301516553
G
reverse 3/4
Mean loss: 0.08948553448408178
******************** 6 round ********************
SNR
reverse 0/4
Mean loss: 0.0869340780457153
SPL
reverse 1/4
Mean loss: 0.08607867917223372
RT
reverse 2/4
Mean loss: 0.09211142301516553
G
reverse 3/4
Mean loss: 0.08948553448408178
--------------------complete greedy--------------------
random select starts with:
 ['SNR', 'SPL', 'RT', 'G', 'BN']
 score: 0.08597677125607062
small cycle cross
0/25
Mean loss: 0.08591000858005234
1/25
Mean loss: 0.08595127270999792
2/25
Mean loss: 0.08590996269800664
3/25
Me

In [11]:
print(bf)

['RT', 'BN', '(SNR*BN)', '(SPL*RT)', '(RT*BN)', '(G*BN)']
