# Task 7: AutoFeatureSelector Tool
## This task is to test your understanding of various Feature Selection methods outlined in the lecture and the ability to apply this knowledge in a real-world dataset to select best features and also to build an automated feature selection tool as your toolkit

### Use your knowledge of different feature selector methods to build an Automatic Feature Selection tool
- Pearson Correlation
- Chi-Square
- RFE
- Embedded
- Tree (Random Forest)
- Tree (Light GBM)

### Dataset: FIFA 19 Player Skills
#### Attributes: FIFA 2019 players attributes like Age, Nationality, Overall, Potential, Club, Value, Wage, Preferred Foot, International Reputation, Weak Foot, Skill Moves, Work Rate, Position, Jersey Number, Joined, Loaned From, Contract Valid Until, Height, Weight, LS, ST, RS, LW, LF, CF, RF, RW, LAM, CAM, RAM, LM, LCM, CM, RCM, RM, LWB, LDM, CDM, RDM, RWB, LB, LCB, CB, RCB, RB, Crossing, Finishing, Heading, Accuracy, ShortPassing, Volleys, Dribbling, Curve, FKAccuracy, LongPassing, BallControl, Acceleration, SprintSpeed, Agility, Reactions, Balance, ShotPower, Jumping, Stamina, Strength, LongShots, Aggression, Interceptions, Positioning, Vision, Penalties, Composure, Marking, StandingTackle, SlidingTackle, GKDiving, GKHandling, GKKicking, GKPositioning, GKReflexes, and Release Clause.

In [9]:
%matplotlib inline
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as ss
from collections import Counter
import math
from scipy import stats

In [10]:
player_df = pd.read_csv("fifa19.csv")

In [11]:
numcols = ['Overall', 'Crossing','Finishing',  'ShortPassing',  'Dribbling','LongPassing', 'BallControl', 'Acceleration','SprintSpeed', 'Agility',  'Stamina','Volleys','FKAccuracy','Reactions','Balance','ShotPower','Strength','LongShots','Aggression','Interceptions']
catcols = ['Preferred Foot','Position','Body Type','Nationality','Weak Foot']

In [12]:
player_df = player_df[numcols+catcols]

In [13]:
traindf = pd.concat([player_df[numcols], pd.get_dummies(player_df[catcols])],axis=1)
features = traindf.columns

traindf = traindf.dropna()

In [14]:
traindf = pd.DataFrame(traindf,columns=features)

In [15]:
y = traindf['Overall']>=87
X = traindf.copy()
del X['Overall']

In [16]:
X.head()

Unnamed: 0,Crossing,Finishing,ShortPassing,Dribbling,LongPassing,BallControl,Acceleration,SprintSpeed,Agility,Stamina,...,Nationality_Uganda,Nationality_Ukraine,Nationality_United Arab Emirates,Nationality_United States,Nationality_Uruguay,Nationality_Uzbekistan,Nationality_Venezuela,Nationality_Wales,Nationality_Zambia,Nationality_Zimbabwe
0,84.0,95.0,90.0,97.0,87.0,96.0,91.0,86.0,91.0,72.0,...,0,0,0,0,0,0,0,0,0,0
1,84.0,94.0,81.0,88.0,77.0,94.0,89.0,91.0,87.0,88.0,...,0,0,0,0,0,0,0,0,0,0
2,79.0,87.0,84.0,96.0,78.0,95.0,94.0,90.0,96.0,81.0,...,0,0,0,0,0,0,0,0,0,0
3,17.0,13.0,50.0,18.0,51.0,42.0,57.0,58.0,60.0,43.0,...,0,0,0,0,0,0,0,0,0,0
4,93.0,82.0,92.0,86.0,91.0,91.0,78.0,76.0,79.0,90.0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
len(X.columns)

223

### Set some fixed set of features

In [18]:
feature_name = list(X.columns)
# no of maximum features we need to select
num_feats=30

## Filter Feature Selection - Pearson Correlation

### Pearson Correlation function

In [19]:
def get_support_feature(feature_names, coefficients, num_feats):
    feature_indexes = np.argsort(np.abs(coefficients))[-num_feats:]
    cor_support = [index in feature_indexes for index, name in enumerate(feature_names)]
    cor_feature = X.iloc[:,feature_indexes].columns.to_list()
    return cor_support, cor_feature


In [20]:
def cor_selector(X, y, num_feats):
    feature_names = X.columns.to_list()
    coefficients = [np.corrcoef(X[name], y)[0, 1] for name in feature_names]
    coefficients = [0 if np.isnan(coef) else coef for coef in coefficients]
    return get_support_feature(feature_names, coefficients, num_feats)

In [21]:
cor_support, cor_feature = cor_selector(X, y,num_feats)
print(str(len(cor_feature)), 'selected features')

30 selected features


### List the selected features from Pearson Correlation

In [22]:
cor_feature

['Nationality_Costa Rica',
 'Position_LAM',
 'Nationality_Uruguay',
 'Acceleration',
 'SprintSpeed',
 'Strength',
 'Nationality_Gabon',
 'Nationality_Slovenia',
 'Stamina',
 'Weak Foot',
 'Agility',
 'Crossing',
 'Nationality_Belgium',
 'Dribbling',
 'ShotPower',
 'LongShots',
 'Finishing',
 'BallControl',
 'FKAccuracy',
 'LongPassing',
 'Volleys',
 'ShortPassing',
 'Position_RF',
 'Position_LF',
 'Body Type_PLAYER_BODY_TYPE_25',
 'Body Type_Courtois',
 'Body Type_Neymar',
 'Body Type_Messi',
 'Body Type_C. Ronaldo',
 'Reactions']

## Filter Feature Selection - Chi-Sqaure

In [23]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler

### Chi-Squared Selector function

In [24]:
def chi_squared_selector(X, y, num_feats):
    feature_names = X.columns.to_list()
    selector = SelectKBest(score_func=chi2,
                           k=num_feats)
    X_norm = MinMaxScaler().fit_transform(X)
    top_features = selector.fit(X_norm,y)
    coefficients = top_features.scores_
    return get_support_feature(feature_names, coefficients, num_feats)

In [25]:
chi_support, chi_feature = chi_squared_selector(X, y,num_feats)
print(str(len(chi_feature)), 'selected features')

30 selected features


### List the selected features from Chi-Square 

In [26]:
chi_feature

['Nationality_England',
 'BallControl',
 'ShortPassing',
 'Nationality_France',
 'Position_RB',
 'Position_CM',
 'Nationality_Slovakia',
 'Nationality_Spain',
 'LongPassing',
 'LongShots',
 'FKAccuracy',
 'Finishing',
 'Volleys',
 'Nationality_Croatia',
 'Position_LW',
 'Nationality_Egypt',
 'Nationality_Costa Rica',
 'Reactions',
 'Position_LAM',
 'Nationality_Uruguay',
 'Nationality_Gabon',
 'Nationality_Slovenia',
 'Nationality_Belgium',
 'Position_RF',
 'Position_LF',
 'Body Type_Neymar',
 'Body Type_Messi',
 'Body Type_Courtois',
 'Body Type_PLAYER_BODY_TYPE_25',
 'Body Type_C. Ronaldo']

## Wrapper Feature Selection - Recursive Feature Elimination

In [27]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler

### RFE Selector function

In [28]:
def rfe_selector(X, y, num_feats):
    feature_names = X.columns.to_list()
    lr = LogisticRegression()
    rfe_lr = RFE(estimator=lr,
                 n_features_to_select=num_feats,
                 step=1,
                 verbose=5)
    X_norm = MinMaxScaler().fit_transform(X)
    rfe_lr = rfe_lr.fit(X_norm, y)
    coefficients = rfe_lr.ranking_
    return get_support_feature(feature_names, coefficients, num_feats)

In [29]:
rfe_support, rfe_feature = rfe_selector(X, y,num_feats)
print(str(len(rfe_feature)), 'selected features')

Fitting estimator with 223 features.
Fitting estimator with 222 features.
Fitting estimator with 221 features.
Fitting estimator with 220 features.
Fitting estimator with 219 features.
Fitting estimator with 218 features.
Fitting estimator with 217 features.
Fitting estimator with 216 features.
Fitting estimator with 215 features.
Fitting estimator with 214 features.
Fitting estimator with 213 features.
Fitting estimator with 212 features.
Fitting estimator with 211 features.
Fitting estimator with 210 features.
Fitting estimator with 209 features.
Fitting estimator with 208 features.
Fitting estimator with 207 features.
Fitting estimator with 206 features.
Fitting estimator with 205 features.
Fitting estimator with 204 features.
Fitting estimator with 203 features.
Fitting estimator with 202 features.
Fitting estimator with 201 features.
Fitting estimator with 200 features.
Fitting estimator with 199 features.
Fitting estimator with 198 features.
Fitting estimator with 197 features.
F

### List the selected features from RFE

In [30]:
rfe_feature

['Nationality_Uzbekistan',
 'Nationality_Bermuda',
 'Nationality_Liberia',
 'Nationality_Guatemala',
 'Nationality_Niger',
 'Nationality_Rwanda',
 'Nationality_Guyana',
 'Nationality_Antigua & Barbuda',
 'Nationality_Mauritius',
 'Nationality_Lebanon',
 'Body Type_Akinfenwa',
 'Nationality_Palestine',
 'Nationality_Mauritania',
 'Nationality_Sudan',
 'Nationality_Guam',
 'Nationality_Afghanistan',
 'Nationality_Hong Kong',
 'Nationality_Nicaragua',
 'Nationality_Jordan',
 'Nationality_Ethiopia',
 'Nationality_Andorra',
 'Nationality_Puerto Rico',
 'Nationality_Grenada',
 'Nationality_Belize',
 'Nationality_St Lucia',
 'Nationality_South Sudan',
 'Nationality_Qatar',
 'Nationality_Botswana',
 'Nationality_Malta',
 'Nationality_Indonesia']

## Embedded Selection - Lasso: SelectFromModel

In [31]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler

In [32]:
def embedded_log_reg_selector(X, y, num_feats):
    feature_names = X.columns.to_list()
    lr = LogisticRegression()
    embedded_selector = SelectFromModel(lr,
                                        max_features=num_feats)
    X_norm = MinMaxScaler().fit_transform(X)
    embedded_selector = embedded_selector.fit(X_norm, y)
    coefficients = embedded_selector.estimator_.coef_[0]
    return get_support_feature(feature_names, coefficients, num_feats)

In [33]:
embedded_lr_support, embedded_lr_feature = embedded_log_reg_selector(X, y, num_feats)
print(str(len(embedded_lr_feature)), 'selected features')

30 selected features


In [34]:
embedded_lr_feature

['Nationality_France',
 'Weak Foot',
 'Nationality_Netherlands',
 'Nationality_Croatia',
 'Position_RCB',
 'Position_RM',
 'Position_CAM',
 'Body Type_Courtois',
 'Nationality_Costa Rica',
 'Nationality_Gabon',
 'Body Type_Stocky',
 'Position_RW',
 'Body Type_Lean',
 'Position_LCB',
 'Position_LM',
 'Position_RB',
 'Agility',
 'SprintSpeed',
 'Nationality_Uruguay',
 'Nationality_Belgium',
 'Volleys',
 'Finishing',
 'BallControl',
 'Nationality_Slovenia',
 'Position_CM',
 'Strength',
 'ShortPassing',
 'LongPassing',
 'Position_GK',
 'Reactions']

## Tree based(Random Forest): SelectFromModel

In [35]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

In [36]:
def embedded_rf_selector(X, y, num_feats):
    rf = RandomForestClassifier(n_estimators=100)
    embedded_selector = SelectFromModel(rf,
                                        max_features=num_feats)
    X_norm = MinMaxScaler().fit_transform(X)
    embedded_selector = embedded_selector.fit(X_norm, y)
    support = embedded_selector.get_support()
    features = X.loc[:, support].columns.tolist()
    return support, features

In [37]:
embeded_rf_support, embeded_rf_feature = embedded_rf_selector(X, y, num_feats)
print(str(len(embeded_rf_feature)), 'selected features')

24 selected features


In [38]:
embeded_rf_feature

['Crossing',
 'Finishing',
 'ShortPassing',
 'Dribbling',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Stamina',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Balance',
 'ShotPower',
 'Strength',
 'LongShots',
 'Aggression',
 'Interceptions',
 'Weak Foot',
 'Preferred Foot_Left',
 'Body Type_Courtois',
 'Body Type_Normal',
 'Nationality_Germany']

## Tree based(Light GBM): SelectFromModel

In [39]:
from sklearn.feature_selection import SelectFromModel
from lightgbm import LGBMClassifier

In [40]:
def embedded_lgbm_selector(X, y, num_feats):
    lgbmc = LGBMClassifier(n_estimators=500,
                           learning_rate=0.05,
                           num_leaves=32,
                           colsample_bytree=0.2,
                           reg_alpha=3,
                           reg_lambda=1,
                           min_split_gain=0.01,
                           min_child_weight=40)
    embedded_lgbm_selector = SelectFromModel(lgbmc,
                                             max_features=num_feats)
    X_norm = MinMaxScaler().fit_transform(X)
    embedded_lgbm_selector = embedded_lgbm_selector.fit(X_norm, y)
    embedded_lgbm_support = embedded_lgbm_selector.get_support()
    embedded_lgbm_feature = X.loc[:, embedded_lgbm_support].columns.tolist()
    return embedded_lgbm_support, embedded_lgbm_feature

In [41]:
embedded_lgbm_support, embeded_lgbm_feature = embedded_lgbm_selector(X, y, num_feats)
print(str(len(embeded_lgbm_feature)), 'selected features')

30 selected features


In [42]:
embeded_lgbm_feature

['Crossing',
 'Finishing',
 'ShortPassing',
 'Dribbling',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Stamina',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Balance',
 'ShotPower',
 'Strength',
 'LongShots',
 'Aggression',
 'Interceptions',
 'Weak Foot',
 'Preferred Foot_Left',
 'Preferred Foot_Right',
 'Position_CAM',
 'Position_CB',
 'Position_CDM',
 'Position_CF',
 'Position_CM',
 'Position_GK',
 'Position_LAM',
 'Position_LB']

## Putting all of it together: AutoFeatureSelector Tool

In [43]:
pd.set_option('display.max_rows', None)
# put all selection together
feature_selection_df = pd.DataFrame({'Feature': feature_name,
                                     'Pearson': cor_support,
                                     'Chi-2': chi_support,
                                     'RFE': rfe_support,
                                     'Logistics': embedded_lr_support,
                                     'Random Forest':embeded_rf_support,
                                     'LightGBM': embedded_lgbm_support})
# count the selected times for each feature
feature_selection_df['Total'] = np.sum(feature_selection_df, axis=1)
# display the top 30
feature_selection_df = feature_selection_df.sort_values(['Total','Feature'],
                                                        ascending=False)
feature_selection_df.index = range(1, len(feature_selection_df)+1)
feature_selection_df.head(num_feats)

  return reduction(axis=axis, out=out, **passkwargs)


Unnamed: 0,Feature,Pearson,Chi-2,RFE,Logistics,Random Forest,LightGBM,Total
1,Volleys,True,True,False,True,True,True,5
2,ShortPassing,True,True,False,True,True,True,5
3,Reactions,True,True,False,True,True,True,5
4,LongPassing,True,True,False,True,True,True,5
5,Finishing,True,True,False,True,True,True,5
6,BallControl,True,True,False,True,True,True,5
7,Weak Foot,True,False,False,True,True,True,4
8,Strength,True,False,False,True,True,True,4
9,SprintSpeed,True,False,False,True,True,True,4
10,LongShots,True,True,False,False,True,True,4


## Can you build a Python script that takes dataset and a list of different feature selection methods that you want to try and output the best (maximum votes) features from all methods?

In [44]:
def autoFeatureSelector(dataset_path, num_feats, methods=[]):
    def preprocess_dataset(dataset_path):
        player_df = pd.read_csv(dataset_path)
        numcols = ['Overall', 'Crossing','Finishing',  'ShortPassing',  'Dribbling','LongPassing', 'BallControl', 'Acceleration','SprintSpeed', 'Agility',  'Stamina','Volleys','FKAccuracy','Reactions','Balance','ShotPower','Strength','LongShots','Aggression','Interceptions']
        catcols = ['Preferred Foot','Position','Body Type','Nationality','Weak Foot']
        player_df = player_df[numcols+catcols]
        traindf = pd.concat([player_df[numcols], pd.get_dummies(player_df[catcols])],axis=1)
        features = traindf.columns
        traindf = traindf.dropna()

        traindf = pd.DataFrame(traindf,columns=features)
        y = traindf['Overall']>=87
        X = traindf.copy()
        del X['Overall']
        return X, y

    def get_support_feature(feature_names, coefficients, num_feats):
        feature_indexes = np.argsort(np.abs(coefficients))[-num_feats:]
        cor_support = [index in feature_indexes for index, name in enumerate(feature_names)]
        cor_feature = X.iloc[:,feature_indexes].columns.to_list()
        return cor_support, cor_feature

    def cor_selector(X, y, num_feats):
        feature_names = X.columns.to_list()
        coefficients = [np.corrcoef(X[name], y)[0, 1] for name in feature_names]
        coefficients = [0 if np.isnan(coef) else coef for coef in coefficients]
        return get_support_feature(feature_names, coefficients, num_feats)

    def chi_squared_selector(X, y, num_feats):
        feature_names = X.columns.to_list()
        selector = SelectKBest(score_func=chi2,
                               k=num_feats)
        X_norm = MinMaxScaler().fit_transform(X)
        top_features = selector.fit(X_norm,y)
        coefficients = top_features.scores_
        return get_support_feature(feature_names, coefficients, num_feats)

    def rfe_selector(X, y, num_feats):
        feature_names = X.columns.to_list()
        lr = LogisticRegression()
        rfe_lr = RFE(estimator=lr,
                     n_features_to_select=num_feats,
                     step=1,
                     verbose=5)
        X_norm = MinMaxScaler().fit_transform(X)
        rfe_lr = rfe_lr.fit(X_norm, y)
        coefficients = rfe_lr.ranking_
        return get_support_feature(feature_names, coefficients, num_feats)

    def embedded_log_reg_selector(X, y, num_feats):
        feature_names = X.columns.to_list()
        lr = LogisticRegression()
        embedded_selector = SelectFromModel(lr,
                                            max_features=num_feats)
        X_norm = MinMaxScaler().fit_transform(X)
        embedded_selector = embedded_selector.fit(X_norm, y)
        coefficients = embedded_selector.estimator_.coef_[0]
        return get_support_feature(feature_names, coefficients, num_feats)

    def embedded_rf_selector(X, y, num_feats):
        rf = RandomForestClassifier(n_estimators=100)
        embedded_selector = SelectFromModel(rf,
                                            max_features=num_feats)
        X_norm = MinMaxScaler().fit_transform(X)
        embedded_selector = embedded_selector.fit(X_norm, y)
        support = embedded_selector.get_support()
        features = X.loc[:, support].columns.tolist()
        return support, features

    def embedded_lgbm_selector(X, y, num_feats):
        lgbmc = LGBMClassifier(n_estimators=500,
                               learning_rate=0.05,
                               num_leaves=32,
                               colsample_bytree=0.2,
                               reg_alpha=3,
                               reg_lambda=1,
                               min_split_gain=0.01,
                               min_child_weight=40)
        embedded_lgbm_selector = SelectFromModel(lgbmc,
                                                 max_features=num_feats)
        X_norm = MinMaxScaler().fit_transform(X)
        embedded_lgbm_selector = embedded_lgbm_selector.fit(X_norm, y)
        embedded_lgbm_support = embedded_lgbm_selector.get_support()
        embedded_lgbm_feature = X.loc[:, embedded_lgbm_support].columns.tolist()
        return embedded_lgbm_support, embedded_lgbm_feature

    X, y = preprocess_dataset(dataset_path)
    methods_support = {'Feature': feature_name}

    if 'pearson' in methods:
        cor_support, _ = cor_selector(X, y, num_feats)
        methods_support['Pearson'] = cor_support
    if 'chi-square' in methods:
        chi_support, _ = chi_squared_selector(X, y, num_feats)
        methods_support['Chi-2'] = chi_support
    if 'rfe' in methods:
        rfe_support, _ = rfe_selector(X, y, num_feats)
        methods_support['RFE'] = rfe_support
    if 'log-reg' in methods:
        embedded_lr_support, _ = embedded_log_reg_selector(X, y, num_feats)
        methods_support['Logistics'] = embedded_lr_support
    if 'rf' in methods:
        embedded_rf_support, _ = embedded_rf_selector(X, y, num_feats)
        methods_support['Random Forest'] = embedded_rf_support
    if 'lgbm' in methods:
        embedded_lgbm_support, _ = embedded_lgbm_selector(X, y, num_feats)
        methods_support['LightGBM'] = embedded_lgbm_support

    pd.set_option('display.max_rows', None)

    feature_selection_df = pd.DataFrame(methods_support)
    feature_selection_df['Total'] = np.sum(feature_selection_df, axis=1)
    feature_selection_df = feature_selection_df.sort_values(['Total','Feature'],
                                                            ascending=False)
    feature_selection_df.index = range(1, len(feature_selection_df)+1)
    return feature_selection_df.head(num_feats)['Feature']

In [45]:
best_features = autoFeatureSelector(dataset_path="fifa19.csv", num_feats=30, methods=['pearson', 'chi-square', 'rfe', 'log-reg', 'rf', 'lgbm'])
best_features

Fitting estimator with 223 features.
Fitting estimator with 222 features.
Fitting estimator with 221 features.
Fitting estimator with 220 features.
Fitting estimator with 219 features.
Fitting estimator with 218 features.
Fitting estimator with 217 features.
Fitting estimator with 216 features.
Fitting estimator with 215 features.
Fitting estimator with 214 features.
Fitting estimator with 213 features.
Fitting estimator with 212 features.
Fitting estimator with 211 features.
Fitting estimator with 210 features.
Fitting estimator with 209 features.
Fitting estimator with 208 features.
Fitting estimator with 207 features.
Fitting estimator with 206 features.
Fitting estimator with 205 features.
Fitting estimator with 204 features.
Fitting estimator with 203 features.
Fitting estimator with 202 features.
Fitting estimator with 201 features.
Fitting estimator with 200 features.
Fitting estimator with 199 features.
Fitting estimator with 198 features.
Fitting estimator with 197 features.
F

  return reduction(axis=axis, out=out, **passkwargs)


1                    Volleys
2               ShortPassing
3                  Reactions
4                LongPassing
5                  Finishing
6                BallControl
7                  Weak Foot
8                   Strength
9                SprintSpeed
10      Nationality_Slovenia
11                 LongShots
12                FKAccuracy
13        Body Type_Courtois
14                   Agility
15                   Stamina
16                 ShotPower
17              Position_LAM
18               Position_CM
19       Nationality_Uruguay
20         Nationality_Gabon
21    Nationality_Costa Rica
22       Nationality_Belgium
23                 Dribbling
24                  Crossing
25              Acceleration
26               Position_RF
27               Position_RB
28               Position_LF
29               Position_GK
30              Position_CAM
Name: Feature, dtype: object

### Last, Can you turn this notebook into a python script, run it and submit the python (.py) file that takes dataset and list of methods as inputs and outputs the best features