# Task 7: AutoFeatureSelector Tool


## This task is to test your understanding of various Feature Selection methods outlined in the lecture and the ability to apply this knowledge in a real-world dataset to select best features and also to build an automated feature selection tool as your toolkit

### Use your knowledge of different feature selector methods to build an Automatic Feature Selection tool
- Pearson Correlation
- Chi-Square
- RFE
- Embedded
- Tree (Random Forest)
- Tree (Light GBM)

### Dataset: FIFA 19 Player Skills
#### Attributes: FIFA 2019 players attributes like Age, Nationality, Overall, Potential, Club, Value, Wage, Preferred Foot, International Reputation, Weak Foot, Skill Moves, Work Rate, Position, Jersey Number, Joined, Loaned From, Contract Valid Until, Height, Weight, LS, ST, RS, LW, LF, CF, RF, RW, LAM, CAM, RAM, LM, LCM, CM, RCM, RM, LWB, LDM, CDM, RDM, RWB, LB, LCB, CB, RCB, RB, Crossing, Finishing, Heading, Accuracy, ShortPassing, Volleys, Dribbling, Curve, FKAccuracy, LongPassing, BallControl, Acceleration, SprintSpeed, Agility, Reactions, Balance, ShotPower, Jumping, Stamina, Strength, LongShots, Aggression, Interceptions, Positioning, Vision, Penalties, Composure, Marking, StandingTackle, SlidingTackle, GKDiving, GKHandling, GKKicking, GKPositioning, GKReflexes, and Release Clause.

In [422]:
%matplotlib inline
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as ss
from collections import Counter
import math
from scipy import stats

In [423]:
player_df = pd.read_csv("fifa19.csv")

In [424]:
numcols = ['Overall', 'Crossing','Finishing',  'ShortPassing',  'Dribbling','LongPassing', 'BallControl', 'Acceleration','SprintSpeed', 'Agility',  'Stamina','Volleys','FKAccuracy','Reactions','Balance','ShotPower','Strength','LongShots','Aggression','Interceptions']
catcols = ['Preferred Foot','Position','Body Type','Nationality','Weak Foot']

In [425]:
player_df = player_df[numcols+catcols]

In [426]:
traindf = pd.concat([player_df[numcols], pd.get_dummies(player_df[catcols])],axis=1)
features = traindf.columns

traindf = traindf.dropna()

In [427]:
traindf = pd.DataFrame(traindf,columns=features)

In [428]:
y = traindf['Overall']>=87
X = traindf.copy()
del X['Overall']

In [429]:
X.head()

Unnamed: 0,Crossing,Finishing,ShortPassing,Dribbling,LongPassing,BallControl,Acceleration,SprintSpeed,Agility,Stamina,...,Nationality_Uganda,Nationality_Ukraine,Nationality_United Arab Emirates,Nationality_United States,Nationality_Uruguay,Nationality_Uzbekistan,Nationality_Venezuela,Nationality_Wales,Nationality_Zambia,Nationality_Zimbabwe
0,84.0,95.0,90.0,97.0,87.0,96.0,91.0,86.0,91.0,72.0,...,False,False,False,False,False,False,False,False,False,False
1,84.0,94.0,81.0,88.0,77.0,94.0,89.0,91.0,87.0,88.0,...,False,False,False,False,False,False,False,False,False,False
2,79.0,87.0,84.0,96.0,78.0,95.0,94.0,90.0,96.0,81.0,...,False,False,False,False,False,False,False,False,False,False
3,17.0,13.0,50.0,18.0,51.0,42.0,57.0,58.0,60.0,43.0,...,False,False,False,False,False,False,False,False,False,False
4,93.0,82.0,92.0,86.0,91.0,91.0,78.0,76.0,79.0,90.0,...,False,False,False,False,False,False,False,False,False,False


In [430]:
len(X.columns)

223

### Set some fixed set of features

In [431]:
feature_name = list(X.columns)
# no of maximum features we need to select
num_feats=30

## Filter Feature Selection - Pearson Correlation

### Pearson Correlation function

In [432]:
def cor_selector(X, y,num_feats):
    # Your code goes here (Multiple lines)
    cor_list = []
    feature_name = X.columns.tolist()
    
    # calculate correlation between features and target
    for i in feature_name:
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
        
    # convert to dataframe
    feature_value = pd.DataFrame(
        {'Feature': feature_name,
         'Correlation': cor_list})
    
    # sort by absolute correlation value
    feature_value['Correlation'] = feature_value['Correlation'].abs()
    feature_value = feature_value.sort_values('Correlation', ascending=False)
    
    # select top features
    topk_feature = feature_value.iloc[:num_feats, :]
    
    # create boolean mask
    cor_support = []
    for feat in feature_name:
        if feat in topk_feature['Feature'].tolist():
            cor_support.append(True)
        else:
            cor_support.append(False)
            
    # get selected feature names
    cor_feature = X.columns[cor_support].tolist()
    
    # Your code ends here
    return cor_support, cor_feature

In [433]:
cor_support, cor_feature = cor_selector(X, y,num_feats)
print(str(len(cor_feature)), 'selected features')

30 selected features


### List the selected features from Pearson Correlation

In [434]:
cor_feature

['Crossing',
 'Finishing',
 'ShortPassing',
 'Dribbling',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Stamina',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'ShotPower',
 'Strength',
 'LongShots',
 'Weak Foot',
 'Position_LAM',
 'Position_LF',
 'Position_RF',
 'Body Type_C. Ronaldo',
 'Body Type_Courtois',
 'Body Type_Messi',
 'Body Type_Neymar',
 'Body Type_PLAYER_BODY_TYPE_25',
 'Nationality_Belgium',
 'Nationality_Costa Rica',
 'Nationality_Gabon',
 'Nationality_Slovenia',
 'Nationality_Uruguay']

## Filter Feature Selection - Chi-Sqaure

In [435]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler

### Chi-Squared Selector function

In [436]:
def chi_squared_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    X_norm = MinMaxScaler().fit_transform(X)
    chi_selector = SelectKBest(chi2, k=num_feats)
    chi_selector.fit(X_norm, y)
    chi_support = chi_selector.get_support()
    chi_feature = X.loc[:,chi_support].columns.tolist()
    # Your code ends here
    return chi_support, chi_feature

In [437]:
chi_support, chi_feature = chi_squared_selector(X, y,num_feats)
print(str(len(chi_feature)), 'selected features')

30 selected features


### List the selected features from Chi-Square 

In [438]:
chi_feature

['Finishing',
 'ShortPassing',
 'LongPassing',
 'BallControl',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'LongShots',
 'Position_CM',
 'Position_LAM',
 'Position_LF',
 'Position_LW',
 'Position_RB',
 'Position_RF',
 'Body Type_C. Ronaldo',
 'Body Type_Courtois',
 'Body Type_Messi',
 'Body Type_Neymar',
 'Body Type_PLAYER_BODY_TYPE_25',
 'Nationality_Belgium',
 'Nationality_Costa Rica',
 'Nationality_Croatia',
 'Nationality_Egypt',
 'Nationality_England',
 'Nationality_France',
 'Nationality_Gabon',
 'Nationality_Slovakia',
 'Nationality_Slovenia',
 'Nationality_Spain',
 'Nationality_Uruguay']

## Wrapper Feature Selection - Recursive Feature Elimination

In [439]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler

### RFE Selector function

In [440]:
def rfe_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    X_norm = MinMaxScaler().fit_transform(X)
    rfe_selector = RFE(
        estimator=LogisticRegression(random_state=42),
        n_features_to_select=num_feats,
        step=1
    )
    rfe_selector.fit(X_norm, y)
    rfe_support = rfe_selector.get_support()
    rfe_feature = X.loc[:,rfe_support].columns.tolist()
    # Your code ends here
    return rfe_support, rfe_feature

In [441]:
rfe_support, rfe_feature = rfe_selector(X, y,num_feats)
print(str(len(rfe_feature)), 'selected features')

30 selected features


### List the selected features from RFE

In [442]:
rfe_feature

['Finishing',
 'ShortPassing',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Volleys',
 'Reactions',
 'Strength',
 'Position_CM',
 'Position_GK',
 'Position_LCB',
 'Position_LF',
 'Position_LM',
 'Position_RB',
 'Position_RCB',
 'Position_RW',
 'Body Type_Courtois',
 'Body Type_Lean',
 'Body Type_Normal',
 'Body Type_Stocky',
 'Nationality_Belgium',
 'Nationality_Costa Rica',
 'Nationality_Croatia',
 'Nationality_Gabon',
 'Nationality_Netherlands',
 'Nationality_Slovenia',
 'Nationality_Uruguay',
 'Nationality_Wales']

## Embedded Selection - Lasso: SelectFromModel

In [443]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler

In [444]:
def embedded_log_reg_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    X_norm = MinMaxScaler().fit_transform(X)
    embedded_lr_selector = SelectFromModel(
        LogisticRegression(penalty='l1', solver='liblinear',random_state=42),
        max_features=num_feats
    )
    embedded_lr_selector.fit(X_norm, y)
    embedded_lr_support = embedded_lr_selector.get_support()
    embedded_lr_feature = X.loc[:,embedded_lr_support].columns.tolist()
    # Your code ends here
    return embedded_lr_support, embedded_lr_feature

In [445]:
embedded_lr_support, embedded_lr_feature = embedded_log_reg_selector(X, y, num_feats)
print(str(len(embedded_lr_feature)), 'selected features')

27 selected features


In [446]:
embedded_lr_feature

['LongPassing',
 'Reactions',
 'Balance',
 'Aggression',
 'Preferred Foot_Right',
 'Position_CAM',
 'Position_CM',
 'Position_GK',
 'Position_LCB',
 'Position_LM',
 'Position_LW',
 'Position_RB',
 'Position_RCB',
 'Position_RW',
 'Body Type_Lean',
 'Body Type_Stocky',
 'Nationality_Belgium',
 'Nationality_Brazil',
 'Nationality_Croatia',
 'Nationality_England',
 'Nationality_France',
 'Nationality_Germany',
 'Nationality_Italy',
 'Nationality_Netherlands',
 'Nationality_Portugal',
 'Nationality_Slovenia',
 'Nationality_Uruguay']

## Tree based(Random Forest): SelectFromModel

In [447]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

In [448]:
def embedded_rf_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    X_norm = MinMaxScaler().fit_transform(X)
    embedded_rf_selector = SelectFromModel(
        RandomForestClassifier(n_estimators=100, random_state=42),
        max_features=num_feats
    )
    embedded_rf_selector.fit(X_norm, y)
    embedded_rf_support = embedded_rf_selector.get_support()
    embedded_rf_feature = X.loc[:,embedded_rf_support].columns.tolist()
    # Your code ends here
    return embedded_rf_support, embedded_rf_feature

In [449]:
embedded_rf_support, embedded_rf_feature = embedded_rf_selector(X, y, num_feats)
print(str(len(embedded_rf_feature)), 'selected features')

24 selected features


In [450]:
embedded_rf_feature

['Crossing',
 'Finishing',
 'ShortPassing',
 'Dribbling',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Stamina',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Balance',
 'ShotPower',
 'Strength',
 'LongShots',
 'Aggression',
 'Interceptions',
 'Weak Foot',
 'Preferred Foot_Right',
 'Body Type_Courtois',
 'Body Type_Normal',
 'Nationality_Slovenia']

## Tree based(Light GBM): SelectFromModel

In [451]:
from sklearn.feature_selection import SelectFromModel
from lightgbm import LGBMClassifier

In [452]:
def embedded_lgbm_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    X_norm = MinMaxScaler().fit_transform(X)
    lgbc = LGBMClassifier(n_estimators=500, learning_rate=0.05, num_leaves=32, random_state=42, verbosity=-1) # Verbosity = -1 to suppress warnings
    embedded_lgbm_selector = SelectFromModel(lgbc, max_features=num_feats)
    embedded_lgbm_selector.fit(X_norm, y)
    embedded_lgbm_support = embedded_lgbm_selector.get_support()
    embedded_lgbm_feature = X.loc[:,embedded_lgbm_support].columns.tolist()
    # Your code ends here
    return embedded_lgbm_support, embedded_lgbm_feature

In [453]:
embedded_lgbm_support, embedded_lgbm_feature = embedded_lgbm_selector(X, y, num_feats)
print(str(len(embedded_lgbm_feature)), 'selected features')



22 selected features


In [454]:
embedded_lgbm_feature

['Crossing',
 'Finishing',
 'ShortPassing',
 'Dribbling',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Stamina',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Balance',
 'ShotPower',
 'Strength',
 'LongShots',
 'Aggression',
 'Interceptions',
 'Body Type_Lean',
 'Nationality_Germany',
 'Nationality_Italy']

## Putting all of it together: AutoFeatureSelector Tool

In [455]:
pd.set_option('display.max_rows', None)
# put all selection together
feature_selection_df = pd.DataFrame({'Feature':feature_name, 'Pearson':cor_support, 'Chi-2':chi_support, 'RFE':rfe_support, 'Logistics':embedded_lr_support,
                                    'Random Forest':embedded_rf_support, 'LightGBM':embedded_lgbm_support})
# count the selected times for each feature
feature_selection_df['Total'] = feature_selection_df.iloc[:, 1:].astype(int).sum(axis=1)
# display the top 100
feature_selection_df = feature_selection_df.sort_values(['Total','Feature'] , ascending=False)
feature_selection_df.index = range(1, len(feature_selection_df)+1)
feature_selection_df.head(num_feats)

Unnamed: 0,Feature,Pearson,Chi-2,RFE,Logistics,Random Forest,LightGBM,Total
1,Reactions,True,True,True,True,True,True,6
2,LongPassing,True,True,True,True,True,True,6
3,Volleys,True,True,True,False,True,True,5
4,ShortPassing,True,True,True,False,True,True,5
5,Nationality_Slovenia,True,True,True,True,True,False,5
6,Finishing,True,True,True,False,True,True,5
7,BallControl,True,True,True,False,True,True,5
8,Strength,True,False,True,False,True,True,4
9,SprintSpeed,True,False,True,False,True,True,4
10,Nationality_Uruguay,True,True,True,True,False,False,4


## Can you build a Python script that takes dataset and a list of different feature selection methods that you want to try and output the best (maximum votes) features from all methods?

In [456]:
def preprocess_dataset(dataset_path):
    # Your code starts here (Multiple lines)
    player_df = pd.read_csv(dataset_path)
    
    # Define numerical and categorical columns
    numcols = ['Overall', 'Crossing','Finishing', 'ShortPassing', 'Dribbling','LongPassing', 
               'BallControl', 'Acceleration','SprintSpeed', 'Agility', 'Stamina','Volleys',
               'FKAccuracy','Reactions','Balance','ShotPower','Strength','LongShots',
               'Aggression','Interceptions']
    catcols = ['Preferred Foot','Position','Body Type','Nationality','Weak Foot']
    
    # Select only the columns we want
    player_df = player_df[numcols+catcols]
    
    # Create the training dataframe with one-hot encoding for categorical variables
    traindf = pd.concat([player_df[numcols], pd.get_dummies(player_df[catcols])], axis=1)
    traindf = traindf.dropna()
    
    # Create X and y
    y = traindf['Overall']>=87
    X = traindf.copy()
    del X['Overall']
    
    # Set number of features to select
    num_feats = 30
    
    # Your code ends here
    return X, y, num_feats

In [457]:
def autoFeatureSelector(dataset_path, methods=[]):
    # Parameters
    # data - dataset to be analyzed (csv file)
    # methods - various feature selection methods we outlined before, use them all here (list)
    
    # preprocessing
    X, y, num_feats = preprocess_dataset(dataset_path)
    feature_name = list(X.columns)
    
    # Dictionary to store support indicators for each method
    support_dict = {}
    feature_dict = {}
    
    # Run every method we outlined above from the methods list and collect returned best features from every method
    if 'pearson' in methods:
        cor_support, cor_feature = cor_selector(X, y,num_feats)
        support_dict['pearson'] = cor_support
        feature_dict['pearson'] = cor_feature
    if 'chi-square' in methods:
        chi_support, chi_feature = chi_squared_selector(X, y,num_feats)
        support_dict['chi-square'] = chi_support
        feature_dict['chi-square'] = chi_feature
    if 'rfe' in methods:
        rfe_support, rfe_feature = rfe_selector(X, y,num_feats)
        support_dict['rfe'] = rfe_support
        feature_dict['rfe'] = rfe_feature
    if 'log-reg' in methods:
        embedded_lr_support, embedded_lr_feature = embedded_log_reg_selector(X, y, num_feats)
        support_dict['log-reg'] = embedded_lr_support
        feature_dict['log-reg'] = embedded_lr_feature
    if 'rf' in methods:
        embedded_rf_support, embedded_rf_feature = embedded_rf_selector(X, y, num_feats)
        support_dict['rf'] = embedded_rf_support
        feature_dict['rf'] = embedded_rf_feature
    if 'lgbm' in methods:
        embedded_lgbm_support, embedded_lgbm_feature = embedded_lgbm_selector(X, y, num_feats)
        support_dict['lgbm'] = embedded_lgbm_support
        feature_dict['lgbm'] = embedded_lgbm_feature
    
    # Combine all the above feature list and count the maximum set of features that got selected by all methods
    #### Your Code starts here (Multiple lines)
    # Create dataframe with all selection methods
    feature_selection_df = pd.DataFrame({'Feature':feature_name})
    for method in methods:
        feature_selection_df[method] = support_dict[method]
    
    # Count the total votes for each feature
    feature_selection_df['Total'] = feature_selection_df.iloc[:, 1:].sum(axis=1)
    
    # Sort features by total votes and feature name
    feature_selection_df = feature_selection_df.sort_values(['Total','Feature'], ascending=False)
    
    # Select features with maximum votes
    max_votes = feature_selection_df['Total'].max()
    best_features = feature_selection_df[feature_selection_df['Total'] == max_votes]['Feature'].tolist()
    
    #### Your Code ends here
    return best_features

In [458]:
best_features = autoFeatureSelector(dataset_path="fifa19.csv", methods=['pearson', 'chi-square', 'rfe', 'log-reg', 'rf', 'lgbm'])
best_features



['Reactions', 'LongPassing']

### Last, Can you turn this notebook into a python script, run it and submit the python (.py) file that takes dataset and list of methods as inputs and outputs the best features

In [459]:
# Yes. File is called auto_feature_selector.py

# Conclusions

## Comparison of Feature Selection Methods

After analyzing the outputs of different feature selection methods:

1. **Pearson Correlation**
   - Focused heavily on numerical player attributes (Crossing, Finishing, ShortPassing)
   - Selected key individual player features (Body Types for Ronaldo, Messi)
   - Good mix of technical and physical attributes

2. **Chi-Square**
   - More emphasis on categorical variables
   - Selected more nationality features
   - Strong focus on playing positions (CM, LAM, LF)

3. **RFE (Recursive Feature Elimination)**
   - Balanced selection between numerical and categorical
   - Emphasized positional features
   - Included key physical attributes (Acceleration, SprintSpeed)

4. **Logistic Regression (Embedded)**
   - Selected fewer numerical attributes
   - Strong preference for positional and nationality features
   - Focused on tactical positions (CAM, CM, LM)

5. **Random Forest & LightGBM**
   - Most consistent with core player attributes
   - Selected fewer categorical features
   - Emphasized technical skills over physical attributes

### Key Findings:
- Technical skills (BallControl, Reactions) were consistently selected across methods
- Tree-based methods (RF, LGBM) showed more agreement in feature selection
- Position and nationality features were more prominent in statistical methods
- Physical attributes were selected less frequently overall

### Best Features Analysis                                                     
When running all 6 feature selection methods together, only two features achieved unanimous selection across all methods:

1. **Reactions** - This mental attribute represents a player's ability to respond  
quickly to game situations. Its selection by all methods indicates it's a crucial predictor 
of high-rated players (Overall >= 87). 
2. **LongPassing** - This technical skill measures a player's ability to make      
accurate long-distance passes. Its universal selection suggests that elite players (87+     
rated) tend to excel in this fundamental aspect of the game.

The consensus on these two features across different statistical approaches        
(correlation, chi-square), wrapper methods (RFE), and embedded methods (logistic regression 
random forest, LightGBM) strongly validates their importance in determining top-tier player 
ratings in FIFA 19.