# Task 7: AutoFeatureSelector Tool
## This task is to test your understanding of various Feature Selection methods outlined in the lecture and the ability to apply this knowledge in a real-world dataset to select best features and also to build an automated feature selection tool as your toolkit

### Use your knowledge of different feature selector methods to build an Automatic Feature Selection tool
- Pearson Correlation
- Chi-Square
- RFE
- Embedded
- Tree (Random Forest)
- Tree (Light GBM)

### Dataset: FIFA 19 Player Skills
#### Attributes: FIFA 2019 players attributes like Age, Nationality, Overall, Potential, Club, Value, Wage, Preferred Foot, International Reputation, Weak Foot, Skill Moves, Work Rate, Position, Jersey Number, Joined, Loaned From, Contract Valid Until, Height, Weight, LS, ST, RS, LW, LF, CF, RF, RW, LAM, CAM, RAM, LM, LCM, CM, RCM, RM, LWB, LDM, CDM, RDM, RWB, LB, LCB, CB, RCB, RB, Crossing, Finishing, Heading, Accuracy, ShortPassing, Volleys, Dribbling, Curve, FKAccuracy, LongPassing, BallControl, Acceleration, SprintSpeed, Agility, Reactions, Balance, ShotPower, Jumping, Stamina, Strength, LongShots, Aggression, Interceptions, Positioning, Vision, Penalties, Composure, Marking, StandingTackle, SlidingTackle, GKDiving, GKHandling, GKKicking, GKPositioning, GKReflexes, and Release Clause.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as ss
from collections import Counter
import math
from scipy import stats

In [3]:
player_df = pd.read_csv("./fifa19.csv")

In [4]:
numcols = ['Overall', 'Crossing','Finishing',  'ShortPassing',  'Dribbling','LongPassing', 'BallControl', 'Acceleration','SprintSpeed', 'Agility',  'Stamina','Volleys','FKAccuracy','Reactions','Balance','ShotPower','Strength','LongShots','Aggression','Interceptions']
catcols = ['Preferred Foot','Position','Body Type','Nationality','Weak Foot']

In [5]:
player_df = player_df[numcols+catcols]

In [6]:
traindf = pd.concat([player_df[numcols], pd.get_dummies(player_df[catcols])],axis=1)
features = traindf.columns

traindf = traindf.dropna()

In [7]:
traindf = pd.DataFrame(traindf,columns=features)

In [8]:
y = traindf['Overall']>=87
X = traindf.copy()
del X['Overall']

In [9]:
X.head()

Unnamed: 0,Crossing,Finishing,ShortPassing,Dribbling,LongPassing,BallControl,Acceleration,SprintSpeed,Agility,Stamina,...,Nationality_Uganda,Nationality_Ukraine,Nationality_United Arab Emirates,Nationality_United States,Nationality_Uruguay,Nationality_Uzbekistan,Nationality_Venezuela,Nationality_Wales,Nationality_Zambia,Nationality_Zimbabwe
0,84.0,95.0,90.0,97.0,87.0,96.0,91.0,86.0,91.0,72.0,...,False,False,False,False,False,False,False,False,False,False
1,84.0,94.0,81.0,88.0,77.0,94.0,89.0,91.0,87.0,88.0,...,False,False,False,False,False,False,False,False,False,False
2,79.0,87.0,84.0,96.0,78.0,95.0,94.0,90.0,96.0,81.0,...,False,False,False,False,False,False,False,False,False,False
3,17.0,13.0,50.0,18.0,51.0,42.0,57.0,58.0,60.0,43.0,...,False,False,False,False,False,False,False,False,False,False
4,93.0,82.0,92.0,86.0,91.0,91.0,78.0,76.0,79.0,90.0,...,False,False,False,False,False,False,False,False,False,False


In [10]:
len(X.columns)

223

### Set some fixed set of features

In [11]:
feature_name = list(X.columns)
# no of maximum features we need to select
num_feats=30

## Filter Feature Selection - Pearson Correlation

### Pearson Correlation function

In [16]:
def cor_selector(X, y,num_feats):
    # Your code goes here (Multiple lines)
    
    # Compute the Pearson correlation coefficients
    cor_list = []
    for col in X.columns:
        cor = np.corrcoef(X[col], y)[0, 1]  # Compute correlation
        cor_list.append(abs(cor))  # Use absolute value for ranking
    
    # Create a Pandas Series for sorting and feature selection
    cor_list = np.array(cor_list)
    cor_df = pd.Series(cor_list, index=X.columns)
    
    # Sort the features by correlation value (descending)
    cor_feature = cor_df.nlargest(num_feats).index.tolist()
    
    # Create a mask for the selected features
    cor_support = [True if feature in cor_feature else False for feature in X.columns]
    
    # Your code ends here
    return cor_support, cor_feature

In [17]:
cor_support, cor_feature = cor_selector(X, y,num_feats)
print(str(len(cor_feature)), 'selected features')

30 selected features


### List the selected features from Pearson Correlation

In [18]:
cor_feature

['Reactions',
 'Body Type_Courtois',
 'Body Type_PLAYER_BODY_TYPE_25',
 'Body Type_C. Ronaldo',
 'Body Type_Messi',
 'Body Type_Neymar',
 'Position_LF',
 'Position_RF',
 'ShortPassing',
 'Volleys',
 'LongPassing',
 'FKAccuracy',
 'BallControl',
 'Finishing',
 'LongShots',
 'ShotPower',
 'Dribbling',
 'Nationality_Belgium',
 'Crossing',
 'Agility',
 'Weak Foot',
 'Stamina',
 'Nationality_Slovenia',
 'Nationality_Gabon',
 'Strength',
 'SprintSpeed',
 'Acceleration',
 'Nationality_Uruguay',
 'Position_LAM',
 'Nationality_Costa Rica']

## Filter Feature Selection - Chi-Sqaure

In [19]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler

### Chi-Squared Selector function

In [20]:
def chi_squared_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    # Apply SelectKBest with Chi-Square
    chi_selector = SelectKBest(score_func=chi2, k=num_feats)
    chi_selector.fit(X, y)
    
    # Get the support mask and selected feature names
    chi_support = chi_selector.get_support()
    chi_feature = X.columns[chi_support].tolist()
    # Your code ends here
    return chi_support, chi_feature

In [21]:
chi_support, chi_feature = chi_squared_selector(X, y,num_feats)
print(str(len(chi_feature)), 'selected features')

30 selected features


### List the selected features from Chi-Square 

In [22]:
chi_feature

['Crossing',
 'Finishing',
 'ShortPassing',
 'Dribbling',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Stamina',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Balance',
 'ShotPower',
 'Strength',
 'LongShots',
 'Aggression',
 'Interceptions',
 'Position_LF',
 'Position_RF',
 'Body Type_C. Ronaldo',
 'Body Type_Courtois',
 'Body Type_Messi',
 'Body Type_Neymar',
 'Body Type_PLAYER_BODY_TYPE_25',
 'Nationality_Belgium',
 'Nationality_Gabon',
 'Nationality_Slovenia',
 'Nationality_Uruguay']

## Wrapper Feature Selection - Recursive Feature Elimination

In [23]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler

### RFE Selector function

In [32]:
def rfe_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    # Encode categorical features in X if needed
    if isinstance(X, pd.DataFrame):
        X = pd.get_dummies(X, drop_first=True)

    # Fill any missing values in X
    X = X.fillna(0)

    # Scale the features in X using MinMaxScaler
    scaler = MinMaxScaler()
    X_scaled = scaler.fit_transform(X)

    # Ensure y is a 1D array
    y = np.array(y).ravel()

    # Define the Logistic Regression model
    model = LogisticRegression(max_iter=10000, random_state=42)

    # Apply Recursive Feature Elimination (RFE)
    rfe_selector = RFE(estimator=model, n_features_to_select=num_feats, step=1)
    rfe_selector.fit(X_scaled, y)

    # Get the support mask and selected feature names
    rfe_support = rfe_selector.get_support()  # Boolean mask of selected features
    rfe_feature = X.columns[rfe_support].tolist()  # Selected feature names

    # Your code ends here
    return rfe_support, rfe_feature

In [36]:
rfe_support, rfe_feature = rfe_selector(X, y,num_feats)
print(str(len(rfe_feature)), 'selected features')


30 selected features


### List the selected features from RFE

In [37]:
rfe_feature

['Finishing',
 'ShortPassing',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Volleys',
 'Reactions',
 'Strength',
 'Position_CM',
 'Position_GK',
 'Position_LCB',
 'Position_LF',
 'Position_LM',
 'Position_RB',
 'Position_RCB',
 'Position_RW',
 'Body Type_Courtois',
 'Body Type_Lean',
 'Body Type_Normal',
 'Body Type_Stocky',
 'Nationality_Belgium',
 'Nationality_Costa Rica',
 'Nationality_Croatia',
 'Nationality_Gabon',
 'Nationality_Netherlands',
 'Nationality_Slovenia',
 'Nationality_Uruguay',
 'Nationality_Wales']

## Embedded Selection - Lasso: SelectFromModel

In [77]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler

In [96]:
def embedded_log_reg_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    elr = LogisticRegression(penalty='l1', solver='liblinear', max_iter=500)
    elr.fit(X, y)
    coef = elr.coef_[0]
    sorted_indices = np.argsort(np.abs(coef))[-num_feats:]
    selected_features = np.where(coef != 0)[0]  # Indices of non-zero coefficients
    final_selected_features = np.intersect1d(sorted_indices, selected_features)
    embedded_lr_support = [True if i in final_selected_features else False for i in range(X.shape[1])]
    embedded_lr_feature = X.columns[embedded_lr_support].tolist()
    # Your code ends here
    return embedded_lr_support, embedded_lr_feature

In [97]:
embedded_lr_support, embedded_lr_feature = embedded_log_reg_selector(X, y, num_feats)
print(str(len(embedded_lr_feature)), 'selected features')

30 selected features


In [98]:
embedded_lr_feature

['Crossing',
 'Finishing',
 'Dribbling',
 'LongPassing',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'FKAccuracy',
 'Reactions',
 'Balance',
 'Aggression',
 'Interceptions',
 'Weak Foot',
 'Preferred Foot_Right',
 'Position_CAM',
 'Position_CM',
 'Position_LCB',
 'Position_LM',
 'Position_LW',
 'Position_RB',
 'Position_RW',
 'Body Type_Lean',
 'Nationality_Brazil',
 'Nationality_Croatia',
 'Nationality_England',
 'Nationality_France',
 'Nationality_Germany',
 'Nationality_Italy',
 'Nationality_Netherlands',
 'Nationality_Uruguay']

## Tree based(Random Forest): SelectFromModel

In [54]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

In [117]:
def embedded_rf_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    # Encode categorical features in X if needed
    if isinstance(X, pd.DataFrame):
        X = pd.get_dummies(X, drop_first=True)

    # Fill any missing values in X
    X = X.fillna(0)

    # Scale the features in X using MinMaxScaler
    scaler = MinMaxScaler()
    X_scaled = scaler.fit_transform(X)

    # Ensure y is a 1D array
    y = np.array(y).ravel()

    # Create a Random Forest Classifier model
    rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

    # Fit the Random Forest model
    rf_model.fit(X_scaled, y)

    # Apply SelectFromModel with RandomForestClassifier
    selector = SelectFromModel(rf_model, max_features=num_feats, threshold="mean")  # Select top features
    selector.fit(X_scaled, y)

    # Get the support mask and selected feature names
    embedded_rf_support = selector.get_support()  # Boolean mask of selected features
    embedded_rf_feature = X.columns[embedded_rf_support].tolist()  # Selected feature names

    # If less than num_feats, manually adjust the selection
    if len(embedded_rf_feature) < num_feats:
        remaining_features = [f for i, f in enumerate(X.columns) if not embedded_rf_support[i]]
        embedded_rf_feature += remaining_features[:num_feats - len(embedded_rf_feature)]

    # Your code ends here
    return embedded_rf_support, embedded_rf_feature

In [110]:
embedder_rf_support, embeded_rf_feature = embedded_rf_selector(X, y, num_feats)
print(str(len(embeded_rf_feature)), 'selected features')

23 selected features


In [111]:
embeded_rf_feature

['Crossing',
 'Finishing',
 'ShortPassing',
 'Dribbling',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Stamina',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Balance',
 'ShotPower',
 'Strength',
 'LongShots',
 'Aggression',
 'Interceptions',
 'Weak Foot',
 'Body Type_Courtois',
 'Nationality_Belgium',
 'Nationality_Slovenia']

## Tree based(Light GBM): SelectFromModel

In [72]:
from sklearn.feature_selection import SelectFromModel
from lightgbm import LGBMClassifier

In [73]:
def embedded_lgbm_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    # Encode categorical features in X if needed
    if isinstance(X, pd.DataFrame):
        X = pd.get_dummies(X, drop_first=True)

    # Fill any missing values in X
    X = X.fillna(0)

    # Scale the features in X using MinMaxScaler
    scaler = MinMaxScaler()
    X_scaled = scaler.fit_transform(X)

    # Ensure y is a 1D array
    y = np.array(y).ravel()

    # Create a LightGBM model
    lgbm_model = LGBMClassifier(n_estimators=100, random_state=42)

    # Fit the LightGBM model
    lgbm_model.fit(X_scaled, y)

    # Apply SelectFromModel with LightGBM
    selector = SelectFromModel(lgbm_model, max_features=num_feats, threshold="mean")  # Select top features
    selector.fit(X_scaled, y)

    # Get the support mask and selected feature names
    embedded_lgbm_support = selector.get_support()  # Boolean mask of selected features
    embedded_lgbm_feature = X.columns[embedded_lgbm_support].tolist()  # Selected feature names

    # If less than num_feats, manually adjust the selection
    if len(embedded_lgbm_feature) < num_feats:
        remaining_features = [f for i, f in enumerate(X.columns) if not embedded_lgbm_support[i]]
        embedded_lgbm_feature += remaining_features[:num_feats - len(embedded_lgbm_feature)]

    
    # Your code ends here
    return embedded_lgbm_support, embedded_lgbm_feature

In [74]:
embedded_lgbm_support, embedded_lgbm_feature = embedded_lgbm_selector(X, y, num_feats)
print(str(len(embedded_lgbm_feature)), 'selected features')



[LightGBM] [Info] Number of positive: 55, number of negative: 18104
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001249 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1801
[LightGBM] [Info] Number of data points in the train set: 18159, number of used features: 124
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.003029 -> initscore=-5.796555
[LightGBM] [Info] Start training from score -5.796555




[LightGBM] [Info] Number of positive: 55, number of negative: 18104
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001117 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1801
[LightGBM] [Info] Number of data points in the train set: 18159, number of used features: 124
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.003029 -> initscore=-5.796555
[LightGBM] [Info] Start training from score -5.796555
30 selected features


In [75]:
embedded_lgbm_feature

['Crossing',
 'Finishing',
 'ShortPassing',
 'Dribbling',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Stamina',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Balance',
 'ShotPower',
 'Strength',
 'LongShots',
 'Aggression',
 'Interceptions',
 'Position_LCB',
 'Body Type_Lean',
 'Nationality_Italy',
 'Weak Foot',
 'Preferred Foot_Left',
 'Preferred Foot_Right',
 'Position_CAM',
 'Position_CB',
 'Position_CDM',
 'Position_CF',
 'Position_CM']

## Putting all of it together: AutoFeatureSelector Tool

In [120]:
print(f"Length of feature_name: {len(feature_name)}")
print(f"Length of cor_support: {len(cor_support)}")
print(f"Length of chi_support: {len(chi_support)}")
print(f"Length of rfe_support: {len(rfe_support)}")
print(f"Length of embedded_lr_support: {len(embedded_lr_support)}")
print(f"Length of embeded_rf_support: {len(embedder_rf_support)}")
print(f"Length of embedded_lgbm_support: {len(embedded_lgbm_support)}")

Length of feature_name: 223
Length of cor_support: 223
Length of chi_support: 223
Length of rfe_support: 223
Length of embedded_lr_support: 223
Length of embeded_rf_support: 223
Length of embedded_lgbm_support: 223


In [118]:
pd.set_option('display.max_rows', None)
# put all selection together
feature_selection_df = pd.DataFrame({'Feature':feature_name, 'Pearson':cor_support, 'Chi-2':chi_support, 'RFE':rfe_support, 'Logistics':embedded_lr_support,
                                    'Random Forest':embedder_rf_support, 'LightGBM':embedded_lgbm_support})
# count the selected times for each feature
feature_selection_df['Total'] = feature_selection_df.iloc[:, 1:].sum(axis=1)
# display the top 100
feature_selection_df = feature_selection_df.sort_values(['Total','Feature'] , ascending=False)
feature_selection_df.index = range(1, len(feature_selection_df)+1)
feature_selection_df.head(100)

Unnamed: 0,Feature,Pearson,Chi-2,RFE,Logistics,Random Forest,LightGBM,Total
1,SprintSpeed,True,True,True,True,True,True,6
2,Reactions,True,True,True,True,True,True,6
3,LongPassing,True,True,True,True,True,True,6
4,Finishing,True,True,True,True,True,True,6
5,Agility,True,True,True,True,True,True,6
6,Acceleration,True,True,True,True,True,True,6
7,Volleys,True,True,True,False,True,True,5
8,Strength,True,True,True,False,True,True,5
9,ShortPassing,True,True,True,False,True,True,5
10,FKAccuracy,True,True,False,True,True,True,5


## Can you build a Python script that takes dataset and a list of different feature selection methods that you want to try and output the best (maximum votes) features from all methods?

In [138]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [142]:
def preprocess_dataset(dataset_path):
    # Load the dataset
    data = pd.read_csv(dataset_path)

    # Sanitize column names to remove special characters
    data.columns = data.columns.str.replace('[^A-Za-z0-9_]', '', regex=True)

    # Ensure only numeric columns are used for feature selection
    numeric_cols = data.select_dtypes(include=[np.number]).columns
    data = data[numeric_cols]

    # Handle missing values by filling with mean
    data.fillna(data.mean(), inplace=True)

    # Separate features and target variable
    X = data.drop('Overall', axis=1, errors='ignore')  # Adjust target column if necessary
    y = (data['Overall'] >= 87).astype(int)  # Binary target variable based on 'Overall' >= 87

    # Number of features to select
    num_feats = 30

    return X, y, num_feats

In [144]:
def autoFeatureSelector(dataset_path, methods=[]):
    # Parameters
    # data - dataset to be analyzed (csv file)
    # methods - various feature selection methods we outlined before, use them all here (list)
    
    # preprocessing
    X, y, num_feats = preprocess_dataset(dataset_path)
    
    # Run every method we outlined above from the methods list and collect returned best features from every method
    if 'pearson' in methods:
        cor_support, cor_feature = cor_selector(X, y,num_feats)
    if 'chi-square' in methods:
        chi_support, chi_feature = chi_squared_selector(X, y,num_feats)
    if 'rfe' in methods:
        rfe_support, rfe_feature = rfe_selector(X, y,num_feats)
    if 'log-reg' in methods:
        embedded_lr_support, embedded_lr_feature = embedded_log_reg_selector(X, y, num_feats)
    if 'rf' in methods:
        embedded_rf_support, embedded_rf_feature = embedded_rf_selector(X, y, num_feats)
    if 'lgbm' in methods:
        embedded_lgbm_support, embedded_lgbm_feature = embedded_lgbm_selector(X, y, num_feats)
    
    
    # Combine all the above feature list and count the maximum set of features that got selected by all methods
    #### Your Code starts here (Multiple lines)
    feature_selection_df = pd.DataFrame({
        'Feature': X.columns,
        'Pearson': cor_support if cor_support is not None else [False] * len(X.columns),
        'Chi-2': chi_support if chi_support is not None else [False] * len(X.columns),
        'RFE': rfe_support if rfe_support is not None else [False] * len(X.columns),
        'Logistics': embedded_lr_support if embedded_lr_support is not None else [False] * len(X.columns),
        'Random Forest': embedded_rf_support if embedded_rf_support is not None else [False] * len(X.columns),
        'LightGBM': embedded_lgbm_support if embedded_lgbm_support is not None else [False] * len(X.columns)
    })

    # Count the total selections for each feature
    feature_selection_df['Total'] = feature_selection_df.iloc[:, 1:].sum(axis=1)

    # Sort by Total and return the top features
    feature_selection_df = feature_selection_df.sort_values(['Total', 'Feature'], ascending=False)
    best_features = feature_selection_df.loc[feature_selection_df['Total'] > 0, 'Feature'].tolist()
    #### Your Code ends here
    return best_features

In [145]:
X, y, num_feats = preprocess_dataset("./fifa19.csv")
print(X.shape, y.shape, num_feats)  # Check if the data is loaded and split correctly

(18207, 43) (18207,) 30


In [146]:
best_features = autoFeatureSelector(dataset_path="/Users/athikafatima/Desktop/python/ML/fifa19.csv", methods=['pearson', 'chi-square', 'rfe', 'log-reg', 'rf', 'lgbm'])
best_features



[LightGBM] [Info] Number of positive: 55, number of negative: 18152
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000648 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3759
[LightGBM] [Info] Number of data points in the train set: 18207, number of used features: 43
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.003021 -> initscore=-5.799203
[LightGBM] [Info] Start training from score -5.799203




[LightGBM] [Info] Number of positive: 55, number of negative: 18152
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003797 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3759
[LightGBM] [Info] Number of data points in the train set: 18207, number of used features: 43
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.003021 -> initscore=-5.799203
[LightGBM] [Info] Start training from score -5.799203


['Unnamed0',
 'Reactions',
 'Potential',
 'Positioning',
 'Finishing',
 'FKAccuracy',
 'Composure',
 'Volleys',
 'Stamina',
 'Special',
 'LongShots',
 'LongPassing',
 'JerseyNumber',
 'InternationalReputation',
 'ID',
 'Dribbling',
 'Curve',
 'Agility',
 'Vision',
 'Strength',
 'SlidingTackle',
 'ShotPower',
 'Penalties',
 'Jumping',
 'GKPositioning',
 'GKDiving',
 'Crossing',
 'BallControl',
 'Balance',
 'Age',
 'Acceleration',
 'SkillMoves',
 'ShortPassing',
 'Marking',
 'Interceptions',
 'HeadingAccuracy',
 'GKReflexes',
 'GKHandling',
 'WeakFoot',
 'StandingTackle',
 'GKKicking',
 'Aggression']

### Last, Can you turn this notebook into a python script, run it and submit the python (.py) file that takes dataset and list of methods as inputs and outputs the best features