## Introduction
In this section, the goal is to construct a high-dimensional player profile. The aim is to identify the most meaningful and promising attributes that best describe the player's characteristics.

When dealing with ratios, such as pass accuracy, the objective is to convert them to absolute numbers to capture their magnitude. Ratios are generally not retained as they may introduce redundancy.

## Import

In [190]:
import utils
import numpy as np

features, player_info = utils.load_player_statistics()

mask = (player_info["Matches Played"] > 8) & (player_info["Playing Time_Min"] > 60)
player_info = player_info[mask]
features = features[mask]

playing_time_cols = ['Playing Time_Minutes', 'Playing Time_Mn/MP','Starts', 'Mn/Start', 'Compl',
                     'Subs', 'unSub', 'PPM','onG', 'onGA','On-Off','W', 'D', 'L'
                    ]

col_to_drop = playing_time_cols
features = features.drop(columns = col_to_drop)

### Helper

In [191]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder

def evaluate(select_features, X, y):
    encoder = LabelEncoder()
    X = X[select_features]
    y = encoder.fit_transform(y)

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    pipeline = Pipeline([
        ('scaler', StandardScaler()),  
        ('svc', SVC())                 
    ])

    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)

    class_names = np.unique(encoder.inverse_transform(y))
    if len(class_names) == 2:
        print(classification_report(y_test, y_pred, target_names=[str(x) for x in class_names]))
    else:
        print(classification_report(y_test, y_pred, target_names=class_names))

## Selection for all Position

 - `f_classif` It is based on ANOVA (analysis of variance). It computes the F-value between each feature and the target variable, which measures the linear dependency between two variables. Features that are highly dependent on the target variable will have high scores.
 - `mutual_info_classif` It is based on the concept of mutual information, which measures the amount of information shared between two variables. It computes the mutual information between each feature and the target variable. Features that are highly informative with respect to the target variable will have high scores.

In [192]:
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, mutual_info_classif, f_classif
from sklearn.preprocessing import LabelEncoder

scaler = StandardScaler()
X = scaler.fit_transform(features)

encoder = LabelEncoder()
y = encoder.fit_transform(player_info["Global Pos"])

selector = SelectKBest(score_func=mutual_info_classif, k=10)
X_selected = selector.fit_transform(X, y)

Selected Features

In [193]:
selected_features = features.columns[selector.get_support()]
selected_features

Index(['Touches_Def Pen', 'Touches_Def 3rd', 'Touches_Mid 3rd',
       'Touches_Att Pen', 'Total_PrgDist', 'Short_Cmp%', 'Medium_Cmp%',
       'Long_Att', 'SCA90', 'Shots/90'],
      dtype='object')

In [194]:
evaluate(selected_features, features, player_info["Global Pos"])

              precision    recall  f1-score   support

          DF       0.93      0.93      0.93       319
          FW       0.86      0.85      0.86       244
          GK       1.00      1.00      1.00        45
          MF       0.81      0.82      0.82       275

    accuracy                           0.88       883
   macro avg       0.90      0.90      0.90       883
weighted avg       0.88      0.88      0.88       883



### Conclusion
With 10 features players can already be seperated. Even with 2 to seperability is very high.
For `k=2` appears that `mutual_info_classif` has significantly better performance than ANOVA

## Selection for all Forwards

 - `f_classif` 
 - `mutual_info_classif` performed strongly for all positions

In [207]:
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, mutual_info_classif, f_classif
from sklearn.preprocessing import LabelEncoder

scaler = StandardScaler()
X = scaler.fit_transform(features)

encoder = LabelEncoder()
y = encoder.fit_transform(player_info["Global Pos"] == "FW")

selector = SelectKBest(score_func=mutual_info_classif, k=30)
X_selected = selector.fit_transform(X, y)

Selected Features

In [208]:
selected_features = features.columns[selector.get_support()]
selected_features

Index(['Tackles_Def 3rd', 'Interceptions', 'Clearances', 'Dribblers_Tkl_Succ',
       'Blocks_Shots', 'Touches_Number', 'Touches_Def Pen', 'Touches_Def 3rd',
       'Touches_Mid 3rd', 'Touches_Att Pen', 'Carries_CPA', 'Receiving_PrgR',
       'Total_Cmp', 'Total_Att', 'Total_Cmp%', 'Total_TotDist',
       'Total_PrgDist', 'Short_Cmp%', 'Medium_Cmp', 'Medium_Att',
       'Medium_Cmp%', 'Long_Cmp', 'Long_Att', 'Passes_to_1/3', 'SCA90',
       'Goals', 'SoT', 'SoT%', 'Shots/90', 'Off'],
      dtype='object')

In [209]:
evaluate(selected_features, features, y)

              precision    recall  f1-score   support

           0       0.95      0.95      0.95       639
           1       0.88      0.88      0.88       244

    accuracy                           0.93       883
   macro avg       0.92      0.92      0.92       883
weighted avg       0.93      0.93      0.93       883



### Conclusion
With `k=30` there are good suggestions on what attributes one can identify a striker. This can definetly help to have good choice of attributes for similarity search.
Also the prediction accuracy is high.

- dribbling
- touches
- carries
- goals and shots
- passes

## Selection for all Midfielder

 - `f_classif` 
 - `mutual_info_classif`
 
One would expect to have lease accuracy since midfielders are across to whole field attempting every possible action in a game.

In [240]:
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, mutual_info_classif, f_classif
from sklearn.preprocessing import LabelEncoder

scaler = StandardScaler()
X = scaler.fit_transform(features)

encoder = LabelEncoder()
y = encoder.fit_transform(player_info["Global Pos"] == "MF")

selector = SelectKBest(score_func=mutual_info_classif, k=30)
X_selected = selector.fit_transform(X, y)

Selected Features

In [241]:
selected_features = features.columns[selector.get_support()]
selected_features

Index(['Tackles_Att', 'Tackles_Mid 3rd', 'Tackles_Att 3rd', 'Clearances',
       'Dribblers_Tkl_Att', 'Dribblers_Tkl_Lost', 'Blocks_Shots',
       'Touches_Def Pen', 'Touches_Def 3rd', 'Touches_Att 3rd',
       'Take-Ons_Succ', 'Take-Ons_Tkld%', 'Carries_1/3', 'Carries_Dis',
       'Receiving_PrgR', 'Total_Cmp%', 'Total_PrgDist', 'Short_Cmp%',
       'Medium_Cmp%', 'Long_Cmp%', 'Key Passes', 'Passes_to_1/3',
       'Progressive Passes', 'SCA90', 'Shots/90', 'Launched_Att', 'Passes_Thr',
       '#OPA/90', 'Penalty Kicks_PKatt', 'Aerial Duels_Lost'],
      dtype='object')

In [242]:
evaluate(selected_features, features, y)

              precision    recall  f1-score   support

           0       0.91      0.95      0.93       608
           1       0.89      0.79      0.84       275

    accuracy                           0.90       883
   macro avg       0.90      0.87      0.88       883
weighted avg       0.90      0.90      0.90       883



### Conclusion
With `k=30` there are good suggestions on what attributes one can identify a miffielder. This can definetly help to have good choice of attributes for similarity search.
Also the prediction accuracy is high.

- passing
- aerial duels
- dribbles
- tackles
- shot creation

## Selection for all Defender

 - `f_classif` 
 - `mutual_info_classif`
 

In [264]:
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, mutual_info_classif, f_classif
from sklearn.preprocessing import LabelEncoder

scaler = StandardScaler()
X = scaler.fit_transform(features)

encoder = LabelEncoder()
y = encoder.fit_transform(player_info["Global Pos"] == "DF")

selector = SelectKBest(score_func=mutual_info_classif, k=20)
X_selected = selector.fit_transform(X, y)

Selected Features

In [265]:
selected_features = features.columns[selector.get_support()]
selected_features

Index(['Tackles_Def 3rd', 'Interceptions', 'Clearances', 'Dribblers_Tkl_Succ',
       'Blocks_Shots', 'Touches_Def Pen', 'Touches_Def 3rd', 'Touches_Att 3rd',
       'Carries_Mis', 'Carries_Dis', 'Total_PrgDist', 'Short_Cmp%',
       'Medium_Cmp', 'Medium_Att', 'Medium_Cmp%', 'Long_Att', 'SCA90', 'Shots',
       'SoT', 'Shots/90'],
      dtype='object')

In [266]:
evaluate(selected_features, features, y)

              precision    recall  f1-score   support

           0       0.95      0.96      0.96       564
           1       0.94      0.91      0.92       319

    accuracy                           0.95       883
   macro avg       0.94      0.94      0.94       883
weighted avg       0.95      0.95      0.95       883



### Conclusion
With `k=30` there are good suggestions on what attributes one can identify a miffielder. This can definetly help to have good choice of attributes for similarity search.
Also the prediction accuracy is high.

- passing
- aerial duels
- dribbles
- tackles
- shot creation