![Photo by DATAIDEA](../assets/banner4.png)

## ANOVA for Feature Selection

This bonus notebook is to demonstrate in another way how ANOVA is actually used

In [None]:
# !pip install -U dataidea

Let's import some packages, `scipy` has `f_oneway` for performing Analysis of Variation, DATAIDEA's `loadDataset` for loading the fantasy premier league dataset and Sci-Kit Learn's SelectKBest for Univariate Feature selection basing on some statistical tests

In [2]:
import scipy as sp
from sklearn.feature_selection import SelectKBest
from dataidea.datasets import loadDataset

In [3]:
# load fpl inbuilt
fpl = loadDataset('fpl') 

# select top 5
fpl.head(n=5) 

Unnamed: 0,First_Name,Second_Name,Club,Goals_Scored,Assists,Total_Points,Minutes,Saves,Goals_Conceded,Creativity,Influence,Threat,Bonus,BPS,ICT_Index,Clean_Sheets,Red_Cards,Yellow_Cards,Position
0,Bruno,Fernandes,MUN,18,14,244,3101,0,36,1414.9,1292.6,1253,36,870,396.2,13,0,6,MID
1,Harry,Kane,TOT,23,14,242,3083,0,39,659.1,1318.2,1585,40,880,355.9,12,0,1,FWD
2,Mohamed,Salah,LIV,22,6,231,3077,0,41,825.7,1056.0,1980,21,657,385.8,11,0,0,MID
3,Heung-Min,Son,TOT,17,11,228,3119,0,36,1049.9,1052.2,1046,26,777,315.2,13,0,0,MID
4,Patrick,Bamford,LEE,17,11,194,3052,0,50,371.0,867.2,1512,26,631,274.6,10,0,3,FWD


ANOVA helps us determine if there's a significant difference between the means of many groups. 

This concept can be used to obtain the best features for a categorical outcome by picking features that can best show the difference between categories. After ANOVA test, features with higher F-statistics would fit this idea.

Below, I've created groups of goals scored by each position of the players ie Goals scored by forwards make one group, midfielders too and so on.

In [6]:
# Create groups of goals scored for each player position

forwards_goals = fpl[fpl.Position == 'FWD']['Goals_Scored']
midfielders_goals = fpl[fpl.Position == 'MID']['Goals_Scored']
defenders_goals = fpl[fpl.Position == 'DEF']['Goals_Scored']
goalkeepers_goals = fpl[fpl.Position == 'GK']['Goals_Scored']

Let's run an ANOVA test to see if there's a significant difference between the means of the goals scored by each of the groups.

In [8]:
# Perform the ANOVA test for the groups

f_statistic, p_value = sp.stats.f_oneway(forwards_goals, midfielders_goals,
                                         defenders_goals, goalkeepers_goals
                                        )
print("F-statistic:", f_statistic)
print("p-value:", p_value)

F-statistic: 33.281034594400445
p-value: 3.9257634156019246e-20


We observe an *F-statistic* of `33.281` (seems big) and a *p-value* of `3.926e-20` which is infinitely small and shows significance at `95%`, `97%` and even `99%` confidence levels.

Below, I've created groups of assist obtained by each position of the players ie Assists obtained by forwards make one group, midfielders too and so on.

In [9]:
# Create groups of assists for each player position

forwards_assists = fpl[fpl.Position == 'FWD']['Assists']
midfielders_assists = fpl[fpl.Position == 'MID']['Assists']
defenders_assists = fpl[fpl.Position == 'DEF']['Assists']
goalkeepers_assists = fpl[fpl.Position == 'GK']['Assists']

Let's run an ANOVA test to see if there's a significant difference between the means of the assists by each of the groups.

In [11]:
# Perform the ANOVA test for the groups

f_statistic, p_value = sp.stats.f_oneway(forwards_assists, midfielders_assists,
                                         defenders_assists, goalkeepers_assists
                                        )
print("F-statistic:", f_statistic)
print("p-value:", p_value)

F-statistic: 19.263717036430815
p-value: 5.124889288362087e-12


We observe an *F-statistic* of `19.264` (seems big too) and a *p-value* of `5.125e-20` which is infinitely small and shows significance at `95%`, `97%` and even `99%` confidence levels.

As we can observe, both features have *big* and significant F-statistics but it's clear that the goals is ahead of assists. Basing on the idea that ANOVA checks for the difference between means of groups, it easy to say goals scored can best differentiate between the positions of player as the differences in the means of goals is bigger.

That's good, but it's alot of work when when we can wrap it up in a `SelectKBest` class from `sklearn` as demostrated below

In [None]:
# Use scikit-learn's SelectKBest (with f_classif)
test = SelectKBest(k=1)

# select numeric features
fit = test.fit(fpl[['Goals_Scored', 'Assists']], fpl.Position)

# get the f-statistics
scores = fit.scores_

# select the best feature
features = fit.transform(fpl[['Goals_Scored', 'Assists']])

# get the indices (optional)
selected_indices = test.get_support(indices=True)

# print indices and scores
print('Feature Scores: ', scores)
print('Selected Features Indices: ', selected_indices)

Feature Scores:  [33.28103459 19.26371704]
Selected Features Indices:  [0]


As we can observe the `0th` feature which is Goals Scored is selected as best 1 of the 2 features as expected basing on the F-statistics

A few ads maybe displayed for income as resources are now offered freely. ü§ùü§ùü§ù
<!-- Insert AdSense script dynamically -->
<script>
    (function() {
        var adScript = document.createElement('script');
        adScript.src = 'https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-8076040302380238';
        adScript.async = true;
        adScript.crossorigin="anonymous"
        document.head.appendChild(adScript);
    })();
</script>
