<div style="border: 2px solid white; padding: 10px;">
    <h2 style='text-align: center; margin-top:5px'><u>Variance Inflation Factor and Tolerance:</u></h2>
    <hr></hr>
    <ul>
        <li>
            <h4>Variance Inflation Factor:</h4>
            <pre style='line-height: 1.5'>
    The Variance Inflation Factor, known as <mark>VIF</mark>, 
    is the measure of the amount of <mark>multicollinearity</mark> in regression analysis...
    That does not help a lot.
            </pre>
            <pre style='line-height: 1.5'>
    <b>In simpler term:</b> 
        if the data points of a feature, let's say Force, are too <mark>similar</mark> 
        to the data points of another feature, let's say LightSaber,
        then Force and LightSaber are both giving <mark>the same kind</mark>
        <mark>of information</mark> to the model. This redundancy can make the model
        less reliable.
            </pre>
            <pre style='line-height: 1.5'>
    <b>A VIF of:</b>
        • 1: Perfect, no multicollinearity.
        • >5: Bad, the feature is highly correlated with others.
        • >10: Drop that gun, on the ground, NOW
            </pre>
        </li>
        <li>
            <h4>Tolerance:</h4>
            <pre style='line-height: 1.5'>
    <code>Tolerance = 1 / VIF</code>
    Tolerance is like <mark>the opposite</mark> of the VIF.
    If VIF measure how much a feature overlaps with others,
    tolerance measures how <mark>independent a feature is</mark> from the other.
            </pre>
            <pre style='line-height: 1.5'>
    • If tolerance is <mark>low</mark>, it means the feature is <mark>redundant</mark>.
    • If tolerance is <mark>high</mark>, it means the feature is more <mark>independent</mark>.
            </pre>
        </li>
</div>

In [None]:
import matplotlib.pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
df = pd.read_csv('../../subject/Knights/Train_Knight.csv')

In [316]:
if 'knight' in df.columns:
    df['knight'] = df['knight'].map({'Jedi' : 0, 'Sith': 1})
df_scaled = pd.DataFrame(StandardScaler().fit_transform(df))
df_scaled.columns = df.columns
full_VIF_display = pd.DataFrame()
full_VIF_display[''] = df_scaled.columns
full_VIF_display['VIF'] = [variance_inflation_factor(df_scaled.values, i)
                           for i in range(len(df_scaled.columns))]
full_VIF_display['Tolerance'] = 1 / full_VIF_display['VIF']
print(full_VIF_display)


                           VIF  Tolerance
0     Sensitivity  4405.611729   0.000227
1        Hability    11.315560   0.088374
2        Strength  4507.502279   0.000222
3           Power   415.206742   0.002408
4         Agility     7.658262   0.130578
5       Dexterity    54.742161   0.018267
6       Awareness    70.187104   0.014248
7      Prescience    54.807121   0.018246
8      Reactivity     4.074263   0.245443
9   Midi-chlorien    14.678468   0.068127
10          Slash    81.359476   0.012291
11           Push     4.076030   0.245337
12           Pull    74.078367   0.013499
13     Lightsaber    47.718093   0.020956
14       Survival     3.663389   0.272971
15        Repulse    15.265039   0.065509
16     Friendship    19.172066   0.052159
17       Blocking    14.670653   0.068163
18     Deflection     4.550875   0.219738
19           Mass    10.506650   0.095178
20       Recovery   799.098802   0.001251
21          Evade    17.664258   0.056611
22          Stims   346.143357   0

In [317]:
while True:
    vif_data = pd.DataFrame()
    vif_data[""] = df_scaled.columns
    vif_data['VIF'] = [variance_inflation_factor(df_scaled.values, i)
                       for i in range(len(df_scaled.columns))]
    if max(vif_data['VIF']) < 5:
        break
    df_scaled = df_scaled.drop(df_scaled.columns[vif_data['VIF'].idxmax()], axis=1)
vif_data['Tolerance'] = 1 / vif_data['VIF']
print(vif_data)

                        VIF  Tolerance
0        Hability  1.707505   0.585650
1         Agility  2.698801   0.370535
2      Reactivity  2.911166   0.343505
3   Midi-chlorien  3.655478   0.273562
4            Push  1.849124   0.540797
5            Pull  2.203770   0.453768
6        Survival  1.684175   0.593763
7      Friendship  3.762652   0.265770
8        Blocking  4.141421   0.241463
9      Deflection  2.180713   0.458566
10           Mass  4.102290   0.243766
11          Burst  3.770857   0.265192
12         knight  2.897881   0.345080


<div style="border: 2px solid white; padding: 10px;">
            <h3 style='text-align:center'><u>A look to the results:</u></h3>
            <pre style='line-height: 1.5'>
    As you can see, at first we got a pretty terrible VIF score for every
    features. The thing is, we cannot just select the ones with the best
    VIF score and stop there. That would be an error and we'd be missing
    some precious information.
            </pre>
            <pre style='line-height: 1.5'>
    Now that we know that VIF is the measure of the "singularity" of a feature,
    we can guess that <mark>if a feature is multicollinear</mark> with another, 
    then <mark>we shouldnt drop both</mark>!
    Therefore, the best method is to <mark>remove the feature</mark> with the highest score, 
    and <mark>recalculate</mark> the VIF score of <mark>all</mark> the feature.
    Every time a feature is removed, or added, all the other 
    features have their VIF score changed.
    As shown below, we went on a loop, removing the highests VIF score features,
    one by one, recalculating the whole VIF every time, 
    until we reached a decent VIF score for every feature.
            </pre>
</div>