# Pratique: Selection forward de prédicteurs 

* Dans le cas de l’apprentissage supervisé, un ensemble de prédicteurs doit être choisi. Dans ce cadre, on va procéder à
* l’exploration du dataset pour établir la pertinence d’utiliser certaines des variables disponibles dans le dataset. 
* Un prétraitement pourra aussi être appliqué sur le dataset afin de corriger des valeurs manquantes ou aberrantes.


# Les étapes sont les suivantes

1. Analyse de données
2. Summarize data
 * Structure des données

3. Évaluation de modèles sur la base de l’erreur RSS
 * Recherche de prédicteur avec approche forward stepwise
 * Recherche de prédicteurs avec approche backward stepwise

 

* Références
 1. “Introduction to Statistical Learning with Applications in R" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshiran


* Importation des modules nécessaires

In [3]:
%matplotlib inline
import pandas as pd
import numpy as np
import itertools
import time
import statsmodels.api as sm
import matplotlib.pyplot as plt

* chargement du dataset

In [4]:
df = pd.read_csv('Hitters.csv')

* Description du dataset - Shape

In [5]:
# Print the dimensions of the original Hitters data (322 rows x 20 columns)
print(df.shape)

(322, 21)


In [6]:
#Visualisation des 5 premieres lignes
df.head()

Unnamed: 0,Player,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,...,CRuns,CRBI,CWalks,League,Division,PutOuts,Assists,Errors,Salary,NewLeague
0,-Andy Allanson,293,66,1,30,29,14,1,293,66,...,30,29,14,A,E,446,33,20,,A
1,-Alan Ashby,315,81,7,24,38,39,14,3449,835,...,321,414,375,N,W,632,43,10,475.0,N
2,-Alvin Davis,479,130,18,66,72,76,3,1624,457,...,224,266,263,A,W,880,82,14,480.0,A
3,-Andre Dawson,496,141,20,65,78,37,11,5628,1575,...,828,838,354,N,E,200,11,3,500.0,N
4,-Andres Galarraga,321,87,10,39,42,30,2,396,101,...,48,46,33,N,E,805,40,4,91.5,N


In [7]:
# Exploration de données - nom de lignes avec variable Salary manquants
print(df["Salary"].isnull().sum())

59


In [8]:
# Drop any rows the contain missing values, along with the player names
df = df.dropna().drop('Player', axis=1)
# Print the dimensions of the modified Hitters data (263 rows x 20 columns)
print(df.shape)
# One last check: should return 0
print(df["Salary"].isnull().sum())

(263, 20)
0


# Preparation du DataFrame pour la sélection

In [9]:
#Creation de variables numériques pour remplacer les var categorielles
dummies = pd.get_dummies(df[['League', 'Division', 'NewLeague']])
 
y = df.Salary
# Drop the column with the independent variable (Salary), and columns for which we created dummy 
X_ = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1).astype('float64')
# Define the feature set X.
X = pd.concat([X_, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis=1) 
# Print the dimensions of the modified  
print(X.shape)
#Visualisation des 5 premieres lignes
X.head()

(263, 19)


Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,PutOuts,Assists,Errors,League_N,Division_W,NewLeague_N
1,315.0,81.0,7.0,24.0,38.0,39.0,14.0,3449.0,835.0,69.0,321.0,414.0,375.0,632.0,43.0,10.0,1.0,1.0,1.0
2,479.0,130.0,18.0,66.0,72.0,76.0,3.0,1624.0,457.0,63.0,224.0,266.0,263.0,880.0,82.0,14.0,0.0,1.0,0.0
3,496.0,141.0,20.0,65.0,78.0,37.0,11.0,5628.0,1575.0,225.0,828.0,838.0,354.0,200.0,11.0,3.0,1.0,0.0,1.0
4,321.0,87.0,10.0,39.0,42.0,30.0,2.0,396.0,101.0,12.0,48.0,46.0,33.0,805.0,40.0,4.0,1.0,0.0,1.0
5,594.0,169.0,4.0,74.0,51.0,35.0,11.0,4408.0,1133.0,19.0,501.0,336.0,194.0,282.0,421.0,25.0,0.0,1.0,0.0


In [45]:
def processSubset(feature_set):
    # Fit model on feature_set and calculate RSS
    model = sm.OLS(y,X[list(feature_set)])
    regr = model.fit()
    RSS = ((regr.predict(X[list(feature_set)]) - y) ** 2).sum()
    return {"modele":regr, "RSS":RSS}

* fonction pour trouver le meilleur modele pour un nombre k donné de predicteurs

In [46]:
def forwardSelection(predictors):
    # Pull out predictors we still need to process
    remaining_predictors = [p for p in X.columns if p not in predictors]
    tic = time.time()
    results = []
    for p in remaining_predictors:
        results.append(processSubset(predictors+[p]))
    # Wrap everything up in a nice dataframe
    models = pd.DataFrame(results)
    # Choose the model with the highest RSS
    best_model = models.loc[models['RSS'].argmin()]
    toc = time.time()
    print("Processed ", models.shape[0], "models on", len(predictors)+1, "predictors in", (toc-tic))
    # Return the best model, along with some other useful information about the model
    return best_model

# Selection de feature selon RSS

* But: on commence avec le minimum de predicteurs et on itere jusqu'a trouver le meilleur modele

In [52]:
#Execution de la routine pour trouver le meilleur modele
models2 = pd.DataFrame(columns=["RSS", "modele"])
 
tic = time.time()
predictors = []
for i in range(1,len(X.columns)+1):
    models2.loc[i] = forwardSelection(predictors)
    predictors = models2.loc[i]["modele"].model.exog_names

toc = time.time()
 
print("Total temps ecoulé:", (toc-tic), "secondes.")

('Processed ', 19, 'models on', 1, 'predictors in', 0.0690000057220459)
('Processed ', 18, 'models on', 2, 'predictors in', 0.04399991035461426)
('Processed ', 17, 'models on', 3, 'predictors in', 0.04800009727478027)
('Processed ', 16, 'models on', 4, 'predictors in', 0.0409998893737793)
('Processed ', 15, 'models on', 5, 'predictors in', 0.0410001277923584)
('Processed ', 14, 'models on', 6, 'predictors in', 0.039999961853027344)
('Processed ', 13, 'models on', 7, 'predictors in', 0.03299999237060547)
('Processed ', 12, 'models on', 8, 'predictors in', 0.03099989891052246)
('Processed ', 11, 'models on', 9, 'predictors in', 0.028999805450439453)
('Processed ', 10, 'models on', 10, 'predictors in', 0.026000022888183594)
('Processed ', 9, 'models on', 11, 'predictors in', 0.023999929428100586)
('Processed ', 8, 'models on', 12, 'predictors in', 0.02499985694885254)
('Processed ', 7, 'models on', 13, 'predictors in', 0.026000022888183594)
('Processed ', 6, 'models on', 14, 'predictors i

In [51]:
#Detail d'un modele avec 2 predicteurs en utilisant la fonction summary() :
print(models.loc[2, "modele"].summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.761
Model:                            OLS   Adj. R-squared:                  0.760
Method:                 Least Squares   F-statistic:                     416.7
Date:                Wed, 11 Jan 2017   Prob (F-statistic):           5.80e-82
Time:                        14:41:31   Log-Likelihood:                -1907.6
No. Observations:                 263   AIC:                             3819.
Df Residuals:                     261   BIC:                             3826.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Hits           2.9538      0.261     11.335      0.0