 univariate feature selection : each feature is evaluated independently with respect to the response variable
 

In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression as LR
from matplotlib import pyplot as plt
% matplotlib inline

In [3]:
df = pd.read_csv('kc_house_data.csv')
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [4]:
X = df[['bathrooms','sqft_above','sqft_basement','sqft_lot','yr_built','yr_renovated','view','waterfront','zipcode']]
y = df['price']

### Feature Selection: Univariate Selection
* http://blog.datadive.net/selecting-good-features-part-i-univariate-selection/

It can work for selecting top features for model improvement in some settings, but since it is **UNABLE TO REMOVE REDUNDANCY** (for example selecting only the best feature among **a subset of strongly correlated features**), this task is better left for other methods.

#### Pearson Correlation
Scipy‘s pearsonr method computes both the correlation and p-value for the correlation, roughly showing the probability of an uncorrelated system creating a correlation value of this magnitude.

pro's:
* faster than other sophisticated methods
* can show whether relationship is positive or negative

con's:
-----> captures only **linear dependency** !

In [5]:
from scipy.stats import pearsonr
for feature in X.columns:
    print feature , pearsonr(X[feature], y)

bathrooms (0.52513750541396187, 0.0)
sqft_above (0.60556729835607825, 0.0)
sqft_basement (0.32381602071198395, 0.0)
sqft_lot (0.089660860587100114, 7.9725045103261473e-40)
yr_built (0.054011531494792715, 1.929872809374955e-15)
yr_renovated (0.12643379344089295, 1.0213478858043326e-77)
view (0.39729348829450428, 0.0)
waterfront (0.26636943403060209, 0.0)
zipcode (-0.053202854298325608, 5.0110505033187622e-15)


In [6]:
col = X.columns
[(i, j, pearsonr(X[i], X[j])) for i in col for j in col if i<=j]

[('bathrooms', 'bathrooms', (1.0, 0.0)),
 ('bathrooms', 'sqft_above', (0.68534247587615915, 0.0)),
 ('bathrooms', 'sqft_basement', (0.28377003400466877, 0.0)),
 ('bathrooms', 'sqft_lot', (0.087739661531261948, 3.3340851719574847e-38)),
 ('bathrooms', 'yr_built', (0.50601943828525331, 0.0)),
 ('bathrooms', 'yr_renovated', (0.050738977648059562, 8.4128623727100258e-14)),
 ('bathrooms', 'view', (0.18773702397664169, 1.1964071644867705e-170)),
 ('bathrooms', 'waterfront', (0.063743629135636942, 6.5861097041278159e-21)),
 ('bathrooms', 'zipcode', (-0.20386627357626266, 1.6500992147613623e-201)),
 ('sqft_above', 'sqft_above', (1.0, 0.0)),
 ('sqft_above',
  'sqft_basement',
  (-0.051943306770727428, 2.1539158563766889e-14)),
 ('sqft_above', 'sqft_lot', (0.18351228086523363, 5.137221092667776e-163)),
 ('sqft_above', 'yr_built', (0.42389835166374401, 0.0)),
 ('sqft_above', 'yr_renovated', (0.023284687865467713, 0.0006183571709836459)),
 ('sqft_above', 'view', (0.16764934410325252, 5.29758031208

## sklearn.feature_selection
* SelectKBest, SelectPercentile, SelectFpr, SelectFwe, GenericUnivariateSelect 

* scoring function for regression: f_regression, mutual_info_regression
* scoring function for classification: chi2, f_classif, mutual_info_classif

http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection

example code:

In [9]:
from sklearn.feature_selection import SelectKBest, f_regression
SKB = SelectKBest(f_regression, k=4)
X_new = SKB.fit_transform(X,y)
SKB.scores_

array([  8228.9432278 ,  12514.0608974 ,   2531.50632597,    175.14030523,
           63.2290479 ,    351.07483789,   4050.45898116,   1650.46303578,
           61.34451836])

##### F-test:
1. The cross correlation between each regressor and the target is computed, that is, ((X[:, i] - mean(X[:, i])) * (y - mean_y)) / (std(X[:, i]) * std(y)).
2. It is converted to an F score then to a p-value.

-----> captures only **LINEAR DEPENDENCY** ! (http://scikit-learn.org/stable/auto_examples/feature_selection/plot_f_test_vs_mi.html#sphx-glr-auto-examples-feature-selection-plot-f-test-vs-mi-py)

In [10]:
from sklearn.feature_selection import f_regression, mutual_info_regression
f_test, _ = f_regression(X, y)
f_test /= np.max(f_test)
print zip(col, f_test)

[('bathrooms', 0.65757576978979304), ('sqft_above', 1.0), ('sqft_basement', 0.20229295244181325), ('sqft_lot', 0.013995481296054574), ('yr_built', 0.0050526402596444411), ('yr_renovated', 0.028054429394893526), ('view', 0.32367262828331822), ('waterfront', 0.13188868500063505), ('zipcode', 0.0049020472940469236)]


##### MI & MIC :
-----> **captures any kind of statistical dependency**

But, it can be inconvenient to use directly for feature ranking for two reasons though. Firstly, it is not a metric and not normalized (i.e. doesn’t lie in a fixed range), so the MI values can be incomparable between two datasets. Secondly, it can be inconvenient to compute for continuous variables: in general the variables need to be discretized by binning, but the mutual information score can be quite sensitive to bin selection.

**Maximal information coefficient** is a technique developed to address these shortcomings. It searches for optimal binning and turns mutual information score into a metric that lies in range [0;1]. In python, MIC is available in the minepy library.

In [11]:
mi = mutual_info_regression(X, y)
mi /= np.max(mi)
print zip(col, mi)

[('bathrooms', 0.48313805417875244), ('sqft_above', 0.62345865146860457), ('sqft_basement', 0.16855055426958279), ('sqft_lot', 0.14501153554990898), ('yr_built', 0.17520338054615969), ('yr_renovated', 0.0032042748241083475), ('view', 0.13446007600176257), ('waterfront', 0.029273394717551429), ('zipcode', 1.0)]


In [12]:
from minepy import MINE
m = MINE()
for c in col:
    m.compute_score(X[c], y)
    print c,
    print m.mic()

bathrooms 0.184667777851
sqft_above 0.217266693397
sqft_basement 0.0731383173777
sqft_lot 0.0803989424276
yr_built 0.0573512220128
yr_renovated 0.0389160834264
view 0.10493931877
waterfront 0.036629801995
zipcode 0.384736343043


#### Pipeline

Pipeline of transforms with a final estimator.

Sequentially apply a list of transforms and a final estimator. **Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit.**

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a ‘__’, as in the example below. A step’s estimator may be replaced entirely by setting the parameter with its name to another estimator, or a transformer removed by setting to None.

http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

In [13]:
from sklearn import svm
from sklearn.datasets import samples_generator
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.pipeline import Pipeline
# generate some data to play with
X, y = samples_generator.make_classification(n_informative=5, n_redundant=0, random_state=42)
# ANOVA SVM-C
anova_filter = SelectKBest(f_regression, k=5)
clf = svm.SVC(kernel='linear')
anova_svm = Pipeline([('anova', anova_filter), ('svc', clf)])
# You can set the parameters using the names issued
# For instance, fit using a k of 10 in the SelectKBest
# and a parameter 'C' of the svm
anova_svm.set_params(anova__k=10, svc__C=.1).fit(X, y)

prediction = anova_svm.predict(X)
anova_svm.score(X, y)                        

# getting the selected features chosen by anova_filter
anova_svm.named_steps['anova'].get_support()

array([False, False,  True,  True, False, False,  True,  True, False,
        True, False,  True,  True, False,  True, False,  True,  True,
       False, False], dtype=bool)

#### Random Forest Regressor
Random Forest for Regression ：
* https://www.quora.com/How-does-random-forest-work-for-regression-1

Many machine learning models have either some inherent internal ranking of features or it is easy to generate the ranking from the structure of the model. This applies to regression models, SVM’s, decision trees, random forests, etc.
when all features are on the same scale, the most important features should have the highest coefficients in the model, while features uncorrelated with the output variables should have coefficient values close to zero. This approach can work well even with simple linear regression models when the data is not very noisy (or there is a lot of data compared to the number of features) and the features are (relatively) independent

In [14]:
from sklearn.cross_validation import cross_val_score, ShuffleSplit
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
 
#Load boston housing dataset as an example
boston = load_boston()
X = boston["data"]
Y = boston["target"]
names = boston["feature_names"]
 
rf = RandomForestRegressor(n_estimators=20, max_depth=4)
scores = []
for i in range(X.shape[1]):
     score = cross_val_score(rf, X[:, i:i+1], Y, scoring="r2",cv=ShuffleSplit(len(X), 3, .3))
     scores.append((round(np.mean(score), 3), names[i]))
print sorted(scores, reverse=True)

[(0.662, 'LSTAT'), (0.528, 'RM'), (0.407, 'NOX'), (0.329, 'TAX'), (0.302, 'PTRATIO'), (0.287, 'INDUS'), (0.23, 'CRIM'), (0.146, 'ZN'), (0.118, 'RAD'), (0.106, 'DIS'), (0.061, 'B'), (0.017, 'AGE'), (0.004, 'CHAS')]


interpreting random forest:
http://blog.datadive.net/interpreting-random-forests/

### IMPUTE MISSING DATA

http://scikit-learn.org/stable/auto_examples/missing_values.html#sphx-glr-auto-examples-missing-values-py

In [21]:
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.model_selection import cross_val_score

rng = np.random.RandomState(0)

dataset = load_boston()
X_full, y_full = dataset.data, dataset.target
n_samples = X_full.shape[0]
n_features = X_full.shape[1]

# Estimate the score on the entire dataset, with no missing values
estimator = RandomForestRegressor(random_state=0, n_estimators=100)
score = cross_val_score(estimator, X_full, y_full).mean()
print("Score with the entire dataset = %.2f" % score)

# Add missing values in 75% of the lines
missing_rate = 0.75
n_missing_samples = np.floor(n_samples * missing_rate)
missing_samples = np.hstack((np.zeros(n_samples - n_missing_samples,
                                      dtype=np.bool),
                             np.ones(n_missing_samples,
                                     dtype=np.bool)))
rng.shuffle(missing_samples)
missing_features = rng.randint(0, n_features, n_missing_samples)

# Estimate the score without the lines containing missing values
X_filtered = X_full[~missing_samples, :]
y_filtered = y_full[~missing_samples]
estimator = RandomForestRegressor(random_state=0, n_estimators=100)
score = cross_val_score(estimator, X_filtered, y_filtered).mean()
print("Score without the samples containing missing values = %.2f" % score)

# Estimate the score after imputation of the missing values
X_missing = X_full.copy()
X_missing[np.where(missing_samples)[0], missing_features] = 0
y_missing = y_full.copy()
##################################################################
# HERE IS THE IMPUTATION CODE: 

estimator = Pipeline([("imputer", Imputer(missing_values=0,
                                          strategy="mean",
                                          axis=0)),
                      ("forest", RandomForestRegressor(random_state=0,
                                                       n_estimators=100))])
##################################################################
score = cross_val_score(estimator, X_missing, y_missing).mean()
print("Score after imputation of the missing values = %.2f" % score)

Score with the entire dataset = 0.56
Score without the samples containing missing values = 0.48
Score after imputation of the missing values = 0.57


  a = empty(shape, dtype, order)


#### MultiOutput Regression

http://scikit-learn.org/stable/auto_examples/ensemble/plot_random_forest_regression_multioutput.html#sphx-glr-auto-examples-ensemble-plot-random-forest-regression-multioutput-py

In [None]:
from sklearn.multioutput import MultiOutputRegressor
#additional code omitted
#y_train has 2 targets, thus dimension of n_features-by-2

max_depth = 30
#Using MultiOutputRegressor using random forest as "estimator"
regr_multirf = MultiOutputRegressor(RandomForestRegressor(max_depth=max_depth, random_state=0))
regr_multirf.fit(X_train, y_train)

#Using Random Forest directly 
regr_rf = RandomForestRegressor(max_depth=max_depth, random_state=2)
regr_rf.fit(X_train, y_train)

# Summary

This session is about feature selection when features are uncorrelated.

Regressions on scaled features are a way to do feature selection because the coefficient tells the weight each feature has on the prediction.