# Regularized feature selection -- Boston

1. Main idea
When all features are at the same scale, the most important feature should have the highest coefficient in the model, while the feature unrelated to the output variable should have coefficient values close to zero. Even using simple linear regression models, this approach works well when the data is not very noisy (or there is a large number of data compared to the number of features) and the features are (relatively) independent

2. Regularization model
Regularization is to add additional constraints or penalties to the existing model (loss function) to prevent overfitting and improve generalization ability. Loss function E (X, Y) has E (X, Y) + alpha | | w | |, w is a coefficient model of vector (also called parameters parameter in some places, the coefficients), | |, | | is commonly L1 and L2 norm, alpha is an adjustable parameters, controls the strength of the regularization. When used on linear models, L1 and L2 regularization are also known as Lasso and Ridge.

1) L1 regularization /Lasso Regression
L1 regularization adds the L1 norm of the coefficient W to the loss function as a penalty term. Since the regular term is non-zero, this forces the coefficients corresponding to weak features to be zero. Therefore, L1 regularization tends to make the learned models sparse (the coefficient W is often 0), which makes L1 regularization a good feature selection method.

Lasso was able to pick out some good features while letting the coefficients of others go to zero. It is useful when you need to reduce the number of features, but not very useful for data understanding.

2) L2 Regularization /Ridge Regression
L2 regularization adds the L2 norm of the coefficient vector to the loss function.

Since the coefficients in THE penalty term L2 are quadratic, there are many differences between L2 and L1. The most obvious one is that L2 regularization will average the values of coefficients.
For correlation features, this means that they can obtain more similar corresponding coefficients.
The Ridge distributes the regression coefficients evenly across the associated variables.

- L2 regularization is a stable model for feature selection, unlike L1 regularization, where coefficients fluctuate with subtle data changes. Therefore, L2 regularization and L1 regularization provide different values, and L2 regularization is more useful for feature understanding: the coefficient corresponding to a feature with strong capability is non-zero.

Each feature has its corresponding weight coefficient coef. The positive and negative values of the feature's weight coefficient represent whether the feature is positively or negatively correlated with the target value, and the absolute value of the feature's weight coefficient represents importance.

The fit() method for the LinearRegression in SKLearn is to solve θ through the training set. The two properties of the LinearRegression intercept and COef correspond to θ0 and θ1-θn, respectively.

In [4]:
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LinearRegression

In [5]:
#获取boston数据
boston=datasets.load_boston()
x=boston.data
y=boston.target

b = datasets.load_boston() 

bos = pd.DataFrame(b.data)
bos.columns = b.feature_names
X = bos[bos.columns]
bos["PRICE"] = b.target
y = bos["PRICE"]

#过滤掉异常值
X=X[y<50]
y=y[y<50]
reg=LinearRegression()
reg.fit(X,y)
#求排序后的coef
coefSort=reg.coef_.argsort()
#featureNameSort: 按对标记值的影响，从小到大的各特征值名称
#featureCoefSore：按对标记值的影响，从小到大的coef_
featureNameSort=boston.feature_names[coefSort]
featureCoefSore=reg.coef_[coefSort]
print("featureNameSort:", featureNameSort)
print("featureCoefSore:", featureCoefSore)

featureNameSort: ['NOX' 'DIS' 'PTRATIO' 'LSTAT' 'CRIM' 'INDUS' 'AGE' 'TAX' 'B' 'ZN' 'RAD'
 'CHAS' 'RM']
featureCoefSore: [-1.23981083e+01 -1.21096549e+00 -8.38180086e-01 -3.50107918e-01
 -1.06715912e-01 -4.38830943e-02 -2.36790549e-02 -1.37774382e-02
  7.85316354e-03  3.53133180e-02  2.51301879e-01  4.52209315e-01
  3.75945346e+00]



    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

Results analysis:

- The largest eigenvalue of the positive correlation effect coefficient is "RM" : the average number of rooms, and the coefficient value is 3.75.
- The largest characteristic value of negative correlation coefficient is "NOX" : nitric oxide concentration, and the coefficient value is -1.24.

In [2]:
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_boston


def pretty_print_linear(coefs, names = None, sort = False):
    if names == None:
        names = ["X%s" % x for x in range(len(coefs))]
    lst = zip(coefs, names)
    if sort:
        lst = sorted(lst,  key = lambda x:-np.abs(x[0]))
    return " + ".join("%s * %s" % (round(coef, 3), name) for coef, name in lst)

boston = load_boston()
scaler = StandardScaler()
X = scaler.fit_transform(boston["data"])
Y = boston["target"]
names = boston["feature_names"]

lasso = Lasso(alpha=.3)
lasso.fit(X, Y)
lasso.coef_
# print("Lasso model: {}".format(pretty_print_linear(lasso.coef_, names, sort = True)))


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

array([-0.24227912,  0.081819  , -0.        ,  0.53987192, -0.69891258,
        2.99322993, -0.        , -1.08091325,  0.        , -0.        ,
       -1.75561249,  0.62831526, -3.70463287])

In [9]:
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score
size = 100

#We run the method 10 times with different random seeds
for i in range(10):
    print("Random seed {}".format(i))
    np.random.seed(seed=i)
    X_seed = np.random.normal(0, 1, size)
    X1 = X_seed + np.random.normal(0, .1, size)
    X2 = X_seed + np.random.normal(0, .1, size)
    X3 = X_seed + np.random.normal(0, .1, size)
    Y = X1 + X2 + X3 + np.random.normal(0, 1, size)
    X = np.array([X1, X2, X3]).T


    lr = LinearRegression()
    lr.fit(X,Y)
    print("Linear model: {}".format(pretty_print_linear(lr.coef_)))

    ridge = Ridge(alpha=10)
    ridge.fit(X,Y)
    print("Ridge model: {}".format(pretty_print_linear(ridge.coef_)))

Random seed 0
Linear model: 0.728 * X0 + 2.309 * X1 + -0.082 * X2
Ridge model: 0.938 * X0 + 1.059 * X1 + 0.877 * X2
Random seed 1
Linear model: 1.152 * X0 + 2.366 * X1 + -0.599 * X2
Ridge model: 0.984 * X0 + 1.068 * X1 + 0.759 * X2
Random seed 2
Linear model: 0.697 * X0 + 0.322 * X1 + 2.086 * X2
Ridge model: 0.972 * X0 + 0.943 * X1 + 1.085 * X2
Random seed 3
Linear model: 0.287 * X0 + 1.254 * X1 + 1.491 * X2
Ridge model: 0.919 * X0 + 1.005 * X1 + 1.033 * X2
Random seed 4
Linear model: 0.187 * X0 + 0.772 * X1 + 2.189 * X2
Ridge model: 0.964 * X0 + 0.982 * X1 + 1.098 * X2
Random seed 5
Linear model: -1.291 * X0 + 1.591 * X1 + 2.747 * X2
Ridge model: 0.758 * X0 + 1.011 * X1 + 1.139 * X2
Random seed 6
Linear model: 1.199 * X0 + -0.031 * X1 + 1.915 * X2
Ridge model: 1.016 * X0 + 0.89 * X1 + 1.091 * X2
Random seed 7
Linear model: 1.474 * X0 + 1.762 * X1 + -0.151 * X2
Ridge model: 1.018 * X0 + 1.039 * X1 + 0.901 * X2
Random seed 8
Linear model: 0.084 * X0 + 1.88 * X1 + 1.107 * X2
Ridge model:

As you can see from the examples, the coefficients of linear regression vary widely, depending on the data generated. However, for the L2 regularization model, the coefficients are very stable and closely reflect the way the data is generated (all coefficients are close to 1).