# Exercise

# #1

In [1]:
#Load boston setting X as boston.data and y as boston.target
#Attempt the grid search using polyregression + (linear, ridge, lasso, elastic net), and 
#Does feature mechanisms on ridge/lasso/elastic helps here?
#what is the optimal polynomial degree?  What does it mean?
#why do you think the result is like this?
#what is the value of lambdas, and what does it means?
#YOURE CODE HERE

In [5]:
boston.feature_names

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

In [7]:
print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

# Solution

# #1

In [2]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import mean_squared_error, r2_score
import warnings
from sklearn.exceptions import ConvergenceWarning
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

import pandas as pd
import numpy as np

#we know in advance that ElasticNet gonna complain about
#needing more iterations.  Unfortunately, to prevent my pc 
#from crashing, I will simply ignore this warning, and
#likely set the tol to a bit high
warnings.filterwarnings("ignore", category=ConvergenceWarning,
                        module="sklearn")

boston = load_boston()
X = boston.data
y = boston.target
X = scaler.fit_transform(X)  #by scaling it helps reach convergence faster

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size = 0.3, random_state=5)

params_linear = {'polynomialfeatures__degree': np.arange(1, 10)}
linear = make_pipeline(PolynomialFeatures(), LinearRegression(normalize=True))
params_Ridge = {'polynomialfeatures__degree': np.arange(1, 10),
                'ridge__alpha': np.logspace(-1, -4, 4)}
ridge = make_pipeline(PolynomialFeatures(), Ridge(normalize=True))  
params_Lasso = {'polynomialfeatures__degree': np.arange(1, 10),
                'lasso__alpha': np.logspace(-1, -4, 4)}
lasso = make_pipeline(PolynomialFeatures(), 
                      Lasso(normalize=True))
params_Elasticnet = {'polynomialfeatures__degree': np.arange(1, 10),
                'elasticnet__alpha': np.logspace(-1, -4, 4),
                "elasticnet__l1_ratio": np.linspace(0, 1, 3)}
elastic = make_pipeline(PolynomialFeatures(), 
                      ElasticNet(normalize=True))

params = [params_linear, params_Ridge, params_Lasso, params_Elasticnet]
models = [linear, ridge, lasso, elastic]
features_name = ['linearregression', 'ridge', 'lasso', 'elasticnet']

for ix, model in enumerate(models):
    cv = ShuffleSplit(n_splits = 10, test_size = 0.2, random_state=42)
    print(model)
    grid = GridSearchCV(model, params[ix], cv=cv)
    
    #grid.fit will fit the model at each grid point
    grid.fit(X_train, y_train)

    #print the best parameters
    print("Best params: ", grid.best_params_)

    #make prediction
    model = grid.best_estimator_
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    #print the stats
    print("Coefficients: ", model.named_steps[features_name[ix]].coef_)
    print(f"r^2 = {r2_score(y_test, y_pred):.3f}")
    print(f"MSE = {mean_squared_error(y_test, y_pred):.2f}")
    n, p = X.shape[0], X.shape[1]
    adjusted_rsqrt = 1-(1-r2_score(y_test, y_pred))*(n-1)/(n-p-1)
    print(f"adjusted $r^2$ = {adjusted_rsqrt:.3f}")

'''
We can quickly tell that there are some less relevant features in this Boston dataset, which thus
lower the accuracy of the LinearRegression model.   Thus, Lasso and Ridge seems to provide much
better results; however, it is not yet clear which performs better.   

Alpha value also provides  us some good information, that is, alpha of 0.001 indicates that only slightly regularized
model is needed to improve the accuracy by almost 20 percent.

Looking at coefficients, let's say degree 1 from Ridge [coefficient number 1 - 14]: 

  -1.44816043e-01 -5.64496705e-02  3.98000454e-01
  4.38845564e-01 -8.51675780e-01  3.55010742e+00 -1.37778256e+00
 -1.86222774e+00  1.10837142e+00 -1.06506148e+00 -8.12821444e-01
  1.25327660e+00 -3.26288191e+00
  
It is clear that coefficient number 6 has the strongest impact toward targets,
while feature number 13 has the strongest inverse relationship with targets.  For example, 6 represents 
ROOM in dwellings which make sense to impact the price positively.  On the other hand, 13 represents the LSTAT which
is the percentage of the lower status of the population, thus naturally impact the price negatively.

Good thing is that all models share the same direction thus feature 6 and 13 seems to have the strongest impact.
Can you tell me which two features are the most unimportant?

Another point of observation is that the polynomial degree is around 1 and 2, which indicates
an almost linear relationship between boston data and boston target.

Adjusted r-square does not change dramatically, since the number of samples scale well with
number of features, thus the lower term (n - p - 1) is not dramatically different from only (n)

For elasticnet, interestingly, it chooses l1_ratio of 0.5, meaning a balance between Lasso and Ridge.
Anyhow, I am not sure how my pc fares, but it took a entire night! to complete the model for a mere
increae of 2% increase in accuracy, as compared to Lasso and Ridge.

In conclusion, the intuition is that there are some very prominent features but also some very irrelevant
features.  Thus regularized model will work better in this situation.  However, I would go for Ridge over
Elastic since it takes much less time to run.  However, if I want to understand the interaction between
l1 and l2 penalty, then I will run Elastic net.  Lasso will also be nice since all models suggest the second feature
- ZN - is very irrelevant to target, thus it might be useful to render it to 0., which Lasso and Elastic net does.
'''

Pipeline(steps=[('polynomialfeatures', PolynomialFeatures()),
                ('linearregression', LinearRegression(normalize=True))])
Best params:  {'polynomialfeatures__degree': 1}
Coefficients:  [ 0.         -1.32750493  0.96447433 -0.17391979  0.19945597 -1.49757922
  2.83543215 -0.2962689  -2.80831532  2.76854145 -2.12866545 -2.11368262
  1.15569647 -3.29628098]
r^2 = 0.677
MSE = 30.70
adjusted $r^2$ = 0.669
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures()),
                ('ridge', Ridge(normalize=True))])
Best params:  {'polynomialfeatures__degree': 2, 'ridge__alpha': 0.01}
Coefficients:  [ 0.00000000e+00 -1.44816043e-01 -5.64496705e-02  3.98000454e-01
  4.38845564e-01 -8.51675780e-01  3.55010742e+00 -1.37778256e+00
 -1.86222774e+00  1.10837142e+00 -1.06506148e+00 -8.12821444e-01
  1.25327660e+00 -3.26288191e+00  1.63850840e-01  4.71665252e-01
 -1.16864891e-01  2.94839213e+00 -6.22042398e-01  3.06562044e-01
 -4.06958944e-01  3.23915966e-01 -5.05610073e-01  7.88507767

'\nWe can quickly tell that there are some irrelevant features in this Boston dataset, which thus\nlower the accuracy of the LinearRegression model.   Thus, Lasso and Ridge seems to provide much\nbetter results; however, it is not yet clear which performs better.  \n\nAnother point of observation is that the polynomial degree is around 1 and 2, which indicates\nan almost linear relationship between boston data and boston target.\n\nAdjusted r-square does not change dramatically, since the number of samples scale well with\nnumber of features, thus the lower term (n - p - 1) is not dramatically different from only (n)\n\nSadly, my pc takes a long long time to run Elastic net.  It would be interesting to see what\nElastic net picks between l1 and l2 penalty, as well as the ratio.  If anyone can tell me, \nlet me know in the online discussion!\n\n'