#### Pair Problem - Week2 - Day5 - Regression Practice

Practice Lasso regularization technique in five steps:

1) Load Diabetes Dataset from SK Learn (`sklearn.datasets.load_diabetes()`).  Note that data may already be normalized.

2) Use the KFold function from sklearn's cross validation module to divide the data into 5 training/test sets.  Randomize the KFold (via the shuffle parameter with Random State of 0).

3) Tune the lambda (alpha) parameter in the lasso model by looping over a grid of possible lambdas (sklearn: lasso)

```
For each candidate lambda, loop over the 5 training/test sets.  
On each training/test set run the lasso model on the training set and then compute and record the prediction error in the test set.  
Finally total the prediction error for the 5 training/test sets.
```

4) Set lambda to be the value that minimizes prediction error.

5) Run the lasso model again with the optimal lambda determined in step 3. Which variables would you consider excluding on the basis of these results?

6) Try with Ridge and ElasticNet and base LinearRegression Models.  Compare your results.

Report the best score.

**Extra Credit**:  Try some Feature Engineering (Polynomials etc) to fit the data better.  Plot the data to see relationships.

In [None]:
from __future__ import division, print_function  # Python 2 and 3 Compatibility

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Lasso, Ridge, ElasticNet, LinearRegression

from sklearn.cross_validation import cross_val_score, train_test_split, KFold
from sklearn.grid_search import GridSearchCV

### Load Data

In [None]:
diabetes = load_diabetes()

In [None]:
diabetes.keys()

* data Matrix is X and target vecor is y

In [None]:
len(diabetes.data)

### Build an Hold OUT Set to test different Models

In [None]:
X_train, X_holdout, y_train, y_holdout = train_test_split(diabetes.data, diabetes.target, test_size=0.1, random_state=42)

### Split Training Data into Multiple Folds

In [None]:
# Notice that we are splitting the X_train data into 5 Folds

kfold = KFold(len(X_train), n_folds=5, shuffle=True, random_state=0)

### Baseline: Linear Regression

In [None]:
lin_reg_est = LinearRegression()

scores = cross_val_score(lin_reg_est, X_train, y_train, cv=kfold)
print(scores)
print("Linear Reg Mean Score: ", np.mean(scores))

# Build the Model
lin_reg_est.fit(X_train, y_train)

### Evaluating Model

In [None]:
# Fitted vs. Actual
y_train_pred = lin_reg_est.predict(X_train)

plt.scatter(y_train, y_train_pred, alpha=0.2)
plt.plot([0, 400], [0, 400])

In [None]:
# Fitted vs. Actual
y_test_pred = lin_reg_est.predict(X_test)

plt.scatter(y_test, y_test_pred)
plt.plot([0, 400], [0, 400])

In [None]:
# Plot Residuals

lin_reg_residuals = y_train - y_train_pred

plt.scatter(y_train, lin_reg_residuals)
plt.plot([0,400], [0, 0])
plt.title("Actual vs. Residuals")

### Lasso

In [None]:
print("Lasso Model:")
params = {
    "alpha": np.logspace(-4, -.1, 20)
}

grid_est = GridSearchCV(Lasso(), param_grid=params, cv=kfold)
grid_est.fit(X_train, y_train)
df = pd.DataFrame(grid_est.grid_scores_)
df["alpha"] = df.parameters.apply(lambda val: val["alpha"])
plt.plot(np.log(df.alpha), df.mean_validation_score);
grid_est.grid_scores_

### Ridge

In [None]:
print("Ridge Model:")
params = {
    "alpha": np.logspace(-4, -.1, 20)
}

grid_est = GridSearchCV(Ridge(), param_grid=params, cv=kfold)
grid_est.fit(X_train, y_train)
df = pd.DataFrame(grid_est.grid_scores_)
df["alpha"] = df.parameters.apply(lambda val: val["alpha"])
plt.plot(np.log(df.alpha), df.mean_validation_score);
grid_est.grid_scores_

### Make Functions for repeatable Code - DRY

In [None]:
def build_grid_search_est(model, X, y, cv=5, **params):
    
    grid_est = GridSearchCV(model, param_grid=params, cv=cv)
    grid_est.fit(X, y)
    df = pd.DataFrame(grid_est.grid_scores_)
    for param in params:
        df[param] = df.parameters.apply(lambda val: val[param])
#         plt.plot(np.log(df.alpha), df.mean_validation_score);
        plt.semilogx(df.alpha, df.mean_validation_score)
    grid_est.grid_scores_
    return grid_est

In [None]:
print("Lasso Grid Search")
lasso_grid_est = build_grid_search_est(Lasso(), X_train, y_train, cv=kfold,
                                       alpha=np.logspace(-4, 0.1, 30))

In [None]:
print("Ridge Grid Search")
ridge_grid_est = build_grid_search_est(Ridge(), X_train, y_train, cv=kfold,
                                       alpha=np.logspace(-4, 0.1, 10))

In [None]:
print("Elastic Net Grid Search")
elastic_net_grid_est = build_grid_search_est(ElasticNet(), X_train, y_train, cv=kfold,
                                             alpha=np.logspace(-4, 0.1, 10))

In [None]:
print("Lasso Grid Scores")
lasso_grid_est.grid_scores_

In [None]:
print("Ridge Grid Scores")
ridge_grid_est.grid_scores_

In [None]:
print("Elatic Net Grid Scores")
elastic_net_grid_est.grid_scores_

### Evaluating Models using Holdout Set across these four models

In [None]:
from sklearn.metrics import r2_score, mean_squared_error

y_pred = lin_reg_est.predict(X_holdout)
print("Linear Regression:", r2_score(y_holdout, y_pred))

y_pred = lasso_grid_est.predict(X_holdout)
print("Lasso Regression:", r2_score(y_holdout, y_pred))

y_pred = ridge_grid_est.predict(X_holdout)
print("Ridge Regression:", r2_score(y_holdout, y_pred))

y_pred = elastic_net_grid_est.predict(X_holdout)
print("ElasticNet Regression:", r2_score(y_holdout, y_pred))

In [None]:
pd.DataFrame(list(zip(range(10), lasso_grid_est.best_estimator_.coef_)))

### EDA

In [None]:
diabetes_df = pd.DataFrame(diabetes.data)
diabetes_df.columns = ["X" + str(col) for col in diabetes_df.columns]
diabetes_df["target"] = diabetes.target

In [None]:
diabetes_df.corr()

In [None]:
sns.pairplot(diabetes_df)