___
# Ridge Regression
___

## Import Libraries

In [1]:
# Import DS environment
import sys; import os; sys.path.append(os.path.expanduser('~/Google Drive/my/projects/python/'))
from ds_setup import *
import datetime
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

%load_ext autoreload
%autoreload

## Data

This example uses the `diabetes` dataset

In [2]:
from sklearn import datasets
# Load the diabetes dataset
diabetes = datasets.load_diabetes()

data1 = pd.DataFrame(data= np.c_[diabetes['data'], diabetes['target']],
                     columns= diabetes['feature_names'] + ['target'])

# lets select BMI
data1 = T(data1).select("target", "bmi", "age")
data1.head()

Unnamed: 0,target,bmi,age
0,151.0,0.061696,0.038076
1,75.0,-0.051474,-0.001882
2,141.0,0.044451,0.085299
3,206.0,-0.011595,-0.089063
4,135.0,-0.036385,0.005383


See that the bmi is more correlated with the target value, than the age is

## Train Test Split

Now let's split the data into a training set and a testing set. We will train out model on the training set and then use the test set to evaluate the model.

In [3]:
X = data1[['bmi', "age"]]
y = data1['target']

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)

## Grid Search

In [6]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge

parameters = {'alpha': [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 20]}

ridge_regressor = GridSearchCV(Ridge(), parameters,scoring='neg_mean_squared_error', cv=5)

classifier = t().grid_search(Ridge(), x_train, y_train.values.ravel(), parameters, scoring='neg_mean_squared_error')

Fitting 10 folds for each of 10 candidates, totalling 100 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    4.5s



Best Estimator
---------------------------
Ridge(alpha=0.01, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001)


Best Estimator Parameters
---------------------------
{'alpha': 0.01}



[Parallel(n_jobs=-1)]: Done  85 out of 100 | elapsed:    4.9s remaining:    0.8s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    4.9s finished


### Running the best Grid search

In [7]:
rf_reg = Ridge(alpha=0.01, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001)

model =  rf_reg.fit(x_train, y_train.values.ravel())

### Run Validation

Let's evaluate the model by checking out it's coefficients and how we can interpret them.

In [8]:
t().cross_validation_score(rf_reg, X, y.values.ravel())

Scores: [63.10119569 60.78668734 67.38040387 61.59683761 62.30918424 62.44852425
 66.12466639 52.94106892 70.06035537 55.93900689]

Mean: 62.268793057834536
Standard deviation: 4.818219468463213


### Eval

In [9]:
predictions = model.predict(x_test)

In [10]:
T().regression_score(y_test, predictions)

MAE: 54.292108610182126
MSE: 4204.236590676257
RMSE: 64.84008475222913
r2: 0.353774042538367


35.4% as R2 is too low 

This only serves to exemplify how to run the code, not as an example of a good model

### Weights

In [13]:
#One of the benefits of growing trees is that we can understand how important each of the features are print "Feature Importances" 
coeff_df = pd.DataFrame(model.coef_, X.columns, columns=['Coefficient'])
coeff_df.sort_values("Coefficient", ascending=False)

Unnamed: 0,Coefficient
bmi,914.098022
age,99.395047


### Explainer

In [14]:
import eli5
from eli5.sklearn import PermutationImportance
# define a permutation importance object
perm = PermutationImportance(model).fit(X, y)
# show the importance
eli5.show_weights(perm, feature_names=X.columns.values)

Weight,Feature
0.6651  ± 0.0548,bmi
0.0114  ± 0.0103,age
