### 1. Load data

I'll be using a voting dataset I made.

---

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import cross_val_score

from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV
from sklearn.linear_model import Lasso, Ridge, ElasticNet

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor

%matplotlib inline

In [2]:
issues = pd.read_csv('./datasets/city_issues_nlp.csv')

In [3]:
issues.head()

Unnamed: 0,num_votes,source[Map Widget],source[Mobile Site],source[New Map Widget],source[android],source[city_initiated],source[iphone],source[web],tag_type[T.abandoned_vehicles],tag_type[T.animal_problem],...,way,way street,weeds,westwood,wharf,white,windows,working,yard,yard waste
0,2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
1,2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
2,3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
3,3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
4,2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0


### 2. Create X, Y, and normalize X

**Y is num_votes here**. We're doing regression.

---

In [5]:
Y = issues.num_votes
print Y.value_counts()
Y = Y.values

X = issues[[c for c in issues.columns if c != 'num_votes']]
print X.shape

2     997
3     487
4     229
5     129
6      53
7      52
8      25
9      16
10     10
11      2
Name: num_votes, dtype: int64
(2000, 545)


In [6]:
Xn = ((X - X.mean()) / X.std()).values

### 3. Linear regression

1. Cross-validate a linear regression with 5 folds. 
2. Build a linear regression model with the full X and Y.

---

In [7]:
linreg = LinearRegression()
linreg_scores = cross_val_score(linreg, Xn, Y, cv = 5)
print linreg_scores
print np.mean(linreg_scores)

[ -6.16582890e+27  -5.72482157e+27  -5.86767849e+27  -2.15572371e+27
  -1.07346060e+27]
-4.19750265486e+27


### 4. Ridge regression

1. Either use `GridSearchCV` or `RidgeCV` to find the best `C` or `alpha` respectively. **Remember that bigger alphas means stronger regularization, and smaller Cs are stronger regularization. (C is the inverse of alpha).**
2. Cross-validate the R2 with Ridge using your optimal C or alpha.
3. Build a final Ridge model and fit it on the full X and Y as you did above.

---

In [12]:
# ridge1 = Ridge(alpha = 1)
# ridge2 = Ridge(alpha = 10)
# ridge3 = Ridge(alpha = 100)
# ridge4 = Ridge(alpha = 1000)
# ridge5 = Ridge(alpha = 10000)

# for r in [ridge1, ridge2, ridge3, ridge4, ridge5]:
#     print np.mean(cross_val_score(r, Xn, Y, cv = 5))

ridge = Ridge()
ridge_params = {
    'alpha': [1, 10, 100, 1000, 10000]
}

ridge_gs = GridSearchCV(ridge, ridge_params, cv = 5)

ridge_gs.fit(Xn, Y)

GridSearchCV(cv=5, error_score='raise',
       estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'alpha': [1, 10, 100, 1000, 10000]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [13]:
print ridge_gs.best_params_
print ridge_gs.best_score_

{'alpha': 1000}
0.275599411289


### 5. Lasso regression

1. Use either `GridSearchCV` or `LassoCV` to find the optimal `C` or `alpha` for the Lasso regression.
2. Cross-validate the R2 with Lasso using your optimal C or alpha.
3. Build a final Lasso model fit on the full X and Y.

---

In [27]:
lasso = Lasso()

lasso_params = {
    'alpha': np.logspace(-4, 5, 50)
}

lasso_gs = GridSearchCV(lasso, lasso_params, cv = 5)

lasso_gs.fit(Xn, Y)



GridSearchCV(cv=5, error_score='raise',
       estimator=Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'alpha': array([  1.00000e-04,   1.52642e-04,   2.32995e-04,   3.55648e-04,
         5.42868e-04,   8.28643e-04,   1.26486e-03,   1.93070e-03,
         2.94705e-03,   4.49843e-03,   6.86649e-03,   1.04811e-02,
         1.59986e-02,   2.44205e-02,   3.72759e-02,   5.68987e-02,
         8....    1.20679e+04,   1.84207e+04,   2.81177e+04,   4.29193e+04,
         6.55129e+04,   1.00000e+05])},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [28]:
print lasso_gs.best_params_
print lasso_gs.best_score_

{'alpha': 0.056898660290182992}
0.315248725458


In [29]:
print np.sum(lasso_gs.best_estimator_.coef_ == 0.)/float(Xn.shape[0])

0.2485


In [30]:
print lasso_gs.best_estimator_.coef_

[  0.00000000e+00   1.99489693e-02  -6.17657131e-04  -0.00000000e+00
  -1.22192956e-02   0.00000000e+00   1.26668829e-01   0.00000000e+00
  -0.00000000e+00   6.36115618e-02   5.52454512e-03   0.00000000e+00
  -0.00000000e+00  -0.00000000e+00   0.00000000e+00  -0.00000000e+00
  -0.00000000e+00   0.00000000e+00   0.00000000e+00  -0.00000000e+00
   0.00000000e+00  -2.48692708e-02  -0.00000000e+00   0.00000000e+00
  -0.00000000e+00  -0.00000000e+00   0.00000000e+00  -0.00000000e+00
   2.46250900e-03  -0.00000000e+00   3.46857915e-02   0.00000000e+00
   1.80563267e-02  -4.33198880e-02   0.00000000e+00  -0.00000000e+00
   0.00000000e+00   0.00000000e+00  -4.03232577e-03  -0.00000000e+00
   7.56780258e-03  -2.25394291e-01   2.28824646e-01   6.39808366e-17
   5.42257989e-01   0.00000000e+00  -0.00000000e+00  -0.00000000e+00
  -0.00000000e+00   0.00000000e+00  -0.00000000e+00  -0.00000000e+00
  -0.00000000e+00   0.00000000e+00  -0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00

### 6. ElasticNet regression

Now you'll get to try out the ElasticNet. It is a combination of the Ridge and the Lasso to leverage the benefits of both!

Arguments to optimize:

    alpha : same as the Ridge/Lasso above
    l1_ratio: this is the proportion of Ridge vs Lasso that the model is. 
        An l1_ratio of 0.0 is a pure Ridge
        An l1_ratio of 1.0 is a pure Lasso
        
1. Use `GridSearchCV` or `ElasticNetCV` to search for the optimal `alpha` and `l1_ratio`. 
2. Explain the probable reason why the it chose the parameters it did as the best ones.
3. Cross-validate the R2 with the ElasticNet using the optimal parameters.
4. Fit the ElasticNet on all X and Y.

---

In [33]:
enet_cv = ElasticNetCV(l1_ratio = np.linspace(0.025, 1., 50), n_alphas = 50, cv = 5, verbose = 1)

enet_cv.fit(Xn, Y)

........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

ElasticNetCV(alphas=None, copy_X=True, cv=5, eps=0.001, fit_intercept=True,
       l1_ratio=array([ 0.025  ,  0.0449 ,  0.0648 ,  0.08469,  0.10459,  0.12449,
        0.14439,  0.16429,  0.18418,  0.20408,  0.22398,  0.24388,
        0.26378,  0.28367,  0.30357,  0.32347,  0.34337,  0.36327,
        0.38316,  0.40306,  0.42296,  0.44286,  0.46276,  0.48265,
        0.50255,  0.522...4082,
        0.86071,  0.88061,  0.90051,  0.92041,  0.94031,  0.9602 ,
        0.9801 ,  1.     ]),
       max_iter=1000, n_alphas=50, n_jobs=1, normalize=False,
       positive=False, precompute='auto', random_state=None,
       selection='cyclic', tol=0.0001, verbose=1)

In [34]:
print enet_cv.alpha_
print enet_cv.l1_ratio_

0.049531997604
1.0


In [35]:
enet = ElasticNet( alpha = enet_cv.alpha_, l1_ratio = enet_cv.l1_ratio_)
enet_scores = cross_val_score(enet, Xn, Y, cv = 5)
print enet_scores
print np.mean(enet_scores)

[ 0.32757509  0.27178607  0.28140379  0.36879677  0.33183592]
0.316279530534


### 7. DecisionTreeRegressor

1. Use `GridSearchCV` to find the best `max_features`, `max_depth`, and `min_samples_leaf`. Read the documentation and think about what range of parameters would be good to search for each!
2. Cross-validate the R2 as above.
3. Fit a DecisionTreeRegressor on all X and Y as above.

---

### 8. BaggingRegressor

Now we'll use bagging with the DecisionTreeRegressor. Yes, regressions can be done with bagging too!

---

Remember that with the Bagging regressor you first have to initialize the internal "base estimator" that it will copy:

```python
dtr = DecisionTreeRegressor()
```

A cool thing to note is that you can actually gridseach over the internal base estimators as well. So, not only are you finding the best parameters for the BaggingRegressor but also the DecisionTreeRegressors that it copies inside:

```python
bag_params = {
    'base_estimator__max_features':[None],
    'base_estimator__max_depth':[None],
    'base_estimator__min_samples_leaf':[1],
    'max_features': [0.33, 0.66, 0.99],
    'max_samples': [0.1, 0.2, 0.4, 0.6, 0.8, 0.9],
    'n_estimators': [100]
}
```

**Be careful putting too many parameters into the `GridSearchCV`! It can really explode the possible permutations!**

That being said, you'll probably be able to put a decent amount of parameters in since the wine dataset doesn't have many columns.

Next you initialize the BaggingRegressor, putting the desired model as the first argument:

```python
bag = BaggingRegressor(dtr)
```

This tells the BaggingRegressor that you want it to spawn DecisionTreeRegressors as it's internal "children" base estimators.

Lastly, you'll put the BaggingClassifier into the grid searcher and fit on the data (it will cross-validate with the specified `cv` folds.

```python
bag_gs = GridSearchCV(bag, bag_params, cv=5, verbose=1)
bag_gs.fit(X, Y)
```

---

As before...

1. Use `GridSearchCV` to find the best `BaggingRegressor` and optionally internal `DecisionTreeRegressor` parameters.
2. Cross-validate the R2.
3. Fit a `BaggingRegressor` on all X and Y with the optimal parameters.

---

### 9. Get feature importances from `RandomForestRegressor`

The `RandomForestRegressor` has an attribute called `.feature_importances_`. These are the importances of your predictors as measured by how useful they were to the base estimators.

As you may recall, the `RandomForestRegressor` is just a special case of the `BaggingRegressor` that specifically uses decision trees. In fact, you've already done it above. The difference is that this class gives us the feature importances whereas the "generalized" bagging regressor class does not.

1. Save the column names X in a variable.
2. Grid search optimal parameters for the `RandomForestRegressor`.
3. Fit a `RandomForestRegressor` using the optimal parameters you found on the full X and Y.
4. Get out the feature importances.
5. Create a pandas DataFrame where one column is the feature importances and the other column is the X column names.
6. Sort the dataframe you made by feature importances in descending order.
7. Plot the feature importances.

---

In [None]:
sns.set(style="whitegrid")

f, ax = plt.subplots(figsize=(6, 15))

sns.set_color_codes("muted")
sns.barplot(x="importance", y="feature", data=feature_importances,
            label="feature importances", color="b")

sns.despine(left=True, bottom=True)

plt.show()

### 10. [BONUS] Use a different regression class with the `BaggingRegressor`

You could try `Ridge`, `Lasso`, `ElasticNet`, `SVC`, `KNeighborsRegressor` or any kind of regression class you're interested in!

---