# Ensemble methods

* Pre-reqs:
    * Linear classifiers in Python
    * Machine Learning with Tree-Based Models in Python
    * Supervised Learning with Scikit-Learn

### CAPSTONE TO DO: choose an evaluation metric for ensemble model: accuracy, sensitivity, specificty, other?
* **Accuracy:** shows the percentage of the dataset instances correctly predicted by the model developed by the machine learning algorithm
* **Sensitivity:** shows the percentage of COVID-19 positive patients correctly by the models
* **Specificity:**  shows the percentage of COVID-19 negative patients correctly by the models

* Create cross-val table of: CCI, TP, FP, Precision, Recall, F Measure/F1 Score, ROC/AUC?
* **F1 Score for imbalanced data classes**


### Combining multiple models
* When you're building a model, you want to choose the one that performs the best according to some evaluation metric
* Ensemble methods: form a new model by combining existing ones; the combined responses of different models will likely lead to a better decision than relying on a single response.
    * the combined model will have better performance than any of the individual models (or at least be as good as the best individual model)
* Ensemble learning is one of the most effective techniques in machine learning
* A useful Python library for machine learning called `mlxtend`
* Main scikit-learn module: `sklearn.ensemble`


```
from sklearn.ensemble import MetaEstimator
# Base estimators
est1 = Model1()
est2 = Model1()
estN = ModelN()
# Meta estimator
est_combined = MetaEstimator(estimators=[est1, est2,  ..., estN], 
                # Additional parameters (specific to the ensemble method)
)
# Train and test
est_combined.fit(X_train, y_train)
pred = est_combined.predict(X_test)
```
* The best feature of the meta estimator is that it works similarly to the scikit-learn estimators you already know, with the standard methods of fit and predict
* Decision trees are the building block of many ensemble methods

RECAP DT CODE:

```
# Split into train (80%) and test (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Instantiate the regressor
reg_dt = DecisionTreeRegressor(min_samples_leaf=3, min_samples_split=9, random_state=500)

# Fit to the training set
reg_dt.fit(X_train, y_train)

# Evaluate the performance of the model on the test set
y_pred = reg_dt.predict(X_test)
print('MAE: {:.3f}'.format(mean_absolute_error(y_test, y_pred)))
```

### Voting

* Concept of "wisdom of the crowds": refers to the collective intelligence based on a group of individuals instead of a single expert 
* Also known as **collective intelligence**
* The aggregated opinion of the crowd can be as good as (and is usually superior to) the answer of any individual, even that of an expert.
* Useful technique commonly applied to problem solving, decision making, innovation, and prediction (we are particularly interested in prediction)

* **Majority voting:** a technique that combines the output of many classifiers using a maority voting approach
    * Properties:
        * Classification problems
        * Majority voting $\Rightarrow$ Mode of individual predictions
        * **It is recommended to use an odd number of classifiers:** Use at least **three** classifiers, and when problem constraints allow it, use five or more
        
        
* **Wise Crowd Characteristics:**
    * **The ensemble needs to be diverse:** do this by using different algorithms or different datasets.
    * **Independent and uncorrelated:** Each prediction needs to be independent and uncorrelated from the rest.
    * **Use individual knowledge:** Each model should be able to make its own predition without relying on the other predictions
    * **Aggregate individual predictions:** into a collective one
    
**NOTE:** Majority Voting can only be applied to classification problems
    * For regression problems, see: ensemble voting regressor
    
```
from sklearn.ensemble import VotingClassifier
clf_voting = VotingClassifier(
       estmators = [
           ('label1', clf_1),
           ('label2', clf_2),
           ('labelN', clf_N)])
         
```

```
# Create the individual models
clf_knn = KNeightborsClassifier(5) # k=5 nearest neighbors (to avoid multi-modal predictions)
clf_dt = DecisionTreeClassifier()
clf_lr = LogisticRegression()
# Create voting classifier
clf_voting = VotingClassifier(
    estimators=[
        ('knn', clf_knn),
        ('dt', clf_dt),
        ('lr', clf_lr)])
# Fit combined model to the training set and predict
clf_voting.fit(X_train, y_train)
y_pred = clf_voting.predict(X_test)
```
#### Evaluate the performance

```
# Get the accuracy score
acc = accuracy_score(X_test, y_pred)
print("Accuracy: {:0.3f}".format(acc))
```


* The main input- with keyword "estimators" - is a list of (string, estimator) tuples.
* Each string is a label and each estimtor is an sklearn classifier
* **You do not need to fit the classifiers individually, as the voting classifier will take care of that.**
    * But you do need to tune hyperparameters? Not sure...
    
```
# Make the invidual predictions
pred_lr = clf_lr.predict(X_test)
pred_dt = clf_dt.predict(X_test)
pred_knn = clf_knn.predict(X_test)

# Evaluate the performance of each model
score_lr = f1_score(y_test, pred_lr)
score_dt = f1_score(y_test, pred_dt)
score_knn = f1_score(y_test, pred_knn)
```

* **1) Choose the best model**
    * generate predictions as per each model
    * get evaluation metric scores for each model's predictions (F1 score, accuracy, precision, recall, etc)
    * Decide which model performs the best
    
* **2) Instantiate models**

```
# Instantiate the individual models
clf_knn = KNeighborsClassifier(n_neighbors=5)
clf_lr = LogisticRegression(class_weight='balanced')
clf_dt = DecisionTreeClassifier(min_samples_leaf=3, min_samples_split=9, random_state=500)

# Create and fit the voting classifier
clf_vote = VotingClassifier(
    estimators=[('knn', clf_knn), ('lr', clf_lr), ('dt', clf_dt)]
)
clf_vote.fit(X_train, y_train)
```

* **3) Evaluate your ensemble**

```
# Calculate the predictions using the voting classifier
pred_vote = clf_vote.predict(X_test)

# Calculate the F1-Score of the voting classifier
score_vote = f1_score(y_test, pred_vote)
print('F1-Score: {:.3f}'.format(score_vote))

# Calculate the classification report
report = classification_report(y_test, pred_vote)
print(report)
```

### Averaging aka "Soft Voting"
* Averaging is another popular ensemble method
* Averaging is also referred to as **soft voting**
* Averaging/ Soft Voting can be applied to both classification and regression.
* In this technique, the combined prediction is the mean of the individual predictions
* **Soft voting: Mean**
    * **Regression:** mean of predicted values
    * **Classification:** mean of predicted *probabilities*
* As the mean doesn't have ambiguous cases (like the mode), we can use any number of estimators (odd or even), as long as we have at least two.

* To build an averaging classifier, we'll use the same class as before: **`VotingClassifier()`**
    * **Main difference:** we specify an additional parameter: **`voting='soft`**
        * Default value of voting is `'hard'`
#### Averaging Classifier    
    
```
from sklearn.ensemble import VotingClassifier
clf_voting = VotingClassifier(
                estimators=[
                    ('label1', clf_1),
                    ('label2', clf_2),
                    ...
                    ('labelN', clf_N)],
                voting = 'soft', 
                weights=[w_1, w_2, ..., w_N]
)
```
* parameter `weights` is optional; specifies a weight for each of the estimators
    * If specified, the combined prediction is a weighted average of the individual ones
    * Otherwise, weights are considered uniform.
* To build an **`AveragingRegressor()`:**
#### Averaging Regressor:

```
from sklearn.ensemble import VotingRegressor
reg_voting = VotingRegressor(
        estimators = [
            ('label1', reg_1),
            ('label2', reg_2),
            ...
            ('labelN', reg_N)],
        voting = 'soft',
        weights = [w_1, w_2, ..., w_N]
)
```
* The first parameter is also a list of the string/estimator tuples, but instead of classifiers, we use regressors


***
* **Averaging classifier example:**
```
from sklearn.ensemble import VotingRegressor
clf_knn = KNeighborsClassifier(5)
clf_dt = DecisionTreeClassifier()
clf_lr = LogisticRegression()
clf_voting = VotingClassifier(
        estimators = [
            ('knn', clf_knn),
            ('dt', clf_dt),
            ('lr', clf_lr)],
        voting='soft',
        weights = [1, 2, 1]
)
```
* Assuming we know that Decision Tree has better individual performance, we give it a higher weight
* Ideally, the weights should be tuned while training the model, for example, using grid search cross-validation (`GridSearchCV`).

```
# Build the individual models
clf_lr = LogisticRegression(class_weight='balanced')
clf_dt = DecisionTreeClassifier(min_samples_leaf=3, min_samples_split=9, random_state=500)
clf_svm = SVC(probability=True, class_weight='balanced', random_state=500)

# List of (string, estimator) tuples
estimators = [('lr', clf_lr), ('dt', clf_dt), ('svm', clf_svm)]

# Build and fit an averaging classifier
clf_avg = VotingClassifier(estimators = estimators,voting='soft')
clf_avg.fit(X_train, y_train)

# Evaluate model performance
acc_avg = accuracy_score(y_test,  clf_avg.predict(X_test))
print('Accuracy: {:.2f}'.format(acc_avg))
```

## Bagging
#### The strength of weak models
* What is a weak model and how to identify one by its properties?
* Voting and averaging work by combining the predictions of already trained models
    * Small number of estimators
    * **Fine-tuned estimators**
    * Individually optimized and trained for the problem
    
#### Weak vs. fine-tuned model

#### Weak estimator
* A "weak" model does not mean that it is a "bad" model, just that it is not as strong as a highly-optimized, finely-tuned model.
* Performance is slightly better than random guessing
* The error rate is less than 50%, but close to it
* A weak model should be light in terms of space and computational requirements, and fast during training and modeling.
* One good example: a decision tree limited to a depth of two
* **The three desired properties:
    * Low performance (just above guessing)
    * It is light
    * Low training and evaluation time
* Common examples of weak models:
    * decision tree with small depth
    * Logistic Regression (makes the assumption that the classes are linearly separable)
        * could also limit number of iterations for training
        * or, specify a high value of the parameter C to use a weak regularization
    * Linear Regression 
        * Makes the assumption that the output is a linear function of the input features
        * could limit the number of iterations or not use normalization
    * Other restricted models

#### Bagging = Bootstrap Aggregating
* Heterogenous vs homogenous ensemble methods
#### Heterogenous:
* Different algorithms (fine-tuned)
* Work well with a small number of estimators
* For example, we could combine a decision tree, a logistic regression, and a support vector machine using voting to improve the results
* Included: Voting, Averaging, Stacking
#### Homogenous
* methods such as bagging
* work by applying the same algorithm on all the estimators, and this algorithm must be a "weak" model
* Large number of weak estimators
* Bagging and Boosting are some of the most popular of this kind

#### Condorcet's Jury Theorem:
* **Requirements:**
* All models must be independent
* Each model performs better than random guessing
* All individual models have similar performance

* If these three conditions are met, then adding more models increases the probability of the ensemble to be correct, and makes this probability tend to 1 (1 equivalent to 100%)
* The second and third requirements can be fulfilled by using the same weak model for all the estimators

* To guarantee the first requirement of the theorem, the **bagging algorithm** trains individual models using a random subsample for each (known as **bootstrapping**)

* A wise crowd needs to be diverse, either through using different datasets or algorithms

* **Bootstrapping** requires:
    * Random subsamples $\Rightarrow$ provides **diversity** of data
    * Using replacement
    
* Bagging helps reduce variance, as the sampling is truly random
* Bias can be reduced since we use voting or averaging to combine the models 
* Bagging provides stablity and robustness
* However, **bagging is computationally expensive in terms of space and time**

* To take a sample, you'll use pandas' `.sample()` method, which has a replace parameter. For example, the following line of code samples with replacement from the whole DataFrame df:
    * `df.sample(frac=1.0, replace=True, random_state=42)`
    
```
# Take a sample with replacement
X_train_sample = X_train.sample(frac=1.0, replace=True, random_state=42)
y_train_sample = y_train.loc[X_train_sample.index]

# Build a "weak" Decision Tree classifier
clf = DecisionTreeClassifier(max_depth=4, random_state=500)

# Fit the model to the training sample
clf.fit(X_train_sample, y_train_sample)
```

```
def build_decision_tree(X_train, y_train, random_state=None):
    # Takes a sample with replacement,
    # builds a "weak" decision tree,
    # and fits it to the train set

def predict_voting(classifiers, X_test):
    # Makes the individual predictions 
    # and then combines them using "Voting"
```

#### BaggingClassifier
* **Heterogenous Ensemble Functions**
* A key difference between the ensemble functions from heterogenous and homogenous methods:

```
het_est = HeterogenousEnsemble(
    estimators= [('est1', est1), ('est2', est2), ...], 
    # additional parameters
)
```

* **Homogenous Ensemble Functions:**
* To build a homogenous ensemble model, instead of a list of estimators, we pass the parameter base_estimator, which is the instantiated "weak" model we have chosen for our ensemble:

```
hom_est = HomogenousEnsemble(
            base_estimator= est_base,
            n_estimators= chosen_number,
            #additional paramters
)
```

#### BaggingClassifier example:

```
clf_dt = DecisionTreeClassifier(max_depth=3)
clf_bag = BaggingClassifer(
            base_estimator= clf_dt,
            n_estimators=5
)
clf_bag.fit(X_train, y_train)
y_pred = clf_bag.predict(X_test)
```

#### BaggingRegressor example

```
reg_lr = LinearRegression(normalization=False)
reg_bag = BaggingRegressor(
            base_estimator=reg_lr,
            oob_score = True # oob_score = out-of-bag score
)
reg_bag.fit(X_train, y_train)
y_pred = reg_bag.predict(X_test)
```
* Note: default number of estimators is ten, when left undefined as above

#### Out-of-bag score
* In a bagging ensemble, each estimator is trained on a bootstrap sample. Therefore, each of the samples will leave out some of the instances, which are then used to evaluate each estimator, similar to a train-test split
* To get the out-of-bag score for each instance, we calculate the predictions using all the estimators for which it was out of the sample
* Then, combine individual predictions
* Evaluate the metric on those predictions
    * For **classification** the default metric is accuracy
    * For **regression** the default metric is R2 (aka **coefficient of determination**)
* **Out of bag score helps avoid the need for an independent test set**
    * However, it's often lower than the actual performance
* To get the out-of-bag score from a Bagging ensemble, we need to set the parameter `oob_score` to True
* After training the model, we can access the oob score using the attribute `.oob_score_`
* It's good to compare oob score to actual metric (for example, R2 or accuracy)
    * The two values being close is a good indicator of the model's ability to generalize to new data
    
```
# Instantiate the base model
clf_dt = DecisionTreeClassifier(max_depth=4)

# Build and train the Bagging classifier
clf_bag = BaggingClassifier(
  base_estimator=clf_dt,
  n_estimators=21,
  random_state=500)
clf_bag.fit(X_train, y_train)

# Predict the labels of the test set
pred = clf_bag.predict(X_test)

# Show the F1-score
print('F1-Score: {:.3f}'.format(f1_score(y_test, pred)))
```

```
# Build and train the bagging classifier
clf_bag = BaggingClassifier(
  base_estimator=clf_dt,
  n_estimators=21,
  oob_score=True,
  random_state=500)
clf_bag.fit(X_train, y_train)

# Print the out-of-bag score
print('OOB-Score: {:.3f}'.format(clf_bag.oob_score_))

# Evaluate the performance on the test set to compare
pred = clf_bag.predict(X_test)
print('Accuracy: {:.3f}'.format(accuracy_score(y_test, pred)))
```

#### Bagging parameters: tips and tricks
* **Basic parameters for bagging:**
    * **`base_estimator`**: the "weak" model which will be built for each sample
    * **`n_estimtors`**: specifies the number of estimators to use; 10 by default; in practice you'll want to use more than 10 (the larger the better-- usually between 100 and 500 trees are enough)
    * **`oob_score`**: T/F 
* **Additional parameters for bagging:**
    * **`max_samples`**: the number of samples to draw for each estimator; default is 1.0, the equivalent of 100%
    * **`max_features`**: the number of features to draw (randomly) for each estimator; default is 1.0, the equivalent of 100%
        * Using lower values of both of the above provides more diversity for the individual models and reduces the correlation among them, as each will get a different sample of both features and instances.
        * For classification, the optimal value lies around the square root of the number of features
        * For regression, the optimal value is usually close to one third of the number of features
    * **`bootstrap`**: boolean; indicates whether samples are drawn with replacement; default is `True`
        * If passed as True, it is recommended to use max_samples of 100%
        * If False, then max_samples should be lower than 100%, because otherwise all the samples would be identical
        
## Random Forest Bagging
* Random Forests are a special case of bagging where the base estimators are decision trees
* If you want to use decision trees as base estimators, it is recommended to use the Random Forest classes instead, as they are specifically designed for trees.
* The **sklearn implementation for RFs combines the models using averaging instead of voting, so there is no need to use an odd number of estimators**
* For classification: **`RandomForestClassifier`**
* For regression: **`RandomForestRegressor`**
* **Some of the most important parameters:**
    * **Parameters shared with bagging:**
        * **`n_estimators`**
        * **`max_features`**
        * **`oob_score`**
    * **Tree-specific parameters:**
        * **`max_depth`**
        * **`min_samples_split`** (min samples required to split a node)
        * **`min_samples_leaf`** (min samples required in a leaf)
        * **`class_weight`** (`"balanced"`): allows you to specify the weights for each class using a dictionary: balanced will use the class distribution to calculate balanced weights $\Rightarrow$ therefore RFs are able to deal with imbalanced targets
        
        
#### Recap: Bias-variance trade-off
* A simple model has low variance, but high bias
* Adding more complexity to the model may reduce the bias but increase the variance of predictions
* This is why it's important to optimize the parameters of the ensemble models that minimize the total error and find the balance between bias and variance
    
```
# Build a balanced logistic regression
clf_lr = LogisticRegression(class_weight="balanced")

# Build and fit a bagging classifier
clf_bag = BaggingClassifier(clf_lr, max_features=10, oob_score=True, random_state=500)
clf_bag.fit(X_train, y_train)

# Evaluate the accuracy on the test set and show the out-of-bag score
pred = clf_bag.predict(X_test)
print('Accuracy:  {:.2f}'.format(accuracy_score(y_test, pred)))
print('OOB-Score: {:.2f}'.format(clf_bag.oob_score_))

# Print the confusion matrix
print(confusion_matrix(y_test, pred))
```

```
# Build a balanced logistic regression
clf_base = LogisticRegression(class_weight='balanced', random_state=42)

# Build and fit a bagging classifier with custom parameters
clf_bag = BaggingClassifier(base_estimator=clf_base, n_estimators=500, max_samples=0.65, max_features=10, bootstrap=False, random_state=500)
clf_bag.fit(X_train, y_train)

# Calculate predictions and evaluate the accuracy on the test set
y_pred = clf_bag.predict(X_test)
print('Accuracy:  {:.2f}'.format(accuracy_score(y_test, y_pred)))

# Print the classification report
print(classification_report(y_test, y_pred))
```


## Boosting

* **Boosting** is class of ensemble learning algorithms based on a technique known as **gradual learning**
* Collective learning vs gradual learning:
    * **Collective learning:**
        * the "wisdom of the crowd" principle
        * idea that the combined prediction of individual models is superior to any of the individual predictions on their own.
        * For collective learning to be efficient, the estimators need to be independent and uncorrelated
        * All the estimators are learning the same task for the same goal
        * Because the estimators are independent, they can be trained in parallel to speed up the model building
    * **Gradual learning:**
        * Based on the principle of iterative learning
        * In this approach, each subsequent model tries to fix the errors of the previous model 
        * Gradual learning creates dependent estimators, as each model takes advantage of the knowledge from the previous estimator
        * Each model is learning a different task, but each one contributes to the same goal of accurately predicting the target variable
        * As gradual learning follows a sequential model building process, models cannot be trained in parallel
        * Intuitively, gradual learning is similar to the way we, as humans, learn
        * In gradual learning, instead of the same model being corrected in every iteration, a new model is built that tries to fix the errors of the previous model
        * **Careful of fitting to noise!**
        * You want to avoid having an estimator that is fitting to noise, which will lead to overfitting
            * One way to control this is to stop training after the errors of an estimator start to display white noise

* **White noise:**
    * Errors are uncorrelated with the input features 
    * Errors are unbiased and have constant variance
    
* **Another approach to control fitting to white noise is Improvement Tolerance:**
    * **Improvement tolerance:**
        * If the difference in performance does not meet a defined threshold, then the training is stopped
        * **?** Cross search for threshold defintion **?**
        
* You'll build another linear regression, but this time the target values are the errors from the base model, calculated as follows:

* `y_train_error = pred_train - y_train`
* `y_test_error = pred_test - y_test`

```
# Fit a linear regression model to the previous errors
reg_error = LinearRegression(normalize=True)
reg_error.fit(X_train_pop, y_train_error)

# Calculate the predicted errors on the test set
pred_error = reg_error.predict(X_test_pop)

# Evaluate the updated performance
rmse_error = np.sqrt(mean_squared_error(y_test_error, pred_error))
print('RMSE: {:.3f}'.format(rmse_error))
```

### AdaBoost: Adaptive Boosting

* Award winning model with a high potential to solve complex problems 
* The first practical boosting algorithm
* Proposed in 1997, it remains highly used and well-known among machine learning practitioners
* There are two distinctive properties of Adaptive Boosting compared to other boosting algorithms

#### AdaBoost Properties
* 1. Instances are drawn using a sample distribution of the training data into each subsequent dataset
    * This sample distribution makes sure that instances which were harder to predict for the previous estimator have a higher chance to be included in the training set for the next estimator by giving them higher weights
    * Distribution is initialized to be uniform
* 2. The estimators are combined through weighted majority voting
    * The voting weights are based on the estimators' training error
    * Estimators which have shown good performance are rewarded with higher weights for voting
    * **Good estimators are given higher weights.**
* 3. Guaranteed to improve as the estimator grows
    * **AdaBoost is guaranteed to improve as the ensemble grows if each estimator has an error rate less than 0.5.
    
* **Each estimator needs to be a "weak" 
* **Similar to bagging, AdaBoost can be used for both Classification and Regression with its two variations

```
from sklearn.ensembe import AdaBoostClassifier
clf_ada = AdaBoostClassifier
clf_ada = AdaBoostClassifier(base_estimator, 
                                n_estimators,
                                learning_rate
)
```

* **Parameters:**
* `base_estimator`: the weak model template for all the estimators; default is a DecisionTreeClassifier with a max depth of 1, also known as a decision stump
* `n_estimators`: number of estimators to use; default is 50
    * If there's a perfect fit, or an estimator with error higher than 50%, no more estimators are built
* `learning_rate`: which represents how much each estimator contributes to the ensemble; 1.0 by default
    * There is a trade between `n_estimators` and `learning_rate`
    
```
from sklearn.ensemble import AdaBoostRegressor
reg_ada = AdaBoostRegressor(
            base_estimator,
            n_estimators,
            learning_rate,
            loss
)
```
* There is a difference with the parameter `base_estimator` between classification and regression versions of AdaBoost. If it's not specified, the default will be a Decision Tree Regressor with a max_depth of **3** (as compared with the AdaBoost Classifier with had as default a Decision Tree Classifier with max_depth of **1** by default.
* `loss` parameter is the function used to update weights. By default it is linear, but you can also use the square of exponential loss

```
# Instantiate a normalized linear regression model
reg_lm = LinearRegression(normalize=True)

# Build and fit an AdaBoost regressor
reg_ada = AdaBoostRegressor(base_estimator=reg_lm, n_estimators=12, random_state=500)
reg_ada.fit(X_train, y_train)

# Calculate the predictions on the test set
pred = reg_ada.predict(X_test)

# Evaluate the performance using the RMSE
rmse = np.sqrt(mean_squared_error(y_test, pred))
print('RMSE: {:.3f}'.format(rmse))
```

```
# Build and fit a tree-based AdaBoost regressor
reg_ada = AdaBoostRegressor(n_estimators=12, random_state=500)
reg_ada.fit(X_train, y_train)

# Calculate the predictions on the test set
pred = reg_ada.predict(X_test)

# Evaluate the performance using the RMSE
rmse = np.sqrt(mean_squared_error(y_test, pred))
print('RMSE: {:.3f}'.format(rmse))
```

# Gradient boosting

* 1. Initial model (weak estimator fit to the dataset)
* 2. On each subsequent iteration, a new model is built and fitted to the residual error from the previous iteration.
* 3. After each individual estimator is built, the resuly is a new **additive** model, which is an improvement on the previous estimate
* 4. Repeat this process n times or until the error is small enough such that the difference in performance in negligible
* 5. After the algorithm is finished, the result is a final improved additive model.
        * This is a peculiarity of Gradient Boosting, as the individual estimators are not combined through voting or average, but by addition
        * This is because only the first model is fitted to the target variable, and the rest are estimates of the residual errors 
        
* Why "Gradient Boosting?"
    * Because it's equivalent to applying **gradient descent** as the optimization algorithm.
    * The residuals = negative gradient
    
* The residuals are defined as $F_i$(x)
    * This ($F_i$(x)) represents the error that the model has at iteration *i*
    
* **Gradient Descent** is an iterative optimization algorithm that attempts to minimize the loss of an estimator
* On every iteration steps are taken in the direction of the negative gradient, which points toward the minimum 
* The gradient is the derivative of the loss with respect to the approximate function
* The result is $F_i$(x) - *y*
* We are actually improving the model using Gradient Descent on each iteration 
* **Gradient Boosting Classifier:**

```
from sklearn.ensemble import GradientBoostingClassifier
clf_gbm = GradientBoostingClassifier(
            n_estimators=100,
            learning_rate=0.1,
            max_depth=3,
            min_samples_split,
            max_samples_leaf,
            max_features
)
```

* Unlike with other ensemble methods, here we don't specify the `base_estimator` as Gradient Boosting is implemented with regression trees as the individual estimators.
* In classification, the trees are fitted to the class probabilities
* `n_estimators`= 100 by default 
* `learning_rate`= 0.1 by default
* `max_depth` = 3 by default
* **In gradient boosting, it is recommended to use all the features.**
* **Gradient Boosting Regressor:**

```
from sklearn.ensemble import GradientBoostingRegressor
reg_gbm = GradientBoostingRegressor(
            n_estimators=100,
            learning_rate=0.1,
            max_depth= 3
            min_samples_split,
            min_samples_leaf,
            max_features
)
```

```
# Build and fit a Gradient Boosting classifier
clf_gbm = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=500)
clf_gbm.fit(X_train, y_train)

# Calculate the predictions on the test set
pred = clf_gbm.predict(X_test)

# Evaluate the performance based on the accuracy
acc = accuracy_score(y_test, pred)
print('Accuracy: {:.3f}'.format(acc))

# Get and show the Confusion Matrix
cm = confusion_matrix(y_test, pred)
print(cm)
```                   

### Gradient Boosting "flavors"

* Some variations or "flavors" in the gradient boosting family of algorithms (along with their implementations in Python)

#### XGBoost
* **Extreme Gradient Boosting**; implemented with **XGBoost**
* a more advance implementation of the Gradient Boosting algorithm
* optimized for distributed computing for both training and prediction phases
* uses parallel processing for training each estimator, thus speeding up the processing
* scalable, portable, accurate
* can work with huge datasets

```
import xgboost as xgb
clf_xgb = xgb.XGBClassifier(
            n_estimtors= 100,
            learning_rate=0.1,
            max_depth= 3,
            random_state
)
clf_xgb.fit(X_train, y_train)
pred = clf_xgb.predict(X_test)
```


#### LightGBM
* **Light Gradient Boosting Machine**; implemented with **LightGBM**
* framework developed by Microsoft in 2017
* Provides faster training and higher efficiency
* Lighter in terms of space and memory usage
* Being a distributed algorithm means it's optimized for parallel and GPU processing
* Useful for problems involving big datasets and/or constraints of speed or memory
* Note that for LGBM max_depth is **-1** by default (meaning **no limit**)

```
import lightgbm as lgb
clf_lgb = lgb.LGBMClassifier(
            n_estimtors=100,
            learning_rate=0.1,
            max_depth= -1,
            random_state
)
clf_lgb.fit(X_train, y_train)
pred = clf_lgb.predict(X_test)           
```

```
# Build and fit an XGBoost regressor
reg_xgb = xgb.XGBRegressor(max_depth=3, learning_rate=0.1, n_estimators=100, objective='reg:squarederror', random_state=500)
reg_xgb.fit(X_train, y_train)

# Build and fit a LightGBM regressor
reg_lgb = lgb.LGBMRegressor(max_depth=3, learning_rate=0.1, n_estimators=100, objective='mean_squared_error', seed=500)
reg_lgb.fit(X_train, y_train)

# Calculate the predictions and evaluate both regressors
pred_xgb = reg_xgb.predict(X_test)
rmse_xgb = np.sqrt(mean_squared_error(y_test, pred_xgb))
pred_lgb = reg_lgb.predict(X_test)
rmse_lgb = np.sqrt(mean_squared_error(y_test, pred_lgb))

print('Extreme: {:.3f}, Light: {:.3f}'.format(rmse_xgb, rmse_lgb))
```

#### CatBoost
* **Categorical Boosting**; implemented with **CatBoost** (the newest "flavor")
* alias = `cb`
* `cb` gives us access to CatBoostClassifier **and** CatBoostRegressor
* the most recent Gradient Boosting "flavor"
* open sourced by **Yandex**, a Russian tech company, in April 2017
* CatBoost has built-in capacity to handle categorical features, so you don't need to do the preprocessing yourself
* It is a fast implementation which can scale to large datasets and run on a GPU if required
* Accurate and robust
* Fast and scalable
* User-friendly interface/API that integrates well with scikit-learn
* Similar set of parameters to other scikit-learn machine learning class, but the default values are somewhat different (see below)

```
import catboost as cb
clf_cat = cb.CatBoostClassifier(
            n_estimators=1000,
            learning_rate=0.03,
            max_depth=6,
            random_state
)
clf_cat.fit(X_train, y_train)
pred = clf_cat.predict(X_test)
```

#### Movie revenue prediction with CatBoost

```
# Build and fit a CatBoost regressor
reg_cat = cb.CatBoostRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=500)
reg_cat.fit(X_train, y_train)

# Calculate the predictions on the set set
pred = reg_cat.predict(X_test)

# Evaluate the performance using the RMSE
rmse_cat = np.sqrt(mean_squared_error(y_test, pred))
print('RMSE (CatBoost): {:.3f}'.format(rmse_cat))
```

# Cat Boost

* CatBoost is an open-sourced gradient boosting library.
* CatBoost deals with the categorical data quite well out-of-the-box. However, it also has a huge number of training parameters, which provide fine control over the categorical features preprocessing. 
* According to [CatBoost documentation](https://catboost.ai/en/docs/features/categorical-features): **don't use one hot encoding on preprocessing of data fed to CatBoost** "This affects both training speed and resulting quality."
* Alternately, according to tds: "CatBoost supports some traditional methods of categorical data preprocessing, such as One-hot Encoding and Frequency Encoding."
* The core idea behind CatBoost categorical features preprocessing is Ordered Target Encoding: a random permutation of the dataset is performed and then target encoding of some type (for example just computing mean of the target for objects of this category) is performed on each example using only the objects that are placed before the current object.

* Generally transforming categorical features to numerical features in CatBoost includes the following steps:
    * 1. **Permutation** of the training objects in random order.
    * 2. **Quantization** i.e. converting the target value from a floating-point to an integer depending on the task type:
        * **Classification** — Possible values for target value are “0” (doesn’t belong to the specified target class) and “1” (belongs to the specified target class).
        * **Multiclassification** — The target values are integer identifiers of target classes (starting from “0”).
        * **Regression** — Quantization is performed on the label value. The mode and number of buckets are set in the starting parameters. All values located inside a single bucket are assigned a label value class — an integer in the range defined by the formula: <bucket ID — 1>.
    * 3. **Encoding** the categorical feature values.
    
    
* CatBoost creates four permutations of the training objects and for each permutation, a separate model is trained. Three models are used for the tree structure selection and the fourth is used to compute the leaves values of the final model that we save. At each iteration one of the three models is chosen randomly; this model is used to choose the new tree structure and to calculate the leaves values for all the four models.

* **Another important point is that CatBoost can create new categorical features combining the existing ones. And it will actually do so unless you explicitly tell it not to**



## XGBoost

#### Notes: 
* **change MSE evaluation metric to RMSE (or MAE) in capstone?**

* Common regression metrics: 
    * RMSE
        * allows us to treat negative and positive differences equally 
        * but, it tends to punish larger differences between predicted and actual values much more than smaller ones 
    * MAE
        * simply sums the absolute differences between predicted and actual values 
        * Although MAE isn't affected by large differences as much as RMSE, it lacks some nice mathematical properties that make it much less frequently used as an evaluation metric
* Decision trees can be effectively applied to both classification and regression probems, an important property that makes them prime candidates to be the building blocks of XGBoost models 

* Objective (loss) functions and base learners
    * Two critical concepts to understand in order to grasp why XGBoost is such a powerful approach to building supervised regression models
    * The goal of any ML model is to find the model that yields the minimum value of the loss function
    
#### Common loss functions and XGBoost
* Loss functions have specific naming conventions in XGBoost:
    * **reg:linear** $\Rightarrow$ use for regression problems 
    * **reg:logistic** $\Rightarrow$ use for binary classification problems when you want just the decision, **not** the probability
    * **binary:logistic** $\Rightarrow$ use when you want probability rather than just the decision for binary classification problems. 
    
#### Base learners
* XGBoost is an ensemble learning method composed of many individual models that are added together to generte a single prediction
* Each of the individual models that are trained and combined are called base learners
* **The goal of XGBoost is to have base learners that are slightly better than random guessing on certain subsets of training examples and uniformly bad on the remainder** so tht when all of the predictions are combined, the uniformly bad predictions cancel out and those slightly better than chance combine into a single very good prediction.
* Two kinds of base learners: tree and linear

### Trees as base learners in XGBoost for a regression problem with sklearn's API:

```
from sklearn.model_selection import train_test_split
import xgboost as xgb
X, y = boston_data.iloc[:, :-1], boston_data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

xg_reg = xgb.XGBRegressor(
                objective='reg:linear', 
                n_estimators=10, 
                seed=123
)
xg_reg.fit(X_train, y_train)

preds = xg_reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))
```

### Regularization and base learners in XGBoost

* Loss functions in XGboost don't just take into account how close a model's predictions are to the actual values, but also take into account how complex the model is. 
* The idea of penalizing models as they become more complex is called **regularization**
* So, loss functions in XGBoost are used to find models that are both accurate and as simple as they can possibly be
* **Regularization parameters in XGBoost:**
    * **gamma**: minimum loss reduction allowed for a split to occur; for tree-based learners 
        * higher values lead to fewer splits
    * **alpha**: l1 regularization (penalty) on leaf weights (rather than on feature weights, as is the cse in linear or logistic regression)
        * larger values mean more regularization
        * higher alpha values lead to stronger L1 regularization, which causes many leaf weights in the base learners to go to 0.
    * **lambda**: l2 regularization on leaf weights
        * L2 regularization is a much smoother penalty than L1 and causes leaf weights to smoothly decrease, instead of enforcing strong sparsity constraints on the leaf weights as in L1
* More about regularization in DC's Supervised Learning with Scikit Learn Course

* **L1 regularization in XGBoost example:**
* first: import necessary libraries and dataset, create feature matrix X and target vector y

```
boston_dmatrix = xgb.DMatrix(data=X, label=y)
params= {"objective":"reg:linear", "max_depth":4}
l1_params = [1,10,100]
rmses_l1 = []
for reg in l1_params:
    params['aplha'] = reg
    cv_results = xgb.cv(
                    dtrain=boston_dmatrix, 
                    params=params, 
                    nfold=4,
                    num_boost_round =10,
                    metrics='rmse',
                    as_pandas=True,
                    seed=123
    )
    rmses_l1.append(cv_results['test-rmse-mean'].tail(1).values[0])
print('Best rmse as a function of l1:')
print(pd.DataFrame(list(zip(l1_params, rmse_l1)), columns=['l1', 'rmse']))
```

* **Tree base learner:**
    * Decision tree
    * Boosted model is weighted sum of decision trees (nonlinear)
    * Almost exclusively used in XGBoost
        

### Tunable parameters in XGBoost
* The parameters that can be tuned are significantly different for each base learner
* For the tree based learner, most frequentyly tuned parameters are:
    * **learning rate:** affects how quickly the model fits the residual error using additional base learners 
        * A low learning rate will require more boosting rounds to achieve the same reduction in residual error as an XGBoost with a high learning rate. 
    * **gamma:** min loss reduction to create new tree split
    * **lambda:** L2 reg on leaf weights
    * **alpha:** L1 reg on leaf weights
    * **max_depth:** max depth per tree; must be a positive integer value
    * **subsample:** % samples used per tree; must be a value between 0 and 1 
        * if the value is low, then the fractino of your training data used per boosting round would be low (and you may run into underfitting problems
        * a high value can lead to overfitting
    * **colsample_bytree:** % features used per tree; must be a value between 0 and 1
        * a large value means that almost all features can be used to build a tree during a given boosting round; large values may in certain cases overfit a trained model.
        * a small value means that the fraction of features that can be selected from is very small; small values can be thought of as providing additional regularization to the model 


```
# Create your housing DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary for each tree (boosting round)
params = {"objective":"reg:linear", "max_depth":3}

# Create list of eta values and empty list to store final round rmse per xgboost model
eta_vals = [0.001, 0.01, 0.1]
best_rmse = []

# Systematically vary the eta
for curr_val in eta_vals:

    params["eta"] = curr_val
    
    # Perform cross-validation: cv_results
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3,
                        num_boost_round=10, early_stopping_rounds=5,
                        metrics="rmse", as_pandas=True, seed=123)
    
    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(pd.DataFrame(list(zip(eta_vals, best_rmse)), columns=["eta","best_rmse"]))
```

```
# Create your housing DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X,label=y)

# Create the parameter dictionary
params = {"objective":"reg:linear"}

# Create list of max_depth values
max_depths = [2, 5, 10, 20]
best_rmse = []

# Systematically vary the max_depth
for curr_val in max_depths:

    params["max_depth"] = curr_val
    
    # Perform cross-validation
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2,
                 num_boost_round=10, early_stopping_rounds=5,
                 metrics="rmse", as_pandas=True, seed=123)
    
    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(pd.DataFrame(list(zip(max_depths, best_rmse)),columns=["max_depth","best_rmse"]))
```

```
# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X,label=y)

# Create the parameter dictionary
params={"objective":"reg:linear","max_depth":3}

# Create list of hyperparameter values
colsample_bytree_vals = [0.1, 0.5, 0.8, 1]
best_rmse = []

# Systematically vary the hyperparameter value 
for curr_val in colsample_bytree_vals:

    params["colsample_bytree"] = curr_val
    
    # Perform cross-validation
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2,
                 num_boost_round=10, early_stopping_rounds=5,
                 metrics="rmse", as_pandas=True, seed=123)
    
    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(pd.DataFrame(list(zip(colsample_bytree_vals, best_rmse)), columns=["colsample_bytree","best_rmse"]))
```

### Review of grid search and random search

* How do we find the optimal values for several hyperparameters simultaneously, leading to the lowest loss possible, when their values interact in non-obvious, non-linear ways?
* **GridSearch is a method of exhaustively searching through a collection of possible parameter values
    * Number of models = number of distinct values per hyperparameter multiplied across hyperparameter
    * In GridSearch, you try every parameter configuration, evaluate some metric for that configuration, and pick the parameter configuration that gave you the best value for the metric you were using.
    
```
gbm_param_grid = {'learning_rate':[0.01, 0.1, 0.5, 0.9],
                  'n_estimtors': [200],
                  'subsample': [0.3, 0.5, 0.9]}
gbm = xgb.XGBRegressor()
grid_mse = GridSearchCV(estimator=gbm, 
                        param_grid=gbm_param_grid, 
                        scoring = 'neg_mean_squared_error',
                        cv=4,
                        verbose=1)
grid_mse.fit(X, y)
print("Best parameters found: ", grid_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(grid_mse.best_score_)))
```

#### RandomSearch: review
* Create a (possibly infinite) range of hyperparameter values per hyperparameter that you would like to search over.
* Set the number of iterations you would like for the random search to continue

```
# Create the parameter grid: gbm_param_grid
gbm_param_grid = {
    'colsample_bytree': [0.3, 0.7],
    'n_estimators': [50],
    'max_depth': [2, 5]
}

# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor()

# Perform grid search: grid_mse
grid_mse = GridSearchCV(param_grid=gbm_param_grid, estimator=gbm, scoring='neg_mean_squared_error', cv=4, verbose=1)


# Fit grid_mse to the data
grid_mse.fit(X, y)

# Print the best parameters and lowest RMSE
print("Best parameters found: ", grid_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(grid_mse.best_score_)))
```

#### Limits of GridSearch and RandomSearch
* Grid search: computationally expensive
* Random search: parameter space to explore can be massive

## Review of pipelines using sklearn
* Pipelines in sklearn are objects that take a list of named tuples (name, pipline_step) as input
* The named tuples must always contain a string name as the first element in each tuple
* Tuples can contain any arbitrary scikit-learn compatible estimator or transformer object
* Each named tuple in the pipeline is called a step and the list of transformations that are contained in it are executed in order once some data is passed through the pipeline.
* Pipeline implements sklearn's standard fit/predict paradigm.
* Pipelines can be used as input estimators into grid/randomized search and cross_val_score methods

#### Pipeline example using Random Forest Regression

```
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

data = ...
X, y = ...
rf_pipeline = Pipeline[("st_scaler", StandardScaler()), ('rf_model', RandomForestRegressor())]
scores = cross_val_score(rf_pipeline, X, y, scoring = 'neg_mean_squared_error, cv=10)

final_avg_rmse = np.mean(np.sqrt(np.abs(scores)))

print("final RMSE: ", final_avg_rmse)
```
$\Uparrow$ to get a root mean squared error across all 10 cross-validation folds

* **neg_mean_squared_error**: sklearn's API-specific way of calculating the mean squared error (mse) in an API-comptible way; negative mean squared error don't eactually exist, as all squares must be positive when working with real numbers

#### Dictvectorizer
* DictVectorizer is a class found in scikit-learn's feature extraction submodule and is traditionally used in text processing pipelines by converting lists of feature mappings into vectors
* Need to convert a DataFrame into a list of dictionary entries

```
# Import necessary modules
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline

# Fill LotFrontage missing values with 0
X.LotFrontage = X.LotFrontage.fillna(0)

# Setup the pipeline steps: steps
steps = [("ohe_onestep", DictVectorizer(sparse=False)),
         ("xgb_model", xgb.XGBRegressor())]

# Create the pipeline: xgb_pipeline
xgb_pipeline = Pipeline(steps)

# Fit the pipeline
xgb_pipeline.fit(X.to_dict("records"), y)
```

#### Incorporating XGBoost into pipelines
* to get XGBoost to work within a pipeline, all that's really required is that you use XGBoost's scikit-learn API within a pipeline object

```
import pandas as pd
import xgboost as xgb
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

data = ...
X, y = ...
 
scores = cross_val_score(xgb_pipeline, X, y, scoring = "neg_mean_squared_error", cv=10)
final_avg_rmse = np.mean(np.sqrt(np.abs(scores)))
print("Final XGB RMSE: " final_avg_rmse)
```

### Additional components introduced for pipelines
* `sklearn_pandas`: is a separate library that attempts to bridge the gap between working with pandas and working with scikit-learn, as they don't always work seamlessly together 
    * `DataFrameMapper`: Interoperability between `pandas` and `scikit-learn`
    * `CategoricalImputer`: a class that allows us to impute missing categorical values directly 
* `sklearn.preprocessing`
    * `Imputer`: Native imputation of numerical columns in scikit-learn
* `sklearn.pipeline`:
    * `FutureUnion`: combine multiple pipelines of features into a single pipeline of features 
        * as we would need to do, for example, if we had one set of preprocessing steps we needed to perform on categorical features of a dataset and a distinct set of preprocessing steps on the numeric features found in a dataset
        
```
# Import necessary modules
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

# Fill LotFrontage missing values with 0
X.LotFrontage = X.LotFrontage.fillna(0)

# Setup the pipeline steps: steps
steps = [("ohe_onestep", DictVectorizer(sparse=False)),
         ("xgb_model", xgb.XGBRegressor(max_depth=2, objective="reg:linear"))]

# Create the pipeline: xgb_pipeline
xgb_pipeline = Pipeline(steps)

# Cross-validate the model
cross_val_scores = cross_val_score(xgb_pipeline, X.to_dict("records"), y, cv=10, scoring='neg_mean_squared_error')

# Print the 10-fold RMSE
print("10-fold RMSE: ", np.mean(np.sqrt(np.abs(cross_val_scores))))
```

```
# Import necessary modules
from sklearn_pandas import DataFrameMapper
from sklearn_pandas import CategoricalImputer

# Check number of nulls in each feature column
nulls_per_column = X.isnull().sum()
print(nulls_per_column)

# Create a boolean mask for categorical columns
categorical_feature_mask = X.dtypes == object

# Get list of categorical column names
categorical_columns = X.columns[categorical_feature_mask].tolist()

# Get list of non-categorical column names
non_categorical_columns = X.columns[~categorical_feature_mask].tolist()

# Apply numeric imputer
numeric_imputation_mapper = DataFrameMapper(
                                            [([numeric_feature], Imputer(strategy="median")) for numeric_feature in non_categorical_columns],
                                            input_df=True,
                                            df_out=True
                                           )

# Apply categorical imputer
categorical_imputation_mapper = DataFrameMapper(
                                                [(category_feature, CategoricalImputer()) for category_feature in categorical_columns],
                                                input_df=True,
                                                df_out=True
                                               )
```

```
# Import FeatureUnion
from sklearn.pipeline import FeatureUnion

# Combine the numeric and categorical transformations
numeric_categorical_union = FeatureUnion([
                                          ("num_mapper", numeric_imputation_mapper),
                                          ("cat_mapper", categorical_imputation_mapper)
                                         ])
```

```
# Create full pipeline
pipeline = Pipeline([
                     ("featureunion", numeric_categorical_union),
                     ("dictifier", Dictifier()),
                     ("vectorizer", DictVectorizer(sort=False)),
                     ("clf", xgb.XGBClassifier(max_depth=3))
                    ])

# Perform cross-validation
cross_val_scores = cross_val_score(pipeline, kidney_data, y, scoring="roc_auc", cv=3)

# Print avg. AUC
print("3-fold AUC: ", np.mean(cross_val_scores))
```

### Tuning XGBoost hyperparameters in a pipeline
* **Note:** In order for the hyperparameters to be passed to the appropriate step, you have to name the parameters in the dictionary with the name of the step being referenced, followed by two underscore signs and then the name of the hyperparameter you want to iterate over
* Since (below) the xgboost step is called xgb_model, all of our hyperparameter keys will start with xgboost_model__

```
data = ...
x, y = ...
xgb_pipeline = Pipeline[("st_scaler", ...: StandardScaler()), ('xgb_model', xgb.XGBRegressor())]
gbm_param_grid = {
    ...:      'xgb_model__subsample': np.arange(.05, 1, .05),
    ...:      'xgb_model__max_depth': np.arange(3, 20, 1),
    ...:      'xgb_model__colsample_bytree': np.arange(0.1, 1.05, .05)}
randomized_neg_mse = RandomizedSearchCV(estimator=xgb_pipeline, 
    ...:      param_distributions=gbm_param_grid, n_iter=10,
    ...:      scoring='neg_mean_squared_error', cv=4)
randomized_neg_mse.fit(X, y)

print("Best rmse: ", np.sqrt(np.abs(randomized_neg_mse.best_score_)))

print("Best model: ", randomized_neg_mse.best_estimtor_)
```

```
# Create the parameter grid
gbm_param_grid = {
    'clf__learning_rate': np.arange(.05, 1, .05),
    'clf__max_depth': np.arange(3,10, 1),
    'clf__n_estimators': np.arange(50, 200, 50)
}

# Perform RandomizedSearchCV
randomized_roc_auc = RandomizedSearchCV(estimator=pipeline,
                                        param_distributions=gbm_param_grid,
                                        n_iter=2, scoring='roc_auc', cv=2, verbose=1)

# Fit the estimator
randomized_roc_auc.fit(X, y)

# Compute metrics
print(randomized_roc_auc.best_score_)
print(randomized_roc_auc.best_estimator_)
```