**Classification with XGBoost**
___
Introduction:
- Supervised learning
    - **has labeled data** - we have some understanding of past behavior of problem we are trying to solve or what we are trying to predict
    - **Classification** - binary or multi-class
        - AUC is metric for binary classification
            - Area on receiver operating characteristic curve
            - probability that a randomly chosen positive data point will have a higher rank than a randomly chosen negative data point for your data problem
            - higher AUC, better model
        - for multi-class problems, confusion-matrix and accuracy score is how this happens
- features are either numeric or categorical
    - numeric features should be scaled (e.g., SVM models)
    - categorical features should be encoded (one-hot)
- **Ranking** - predicting an ordering on a srt of choices
- **Recommending** - recommending an item to a user based on consumption history and profile
___

**Introduction to XGBoost**
___
- optimized gradient boosting machine learning library
- written in C++
- fast
- paralellizable

In [None]:
#XGBoost: Fit/Predict

#It's time to create your first XGBoost model! As Sergey showed you
#in the video, you can use the scikit-learn .fit() / .predict()
#paradigm that you are already familiar to build your XGBoost models,
#as the xgboost library has a scikit-learn compatible API!

#Here, you'll be working with churn data. This dataset contains imaginary
#data from a ride-sharing app with user behaviors over their first month
#of app usage in a set of imaginary cities as well as whether they used
#the service 5 months after sign-up. It has been pre-loaded for you into
#a DataFrame called churn_data - explore it in the Shell!

#Your goal is to use the first month's worth of data to predict whether
#the app's users will remain users of the service at the 5 month mark.
#This is a typical setup for a churn prediction problem. To do this,
#you'll split the data into training and test sets, fit a small xgboost
#model on the training set, and evaluate its performance on the test set
#by computing its accuracy.

#pandas and numpy have been imported as pd and np, and train_test_split
#has been imported from sklearn.model_selection. Additionally, the arrays
#for the features and the target have been created as X and y.

# Import xgboost
import xgboost as xgb

# Create arrays for the features and the target: X, y
#X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]

# Create the training and test sets
#X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the XGBClassifier: xg_cl
#xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)

# Fit the classifier to the training set
#xg_cl.fit(X_train, y_train)

# Predict the labels of the test set: preds
#preds = xg_cl.predict(X_test)

# Compute the accuracy: accuracy
#accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
#print("accuracy: %f" % (accuracy))

#################################################
#<script.py> output:
#    accuracy: 0.743300
#################################################

**What is a decision tree?**
___
- series of binary choices
- prediction happens at the "leaves" of the tree
- **base learner** - individual learning algorithm in an ensemble algorithm
- decision trees and CART are constructed iteratively until a stopping criterion is met
- individual decision trees tend to overfit
    - low bias, high variance
    - generalize to new data poorly
- **Classification and Regression Trees (CART)**
    - each leaf *always* contains a real-valued score which can be later converted into categories
___

In [None]:
#Decision trees

#Your task in this exercise is to make a simple decision tree using
#scikit-learn's DecisionTreeClassifier on the breast cancer dataset
#that comes pre-loaded with scikit-learn.

#This dataset contains numeric measurements of various dimensions of
#individual tumors (such as perimeter and texture) from breast biopsies
#and a single outcome value (the tumor is either malignant, or benign).

#We've preloaded the dataset of samples (measurements) into X and the
#target values per tumor into y. Now, you have to split the complete
#dataset into training and testing sets, and then train a
#DecisionTreeClassifier. You'll specify a parameter called max_depth.
#Many other parameters can be modified within this model, and you can
#check all of them out at
#https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

# Import the necessary modules
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Create the training and test sets
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the classifier: dt_clf_4
#dt_clf_4 = DecisionTreeClassifier(max_depth=4)

# Fit the classifier to the training set
#dt_clf_4.fit(X_train, y_train)

# Predict the labels of the test set: y_pred_4
#y_pred_4 = dt_clf_4.predict(X_test)

# Compute the accuracy of the predictions: accuracy
#accuracy = float(np.sum(y_pred_4==y_test))/y_test.shape[0]
#print("accuracy:", accuracy)

#################################################
#<script.py> output:
#   accuracy: 0.9649122807017544
#################################################

**What is Boosting?**
___
- a concept that can be applied to a set of machine learning models
- ensemble meta-algorithm used to convert many weak learners into a strong learner
- **weak learner**
    - ML algorithm that is slightly better than chance (50/50)
- **strong learner**
    - any algorithm that can be tuned to achieve good performance
- iteratively learning a set of weak models on subsets of the data
- weighting each weak prediction according to each weak learner's performance
- combine weighted predictions to obtain a single weighted prediction
- **Cross-validation** is baked into XGBoost
___

In [None]:
#Measuring accuracy

#You'll now practice using XGBoost's learning API through its baked
#in cross-validation capabilities. As Sergey discussed in the previous
#video, XGBoost gets its lauded performance and efficiency gains by
#utilizing its own optimized data structure for datasets called a
#DMatrix.

#In the previous exercise, the input datasets were converted into
#DMatrix data on the fly, but when you use the xgboost cv object,
#you have to first explicitly convert your data into a DMatrix. So,
#that's what you will do here before running cross-validation on
#churn_data.

# Create arrays for the features and the target: X, y
#X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]

# Create the DMatrix from X and y: churn_dmatrix
#churn_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
#params = {"objective":"reg:logistic", "max_depth":3}

# Perform cross-validation: cv_results
#cv_results = xgb.cv(dtrain=churn_dmatrix, params=params,
#                    nfold=3, num_boost_round=5,
#                    metrics="error", as_pandas=True, seed=123)

# Print cv_results
#print(cv_results)

# Print the accuracy
#print(((1-cv_results["test-error-mean"]).iloc[-1]))

#################################################
#<script.py> output:
#       train-error-mean  train-error-std  test-error-mean  test-error-std
#    0           0.28232         0.002366          0.28378        0.001932
#    1           0.26951         0.001855          0.27190        0.001932
#    2           0.25605         0.003213          0.25798        0.003963
#    3           0.25090         0.001845          0.25434        0.003827
#    4           0.24654         0.001981          0.24852        0.000934
#    0.75148
#################################################

#Measuring AUC
#Now that you've used cross-validation to compute average out-of-sample
#accuracy (after converting from an error), it's very easy to compute
#any other metric you might be interested in. All you have to do is pass
#it (or a list of metrics) in as an argument to the metrics parameter
#of xgb.cv().

#Your job in this exercise is to compute another common metric used in
#binary classification - the area under the curve ("auc"). As before,
#churn_data is available in your workspace, along with the DMatrix
#churn_dmatrix and parameter dictionary params.

#Perform cross_validation: cv_results
#cv_results = xgb.cv(dtrain=churn_dmatrix, params=params,
#                  nfold=3, num_boost_round=5,
#                  metrics="auc", as_pandas=True, seed=123)

# Print cv_results
#print(cv_results)

# Print the AUC
#print((cv_results["test-auc-mean"]).iloc[-1])

#################################################
#<script.py> output:
#       train-auc-mean  train-auc-std  test-auc-mean  test-auc-std
#    0        0.768893       0.001544       0.767863      0.002820
#    1        0.790864       0.006758       0.789157      0.006846
#    2        0.815872       0.003900       0.814476      0.005997
#    3        0.822959       0.002018       0.821682      0.003912
#    4        0.827528       0.000769       0.826191      0.001937
#    0.826191
#################################################

**When should I use XGBoost?**
___
- any supervised learning example with:
    - large number of training samples (1000+ samples with fewer than 100 features)
    - number of features < number of training samples
    - a mixture of categorical and numeric features
    - just numeric features
- do not use XGBoost with:
    - image recognition
    - computer vision
    - natural language processing/understanding
    - smaller number of training samples (see above)
___

**Regression Review**
___
- Common regression metrics
    - root mean squared error (RMSE)
        - square root of mean of [difference between actual and predicted values, squared]
        - treats negative and positive values equally
        - tends to punish larger differences between predicted and actual values
    - mean absolute error (MAE)
        - sums absolute differences
- **Algorithms**
    - linear regression
    - decision trees
___

**Objective (loss) functions and base learners**
___
- Quantifies how far off a prediction is from the actual result
- Measures the difference between the estimated true values for some collection of data
- **Goal**: find the model that yields the minimum value of the loss function
- in xgboost:
    - reg:linear - use for regression problems
    - reg:logistic - use for classification problems when you want decision, not probability
    - binary:logistic - use when you want probability rather than just decision
- base learners and why we need them
        - we want base learners that when combined create a final prediction that is **non-linear**
        - each base learner should be good at distinguishing or predicting different parts of the dataset
- two kinds of base learners: tree and linear

In [None]:
#Decision trees as base learners

#It's now time to build an XGBoost model to predict house prices -
#not in Boston, Massachusetts, as you saw in the video, but in Ames,
#Iowa! This dataset of housing prices has been pre-loaded into a
#DataFrame called df. If you explore it in the Shell, you'll see that
#there are a variety of features about the house and its location in
#the city.

#In this exercise, your goal is to use trees as base learners. By default,
#XGBoost uses trees as base learners, so you don't have to specify that you
#want to use trees here with booster="gbtree".

#xgboost has been imported as xgb and the arrays for the features and
#the target are available in X and y, respectively.

# Create the training and test sets
#X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the XGBRegressor: xg_reg
#xg_reg = xgb.XGBRegressor(objective="reg:linear", n_estimators=10, seed=123)

# Fit the regressor to the training set
#xg_reg.fit(X_train, y_train)

# Predict the labels of the test set: preds
#preds = xg_reg.predict(X_test)

# Compute the rmse: rmse
#rmse = np.sqrt(mean_squared_error(y_test, preds))
#print("RMSE: %f" % (rmse))

#################################################
#<script.py> output:
#    [18:40:17] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
#    RMSE: 78847.401758
#################################################

#Linear base learners

#Now that you've used trees as base models in XGBoost, let's use the
#other kind of base model that can be used with XGBoost - a linear
#learner. This model, although not as commonly used in XGBoost, allows
#you to create a regularized linear regression using XGBoost's powerful
#learning API. However, because it's uncommon, you have to use XGBoost's
#own non-scikit-learn compatible functions to build the model, such as
#xgb.train().

#In order to do this you must create the parameter dictionary that
#describes the kind of booster you want to use (similarly to how you
#created the dictionary in Chapter 1 when you used xgb.cv()). The
#key-value pair that defines the booster type (base model) you need is
#"booster":"gblinear".

#Once you've created the model, you can use the .train() and .predict()
#methods of the model just like you've done in the past.

#Here, the data has already been split into training and testing sets,
#so you can dive right into creating the DMatrix objects required by the
#XGBoost learning API.

# Convert the training and testing sets into DMatrixes: DM_train, DM_test
#DM_train = xgb.DMatrix(data=X_train, label=y_train)
#DM_test =  xgb.DMatrix(data=X_test, label=y_test)

# Create the parameter dictionary: params
#params = {"booster":"gblinear", "objective":"reg:linear"}

# Train the model: xg_reg
#xg_reg = xgb.train(params = params, dtrain=DM_train, num_boost_round=5)

# Predict the labels of the test set: preds
#preds = xg_reg.predict(DM_test)

# Compute and print the RMSE
#rmse = np.sqrt(mean_squared_error(y_test,preds))
#print("RMSE: %f" % (rmse))

#################################################
#<script.py> output:
#    [18:52:04] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
#    RMSE: 40738.238504
#################################################

#Evaluating model quality
#It's now time to begin evaluating model quality.

#Here, you will compare the RMSE and MAE of a cross-validated XGBoost
#model on the Ames housing data. As in previous exercises, all necessary
#modules have been pre-loaded and the data is available in the DataFrame
#df.

# Create the DMatrix: housing_dmatrix
#housing_dmatrix = xgb.DMatrix(data=X,label=y)

# Create the parameter dictionary: params
#params = {"objective":"reg:linear", "max_depth":4}

# Perform cross-validation: cv_results
#cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=4, num_boost_round=5, metrics="mae", as_pandas=True, seed=123)

# Print cv_results
#print(cv_results)

# Extract and print final round boosting round metric
#print((cv_results["test-mae-mean"]).tail(1))

#results printed below for "rmse" first and "mae" second
#################################################
#train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std
#    0    141767.531250      429.454591   142980.429688    1193.794436
#    1    102832.544922      322.474657   104891.394532    1223.158855
#    2     75872.619140      266.472468    79478.935547    1601.344218
#    3     57245.650390      273.624608    62411.922851    2220.149653
#    4     44401.297851      316.422372    51348.279297    2963.377719
#    4    51348.279297
#   Name: test-rmse-mean, dtype: float64

#train-mae-mean  train-mae-std  test-mae-mean  test-mae-std
#    0   127343.476562     668.342129  127633.980469   2404.003469
#    1    89770.052735     456.962096   90122.501953   2107.915156
#    2    63580.791992     263.403452   64278.561524   1887.563452
#    3    45633.153321     151.884551   46819.167969   1459.819091
#    4    33587.092774      86.999100   35670.644531   1140.607997
#    4    35670.644531
#    Name: test-mae-mean, dtype: float64
#################################################

**Regularization and base learners in XGBoost**
___
- Regularization is a control on model complexity
- Want models that are both accurate and as simple as possible
- Regularization parameters in XGBoost
    - *gamma* - minimum loss reduction allowed for a split to occur (higher value means fewer splits)
    - *alpha* - l1 regularization (many weights will go to zero) of leaf weights (higher values mean more regularization)
    - *lambda* - l2 regularization (smooths leaf weights)
- Linear base learner
    - sum of linear terms
    - boosted model is weighted sum of linear models (linear)
    - rarely used (you can get same performance from regularized linear model
- Tree base learner
    - decision tree
    - boosted model is weighted sum of decision trees (nonlinear)
    - almost always used in XGBoost

In [None]:
#Using regularization in XGBoost
#Having seen an example of l1 regularization in the video, you'll now
#vary the l2 regularization penalty - also known as "lambda" - and see
#its effect on overall model performance on the Ames housing dataset.

# Create the DMatrix: housing_dmatrix
#housing_dmatrix = xgb.DMatrix(data=X, label=y)

#reg_params = [1, 10, 100]

# Create the initial parameter dictionary for varying l2 strength: params
#params = {"objective":"reg:linear","max_depth":3}

# Create an empty list for storing rmses as a function of l2 complexity
#rmses_l2 = []

# Iterate over reg_params
#for reg in reg_params:

    # Update l2 strength
#    params["lambda"] = reg

    # Pass this updated param dictionary into cv
#    cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2, num_boost_round=5, metrics="rmse", as_pandas=True, seed=123)

    # Append best rmse (final round) to rmses_l2
#    rmses_l2.append(cv_results_rmse["test-rmse-mean"].tail(1).values[0])

# Look at best rmse per l2 param
#print("Best rmse as a function of l2:")
#print(pd.DataFrame(list(zip(reg_params, rmses_l2)), columns=["l2","rmse"]))

#################################################
#Best rmse as a function of l2:
#        l2          rmse
#    0    1  52275.359375
#    1   10  57746.064453
#    2  100  76624.625000
#################################################

In [None]:
#Visualizing individual XGBoost trees
#Now that you've used XGBoost to both build and evaluate regression as
#well as classification models, you should get a handle on how to
#visually explore your models. Here, you will visualize individual trees
#from the fully boosted model that XGBoost creates using the entire
#housing dataset.

#XGBoost has a plot_tree() function that makes this type of
#visualization easy. Once you train a model using the XGBoost learning
#API, you can pass it to the plot_tree() function along with the number
#of trees you want to plot using the num_trees argument.

# Create the DMatrix: housing_dmatrix
#housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
#params = {"objective":"reg:linear", "max_depth":2}

# Train the model: xg_reg
#xg_reg = xgb.train(params=params, dtrain=housing_dmatrix, num_boost_round=10)

# Plot the first tree
#xgb.plot_tree(xg_reg, num_trees=0)
#plt.show()

# Plot the fifth tree
#xgb.plot_tree(xg_reg, num_trees=4)
#plt.show()

# Plot the last tree sideways
#xgb.plot_tree(xg_reg, num_trees=9, rankdir="LR")
#plt.show()

![_images/11.1.svg](_images/11.1.svg)
![_images/11.2.svg](_images/11.2.svg)
![_images/11.3.svg](_images/11.3.svg)

In [None]:
#Visualizing feature importances: What features are most important in
#my dataset

#Another way to visualize your XGBoost models is to examine the
#importance of each feature column in the original dataset within the
#model.

#One simple way of doing this involves counting the number of times
#each feature is split on across all boosting rounds (trees) in the
#model, and then visualizing the result as a bar graph, with the
#features ordered according to how many times they appear. XGBoost has
#a plot_importance() function that allows you to do exactly this, and
#you'll get a chance to use it in this exercise!

# Create the DMatrix: housing_dmatrix
#housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
#params = {"objective":"reg:linear", "max_depth":4}

# Train the model: xg_reg
#xg_reg = xgb.train(params=params, dtrain=housing_dmatrix, num_boost_round=10)

# Plot the feature importances
#xgb.plot_importance(xg_reg)
#plt.show()

![_images/11.4.svg](_images/11.4.svg)

**Why tune your model?**
___
- ~10-15% reduction of RMSE
___

In [None]:
#Tuning the number of boosting rounds
#Let's start with parameter tuning by seeing how the number of boosting
#rounds (number of trees you build) impacts the out-of-sample performance
#of your XGBoost model. You'll use xgb.cv() inside a for loop and build one model per num_boost_round parameter.

#Here, you'll continue working with the Ames housing dataset. The features
#are available in the array X, and the target vector is contained in y.

# Create the DMatrix: housing_dmatrix
#housing_dmatrix = xgb.DMatrix(X,y)

# Create the parameter dictionary for each tree: params
#params = {"objective":"reg:linear", "max_depth":3}

# Create list of number of boosting rounds
#num_rounds = [5, 10, 15]

# Empty list to store final round rmse per XGBoost model
#final_rmse_per_round = []

# Iterate over num_rounds and build one model per num_boost_round parameter
#for curr_num_rounds in num_rounds:

    # Perform cross-validation: cv_results
#    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3, num_boost_round=curr_num_rounds, metrics="rmse", as_pandas=True, seed=123)

    # Append final round RMSE
#    final_rmse_per_round.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
#num_rounds_rmses = list(zip(num_rounds, final_rmse_per_round))
#print(pd.DataFrame(num_rounds_rmses,columns=["num_boosting_rounds","rmse"]))

#################################################
#   num_boosting_rounds          rmse
#                   5       50903.299479
#                   10      34774.192708
#                   15      32895.098958
#################################################

#Automated boosting round selection using early_stopping

#Now, instead of attempting to cherry pick the best possible number
#of boosting rounds, you can very easily have XGBoost automatically
#select the number of boosting rounds for you within xgb.cv(). This is
#done using a technique called early stopping.

#Early stopping works by testing the XGBoost model after every boosting
#round against a hold-out dataset and stopping the creation of additional
#boosting rounds (thereby finishing training of the model early) if the
#hold-out metric ("rmse" in our case) does not improve for a given number
#of rounds. Here you will use the early_stopping_rounds parameter in
#xgb.cv() with a large possible number of boosting rounds (50). Bear in
#mind that if the holdout metric continuously improves up through when
#num_boost_rounds is reached, then early stopping does not occur.

#Here, the DMatrix and parameter dictionary have been created for you.
#Your task is to use cross-validation with early stopping. Go for it!

# Create your housing DMatrix: housing_dmatrix
#housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary for each tree: params
#params = {"objective":"reg:linear", "max_depth":4}

# Perform cross-validation with early stopping: cv_results
#cv_results = xgb.cv(dtrain=housing_dmatrix,
#                    params=params,
#                    nfold=3,
#                    early_stopping_rounds=10,
#                    num_boost_round=50,
#                    metrics="rmse",
#                    as_pandas=True,
#                    seed=123)

# Print cv_results
#print(cv_results)

#################################################
#train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std
#    0     141871.635417      403.636200   142640.651042     705.559164
#    1     103057.033854       73.772531   104907.664062     111.112417
#    2      75975.966146      253.726099    79262.059895     563.766991
#    3      57420.531250      521.656754    61620.136719    1087.694282
#    4      44552.955729      544.170190    50437.561198    1846.446330
#    5      35763.946615      681.797429    43035.661458    2034.469207
#    6      29861.464193      769.571238    38600.880208    2169.796232
#    7      25994.676432      756.520565    36071.817708    2109.795430
#    8      23306.836588      759.238254    34383.184896    1934.546688
#    9      21459.769531      745.624998    33509.142578    1887.377024
#    11     19215.382812      641.388842    32197.832682    1734.456935
#    12     18627.388021      716.257152    31770.852865    1802.155484
#    13     17960.694661      557.043073    31482.782552    1779.123767
#    14     17559.736328      631.412555    31389.990886    1892.319967
#    15     17205.712565      590.171852    31302.882162    1955.165902
#    16     16876.571940      703.631755    31234.058594    1880.705796
#    17     16597.662110      703.677609    31318.348959    1828.860617
#    18     16330.460937      607.274494    31323.634115    1775.908526
#    19     16005.972982      520.470911    31204.134766    1739.075860
#    20     15814.301432      518.604477    31089.862630    1756.021674
#    21     15493.405599      505.616447    31047.996094    1624.673955
#    23     15086.381836      503.912899    31024.983724    1548.985354
#    25     14709.589518      449.668010    30989.476563    1686.667469
#    26     14457.286458      376.787206    30952.113932    1613.172643
#    27     14185.567383      383.102234    31066.902344    1648.534310
#    28     13934.066732      473.465449    31095.641276    1709.225654
#    29     13749.645182      473.670437    31103.887370    1778.880069
#    30     13549.836914      454.898399    30976.085938    1744.515079
#    31     13413.485351      399.603618    30938.469401    1746.052445
#    32     13275.916016      415.408786    30931.001302    1772.469115
#    33     13085.878255      493.792509    30929.057291    1765.540568
#    34     12947.181315      517.789746    30890.630208    1786.511392
#    35     12846.027344      547.732372    30884.493490    1769.729215
#    36     12702.379232      505.523221    30833.542969    1691.002065
#    37     12532.244141      508.298241    30856.687500    1771.445978
#    38     12384.055013      536.225042    30818.016927    1782.784630
#    39     12198.444010      545.165197    30839.393229    1847.327435
#    40     12054.583659      508.841772    30776.966146    1912.780507
#    41     11897.036133      477.177991    30794.702474    1919.674832
#    42     11756.221354      502.992395    30780.955078    1906.820029
#    43     11618.846029      519.837502    30783.755860    1951.260705
#    44     11484.080404      578.428621    30776.731120    1953.446309
#    45     11356.553060      565.368380    30758.544271    1947.455425
#    46     11193.558268      552.298848    30729.972656    1985.699788
#    47     11071.315429      604.089960    30732.663411    1966.997809
#    48     10950.777995      574.863209    30712.241536    1957.751573
#    49     10824.865560      576.665405    30720.854167    1950.511057
#################################################

**Overview of XGBoost's hyperparameters**
___
- Common tree tunable parameters
    - **learning rate** - eta / how quickly model fits residual error using additional base learners. Low learning rate requires more base learners
    - **gamma** - minimum loss reduction to create a new split
    - **lambda** - L2 regularization on leaf weights
    - **alpha** - L1 regularization on leaf weights
    - **max_depth** - [integer] max depth per tree per boosting round
    - **subsample** - [0-1] % of samples used per tree/boosting round
        - too low - underfitting
        - too high - overfitting
    - **colsample_bytree** - [0-1] % of features used per tree
        - smaller values provide additional regularization
        - larger values may overfit
- Linear tunable parameters
    - **lambda**: L2 regularization on weights
    - **alpha**: L1 regularization on weights
    - **lambda_bias**: L2 regularization term on bias
- you can also tune the number of estimators used for both base model types
___

In [None]:
#Tuning eta

#It's time to practice tuning other XGBoost hyperparameters in earnest
#and observing their effect on model performance! You'll begin by tuning
#the "eta", also known as the learning rate.

#The learning rate in XGBoost is a parameter that can range between 0
#and 1, with higher values of "eta" penalizing feature weights more
#strongly, causing much stronger regularization.

# Create your housing DMatrix: housing_dmatrix
#housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary for each tree (boosting round)
#params = {"objective":"reg:linear", "max_depth":3}

# Create list of eta values and empty list to store final round rmse per xgboost model
#eta_vals = [0.001, 0.01, 0.1]
#best_rmse = []

# Systematically vary the eta
#for curr_val in eta_vals:

#    params["eta"] = curr_val

    # Perform cross-validation: cv_results
#    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3,
#                        num_boost_round=10, early_stopping_rounds=5,
#                        metrics="rmse", as_pandas=True, seed=123)

    # Append the final round rmse to best_rmse
#    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
#print(pd.DataFrame(list(zip(eta_vals, best_rmse)), columns=["eta","best_rmse"]))

#################################################
#      eta      best_rmse
#    0  0.001  195736.406250
#    1  0.010  179932.187500
#    2  0.100   79759.411458
#################################################

In [None]:
#Tuning max_depth

#In this exercise, your job is to tune max_depth, which is the parameter
#that dictates the maximum depth that each tree in a boosting round can
#grow to. Smaller values will lead to shallower trees, and larger values
#to deeper trees.

# Create your housing DMatrix: housing_dmatrix
#housing_dmatrix = xgb.DMatrix(data=X,label=y)

# Create the parameter dictionary
#params = {"objective":"reg:linear"}

# Create list of max_depth values
#max_depths = [2, 5, 10, 20]
#best_rmse = []

# Systematically vary the max_depth
#for curr_val in max_depths:

#    params["max_depth"] = curr_val

    # Perform cross-validation
#    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2,
#                 num_boost_round=10, early_stopping_rounds=5,
#                 metrics="rmse", as_pandas=True, seed=123)

    # Append the final round rmse to best_rmse
#    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
#print(pd.DataFrame(list(zip(max_depths, best_rmse)),columns=["max_depth","best_rmse"]))

#################################################
#     max_depth     best_rmse
#            2     37957.468750
#            5     35596.599610
#           10     36065.548829
#           20     36739.578125
#################################################

In [None]:
#Tuning colsample_bytree

#Now, it's time to tune "colsample_bytree". You've already seen this
#if you've ever worked with scikit-learn's RandomForestClassifier or
#RandomForestRegressor, where it just was called max_features. In both
#xgboost and sklearn, this parameter (although named differently) simply
#specifies the fraction of features to choose from at every split in a
#given tree. In xgboost, colsample_bytree must be specified as a float
#between 0 and 1.

# Create your housing DMatrix
#housing_dmatrix = xgb.DMatrix(data=X,label=y)

# Create the parameter dictionary
#params={"objective":"reg:linear","max_depth":3}

# Create list of hyperparameter values
#colsample_bytree_vals = [0.1, 0.5, 0.8, 1]
#best_rmse = []

# Systematically vary the hyperparameter value
#for curr_val in colsample_bytree_vals:

#    params["colsample_bytree"] = curr_val

    # Perform cross-validation
#    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2,
#                 num_boost_round=10, early_stopping_rounds=5,
#                 metrics="rmse", as_pandas=True, seed=123)

    # Append the final round rmse to best_rmse
#    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
#print(pd.DataFrame(list(zip(colsample_bytree_vals, best_rmse)), columns=["colsample_bytree","best_rmse"]))

#################################################
#colsample_bytree     best_rmse
#    0               0.1  48193.451172
#    1               0.5  36013.544922
#    2               0.8  35932.962891
#    3               1.0  35836.044922
#################################################

**Review of grid search and random search**
___
- **Grid search**
    - search exhaustively over a given set of hyperparameters, once per set of hyperparameters
- **Random search**
    - create a range of hyperparameter values per hyperparameter that you would like to search over
    - set the number of iterations for the search
    - during each iteration, draw a random value for each hyperparameter searched over and train a model with those hyperparameters
    - after max iterations have completed, select best evaluated score with corresponding hyperparameter values
___

In [None]:
#Grid search with XGBoost

#Now that you've learned how to tune parameters individually with XGBoost,
#let's take your parameter tuning to the next level by using scikit-learn's
#GridSearch and RandomizedSearch capabilities with internal cross-validation
#using the GridSearchCV and RandomizedSearchCV functions. You will use these
#to find the best model exhaustively from a collection of possible parameter
#values across multiple parameters simultaneously. Let's get to work, starting
#with GridSearchCV!

# Create the parameter grid: gbm_param_grid
#gbm_param_grid = {
#    'colsample_bytree': [0.3, 0.7],
#    'n_estimators': [50],
#    'max_depth': [2, 5]
#}

# Instantiate the regressor: gbm
#gbm = xgb.XGBRegressor()

# Perform grid search: grid_mse
#grid_mse = GridSearchCV(estimator=gbm, param_grid=gbm_param_grid,
#                        scoring='neg_mean_squared_error', cv=4, verbose=1)
#grid_mse.fit(X, y)

# Print the best parameters and lowest RMSE
#print("Best parameters found: ", grid_mse.best_params_)
#print("Lowest RMSE found: ", np.sqrt(np.abs(grid_mse.best_score_)))

#################################################
#Best parameters found:  {'colsample_bytree': 0.7, 'max_depth': 5, 'n_estimators': 50}
#    Lowest RMSE found:  29916.562522854438
#################################################

In [None]:
#Random search with XGBoost

#Often, GridSearchCV can be really time consuming, so in practice, you
#may want to use RandomizedSearchCV instead, as you will do in this
#exercise. The good news is you only have to make a few modifications
#to your GridSearchCV code to do RandomizedSearchCV. The key difference
#is you have to specify a param_distributions parameter instead of a
#param_grid parameter.

# Create the parameter grid: gbm_param_grid
#gbm_param_grid = {
#    'n_estimators': [25],
#    'max_depth': range(2, 12)
#}

# Instantiate the regressor: gbm
#gbm = xgb.XGBRegressor(n_estimators=10)

# Perform random search: grid_mse
#randomized_mse = RandomizedSearchCV(estimator=gbm, param_distributions=gbm_param_grid,
#                                    n_iter=5, scoring='neg_mean_squared_error', cv=4, verbose=1)
#randomized_mse.fit(X, y)

# Print the best parameters and lowest RMSE
#print("Best parameters found: ",randomized_mse.best_params_)
#print("Lowest RMSE found: ", np.sqrt(np.abs(randomized_mse.best_score_)))

#################################################
#Best parameters found:  {'n_estimators': 25, 'max_depth': 6}
#    Lowest RMSE found:  36909.98213965752
#################################################

**Limits of grid search and random search**
___
- Grid Search takes a lot of time, especially as number of hyperparameter values grows
- Random Search problems are similar, especially related to total number of hyperparameters
___


**Review of pipelines using sklearn**
___
- takes a list of named 2-tuples (name, pipeline_step) as input
- tuples can contain any arbitrary scikit-learn compatible estimator or transformer object
- pipeline implements fit/predict methods
- can be used as input estimator into grid/randomized search and cross_val_score methods
___
**Preprocessing I**
    - *LabelEncoder* - converts categorical column of strings into integers
    - *OneHotEncoder* - takes the column of integers and encodes them as dummy variables
**Prepropressing II**
    - *DictVectorizer* - converts lists of feature mappings into vectors
___

In [None]:
#Encoding categorical columns I: LabelEncoder

#Now that you've seen what will need to be done to get the housing data
#ready for XGBoost, let's go through the process step-by-step.

#First, you will need to fill in missing values - as you saw previously,
#the column LotFrontage has many missing values. Then, you will need to
#encode any categorical columns in the dataset using one-hot encoding so
#that they are encoded numerically.

#The data has five categorical columns: MSZoning, PavedDrive, Neighborhood,
#BldgType, and HouseStyle. Scikit-learn has a LabelEncoder function that
#converts the values in each categorical column into integers. You'll practice
#using this here.

# Import LabelEncoder
#from sklearn.preprocessing import LabelEncoder

# Fill missing values with 0
#df.LotFrontage = df.LotFrontage.fillna(0)

# Create a boolean mask for categorical columns
#categorical_mask = (df.dtypes == object)

# Get list of categorical column names
#categorical_columns = df.columns[categorical_mask].tolist()

# Print the head of the categorical columns
#print(df[categorical_columns].head())

# Create LabelEncoder object: le
#le = LabelEncoder()

# Apply LabelEncoder to categorical columns
#df[categorical_columns] = df[categorical_columns].apply(lambda x: le.fit_transform(x))

# Print the head of the LabelEncoded categorical columns
#print(df[categorical_columns].head())

#################################################
#<script.py> output:
#      MSZoning PavedDrive Neighborhood BldgType HouseStyle
#    0       RL          Y      CollgCr     1Fam     2Story
#    1       RL          Y      Veenker     1Fam     1Story
#    2       RL          Y      CollgCr     1Fam     2Story
#    3       RL          Y      Crawfor     1Fam     2Story
#    4       RL          Y      NoRidge     1Fam     2Story
#       MSZoning  PavedDrive  Neighborhood  BldgType  HouseStyle
#    0         3           2             5         0           5
#    1         3           2            24         0           2
#    2         3           2             5         0           5
#    3         3           2             6         0           5
#    4         3           2            15         0           5
#################################################

In [None]:
#Encoding categorical columns II: OneHotEncoder

#Okay - so you have your categorical columns encoded numerically. Can you
#now move onto using pipelines and XGBoost? Not yet! In the categorical
#columns of this dataset, there is no natural ordering between the entries.
#As an example: Using LabelEncoder, the CollgCr Neighborhood was encoded
#as 5, while the Veenker Neighborhood was encoded as 24, and Crawfor as 6.
#Is Veenker "greater" than Crawfor and CollgCr? No - and allowing the
#model to assume this natural ordering may result in poor performance.

#As a result, there is another step needed: You have to apply a one-hot
#encoding to create binary, or "dummy" variables. You can do this using
#scikit-learn's OneHotEncoder.

# Import OneHotEncoder
#from sklearn.preprocessing import OneHotEncoder

# Create OneHotEncoder: ohe
#ohe = OneHotEncoder(categorical_features=categorical_mask, sparse=False)

# Apply OneHotEncoder to categorical columns - output is no longer a dataframe: df_encoded
#df_encoded = ohe.fit_transform(df)

# Print first 5 rows of the resulting dataset - again, this will no longer be a pandas dataframe
#print(df_encoded[:5, :])

# Print the shape of the original DataFrame
#print(df.shape)

# Print the shape of the transformed array
#print(df_encoded.shape)

#################################################
#    (1460, 21)
#    (1460, 62)
#################################################

In [None]:
#Encoding categorical columns III: DictVectorizer

#Alright, one final trick before you dive into pipelines. The two step
#process you just went through - LabelEncoder followed by OneHotEncoder
#- can be simplified by using a DictVectorizer.

#Using a DictVectorizer on a DataFrame that has been converted to a
#dictionary allows you to get label encoding as well as one-hot
#encoding in one go.

#Your task is to work through this strategy in this exercise!

# Import DictVectorizer
#from sklearn.feature_extraction import DictVectorizer

# Convert df into a dictionary: df_dict
#df_dict = df.to_dict("records")

# Create the DictVectorizer object: dv
#dv = DictVectorizer(sparse=False)

# Apply dv on df: df_encoded
#df_encoded = dv.fit_transform(df_dict)

# Print the resulting first five rows
#print(df_encoded[:5,:])

# Print the vocabulary
#print(dv.vocabulary_)

#################################################
#{'MSSubClass': 23, 'LotFrontage': 22, 'LotArea': 21, 'OverallQual': 55,
#'OverallCond': 54, 'YearBuilt': 61, 'Remodeled': 59, 'GrLivArea': 11,
#'BsmtFullBath': 6, 'BsmtHalfBath': 7, 'FullBath': 9, 'HalfBath': 12,
#'BedroomAbvGr': 0, 'Fireplaces': 8, 'GarageArea': 10, 'MSZoning=RL': 27,
#'PavedDrive=Y': 58, 'Neighborhood=CollgCr': 34, 'BldgType=1Fam': 1,
#'HouseStyle=2Story': 18, 'SalePrice': 60, 'Neighborhood=Veenker': 53,
#'HouseStyle=1Story': 15, 'Neighborhood=Crawfor': 35, 'Neighborhood=NoRidge': 44,
#'Neighborhood=Mitchel': 40, 'HouseStyle=1.5Fin': 13, 'Neighborhood=Somerst': 50,
#'Neighborhood=NWAmes': 43, 'MSZoning=RM': 28, 'Neighborhood=OldTown': 46,
#'Neighborhood=BrkSide': 32, 'BldgType=2fmCon': 2, 'HouseStyle=1.5Unf': 14,
#'Neighborhood=Sawyer': 48, 'Neighborhood=NridgHt': 45, 'Neighborhood=NAmes': 41,
#'BldgType=Duplex': 3, 'Neighborhood=SawyerW': 49, 'PavedDrive=N': 56,
#'Neighborhood=IDOTRR': 38, 'Neighborhood=MeadowV': 39, 'BldgType=TwnhsE': 5,
#'MSZoning=C (all)': 24, 'Neighborhood=Edwards': 36, 'PavedDrive=P': 57,
#'Neighborhood=Timber': 52, 'HouseStyle=SFoyer': 19, 'MSZoning=FV': 25,
#'Neighborhood=Gilbert': 37, 'HouseStyle=SLvl': 20, 'BldgType=Twnhs': 4,
#'Neighborhood=StoneBr': 51, 'HouseStyle=2.5Unf': 17, 'Neighborhood=ClearCr': 33,
#'Neighborhood=NPkVill': 42, 'HouseStyle=2.5Fin': 16, 'Neighborhood=Blmngtn': 29,
#'Neighborhood=BrDale': 31, 'Neighborhood=SWISU': 47, 'MSZoning=RH': 26,
#'Neighborhood=Blueste': 30}
#################################################

In [None]:
#Preprocessing within a pipeline

#Now that you've seen what steps need to be taken individually to
#properly process the Ames housing data, let's use the much cleaner
#and more succinct DictVectorizer approach and put it alongside an
#XGBoostRegressor inside of a scikit-learn pipeline.

# Import necessary modules
#from sklearn.feature_extraction import DictVectorizer
#from sklearn.pipeline import Pipeline

# Fill LotFrontage missing values with 0
#X.LotFrontage = X.LotFrontage.fillna(0)

# Setup the pipeline steps: steps
#steps = [("ohe_onestep", DictVectorizer(sparse=False)),
#         ("xgb_model", xgb.XGBRegressor())]

# Create the pipeline: xgb_pipeline
#xgb_pipeline = Pipeline(steps)

# Fit the pipeline
#xgb_pipeline.fit(X.to_dict("records"), y)

**Incorporating XGBoost into pipelines**
___
- pipeline behavior is the same for other scikit-learn algorithms
- advanced data wrangling using additional components is included in examples below:
    - sklearn_pandas (This library is not really maintained anymore, so let's do preprocessing differently)
        - DataFrameMapper - interoperability between pandas and scikit-learn
        - CategoricalImputer - allow for imputation of categorical variables before conversion to integers
    - sklearn.preprocessing
        - Imputer - Native imputation of numerical columns in scikit-learn
    - sklearn.pipeline
        - FeatureUnion - combine multiple pipelines of features into a single pipeline of features
___

In [None]:
#Cross-validating your XGBoost model

#In this exercise, you'll go one step further by using the pipeline
#you've created to preprocess and cross-validate your model.

# Import necessary modules
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

# Fill LotFrontage missing values with 0
#X.LotFrontage = X.LotFrontage.fillna(0)

# Setup the pipeline steps: steps
#steps = [("ohe_onestep", DictVectorizer(sparse=False)),
#         ("xgb_model", xgb.XGBRegressor(max_depth=2, objective="reg:linear"))]

# Create the pipeline: xgb_pipeline
#xgb_pipeline = Pipeline(steps)

# Cross-validate the model
#cross_val_scores = cross_val_score(xgb_pipeline, X.to_dict("records"), y, cv=10, scoring="neg_mean_squared_error")

# Print the 10-fold RMSE
#print("10-fold RMSE: ", np.mean(np.sqrt(np.abs(cross_val_scores))))

#################################################
#10-fold RMSE:  29867.603720688923
#################################################

In [None]:
#Kidney disease case study I: Categorical Imputer

#You'll now continue your exploration of using pipelines with a
#dataset that requires significantly more wrangling. The chronic
#kidney disease dataset contains both categorical and numeric
#eatures, but contains lots of missing values. The goal here is to
#predict who has chronic kidney disease given various blood
#indicators as features.

#As Sergey mentioned in the video, you'll be introduced to a new
#library, sklearn_pandas, that allows you to chain many more
#processing steps inside of a pipeline than are currently supported
#in scikit-learn. Specifically, you'll be able to impute missing
#categorical values directly using the Categorical_Imputer() class
#in sklearn_pandas, and the DataFrameMapper() class to apply any
#arbitrary sklearn-compatible transformer on DataFrame columns,
#where the resulting output can be either a NumPy array or DataFrame.

#We've also created a transformer called a Dictifier that encapsulates
#converting a DataFrame using .to_dict("records") without you having
#to do it explicitly (and so that it works in a pipeline). Finally,
#we've also provided the list of feature names in kidney_feature_names,
#the target name in kidney_target_name, the features in X, and the
#target in y.

#In this exercise, your task is to apply the CategoricalImputer to
#impute all of the categorical columns in the dataset. You can refer
#to how the numeric imputation mapper was created as a template.
#Notice the keyword arguments input_df=True and df_out=True? This is
#so that you can work with DataFrames instead of arrays. By default,
#the transformers are passed a numpy array of the selected columns as
#input, and as a result, the output of the DataFrame mapper is also an
#array. Scikit-learn transformers have historically been designed to
#work with numpy arrays, not pandas DataFrames, even though their basic
#indexing interfaces are similar.

# Import necessary modules
#from sklearn_pandas import DataFrameMapper
#from sklearn_pandas import CategoricalImputer

# Check number of nulls in each feature column
#nulls_per_column = X.isnull().sum()
#print(nulls_per_column)

# Create a boolean mask for categorical columns
#categorical_feature_mask = X.dtypes == object

# Get list of categorical column names
#categorical_columns = X.columns[categorical_feature_mask].tolist()

# Get list of non-categorical column names
#non_categorical_columns = X.columns[~categorical_feature_mask].tolist()

# Apply numeric imputer
#numeric_imputation_mapper = DataFrameMapper(
#                                            [([numeric_feature],Imputer(strategy="median")) for numeric_feature in non_categorical_columns],
#                                            input_df=True,
#                                            df_out=True
#                                           )

# Apply categorical imputer
#categorical_imputation_mapper = DataFrameMapper(
#                                                [(category_feature, CategoricalImputer()) for category_feature in categorical_columns],
#                                                input_df=True,
#                                                df_out=True
#                                               )

#################################################
#<script.py> output:
#    age        9
#    bp        12
#    sg        47
#    al        46
#    su        49
#    bgr       44
#    bu        19
#    sc        17
#    sod       87
#    pot       88
#    hemo      52
#    pcv       71
#    wc       106
#    rc       131
#    rbc      152
#    pc        65
#    pcc        4
#    ba         4
#    htn        2
#    dm         2
#    cad        2
#    appet      1
#    pe         1
#    ane        1
#    dtype: int64
#################################################

In [None]:
#Kidney disease case study II: Feature Union

#Having separately imputed numeric as well as categorical columns,
#your task is now to use scikit-learn's FeatureUnion to concatenate
#their results, which are contained in two separate transformer objects
#- numeric_imputation_mapper, and categorical_imputation_mapper,
#respectively.

#You may have already encountered FeatureUnion in Machine Learning
#with the Experts: School Budgets. Just like with pipelines, you
#have to pass it a list of (string, transformer) tuples, where the
#first half of each tuple is the name of the transformer.

# Import FeatureUnion
#from sklearn.pipeline import FeatureUnion

# Combine the numeric and categorical transformations
#numeric_categorical_union = FeatureUnion([
#                                          ("num_mapper", numeric_imputation_mapper),
#                                          ("cat_mapper", categorical_imputation_mapper)
#                                         ])

In [None]:
#Kidney disease case study III: Full pipeline

#It's time to piece together all of the transforms along with an
#XGBClassifier to build the full pipeline!

#Besides the numeric_categorical_union that you created in the
#previous exercise, there are two other transforms needed: the
#Dictifier() transform which we created for you, and the
#DictVectorizer().

#After creating the pipeline, your task is to cross-validate it to
#see how well it performs.

# Create full pipeline
#pipeline = Pipeline([
#                     ("featureunion", numeric_categorical_union),
#                     ("dictifier", Dictifier()),
#                     ("vectorizer", DictVectorizer(sort=False)),
#                     ("clf", xgb.XGBClassifier(max_depth=3))
#                    ])

# Perform cross-validation
#cross_val_scores = cross_val_score(pipeline, kidney_data, y, scoring="roc_auc", cv=3)

# Print avg. AUC
#print("3-fold AUC: ", np.mean(cross_val_scores))

#################################################
#<script.py> output:
#    3-fold AUC:  0.998637406769937
#################################################

**Tuning XGBoost hyperparameters**
___

In [None]:
#Bringing it all together

#Alright, it's time to bring together everything you've learned so
#far! In this final exercise of the course, you will combine your
#work from the previous exercises into one end-to-end XGBoost
#pipeline to really cement your understanding of preprocessing and
#pipelines in XGBoost.

#Your work from the previous 3 exercises, where you preprocessed the
#data and set up your pipeline, has been pre-loaded. Your job is to
#perform a randomized search and identify the best hyperparameters.

# Create the parameter grid
#gbm_param_grid = {
#    'clf__learning_rate': np.arange(.05, 1, .05),
#    'clf__max_depth': np.arange(3,10, 1),
#    'clf__n_estimators': np.arange(50, 200, 50)
#}

# Perform RandomizedSearchCV
#randomized_roc_auc = RandomizedSearchCV(estimator=pipeline,
#                                        param_distributions=gbm_param_grid,
#                                        n_iter=2, scoring='roc_auc', cv=2, verbose=1)

# Fit the estimator
#randomized_roc_auc.fit(X, y)

# Compute metrics
#print(randomized_roc_auc.best_score_)
#print(randomized_roc_auc.best_estimator_)

#################################################
#<script.py> output:
#    Fitting 2 folds for each of 2 candidates, totalling 4 fits
#    0.9965333333333334
#    Pipeline(memory=None,
#             steps=[('featureunion',
#                     FeatureUnion(n_jobs=None,
#                                  transformer_list=[('num_mapper',
#                                                     DataFrameMapper(default=False,
#                                                                     df_out=True,
#                                                                     features=[(['age'],
#                                                                                Imputer(axis=0,
#                                                                                        copy=True,
#                                                                                        missing_values='NaN',
#                                                                                        strategy='median',
#                                                                                        verbose=0)),
#                                                                               (['bp'],
#                                                                                Imputer(axis=0,
#                                                                                        copy=True,
#                                                                                        missing_values='NaN',
#                                                                                        strategy='median',
#                                                                                        verbose=0)),
#                                                                               (['sg'],
#                                                                                Imputer(axis=0,
#                                                                                        copy=...
#                     XGBClassifier(base_score=0.5, booster='gbtree',
#                                   colsample_bylevel=1, colsample_bynode=1,
#                                   colsample_bytree=1, gamma=0,
#                                   learning_rate=0.9500000000000001,
#                                   max_delta_step=0, max_depth=4,
#                                   min_child_weight=1, missing=None,
#                                   n_estimators=100, n_jobs=1, nthread=None,
#                                   objective='binary:logistic', random_state=0,
#                                   reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
#                                   seed=None, silent=None, subsample=1,
#                                   verbosity=1))],
#             verbose=False)
#################################################

**Final Thoughts**
___
- Using XGBoost for classification and regression tasks
- Tuning XGBoost's most important hyperparameters
- Incorporating XGBoost into sklearn pipelines
- Not covered:
    - using XGBoost for ranking/recommendation problems (Netflix/Amazon)
    - using more sophisticated hyperparameter tuning strategies for tuning XGBoost models (Bayesian Optimization)
    - using XGBoost as part of an ensemble of other models for regression/classification