###  Chapter 1: Classification with XGBoost

This chapter will introduce you to the fundamental idea behind XGBoost—boosted learners. Once you understand how XGBoost works, you'll apply it to solve a common classification problem found in industry: predicting whether a customer will stop being a customer at some point in the future.
<br />

- 1.1 Course Intro:
    - Which of these is a classification problem?
    - Which of these is a binary classification problem?  
    <br />

- 1.2 Introducing XGBoost
    - XGBoost: Fit/Predict
    - Decision trees  
    <br />

- 1.3 What is Boosting?
    - Measuring accuracy
    - Measuring AUC  
    <br />
    
- 1.4 When should I use XGBoost?
    - Using XGBoost

#### 1.2 Introducing XGBosst

- 1.2.1 XGBoost: Fit/Predict
    - Import xgboost as xgb.
    - Create training and test sets such that 20% of the data is used for testing. Use a random_state of 123.
    - Instantiate an XGBoostClassifier as xg_cl using xgb.XGBClassifier(). Specify n_estimators to be 10 estimators and an objective of 'binary:logistic'. Do not worry about what this means just yet, you will learn about these parameters later in this course.
    - Fit xg_cl to the training set (X_train, y_train) using the .fit() method.
    - Predict the labels of the test set (X_test) using the .predict() method and hit 'Submit Answer' to print the accuracy.


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# churn_data = pd.read_csv("datasets/churn.csv") # This dataset is not the one that was used in the course. 
churn_data = pd.read_csv("datasets/churn_data.csv")

print(churn_data.info())
churn_data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 13 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   avg_dist                     50000 non-null  float64
 1   avg_rating_by_driver         49799 non-null  float64
 2   avg_rating_of_driver         41878 non-null  float64
 3   avg_inc_price                50000 non-null  float64
 4   inc_pct                      50000 non-null  float64
 5   weekday_pct                  50000 non-null  float64
 6   fancy_car_user               50000 non-null  bool   
 7   city_Carthag                 50000 non-null  int64  
 8   city_Harko                   50000 non-null  int64  
 9   phone_iPhone                 50000 non-null  int64  
 10  first_month_cat_more_1_trip  50000 non-null  int64  
 11  first_month_cat_no_trips     50000 non-null  int64  
 12  month_5_still_here           50000 non-null  int64  
dtypes: bool(1), floa

Unnamed: 0,avg_dist,avg_rating_by_driver,avg_rating_of_driver,avg_inc_price,inc_pct,weekday_pct,fancy_car_user,city_Carthag,city_Harko,phone_iPhone,first_month_cat_more_1_trip,first_month_cat_no_trips,month_5_still_here
0,3.67,5.0,4.7,1.1,15.4,46.2,True,0,1,1,1,0,1
1,8.26,5.0,5.0,1.0,0.0,50.0,False,1,0,0,0,1,0
2,0.77,5.0,4.3,1.0,0.0,100.0,False,1,0,1,1,0,0
3,2.36,4.9,4.6,1.14,20.0,80.0,True,0,1,1,1,0,1
4,3.13,4.9,4.4,1.19,11.8,82.4,False,0,0,0,1,0,0


In [4]:
# Import xgboost
import xgboost as xgb 

# Create arrays for the features and the target: X, y
X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]

# Create the training and test sets
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the XGBClassifier: xg_cl
xg_cl = xgb.XGBClassifier(objective='binary:logistic', eval_metric ="error", n_estimators=10, seed=123, use_label_encoder=False)

# Fit the classifier to the training set
xg_cl.fit(X_train, y_train)

# Predict the labels of the test set: preds
preds = xg_cl.predict(X_test)

# Compute the accuracy: accuracy
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))


accuracy: 0.758200


- 1.2.2 Decision Tree

Our task in this exercise is to make a simple decision tree using scikit-learn's DecisionTreeClassifier on the breast cancer dataset that comes pre-loaded with scikit-learn.

This dataset contains numeric measurements of various dimensions of individual tumors (such as perimeter and texture) from breast biopsies and a single outcome value (the tumor is either malignant, or benign).

We've preloaded the dataset of samples (measurements) into X and the target values per tumor into y. Now, you have to split the complete dataset into training and testing sets, and then train a DecisionTreeClassifier. You'll specify a parameter called max_depth.

In [5]:
# Import the necessary modules
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# load the breast cancer dataset
df = pd.read_csv("datasets/breast_cancer_classification_data.csv", index_col=0)
diagnosis_type = {'M': 1, 'B': 0}
df.diagnosis = [diagnosis_type[item] for item in df.diagnosis]

# Droping the column has NaN
for col in df.columns:
    null_num = df[col].isnull().values.any()
    if null_num:
        print(f"col {col} has {null_num} null values")
        print(f"Droping {col}")
        df = df.drop(col, axis=1)

X, y = df.drop("diagnosis", axis=1), df.diagnosis
print("X.shape: {}, y.shape: {}".format(X.shape, y.shape))

# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=123)

# Instantiate the classifier: dt_clf_4
dt_clf_4 = DecisionTreeClassifier(max_depth = 4)

# Fit the classifier to the training set
dt_clf_4.fit(X_train, y_train)

# Predict the labels of the test set: y_pred_4
y_pred_4 = dt_clf_4.predict(X_test)

# Compute the accuracy of the predictions: accuracy
accuracy = float(np.sum(y_pred_4==y_test))/y_test.shape[0]
print("accuracy:", accuracy)


col Unnamed: 32 has True null values
Droping Unnamed: 32
X.shape: (569, 30), y.shape: (569,)
accuracy: 0.9736842105263158


#### 1.3 What is Boosting

- 1.3.1 Accuracy

You'll now practice using XGBoost's learning API through its baked in cross-validation capabilities. As Sergey discussed in the previous video, XGBoost gets its lauded performance and efficiency gains by utilizing its own optimized data structure for datasets called a DMatrix.

In the previous exercise, the input datasets were converted into DMatrix data on the fly, but when you use the xgboost cv object, you have to first explicitly convert your data into a DMatrix. So, that's what you will do here before running cross-validation on churn_data.

In [6]:
# Create arrays for the features and the target: X, y
X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]
print(type(X))

# Create the DMatrix from X and y: churn_dmatrix
churn_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:logistic", "max_depth":3}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, 
                  nfold=3, num_boost_round=5, 
                  metrics="error", as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Print the accuracy
print(((1-cv_results["test-error-mean"]).iloc[-1]))

<class 'pandas.core.frame.DataFrame'>


  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):


   train-error-mean  train-error-std  test-error-mean  test-error-std
0           0.28232         0.002366          0.28378        0.001932
1           0.26951         0.001855          0.27190        0.001932
2           0.25605         0.003213          0.25798        0.003963
3           0.25090         0.001845          0.25434        0.003827
4           0.24654         0.001981          0.24852        0.000934
0.75148


- 1.3.2 Measuring AUC

In [7]:
# Perform cross_validation: cv_results
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, 
                  nfold=3, num_boost_round=5, 
                  metrics="auc", as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Print the AUC
print((cv_results["test-auc-mean"]).iloc[-1])

   train-auc-mean  train-auc-std  test-auc-mean  test-auc-std
0        0.768893       0.001544       0.767863      0.002820
1        0.790864       0.006758       0.789156      0.006847
2        0.815872       0.003899       0.814476      0.005997
3        0.822959       0.002018       0.821682      0.003912
4        0.827528       0.000769       0.826191      0.001938
0.8261913333333334


Fantastic! An AUC of 0.84 is quite strong. As you have seen, XGBoost's learning API makes it very easy to compute any metric you may be interested in. In Chapter 3, you'll learn about techniques to fine-tune your XGBoost models to improve their performance even further. For now, it's time to learn a little about exactly when to use XGBoost.

- When to use XGBoost:
    - You have a large number of training samples ( > 1000 training samples, less 100 features, number of features < number of training samples)
    - You have a mixture of categorical and numeric features ( or just numeric features)  
    <br/>
- When to NOT use XGBoost:
    - Image recognition
    - Computer Vision
    - NLP or understanding problems
    - Number of training samples is significantly smaller than number of features