# Extreme Gradient Boosting with XGBoost

__What is XGBoost?__

* Optimized	gradient-boosting machine learning library
* Originally written in	C++
* Has APIs in several languages: Python, R, Scala, Julia, Java

__What	makes	XGBoost	so	popular?__

* Speed	and	performance
* Core algorithm is	parallelizable 
* Consistently outperforms single-algorithm	methods
* State-of-the-art performance in many ML tasks

In [13]:
import xgboost as xgb
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

from sklearn.metrics import mean_squared_error

In [2]:
dir1 = '/disk1/sousae/Classes/udemy_machineLearning_A-Z/Part10_Model_Selection_Boosting/'
class_data = pd.read_csv(dir1+'Churn_Modelling.csv')

X, y = class_data.iloc[:, 3:13].values, class_data.iloc[:, 13].values

# Encoding categorical data
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])
onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
X = X[:, 1:]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 123)

xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)
xg_cl.fit(X_train, y_train)
preds = xg_cl.predict(X_test)

accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))

accuracy: 0.852000


  if diff:


### Decision trees as base learners

* Base learner - Individual learning algorithm in an ensemble algorithm
* Composed of a series of binary questions
* Predictions happen at the "leaves" of the tree

__Individual decision trees tend to overfit.__
    - low bias
    - high variance
    
#### CART:	Classification	and	Regression	Trees

* Each leaf	always contains a real-valued score
* Can later be converted into categories


#### Decision trees
Your task in this exercise is to make a simple decision tree using scikit-learn's DecisionTreeClassifier on the breast cancer dataset that comes pre-loaded with scikit-learn.

This dataset contains numeric measurements of various dimensions of individual tumors (such as perimeter and texture) from breast biopsies and a single outcome value (the tumor is either malignant, or benign).

We've preloaded the dataset of samples (measurements) into X and the target values per tumor into y. Now, you have to split the complete dataset into training and testing sets, and then train a DecisionTreeClassifier. You'll specify a parameter called max_depth. Many other parameters can be modified within this model, and you can check all of them out here.

```python 
# Import the necessary modules
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the classifier: dt_clf_4
dt_clf_4 = DecisionTreeClassifier(max_depth=4)

# Fit the classifier to the training set
dt_clf_4.fit(X_train, y_train)

# Predict the labels of the test set: y_pred_4
y_pred_4 = dt_clf_4.predict(X_test)

# Compute the accuracy of the predictions: accuracy
accuracy = float(np.sum(y_pred_4==y_test))/y_test.shape[0]
print("accuracy:", accuracy)
```

### What is Boosting?

* Not a specific machine learning algorithm
* Concept that can be applied to a set of machine learning models (_"Meta-algorithm"_)
* Ensemble meta-algorithm used to convert many weak learners into a strong learner

#### Weak learners and strong learners

__Weak	learner__:
* ML algorithm that is slightly	better than	chance
    - Example: Decision tree whose predictions are slightly	better	than 50%

_Boosting converts a collection of weak	learners into a	strong learner._

__Strong learner__:	
* Any algorithm	that can be	tuned to achieve good performance

#### How boosting is accomplished?

* Iteratively learning a set of	weak models	on subsets of the data
* Weighing each weak prediction	according to each weak learner's performance
* Combine the weighted predictions to obtain a single weighted prediction that is much better than the individual predictions themselves!

#### Cross-validation	in	XGBoost	example
```python
import xgboost as xgb
import pandas as pd

class_data = pd.read_csv("classification_data.csv")
churn_dmatrix = xgb.DMatrix(data=churn_data.iloc[:,:-1], label=churn_data.month_5_still_here)

params={"objective":"binary:logistic","max_depth":4}
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, nfold=4, 
                    num_boost_round=10, metrics="error", as_pandas=True)
                    
print("Accuracy: %f" %((1-cv_results["test-error-mean"]).iloc[-1]))
```

### When should I use XGBoost?

* You have a large number of training samples
* Greater than 1000	training samples and less 100 features
* The number of	features < number of training samples
* You have a mixture of	categorical	and	numeric	features Or	just numeric features

#### When to NOT use XGBoost

* Image recognition
* Computer vision
* Natural language processing and understanding problems
* When the number of training samples is significantly smaller than the number of features



## Regression Review

__Regression	basics__: Outcome is real-valued

#### Common	regression	metrics

* Root mean squared error (RMSE)
* Mean absolute error (MAE)

### Objective (loss) functions and base learners
```python
boston_data	= pd.read_csv("bostonhousing.csv")
X, y = boston_data.iloc[:,:-1],boston_data.iloc[:,-1]
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)
xg_reg = xgb.XGBRegressor(objective='reg:linear', n_estimators=10, seed=123)
xg_reg.fit(X_train,	y_train)
preds =	xg_reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test,preds))
print("RMSE: %f" % (rmse))
```
#### Objective Functions and Why We	Use	Them

* Quantifies how far off a prediction is from the actual result
* Measures the difference between estimated	and	true values for some collection of data
* __Goal__:	Find the model that yields the minimum value of the loss function

#### Loss function names in xgboost:

* reg:linear - use for regression problems
* reg:logistic - use for classification	problems when you want just decision, not probability
* binary:logistic - use when you want probability rather than just decision

#### Base Learners and Why We Need Them

* XGBoost involves creating a meta-model that is composed of many individual models	that combine to give a final prediction
* Individual models	= base learners
* Want base learners that when combined create final prediction that is non-linear
* Each base learner should be good at distinguishing or predicting different parts of the dataset
* Two kinds of base learners: tree and linear

In [10]:
boston_data = pd.read_csv("./data/BostonHousing.csv")
boston_data.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [14]:
X, y = boston_data.iloc[:,:-1],boston_data.iloc[:,-1]
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)
xg_reg = xgb.XGBRegressor(objective='reg:linear', n_estimators=10, seed=123)
xg_reg.fit(X_train,	y_train)
preds =	xg_reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test,preds))
print("RMSE: %f" % (rmse))

RMSE: 9.749041
