<a href="https://colab.research.google.com/github/a-forty-two/DataSetsForML/blob/master/22_XGBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
# Extreme Gradient Boosting
from xgboost import XGBClassifier
# instead of dataframe, we could have also worked on Numpy arrays directly! 
# XGBoost Requires Numpy arrays
from numpy import loadtxt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = loadtxt('diabetes.csv', delimiter=',')
#colnames = Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
data

array([[  6.   , 148.   ,  72.   , ...,   0.627,  50.   ,   1.   ],
       [  1.   ,  85.   ,  66.   , ...,   0.351,  31.   ,   0.   ],
       [  8.   , 183.   ,  64.   , ...,   0.672,  32.   ,   1.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,   0.245,  30.   ,   0.   ],
       [  1.   , 126.   ,  60.   , ...,   0.349,  47.   ,   1.   ],
       [  1.   ,  93.   ,  70.   , ...,   0.315,  23.   ,   0.   ]])

In [0]:
# break your data into input and output
x = data[:,0:8] # all columns including 0, excluding 8
y = data[:,8] 
#y[:10]
# split into train/test
xtrain,xtest, ytrain,ytest = train_test_split(x,y, test_size=0.2, random_state=42)


In [13]:
model = XGBClassifier(n_estimators=150, learning_rate=1e-3)
history = model.fit(xtrain,ytrain)
print(model)

# the hyperparams to tweak
# 1-> learning rate 
# 2-> n_estimators -> number of parallel trees that will be generated 
# 3-> n_jobs and nthread -> used for parallelization -> n_jobs parallel XGboost jobs, and nthread
# controls multithreaded execution 
# 4-> missing -> handling missing value 
# 5-> objective -> kind of inference -> classification or regression 
# 6-> MAX depth -> helps control bias-variance tradeoff. Keeping max depth at 3 will also ensure
# that a bagged tree is formed 

# njobs -> LONG (time) piece of code that should be given resources and should not impact
# user tasks 
# nthread -> multiple threads to parallelize execution 

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.001, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=150, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)


In [12]:
predictions = model.predict(xtest)
accuracy = accuracy_score(ytest, predictions)
accuracy

#predictions = [round(value) for value in predictions]
#predictions


0.7597402597402597

In [20]:
# with lr=1e-2
model = XGBClassifier(learning_rate=1e-2)
history = model.fit(xtrain,ytrain)
predictions = model.predict(xtest)
accuracy = accuracy_score(ytest, predictions)
accuracy
# to get better accuracy, tune Hyperparameters via GRID search (exhaustive) or Random Search (for minimal workable 
# accuracy score)

# Grid -> when hyperparameter search space is finite (limited values can be supplied)
# SARIMAX, ARIMA, KNN, SVM, batch_size -> finite hyperparams
# Random -> infinite (when any number could have been the right answer!)
# EPOCHS, input-output dimensions, Learning rate -> infinite solutions possible!


0.7727272727272727

TREES-> Bagging on individual trees, Boosted Trees was the way of selecting multiple trees ONE after the OTHER, such that each tree performance is better than the previous selection.  

https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205

https://shirinsplayground.netlify.com/2018/11/ml_basics_gbm/ 

XGBoost-> Extreme Gradient Boosting

1) C++ implementation redone in python thus VERY high performance due to parallelization 

2) v/s other packages-> it has almost always outperformed other gradient boosting algorithms 

3) choice of hyperparameters is HUGE-> this makes it much more customizable than other models 

AT the CORE-> it's still gradient boosting (it can be ascend or descend!) 

Regression Example:

https://www.datacamp.com/community/tutorials/xgboost-in-python

Without a cluster (of machines or GPU), XGBoost would have performed same as its contempory algos from ScikitLearn or Keras 

Keras -> 1 layer of Neural Network is actually machine learning! 
https://playground.tensorflow.org

tf.estimator -> Machine learning API from tensorflow, contains most basic inference algorithms (like linear regression)

But estimators together can form complex networks such as CNNs!
https://towardsdatascience.com/first-contact-with-tensorflow-estimator-69a5e072998d




