 # <div style="text-align: center">Tutorial on Ensemble Learning </div>
 <img src='https://data-science-blog.com/wp-content/uploads/2017/12/ensemble-learning-stacking.png' width=400 height=400 >
### <div style="text-align: center"> Quite Practical and Far from any Theoretical Concepts </div>
<div style="text-align:center">last update: <b>07/02/2019</b></div>


>You are reading **10 Steps to Become a Data Scientist** and are now in the 8th step : 

1. [Leren Python](https://www.kaggle.com/mjbahmani/the-data-scientist-s-toolbox-tutorial-1)
2. [Python Packages](https://www.kaggle.com/mjbahmani/the-data-scientist-s-toolbox-tutorial-2)
3. [Mathematics and Linear Algebra](https://www.kaggle.com/mjbahmani/linear-algebra-for-data-scientists)
4. [Programming &amp; Analysis Tools](https://www.kaggle.com/mjbahmani/20-ml-algorithms-15-plot-for-beginners)
5. [Big Data](https://www.kaggle.com/mjbahmani/a-data-science-framework-for-quora)
6. [Data visualization](https://www.kaggle.com/mjbahmani/top-5-data-visualization-libraries-tutorial)
7. [Data Cleaning](https://www.kaggle.com/mjbahmani/machine-learning-workflow-for-house-prices)
8. <font color="red">You are in the 8th step</font>
9. [A Comprehensive ML  Workflow with Python](https://www.kaggle.com/mjbahmani/a-comprehensive-ml-workflow-with-python)
10. [Deep Learning](https://www.kaggle.com/mjbahmani/top-5-deep-learning-frameworks-tutorial)

---------------------------------------------------------------------
you can Fork and Run this kernel on <font color="red">Github</font>:

> ###### [ GitHub](https://github.com/mjbahmani/10-steps-to-become-a-data-scientist)
-------------------------------------------------------------------------------------------------------------
 **I hope you find this kernel helpful and some <font color='red'> UPVOTES</font> would be very much appreciated**
 
 -----------

<a id="top"></a> <br>
## Notebook  Content
1. [Introduction](#1)
    1. [Why Ensemble Learning?](#11)
1. [Ensemble Techniques](#2)
    1. [what-is-the-difference-between-bagging-and-boosting?](#21)
1. [XGBoost?](#3)
    1. [Installing XGBoost ](#31)
    1. [Matrix Multiplication](#32)
    1. [Vector-Vector Products](#33)
    1. [Outer Product of Two Vectors](#34)
    1. [Matrix-Vector Products](#35)
    1. [Matrix-Matrix Products](#36)
1. [Random Forest](#4)
1. [AdaBoost](#5)
1. [GBM](#6)
1. [XGB](#7)
1. [Light GBM](#8)
1. [Conclusion](#6)
1. [References & Credits](#7)

<a id="1"></a> <br>
#  1- Introduction
In this kernel, I want to start explorer everything about **Ensemble modeling**. I will run plenty of algorithms on various datasets. I hope you enjoy and give me feedback.

<a id="2"></a> <br>
## 2- Import packages

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from pandas import get_dummies
import plotly.graph_objs as go
from sklearn import datasets
import plotly.plotly as py
import seaborn as sns
import xgboost as xgb
import pandas as pd
import numpy as np
import matplotlib
import warnings
import sklearn
import scipy
import numpy
import json
import sys
import csv
import os

<a id="21"></a> <br>
### 2-1 Version

In [4]:
print('matplotlib: {}'.format(matplotlib.__version__))
print('sklearn: {}'.format(sklearn.__version__))
print('scipy: {}'.format(scipy.__version__))
print('seaborn: {}'.format(sns.__version__))
print('pandas: {}'.format(pd.__version__))
print('numpy: {}'.format(np.__version__))
print('Python: {}'.format(sys.version))

matplotlib: 2.2.3
sklearn: 0.20.2
scipy: 1.1.0
seaborn: 0.9.0
pandas: 0.23.4
numpy: 1.16.0
Python: 3.6.6 |Anaconda, Inc.| (default, Oct  9 2018, 12:34:16) 
[GCC 7.3.0]


<a id="22"></a> <br>
### 2-2 Setup

A few tiny adjustments for better **code readability**

In [5]:
warnings.filterwarnings('ignore')
sns.set(color_codes=True)
plt.style.available
%matplotlib inline
%precision 2

'%.2f'

<a id="23"></a> <br>
### 2-3 Data Collection

In [6]:
# import Dataset to play with it
dataset = pd.read_csv('../input/iris-dataset/Iris.csv')

**<< Note 1 >>**

* Each row is an observation (also known as : sample, example, instance, record)
* Each column is a feature (also known as: Predictor, attribute, Independent Variable, input, regressor, Covariate)

<a id="3"></a> <br>
## 3- What's Ensemble Learning?
let us, review some defination on Ensemble Learning:

1. **Ensemble learning** is the process by which multiple models, such as classifiers or experts, are strategically generated and combined to solve a particular computational intelligence problem[9]
1. **Ensemble Learning** is a powerful way to improve the performance of your model. It usually pays off to apply ensemble learning over and above various models you might be building. Time and again, people have used ensemble models in competitions like Kaggle and benefited from it.[6]
1. **Ensemble methods** are techniques that create multiple models and then combine them to produce improved results. Ensemble methods usually produces more accurate solutions than a single model would.[10]
<img src='https://hub.packtpub.com/wp-content/uploads/2018/02/ensemble_machine_learning_image_1-600x407.png'  width=400 height=400>
[img-ref](https://hub.packtpub.com/wp-content/uploads/2018/02/ensemble_machine_learning_image_1-600x407.png)

> <font color="red"><b>Note</b></font>
Ensemble Learning is a Machine Learning concept in which the idea is to train multiple models using the same learning algorithm. The ensembles take part in a bigger group of methods, called multiclassifiers, where a set of hundreds or thousands of learners with a common objective are fused together to solve the problem.[11]

> <font color="red"><b>Note</b></font>
This Kernel assumes a basic understanding of Machine Learning algorithms. I would recommend going through this [**kernel**](https://www.kaggle.com/mjbahmani/a-comprehensive-ml-workflow-with-python)  to familiarize yourself with these concepts.


<a id="31"></a> <br>
## 3-1 Why Ensemble Learning?
1. Difference in population
1. Difference in hypothesis
1. Difference in modeling technique
1. Difference in initial seed
<br>
[go to top](#top)

<a id="4"></a> <br>
# 4- Ensemble Techniques
The goal of any machine learning problem is to find a single model that will best predict our wanted outcome. Rather than making one model and hoping this model is the best/most accurate predictor we can make, ensemble methods take a myriad of models into account, and average those models to produce one final model.[12]
<img src='https://uploads.toptal.io/blog/image/92062/toptal-blog-image-1454584029018-cffb1b601292e8d328556e355ed4f7e0.jpg' width=300 height=300>
[img-ref](https://www.toptal.com/machine-learning/ensemble-methods-machine-learning)
1. Bagging based Ensemble learning
1. Boosting-based Ensemble learning
1. Voting based Ensemble learning

<a id="41"></a> <br>
## 4-1- what-is-the-difference-between-bagging-and-boosting?
**Bagging**: It is the method to decrease the variance of model by generating additional data for training from your original data set using combinations with repetitions to produce multisets of the same size as your original data.

**Boosting**: It helps to calculate the predict the target variables using different models and then average the result( may be using a weighted average approach).
<img src='https://www.globalsoftwaresupport.com/wp-content/uploads/2018/02/ds33ggg.png'>
[img-ref](https://www.globalsoftwaresupport.com/boosting-adaboost-in-machine-learning/)
<br>
[go to top](#top)

<a id="5"></a> <br>
## 5- Model Deployment
In this section have been applied more than **20 learning algorithms** that play an important rule in your experiences and improve your knowledge in case of ML technique.

> **<< Note 3 >>** : The results shown here may be slightly different for your analysis because, for example, the neural network algorithms use random number generators for fixing the initial value of the weights (starting points) of the neural networks, which often result in obtaining slightly different (local minima) solutions each time you run the analysis. Also note that changing the seed for the random number generator used to create the train, test, and validation samples can change your results.
<br>
[go to top](#top)

<a id="51"></a> <br>
## 5-1 Families of ML algorithms
There are several categories for machine learning algorithms, below are some of these categories:
* Linear
    * Linear Regression
    * Logistic Regression
    * Support Vector Machines
* Tree-Based
    * Decision Tree
    * Random Forest
    * GBDT
* KNN
* Neural Networks

-----------------------------
And if we  want to categorize ML algorithms with the type of learning, there are below type:
* Classification

    * k-Nearest 	Neighbors
    * LinearRegression
    * SVM
    * DT 
    * NN
    
* clustering

    * K-means
    * HCA
    * Expectation Maximization
    
* Visualization 	and	dimensionality 	reduction:

    * Principal 	Component 	Analysis(PCA)
    * Kernel PCA
    * Locally -Linear	Embedding 	(LLE)
    * t-distributed	Stochastic	Neighbor	Embedding 	(t-SNE)
    
* Association 	rule	learning

    * Apriori
    * Eclat
* Semisupervised learning
* Reinforcement Learning
    * Q-learning
* Batch learning & Online learning
* Ensemble  Learning

**<< Note >>**
> Here is no method which outperforms all others for all tasks
<br>
[go to top](#top)

<a id="52"></a> <br>
## 5-2 XGBoost?
* **XGBoost** is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data.
* **XGBoost** is an implementation of gradient boosted decision trees designed for speed and performance.
* **XGBoost** is short for e**X**treme **G**radient **Boost**ing package.

* Speed and performance : Originally written in C++, it is comparatively faster than other ensemble classifiers.

* Core algorithm is parallelizable : Because the core XGBoost algorithm is parallelizable it can harness the power of multi-core computers. It is also parallelizable onto GPU’s and across networks of computers making it feasible to train on very large datasets as well.

* Consistently outperforms other algorithm methods : It has shown better performance on a variety of machine learning benchmark datasets.

* Wide variety of tuning parameters : XGBoost internally has parameters for cross-validation, regularization, user-defined objective functions, missing values, tree parameters, scikit-learn compatible API etc.
* Win competition On Kaggle : there are a lot of winners on Kaggle that use XGBoost
<br>
[go to top](#top)

<a id="521"></a> <br>
## 5-2-1 Installing XGBoost

There is a comprehensive installation guide on the [XGBoost documentation website](http://xgboost.readthedocs.io/en/latest/build.html).

###  XGBoost in R
If you are an R user, the best place to get started is the [CRAN page for the xgboost package](https://cran.r-project.org/web/packages/xgboost/index.html).

###  XGBoost in Python
Installation instructions are available on the Python section of the XGBoost installation guide.

The official Python Package Introduction is the best place to start when working with XGBoost in Python.

To get started quickly, you can type:
<br>
[go to top](#top)

In [7]:
#>sudo pip install xgboost

<a id="53"></a> <br>
## 5-3 Prepare Features & Targets
First of all seperating the data into dependent(Feature) and independent(Target) variables.

**<< Note 4 >>**
* X==>>Feature
* y==>>Target

In [8]:

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

After loading the data via **pandas**, we should checkout what the content is, description and via the following:
<br>
[go to top](#top)

<a id="54"></a> <br>
## 5-4 RandomForest
A random forest is a meta estimator that **fits a number of decision tree classifiers** on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. 

The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

In [9]:
from sklearn.ensemble import RandomForestClassifier
Model=RandomForestClassifier(max_depth=2)
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_pred,y_test))
#Accuracy Score
print('accuracy is ',accuracy_score(y_pred,y_test))

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        11
Iris-versicolor       0.93      1.00      0.96        13
 Iris-virginica       1.00      0.83      0.91         6

      micro avg       0.97      0.97      0.97        30
      macro avg       0.98      0.94      0.96        30
   weighted avg       0.97      0.97      0.97        30

[[11  0  0]
 [ 0 13  1]
 [ 0  0  5]]
accuracy is  0.9666666666666667


<a id="55"></a> <br>
## 5-5 Bagging classifier 
A Bagging classifier is an ensemble **meta-estimator** that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.

This algorithm encompasses several works from the literature. When random subsets of the dataset are drawn as random subsets of the samples, then this algorithm is known as Pasting . If samples are drawn with replacement, then the method is known as Bagging . When random subsets of the dataset are drawn as random subsets of the features, then the method is known as Random Subspaces . Finally, when base estimators are built on subsets of both samples and features, then the method is known as Random Patches .[http://scikit-learn.org]
<br>
[go to top](#top)

In [10]:
from sklearn.ensemble import BaggingClassifier
Model=BaggingClassifier()
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_pred,y_test))
#Accuracy Score
print('accuracy is ',accuracy_score(y_pred,y_test))

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        11
Iris-versicolor       0.93      1.00      0.96        13
 Iris-virginica       1.00      0.83      0.91         6

      micro avg       0.97      0.97      0.97        30
      macro avg       0.98      0.94      0.96        30
   weighted avg       0.97      0.97      0.97        30

[[11  0  0]
 [ 0 13  1]
 [ 0  0  5]]
accuracy is  0.9666666666666667


<a id="56"></a> <br>
##  5-6 AdaBoost classifier

An AdaBoost classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.
This class implements the algorithm known as **AdaBoost-SAMME** .

In [11]:
from sklearn.ensemble import AdaBoostClassifier
Model=AdaBoostClassifier()
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_pred,y_test))
#Accuracy Score
print('accuracy is ',accuracy_score(y_pred,y_test))

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        11
Iris-versicolor       0.93      1.00      0.96        13
 Iris-virginica       1.00      0.83      0.91         6

      micro avg       0.97      0.97      0.97        30
      macro avg       0.98      0.94      0.96        30
   weighted avg       0.97      0.97      0.97        30

[[11  0  0]
 [ 0 13  1]
 [ 0  0  5]]
accuracy is  0.9666666666666667


<a id="57"></a> <br>
## 5-7 Gradient Boosting Classifier
GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions.

In [12]:
from sklearn.ensemble import GradientBoostingClassifier
Model=GradientBoostingClassifier()
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_pred,y_test))
#Accuracy Score
print('accuracy is ',accuracy_score(y_pred,y_test))

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        11
Iris-versicolor       0.93      1.00      0.96        13
 Iris-virginica       1.00      0.83      0.91         6

      micro avg       0.97      0.97      0.97        30
      macro avg       0.98      0.94      0.96        30
   weighted avg       0.97      0.97      0.97        30

[[11  0  0]
 [ 0 13  1]
 [ 0  0  5]]
accuracy is  0.9666666666666667


<a id="58"></a> <br>
## 5-8 Linear Discriminant Analysis
Linear Discriminant Analysis (discriminant_analysis.LinearDiscriminantAnalysis) and Quadratic Discriminant Analysis (discriminant_analysis.QuadraticDiscriminantAnalysis) are two classic classifiers, with, as their names suggest, a **linear and a quadratic decision surface**, respectively.

These classifiers are attractive because they have closed-form solutions that can be easily computed, are inherently multiclass, have proven to work well in practice, and have no **hyperparameters** to tune.

In [13]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
Model=LinearDiscriminantAnalysis()
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_pred,y_test))
#Accuracy Score
print('accuracy is ',accuracy_score(y_pred,y_test))

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        11
Iris-versicolor       1.00      1.00      1.00        13
 Iris-virginica       1.00      1.00      1.00         6

      micro avg       1.00      1.00      1.00        30
      macro avg       1.00      1.00      1.00        30
   weighted avg       1.00      1.00      1.00        30

[[11  0  0]
 [ 0 13  0]
 [ 0  0  6]]
accuracy is  1.0


<a id="59"></a> <br>
## 5-9 Quadratic Discriminant Analysis
A classifier with a quadratic decision boundary, generated by fitting class conditional densities to the data and using Bayes’ rule.

The model fits a **Gaussian** density to each class.

In [14]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
Model=QuadraticDiscriminantAnalysis()
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_pred,y_test))
#Accuracy Score
print('accuracy is ',accuracy_score(y_pred,y_test))

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        11
Iris-versicolor       1.00      1.00      1.00        13
 Iris-virginica       1.00      1.00      1.00         6

      micro avg       1.00      1.00      1.00        30
      macro avg       1.00      1.00      1.00        30
   weighted avg       1.00      1.00      1.00        30

[[11  0  0]
 [ 0 13  0]
 [ 0  0  6]]
accuracy is  1.0


<a id="510"></a> <br>
##  5-10 XGBoost
Finally see how to perform XGBoost

In [15]:
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
y = iris.target

In [16]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [17]:
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

In [18]:
from sklearn.datasets import dump_svmlight_file

dump_svmlight_file(X_train, y_train, 'dtrain.svm', zero_based=True)
dump_svmlight_file(X_test, y_test, 'dtest.svm', zero_based=True)
dtrain_svm = xgb.DMatrix('dtrain.svm')
dtest_svm = xgb.DMatrix('dtest.svm')

[10:29:02] 120x4 matrix with 480 entries loaded from dtrain.svm
[10:29:02] 30x4 matrix with 120 entries loaded from dtest.svm


In [19]:
param = {
    'max_depth': 3,  # the maximum depth of each tree
    'eta': 0.3,  # the training step for each iteration
    'silent': 1,  # logging mode - quiet
    'objective': 'multi:softprob',  # error evaluation for multiclass training
    'num_class': 3}  # the number of classes that exist in this datset
num_round = 20  # the number of training iterations

In [20]:
bst = xgb.train(param, dtrain, num_round)

In [21]:
bst.dump_model('dump.raw.txt')

In [22]:
preds = bst.predict(dtest)

In [23]:
best_preds = np.asarray([np.argmax(line) for line in preds])

Determine the precision of this prediction:

In [24]:
from sklearn.metrics import precision_score

print (precision_score(y_test, best_preds, average='macro'))

1.0


## 5-11 Extremely Randomized Trees
In extremely randomized trees[13]

In [25]:
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
X, y = make_blobs(n_samples=10000, n_features=10, centers=100,
random_state=0)

clf = DecisionTreeClassifier(max_depth=None, min_samples_split=2,
random_state=0)
scores = cross_val_score(clf, X, y, cv=5)
scores.mean()   

0.9823000000000001

<a id="6"></a> <br>
# 6-Conclusion
* That XGBoost is a library for developing fast and high performance gradient boosting tree models.
* That XGBoost is achieving the best performance on a range of difficult machine learning tasks.
<br>
[go to top](#top)

you can follow me on:
> ###### [ GitHub](https://github.com/mjbahmani)
> ###### [Kaggle](https://www.kaggle.com/mjbahmani/)

 **I hope you find this kernel helpful and some upvotes would be very much appreciated**
 

<a id="10"></a> <br>
# 7-References & Credits

1. [datacamp](https://www.datacamp.com/community/tutorials/xgboost-in-python)
1. [Xgboost presentation](https://www.oreilly.com/library/view/data-science-from/9781491901410/ch04.html)
1. [machinelearningmastery](https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/)
1. [analyticsvidhya](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)
1. [Github](https://github.com/mjbahmani)
1. [analyticsvidhya](https://www.analyticsvidhya.com/blog/2015/08/introduction-ensemble-learning/)
1. [ensemble-learning-python](https://www.datacamp.com/community/tutorials/ensemble-learning-python)
1. [image-header-reference](https://data-science-blog.com/blog/2017/12/03/ensemble-learning/)
1. [scholarpedia](http://www.scholarpedia.org/article/Ensemble_learning)
1. [toptal](https://www.toptal.com/machine-learning/ensemble-methods-machine-learning)
1. [quantdare](https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/)
1. [towardsdatascience](https://towardsdatascience.com/ensemble-methods-in-machine-learning-what-are-they-and-why-use-them-68ec3f9fef5f)
1. [scikit-learn](https://scikit-learn.org/stable/modules/ensemble.html)

>If you have read the notebook, you can follow next steps: [Course Home Page](https://www.kaggle.com/mjbahmani/10-steps-to-become-a-data-scientist)