### Ensembling

Ensembling is a machine learning technique in which we combine multiple models (of different algorithms or same algorithm) in order to create one model. The underlying models are called `learner`s or `base model`s. This approach combines the strengths of the learners into one model and cancels out the weaknesses of each learner.

Commonly used ensembling methods are:
- `Bagging`: multiple learners are trained together (ideally on bootstrapped samples)
- `Boosting`: learners are trained sequentially such that each subsequent learner attempts to rectify the errors of the preceding model. 

### Random Forest Algorithm

The `Random Forest` algorithm is a machine learning algorithm based on decision trees. It comprises of multiple decision trees, which work on the same data in different ways. `Forest` indicates the composition of multiple `trees` (decision trees) and the `random` is used to indicate the variation in the construction of these decision trees. For each node, in every decision tree, a random subset of input features are chosen and analyzed (based on `gini` or `entropy`). In addition to the variance in the construction of the trees, the sample data provided to each tree is also different, due to the incorporation of `bootstrapping`.

`Bootstrapping` is a sampling technique in which multiple samples are created from the same data, such that each sample has different data in it.

<h3 style='text-align: center;'>Original Sample</h3>

| Category | Sub-Category | Sales |
| --- | --- | --- |
| Technology | Phones | 2300 |
| Furniture | Chairs | 150 |
| Technology | Laptops | 1500 |
| Technology | Phones | 900 |

For example, for the above dataset, bootstrapping can create multiple samples:<br>

<div style='display:flex; align-items: center;'>
    <div style='margin: 10px;'>
        <h3 style='text-align: center;'>Sample #1</h3>
        <table>
          <tr>
            <th>Category</th>
            <th>Sub-Category</th>
            <th>Sales</th>
          </tr>
          <tr>
            <td>Technology</td>
            <td>Phones</td>
            <td>2300</td>
          </tr>
          <tr>
            <td>Furniture</td>
            <td>Chairs</td>
            <td>150</td>
          </tr>
          <tr>
            <td>Technology</td>
            <td>Laptops</td>
            <td>1500</td>
          </tr>
          <tr>
            <td>Furniture</td>
            <td>Laptops</td>
            <td>1500</td>
          </tr>
        </table>
    </div>
    <div style='margin: 10px;'>
        <h3 style='text-align: center;'>Sample #2</h3>
        <table>
          <tr>
            <th>Category</th>
            <th>Sub-Category</th>
            <th>Sales</th>
          </tr>
          <tr>
            <td>Furniture</td>
            <td>Chairs</td>
            <td>150</td>
          </tr>
          <tr>
            <td>Furniture</td>
            <td>Chairs</td>
            <td>150</td>
          </tr>
          <tr>
            <td>Technology</td>
            <td>Laptops</td>
            <td>1500</td>
          </tr>
          <tr>
            <td>Technology</td>
            <td>Phones</td>
            <td>900</td>
          </tr>
        </table>
    </div>
    <div style='margin: 10px;'>
        <h3 style='text-align: center;'>Sample #3</h3>
        <table>
          <tr>
            <th>Category</th>
            <th>Sub-Category</th>
            <th>Sales</th>
          </tr>
          <tr>
            <td>Furniture</td>
            <td>Chairs</td>
            <td>150</td>
          </tr>
          <tr>
            <td>Technology</td>
            <td>Laptops</td>
            <td>1500</td>
          </tr>
          <tr>
            <td>Technology</td>
            <td>Laptops</td>
            <td>1500</td>
          </tr>
          <tr>
            <td>Technology</td>
            <td>Laptops</td>
            <td>1500</td>
            </tr>
        </table>
    </div>
    <div style='margin: 10px;'>
        <h3 style='text-align: center;'>Sample #4</h3>
        <table>
          <tr>
            <th>Category</th>
            <th>Sub-Category</th>
            <th>Sales</th>
          </tr>
          <tr>
            <td>Furniture</td>
            <td>Chairs</td>
            <td>150</td>
          </tr>
          <tr>
            <td>Technology</td>
            <td>Phones</td>
            <td>900</td>
          </tr>
          <tr>
            <td>Technology</td>
            <td>Laptops</td>
            <td>1500</td>
          </tr>
          <tr>
            <td>Technology</td>
            <td>Phones</td>
            <td>2300</td>
            </tr>
        </table>
    </div>
    <div style='margin: 10px;'>
        <div style='margin-top: 100%;''>...and so on.</div>
    </div>
</div>

Bootstrapping will create multiple samples from the same original sample, such that:
- each sample will have the same number of observations as the original sample
- observations are `sampled with replacement`: this basically means that the same observation may be selected multiple times in a sample.

Through bootstrapping, there are chances that a particular observation may be used in one decision tree, but may not be used in another. This way, a kind of infinite training dataset is constructed and used for training the decision trees in a random forest.

Coupled with `bootstrapping`, random selection of features at each node (of each decision tree), random forests tend to `learn` about the data than to `memorize` the data provided for training.

The final prediction computed by the random forest algorithm is an aggregation of the multiple predictions made by each constituent decision tree. This, along with bootstrapping, is why random forests are classified as bagging ensemble algorithms (`b`ootstrap-`agg`regat`ing`).

The underlying criteria on which random forests are evaluated is the `out-of-bag` error. This is simply a measure of how many out-of-bag samples were incorrectly predicted.

In [1]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
import pandas as pd
import warnings

warnings.filterwarnings('ignore')

In [3]:
insurance_data = pd.read_csv(r'https://raw.githubusercontent.com/puneettrainer/datasets/main/insurance_fraud.csv')
insurance_data.head()

Unnamed: 0,ACCOUNT_AGE,CUSTOMER_AGE,POLICY_NUMBER,POLICY_START_DATE,POLICY_STATE,LIABILITY_AMOUNT,DEDUCTABLE,ANNUAL_FEE,UMBRELLA_LIMIT,ZIP_CODE,...,WITNESSES,POLICE_REPORT,TOTAL_CLAIM_AMOUNT,INJURY_CLAIM,PROPERTY_CLAIM,VEHICLE_CLAIM,AUTO_MAKE,AUTO_MODEL,AUTO_YEAR,FRAUD
0,328,48,521585,2014-10-17,OH,250/500,1000,1406.91,0,466132,...,2,YES,71610,6510,13020,52080,Saab,92x,2004,Y
1,228,42,342868,2006-06-27,IN,250/500,2000,1197.22,5000000,468176,...,0,NO,5070,780,780,3510,Mercedes,E400,2007,Y
2,134,29,687698,2000-09-06,OH,100/300,2000,1413.14,5000000,430632,...,3,NO,34650,7700,3850,23100,Dodge,RAM,2007,N
3,256,41,227811,1990-05-25,IL,250/500,2000,1415.74,6000000,608117,...,2,NO,63400,6340,6340,50720,Chevrolet,Tahoe,2014,Y
4,228,44,367455,2014-06-06,IL,500/1000,1000,1583.91,6000000,610706,...,1,NO,6500,1300,650,4550,Accura,RSX,2009,N


In [4]:
target_field = 'FRAUD'
input_fields = ['ACCOUNT_AGE', 'CUSTOMER_AGE', 'LIABILITY_AMOUNT', 'DEDUCTABLE', 'ANNUAL_FEE',
                'UMBRELLA_LIMIT', 'GENDER', 'EDUCATION_LEVEL', 'OCCUPATION', 'CAPITAL_GAINS', 'INCIDENT_TYPE', 'INCIDENT_SEVERITY',
                'AUTHORITIES', 'NUMBER_OF_VEHICLES', 'TOTAL_CLAIM_AMOUNT']

In [5]:
categorical_fields = list(insurance_data[input_fields].select_dtypes(exclude='number').columns)

In [6]:
training_data, test_data = train_test_split(insurance_data
                                           ,test_size=0.3
                                           ,random_state=10)

In [7]:
encoder = OneHotEncoder().fit(insurance_data[categorical_fields])
training_data.loc[:, encoder.get_feature_names_out()] = encoder.transform(training_data[categorical_fields]).toarray()
test_data.loc[:, encoder.get_feature_names_out()] = encoder.transform(test_data[categorical_fields]).toarray()

In [8]:
input_fields = list(insurance_data[input_fields].select_dtypes(include='number')) + list(encoder.get_feature_names_out())

In [9]:
from sklearn.ensemble import RandomForestClassifier

model = DecisionTreeClassifier()
model.fit(training_data[input_fields], training_data[target_field])
predictions = model.predict(test_data[input_fields])

accuracy_score(test_data[target_field], predictions)

0.7166666666666667

In [10]:
model = RandomForestClassifier()
model.fit(training_data[input_fields], training_data[target_field])
predictions = model.predict(test_data[input_fields])

accuracy_score(test_data[target_field], predictions)

0.79

### Accessing individual decision trees within a random forest

The `RandomForestClassifier` (or `RandomForestRegressor`) class has an attribute which allows us to access individual decision trees within the random forest:

`RandomForestObject.estimators_[n]`

Here, $n \implies$ the index of the decision tree.

In [11]:
model.estimators_[10]

In [12]:
model.estimators_[19]

### `n_jobs` Hyperparameter

`n_jobs` hyperparameter in the RandomForestClassifier (RandomForestRegressor) is used to specify how many trees to train in parallel. By default, `n_jobs` is `None` (one tree is trained at a time). We can specify this as $-1$, to train all decision trees in parallel. This improves the learning of the random forest.

In [13]:
model = RandomForestClassifier(n_jobs=-1)
model.fit(training_data[input_fields], training_data[target_field])
predictions = model.predict(test_data[input_fields])

accuracy_score(test_data[target_field], predictions)

0.8

### `n_estimators` Hyperparameter

`n_estimators` is used to specify how many decision trees are going to be created in the random forest. Greater the number of decision trees, better the performance of the model. By default, `n_estimators` = $100$.

In [14]:
model = RandomForestClassifier(n_jobs=-1, n_estimators=50)
model.fit(training_data[input_fields], training_data[target_field])
predictions = model.predict(test_data[input_fields])

accuracy_score(test_data[target_field], predictions)

0.7633333333333333

In [15]:
model = RandomForestClassifier(n_jobs=-1, n_estimators=150)
model.fit(training_data[input_fields], training_data[target_field])
predictions = model.predict(test_data[input_fields])

accuracy_score(test_data[target_field], predictions)

0.7866666666666666

Let's assume the above data is our training data and we are trying to predict Sales (using XGBoost for regression).

<ol>
	<li>the algorithm will compute a initial prediction ($P_0$), which is the average of the target column (in scikit, initial $P_0$ is set to 0.5 by default)
		<table>
			<tr>
				<th>Category</th>
				<th>Sub-Category</th>
				<th>Sales</th>
				<th>$P_0$</th>
			</tr>
			<tr>
				<td>Technology</td>
				<td>Phones</td>
				<td>2300</td>
				<td>1212.5</td>
			</tr>
			<tr>
				<td>Furniture</td>
				<td>Chairs</td>
				<td>150</td>
				<td>1212.5</td>
			</tr>
			<tr>
				<td>Technology</td>
				<td>Laptops</td>
				<td>1500</td>
				<td>1212.5</td>
			</tr>
			<tr>
				<td>Technology</td>
				<td>Phones</td>
				<td>900</td>
				<td>1212.5</td>
			</tr>
		</table>
	</li>
	<li>the algorithm then computes the difference between the observed values and the predicted values (pseudo-residuals, $R_0$)<br>$\implies R_0 = Sales_i - P_0$
		<table>
			<tr>
				<th>Category</th>
				<th>Sub-Category</th>
				<th>Sales</th>
				<th>$P_0$</th>
				<th>$R_0$</th>
			</tr>
			<tr>
				<td>Technology</td>
				<td>Phones</td>
				<td>2300</td>
				<td>1212.5</td>
				<td>1087.5</td>
			</tr>
			<tr>
				<td>Furniture</td>
				<td>Chairs</td>
				<td>150</td>
				<td>1212.5</td>
				<td>-1062.5</td>
			</tr>
			<tr>
				<td>Technology</td>
				<td>Laptops</td>
				<td>1500</td>
				<td>1212.5</td>
				<td>287.5</td>
			</tr>
			<tr>
				<td>Technology</td>
				<td>Phones</td>
				<td>900</td>
				<td>1212.5</td>
				<td>-312.5</td>
			</tr>
		</table>
	</li>
	<li>The algorithm then constructs a tree based on these pseudo-residuals. The tree with the greatest $Gain$ is used to make the next set of predictions. - From the above data, the algorithm has the values:
		<table>
			<tr><th>$R_0$</th></tr>
			<tr><td>1087.5</td></tr>
			<tr><td>-1062.5</td></tr>
			<tr><td>287.5</td></tr>
			<tr><td>-312.5</td></tr>
		</table>
	</li>
	<li>the algorithm constructs a tree using these values:<br>$\text{Root Node} = [1087.5, -1062.5, 287.5, -312.5]$
        <ul>
            <li>it first splits the root node by setting the criteria for the split as the mean of the first two pseudo-residuals ($1087.5, -1062.5$), generating the following split:<br>$Leaf_1 = [1087.5]$<br>$Leaf_2 = [-1062.5, 287.5, -312.5]$</li>
            <li>it calculates the $\text{Similarity Score}$ of each node<br>$\text{Similarity Score (for regression)} = \frac{(\sum R) ^ 2}{count(R) + \lambda}$<br>$\implies SS_{root} = {\frac{(\sum(1087.5, -1062.5, 287.5, -312.5)) ^ 2}{count(1087.5, -1062.5, 287.5, -312.5) + \lambda}} = 0.0625$<br>$SS_{Leaf_1} = {\frac{\sum(1087.5) ^ 2}{count(1087.5) + \lambda}} = 1182656.25$<br>$SS_{Leaf_2} = \frac{\sum(-1062.5, 287.5, -312.5) ^ 2}{count(-1062.5, 287.5, -312.5) + \lambda} = 394218.75$</li>
            <li>After calculating $\text{Similarity Score}$ for each node, the algorithm computes the $\text{Gain}$<br>$\text{Gain} = SS_{Leaf_1} + SS_{Leaf_2} - SS_{root}$<br>$\implies \text{Gain} = 1182656.25 + 394218.75 - 0.0625 = 1576874.9375$</li>
            <li>these calculations are repeated for different splits:
            <table>
                <tr>
                    <th>Leaf</th>
                    <th>$Split_1$</th>
                    <th>$Split_2$</th>
                    <th>$Split_3$</th>			
                </tr>
                <tr>
                    <td>1</td>
                    <td>1087.5</td>
                    <td>1087.5, -1062.5</td>
                    <td>1087.5, -1062.5, 287.5</td>		
                </tr>
                <tr>
                    <td>1</td>
                    <td>-1062.5, 287.5, -312.5</td>
                    <td>287.5, -312.5</td>
                    <td>-312.5</td>		
                </tr>
                <tr>
                    <td>$Gain$</td>
                    <td>1576874.93</td>
                    <td>624.93</td>
                    <td>130208.27</td>		
                </tr>
            </table>
        </li>
        <li>the algorithm attempts to split these further by assessing the $\text{Gain}$; whichever split gives the greatest value of $\text{Gain}$, is chosen to construct the tree</li>
        </ul>        
	</li>
	<li>based on the above table, the tree that we constructed is the most optimal one, so the algorithm computes predictions using the input features. Prediction value is given by the $\text{Output Value}$:<br>$\text{Output value}_{leaf} = \frac{\sum residuals_{leaf}}{count(residuls_{leaf}) + \lambda}$<br>
    <strong>NOTE</strong>: $count(residuls_{leaf})$ is called $\text{Cover}$. This is simply the minimum number of residuals that can be in a leaf. We can control this using the hyperparameter <strong>min_child_weight</strong>.</li>
	<li>this goes on iteratively till the depth that we have defined</li>
	<li>after all predictions are made by the subsequent trees, the algorithm makes a final prediction by adding up the scaled predictions of each tree<br>$Predition_{XGBoost} = P_0 + \epsilon \times P_1 + ... + \epsilon \times P_n$</li>
	<li>Additionally, to avoid overfitting, the algorithm also prunes the trees at each iteration by calculating the different between $\text{Gain}$ and $\gamma$. $\gamma$ is a hyperparameter that we choose that allows us to control overfitting in the constructed trees. If the difference is negative, the tree is pruned, otherwise it remains.</li>
</ol>

### XGBoost for Classification

For classification problems, XGBoost constructs a model just like it does for regression problems. However, because the learners (and therefore the entire model) is predicting `probabilities` and not actual continuous values, there is a minor adjustment. The output value is calculated as follows:
    $\text{Output Value}_{leaf} = \frac{\sum residuals}{\sum(probability_{previous} \times (1 - probability_{previous})) + \lambda}$<br>
    <strong>NOTE</strong>: In case of classification, $Cover$ is $\sum(probability_{previous} \times (1 - probability_{previous})) - \lambda$

Due to the presence of log of probabilities, the final predictions made by the XGBoost model are computed as follows:
$Prediction_{XGBoost} (P_F)= log(P_0) + \epsilon \times log(P_1) + ... + \epsilon \times log(P_n)$


So the final predicted value that we use is given by:<br>
$Prediction = \frac{e ^ {P_F}}{1 + e ^ {P_F}}$<br>

Here, $e$ is the natural exponent.

### Implementing XGBoost using `xgboost`

For regression problems, we use the `XGBoostRegressor` and for classification we use `XGBoostClassifier`.

In [49]:
housing_data = pd.read_csv(r'https://raw.githubusercontent.com/puneettrainer/datasets/main/housing.csv')
housing_data.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


In [50]:
target_field = 'price'
input_fields = list(housing_data.columns)
input_fields.remove(target_field)

In [51]:
categorical_fields = list(housing_data[input_fields].select_dtypes(include='object').columns)
numeric_fields = list(housing_data[input_fields].select_dtypes(exclude='object').columns)

In [52]:
training_data, test_data = train_test_split(housing_data
                                           ,test_size=0.2
                                           ,random_state=99)

In [53]:
encoder = OneHotEncoder().fit(training_data[categorical_fields])
training_data.loc[:, encoder.get_feature_names_out()] = encoder.transform(training_data[categorical_fields]).toarray()
test_data.loc[:, encoder.get_feature_names_out()] = encoder.transform(test_data[categorical_fields]).toarray()


In [54]:
input_fields = numeric_fields + list(encoder.get_feature_names_out())

In [1]:
from xgboost import XGBRegressor
from sklearn.metrics import root_mean_squared_error

model = XGBRegressor().fit(training_data[input_fields], training_data[target_field])
predictions = model.predict(test_data[input_fields])

In [56]:
root_mean_squared_error(test_data[target_field], predictions)

np.float64(1408823.085311332)

In [59]:
model_1 = XGBRegressor(n_estimators=50).fit(training_data[input_fields], training_data[target_field])
predictions = model_1.predict(test_data[input_fields])
root_mean_squared_error(test_data[target_field], predictions)

np.float64(1379867.3723508616)

In [60]:
model_1 = XGBRegressor(n_estimators=50, max_depth=3).fit(training_data[input_fields], training_data[target_field])
predictions = model_1.predict(test_data[input_fields])
root_mean_squared_error(test_data[target_field], predictions)

np.float64(1272477.4031598414)

In [67]:
model_2 = XGBRegressor(n_estimators=50, max_depth=3).fit(training_data[input_fields], training_data[target_field])
predictions = model_2.predict(test_data[input_fields])
root_mean_squared_error(test_data[target_field], predictions)

np.float64(1272477.4031598414)

In [86]:
model_3 = XGBRegressor(n_estimators=150, max_depth=3).fit(training_data[input_fields], training_data[target_field])
predictions = model_3.predict(test_data[input_fields])
root_mean_squared_error(test_data[target_field], predictions)

np.float64(1364762.5224144387)