### Out-of-Bag (OOB)

![OOB](https://upload.wikimedia.org/wikipedia/commons/thumb/3/36/Sampling_with_replacement_and_out-of-bag_dataset_-_medical_context.jpg/752px-Sampling_with_replacement_and_out-of-bag_dataset_-_medical_context.jpg)

In Random Forest, the "out-of-bag" (OOB) error is an estimation of the model's performance on unseen data. It is a method used to evaluate the performance of the Random Forest model without the need for a separate validation dataset. 

Here's how out-of-bag works in Random Forest:

1. **Bootstrap Sampling**:
   - When constructing each decision tree in the Random Forest ensemble, bootstrap sampling is used. This means that for each tree, a random sample of the training data is drawn with replacement. Some data points are likely to be included multiple times in the sample, while others may not be included at all.

2. **Out-of-Bag Samples**:
   - Since the bootstrap samples contain a random selection of data points from the original dataset, there will be data points that are not included in each bootstrap sample. These data points that are not included in the training of a particular tree are referred to as out-of-bag samples.

3. **Estimation of Model Performance**:
   - For each data point in the original dataset, it is possible to calculate how many times it was included in the bootstrap samples across all the trees in the Random Forest ensemble. 
   - The out-of-bag error is then calculated by evaluating the predictions of each tree on its corresponding out-of-bag samples and averaging the results across all trees. This provides an estimation of how well the model generalizes to unseen data.

4. **Usage**:
   - The out-of-bag error serves as an internal validation measure during the training process of the Random Forest model. It helps to assess the model's performance and can guide hyperparameter tuning decisions, such as selecting the optimal number of trees or other model parameters.

5. **Advantages**:
   - The out-of-bag error estimation is computationally efficient since it leverages the training data itself and does not require a separate validation dataset.
   - It provides an unbiased estimate of the model's performance, as each data point is left out during the training of some trees and used for evaluation.

In summary, out-of-bag estimation in Random Forest provides a convenient and efficient way to estimate the model's performance on unseen data, helping to evaluate and fine-tune the model during training.

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_csv('heart.csv')
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [3]:
X = df.iloc[:,0:-1]
y = df.iloc[:,-1]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [4]:
rf = RandomForestClassifier(oob_score=True)

In [5]:
rf.fit(X_train,y_train)

In [6]:
rf.oob_score_

0.7975206611570248

The value `0.7975206611570248` represents the out-of-bag (OOB) score of a trained Random Forest model. Let's break down what this means:

1. **Out-of-Bag (OOB) Score**:
   - In Random Forest, during the training process, each decision tree is built using a bootstrap sample of the original dataset. As a result, some data points are not included in the training of certain trees; these data points are referred to as out-of-bag (OOB) samples.
   - The OOB score is an estimate of the model's performance on unseen data. It is calculated by evaluating the predictions of each tree in the ensemble on its corresponding out-of-bag samples and averaging the results across all trees.
   - The OOB score provides an internal validation measure during the training of the Random Forest model. It helps assess the model's generalization ability and can guide hyperparameter tuning decisions.

2. **Interpretation**:
   - Random Forest model, the OOB score of `0.7975206611570248` indicates that, on average, the model correctly predicts the target variable for approximately 79.75% of the out-of-bag samples.
   - This suggests that the model has reasonable predictive performance based on the training data and can generalize well to unseen data.

3. **Usage**:
   - The OOB score can be used to compare different Random Forest models or assess the impact of hyperparameter changes during model training. A higher OOB score generally indicates better model performance, but it should be interpreted in the context of the specific dataset and problem at hand.

In summary, the OOB score provides a useful estimate of the Random Forest model's performance on unseen data, helping to evaluate its generalization ability and guide model tuning decisions.

In [7]:
y_pred = rf.predict(X_test)
accuracy_score(y_test,y_pred)

0.8360655737704918