# **Bootstrap Sampling**

In statistics, Bootstrap Sampling is a method that involves drawing of sample data repeatedly with replacement from a data source to estimate a population parameter.

Wait – that’s too complex. Let’s break it down and understand the key terms:

- Sampling: With respect to statistics, sampling is the process of selecting a subset of items from a vast collection of items (population) to estimate a certain characteristic of the entire population

- Sampling with replacement: It means a data point in a drawn sample can reappear in future drawn samples as well

- Parameter estimation: It is a method of estimating parameters for the population using samples. A parameter is a measurable characteristic associated with a population. For example, the average height of residents in a city, the count of red blood cells, etc.


With that knowledge, go ahead and re-read the above definition again. It’ll make much more sense now!

# **Why do We need Bootstrap Sampling ?**

This is a fundamental question I’ve seen machine learning enthusiasts grapple with. What is the point of Bootstrap Sampling? Where can you use it? Let me take an example to explain this.

Let’s say we want to find the mean height of all the students in a school (which has a total population of 1,000). So, how can we perform this task?

One approach is to measure the height of all the students and then compute the mean height. I’ve illustrated this process below:


<p align="center">
  <img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2020/02/img_1-1.png" />
</p>

However, this would be a tedious task. Just think about it, we would have to individually measure the heights of 1,000 students and then compute the mean height. It will take days! We need a smarter approach here.

This is where Bootstrap Sampling comes into play.

Instead of measuring the heights of all the students, we can draw a random sample of 5 students and measure their heights. We would repeat this process 20 times and then average the collected height data of 100 students (5 x 20). This average height would be an estimate of the mean height of all the students of the school.

Pretty straightforward, right? This is the basic idea of Bootstrap Sampling.




<p align="center">
  <img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2020/02/img_2-1.png" />
</p>

<strong> Hence, when we have to estimate a parameter of a large population, we can take the help of Bootstrap Sampling. </strong>

# **Bootstrap Sampling in Machine Learning**

Bootstrap sampling is used in a machine learning ensemble algorithm called bootstrap aggregating (also called bagging). It helps in avoiding overfitting and improves the stability of machine learning algorithms.

In bagging, a certain number of equally sized subsets of a dataset are extracted with replacement. Then, a machine learning algorithm is applied to each of these subsets and the outputs are ensembled as I have illustrated below:


<p align="center">
  <img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2020/02/Bagging.png" />
</p>


# **Bootstrap Implementation**

In [1]:
#importing libraries

import numpy as np
import seaborn as sns
import random

In [2]:
# normal distribution 
x = np.random.normal(loc= 500.0, scale=1.0, size=10000)

np.mean(x)

499.9828980294433

In [3]:
sample_mean = []

# Bootstrap Sampling
for i in range(40):
  y = random.sample(x.tolist(), 5)
  avg = np.mean(y)

  sample_mean.append(avg)

In [4]:
np.mean(sample_mean)

500.02162469459773

**As you can see It turns out to be pretty close to the population mean! This is why Bootstrap Sampling is such a useful technique in statistics and machine learning.**

# **Conclusion**

Here are a few key benefits of bootstrapping:

- The estimated parameter by bootstrap sampling is comparable to the actual population parameter

- Since we only need a few samples for bootstrapping, the computation requirement is very less

- In Random Forest, the bootstrap sample size of even 20% gives a pretty good performance as shown below:


<p align="center">
  <img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2020/02/rf.png" />
</p>

The model performance reaches maximum when the data provided is less than 0.2 fraction of the original dataset.