# Data Sampling Techniques for Machine Learning
Sampling is very importent in Machine Learning for
1. Model generalization
2. Data balancing
3. Computation reduction
4. Building robust estimators

There are several techniques for data sampling
1. Random Sampling
2. Stratified Sampling
3. Bootstraping
4. Oversampling
5. Undersampling

## Table of Contents
1. [Random Sampling](#1-random-sampling)
2. [Stratified Sampling](#2-stratified-sampling)
3. [Bootstrapping](#3-bootstrapping)
4. [Over-Sampling](#4-oversampling)
5. [Under-Sampling](#5-undersampling)
6. [Cluster Sampling](#6-cluster-sampling)
7. [Systematic Sampling](#7--systematic-sampling)
8. [Adaptive Sampling](#8-adaptive-sampling)

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("dataset/")

### 1. Random Sampling
We randomly selects data points without considering class distribution.

Use Cases
* Large datasets
* Creating quick train/test splits
* When data is uniformly distributed


In [None]:
random_sample = df.sample(n=10, random_state=52)

### 2. Stratified Sampling
We ensures each class or group is represented proportionally.

Use Cases
* Classification problems
* Imbalanced datasets
* Train/test split best practice

In [None]:
stratified_sample = (
    df.groupby('Label', group_keys=False)
      .apply(lambda x: x.sample(frac=0.3, random_state=42))
)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df[['Feature1','Feature2']], df['Label'], 
    test_size=0.3, 
    stratify=df['Label'], 
    random_state=42
)

### 3. Bootstrapping
We do sampling **with replacement**. Same point can appear multiple times.

Use Cases
* Bagging / Random Forest
* Estimating confidence intervals
* Reducing variance

In [None]:
bootstrapped_sample = df.sample(n=20, replace=True, random_state=42)

### 4. Oversampling(SMOTE)
We used for imbalanced classification. SMOTE synthetically generates new samples for minority classes.

Use Cases

* Fraud detection
* Medical diagnosis datasets
* Rare-event classification


**Important:**
SMOTE works **only on numeric features**. Encode categorical variables before using it.

In [None]:
from imblearn.over_sampling import SMOTE

oversample = SMOTE(random_state=42)
X_oversampled, y_oversampled = oversample.fit_resample(X, y)


### 5. Undersampling

We reduces samples from the majority class.

Use cases
* When dataset is large
* When oversampling could cause overfitting
* Quick baselines

Note : This avoids accidentally dropping too much or too little data.

In [None]:
class_A = df[df['Label']=='A']
class_B = df[df['Label']=='B']
class_C = df[df['Label']=='C']

min_size = min(len(class_A), len(class_B), len(class_C))

undersampled_df = pd.concat([
    class_A.sample(n=min_size, random_state=42),
    class_B.sample(n=min_size, random_state=42),
    class_C.sample(n=min_size, random_state=42)
])

### 6. Cluster Sampling

Select groups/clusters instead of individual samples.

* Useful for very large datasets
* Reduces cost of data collection

### 7.  Systematic Sampling

Select every k-th observation.

* Good for streaming data
* Simple and efficient
* Not good if data has periodic patterns


### 8. Adaptive Sampling

Sampling changes based on model performance â€” used in reinforcement learning & active learning.

* Useful when labeling is expensive
* Used in robotics, medical ML
