# Technique: 09 Data Reduction by Sampling

### What is this?
Sampling means picking a small part of the data to represent the whole big group. Instead of looking at 100,000 rows, we might only look at 1,000.

### Why use it?
1. **Save Time**: Big data is very slow for computers to process.
2. **Save Money**: Working with a small sample is cheaper.
3. **Prototypes**: You can build your model quickly with a sample first.

### Sampling Methods:
1. **SRS (Simple Random Sampling)**:
   * **SRSWOR**: Without replacement. Each row is picked only once.
   * **SRSWR**: With replacement. A row can be picked many times.
2. **Stratified Sampling**: Divide data into groups (like Age or Region) first, then sample. This is best for "Skewed" data.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from data_generator import generate_dtt_dataset, GLOBAL_SEED

# 1. Get the full dataset (1000 rows)
df_full = generate_dtt_dataset(n_samples=1000)

print(f"Original dataset size: {len(df_full)} rows")

# 2. Check the distribution of 'Region' (This is our 'Strata')
print("\nOriginal count by Region:")
print(df_full['Region'].value_counts())

Original dataset size: 1000 rows

Original count by Region:
Region
East     256
West     253
South    251
North    240
Name: count, dtype: int64


## Method 1: Simple Random Sampling (SRS)
We use `sample()` to pick 10% of the data randomly. We use `random_state` to make sure we get the same result every time.

In [5]:
# Pick 100 rows randomly from 1000
df_srs = df_full.sample(frac=0.1, random_state=GLOBAL_SEED)

print(f"SRS Sample size: {len(df_srs)} rows")

# Check if the regions are still balanced
print("\nSRS Sample count by Region:")
print(df_srs['Region'].value_counts())

SRS Sample size: 100 rows

SRS Sample count by Region:
Region
South    28
North    28
East     25
West     19
Name: count, dtype: int64


## Method 2: Stratified Sampling
Sometimes, Simple Random Sampling misses small groups. Stratified Sampling makes sure every group (like 'Region') is represented fairly, just like the Kitchener survey example in the slides.

In [6]:
from sklearn.model_selection import train_test_split

# We use train_test_split to pick 10% of the data
# The 'stratify' parameter ensures the Region balance is the same as the original
df_stratified, _ = train_test_split(
    df_full, 
    test_size=0.9, 
    random_state=GLOBAL_SEED, 
    stratify=df_full['Region']
)

print(f"Stratified Sample size: {len(df_stratified)} rows")

# Check the regions - they should be very balanced now
print("\nStratified Sample count by Region:")
print(df_stratified['Region'].value_counts())

Stratified Sample size: 100 rows

Stratified Sample count by Region:
Region
East     26
South    25
West     25
North    24
Name: count, dtype: int64


### Summary from Lecture Slides:
* **SRS**: Easy to do, but it might miss rare groups if the data is "Skewed."
* **Stratified Sampling**: Best for skewed data. It keeps the proportions of subgroups (strata) perfect.
* **Efficiency**: A good sample preserves the key patterns of the full dataset but works much faster!