# 07. Sampling dan Estimasi (Sampling and Estimation)

## Tujuan Pembelajaran
- Memahami konsep sampling dan estimasi parameter
- Membedakan berbagai teknik sampling
- Menghitung interval kepercayaan
- Memahami konsep bias dan precision

## Materi
1. Pengertian Sampling (Sampling)
2. Teknik Sampling (Sampling Techniques)
3. Estimasi Titik (Point Estimation)
4. Estimasi Interval (Interval Estimation)
5. Interval Kepercayaan (Confidence Interval)
6. Aplikasi dalam Analisis Data


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy import stats
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set up plotting
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12

print("Libraries imported successfully!")
print("NumPy version:", np.__version__)


## 1. Pengertian Sampling (Sampling)

**Sampling** adalah proses memilih sebagian data dari populasi untuk dianalisis. Sampling diperlukan karena seringkali tidak mungkin atau tidak praktis untuk menganalisis seluruh populasi.

### Jenis Sampling:
1. **Probability Sampling**: Setiap elemen populasi memiliki peluang yang diketahui untuk terpilih
   - Simple Random Sampling
   - Stratified Sampling
   - Cluster Sampling
   - Systematic Sampling
2. **Non-Probability Sampling**: Tidak semua elemen populasi memiliki peluang yang sama untuk terpilih
   - Convenience Sampling
   - Purposive Sampling
   - Quota Sampling


## 7. Teknik Sampling Lanjutan (Advanced Sampling Techniques)

### A. Stratified Sampling (Sampling Berlapis):
- **Definisi**: Populasi dibagi menjadi strata (lapisan) berdasarkan karakteristik tertentu
- **Kelebihan**: Memastikan representasi yang proporsional dari setiap strata
- **Kelemahan**: Memerlukan informasi tentang strata populasi
- **Aplikasi**: Survei demografi, analisis pasar

### B. Cluster Sampling (Sampling Kluster):
- **Definisi**: Populasi dibagi menjadi kluster, kemudian beberapa kluster dipilih secara random
- **Kelebihan**: Lebih efisien untuk populasi yang tersebar geografis
- **Kelemahan**: Variabilitas antar kluster dapat tinggi
- **Aplikasi**: Survei nasional, penelitian kesehatan

### C. Systematic Sampling (Sampling Sistematis):
- **Definisi**: Memilih setiap k-th elemen dari populasi yang terurut
- **Kelebihan**: Mudah diimplementasikan
- **Kelemahan**: Dapat bias jika ada pola dalam populasi
- **Aplikasi**: Quality control, sampling produk

### D. Multi-Stage Sampling (Sampling Bertahap):
- **Definisi**: Kombinasi dari beberapa teknik sampling
- **Kelebihan**: Fleksibel dan efisien
- **Kelemahan**: Kompleks dalam implementasi
- **Aplikasi**: Survei skala besar, penelitian nasional

### E. Bootstrap Sampling:
- **Definisi**: Resampling dengan pengembalian dari sampel yang ada
- **Kelebihan**: Tidak memerlukan asumsi distribusi
- **Kelemahan**: Memerlukan sampel yang cukup besar
- **Aplikasi**: Estimasi parameter, uji hipotesis


In [None]:
# Demonstrasi Teknik Sampling Lanjutan
print("=== DEMONSTRASI TEKNIK SAMPLING LANJUTAN ===")

# Data populasi untuk demonstrasi
np.random.seed(42)
population_size = 10000
population = np.random.normal(50, 15, population_size)

# 1. Stratified Sampling
print("\n1. STRATIFIED SAMPLING:")
# Membuat strata berdasarkan nilai
strata1 = population[population < 40]  # Nilai rendah
strata2 = population[(population >= 40) & (population < 60)]  # Nilai sedang
strata3 = population[population >= 60]  # Nilai tinggi

print(f"Strata 1 (< 40): {len(strata1)} data")
print(f"Strata 2 (40-60): {len(strata2)} data")
print(f"Strata 3 (≥ 60): {len(strata3)} data")

# Sampling proporsional dari setiap strata
sample_size = 300
strata1_sample = np.random.choice(strata1, int(sample_size * len(strata1) / population_size), replace=False)
strata2_sample = np.random.choice(strata2, int(sample_size * len(strata2) / population_size), replace=False)
strata3_sample = np.random.choice(strata3, int(sample_size * len(strata3) / population_size), replace=False)

stratified_sample = np.concatenate([strata1_sample, strata2_sample, strata3_sample])
print(f"Sampel stratified: {len(stratified_sample)} data")
print(f"Mean stratified: {np.mean(stratified_sample):.2f}")

# 2. Cluster Sampling
print("\n2. CLUSTER SAMPLING:")
# Membuat kluster (misal: 20 kluster dengan 500 data each)
n_clusters = 20
cluster_size = population_size // n_clusters
clusters = [population[i*cluster_size:(i+1)*cluster_size] for i in range(n_clusters)]

# Pilih 5 kluster secara random
selected_clusters = np.random.choice(n_clusters, 5, replace=False)
cluster_sample = np.concatenate([clusters[i] for i in selected_clusters])

print(f"Jumlah kluster: {n_clusters}")
print(f"Kluster terpilih: {selected_clusters}")
print(f"Sampel cluster: {len(cluster_sample)} data")
print(f"Mean cluster: {np.mean(cluster_sample):.2f}")

# 3. Systematic Sampling
print("\n3. SYSTEMATIC SAMPLING:")
# Mengurutkan populasi
sorted_population = np.sort(population)
k = population_size // sample_size  # Interval sampling
systematic_sample = sorted_population[::k][:sample_size]

print(f"Interval sampling (k): {k}")
print(f"Sampel systematic: {len(systematic_sample)} data")
print(f"Mean systematic: {np.mean(systematic_sample):.2f}")

# 4. Bootstrap Sampling
print("\n4. BOOTSTRAP SAMPLING:")
# Ambil sampel awal
initial_sample = np.random.choice(population, 100, replace=False)
n_bootstrap = 1000
bootstrap_means = []

for _ in range(n_bootstrap):
    bootstrap_sample = np.random.choice(initial_sample, len(initial_sample), replace=True)
    bootstrap_means.append(np.mean(bootstrap_sample))

bootstrap_mean = np.mean(bootstrap_means)
bootstrap_std = np.std(bootstrap_means)

print(f"Sampel awal: {len(initial_sample)} data")
print(f"Bootstrap samples: {n_bootstrap}")
print(f"Bootstrap mean: {bootstrap_mean:.2f}")
print(f"Bootstrap std: {bootstrap_std:.2f}")

# 5. Perbandingan Teknik Sampling
print("\n5. PERBANDINGAN TEKNIK SAMPLING:")
techniques = ['Simple Random', 'Stratified', 'Cluster', 'Systematic', 'Bootstrap']
samples = [
    np.random.choice(population, sample_size, replace=False),
    stratified_sample,
    cluster_sample,
    systematic_sample,
    initial_sample
]

print("Teknik Sampling - Mean dan Std:")
for tech, sample in zip(techniques, samples):
    print(f"  {tech:15s}: Mean={np.mean(sample):6.2f}, Std={np.std(sample):6.2f}")

# 6. Estimasi Parameter dengan Bootstrap
print("\n6. ESTIMASI PARAMETER DENGAN BOOTSTRAP:")
# Confidence interval untuk mean menggunakan bootstrap
bootstrap_means_sorted = np.sort(bootstrap_means)
ci_lower_bootstrap = np.percentile(bootstrap_means_sorted, 2.5)
ci_upper_bootstrap = np.percentile(bootstrap_means_sorted, 97.5)

print(f"Bootstrap 95% CI: [{ci_lower_bootstrap:.2f}, {ci_upper_bootstrap:.2f}]")
print(f"Population mean ({np.mean(population):.2f}) dalam CI: {ci_lower_bootstrap <= np.mean(population) <= ci_upper_bootstrap}")

# 7. Visualisasi Teknik Sampling
plt.figure(figsize=(18, 12))

# Plot 1: Stratified Sampling
plt.subplot(3, 3, 1)
plt.hist(strata1, bins=20, alpha=0.5, color='red', label='Strata 1', density=True)
plt.hist(strata2, bins=20, alpha=0.5, color='green', label='Strata 2', density=True)
plt.hist(strata3, bins=20, alpha=0.5, color='blue', label='Strata 3', density=True)
plt.hist(stratified_sample, bins=20, alpha=0.7, color='black', label='Stratified Sample', density=True)
plt.xlabel('Nilai')
plt.ylabel('Density')
plt.title('Stratified Sampling')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 2: Cluster Sampling
plt.subplot(3, 3, 2)
for i, cluster in enumerate(clusters):
    if i in selected_clusters:
        plt.hist(cluster, bins=10, alpha=0.7, color='red', label=f'Cluster {i}' if i == selected_clusters[0] else "")
    else:
        plt.hist(cluster, bins=10, alpha=0.3, color='gray', label='Other Clusters' if i == 0 else "")
plt.xlabel('Nilai')
plt.ylabel('Frequency')
plt.title('Cluster Sampling')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 3: Systematic Sampling
plt.subplot(3, 3, 3)
plt.hist(population, bins=50, alpha=0.5, color='lightblue', label='Population', density=True)
plt.hist(systematic_sample, bins=20, alpha=0.7, color='red', label='Systematic Sample', density=True)
plt.xlabel('Nilai')
plt.ylabel('Density')
plt.title('Systematic Sampling')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 4: Bootstrap Distribution
plt.subplot(3, 3, 4)
plt.hist(bootstrap_means, bins=30, alpha=0.7, color='purple', edgecolor='black')
plt.axvline(bootstrap_mean, color='red', linestyle='-', linewidth=2, label=f'Bootstrap Mean: {bootstrap_mean:.2f}')
plt.axvline(np.mean(population), color='blue', linestyle='--', linewidth=2, label=f'Population Mean: {np.mean(population):.2f}')
plt.xlabel('Bootstrap Means')
plt.ylabel('Frequency')
plt.title('Bootstrap Distribution')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 5: Comparison of Sampling Techniques
plt.subplot(3, 3, 5)
sample_means = [np.mean(sample) for sample in samples]
plt.bar(techniques, sample_means, color=['lightblue', 'lightgreen', 'lightcoral', 'lightyellow', 'lightpink'], alpha=0.7)
plt.axhline(np.mean(population), color='red', linestyle='--', linewidth=2, label=f'Population Mean: {np.mean(population):.2f}')
plt.ylabel('Mean')
plt.title('Perbandingan Mean Sampling')
plt.xticks(rotation=45)
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 6: Bootstrap Confidence Interval
plt.subplot(3, 3, 6)
plt.hist(bootstrap_means, bins=30, alpha=0.7, color='purple', edgecolor='black')
plt.axvline(ci_lower_bootstrap, color='red', linestyle='--', linewidth=2, label=f'CI Lower: {ci_lower_bootstrap:.2f}')
plt.axvline(ci_upper_bootstrap, color='red', linestyle='--', linewidth=2, label=f'CI Upper: {ci_upper_bootstrap:.2f}')
plt.axvline(np.mean(population), color='blue', linestyle='-', linewidth=2, label=f'Population Mean: {np.mean(population):.2f}')
plt.xlabel('Bootstrap Means')
plt.ylabel('Frequency')
plt.title('Bootstrap 95% CI')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 7: Sampling Error Comparison
plt.subplot(3, 3, 7)
sampling_errors = [abs(np.mean(sample) - np.mean(population)) for sample in samples]
plt.bar(techniques, sampling_errors, color=['lightblue', 'lightgreen', 'lightcoral', 'lightyellow', 'lightpink'], alpha=0.7)
plt.ylabel('Sampling Error')
plt.title('Perbandingan Sampling Error')
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)

# Plot 8: Sample Size vs Precision
plt.subplot(3, 3, 8)
sample_sizes = [50, 100, 200, 500, 1000]
precisions = []
for size in sample_sizes:
    sample = np.random.choice(population, size, replace=False)
    precision = 1 / np.std(sample)
    precisions.append(precision)

plt.plot(sample_sizes, precisions, 'o-', linewidth=2, markersize=6, color='blue')
plt.xlabel('Sample Size')
plt.ylabel('Precision (1/Std)')
plt.title('Sample Size vs Precision')
plt.grid(True, alpha=0.3)

# Plot 9: Central Limit Theorem
plt.subplot(3, 3, 9)
n_samples = 1000
sample_means_clt = []
for _ in range(n_samples):
    sample = np.random.choice(population, 30, replace=False)
    sample_means_clt.append(np.mean(sample))

plt.hist(sample_means_clt, bins=30, alpha=0.7, color='green', edgecolor='black', density=True)
x = np.linspace(min(sample_means_clt), max(sample_means_clt), 100)
y = stats.norm.pdf(x, np.mean(population), np.std(population)/np.sqrt(30))
plt.plot(x, y, 'r-', linewidth=2, label='Theoretical Normal')
plt.xlabel('Sample Means')
plt.ylabel('Density')
plt.title('Central Limit Theorem')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# 8. Kesimpulan dan Rekomendasi
print("\n8. KESIMPULAN DAN REKOMENDASI:")
print("   - Simple Random: Cocok untuk populasi homogen")
print("   - Stratified: Cocok untuk populasi heterogen dengan strata")
print("   - Cluster: Cocok untuk populasi tersebar geografis")
print("   - Systematic: Cocok untuk populasi terurut")
print("   - Bootstrap: Cocok untuk estimasi parameter tanpa asumsi")
print("   - Pilih teknik sesuai dengan karakteristik populasi")
print("   - Pertimbangkan bias dan precision")
print("   - Gunakan multiple techniques untuk validasi")


In [None]:
# Simulasi Sampling dan Estimasi
np.random.seed(42)

# Populasi (data yang sebenarnya)
population_mean = 50
population_std = 15
population_size = 10000
population = np.random.normal(population_mean, population_std, population_size)

print("=== SIMULASI SAMPLING DAN ESTIMASI ===")
print(f"Populasi:")
print(f"  - Ukuran: {population_size}")
print(f"  - Mean: {population_mean}")
print(f"  - Standard Deviation: {population_std}")

# Simple Random Sampling
sample_size = 100
sample = np.random.choice(population, sample_size, replace=False)

print(f"\nSampel (n={sample_size}):")
print(f"  - Mean sampel: {np.mean(sample):.2f}")
print(f"  - Standard Deviation sampel: {np.std(sample, ddof=1):.2f}")

# Estimasi Titik (Point Estimation)
sample_mean = np.mean(sample)
sample_std = np.std(sample, ddof=1)

print(f"\nEstimasi Titik:")
print(f"  - Estimasi mean populasi: {sample_mean:.2f}")
print(f"  - Estimasi std populasi: {sample_std:.2f}")

# Estimasi Interval (Interval Estimation) - Confidence Interval
confidence_level = 0.95
alpha = 1 - confidence_level
n = len(sample)

# Standard Error
se = sample_std / np.sqrt(n)

# Critical value (t-distribution)
t_critical = stats.t.ppf(1 - alpha/2, n - 1)

# Margin of Error
margin_error = t_critical * se

# Confidence Interval
ci_lower = sample_mean - margin_error
ci_upper = sample_mean + margin_error

print(f"\nEstimasi Interval (Confidence Interval {confidence_level*100}%):")
print(f"  - Standard Error: {se:.2f}")
print(f"  - t-critical: {t_critical:.2f}")
print(f"  - Margin of Error: {margin_error:.2f}")
print(f"  - CI: [{ci_lower:.2f}, {ci_upper:.2f}]")
print(f"  - Populasi mean ({population_mean}) dalam CI: {ci_lower <= population_mean <= ci_upper}")

# Visualisasi
plt.figure(figsize=(15, 10))

# Plot 1: Distribusi populasi vs sampel
plt.subplot(2, 3, 1)
plt.hist(population, bins=50, alpha=0.7, color='lightblue', label='Populasi', density=True)
plt.hist(sample, bins=20, alpha=0.7, color='red', label='Sampel', density=True)
plt.axvline(population_mean, color='blue', linestyle='-', linewidth=2, label=f'Pop Mean: {population_mean}')
plt.axvline(sample_mean, color='red', linestyle='--', linewidth=2, label=f'Sample Mean: {sample_mean:.2f}')
plt.xlabel('Nilai')
plt.ylabel('Density')
plt.title('Distribusi Populasi vs Sampel')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 2: Confidence Interval
plt.subplot(2, 3, 2)
plt.errorbar(0, sample_mean, yerr=margin_error, fmt='o', capsize=10, capthick=2, 
             color='red', markersize=8, label='Sample Mean ± CI')
plt.axhline(population_mean, color='blue', linestyle='-', linewidth=2, label=f'Population Mean: {population_mean}')
plt.axhline(ci_lower, color='red', linestyle=':', alpha=0.7, label=f'CI Lower: {ci_lower:.2f}')
plt.axhline(ci_upper, color='red', linestyle=':', alpha=0.7, label=f'CI Upper: {ci_upper:.2f}')
plt.xlim(-0.5, 0.5)
plt.xlabel('')
plt.ylabel('Nilai')
plt.title(f'Confidence Interval {confidence_level*100}%')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 3: Sampling Distribution
plt.subplot(2, 3, 3)
n_samples = 1000
sample_means = []
for _ in range(n_samples):
    sample_i = np.random.choice(population, sample_size, replace=False)
    sample_means.append(np.mean(sample_i))

plt.hist(sample_means, bins=30, alpha=0.7, color='green', density=True)
plt.axvline(population_mean, color='blue', linestyle='-', linewidth=2, label=f'Pop Mean: {population_mean}')
plt.axvline(np.mean(sample_means), color='green', linestyle='--', linewidth=2, label=f'Sample Means Mean: {np.mean(sample_means):.2f}')
plt.xlabel('Sample Mean')
plt.ylabel('Density')
plt.title('Sampling Distribution of Means')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 4: Bias dan Precision
plt.subplot(2, 3, 4)
bias = sample_mean - population_mean
precision = 1 / (sample_std / np.sqrt(n))

plt.scatter(bias, precision, s=100, color='red', alpha=0.7)
plt.axvline(0, color='blue', linestyle='--', alpha=0.7, label='No Bias')
plt.xlabel('Bias')
plt.ylabel('Precision (1/SE)')
plt.title('Bias vs Precision')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 5: Konvergensi estimasi
plt.subplot(2, 3, 5)
sample_sizes = [10, 20, 50, 100, 200, 500]
mean_estimates = []
for size in sample_sizes:
    sample_i = np.random.choice(population, size, replace=False)
    mean_estimates.append(np.mean(sample_i))

plt.plot(sample_sizes, mean_estimates, 'o-', color='red', alpha=0.7, label='Sample Mean')
plt.axhline(population_mean, color='blue', linestyle='-', linewidth=2, label=f'Population Mean: {population_mean}')
plt.xlabel('Sample Size')
plt.ylabel('Mean Estimate')
plt.title('Konvergensi Estimasi Mean')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 6: Error bars untuk berbagai ukuran sampel
plt.subplot(2, 3, 6)
errors = []
for size in sample_sizes:
    sample_i = np.random.choice(population, size, replace=False)
    se_i = np.std(sample_i, ddof=1) / np.sqrt(size)
    errors.append(se_i)

plt.errorbar(sample_sizes, mean_estimates, yerr=errors, fmt='o-', capsize=5, 
             color='red', alpha=0.7, label='Mean ± SE')
plt.axhline(population_mean, color='blue', linestyle='-', linewidth=2, label=f'Population Mean: {population_mean}')
plt.xlabel('Sample Size')
plt.ylabel('Mean Estimate')
plt.title('Mean Estimate dengan Standard Error')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
