<span style="color:#333333; font-size:24px; font-weight:bold"> Compiled by <a href=https://github.com/cyterat style="color:#00b2b7;">cyterat</a></span>

# Practical considerations:

- __Data Preparation__: Ensure your data is clean and properly formatted before sampling.

- __Sample Size__: Determine an appropriate sample size based on statistical power calculations or practical constraints.

- __Randomization__: Use numpy's random number generator for consistency and reproducibility.

- __Documentation__: Always document your sampling method and parameters for reproducibility.

- __Validation__: Check if your sample is representative of the population using descriptive statistics.

In [None]:
import numpy as np
import pandas as pd

# 1. Simple Random Sampling
__Use case__: When you need an unbiased representation of the entire population.

In [None]:
# Assuming you have a DataFrame 'data'
sample_size = 100
simple_random_sample = data.sample(n=sample_size, random_state=42)

# 2. Stratified Sampling
__Use case__: When you want to ensure representation from different subgroups in the population.

In [None]:
def stratified_sample(data, strata, size):
    return data.groupby(strata).apply(lambda x: x.sample(min(len(x), size)))

# Assuming 'data' is your DataFrame and 'group' is the column for stratification
stratified_sample = stratified_sample(data, 'group', size=50)

# 3. Cluster Sampling
__Use case__: When the population is spread over a wide geographic area, and you can identify natural clusters.

In [None]:
def cluster_sample(data, cluster_col, n_clusters):
    clusters = data[cluster_col].unique()
    selected_clusters = np.random.choice(clusters, n_clusters, replace=False)
    return data[data[cluster_col].isin(selected_clusters)]

# Assuming 'data' is your DataFrame and 'region' is the cluster column
cluster_sample = cluster_sample(data, 'region', n_clusters=5)

# 4. Systematic Sampling
__Use case__: When you have an ordered list and want to select items at regular intervals.

In [None]:
def systematic_sample(data, step):
    return data.iloc[::step, :]

# Assuming 'data' is your DataFrame
systematic_sample = systematic_sample(data, step=10)

# 5. Convenience Sampling
__Use case__: When you need to quickly collect data and representativeness is not crucial.

In [None]:
# In practice, this might involve selecting easily accessible data points
# For simulation:
convenience_sample = data.head(100)  # First 100 rows

# 6. Quota Sampling
__Use case__: When you need to ensure specific proportions of different subgroups in your sample.

In [None]:
def quota_sample(data, group_col, quotas):
    sample = pd.DataFrame()
    for group, quota in quotas.items():
        group_data = data[data[group_col] == group]
        sample = pd.concat([sample, group_data.sample(n=min(len(group_data), quota))])
    return sample

# Assuming 'data' is your DataFrame and 'category' is the group column
quotas = {'A': 50, 'B': 30, 'C': 20}
quota_sample = quota_sample(data, 'category', quotas)

# 7. Weighted Sampling
__Use case__: When certain observations are more important and should have a higher chance of selection.

In [None]:
def weighted_sample(data, weights, n):
    return data.sample(n=n, weights=weights, replace=True)

# Assuming 'data' is your DataFrame and 'importance' is a column of weights
weighted_sample = weighted_sample(data, weights=data['importance'], n=100)

# 8. Time-based Sampling
__Use case__: When dealing with time series data and you want to sample based on time intervals.

In [None]:
def time_based_sample(data, freq):
    return data.resample(freq).first().dropna()

# Assuming 'data' is a time-indexed DataFrame
time_sample = time_based_sample(data, freq='1D')  # Daily sample

# 9. Reservoir Sampling
__Use case__: When you need to sample from a large or streaming dataset of unknown size.

In [None]:
def reservoir_sample(iterator, k):
    reservoir = []
    for i, item in enumerate(iterator):
        if i < k:
            reservoir.append(item)
        else:
            j = np.random.randint(0, i+1)
            if j < k:
                reservoir[j] = item
    return reservoir

# Usage example with a large list
large_dataset = range(1000000)
reservoir_sample = reservoir_sample(large_dataset, 100)