This example cames from here: https://towardsdatascience.com/data-sampling-methods-in-python-a4400628ea1b

Random sampling.

The simplest data sampling technique. 

Every sampled observation has the same probability of getting selected.



In [1]:
import numpy as np

# generating population data following Normal Distribution
N = 10000
mu = 10
std = 2
population_df = np.random.normal(mu,std,N)

# function that creates random sample 
def random_sampling(df: list, n: int):
    random_sample = np.random.choice(df,replace = False, size = n)
    return(random_sample)
randomSample = random_sampling(population_df,1000)


In [2]:
print(randomSample.mean())
randomSample.std()

10.044295020600478


2.0178096982138496

Sistematic sampling.

Probability sampling approach where the elements from a target population are selected from a random starting point and after a fixed sampling interval.

Extended version of probability sampling techniques.

Each member of the group is selected at regular periods to form a sample.

Sampling interval is calculated by dividing the entire population size by the desired sample size

Systematic Sampling usually produces a random sample but is not addressing the bias in the created sample

In [3]:
import pandas as pd
# generating population data following Normal Distribution
N = 10000
mu = 10
std = 2
population_df = np.random.normal(mu,std,N)

# function that creates random sample using Systematic Sampling
def systematic_sampling(df: pd.DataFrame, step: int):
    id = pd.Series(np.arange(1,len(df),1))
    df = pd.Series(df)
    df_pd = pd.concat([id, df], axis = 1)
    df_pd.columns = ["id", "data"]
    # these indices will increase with the step amount not 1
    selected_index = np.arange(1,len(df),step)
    print(f'selected index is: {selected_index}')
    # using iloc for getting thee data with selected indices
    systematic_sampling = df_pd.iloc[selected_index]
    return(systematic_sampling)

n = 10
step = int(N/n)
sample = systematic_sampling(population_df, step)

selected index is: [   1 1001 2001 3001 4001 5001 6001 7001 8001 9001]


Cluster Sampling

Is a probability sampling technique

Here population is divided into multiple clusters (groups) based on certain clustering criteria.

Then clusters are randomly selected, by random sampling or systematic sampling.

In [4]:
import numpy as np
import pandas as pd

# Generating Population data 

#prive_vb generated using Uniform Distributions
price_vb = pd.Series(np.random.uniform(1,4,size = N))

#Id, as simple as that
id = pd.Series(np.arange(0,len(price_vb),1))

#event type, categorical variable with 3 possible outputs: type1, type2, type3
event_type = pd.Series(np.random.choice(["type1","type2","type3"],size = len(price_vb)))

#Binary variable: 0 - no click ; 1 - click
click = pd.Series(np.random.choice([0,1],size = len(price_vb)))
df = pd.concat([id,price_vb,event_type, click],axis = 1)
df.columns = ["id","price","event_type", "click"]
df

Unnamed: 0,id,price,event_type,click
0,0,1.983634,type3,0
1,1,2.075907,type1,0
2,2,2.519015,type3,0
3,3,1.314358,type1,1
4,4,3.281814,type1,0
...,...,...,...,...
9995,9995,2.584460,type3,1
9996,9996,2.929263,type3,1
9997,9997,3.737684,type1,0
9998,9998,3.115006,type2,1


Note that, Cluster Sampling usually produces a random sample but is not addressing the bias in the created sample.

In [17]:
def get_clustered_Sample(df: pd.DataFrame, n_per_cluster: int, num_select_clusters: int):
    N = len(df)
    K = int(N/n_per_cluster)
    data = None
    for k in range(K):
        sample_k = df.sample(n_per_cluster)
        sample_k["cluster"] = np.repeat(k,len(sample_k))
        df = df.drop(index = sample_k.index)
        data = pd.concat([data,sample_k],axis = 0)

    random_chosen_clusters = np.random.randint(0,K,size = num_select_clusters)
    samples = data[data.cluster.isin(random_chosen_clusters)]
    return(samples)

sample = get_clustered_Sample(df = df, n_per_cluster = 100, num_select_clusters = 6)
sample

Unnamed: 0,id,price,event_type,click,cluster
6359,6359,1.668066,type1,1,2
7556,7556,2.212269,type3,0,2
3544,3544,3.229351,type3,0,2
931,931,3.509514,type2,0,2
4941,4941,2.907123,type1,0,2
...,...,...,...,...,...
7398,7398,3.807042,type2,0,96
4738,4738,3.761345,type1,0,96
4723,4723,2.173289,type3,0,96
8794,8794,3.358589,type2,0,96


Weighted Sampling.

Sampling technique based on probabilities proportions according to weights associated with each observation.

Weighted Sampling is a data sampling method with weights, that intends to compensate for the selection of specific observations with unequal probabilities (oversampling), non-coverage, non-responses, and other types of bias. 

Weighted Sampling addresses the bias in the sample, by creating a sample that takes into account the proportions of the type of observations in the population.

In [6]:
def get_weighted_sample(df: pd.DataFrame,n: int):
    def get_class_prob(x):
        weight_x = int(np.rint(n * len(x[x.click != 0]) / len(df[df.click != 0])))
        sampled_x = x.sample(weight_x).reset_index(drop=True)
        return (sampled_x)
        # we are grouping by the target class we use for the proportions

    weighted_sample = df.groupby('event_type').apply(get_class_prob)
    print(weighted_sample["event_type"].value_counts())
    return (weighted_sample)

sample = get_weighted_sample(df,100)
sample

type2    34
type3    34
type1    33
Name: event_type, dtype: int64


Unnamed: 0_level_0,Unnamed: 1_level_0,id,price,event_type,click
event_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
type1,0,333,1.885327,type1,0
type1,1,1221,1.554620,type1,0
type1,2,8312,2.455083,type1,0
type1,3,7790,2.301309,type1,0
type1,4,8041,1.355008,type1,0
...,...,...,...,...,...
type3,29,9512,3.340707,type3,0
type3,30,1805,3.243580,type3,0
type3,31,3468,3.116329,type3,0
type3,32,2944,2.217287,type3,1


Stratified Sampling


Sampling approach where population is divided into homogeneous subpopulations called strata. This divison is based on specific charasteristics.

Every member of the population studied should be in exactly one stratum.

Each stratum is then sampled using Cluster Sampling, allowing to estimate statistical measures for each sub-population.Stratified Sampling is used when the populations’ characteristics are diverse. This sampling technique ensures that every characteristic is properly represented in the sample.

In [7]:
def get_startified_sample(df: pd.DataFrame, n: int, num_clusters_needed: int):
    N = len(df)
    num_obs_per_cluster = int(N/n)
    K = int(N/num_obs_per_cluster)

    def get_weighted_sample(df,num_obs_per_cluster):
        def get_sample_per_class(x):
            n_x = int(np.rint(num_obs_per_cluster*len(x[x.click !=0])/len(df[df.click !=0])))
            sample_x = x.sample(n_x)
            return(sample_x)
        weighted_sample = df.groupby("event_type").apply(get_sample_per_class)
        return(weighted_sample)

    stratas = None
    for k in range(K):
        weighted_sample_k = get_weighted_sample(df,num_obs_per_cluster).reset_index(drop = True)
        weighted_sample_k["cluster"] = np.repeat(k,len(weighted_sample_k))
        stratas = pd.concat([stratas, weighted_sample_k],axis = 0)
        df.drop(index = weighted_sample_k.index)
    selected_strata_clusters = np.random.randint(0,K,size = num_clusters_needed)
    stratified_samples = stratas[stratas.cluster.isin(selected_strata_clusters)]
    return(stratified_samples)

sample = get_startified_sample(df = df,n = 100,num_clusters_needed = 2)
sample

Unnamed: 0,id,price,event_type,click,cluster
0,6200,3.823171,type1,1,32
1,3811,3.262055,type1,1,32
2,5073,1.712319,type1,0,32
3,2217,1.682040,type1,0,32
4,2000,3.579834,type1,1,32
...,...,...,...,...,...
96,7866,2.328389,type3,0,43
97,7578,1.654607,type3,0,43
98,6163,3.061433,type3,0,43
99,7075,1.422799,type3,1,43
