# Stratified Sampling in Python
#### This kernel gives a simple solution for stratified sampling in Python.

, "in statistics, stratified sampling is a method of sampling from a population which can be partitioned into subpopulations." This method of sampling can be advantageous because it tries to keep in the sample the same proportion of each desired variable (strata) that is present in the population. A simple random sample could ignore this fact.

There are many use cases for stratified sampling. The main idea is that we want to mimic a population with a sample. It is widely used to generate comparable samples when the objective is to perform hypothesis testing of any kind.

Many times I had to face this situation, so I developed a module in Python with functions that performs stratified sampling given a pandas DataFrame object. I hope it can be useful in your endevors.

For this example I will use a dataset with information about customers. It will be useful since it contains lots of variables we can use to perform stratified sampling.



In [1]:
# Required libraries
import pandas as pd
from tqdm import tqdm
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv('Data_edited2.csv')
df.shape

(1250000, 34)

In [3]:
df=df.drop(['id','year','message_text','service','atm_status','atm_lat', 'atm_lon',
       'atm_manufacturer','atm_location','atm_streetname',
       'atm_street_number','atm_zipcode','weather_lat',
       'weather_lon','weather_city_id','weather_city_name','weather_description','weather_id'],axis=1)
df.shape

(1250000, 16)

In [4]:
df['rain_3h']=df['rain_3h'].fillna(0)
df.message_code=df.message_code.fillna(1)

In [5]:
df['temp_band']=0
df.loc[df['temp']<=280,'temp_band']=0
df.loc[(df['temp']>280)&(df['temp']<=301.15),'temp_band']=1
df.head(2)

Unnamed: 0,month,day,weekday,hour,atm_id,currency,card_type,message_code,temp,pressure,humidity,wind_speed,wind_deg,rain_3h,clouds_all,weather_main,temp_band
0,January,1,Sunday,0,1,DKK,MasterCard,1.0,281.15,1014,87,7,260,0.215,92,Rain,1
1,January,1,Sunday,0,2,DKK,MasterCard,1.0,280.64,1020,93,9,250,0.59,92,Rain,1


In [6]:
df['day_band']=0
df.loc[df['day']<=15,'day_band']=0
df.loc[(df['day']>15)&(df['day']<=31),'day_band']=1
df.head(2)

Unnamed: 0,month,day,weekday,hour,atm_id,currency,card_type,message_code,temp,pressure,humidity,wind_speed,wind_deg,rain_3h,clouds_all,weather_main,temp_band,day_band
0,January,1,Sunday,0,1,DKK,MasterCard,1.0,281.15,1014,87,7,260,0.215,92,Rain,1,0
1,January,1,Sunday,0,2,DKK,MasterCard,1.0,280.64,1020,93,9,250,0.59,92,Rain,1,0


In [7]:
df['hour_band']=0
df.loc[df['hour']<=8,'hour_band']=0
df.loc[(df['hour']>8)&(df['hour']<=16),'hour_band']=1
df.loc[(df['hour']>16)&(df['hour']<=24),'hour_band']=2
df.head(2)

Unnamed: 0,month,day,weekday,hour,atm_id,currency,card_type,message_code,temp,pressure,humidity,wind_speed,wind_deg,rain_3h,clouds_all,weather_main,temp_band,day_band,hour_band
0,January,1,Sunday,0,1,DKK,MasterCard,1.0,281.15,1014,87,7,260,0.215,92,Rain,1,0,0
1,January,1,Sunday,0,2,DKK,MasterCard,1.0,280.64,1020,93,9,250,0.59,92,Rain,1,0,0


In [9]:
df.wind_speed.min()

0

In [8]:
df['pressure_band']=0
df.loc[df['pressure']<=1014,'pressure_band']=0
df.loc[(df['pressure']>1014)&(df['pressure']<=1057),'pressure_band']=1
df.head(2)

Unnamed: 0,month,day,weekday,hour,atm_id,currency,card_type,message_code,temp,pressure,humidity,wind_speed,wind_deg,rain_3h,clouds_all,weather_main,temp_band,day_band,hour_band,pressure_band
0,January,1,Sunday,0,1,DKK,MasterCard,1.0,281.15,1014,87,7,260,0.215,92,Rain,1,0,0,0
1,January,1,Sunday,0,2,DKK,MasterCard,1.0,280.64,1020,93,9,250,0.59,92,Rain,1,0,0,1


In [9]:
df['humidity_band']=0
df.loc[df['humidity']<=68,'humidity_band']=0
df.loc[(df['humidity']>68)&(df['humidity']<=124),'humidity_band']=1
df.head(2)

Unnamed: 0,month,day,weekday,hour,atm_id,currency,card_type,message_code,temp,pressure,...,wind_speed,wind_deg,rain_3h,clouds_all,weather_main,temp_band,day_band,hour_band,pressure_band,humidity_band
0,January,1,Sunday,0,1,DKK,MasterCard,1.0,281.15,1014,...,7,260,0.215,92,Rain,1,0,0,0,1
1,January,1,Sunday,0,2,DKK,MasterCard,1.0,280.64,1020,...,9,250,0.59,92,Rain,1,0,0,1,1


In [10]:
df['wind_speed_band']=0
df.loc[df['wind_speed']<=40,'wind_speed_band']=0
df.loc[(df['wind_speed']>40)&(df['wind_speed']<=77),'wind_speed_band']=1

df.head(2)

Unnamed: 0,month,day,weekday,hour,atm_id,currency,card_type,message_code,temp,pressure,...,wind_deg,rain_3h,clouds_all,weather_main,temp_band,day_band,hour_band,pressure_band,humidity_band,wind_speed_band
0,January,1,Sunday,0,1,DKK,MasterCard,1.0,281.15,1014,...,260,0.215,92,Rain,1,0,0,0,1,0
1,January,1,Sunday,0,2,DKK,MasterCard,1.0,280.64,1020,...,250,0.59,92,Rain,1,0,0,1,1,0


In [11]:
# df['wind_deg_band']=0
# df.loc[df['wind_deg']<=90,'wind_deg_band']=0
# df.loc[(df['wind_deg']>90)&(df['wind_deg']<=180),'wind_deg_band']=1
# df.loc[(df['wind_deg']>180)&(df['wind_deg']<=270),'wind_deg_band']=2
# df.loc[(df['wind_deg']>270)&(df['wind_deg']<=360),'wind_deg_band']=3
# df.head(2)
df=df.drop('wind_deg',axis=1)

In [12]:
df['rain_3h_band']=0
df.loc[df['rain_3h']<=0.002,'rain_3h_band']=0
df.loc[(df['rain_3h']>0.002),'rain_3h_band']=1

df.head(2)

Unnamed: 0,month,day,weekday,hour,atm_id,currency,card_type,message_code,temp,pressure,...,rain_3h,clouds_all,weather_main,temp_band,day_band,hour_band,pressure_band,humidity_band,wind_speed_band,rain_3h_band
0,January,1,Sunday,0,1,DKK,MasterCard,1.0,281.15,1014,...,0.215,92,Rain,1,0,0,0,1,0,1
1,January,1,Sunday,0,2,DKK,MasterCard,1.0,280.64,1020,...,0.59,92,Rain,1,0,0,1,1,0,1


In [13]:
df=df.drop(['temp','day','hour','pressure','humidity','wind_speed','rain_3h','clouds_all'],axis=1)

In [14]:
# the functions:
def stratified_sample(df, strata, size=None, seed=None, keep_index= True):
    '''
    It samples data from a pandas dataframe using strata. These functions use
    proportionate stratification:
    n1 = (N1/N) * n
    where:
        - n1 is the sample size of stratum 1
        - N1 is the population size of stratum 1
        - N is the total population size
        - n is the sampling size
    Parameters
    ----------
    :df: pandas dataframe from which data will be sampled.
    :strata: list containing columns that will be used in the stratified sampling.
    :size: sampling size. If not informed, a sampling size will be calculated
        using Cochran adjusted sampling formula:
        cochran_n = (Z**2 * p * q) /e**2
        where:
            - Z is the z-value. In this case we use 1.96 representing 95%
            - p is the estimated proportion of the population which has an
                attribute. In this case we use 0.5
            - q is 1-p
            - e is the margin of error
        This formula is adjusted as follows:
        adjusted_cochran = cochran_n / 1+((cochran_n -1)/N)
        where:
            - cochran_n = result of the previous formula
            - N is the population size
    :seed: sampling seed
    :keep_index: if True, it keeps a column with the original population index indicator
    
    Returns
    -------
    A sampled pandas dataframe based in a set of strata.
    Examples
    --------
    >> df.head()
    	id  sex age city 
    0	123 M   20  XYZ
    1	456 M   25  XYZ
    2	789 M   21  YZX
    3	987 F   40  ZXY
    4	654 M   45  ZXY
    ...
    # This returns a sample stratified by sex and city containing 30% of the size of
    # the original data
    >> stratified = stratified_sample(df=df, strata=['sex', 'city'], size=0.3)
    Requirements
    ------------
    - pandas
    - numpy
    '''
    population = len(df)
    size = __smpl_size(population, size)
    tmp = df[strata]
    tmp['size'] = 1
    tmp_grpd = tmp.groupby(strata).count().reset_index()
    tmp_grpd['samp_size'] = round(size/population * tmp_grpd['size']).astype(int)

    # controlling variable to create the dataframe or append to it
    first = True 
    for i in tqdm(range(len(tmp_grpd))):
        # query generator for each iteration
        qry=''
        for s in range(len(strata)):
            stratum = strata[s]
            value = tmp_grpd.iloc[i][stratum]
            n = tmp_grpd.iloc[i]['samp_size']

            if type(value) == str:
                value = "'" + str(value) + "'"
            
            if s != len(strata)-1:
                qry = qry + stratum + ' == ' + str(value) +' & '
            else:
                qry = qry + stratum + ' == ' + str(value)
        
        # final dataframe
        if first:
            stratified_df = df.query(qry).sample(n=n, random_state=seed).reset_index(drop=(not keep_index))
            first = False
        else:
            tmp_df = df.query(qry).sample(n=n, random_state=seed).reset_index(drop=(not keep_index))
            stratified_df = stratified_df.append(tmp_df, ignore_index=True)
    
    return stratified_df



def stratified_sample_report(df, strata, size=None):
    '''
    Generates a dataframe reporting the counts in each stratum and the counts
    for the final sampled dataframe.
    Parameters
    ----------
    :df: pandas dataframe from which data will be sampled.
    :strata: list containing columns that will be used in the stratified sampling.
    :size: sampling size. If not informed, a sampling size will be calculated
        using Cochran adjusted sampling formula:
        cochran_n = (Z**2 * p * q) /e**2
        where:
            - Z is the z-value. In this case we use 1.96 representing 95%
            - p is the estimated proportion of the population which has an
                attribute. In this case we use 0.5
            - q is 1-p
            - e is the margin of error
        This formula is adjusted as follows:
        adjusted_cochran = cochran_n / 1+((cochran_n -1)/N)
        where:
            - cochran_n = result of the previous formula
            - N is the population size
    Returns
    -------
    A dataframe reporting the counts in each stratum and the counts
    for the final sampled dataframe.
    '''
    population = len(df)
    size = __smpl_size(population, size)
    tmp = df[strata]
    tmp['size'] = 1
    tmp_grpd = tmp.groupby(strata).count().reset_index()
    tmp_grpd['samp_size'] = round(size/population * tmp_grpd['size']).astype(int)
    return tmp_grpd


def __smpl_size(population, size):
    '''
    A function to compute the sample size. If not informed, a sampling 
    size will be calculated using Cochran adjusted sampling formula:
        cochran_n = (Z**2 * p * q) /e**2
        where:
            - Z is the z-value. In this case we use 1.96 representing 95%
            - p is the estimated proportion of the population which has an
                attribute. In this case we use 0.5
            - q is 1-p
            - e is the margin of error
        This formula is adjusted as follows:
        adjusted_cochran = cochran_n / 1+((cochran_n -1)/N)
        where:
            - cochran_n = result of the previous formula
            - N is the population size
    Parameters
    ----------
        :population: population size
        :size: sample size (default = None)
    Returns
    -------
    Calculated sample size to be used in the functions:
        - stratified_sample
        - stratified_sample_report
    '''
    if size is None:
        cochran_n = round(((1.96)**2 * 0.5 * 0.5)/ 0.05**2)
        n = round(cochran_n/(1+((cochran_n -1) /population)))
        print(n)
    elif size >= 0 and size < 1:
        n = round(population * size)
    elif size < 0:
        raise ValueError('Parameter "size" must be an integer or a proportion between 0 and 0.99.')
    elif size >= 1:
        n = size
    print(n)
    return n

Note that the above function already have a documentation.

Let´s first take a look at the "stratified_sample_report" function:

In [15]:
help(stratified_sample_report)

Help on function stratified_sample_report in module __main__:

stratified_sample_report(df, strata, size=None)
    Generates a dataframe reporting the counts in each stratum and the counts
    for the final sampled dataframe.
    Parameters
    ----------
    :df: pandas dataframe from which data will be sampled.
    :strata: list containing columns that will be used in the stratified sampling.
    :size: sampling size. If not informed, a sampling size will be calculated
        using Cochran adjusted sampling formula:
        cochran_n = (Z**2 * p * q) /e**2
        where:
            - Z is the z-value. In this case we use 1.96 representing 95%
            - p is the estimated proportion of the population which has an
                attribute. In this case we use 0.5
            - q is 1-p
            - e is the margin of error
        This formula is adjusted as follows:
        adjusted_cochran = cochran_n / 1+((cochran_n -1)/N)
        where:
            - cochran_n = result of t

#### Creating a Stratified Sampled DataFrame
Let´s take a look at the function´s parameters:

In [16]:
help(stratified_sample)

Help on function stratified_sample in module __main__:

stratified_sample(df, strata, size=None, seed=None, keep_index=True)
    It samples data from a pandas dataframe using strata. These functions use
    proportionate stratification:
    n1 = (N1/N) * n
    where:
        - n1 is the sample size of stratum 1
        - N1 is the population size of stratum 1
        - N is the total population size
        - n is the sampling size
    Parameters
    ----------
    :df: pandas dataframe from which data will be sampled.
    :strata: list containing columns that will be used in the stratified sampling.
    :size: sampling size. If not informed, a sampling size will be calculated
        using Cochran adjusted sampling formula:
        cochran_n = (Z**2 * p * q) /e**2
        where:
            - Z is the z-value. In this case we use 1.96 representing 95%
            - p is the estimated proportion of the population which has an
                attribute. In this case we use 0.5
         

The help informs us that it basically uses the same parameters as the previous function. The difference is that here we can give it a seed in order to have the same sample each time. We can also ask it to keep the same index as the original DataFrame or create a new one.

Let´s create a sample of 10K rows using variables age, marital and education as strata:

In [None]:
df_sample = stratified_sample(df, list(df.columns), size=125000, seed=None, keep_index= True)

In [27]:
df_sample.shape

(52566, 17)

In [28]:
# df.to_csv('all_data.csv', index=False)
df_sample.to_csv('sample_data.csv', index=False)

In [29]:
df_sample.shape

(76608, 15)