python_stratified_sampling

This is a helper python module to be used along side pandas. It creates stratified sampling based on given strata.

Documentation

stratified_sample(df, strata, size=None, seed=None)

It samples data from a pandas dataframe using strata. These functions use proportionate stratification: n1 = (N1/N) * n where: - n1 is the sample size of stratum 1 - N1 is the population size of stratum 1 - N is the total population size - n is the sampling size

Parameters
----------
:df: pandas dataframe from which data will be sampled.
:strata: list containing columns that will be used in the stratified sampling.
:size: sampling size. If not informed, a sampling size will be calculated
    using Cochran adjusted sampling formula:
    cochran_n = (Z**2 * p * q) /e**2

    where:
        - Z is the z-value. In this case we use 1.96 representing 95%
        - p is the estimated proportion of the population which has an
            attribute. In this case we use 0.5
        - q is 1-p
        - e is the margin of error

    This formula is adjusted as follows:
    adjusted_cochran = cochran_n / 1+((cochran_n -1)/N)

    where:
        - cochran_n = result of the previous formula
        - N is the population size

Returns
-------
A sampled pandas dataframe based in a set of strata.

Examples
--------
>> df.head()
	id  sex age city 
0	123 M   20  XYZ
1	456 M   25  XYZ
2	789 M   21  YZX
3	987 F   40  ZXY
4	654 M   45  ZXY
...
# This returns a sample stratified by sex and city containing 30% of the size of
# the original data
>> stratified = stratified_sample(df=df, strata=['sex', 'city'], size=0.3)

Requirements
------------
- pandas
- numpy

stratified_sample_report(df, strata, size=None)

Generates a dataframe reporting the counts in each stratum and the counts
for the final sampled dataframe.

Parameters
----------
:df: pandas dataframe from which data will be sampled.
:strata: list containing columns that will be used in the stratified sampling.
:size: sampling size. If not informed, a sampling size will be calculated
    using Cochran adjusted sampling formula:
    cochran_n = (Z**2 * p * q) /e**2

    where:
        - Z is the z-value. In this case we use 1.96 representing 95%
        - p is the estimated proportion of the population which has an
            attribute. In this case we use 0.5
        - q is 1-p
        - e is the margin of error

    This formula is adjusted as follows:
    adjusted_cochran = cochran_n / 1+((cochran_n -1)/N)

    where:
        - cochran_n = result of the previous formula
        - N is the population size

Returns
-------
A dataframe reporting the counts in each stratum and the counts
for the final sampled dataframe.

__smpl_size(population, size)

An internal function to compute the sample size. If not informed, a sampling size will be calculated using Cochran adjusted sampling formula:
    cochran_n = (Z**2 * p * q) /e**2

    where:
        - Z is the z-value. In this case we use 1.96 representing 95%
        - p is the estimated proportion of the population which has an
            attribute. In this case we use 0.5
        - q is 1-p
        - e is the margin of error

    This formula is adjusted as follows:
    adjusted_cochran = cochran_n / 1+((cochran_n -1)/N)

    where:
        - cochran_n = result of the previous formula
        - N is the population size
Parameters
----------
    :population: population size
    :size: sample size (default = None)
Returns
-------
Calculated sample size to be used in the functions:
    - stratified_sample
    - stratified_sample_report

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
stratifiedSample.py		stratifiedSample.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

python_stratified_sampling

Documentation

stratified_sample(df, strata, size=None, seed=None)

stratified_sample_report(df, strata, size=None)

__smpl_size(population, size)

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

flaboss/python_stratified_sampling

Folders and files

Latest commit

History

Repository files navigation

python_stratified_sampling

Documentation

stratified_sample(df, strata, size=None, seed=None)

stratified_sample_report(df, strata, size=None)

__smpl_size(population, size)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages