# Sampling Methods

* **Sampling** is a *process used in statistical analysis in which a predetermined number of observations are taken from a larger population.*

---

## Simple Random Sampling
* **Simple random sampling** is the *basic sampling technique where we select a group of subjects (a sample) for study from a larger group (a population).* Each individual is chosen entirely by chance and each member of the population has an equal chance of being included in the sample. Every possible sample of a given size has the same chance of selection. 

![Simple random sampling of a sample “n” of 3 from a population “N” of 12. Image: Dan Kernler |Wikimedia Commons](https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/12/Simple_random_sampling-300x231.png)
*Simple random sampling of a sample “n” of 3 from a population “N” of 12. Image: Dan Kernler |Wikimedia Commons*

* Technically, a simple random sample is a set of n objects in a population of N objects where all possible samples are equally likely to happen. Here’s a basic example of how to get a simple random sample: put 100 numbered bingo balls into a bowl (this is the population N). Select 10 balls from the bowl without looking (this is your sample n). Note that it’s important not to look as you could (unknowingly) bias the sample. While the “lottery bowl” method can work fine for smaller populations, in reality you’ll be dealing with much larger populations.

![](https://research-methodology.net/wp-content/uploads/2015/04/Simple-random-sampling2.png)

In [None]:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5000, 4), columns=list('ABCD'))

In [None]:
df.head()

Unnamed: 0,A,B,C,D
0,-2.552401,0.155559,-0.145723,0.034853
1,-0.532157,-1.04227,0.562122,0.237846
2,-0.921828,-2.538457,0.33413,-0.296985
3,0.524958,0.225118,-0.993656,-1.057299
4,0.14132,0.672611,-0.516755,0.193816


In [None]:
sample_df = df.sample(100)

In [None]:
sample_df.shape

(100, 4)

In [None]:
sample_df.head()

Unnamed: 0,A,B,C,D
4719,0.071678,-1.662591,-0.212507,0.194211
4219,1.219382,-0.672356,1.048785,-2.135515
645,-0.587713,-0.512634,-1.156702,-0.865832
1933,-0.070003,-0.964697,1.0682,-0.703325
3582,-1.210951,0.827845,-0.412599,1.884682


## Stratified Sampling

---

* **Stratified random sampling** is a method of sampling that *involves the division of a population into smaller sub-groups known* as **strata** In stratified random sampling or stratification, the strata are formed based on members' shared attributes or characteristics such as income or educational attainment.

* **Stratified random sampling** is also called *proportional random sampling or quota random sampling.*

![](https://image.slidesharecdn.com/sampling-stratifiedvscluster-170115160432/95/sampling-stratified-vs-cluster-2-638.jpg?cb=1484496290)

##### Assume that we need to estimate the average number of votes for each candidate in an election. Assume that the country has 3 towns:
* Town A has 1 million factory workers,
* Town B has 2 million workers, and
* Town C has 3 million retirees.
* We can choose to get a random sample of size 60 over the entire population but there is some chance that the random sample turns out to be not well balanced across these towns and hence is biased causing a significant error in estimation.
* Instead, if we choose to take a random sample of 10, 20 and 30 from Town A, B and C respectively then we can produce a smaller error in estimation for the same total size of the sample.

### Method

In [None]:
from sklearn.datasets import load_iris

In [None]:
X,y = load_iris(return_X_y=True)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify=y, ## we need one categorical variable for that
                                                    test_size=0.25)

In [None]:
X_train.shape

(112, 4)

## Reservoir Sampling

---
![](https://kapilddatascience.files.wordpress.com/2015/06/reservoir.jpg)

* **Reservoir sampling** is a *family of randomized algorithms for randomly choosing k samples from a list of n items, where n is either a very large or unknown number.* Typically n is large enough that the list doesn’t fit into main memory. For example, a list of search queries in Google and Facebook.

![](https://image.slidesharecdn.com/t10part1-141208215154-conversion-gate02/95/sampling-for-big-data-1-21-638.jpg?cb=1418075560)


In [None]:
import random
def generator(max):
    number = 1
    while number < max:
        number += 1
        yield number
# Create as stream generator
stream = generator(10000)
# Doing Reservoir Sampling from the stream
k=5
reservoir = []
for i, element in enumerate(stream):
    if i+1<= k:
        reservoir.append(element)
    else:
        probability = k/(i+1)
        if random.random() < probability:
            # Select item in stream and remove one of the k items already selected
             reservoir[random.choice(range(0,k))] = element
print(reservoir)

[6859, 7151, 2308, 4500, 4533]


It can be mathematically proved that in the sample each element has the same probability of getting selected from the stream.

## Random Undersampling and Oversampling

---

![](https://miro.medium.com/max/700/0*u6pKLqdCDsG_5kXa.png)

* A widely adopted technique for dealing with highly imbalanced datasets is called resampling. It consists of *removing samples from the majority class* (**under-sampling**) and/or *adding more examples from the minority class* (**over-sampling**).

In [None]:
from sklearn.datasets import make_classification
X, y = make_classification(
    n_classes=2, class_sep=1.5, weights=[0.9, 0.1],
    n_informative=3, n_redundant=1, flip_y=0,
    n_features=20, n_clusters_per_class=1,
    n_samples=100, random_state=10
)
X = pd.DataFrame(X)
X['target'] = y

In [None]:
X

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,target
0,0.327419,-0.123936,0.377707,-0.650123,0.267562,1.228781,2.208772,-0.185977,0.238732,-2.565438,-0.383111,0.644056,0.104375,-1.703024,-0.510083,-0.108812,-0.230132,1.553707,1.497538,-1.476485,0
1,-0.843981,-0.018691,-0.841018,1.374583,0.157199,-0.599719,2.217041,-2.032194,-2.310214,-0.490477,-0.304583,1.360939,-1.844740,-0.341096,0.137243,1.704764,0.464255,1.225786,-0.842880,1.303258,0
2,-0.204642,0.472155,-0.140616,-2.902493,-1.513665,1.149545,2.283673,-0.809117,-1.723535,-0.958556,-0.259129,-0.279701,-1.431391,0.260146,-0.501306,-2.320545,0.422214,1.386474,-0.073335,0.586859,0
3,0.208274,-0.156982,0.063369,-0.545759,-0.395416,-2.679969,1.507772,0.391485,-0.487337,-0.946147,0.339852,-1.011854,-1.124795,0.347291,-1.078836,0.046923,-0.978324,1.100517,-0.697134,0.339577,0
4,0.785568,0.208472,0.760082,-0.046130,0.310844,-0.403927,1.462897,0.962173,-0.520996,1.647360,0.146524,0.316792,-0.261528,-1.260698,0.822700,0.141031,-0.294805,2.216364,-1.129875,-1.059984,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,-0.218414,0.393699,0.653777,-1.312950,-1.025031,0.134849,3.840511,-0.804299,-3.165732,-2.021668,-0.681023,-0.581732,-0.151500,-0.150625,0.092327,-0.251294,-0.192647,2.009518,-1.362301,2.664974,0
96,0.212049,-1.717439,0.306482,1.734235,-2.058806,-0.417814,0.376990,-0.834218,-1.355098,1.287204,0.492321,-0.295757,-2.241017,-0.534599,0.533367,0.891498,0.910063,0.349467,1.161048,0.556070,0
97,-0.011497,-0.670154,-0.509845,0.703307,0.857470,0.257115,1.448881,-0.093277,-0.259108,-1.068263,1.526283,-0.359560,-0.688662,2.041108,-2.756415,1.373629,-0.388022,1.090378,-1.388104,-0.304678,0
98,1.107590,1.034863,0.461980,-0.423340,-0.180139,0.624847,1.994488,2.654549,-0.980035,-1.141259,0.568099,0.366621,-1.099235,0.514940,-0.740902,0.321945,0.299279,1.336946,-0.117431,-0.324022,0


We can now do random oversampling and undersampling using:

In [None]:
num_0 = len(X[X['target']==0])
num_1 = len(X[X['target']==1])
print(num_0,num_1)
# random undersample
undersampled_data = pd.concat([ X[X['target']==0].sample(num_1) , X[X['target']==1] ])
print(len(undersampled_data))
# random oversample
oversampled_data = pd.concat([ X[X['target']==0] , X[X['target']==1].sample(num_0, replace=True) ])
print(len(oversampled_data))

90 10
20
180


## Undersampling and Oversampling using imbalanced-learn

* imbalanced-learn(imblearn) is a Python Package to tackle the curse of imbalanced datasets.
It provides a variety of methods to undersample and oversample.

#### A. Undersampling using Tomek Links:  
One of such methods it provides is called Tomek Links. Tomek links are pairs of examples of opposite classes in close vicinity.
In this algorithm, we end up removing the majority element from the Tomek link which provides a better decision boundary for a classifier.

![](https://miro.medium.com/max/700/0*huy_9J15wzYJ2o5S)

In [None]:
from imblearn.under_sampling import TomekLinks

tl = TomekLinks(return_indices=True, ratio='majority')
X_tl, y_tl, id_tl = tl.fit_sample(X, y)

TypeError: ignored

#### B. Oversampling using SMOTE:

In SMOTE (Synthetic Minority Oversampling Technique) we synthesize elements for the minority class, in the vicinity of already existing elements.

![](https://miro.medium.com/max/700/0*UrGYcz_Ab-HTo4-B.png)

In [None]:
from imblearn.over_sampling import SMOTE
smote = SMOTE('minority')
# smote = SMOTE(ratio='minority')
# X_sm, y_sm = smote.fit_sample(X, y)
X_sm, y_sm = smote.fit_resample(X,y)



In [None]:
X_sm.shape, X.shape

((180, 21), (100, 21))

#### There are a variety of other methods in the imblearn package for both undersampling(Cluster Centroids, NearMiss, etc.) and oversampling(ADASYN and bSMOTE) that you can check out.

* For more about [imblearn](https://imbalanced-learn.readthedocs.io/en/stable/index.html)