<center><h1><strong>Sampling</strong></h1></center>

Sampling is the process of selecting a subset of data from a large dataset.

large data = population
subset of data = sample

In [1]:
import pandas as pd
import numpy as np

In [8]:
data = pd.DataFrame({
    "customer_id":range(1,1001),
    "age":np.random.randint(18,65,1000),
    "income":np.random.randint(20000,120000,1000),
    "purchased":np.random.choice([0,1],size=1000,p=[0.8,0.2])
})

imbalanced target (80% No, 20% yes)

In [10]:
data.head()

Unnamed: 0,customer_id,age,income,purchased
0,1,39,71661,0
1,2,42,103656,1
2,3,60,37435,0
3,4,26,99067,0
4,5,29,109374,0


In [22]:
data.groupby("purchased")["customer_id"].count()

purchased
0    798
1    202
Name: customer_id, dtype: int64

<h3><strong>Random Sampling</strong></h3>

Every record has equal probablity of being selected

<code>When to use</code>
1. Data is uniform <br>
2. No class imbalance concern

In [11]:
sample_random = data.sample(n=200,random_state=42)

In [12]:
sample_random.info()

<class 'pandas.core.frame.DataFrame'>
Index: 200 entries, 521 to 78
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   customer_id  200 non-null    int64
 1   age          200 non-null    int32
 2   income       200 non-null    int32
 3   purchased    200 non-null    int64
dtypes: int32(2), int64(2)
memory usage: 6.2 KB


In [14]:
sample_random.head()

Unnamed: 0,customer_id,age,income,purchased
521,522,36,31639,0
737,738,45,24994,0
740,741,21,57983,0
660,661,54,26760,0
411,412,54,48681,1


In [17]:
sample_random.groupby("purchased")["customer_id"].count()

purchased
0    154
1     46
Name: customer_id, dtype: int64

Using fraction

In [26]:
sample_random = data.sample(frac=0.2,random_state=30)

In [27]:
sample_random.info()

<class 'pandas.core.frame.DataFrame'>
Index: 200 entries, 923 to 961
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   customer_id  200 non-null    int64
 1   age          200 non-null    int32
 2   income       200 non-null    int32
 3   purchased    200 non-null    int64
dtypes: int32(2), int64(2)
memory usage: 6.2 KB


In [28]:
sample_random.groupby("purchased")["customer_id"].count()

purchased
0    161
1     39
Name: customer_id, dtype: int64

<h3 style="color:red;"><strong>Stratified Sampling</h3>

It ensures class proportions remain same as population

<code>When to use</code>
1. Classification problems<br>
2. Imbalanced Data

In [29]:
from sklearn.model_selection import train_test_split

X = data.drop("purchased",axis=1)
y = data["purchased"]

In [30]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,stratify=y,random_state=42)

In [34]:
y.value_counts(normalize=True)

purchased
0    0.798
1    0.202
Name: proportion, dtype: float64

In [33]:
y_train.value_counts(normalize=True)

purchased
0    0.7975
1    0.2025
Name: proportion, dtype: float64

<h3><strong>Systematic Sampling</strong></h3>

Pick every k-th record after random start.

if N = 1000 and want 100 samples -> k = 10

In [35]:
k = 10
systematic_sample = data.iloc[::k]

In [37]:
systematic_sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 990
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   customer_id  100 non-null    int64
 1   age          100 non-null    int32
 2   income       100 non-null    int32
 3   purchased    100 non-null    int64
dtypes: int32(2), int64(2)
memory usage: 2.5 KB
