<Center><h1><strong>Sampling Tutorial</strong></h1></center>

Sampling is the process of selecting a subset of data from a larger population. In machine learning, effective sampling ensures that your model is trained on data that accurately represents the real-world scenarios it will encounter.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split,StratifiedKFold

load the dataset

In [6]:
df = sns.load_dataset('titanic').dropna(subset=['age','embarked','sex','deck'])
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
10,1,3,female,4.0,1,1,16.7000,S,Third,child,False,G,Southampton,yes,False
11,1,1,female,58.0,0,0,26.5500,S,First,woman,False,C,Southampton,yes,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
871,1,1,female,47.0,1,1,52.5542,S,First,woman,False,D,Southampton,yes,False
872,0,1,male,33.0,0,0,5.0000,S,First,man,True,B,Southampton,no,True
879,1,1,female,56.0,0,1,83.1583,C,First,woman,False,C,Cherbourg,yes,False
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True


In [7]:
df.dtypes

survived          int64
pclass            int64
sex              object
age             float64
sibsp             int64
parch             int64
fare            float64
embarked         object
class          category
who              object
adult_male         bool
deck           category
embark_town      object
alive            object
alone              bool
dtype: object

<h3>simple random sampling</h3>

In [11]:
#select n sample
random_sample = df.sample(n=100,random_state=1)

#select 10% of data 
fractional_sample = df.sample(frac=0.1, random_state=1)

In [None]:
fractional_sample.info()

<h3>Stratified sampling</h3>

Datasets are often imbalanced, Stratification ensures that the sub-samples maintain the same class proportions as the original population.

In [15]:
x=df.drop('survived',axis=1)
y = df['survived']

In [16]:
x

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
3,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
6,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
10,3,female,4.0,1,1,16.7000,S,Third,child,False,G,Southampton,yes,False
11,1,female,58.0,0,0,26.5500,S,First,woman,False,C,Southampton,yes,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
871,1,female,47.0,1,1,52.5542,S,First,woman,False,D,Southampton,yes,False
872,1,male,33.0,0,0,5.0000,S,First,man,True,B,Southampton,no,True
879,1,female,56.0,0,1,83.1583,C,First,woman,False,C,Cherbourg,yes,False
887,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True


In [17]:
y

1      1
3      1
6      0
10     1
11     1
      ..
871    1
872    0
879    1
887    1
889    1
Name: survived, Length: 182, dtype: int64

Split with stratification

In [None]:
X_train,X_test,Y_train,Y_test = train_test_split(x,y,test_size=0.2,stratify=y,random_state=42)


In [25]:
print(X_train.info(),Y_train)

<class 'pandas.core.frame.DataFrame'>
Index: 145 entries, 110 to 879
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   pclass       145 non-null    int64   
 1   sex          145 non-null    object  
 2   age          145 non-null    float64 
 3   sibsp        145 non-null    int64   
 4   parch        145 non-null    int64   
 5   fare         145 non-null    float64 
 6   embarked     145 non-null    object  
 7   class        145 non-null    category
 8   who          145 non-null    object  
 9   adult_male   145 non-null    bool    
 10  deck         145 non-null    category
 11  embark_town  145 non-null    object  
 12  alive        145 non-null    object  
 13  alone        145 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(3), object(5)
memory usage: 13.5+ KB
None 110    0
571    1
52     1
449    1
329    1
      ..
337    1
248    1
462    0
550    1
879    1
Name: survived, Lengt

In [24]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 37 entries, 3 to 11
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   pclass       37 non-null     int64   
 1   sex          37 non-null     object  
 2   age          37 non-null     float64 
 3   sibsp        37 non-null     int64   
 4   parch        37 non-null     int64   
 5   fare         37 non-null     float64 
 6   embarked     37 non-null     object  
 7   class        37 non-null     category
 8   who          37 non-null     object  
 9   adult_male   37 non-null     bool    
 10  deck         37 non-null     category
 11  embark_town  37 non-null     object  
 12  alive        37 non-null     object  
 13  alone        37 non-null     bool    
dtypes: bool(2), category(2), float64(2), int64(3), object(5)
memory usage: 3.8+ KB
