# Pandas Sampling

### 1) Simple Random Sampling:

**Concept:** This is the most basic form of sampling. Every individual in the dataset has an equal chance of being selected.

**Example:** Imagine you have a bowl of 100 different colored marbles and you want to get a sense of the colors. If you close your eyes and pick 10 marbles at random, that's simple random sampling.

In [3]:
import pandas as pd

df = pd.read_csv('bike.csv')

# This will pick 10 random sample from the data frame
simple_random_sample = df.sample(n = 10)

simple_random_sample

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
6687,2012-03-15 22:00:00,1,0,1,1,21.32,25.0,68,8.9981,32,137,169
6722,2012-03-17 09:00:00,1,0,0,2,18.04,21.97,88,6.0032,104,217,321
7367,2012-05-06 08:00:00,2,0,0,2,20.5,24.24,88,8.9981,23,91,114
301,2011-01-14 00:00:00,1,0,1,1,4.92,6.82,50,12.998,0,14,14
4328,2011-10-12 09:00:00,4,0,1,2,22.14,25.76,88,30.0026,14,183,197
6057,2012-02-08 15:00:00,1,0,1,3,12.3,15.91,61,7.0015,5,49,54
1598,2011-04-12 12:00:00,2,0,1,2,22.96,26.515,64,31.0009,9,83,92
3647,2011-09-02 21:00:00,3,0,1,2,26.24,30.305,73,6.0032,28,121,149
7311,2012-05-04 00:00:00,2,0,1,1,24.6,28.79,78,22.0028,19,70,89
6958,2012-04-08 06:00:00,2,0,0,1,14.76,18.94,37,0.0,7,21,28


### 2) Stratified Random Sampling:
**Concept:** In this method, the population is divided into smaller groups called strata based on certain shared characteristics (like age, gender, etc.). Then, a random sample is taken from each stratum.

**Example:** Let's say you have a class of 50 male and 50 female students and you want to survey 20 students. If you pick 10 males and 10 females randomly, ensuring the gender ratio is maintained, that's stratified random sampling.

In [8]:
import pandas as pd
import numpy as np

# Create a sample dataset
np.random.seed(42)  # for reproducibility
data = {
    'ID': range(1, 101),
    'Age': np.random.randint(18, 30, 100),
    'Gender': np.random.choice(['Male', 'Female'], 100),
    'State': np.random.choice(['StateA', 'StateB', 'StateC', 'StateD', 'StateE'], 100)
}
df1 = pd.DataFrame(data)
df1.head()

Unnamed: 0,ID,Age,Gender,State
0,1,24,Male,StateC
1,2,21,Male,StateB
2,3,28,Male,StateD
3,4,25,Female,StateA
4,5,22,Male,StateA


In [14]:
from sklearn.model_selection import train_test_split

training_data, sample_data = train_test_split(df1, test_size=0.1, stratify=df1['Gender'])

## Training data
90% of total data

In [15]:
training_data

Unnamed: 0,ID,Age,Gender,State
52,53,26,Female,StateD
84,85,25,Male,StateA
4,5,22,Male,StateA
8,9,24,Male,StateA
97,98,29,Female,StateC
...,...,...,...,...
5,6,24,Female,StateA
44,45,22,Male,StateE
38,39,20,Female,StateD
46,47,24,Male,StateB


## Sample data
10% of the total data

In [17]:
sample_data

Unnamed: 0,ID,Age,Gender,State
90,91,19,Female,StateD
53,54,29,Male,StateB
88,89,25,Female,StateB
68,69,21,Male,StateC
54,55,19,Male,StateC
86,87,29,Female,StateD
13,14,21,Male,StateC
64,65,29,Female,StateD
82,83,27,Male,StateD
47,48,22,Male,StateD


### 3) Systematic Sampling:
**Concept:** Instead of random selection, you pick every kth item from your dataset.

**Example:** From a list of 100 students, if you decide to pick every 10th student, starting from the 1st, you'd pick the 1st, 11th, 21st,... and so on.

In [20]:
k = 10

#systematic_sample_data = df1.iloc[::k, :]
systematic_sample_data = df1.iloc[0:100:k, :]

systematic_sample_data

Unnamed: 0,ID,Age,Gender,State
0,1,24,Male,StateC
10,11,28,Male,StateE
20,21,25,Female,StateC
30,31,29,Male,StateA
40,41,24,Male,StateA
50,51,19,Male,StateA
60,61,21,Female,StateC
70,71,25,Female,StateD
80,81,29,Male,StateB
90,91,19,Female,StateD


### 4) Cluster Sampling:
**Concept:** The population is divided into clusters (groups) and a few clusters are selected at random. All observations from these selected clusters are included in the sample.

**Example:** Imagine a country with 50 states, and you want to survey the population. Instead of surveying people from all 50 states, you randomly select 5 states and survey everyone from those 5 states.

**Pandas Implementation:** This is a bit more involved since you'd first group your data into clusters and then sample from those 

In [34]:
# Sample 2 states randomly
selected_states = np.random.choice(df1['State'].unique(), 2, replace=False)

# Select all rows that belong to the 2 randomly selected states
sampled_data = df1[df1['State'].isin(selected_states)]

sampled_data.head()

Unnamed: 0,ID,Age,Gender,State
1,2,21,Male,StateB
3,4,25,Female,StateA
4,5,22,Male,StateA
5,6,24,Female,StateA
6,7,27,Female,StateA
