# Data Sampling

Data sampling is the process of selecting a subset of data from a larger dataset. It is often used to reduce the computation time and storage space required to work with large datasets. There are various types of data sampling methods such as random sampling, stratified sampling, cluster sampling, and systematic sampling.

## Random Sampling

Random Sampling is a type of probability sampling where each item of the population has an equal chance of being selected in the sample. This means that every individual item in the population has an equal probability of being chosen. Random sampling is useful when you want to get an unbiased sample from the population.

Example:
Let's say you have a dataset with 100 rows and you want to take a random sample of 10 rows. You can use the `sample` method from pandas to randomly select 10 rows from the dataset.

In [1]:
import random

# create a dummy dataset with 100 observations
data = [i for i in range(1, 101)]

# Take a random sample of 10 rows
sample_size = 10
random_sample = random.sample(data, sample_size)

print(random_sample)

[79, 86, 21, 92, 95, 5, 7, 60, 64, 55]


Let's see how we can handle it with Python's `pandas` library.

In [2]:
import pandas as pd
import numpy as np

# create a custom DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'B': [20, 30, 40, 50, 60, 70, 80, 90, 100, 110]
})

# take a random sample of 3 rows from the DataFrame
random_sample = df.sample(n=3, random_state=42)

# display the random sample
print(random_sample)

   A    B
8  9  100
1  2   30
5  6   70


## Stratified Sampling

Stratified Sampling is a type of probability sampling where the population is divided into subgroups based on some characteristic or feature. The subgroups are then sampled proportionally to their size in the population. Stratified sampling is useful when you have a population with significant differences in the characteristics and you want to ensure that each subgroup is represented in the sample.

Example:
Let's say you have a dataset of students' grades with a column for the class they belong to. You want to take a sample of 5 students, but you want to ensure that each class is represented in the sample. You can use the `groupby` method from `pandas` to group the dataset by the class column and then use the `apply` method to take a random sample of students from each class.

In [3]:
import pandas as pd
import numpy as np

# Generate random data
np.random.seed(42)
data = {'student_id': np.arange(1, 101),
        'class': np.random.choice(['A', 'B', 'C', 'D'], size=100),
        'grade': np.random.randint(0, 101, size=100)
}

# Convert data to pandas DataFrame
df = pd.DataFrame(data)

# Define the number of samples to take from each class
n_samples = 5

# Define a lambda function to take a random sample of n_samples from each group
sample_func = lambda x: x.sample(n=n_samples)

# Apply the sample_func to each group
sampled_data = df.groupby('class').apply(sample_func)

# Reset the index of the sampled data
sampled_data = sampled_data.reset_index(drop=True)

print(sampled_data)

    student_id class  grade
0           68     A     77
1           22     A     23
2            7     A     61
3            8     A     99
4           36     A      7
5           72     B     89
6           76     B     78
7           71     B      4
8           21     B     52
9           94     B     12
10          40     C     80
11          20     C     81
12          14     C     77
13          48     C      6
14           9     C     13
15          82     D     62
16          24     D     88
17          91     D     42
18          46     D     40
19          45     D      4


## Cluster Sampling

Cluster Sampling is a type of probability sampling where the population is divided into clusters or groups, and then a sample of clusters is selected. All the items within the selected clusters are then sampled. Cluster sampling is useful when the population is widely dispersed and difficult to sample directly.

Example:
Let's say you have a dataset of houses in a city, with a column for the neighborhood they are located in. You want to take a sample of 10 houses, but you want to ensure that you sample from different neighborhoods. You can use the `groupby` method from pandas to group the dataset by the neighborhood column and then randomly select 5 neighborhoods. Then, you can use the loc method to select all the houses in the selected neighborhoods.

In [4]:
import pandas as pd

# Create sample dataset
df = pd.DataFrame({
    'HouseID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
    'Neighborhood': ['A', 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'E', 'E', 'F', 'F', 'F', 'G', 'G', 'G', 'G', 'G']
})

# Take a cluster sample of 10 houses, from 5 randomly selected neighborhoods
neighborhoods = df['Neighborhood'].unique()
selected_neighborhoods = pd.Series(neighborhoods).sample(n=5, random_state=42)
cluster_sample = df.loc[df['Neighborhood'].isin(selected_neighborhoods)].sample(n=10, random_state=42)

print(cluster_sample)

    HouseID Neighborhood
5         6            C
0         1            A
13       14            F
14       15            F
2         3            B
1         2            B
12       13            F
4         5            C
11       12            E
3         4            C


## Systematic Sampling

Systematic Sampling is a type of probability sampling where items are selected at regular intervals from an ordered list. Systematic sampling is useful when the population is large and ordered, and a random sample cannot be easily obtained.

Example:
Let's say you have a dataset of employees in a company, with a column for the employee ID. You want to take a sample of 50 employees, but you want to ensure that you select them in a systematic way.

In [5]:
import pandas as pd
import numpy as np

# create a dummy dataset
df = pd.DataFrame({
    'A': range(1, 101),
    'B': np.random.randint(1, 11, size=100),
    'C': np.random.choice(['Male', 'Female'], size=100)
})

# set the seed for reproducibility
np.random.seed(123)

# define the sample size
sample_size = 10

# calculate the sampling interval
n = len(df)
k = int(n / sample_size)

# randomly choose the starting point
start = np.random.randint(0, k)

# select the indices for the sample
indices = range(start, n, k)

# create the systematic sample
systematic_sample = df.loc[indices]

print(systematic_sample)

     A  B       C
2    3  8  Female
12  13  3  Female
22  23  7    Male
32  33  1    Male
42  43  9    Male
52  53  1  Female
62  63  1    Male
72  73  2  Female
82  83  5  Female
92  93  4    Male


This code creates a dummy dataset with 100 rows and 3 columns. It then sets the seed for reproducibility and defines the sample size as 10. The sampling interval is calculated as `k = n / sample_size`, where `n` is the number of rows in the dataset. The starting point is then randomly chosen using `np.random.randint(0, k)`, and the indices for the sample are selected using `range(start, n, k)`. Finally, the systematic sample is created by selecting the rows with those indices using `df.loc[indices]`. The resulting systematic sample is then printed.