# Sampling

## Key Terms: 
- A ***population*** is the group of all items of interest. Ex: All the students on university campus.
- A ***sample*** is a set of data drawn from the studied population. Ex: 500 students from the campus
- Few sampling techniques are:
    - ***Random Sampling***: Probability of each record being selected into your sample will be equal. If there are n records, probability of choosing any one record is 1/n.
    - ***Stratified Sampling***: Involves dividing the entire population into homogeneous groups called strata.

## Working on real data

### Importing data and libraries

In [1]:
# importing libraries
import pandas as pd
import numpy as np
import os, sys

In [2]:
# Loading data
data = pd.read_excel('datasets/CreditCardData.xlsx')
data.head()

Unnamed: 0,Card_ID,Campaign_Responce,Registration_Date,Gender,Birth_Date
0,100005950,False,1998-11-18,M,1984-02-06
1,100022191,True,1999-09-15,F,1959-09-11
2,100025442,False,1998-05-12,M,1970-08-25
3,100026513,False,1999-02-12,M,1951-03-12
4,100039145,False,2000-08-12,M,1949-06-08


In [3]:
# Shape of the data
print('Shape:', data.shape)

Shape: (297, 5)


In [4]:
# Checking the data types
print(data.dtypes)

Card_ID                       int64
Campaign_Responce              bool
Registration_Date    datetime64[ns]
Gender                       object
Birth_Date           datetime64[ns]
dtype: object


### Simple Random Sampling

- Random sampling without replacement: Repetition of samples are not allowed.
- Random sampling with replacement: Repetition of samples are allowed.

In [6]:
# Select 5 records randomly from the dataset 
df1 = data.sample(n = 5, random_state = 40)
df1.head()

Unnamed: 0,Card_ID,Campaign_Responce,Registration_Date,Gender,Birth_Date
43,100867201,True,1999-12-17,F,1957-05-10
166,103584288,False,2001-11-26,M,1970-11-28
18,100215710,True,2001-09-22,M,1969-05-03
103,102464771,False,1999-12-10,F,1967-03-10
252,105607063,False,1998-05-27,F,1940-03-21


In [7]:
# Check the indexes of the selected sample
df1.index

Index([43, 166, 18, 103, 252], dtype='int64')

In [8]:
# Check the counts of campaign response
df1['Campaign_Responce'].value_counts()

Campaign_Responce
False    3
True     2
Name: count, dtype: int64

In [7]:
# Check the counts of campaign response
df1['Campaign_Responce'].value_counts(normalize= True)

Campaign_Responce
False    0.8
True     0.2
Name: proportion, dtype: float64

In [8]:
data['Campaign_Responce'].value_counts(normalize= True)

Campaign_Responce
False    0.835017
True     0.164983
Name: proportion, dtype: float64

In [9]:
df1 = data.sample(n = 100, random_state= 44)
df1.head()

Unnamed: 0,Card_ID,Campaign_Responce,Registration_Date,Gender,Birth_Date
252,105607063,False,1998-05-27,F,1940-03-21
175,103828605,False,1998-05-29,M,1966-04-26
192,104235958,False,1998-01-27,F,1951-04-25
43,100867201,True,1999-12-17,F,1957-05-10
28,100614020,True,1998-07-21,M,1958-11-19


In [10]:
df1['Campaign_Responce'].value_counts(normalize= True)

Campaign_Responce
False    0.86
True     0.14
Name: proportion, dtype: float64

In [11]:
# Select 10% of records randomly, and create a dataframe df2
df2 = data.sample(frac = 0.1, random_state= 44)
df2.shape

(30, 5)

In [12]:
# Check indexes
df2.index

Index([252, 175, 192,  43,  28,  33,  12, 134, 185, 278,  20, 287, 179, 148,
       201,  46, 294,  95, 110, 107, 174, 211, 226,  36,  65, 135, 277,   7,
       229,  88],
      dtype='int64')

In [13]:
# Check the counts of campaign response
df2['Campaign_Responce'].value_counts(normalize= True)

Campaign_Responce
False    0.8
True     0.2
Name: proportion, dtype: float64

In [14]:
# Select 35 records with replacement. Create a dataframe df3
df3 = data.sample(n = 35, random_state= 44, replace= True)
df3.shape

(35, 5)

In [15]:
# Check indexes
df3.index

Index([276, 241, 173,  59,  96,  84, 239, 120, 151, 195, 199,  67, 227, 109,
       245, 100,  57, 257,  14, 120, 213,  96, 287,  72, 189,  72,  86, 242,
       144, 116,  50,  18,  92, 285,   1],
      dtype='int64')

In [16]:
# Check for duplicates
sum(df3.index.duplicated())

3

## Stratified Sampling

In [17]:
#Original dat: Find out what is the % distribution by Gneder
data['Gender'].value_counts(normalize=True)

Gender
M    0.572391
F    0.427609
Name: proportion, dtype: float64

In [17]:
# Select 30% of records stratified according to Gender Create sample - df1 and df2
from sklearn.model_selection import train_test_split

df1, df2 = train_test_split(data, test_size = 0.3, stratify = data['Gender'],  random_state = 44)
df1.shape, df2.shape

((207, 5), (90, 5))

In [18]:
# Counting for df2
df2['Gender'].value_counts(normalize= True)

Gender
M    0.577778
F    0.422222
Name: proportion, dtype: float64

In [19]:
# Counting for df1
df1['Gender'].value_counts(normalize= True)

Gender
M    0.570048
F    0.429952
Name: proportion, dtype: float64

In [20]:
# Select 25% of records from 'data' stratified based on 'Campaign_Responce' variable.
df3, df4 = train_test_split(data, test_size = 0.25, stratify = data['Campaign_Responce'],  random_state = 44)
df3.shape, df4.shape

((222, 5), (75, 5))

In [21]:
# Counting values for df3
df3['Campaign_Responce'].value_counts(normalize= True)

Campaign_Responce
False    0.833333
True     0.166667
Name: proportion, dtype: float64

In [22]:
# Counting values for df4
df4['Campaign_Responce'].value_counts(normalize= True)

Campaign_Responce
False    0.84
True     0.16
Name: proportion, dtype: float64