## Sampling

Datasets may not always contain all the required information, or may not contain an equal distribution of datapoints. 
Sampling is used to select a subset of data in order to provide an estimate of the whole population.
Sampling also provides a time efficient and cost effective method of dataset analysis due to the smaller dataset size.

Sampling Frame- List of items forming a dataset from which the sample is taken

Sample size- Number of items to be taken from a dataset which would provide enough information about a population with the required accuracy

## Sampling techniques 
Probability sampling (each sample has the same chance of being selected- best chance to provide a representative sample):
- Simple random sampling - each sample is chosen randomly by chance.

- Systematic sampling - first sample is selected randomly and others are selected using a fixed sampling interval.

- stratified sampling- dataset is divided into subgroups based on different traits, with samples selected from these groups. Used to obtain representation from all subgroups. 

- Cluster sampling - dataset is split into clusters/ subgroups, where a whole cluster is randomly selected to be used in the study

Non Probability Sampling (elements do not have an equal chance of being selected):
- Convenience - samples are selected based on their availability and willingness to take part
- Quota - based on predetermined characteristics of the population, eg the letter 'L' in someones name 
- Judgement - depends on the judgment of the selector when choosing whom to ask to participate
- Snowball- Existing people are asked to nominate further people known to them so that the sample increases in size
    

## Examples

In [1]:
# %load imports.py
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

- Random Sampling

In [2]:
import random
dataset=[45,35,2,8,9,2,4,65,16,23,19,48,97,54,37,7,6,2,5,10,12,15,34,23,87,56,47,2,1,4,3,8,56,45,12,13,17,76,23,23,2,4,5,1,56]
sample= random.sample(dataset,10)
print('random sample of 10 datapoints', sample)

random sample of 10 datapoints [6, 16, 17, 13, 45, 76, 56, 2, 10, 97]


In [3]:
data=pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/us-weather-history/KCLT.csv')
data.head()

Unnamed: 0,date,actual_mean_temp,actual_min_temp,actual_max_temp,average_min_temp,average_max_temp,record_min_temp,record_max_temp,record_min_temp_year,record_max_temp_year,actual_precipitation,average_precipitation,record_precipitation
0,2014-7-1,81,70,91,67,89,56,104,1919,2012,0.0,0.1,5.91
1,2014-7-2,85,74,95,68,89,56,101,2008,1931,0.0,0.1,1.53
2,2014-7-3,82,71,93,68,89,56,99,2010,1931,0.14,0.11,2.5
3,2014-7-4,75,64,86,68,89,55,99,1933,1955,0.0,0.1,2.63
4,2014-7-5,72,60,84,68,89,57,100,1967,1954,0.0,0.1,1.65


In [4]:
random_sample=data.sample(n=10).sort_values(by='actual_max_temp')
random_sample.head()

Unnamed: 0,date,actual_mean_temp,actual_min_temp,actual_max_temp,average_min_temp,average_max_temp,record_min_temp,record_max_temp,record_min_temp_year,record_max_temp_year,actual_precipitation,average_precipitation,record_precipitation
207,2015-1-24,43,33,53,30,51,7,74,1963,1967,0.16,0.1,1.79
189,2015-1-6,41,24,57,30,50,5,73,1884,1950,0.0,0.11,3.45
297,2015-4-24,54,38,70,49,74,36,96,1893,1925,0.0,0.1,2.33
98,2014-10-7,67,53,80,52,74,34,91,1935,1951,0.0,0.11,2.7
318,2015-5-15,73,63,82,56,79,40,94,1888,1962,0.0,0.1,2.11


- systematic sample

In [5]:

# selected a number between 0 and size of dataset to be the starting sample. 
# then takes samples every n steps from this sample
def sys_sample(data, step):
    index=np.arange((random.randint(0,len(data))), len(data), step=step)
    sample=data.iloc[index]
    return sample
systematic_sample= sys_sample(data,4)
systematic_sample.head()

Unnamed: 0,date,actual_mean_temp,actual_min_temp,actual_max_temp,average_min_temp,average_max_temp,record_min_temp,record_max_temp,record_min_temp_year,record_max_temp_year,actual_precipitation,average_precipitation,record_precipitation
178,2014-12-26,43,27,59,30,51,6,76,1983,1889,0.0,0.11,1.72
182,2014-12-30,39,32,45,30,50,-5,76,1880,1984,0.06,0.1,1.21
186,2015-1-3,50,47,52,30,50,8,74,1887,2004,0.25,0.11,1.93
190,2015-1-7,29,16,41,29,50,6,77,2014,1890,0.0,0.12,1.76
194,2015-1-11,31,15,46,29,50,0,78,1886,1949,0.08,0.12,1.78


- Stratified sample


 

In [6]:
from sklearn.model_selection import StratifiedShuffleSplit
data={'id': [1,2,1,2,1,2,1,2],
     'number': [1,4,5,2,6,7,4,3],
     'price': [12.99,12.99,9.99,10,10,9.99,10,10]}
df=pd.DataFrame(data)
df.head()


Unnamed: 0,id,number,price
0,1,1,12.99
1,2,4,12.99
2,1,5,9.99
3,2,2,10.0
4,1,6,10.0


In [7]:
sample= StratifiedShuffleSplit(n_splits=10, test_size=0.4)

for x,y in sample.split(df, df['id']):
    newSample=df.iloc[y]
newSample.head()

Unnamed: 0,id,number,price
5,2,7,9.99
0,1,1,12.99
7,2,3,10.0
4,1,6,10.0


- cluster sampling

## Resampling
Resampling is used when building predictive models to draw samples from a training dataset and fit to a model a number of times in order to ensure that the model has seen all the data and to obtain more information about the model.
It is also used to obtain accuracy metrics about a model without requiring new data for testing.

2 methods-

- Cross Validation - eg kfold- dataset is split into k groups/ partitions, where each group is used as the test dataset in turn, the remainder used as the training data.  
- Bootstaping- samples are taken from the dataset but are replaced, so each sample is always present in the dataset, allowing it to appear more than once in the bootstrapped samples. Bootstrapping can be used to make estimates about a dataset by averaging repeated statistical measurements

## cross validation

In [8]:
# Kfold
from sklearn.model_selection import KFold
data=pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/us-weather-history/KCLT.csv')
data.head()
x= data['actual_mean_temp']
y=data['actual_precipitation']
kfold= KFold(n_splits=3)

# split data into test and train datasets based on the kfold split
for train_index, test_index in kfold.split(x):
    x_train, x_test, y_train, y_test= x[train_index], x[test_index], y[train_index], y[test_index]

print('x train: %d, x test: %d' % (len(x_train),len(x_test)))

x train: 244, x test: 121


In [9]:
# Repeated K fold- repeats k fold n times
from sklearn.model_selection import RepeatedKFold
Rkfold= RepeatedKFold(n_splits=3, n_repeats=2)

# split data into test and train datasets based on the kfold split
for train_index, test_index in Rkfold.split(x):
    x_train, x_test, y_train, y_test= x[train_index], x[test_index], y[train_index], y[test_index]

print('x train: %d, x test: %d' % (len(x_train),len(x_test)))

x train: 244, x test: 121


In [None]:
# Leave one out (LOO) - each training set is made by taking all samples but one
# builds n models from n samples, rather than k models. 
# usually results in high variance. K fold is usually preferable.
from sklearn.model_selection import LeaveOneOut
loo=LeaveOneOut()
for train_index, test_index in loo.split(x):
    x_train, x_test, y_train, y_test= x[train_index], x[test_index], y[train_index], y[test_index]

print('x train: %d, x test: %d' % (len(x_train),len(x_test)))

x train: 364, x test: 1


In [None]:
# Leave P out (LPO)- similar to LOO but removes p number of samples
# however the test will overlap for p >1
from sklearn.model_selection import LeavePOut
lpo=LeavePOut(p=3)
for train_index, test_index in lpo.split(x):
    x_train, x_test, y_train, y_test= x[train_index], x[test_index], y[train_index], y[test_index]

print('x train: %d, x test: %d' % (len(x_train),len(x_test)))

In [None]:
# Statified K fold- similar to k fold but each set contains the same percentage of samples 
# of each target class as the complete set
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=3)

for train_index, test_index in skf.split(x):
    x_train, x_test, y_train, y_test= x[train_index], x[test_index], y[train_index], y[test_index]

print('x train: %d, x test: %d' % (len(x_train),len(x_test)))

In [None]:
# group k fold- ensures that the same group is not represented in the test and train datasets
from sklearn.model_selection import GroupKFold
x=[1,2,1,3,1,2,3,2,3]
y= [12.99,12.99,9.99,10,10,9.99,10,10,9.99]
groups=[1,1,1,2,2,2,3,3,3]

gkf = GroupKFold(n_splits=3)
for train_index, test_index in gkf.split(x,y, groups=groups):
    x_train, x_test, y_train, y_test= x[train_index], x[test_index], y[train_index], y[test_index]

print('x train: %d, x test: %d' % (len(x_train),len(x_test)))

## Bootstrapping

In [None]:
# 2 important parameters of bootstrapping is sample size and number of repetitions to peerfomr
# sample size is usally the same as the original dataset, but if a set is too large, 50% of the set can be used

In [None]:
from sklearn.utils import resample
data=pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/us-weather-history/KCLT.csv')


#create one bootstrap sample of 10 points
bootstrap=resample(data, replace=True, n_samples=10)
# use this sample to generate stats
bootmax= bootstrap['actual_max_temp'].max()
print('bootstrap max temp: ', bootmax, 'actual max temp: ', data['actual_max_temp'].max())

print('bootstrap mean temp: ', bootstrap['actual_mean_temp'].mean(), 'actual max temp: ', data['actual_mean_temp'].mean())


## Errors & Bias

Bias
- the difference between the sampled result and full population result due to incorrect measurements or measurements on a non representative sample.

Sampling bias can be a result of sampling error, if the sample is chosen in a way which results in a sample which is not representative of a full population.

Sampling error
- the difference between the sampled result and full population result due to the selection of datapoints to include in the sample

Types of sampling error:
- Sample frame error- sample is taken from the wrong population. eg sampling netflix customers from a mobile phone survey.
- selection error- occurs when samples are self-selected. eg picking respondants for uni survey based on your friendship circle.
- non-response error-  respondents are different than those who do not respond. Eg survey on use of SkyTV, many customers respond but non customers do not. Biased view of the product. 