# Sampling!

In Machine Learning we often need to work with very large datasets, which sometimes may be computationally expensive. During these times, it makes more sense to create a smaller sample of this large dataset and train or models in this smaller dataset. While doing this it is important to ensure that we do not lose statistical information about our population. We also need to esnure that out sample is not biased and is a representative of our population. We explore some methods to ensure this.

For the purpose of this notebook document we will work with California House Dataset.

In [18]:
import pandas as pd
dataset=pd.read_csv('https://raw.githubusercontent.com/marquisvictor/Creating-a-Bias-Free-Testset/master/housing.csv')
print('The size of the Dataset is', len(dataset))
dataset.head()

The size of the Dataset is 20640


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


We will take two approaches at this juncture:


1.   ## Simple Random Sampling
  *   This is fairly easy to achieve and is the most direct method of probability sampling.
  *   There is a risk of introducing sampling bias.
  * To be more confident of the sample, statistical tests may be performed on each of the features of the dataset.

2.  ## Stratified Random Sampling
  * Ensures the sample is a representative of the whole population.
  * Subpopulations or strata are defined and simple random samples are generated from each subpopulation.
  * This approach reduces the sampling error.

# Simple Random Sampling
We use [`pandas.DataFrame.sample`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) to get a simple random sample. It returns a random sample of items from an axis of object.



In [19]:
simple_sample_1=dataset.sample(int(len(dataset)/5))
simple_sample_1.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
7714,-118.12,33.95,36.0,2752.0,459.0,1211.0,452.0,5.0526,269800.0,<1H OCEAN
9190,-118.35,34.32,52.0,102.0,29.0,54.0,32.0,1.9875,191700.0,<1H OCEAN
20598,-121.58,39.15,38.0,1756.0,396.0,837.0,401.0,1.9122,55500.0,INLAND
15342,-117.37,33.22,35.0,2204.0,482.0,1435.0,462.0,3.676,125600.0,NEAR OCEAN
3736,-118.4,34.18,32.0,3724.0,899.0,1912.0,791.0,3.5711,312700.0,<1H OCEAN


## (Optional)

To ensure our sample does not lose statistical significance with respect to the population, we conduct some statistical tests. For an easier implementation, we make an acceptable assumption: Consider each variable (feature/ column) independently from the others.
For each feature we compare the probability distribution of the sample with that of the population. If all them are significant then the sample "Passes our Test" else we retry with another sample. 
We use Kolmogorov-Smirnov test.

To conduct these tests we use the [`scipy`](https://docs.scipy.org/doc/scipy//reference/index.html) library, which is an Open Source Python library, which is used in mathematics, engineering, scientific and technical computing. 

In [65]:
def get_p_values(population, sample):
  import scipy
  p_values_dict={}
  for column in population.columns.tolist():
    statistic, p_value=scipy.stats.ks_2samp(sample[column].dropna().tolist(), population[column].dropna().tolist(), alternative='two-sided', mode='auto')
    p_values_dict[column]=p_value
  return p_values_dict

In [66]:
get_p_values(dataset, simple_sample_1)

{'households': 0.9450908836343869,
 'housing_median_age': 0.7639712229370625,
 'latitude': 0.5887490487249476,
 'longitude': 0.9496780476935938,
 'median_house_value': 0.9961336167245299,
 'median_income': 0.8648295782662301,
 'ocean_proximity': 0.9215267286618063,
 'population': 0.782002948631074,
 'total_bedrooms': 0.8443265432149597,
 'total_rooms': 0.9999595657447887}

We see that all the columns have a p-value > 0.05 and hence we cannot reject the Null Hypothesis that they come from different distributions, implying sample is statistically significant.

# Stratified Random Sampling
In Stratified Random Sampling it is important to choose a strata or the subpopulation. The most optimal way to do it is to choose the feature which is most imporant (highest correlation with the target variable) and stratify the population on the basis of this feature. 

In [73]:
correlation_matrix=dataset.corr()
correlation_matrix['median_house_value'].sort_values(ascending=False)

median_house_value    1.000000
median_income         0.688075
total_rooms           0.134153
housing_median_age    0.105623
households            0.065843
total_bedrooms        0.049686
population           -0.024650
longitude            -0.045967
latitude             -0.144160
Name: median_house_value, dtype: float64

So we see in this example that median_income has highest correlation and we choose this feature to stratify the dataset. For this we first need to create a new column to create the strata.

In [76]:
# Divide by 1.5 to limit the number of income categories
dataset["median_income_category"] = np.ceil(dataset["median_income"] / 1.5)
# showing the frequency of each category
dataset.median_income_category.value_counts().sort_index()

1.0      822
2.0     6581
3.0     7236
4.0     3639
5.0     1423
6.0      532
7.0      189
8.0      105
9.0       50
10.0      14
11.0      49
Name: median_income_category, dtype: int64

In [77]:
# Label those above 5 as 5
dataset["median_income_category"].where(dataset["median_income_category"] < 5, 5.0, inplace=True)
dataset.median_income_category.value_counts().sort_index()

1.0     822
2.0    6581
3.0    7236
4.0    3639
5.0    2362
Name: median_income_category, dtype: int64

All we did above is create 5 strata (or subpopulations) on the basis of which we will sample.

In [79]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2)
for train_index, test_index in split.split( dataset, dataset["median_income_category"]):
    stratified_sample = dataset.loc[test_index]
stratified_sample.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,median_income_category
13262,-117.65,34.1,19.0,1688.0,365.0,622.0,322.0,3.6,136400.0,INLAND,3.0
7799,-118.08,33.9,42.0,1768.0,372.0,1155.0,368.0,3.558,161100.0,<1H OCEAN,3.0
16085,-122.49,37.73,37.0,1399.0,224.0,530.0,235.0,3.9219,433300.0,NEAR OCEAN,3.0
16280,-121.28,37.92,36.0,499.0,115.0,451.0,124.0,2.1705,60300.0,INLAND,2.0
5444,-118.43,34.0,30.0,2148.0,597.0,1341.0,559.0,3.3995,324000.0,<1H OCEAN,3.0


In [80]:
get_p_values(dataset, stratified_sample)

{'households': 0.7864504892205115,
 'housing_median_age': 0.6555839945801963,
 'latitude': 0.9496780476935938,
 'longitude': 0.7952660345425161,
 'median_house_value': 0.9974053517401111,
 'median_income': 0.9849022969654343,
 'median_income_category': 1.0,
 'ocean_proximity': 0.9999998611981844,
 'population': 0.853363140598451,
 'total_bedrooms': 0.5439835016882955,
 'total_rooms': 0.7222572871039292}

We see that all the columns have a p-value > 0.05 and hence we cannot reject the Null Hypothesis that they come from different distributions, implying sample is statistically significant.