# Engineering Data Analysis

> **Mohamad M. Hallal, PhD** <br> Teaching Professor, UC Berkeley

[![License](https://img.shields.io/badge/license-CC%20BY--NC--ND%204.0-blue)](https://creativecommons.org/licenses/by-nc-nd/4.0/)
***

# Sampling

In this notebook, we'll select random samples of different sizes to investigate sampling variation. Sampling variation refers to the differences that can occur in statistics (such as the mean) when different random samples are taken from the same population.

Let's get started!

## Dataset

We will create a dataset of bedrock depth at Treasure Island as our population from which we'll draw samples.

Run the cell below to define some functions that we will use later. You won't see any output after running this cell because it only defines the function.

In [None]:
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
import random
import ipywidgets as widgets

# create indexes
indexes = [[i,j] for i in range(9) for j in range(14)]
del_indexes = [[0,0], [0,1], [1,0], [0,12], [0,13], [1,13], [7,0], [8,0], [8,1]]
indexes = [item for item in indexes if item not in del_indexes]

# create depths data
depths = np.array([[0 , 0 , 45, 48, 51, 49, 50, 52, 49, 53, 54, 55, 0 , 0 ], 
                  [0 , 42, 47, 48, 43, 41, 45, 49, 50, 50, 55, 54, 49, 0 ],
                  [37, 38, 45, 46, 49, 47, 51, 52, 51, 55, 56, 54, 59, 58],
                  [39, 41, 42, 44, 46, 43, 47, 49, 55, 57, 54, 59, 60, 62],
                  [41, 38, 39, 47, 49, 51, 47, 45, 53, 60, 57, 62, 63, 59],
                  [43, 36, 40, 49, 50, 48, 53, 54 ,51, 58, 54, 60, 49, 65],
                  [49, 46, 48, 52, 58, 60, 65, 59, 62, 55, 64, 59, 66, 68],
                  [0 , 50, 55, 56, 51, 65, 68, 66, 70, 66, 71, 68, 69, 73], 
                  [0 , 0 , 60, 57, 55, 50, 48, 55, 63, 70, 69, 72, 71, 75]])

# define plotting function
def plot_sample(size, repetitions):
    # create figure
    fig, ax = plt.subplots(1,2, figsize=(10,5), gridspec_kw={'width_ratios': [2, 1]}, dpi=300)
    ax[0].imshow(img)
    # define colors
    colors = plt.cm.get_cmap('jet', repetitions)
    colors = [colors(i) for i in range(repetitions)]
    # iterate through repetitions
    for i in range(repetitions):
        # select random sample
        indexes_rand = random.choices(indexes, k=size)
        total = 0
        # iterate through samples
        for j in indexes_rand:
            # plot location of sample
            ax[0].scatter(245+j[1]*125+125/2, 205+j[0]*127+127/2, s=300, ec=colors[i],  facecolor=(0,0,0,0))
            total += depths[j[0]][j[1]]
        
        # plot sample mean
        ax[1].scatter(0, total/size, fc=colors[i], zorder=10)
    
    # contol axes
    ax[1].text(-0.025, 55, 'Sample Mean', rotation = 90, verticalalignment='center')
    ax[1].plot([-0.01, 0.01], [np.mean(depths[depths!=0])]*2, color='r')
    ax[1].text(0.015, np.mean(depths[depths!=0]), 'Population Mean', c='r', verticalalignment='center')
    ax[1].axvline(0, color='k')
    ax[1].set_ylim(35,75)
    ax[1].set_xlim(-0.03,0.03)
    ax[0].set_axis_off()
    ax[1].set_axis_off()
    
    # Show the plot
    plt.show()

## Simple Random Sampling

Simple random sampling is a technique where each member of the population has an equal chance of being selected. This method helps to avoid bias and ensures that the sample is representative of the entire population.

Let's draw random samples of different sizes from this population. We'll calculate the mean for each sample and compare them to the population mean. 

<div class="alert alert-block alert-info"> <b>YOUR TURN!</b> Run the cell below and change the sliders to adjust the sample size and how many samples of that size to select. Pay attention to how the sample means change as you change the sample size.</div> 

In [None]:
# Load the image
img = Image.open('resources/treasure_island.png')

# create 2 sliders for size and repetitions
@widgets.interact(size=(1,30,5), repetitions=(1,10,2))

# define a function that takes the values from the sliders and plots the results
def sampling(size, repetitions):
    # Call the function to plot
    plot_sample(size, repetitions) 
    return

<div class="alert alert-block alert-warning"> <b>DISCUSS!</b> What do you observe from the plots above? Specifically: <br> <b>Do samples of the same size have the same sample mean? Why or why not?</b> 
    <br> <b>Does changing the sample size affect sampling variation? Why or why not?</b>
    <br> <b>Do larger or smaller samples give better estimates of the population mean? Why?</b>
    <br> <b>Does sampling variation get eliminated for larger samples? Why or why not? </b></div> 

## Convenience Sampling

Convenience sampling is a non-random sampling technique where samples are taken from a group that is conveniently accessible to the researcher. This method can introduce bias because it might not represent the entire population accurately. Let's select random samples of different sizes assuming we have access to only part of Treasure Island.

Run the cell below to define the new dataset:

In [None]:
# create indexes
indexes = [[i,j] for i in range(5) for j in range(5)]
del_indexes = [[0,0], [0,1], [1,0]]
indexes = [item for item in indexes if item not in del_indexes]

Now, let's draw random samples of different sizes from the area we have access to. We'll calculate the mean for each sample and compare them to the population mean.

<div class="alert alert-block alert-info"> <b>YOUR TURN!</b> Run the cell below and change the sliders to adjust the sample size and how many samples of that size to select. Pay attention to how the sample means change as you change the sample size.</div> 

In [None]:
# Load the image
img = Image.open('resources/treasure_island_convenience.png')

# create 2 sliders for size and repetitions
@widgets.interact(size=(1,30,5), repetitions=(1,10,2))

# define a function that takes the values from the sliders and plots the results
def sampling(size, repetitions):
    # Call the function to plot
    plot_sample(size, repetitions) 
    return

<div class="alert alert-block alert-warning"> <b>DISCUSS!</b> What do you observe from the plots above? Specifically: <br> <b>How does the mean of the convenience sample compare to the population mean?</b> 
    <br> <b>Does increasing the sample size provide a better estimate of the population mean? Why or why not?</b>
    <br> <b>What does this tell you about the potential bias introduced by convenience sampling?</b></div> 