In [1]:
# Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random

# Seed Value for random
random.seed(42)

In [2]:
# Loading dataset
spotify = pd.read_feather("data/Sampling/spotify_2000_2020.feather")  # Use read_feather instead of read_csv
attrition = pd.read_feather("data/Sampling/attrition.feather")
coffee = pd.read_feather("data/Sampling/coffee_ratings_full.feather")

<h2>Sampling In Python</h2>
<p>Sampling and population are fundamental concepts in statistics, particularly in data analysis, research, and machine learning. Here's a breakdown of both:</p>
<h3>Population:</h3>
<ul>
	<li><strong>Definition</strong>: The entire group of individuals, items, or events that you are interested in studying.</li>
	<li><strong>Example</strong>: If you want to study the average height of adults in a country, the population would be <strong>all adults in that country</strong>.</li>
	<li><strong>Types</strong>:
		<ul>
			<li><strong>Finite Population</strong>: A population with a limited number of members, like all employees in a company.</li>
			<li><strong>Infinite Population</strong>: A population that is conceptually unlimited, like all possible rolls of a die.</li>
		</ul>
	</li>
	<li><strong>Parameters</strong>: Characteristics of the population (e.g., population mean $\mu$, population standard deviation $ \sigma$).</li>
</ul>
<h3>Sample:</h3>
<ul>
	<li><strong>Definition</strong>: A subset of the population that is used to gather insights and make inferences about the population.</li>
	<li><strong>Example</strong>: You might survey 500 adults from a city to estimate the average height of adults in the country.</li>
	<li><strong>Sampling Methods</strong>:
		<ul>
			<li><strong>Simple Random Sampling</strong>: Every individual has an equal chance of being selected.</li>
			<li><strong>Stratified Sampling</strong>: The population is divided into subgroups (strata) based on shared characteristics, and a sample is taken from each group.</li>
			<li><strong>Cluster Sampling</strong>: The population is divided into clusters, and entire clusters are randomly selected.</li>
			<li><strong>Systematic Sampling</strong>: A sample is drawn using a fixed interval (e.g., every 10th person).</li>
			<li><strong>Convenience Sampling</strong>: Samples are chosen based on ease of access (can introduce bias).</li>
		</ul>
	</li>
</ul>
<h3>Why Sampling?</h3>
<ul>
	<li><strong>Cost and Time Efficiency</strong>: It's often impractical or impossible to study an entire population due to cost or time constraints.</li>
	<li><strong>Accuracy</strong>: A well-designed sample can provide a reliable approximation of population parameters.</li>
	<li><strong>Statistical Inference</strong>: By analyzing a sample, we can make inferences about the population through estimators (like sample mean and sample standard deviation).</li>
</ul>
<h3>Bias and Error in Sampling:</h3>
<ul>
	<li><strong>Sampling Bias</strong>: Occurs when certain members of the population are more likely to be included in the sample than others, leading to non-representative samples.</li>
	<li><strong>Sampling Error</strong>: The difference between the sample statistic (e.g., sample mean) and the actual population parameter.</li>
</ul>

In [3]:
# Sampling in Pandas Dataframe
coffee.sample(n=10)

Unnamed: 0,total_cup_points,species,owner,country_of_origin,farm_name,lot_number,mill,ico_number,company,altitude,...,color,category_two_defects,expiration,certification_body,certification_address,certification_contact,unit_of_measurement,altitude_low_meters,altitude_high_meters,altitude_mean_meters
455,83.17,Arabica,racafe & cia s.c.a,Colombia,,,,3-37-0119,racafe & cia s.c.a,de 1600 a 1950 msnm,...,,4.0,"November 5th, 2015",Almacafé,e493c36c2d076bf273064f7ac23ad562af257a25,70d3c0c26f89e00fdae6fb39ff54f0d2eb1c38ab,m,1600.0,1950.0,1775.0
1077,80.5,Arabica,juan luis alvarado romero,Guatemala,barranca de las flores,,beneficio exportacafe agua santa,11/52/492,exportcafe,6100 metros,...,Green,1.0,"February 13th, 2014",Asociacion Nacional Del Café,b1f20fe3a819fd6b2ee0eb8fdc3da256604f1e53,724f04ad10ed31dbb9d260f0dfd221ba48be8a95,ft,1859.28,1859.28,1859.28
1123,80.0,Arabica,ngu shwe li,Myanmar,doe kwin,,ngu shwe li coffee estate,Unspecified,ngu shwe li coffee estate,3845,...,Green,4.0,"July 16th, 2016",Coffee Quality Institute,1d4c7f93129f9fb1c8a5f0ce0e36cc1cf4c2f4d7,0f62c9236e3ff5c4921da1e22a350aa99482779d,m,3845.0,3845.0,3845.0
392,83.42,Arabica,federacion nacional de cafeteros,Colombia,,,,03-01-0424,,,...,,3.0,"February 1st, 2012",Almacafé,e493c36c2d076bf273064f7ac23ad562af257a25,70d3c0c26f89e00fdae6fb39ff54f0d2eb1c38ab,m,,,
1110,80.17,Arabica,israel eduardo paz garcia,Mexico,el carrizo,,"zaragoza itundujia, oaxaca",1405791535,cafeorganico.mx,1550,...,Green,2.0,"September 13th, 2013",AMECAFE,59e396ad6e22a1c22b248f958e1da2bd8af85272,0eb4ee5b3f47b20b049548a2fd1e7d4a2b70d0a7,m,1550.0,1550.0,1550.0
259,83.92,Arabica,cafes tomari sa de cv,Mexico,cafetal,101.0,cafes tomari sa de cv,016-1273-101,cafes tomari sa de cv,1300,...,Green,5.0,"July 3rd, 2018",Centro Agroecológico del Café A.C.,3b8dfdd621590b424ff64e0b76df7d6a92e1c628,d470dc009281519e30da6ead1c649fcd7670f386,m,1300.0,1300.0,1300.0
458,83.17,Arabica,mayra yessenia torres,Honduras,el cerron,,rio hamaca,13-01-2369,"olam honduras, s.a.",1450 msnn,...,Green,5.0,"April 12th, 2015",Instituto Hondureño del Café,b4660a57e9f8cc613ae5b8f02bfce8634c763ab4,7f521ca403540f81ec99daec7da19c2788393880,m,1450.0,1450.0,1450.0
74,85.42,Arabica,grounds for health admin,El Salvador,sierra nevada,,beneficio las tres puertas,9-060-60D-L-1D,,1400 m,...,,0.0,"May 31st, 2011",Specialty Coffee Association,36d0d00a3724338ba7937c52a378d085f2172daa,0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660,m,1400.0,1400.0,1400.0
400,83.33,Arabica,andreas kussmaul,Mexico,ecc,,ecc,1620280402,exportadora café california,1200,...,Green,2.0,"July 13th, 2016",Asociación Mexicana De Cafés y Cafeterías De E...,3441698871fa609a44ce947e8944ee42eb4428b9,9894541e8065ee718165a1d432389d114defc38c,m,1200.0,1200.0,1200.0
487,83.08,Arabica,cqi taiwan icp cqi台灣合作夥伴,Taiwan,alishan zou zhu yuan 阿里山鄒築園,,alishan zou zhu yuan 阿里山鄒築園,,blossom valley宸嶧國際,1300 m,...,,0.0,"December 26th, 2014",Blossom Valley International,fc45352eee499d8470cf94c9827922fb745bf815,de73fc9412358b523d3a641501e542f31d2668b0,m,1300.0,1300.0,1300.0


In [4]:
# Sampling in Pandas Series
coffee['species'].sample(n=5)

1314    Robusta
1103    Arabica
1213    Arabica
306     Arabica
1086    Arabica
Name: species, dtype: object

<hr>

<h2>Some Sampling Technique</h2>
<p>Let's dive into specific examples of sampling techniques and how they can be implemented in pandas for real-world data analysis. We'll cover a few sampling techniques beyond basic random sampling, such as <strong>stratified sampling</strong> and <strong>systematic sampling</strong>, as well as <strong>weighted sampling</strong> and creating train-test splits commonly used in machine learning.</p>
<h4><strong>1. Basic Random Sampling</strong></h4>
<p>This is the simplest form of sampling, where each element in the DataFrame has an equal probability of being selected.</p>

In [5]:
# Sample DataFrame
data = {
    'Name': ['John', 'Jane', 'Tom', 'Lucy', 'Jake', 'Mary', 'Alice', 'Bob'],
    'Age': [25, 30, 22, 29, 32, 35, 28, 40],
    'Score': [85, 90, 88, 92, 95, 89, 91, 87]
}

df = pd.DataFrame(data)

# Random sample of 3 rows
sampled_df = df.sample(n=3)
sampled_df

Unnamed: 0,Name,Age,Score
7,Bob,40,87
1,Jane,30,90
4,Jake,32,95


<h4>2. <strong>Stratified Sampling</strong></h4>
<p>Stratified sampling is used when you want to ensure that different subgroups (strata) in your data are represented proportionally in the sample. Pandas does not have a direct function for stratified sampling, but you can achieve this by grouping the data first.</p>

In [6]:
# Stratified sampling: ensuring equal representation from Age groups
# We'll group by "Age" in this case and randomly selected 1 row per group

stratified_sample = df.groupby('Age').sample(1)
stratified_sample

Unnamed: 0,Name,Age,Score
2,Tom,22,88
0,John,25,85
6,Alice,28,91
3,Lucy,29,92
1,Jane,30,90
4,Jake,32,95
5,Mary,35,89
7,Bob,40,87


This example assumes that the stratifying variable is Age, but we can use other columns to stratify as needed.

<h4>3. <strong>Systematic Sampling</strong></h4>
<p>Systematic sampling involves selecting items from the population at regular intervals. In pandas, this can be achieved by slicing the DataFrame after sorting it.</p>

In [7]:
# Systematic sampling: Select every 2nd row from the DataFrame
interval = 2

systematic_sample = df.iloc[::interval]
systematic_sample

Unnamed: 0,Name,Age,Score
0,John,25,85
2,Tom,22,88
4,Jake,32,95
6,Alice,28,91


<h4>4. <strong>Weighted Sampling</strong></h4>
<p>In some cases, certain rows might be more important or have higher relevance, so we can assign weights to rows during sampling. Here's an example of how to do weighted sampling in pandas:</p>

In [8]:
# Weighted sampling: Give higher probability to rows with higher 'Score'
weights = df['Score'] / df['Score'].sum()

weighted_sample = df.sample(n=3, weights=weights)
weighted_sample

Unnamed: 0,Name,Age,Score
0,John,25,85
5,Mary,35,89
6,Alice,28,91


This assigns a higher probability of being selected to rows with higher scores.

<h4>5. <strong>Train-Test Split for Machine Learning</strong></h4>
<p>Sampling is frequently used in splitting datasets into training and testing sets for model building in machine learning. A random 70-30 or 80-20 split is often used to divide the data into training and testing sets.</p>

In [9]:
from sklearn.model_selection import train_test_split

# Sample DataFrame (input features and labels)
X = df[['Age', 'Score']]  # Features
y = df['Name']  # Labels

# Split into 70% training and 30% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("X_train:\n", X_train)
print("X_test:\n", X_test)

X_train:
    Age  Score
7   40     87
2   22     88
4   32     95
3   29     92
6   28     91
X_test:
    Age  Score
1   30     90
5   35     89
0   25     85


This uses the <code>train_test_split</code> function from scikit-learn, but we could also use pandas’ <code>sample()</code> method for custom splits.

<h4>6. <strong>Bootstrapping (Sampling with Replacement)</strong></h4>
<p>In statistical analysis, bootstrapping refers to sampling with replacement, which is useful when you want to create multiple samples from the same dataset for estimation purposes.</p>

In [10]:
# Sampling with replacement to create a bootstrapped sample
bootstrapped_sample = df.sample(n=len(df), replace=True)
bootstrapped_sample

Unnamed: 0,Name,Age,Score
0,John,25,85
3,Lucy,29,92
4,Jake,32,95
2,Tom,22,88
3,Lucy,29,92
3,Lucy,29,92
1,Jane,30,90
7,Bob,40,87


<h4>7. <strong>Cluster Sampling</strong></h4>
<p>Cluster sampling involves dividing the population into clusters and then randomly selecting from those clusters, followed by sampling from within each cluster. It&rsquo;s useful when handling very large datasets and sampling from each group is very costly. We can simulate cluster sampling by first grouping and then sampling from each group.</p>
<p>Stratified sampling vs. cluster sampling: The stratified sampling approach was to split the population into subgroups, then use simple random sampling on each of them. Cluster sampling means that we limit the number of subgroups in the analysis by picking a few of them with simple random sampling. We then perform simple random sampling on each subgroup as before.</p>

<p>Stage 1: sampling for subgroups</p>
The first stage of cluster sampling is to randomly cut down the number of varieties, and we do this by randomly selecting them.

In [11]:
varieties_pop = list(coffee.variety.unique())

varieties_samp = random.sample(varieties_pop, k=3)
varieties_samp

['Pacas', 'Catimor', None]

<p>Stage 2: sampling each group</p>
The second stage of cluster sampling is to perform simple random sampling on each of the three varieties we randomly selected.<br>We first filter the dataset for rows where the variety is one of the three selected, using the <code>.isin()</code> method.
<br>To ensure that the isin filtering removes levels with zero rows, we apply the <code>cat.remove_unused_categories</code> method on the Series of focus, which is variety here. If we exclude this method, we might receive an error when sampling by variety level.

In [12]:
variety_condition = coffee['variety'].isin(varieties_samp)
coffee_ratings_cluster = coffee[variety_condition]
coffee_ratings_cluster

Unnamed: 0,total_cup_points,species,owner,country_of_origin,farm_name,lot_number,mill,ico_number,company,altitude,...,color,category_two_defects,expiration,certification_body,certification_address,certification_contact,unit_of_measurement,altitude_low_meters,altitude_high_meters,altitude_mean_meters
0,90.58,Arabica,metad plc,Ethiopia,metad plc,,metad plc,2014/2015,metad agricultural developmet plc,1950-2200,...,Green,0.0,"April 3rd, 2016",METAD Agricultural Development plc,309fcf77415a3661ae83e027f7e5f05dad786e44,19fef5a731de2db57d16da10287413f5f99bc2dd,m,1950.0,2200.0,2075.0
3,89.00,Arabica,yidnekachew dabessa,Ethiopia,yidnekachew dabessa coffee plantation,,wolensu,,yidnekachew debessa coffee plantation,1800-2200,...,Green,2.0,"March 25th, 2016",METAD Agricultural Development plc,309fcf77415a3661ae83e027f7e5f05dad786e44,19fef5a731de2db57d16da10287413f5f99bc2dd,m,1800.0,2200.0,2000.0
5,88.83,Arabica,ji-ae ahn,Brazil,,,,,,,...,Bluish-Green,1.0,"September 3rd, 2014",Specialty Coffee Institute of Asia,726e4891cf2c9a4848768bd34b668124d12c4224,b70da261fcc84831e3e9620c30a8701540abc200,m,,,
7,88.67,Arabica,ethiopia commodity exchange,Ethiopia,aolme,,c.p.w.e,010/0338,,1570-1700,...,,0.0,"September 2nd, 2011",Ethiopia Commodity Exchange,a176532400aebdc345cf3d870f84ed3ecab6249e,61bbaf6a9f341e5782b8e7bd3ebf76aac89fe24b,m,1570.0,1700.0,1635.0
8,88.42,Arabica,ethiopia commodity exchange,Ethiopia,aolme,,c.p.w.e,010/0338,,1570-1700,...,,0.0,"September 2nd, 2011",Ethiopia Commodity Exchange,a176532400aebdc345cf3d870f84ed3ecab6249e,61bbaf6a9f341e5782b8e7bd3ebf76aac89fe24b,m,1570.0,1700.0,1635.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1333,78.75,Robusta,luis robles,Ecuador,robustasa,Lavado 1,our own lab,,robustasa,,...,Blue-Green,1.0,"January 18th, 2017",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,,,
1334,78.08,Robusta,luis robles,Ecuador,robustasa,Lavado 3,own laboratory,,robustasa,40,...,Blue-Green,0.0,"January 18th, 2017",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,40.0,40.0,40.0
1335,77.17,Robusta,james moore,United States,fazenda cazengo,,cafe cazengo,,global opportunity fund,795 meters,...,,6.0,"December 23rd, 2015",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,795.0,795.0,795.0
1336,75.08,Robusta,cafe politico,India,,,,14-1118-2014-0087,cafe politico,,...,Green,1.0,"August 25th, 2015",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,,,


In [13]:
coffee_ratings_cluster.groupby("variety").sample(n=2, random_state=2021)


Unnamed: 0,total_cup_points,species,owner,country_of_origin,farm_name,lot_number,mill,ico_number,company,altitude,...,color,category_two_defects,expiration,certification_body,certification_address,certification_contact,unit_of_measurement,altitude_low_meters,altitude_high_meters,altitude_mean_meters
192,84.25,Arabica,松澤宏樹 koju matsuzawa,Thailand,matsuzawa coffee,MCCRNX115/16,matsuzawa coffee,,matsuzawa coffee,1200,...,Green,0.0,"November 2nd, 2017",Specialty Coffee Institute of Asia,726e4891cf2c9a4848768bd34b668124d12c4224,b70da261fcc84831e3e9620c30a8701540abc200,m,1200.0,1200.0,1200.0
444,83.17,Arabica,"sunvirtue co., ltd.",Vietnam,apollo estate,Oriental Paris Natural Coffee,yes,,"sunvirtue co., ltd.",1550,...,,0.0,"May 8th, 2018",Specialty Coffee Association,36d0d00a3724338ba7937c52a378d085f2172daa,0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660,m,1550.0,1550.0,1550.0
1024,80.92,Arabica,juan luis alvarado romero,Guatemala,el morito,,el morito,11/972/29,"armajaro guatemala, s. a.",5000 ft.,...,Green,5.0,"June 10th, 2015",Asociacion Nacional Del Café,b1f20fe3a819fd6b2ee0eb8fdc3da256604f1e53,724f04ad10ed31dbb9d260f0dfd221ba48be8a95,ft,1524.0,1524.0,1524.0
1199,79.0,Arabica,cafes finos de exportacion s de r.l.,Honduras,various,,cafes finos de exportacion s. de r.l.,13-117-87,cafes finos de exportacion s de r.l.,1350,...,Green,12.0,"April 26th, 2015",Instituto Hondureño del Café,b4660a57e9f8cc613ae5b8f02bfce8634c763ab4,7f521ca403540f81ec99daec7da19c2788393880,m,1350.0,1350.0,1350.0


<h4>Summary of Techniques:</h4>
<ul>
<li><strong>Random Sampling</strong>: Use <code>df.sample()</code>.</li>
<li><strong>Stratified Sampling</strong>: Group the data using <code>groupby()</code> and apply <code>sample()</code> within each group.</li>
<li><strong>Systematic Sampling</strong>: Slice the DataFrame using <code>iloc</code> with a regular interval.</li>
<li><strong>Weighted Sampling</strong>: Use <code>weights</code> in <code>sample()</code>.</li>
<li><strong>Train-Test Split</strong>: Use <code>train_test_split</code> from scikit-learn or <code>sample()</code> for custom splits.</li>
<li><strong>Bootstrapping</strong>: Sample with replacement using <code>replace=True</code>.</li>
<li><strong>Cluster Sampling</strong>: Group and then sample within each group.</li>
</ul>
<hr>

<h2>Convenience Sampling</h2>
<p><strong>Convenience sampling</strong> is a non-probability sampling technique where samples are drawn based on ease of access, availability, and proximity to the researcher. In contrast to random or stratified sampling, where every element has an equal chance of being selected, convenience sampling focuses on gathering data from readily accessible sources.</p>
<h4>Characteristics of Convenience Sampling:</h4>
<ol>
<li><strong>Ease of Access</strong>: Data is collected from a group that is easily available.</li>
<li><strong>Cost and Time Efficiency</strong>: Often quicker and less expensive since it doesn't require complex randomization.</li>
<li><strong>Higher Risk of Bias</strong>: Since the sample may not represent the population well, the results may be biased and not generalizable.</li>
</ol>
<h4>Example Scenario:</h4>
<ol>
    <li>You are conducting a study on smartphone usage, and instead of randomly selecting people, you ask your friends and colleagues who are easily available. This might introduce bias because your friends and colleagues may share similar characteristics, but it's convenient and quick.</li>
    <li>In 1936, a newspaper called The Literary Digest ran an extensive poll to try to predict the next US presidential election. They phoned 10 million voters and had over 2 million responses. About 1.3 million people said they would vote for Landon, and just under 1 million people said they would vote for Roosevelt. That is, Landon was predicted to get 57% of the vote, and Roosevelt was predicted to get 43% of the vote. Since the sample size was so large, it was presumed that this poll would be very accurate. However, in the election, Roosevelt won by a landslide with 62% of the vote. So what went wrong? Well, in 1936, telephones were a luxury, so the only people who had been contacted by The Literary Digest were relatively rich. The sample of voters was not representative of the whole population of voters, and so the poll suffered from sample bias. The data was collected by the easiest method, in this case, telephoning people. This is called convenience sampling and is often prone to sample bias. Before sampling, we need to think about our data collection process to avoid biased results.</li>
</ol>

<h2>Pseudo-random number generation</h2>
<h3 class="css-1qgaovm">What does random mean?</h3>
<ul>
	<li>There are several meanings of random in English. This definition from Oxford Languages is the most interesting for us.</li>
	<li>If we want to choose data points at random from a population, we shouldn't be able to predict which data points would be selected ahead of time in some systematic way.</li>
</ul>
<h3 class="css-1qgaovm">True random numbers</h3>
<ul>
	<li>To generate truly random numbers, we typically have to use a physical process like flipping coins or rolling dice. The Hotbits service generates numbers from radioactive decay, and RANDOM-dot-ORG generates numbers from atmospheric noise, which are radio signals generated by lightning.</li>
	<li>Unfortunately, these processes are fairly slow and expensive for generating random numbers.</li>
</ul>
<h3>Pseudo-random number</h3>
<ul>
	<li>For most use cases, pseudo-random number generation is better since it is cheap and fast.</li>
	<li>Pseudo-random means that although each value appears to be random, it is actually calculated from the previous random number.</li>
	<li>Since you have to start the calculations somewhere, the first random number is calculated from what is known as a <strong>seed</strong> value.</li>
	<li>The word random is in quotes to emphasize that this process isn't really random.</li>
	<li>If we start from a particular seed value, all future numbers will be the same.</li>
</ul>
<h3>Seed Value</h3>
<ul>
	<li>A seed value initializes the random number generator (RNG).</li>
	<li>The same seed will always produce the same sequence of pseudo-random numbers.</li>
	<li>A random seed (like system time) can be used to vary the sequence.</li>
</ul>
<h3>Generating Random Numbers in Python</h3>
<ul>
	<li>The <code>random</code> module in Python&rsquo;s standard library and the <code>numpy.random</code> module in NumPy both provide tools for generating random numbers, but there are key differences in terms of functionality, performance, and usage.</li>
	<li><strong><code>random</code> module</strong>: This is the standard Python library for generating random numbers. It is simpler and provides basic functionality for random number generation, primarily used for small-scale tasks like games, simulations, or basic randomization.&nbsp;It is designed for simple, lightweight operations and lacks the optimizations for handling large datasets.</li>
	<li><strong><code>numpy.random</code> module</strong>: Part of the NumPy library, this module is designed for large-scale scientific computing. It offers more powerful and flexible random number generation, especially useful for handling large arrays and matrices of random numbers. It is much faster when generating random numbers for large datasets or multidimensional arrays. It is optimized for performance, especially for numerical operations, which is critical in fields like data science or machine learning.</li>
</ul>
<p>&nbsp;</p>

In [14]:
import random

# Set the seed for reproducibility
# change seed value, we will get different sequence of random numbers
# or remove 'random.seed(42)' line, we will get different sequence of random numbers each time we run the cell
random.seed(42)

# Single random number
print(random.random())   # float between 0 and 1

# Random number in a range
print(random.randint(1, 10))  # integer between 1 and 10

# Randomly shuffle a list
my_list = [1, 2, 3, 4]
random.shuffle(my_list)
print(my_list)

0.6394267984578837
1
[2, 4, 1, 3]


In [15]:
import numpy as np

# Set the seed for reproducibility
np.random.seed(42)

# Generate an array of random floats
print(np.random.random(5))  # array of 5 floats between 0 and 1

# Generate an array of random integers
print(np.random.randint(1, 10, size=(2, 3)))  # 2x3 array of integers between 1 and 9

# Generate an array of random numbers from a normal distribution
print(np.random.normal(0, 1, 5))  # 5 floats from a normal distribution

# Generate an array of random floats
print(np.random.random(5))  # array of 5 floats between 0 and 1

# Generate an array of random integers
print(np.random.randint(1, 10, size=(2, 3)))  # 2x3 array of integers between 1 and 9

# Generate an array of random numbers from a normal distribution
print(np.random.normal(0, 1, 5))  # 5 floats from a normal distribution

[0.37454012 0.95071431 0.73199394 0.59865848 0.15601864]
[[3 7 8]
 [5 4 8]]
[-0.58087813 -0.52516981 -0.57138017 -0.92408284 -2.61254901]
[0.29122914 0.61185289 0.13949386 0.29214465 0.36636184]
[[3 7 4]
 [9 3 5]]
[ 0.95036968 -1.15099358  0.37569802 -0.60063869 -0.29169375]


<h3 class="css-1qgaovm">Numpy Random number generating functions</h3>
<ul>
	<li>NumPy has many functions for generating random numbers from statistical distributions.</li>
	<img style="margin-left:2%" src="img/x1.jpeg" width=50%>
	<li>To use each of these, make sure to prepend each function name with numpy.random.</li>
	<li>Some of them, like .uniform and .normal, may be familiar but others have more niche applications.</li>
</ul>

In [16]:
# Generate 10 random numbers from a Uniform distribution(low=-3, high=3)
uniforms = np.random.uniform(low=-3, high=3, size=10)
uniforms

array([ 1.10539816, -0.35908504, -2.26777059, -0.02893854, -2.79366887,
        2.45592241, -1.44732011,  0.97513371, -1.12973354,  0.12040813])

In [17]:
# Generate 10 random numbers from a Normal distribution(mean=5, sd=2)
normals = np.random.normal(loc=5, scale=2, size=10)
normals

array([2.3436279 , 5.39372247, 6.47693316, 5.34273656, 4.76870344,
       4.39779261, 2.04295602, 3.56031158, 4.07872246, 7.11424445])

<h2>What is a Sampling Distribution?</h2>
<p>Imagine you own a big jar filled with 10,000 marbles, some red and some blue. You want to know how many marbles are red, but counting all 10,000 is hard. Instead, you decide to take a smaller sample&mdash;say, 100 marbles at a time&mdash;and count how many are red.</p>
<p>Now, if you do this once, you&rsquo;ll get a percentage of red marbles in your sample. But what if you take <strong>many</strong> samples of 100 marbles each? Each time, you might get slightly different percentages of red marbles. <strong>The sampling distribution is the distribution of all those percentages.</strong></p>
<p>Let&rsquo;s walk through it:</p>
<p><strong>Example:</strong></p>
<ol>
	<li><strong>Population</strong>: The jar with 10,000 marbles.</li>
	<li><strong>Statistic</strong>: The percentage of red marbles in a sample.</li>
	<li><strong>Sample</strong>: A random group of 100 marbles you pick out of the jar.</li>
	<li><strong>Sampling Distribution</strong>: After taking, say, 100 samples of 100 marbles each, you calculate the percentage of red marbles in each sample. The collection of these percentages forms the sampling distribution.</li>
</ol>
<h3>Key Points in the Example:</h3>
<ul>
	<li><strong>Each Sample is Different</strong>: Each time you take a sample of 100 marbles, the percentage of red marbles will likely be a bit different (one sample might have 60% red marbles, another might have 55%, etc.).</li>
	<li><strong>Sampling Distribution</strong>: If you plot the percentages from all your samples on a graph, you&rsquo;ll get a distribution that shows how often each percentage occurs.</li>
	<li><strong>Central Limit Theorem (CLT)</strong>: The more samples you take, the shape of this sampling distribution will start to look like a normal (bell-shaped) curve, even if the population (the jar) isn&rsquo;t perfectly balanced between red and blue.</li>
</ul>
<h3>Let&rsquo;s Visualize:</h3>
<ol>
	<li><strong>Sample 1</strong>: You pick 100 marbles &rarr; 57% are red.</li>
	<li><strong>Sample 2</strong>: You pick another 100 marbles &rarr; 62% are red.</li>
	<li><strong>Sample 3</strong>: You pick again &rarr; 53% are red.</li>
</ol>
<p>If you keep doing this, you&rsquo;ll get a bunch of percentages, like:</p>
<ul>
	<li>57%, 62%, 53%, 60%, 55%, etc.</li>
</ul>
<p>These percentages form the <strong>sampling distribution</strong> of the sample proportion.</p>
<h3>Why is This Important?</h3>
<ol>
	<li><strong>Averages Tell Us Something</strong>: The average of all these percentages will give you an estimate of the true percentage of red marbles in the entire jar.</li>
	<li><strong>Spread (Standard Error)</strong>: The spread (or how much the percentages vary) tells us how much our sample estimates differ from the true population value.</li>
</ol>
<h3>Conclusion:</h3>
<ul>
	<li>The <strong>sampling distribution</strong> helps us understand the variation in sample results. The more samples we take, the more confident we can be about our estimate of the true population percentage.</li>
</ul>

<h2>What is Bootstrapping?</h2>
<p><strong>Bootstrapping</strong> is a statistical technique used to estimate the distribution of a sample statistic by resampling with replacement from the original data. It's especially useful when we don't have a formula for the sampling distribution or the population data is unknown.</p>
<p>Let&rsquo;s break down <strong>bootstrapping</strong> with an example:</p>
<p>Imagine you have a small sample of data from a larger population, but you don&rsquo;t know the exact properties of the population (like the true mean, standard deviation, etc.). Instead of trying to guess, bootstrapping allows you to estimate these properties by creating <strong>many new samples</strong> from your original sample. These new samples are called <strong>resamples</strong>, and each one is drawn <strong>with replacement</strong> from the original sample.</p>
<p>In other words, bootstrapping simulates what would happen if you could take multiple samples from the population, but it uses the original sample as a stand-in for the population.</p>
<h3>Example of Bootstrapping:</h3>
<p>Let&rsquo;s say you have this sample of exam scores from 5 students:</p>
<ul>
	<li><strong>Sample</strong>: 70, 75, 80, 85, 90</li>
</ul>
<p>You want to estimate the <strong>mean</strong> and <strong>standard error</strong> of the mean, but with just 5 data points, it&rsquo;s hard to know how reliable that estimate is.</p>
<h3>Steps in Bootstrapping:</h3>
<ol>
	<li>
		<p><strong>Resampling with Replacement</strong>:</p>
		<ul>
			<li>From your original sample (70, 75, 80, 85, 90), randomly pick 5 values <strong>with replacement</strong> (so the same number can be chosen multiple times).</li>
			<li>Example of a resample: 70, 70, 85, 85, 90.</li>
		</ul>
	</li>
	<li>
		<p><strong>Repeat Many Times</strong>:</p>
		<ul>
			<li>Do this resampling process many times (typically 1,000 times or more) to create multiple resamples.</li>
			<li>Examples of other resamples:
				<ul>
					<li>75, 75, 80, 85, 90</li>
					<li>70, 80, 80, 85, 85</li>
					<li>90, 70, 75, 75, 85</li>
				</ul>
			</li>
		</ul>
	</li>
	<li>
		<p><strong>Calculate Statistic for Each Resample</strong>:</p>
		<ul>
			<li>For each of these resamples, calculate the statistic of interest (e.g., the mean).</li>
			<li>Mean of first resample (70, 70, 85, 85, 90): <span class="katex"><span class="katex-mathml">70+70+85+85+905=80\frac{70 + 70 + 85 + 85 + 90}{5} = 80</span><span class="katex-html"><span class="base"><span class="mord"><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist"><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">5</span></span></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">70</span><span class="mbin mtight">+</span><span class="mord mtight">70</span><span class="mbin mtight">+</span><span class="mord mtight">85</span><span class="mbin mtight">+</span><span class="mord mtight">85</span><span class="mbin mtight">+</span><span class="mord mtight">90</span></span></span></span></span></span></span></span><span class="mrel">=</span></span><span class="base"><span class="mord">80</span></span></span></span></li>
			<li>Mean of second resample (75, 75, 80, 85, 90): <span class="katex"><span class="katex-mathml">75+75+80+85+905=81\frac{75 + 75 + 80 + 85 + 90}{5} = 81</span><span class="katex-html"><span class="base"><span class="mord"><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist"><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">5</span></span></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">75</span><span class="mbin mtight">+</span><span class="mord mtight">75</span><span class="mbin mtight">+</span><span class="mord mtight">80</span><span class="mbin mtight">+</span><span class="mord mtight">85</span><span class="mbin mtight">+</span><span class="mord mtight">90</span></span></span></span></span></span></span></span><span class="mrel">=</span></span><span class="base"><span class="mord">81</span></span></span></span></li>
			<li>Do this for all resamples.</li>
		</ul>
	</li>
	<li>
		<p><strong>Create a Bootstrapped Distribution</strong>:</p>
		<ul>
			<li>After resampling 1,000 times, you now have 1,000 means. These means form the <strong>bootstrapping distribution</strong> of the sample mean.</li>
			<li>The spread of this distribution gives you an idea of the variability (or <strong>standard error</strong>) of your sample mean.</li>
		</ul>
	</li>
	<li>
		<p><strong>Use the Bootstrapped Distribution</strong>:</p>
		<ul>
			<li>You can use this bootstrapped distribution to create <strong>confidence intervals</strong> or make inferences about the population mean.</li>
			<li>For example, if 95% of your bootstrapped means fall between 78 and 82, you can say the <strong>95% confidence interval</strong> for the population mean is between 78 and 82.</li>
		</ul>
	</li>
</ol>
<h3>Key Points:</h3>
<ul>
	<li><strong>Resampling with Replacement</strong>: This is crucial because it allows us to generate multiple "new" samples from the original data.</li>
	<li><strong>Estimates Variation</strong>: Bootstrapping helps estimate how much the sample statistic (like the mean) might vary from one sample to another.</li>
	<li><strong>No Assumptions About Population</strong>: Bootstrapping doesn't assume the population follows any specific distribution, which makes it very flexible.</li>
</ul>
<h3>Why is Bootstrapping Useful?</h3>
<ul>
	<li><strong>When You Have Small Samples</strong>: It&rsquo;s hard to make inferences with a small dataset, but bootstrapping allows you to simulate what might happen if you had many more samples.</li>
	<li><strong>No Complex Math Needed</strong>: You don&rsquo;t need to know any complicated formulas for the sampling distribution&mdash;bootstrapping does the work by simply resampling.</li>
</ul>
<h3>Summary:</h3>
<p>Bootstrapping is a powerful tool that creates a <strong>distribution of a sample statistic</strong> by randomly resampling from the data you already have. It&rsquo;s especially useful when the data is limited or when the population distribution is unknown.</p>