<a href="https://colab.research.google.com/github/gitmystuff/DTSC5082/blob/main/Sampling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sampling

### Parameter vs Statistic

* https://www.statisticshowto.com/statistics-basics/how-to-tell-the-difference-between-a-statistic-and-a-parameter/
* Parameter describes population, statistic describes sample

### Statistical Inference

According to Wikipedia (2022):

>  the process of using data analysis to infer properties of an underlying distribution of probability (para 1).

Statistical Inference. (2022, January 24). In *Wikipedia*. https://en.wikipedia.org/wiki/Statistical_inference.

https://www.analyticsvidhya.com/blog/2022/03/an-overview-of-data-collection-data-sources-and-data-mining/

* Mean and standard deviation of a sample should be similar to the mean and standard deviation of the population
* Mean: where most of the observations lie with a distribution variance that is symmetrical
* Standard Deviation: a way to measure how far observations are from the mean, on the x axis, in terms similar to the data it represents

### Sampling Types

According to Wikipedia (2022):

>  Sampling is the selection of a subset (a statistical sample) of individuals from within a statistical population [similar elements] to estimate characteristics of the whole population (para 1).

Sampling. (2022, January 24). In *Wikipedia*.https://en.wikipedia.org/wiki/Sampling_(statistics)

### Probabilistic Sampling

* Simple random sampling: https://www.statisticshowto.com/probability-and-statistics/statistics-definitions/simple-random-sample/
* Systematic sampling: https://www.open.edu/openlearncreate/mod/oucontent/view.php?id=233&section=1.5.2
* Stratified sampling: https://www.geeksforgeeks.org/stratified-sampling-in-pandas/
* Clustering: https://www.voxco.com/blog/stratified-sampling-vs-cluster-sampling/

### Non-Probabilistic Sampling Methods

https://www.analyticsvidhya.com/blog/2019/09/data-scientists-guide-8-types-of-sampling-techniques/
* Convenience sampling
* Quota sampling
* Snowball sampling


In [None]:
# # numpy's random module
# import numpy as np

# print(np.random.rand())
# print(np.random.randint(0, 9))

In [None]:
# # create population
# # https://stats.stackexchange.com/questions/560281/what-is-the-meaning-of-loc-and-scale-for-the-distributions-in-scipy-stats
# import numpy as np
# import matplotlib.pyplot as plt

# mu, std, n = (0, 1, 10000) # unpacking a tuple
# pop = np.random.normal(loc=mu, scale=std, size=n)
# plt.hist(ndist);

In [None]:
# # random sample
# np.random.choice(pop, replace=False, size=50)

### Visualizing Population and Sample

In [None]:
# # normal distribution - what mean and standard deviation look like for the population
# import numpy as np
# import matplotlib.pyplot as plt

# ndist = np.random.normal(loc=0.0, scale=1.0, size=50000)
# plt.hist(ndist)
# plt.show()

In [None]:
# # what the mean and standard deviation looks like for the sample
# plt.hist(np.random.choice(ndist, size=5000));

### Precision vs Accuracy

**Accuracy** is how close your measurements are to the true value. Think of it like hitting a bullseye on a dartboard.

**Precision** is how close your measurements are to each other, regardless of whether they're close to the true value.  Imagine throwing darts that all land very close together, even if they're far from the bullseye.

You can have high precision and low accuracy, high accuracy and low precision, both, or neither. The ideal is to have both high precision and high accuracy.

I pull into a gas station and ask directions to a place to eat. One attendant says the restaurant is 1.5 miles north, another attendant gives streets directions.

* **Attendant 1: "1.5 miles north"**

    * **Precision:** This is **high precision**. The attendant gives you a specific distance (1.5 miles) and a specific direction (north). It's a very exact measurement.
    * **Accuracy:** This is **unknown accuracy**.  While precise, it might be completely wrong! Maybe the restaurant is actually 1.8 miles north, or 1.5 miles northeast. Without further information, we don't know how accurate the directions are. And it doesn't help since I can't drive directly north.

* **Attendant 2: Street directions**

    * **Precision:** This is **lower precision**.  Street directions are less exact. They might involve turns, landmarks, and estimations ("go down this road for a bit, then turn left at the big oak tree").
    * **Accuracy:** This is **potentially higher accuracy**. If the attendant knows the area well and gives you correct turns and landmarks, you're more likely to reach your destination, even if the directions are a bit vague.

**Think of it like this:**

* **Attendant 1** gave you coordinates on a map. Very precise, but if the map is wrong, you'll be lost.
* **Attendant 2** drew you a rough sketch of the route. Less precise, but if the landmarks are correct, you'll get there.

**In this scenario:**

It's likely that the street directions (Attendant 2) are more helpful, even if they are less precise.  Knowing the general route and landmarks is usually more valuable than having a very specific distance and direction that might be inaccurate.

This illustrates that in real-world situations, **accuracy is often more important than precision**, especially when getting directions!

**However** In classification projects we will sometimes need a comprehensive evaluation of our model's performance in categorizing data and will need to introduce a new term called Recall.



### Accuracy and Precision of a Sample

**Accuracy of a Sample**

* **Parameter:** The true value of a characteristic in the population (e.g., the average height of all adults in a country).
* **Statistic:** The estimate of that characteristic calculated from a sample (e.g., the average height of 1,000 adults in that country).
* **Accuracy:** How close the *statistic* is to the true *parameter*.  A highly accurate sample will produce a statistic that's very close to the actual population value.

**Precision of a Sample**

* **Precision:**  How close the estimates from *different samples* are to each other. If you took multiple samples and calculated the statistic each time, high precision means those statistics would be very similar.
* **Relationship to variability:** Precision is related to the variability or spread of the data in your sample. A low-variability sample (data points clustered closely together) will generally lead to higher precision.

**How to assess accuracy and precision of a sample**

* **Accuracy:**
    * **Difficult to measure directly:**  You usually don't know the true population parameter (that's why you're sampling!).
    * **Techniques:**
        * **Confidence intervals:** Provide a range of values where the true parameter is likely to lie. A narrower confidence interval suggests higher accuracy.
        * **Comparing to known benchmarks or previous studies:** If you have some external information about the population, you can compare your sample statistic to it.

* **Precision:**
    * **Measure:**  Calculate the standard error of the statistic. A smaller standard error indicates higher precision.
    * **Factors that influence precision:**
        * **Sample size:** Larger samples generally lead to higher precision.
        * **Variability of the population:**  More variability in the population makes it harder to get precise estimates.
        * **Sampling method:**  Proper random sampling techniques improve precision.

**Example:**

Imagine you're estimating the average income of people in a city.

* **Accurate sample:** The average income you calculate from your sample is close to the true average income of the entire city.
* **Precise sample:** If you took multiple samples, the average income calculated from each sample would be very similar to each other.

**Key takeaway:**

Accuracy and precision are important considerations when evaluating the quality of a sample and the reliability of the statistics derived from it. By understanding these concepts, you can better assess how well your sample represents the population and how much confidence you can have in your results.


### Wash, Rinse, Repeat

The need for multiple samples

**Precision:**

* **Multiple samples reveal variability:** If you take multiple samples from the same population, you'll likely get slightly different estimates each time. This variation between the sample estimates gives you a sense of the precision.
    * Low variability between samples = high precision
    * High variability between samples = low precision

* **Example:** Imagine trying to estimate the average weight of apples from an orchard. If you take three samples and get average weights of 150g, 152g, and 151g, that suggests higher precision than if the averages were 140g, 165g, and 155g.

**Accuracy:**

* **Multiple samples help identify bias:**  If you consistently get estimates that are off in the same direction (e.g., always overestimating), it suggests there might be a bias in your sampling method.
* **Averaging can improve accuracy:**  Taking the average of multiple sample estimates can often give you a more accurate estimate of the population parameter than any single sample.

**Think of it like this:**

Imagine you're throwing darts at a dartboard.

* **One dart:** You can't really tell much about your accuracy or precision.
* **Multiple darts:**
    * **High accuracy, high precision:** All darts clustered tightly around the bullseye.
    * **High accuracy, low precision:** Darts scattered around the bullseye, but their average position is close to the center.
    * **Low accuracy, high precision:** Darts clustered tightly together, but far from the bullseye.
    * **Low accuracy, low precision:** Darts scattered all over the dartboard.

**In the context of sampling:**

* Multiple samples are like multiple darts. They give you more information to assess both the accuracy (how close you are to the true population value) and the precision (how consistent your estimates are).

**Key takeaway:**

While a single sample can provide some information, multiple samples are essential for a more complete understanding of accuracy and precision in the context of sampling and statistical estimation.


#### Law of Large Numbers

**Law of Large Numbers**

* **What it is:**  A fundamental principle in probability and statistics stating that as the size of a sample increases, the sample average gets closer and closer to the true average of the entire population.
* **In simpler terms:**  If you flip a coin a few times, you might get a lot of heads or tails in a row. But if you flip it thousands of times, the number of heads and tails will be much closer to 50/50.
* **Why it matters:**  It's crucial for understanding how random events behave over time. It's the reason insurance companies can reliably predict risks, pollsters can estimate election results, and casinos can make a profit.

**Law of Small Numbers**

* **What it is:** This is more of a cognitive bias than a real law. It refers to people's tendency to incorrectly believe that small samples accurately represent the larger population.
* **In simpler terms:**  If you meet two unfriendly people from a city, you might assume everyone in that city is unfriendly. This is the law of small numbers in action—drawing broad conclusions from limited evidence.
* **Why it matters:**  This bias can lead to faulty decision-making, stereotypes, and inaccurate generalizations. It's important to be aware of it to avoid making these mistakes.

**Here's an analogy:**

Imagine a giant jar filled with thousands of marbles, with 90% red and 10% blue.

* **Law of Large Numbers:** If you draw a handful of marbles (small sample), you might get mostly red or even all blue. But if you draw hundreds of marbles (large sample), you'll likely get a distribution close to 90% red and 10% blue.
* **Law of Small Numbers:** If you draw only a few marbles and they're all blue, you might mistakenly believe the jar is mostly blue.

**Key Differences:**

* The law of large numbers is a proven mathematical principle, while the law of small numbers is a psychological bias.
* The law of large numbers describes how random events converge to their true probabilities with increasing sample size, while the law of small numbers describes our tendency to misinterpret small samples.

Understanding both these concepts is important for critical thinking and sound decision-making, especially when dealing with statistics and probability in everyday life.


In [None]:
# # find the mean of a large distribution
# import numpy as np

# np.random.seed(42)

# data = np.random.randint(0, 5000, 5000)
# np.mean(data)

In [None]:
# # find sample means from generated list
# samples = np.random.choice(data, 5)
# # print(samples)
# print(np.mean(samples))
# samples = np.random.choice(data, 50)
# print(np.mean(samples))
# samples = np.random.choice(data, 500)
# print(np.mean(samples))
# samples = np.random.choice(data, 1000)
# print(np.mean(samples))
# samples = np.random.choice(data, 10000)
# print(np.mean(samples))
# samples = np.random.choice(data, 100000)
# print(np.mean(samples))

#### Central Limit Theorem

The Central Limit Theorem (CLT) is a cornerstone of statistics. It's a bit complex, but incredibly powerful in its implications. Here's a breakdown:

**What it says:**

Imagine you have a population of anything – people's heights, test scores, the number of petals on flowers, whatever. This population has some overall average and some variability.

Now, you repeatedly take random samples from this population (and importantly, each sample is independent of the others).  The CLT states that if you calculate the *average* of each of these samples, the distribution of those sample averages will tend towards a normal distribution, *regardless* of the shape of the original population's distribution.

**Here's the kicker:**

*   This works even if the original population isn't normally distributed. It could be skewed, uniform, bimodal, whatever!
*   The larger your sample size (the number of data points in each sample), the closer the distribution of the sample averages will be to a normal distribution.

**Why is this so important?**

1. **Inference:** It allows us to make inferences about a population based on a sample. Even if we don't know the true population distribution, we can use the CLT to estimate population parameters (like the mean) with confidence intervals.

2. **Hypothesis Testing:**  Many statistical tests rely on the assumption of normality. The CLT allows us to use these tests even when the underlying population isn't normal, as long as our sample size is large enough.

3. **Real-world applications:** It's used everywhere! In quality control, finance, opinion polling, medical research – any field where you're trying to draw conclusions about a population from a sample.

**Analogy time:**

Imagine you have a bag of differently shaped candies. You take a handful (a sample), calculate the average weight of the candies in your hand, and record it. You do this many times. The CLT says that even though the candies have different individual weights, the distribution of those average weights you record will look like a bell curve.

**Important note:**

While the CLT is powerful, it does have some assumptions:

*   The samples must be random and independent.
*   The sample size should be sufficiently large (generally, a sample size of 30 or more is considered sufficient, but it can depend on the original population distribution).

If these assumptions are violated, the CLT might not hold.

Hopefully, this explanation helps you grasp the core idea of the Central Limit Theorem! It's a complex concept, but understanding its essence is crucial for anyone working with statistics.


* Pierre-Simon Laplace 1749 - 1827 (also revived Bayes Theorem)
* Central Limit Theorem, in probability theory, a theorem that establishes the normal distribution as the distribution to which the mean (average) of almost any set of **independent and randomly generated (identically distributed) variables** rapidly converges. The central limit theorem explains why the normal distribution arises so commonly and why it is generally an excellent approximation for the mean of a collection of data (often with as few as 10 variables).
* The standard version of the central limit theorem, first proved by the French mathematician Pierre-Simon Laplace in 1810, states that the sum or average of an infinite sequence of independent and identically distributed random variables, when suitably rescaled, tends to a normal distribution.

According to Wikipedia (2022):

>  In probability theory, the central limit theorem (CLT) establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution (informally a bell curve) even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions (para 1).

And according to Scribbr (2023):

> The central limit theorem states that if you take sufficiently large samples from a population, the samples’ means will be normally distributed, even if the population isn’t normally distributed.

And according to Towards Data Science (2023):

> If we take many samples from a population, and calculate a mean for each sample, then the distribution of these means across the samples (that is, the sampling distribution of the mean) will be the well-known normal distribution with the mean same as the population mean and with the standard deviation equal to the population standard deviation divided by the number of observations per sample. That’s the case even when the population is not normally distributed.

References<br />
* Central limit theorem. (2023, May 30). In *Scribbr*. https://www.scribbr.com/statistics/central-limit-theorem/
* Central limit theorem. (2022, February 1). In *Wikipedia*. https://en.wikipedia.org/wiki/Central_limit_theorem.
* https://towardsdatascience.com/central-limit-theorem-70b12a0a68d8

In [None]:
# # https://levelup.gitconnected.com/large-numbers-and-central-limit-theorem-using-numpy-1c8199ef63b1
# # create right skewed data
# import numpy as np
# import seaborn as sns

# data = np.random.gamma(1, 1000, 5000)
# sns.kdeplot(data)
# print('mean: ', np.mean(data))

In [None]:
# # what a normal curve (distribution) looks like
# import numpy as np
# import matplotlib.pyplot as plt
# import scipy.stats as stats

# x = np.linspace(-3, 3, 100)
# y = stats.norm.pdf(x, 0, 1)
# plt.plot(x, y)
# plt.show()

In [None]:
# # visualize 100 sample means and histogram
# import numpy as np
# import seaborn as sns

# means = []
# for i in range(100000):
#     means.append(np.random.choice(data, 100).mean())

# sns.kdeplot(means)
# print('mean: ', np.mean(data))

### Accuracy and Precision in Regression

**Accuracy in Regression**

* **Focus:** How close the predicted values are to the true values.
* **Challenge:**  In regression, you're predicting a continuous value (e.g., house price, temperature), not just a category.  So, it's less about being exactly right and more about how close you are on average.
* **Metrics:**
    * **Mean Absolute Error (MAE):** The average absolute difference between predicted and true values.
    * **Mean Squared Error (MSE):** The average squared difference between predicted and true values (penalizes larger errors more).
    * **Root Mean Squared Error (RMSE):** The square root of MSE (in the same units as the target variable).
    * **R-squared:**  Represents the proportion of variance in the target variable explained by the model.

* **Interpretation:** Lower values for MAE, MSE, and RMSE indicate higher accuracy. Higher R-squared indicates a better fit. This is what this class is all about!

**Precision in Regression**

* **Focus:** How consistent the predictions are for different data points or different runs of the model.
* **Relationship to variance:**  Precision in regression is closely tied to the variance of the model's predictions.
    * Low variance = high precision (predictions are tightly clustered)
    * High variance = low precision (predictions are spread out)

* **Factors affecting precision:**
    * **Model complexity:** More complex models can capture more intricate patterns but might also be more sensitive to small variations in the data, leading to lower precision.
    * **Data quality:** Noisy or inconsistent data can reduce precision.
    * **Regularization:** Techniques like L1 or L2 regularization can help improve precision by preventing overfitting.

* **Visualization:**  Plotting predicted values against true values can help visualize both accuracy and precision. A narrow band of points around the ideal 45-degree line indicates high accuracy and precision.

**Example:**

Imagine a model predicting house prices.

* **High accuracy:** The predicted prices are, on average, close to the actual sale prices.
* **High precision:** The model's predictions are consistent and don't vary wildly for similar houses.

**Key takeaway:**

In regression, accuracy is about how close you get to the true values, while precision is about the consistency of your predictions. Both are important for building effective regression models. By using appropriate metrics and visualizations, you can assess and improve both the accuracy and precision of your models.


### Accuracy and Precision in Classification

**Key Metrics in a Classification Report**

* **Accuracy:** The overall correctness of the model.  
    * It answers: "Out of all the predictions, how many were correct?"
    * Can be misleading with imbalanced datasets (where one class has far more instances than another).

* **Precision:**  Focuses on the accuracy of positive predictions.
    *  It answers the question: "Of all the instances the model predicted as positive, how many were actually correct?"
    *  Useful when the cost of false positives is high (e.g., diagnosing someone with a disease they don't have).

* **Recall (Sensitivity):** Measures the model's ability to identify all actual positive instances.
    * It answers: "Of all the actual positive instances, how many did the model correctly identify?"
    * Crucial when the cost of false negatives is high (e.g., missing a critical medical diagnosis).

**How they work together**

The classification report uses these metrics to give you a more complete picture of your model's performance than accuracy alone.  

* **Example:** Imagine a model detecting spam emails.
    * High precision means it correctly labels most of the emails it flags as spam.
    * High recall means it catches most of the actual spam emails.
    * Accuracy might be high overall, but if it misses a lot of spam (low recall), it's not very useful.

**In essence:**

A classification report helps you understand not just if your model is right or wrong, but *how* it's right or wrong, and where it might need improvement. This is crucial for building effective and reliable classification models.#

### Regression and Classification Reminder

**Regression**

* **Goal:** Predict a continuous or numerical outcome. Think of it like estimating a value on a scale.
* **Output:** A number (e.g., house price, temperature, stock price).
* **Examples:**
    * Predicting house prices based on size, location, etc.
    * Forecasting sales for the next quarter
    * Estimating the time it will take to complete a task

**Classification**

* **Goal:**  Predict a categorical or discrete outcome.  Think of it like putting things into buckets.
* **Output:** A class label (e.g., "spam", "not spam", "cat", "dog", "red", "green", "blue").
* **Examples:**
    * Image recognition (classifying images as cat, dog, bird, etc.)
    * Spam detection (classifying emails as spam or not spam)
    * Medical diagnosis (classifying a tumor as benign or malignant)

**Key Differences**

| Feature | Classification | Regression |
|---|---|---|
| Output | Discrete (categories, labels) | Continuous (numbers) |
| Algorithms | Logistic regression, decision trees, support vector machines, naive Bayes | Linear regression, polynomial regression, support vector regression, random forests |
| Evaluation Metrics | Accuracy, precision, recall, F1-score | Mean squared error (MSE), R-squared |

**In essence:**

Classification is about assigning things to categories, while regression is about predicting numerical values. Both are fundamental tasks in machine learning and data science, used in a wide range of applications.

### Data Types

* Numerical: has measurement
    * Discrete: can be counted (number of people in a class)
    * Continuous: described as a measurement (height, weight, temps), usually rounded off
* Categorical: some type of group, category, that can be represented by a number (female: 1, male: 2) but have no mathematical meaning
* Nominal: named or labeled (31 flavors)
* Ordinal: ordered or scaled (first, second, etc, very happy, happy, meh, sad, very sad)
* Cardinal: the number of things, counted (1, 2, 3, 4, 6, etc)
* Interval: ordered numbers with equal distances (50 degrees, 60 degrees, etc)
* Ratio: has a true zero (can't have height of -2 inches, distance between 1 and 2 is the same as 3 and 4 and 4 is twice 2)

### Bias

Data bias is a sneaky problem! It can creep into your data in many ways, often without you even realizing it. Here are some of the common types of bias to watch out for:

**1. Sampling Bias**

* **What it is:** Your sample doesn't accurately represent the population you're interested in. This can lead to inaccurate conclusions about the population.
* **Examples:**
    * **Convenience Sampling:**  Surveying only people who are easy to reach (e.g., your friends, people on the street).
    * **Undercoverage:**  Some groups in the population are underrepresented in the sample (e.g., a survey about internet usage that only includes people with landline phones).

**2. Measurement Bias**

* **What it is:** The way you collect data influences the results.
* **Examples:**
    * **Leading Questions:**  Survey questions that are worded in a way that suggests a particular answer.
    * **Social Desirability Bias:** People tend to answer questions in a way that makes them look good, even if it's not entirely truthful.
    * **Instrument Bias:**  Problems with the measurement tool itself (e.g., a faulty scale).

**3. Confirmation Bias**

* **What it is:** You tend to favor information that confirms your existing beliefs and ignore information that contradicts them.
* **Example:**  Only looking for research that supports your hypothesis and dismissing studies that don't.

**4. Observer Bias**

* **What it is:** The researcher's expectations or beliefs influence how they perceive or interpret the data.
* **Example:**  A doctor might be more likely to diagnose a patient with a condition they're already familiar with, even if the symptoms could indicate something else.

**5. Omitted Variable Bias**

* **What it is:**  An important variable that influences the relationship between the variables you're studying is left out of your analysis.
* **Example:**  Analyzing the relationship between exercise and heart health without considering factors like diet or genetics.

**6. Survivorship Bias**

* **What it is:**  You only focus on the "survivors" of a process and ignore those that didn't make it.
* **Example:**  Studying the characteristics of successful businesses without considering the businesses that failed.

**7.  Historical Bias**

* **What it is:** Data from the past may reflect societal biases or norms that are no longer relevant or acceptable.
* **Example:**  Analyzing historical hiring data might reveal discriminatory practices that were prevalent at the time.

**8.  Recall Bias**

* **What it is:**  People's memories of past events can be inaccurate or influenced by their current beliefs.
* **Example:**  Asking people to recall their childhood eating habits might lead to inaccurate data.

You're absolutely correct! Selection bias and publication bias are two very important types of bias that can significantly impact data analysis and interpretation. Let's add them to the list:

**9. Selection Bias**

* **What it is:** This occurs when the process of selecting individuals, groups, or data for analysis results in a sample that is not representative of the population of interest. This can lead to skewed results and inaccurate conclusions.
* **Examples:**
    * **Sampling Bias (as mentioned before):**  A specific type of selection bias where the sample isn't representative due to the method of selection.
    * **Attrition Bias:** This happens when participants drop out of a study non-randomly, leading to a biased sample of those who remain.
    * **Non-response Bias:** Occurs when people who choose to participate in a study differ systematically from those who don't.

**10. Publication Bias**

* **What it is:** This bias arises in academic research when studies with positive or significant results are more likely to be published than studies with negative or non-significant results. This can create a misleading impression of the true state of knowledge on a topic.
* **Examples:**
    * **File Drawer Problem:**  Studies with negative results are often "filed away" and never published, leading to an overrepresentation of positive findings in the literature.
    * **Outcome Reporting Bias:** Researchers may selectively report outcomes that support their hypotheses, while downplaying or ignoring those that don't.

**Why are selection and publication bias important?**

* **Selection bias** can distort the results of any study, making it difficult to generalize findings to the broader population.
* **Publication bias** can lead to a biased view of the evidence on a particular topic, potentially influencing decision-making in areas like healthcare, policy, and research.

**How to mitigate these biases:**

* **Selection Bias:**
    * Use random sampling methods whenever possible.
    * Carefully consider the inclusion and exclusion criteria for your study.
    * Analyze the characteristics of participants who drop out or don't respond to identify potential biases.

* **Publication Bias:**
    * Encourage the publication of negative results through platforms like pre-print servers and journals that specifically publish null findings.
    * Conduct systematic reviews and meta-analyses that include unpublished studies to get a more complete picture of the evidence.


**Why is it important to be aware of bias?**

Bias can lead to inaccurate conclusions, flawed models, and unfair or discriminatory outcomes.  By understanding the different types of bias, you can take steps to minimize their impact and ensure that your data analysis is as objective and reliable as possible.

**Selection Bias, Measurement Bias, Outcome Report Bias Example**

Story of a Texas school district needing good scores on a standard achievement test given in 10th grade so they held back poor performers in 9th grade and then bumped them up to 11th grade after a couple of years.

### Big Data

How Data Happened, Wiggins and Jones

* Better tools, services, and public goods or privacy incursions and invasive marketing?
* Understand online communities and political movements or track protesters and suppress speech?
* Transform how we study human communication and culture or narrow research options and alter what research means?
* Tendency with too much data is selecting what we like and ignoring the rest
* Whiteout like a snowstorm blinds us from the signal, or presents to many signals, start seeing patterns in random noise
* Noise is increasing faster than information
* As Nate Silvers says in his book Signal in the Noise, noise distracts us from the truth (objective truth?)
* Objective Truth is what exists and can be proved in this physicality. (The sun moves across the sky each day.)
* Normative Truth is what we, as a group, agree is true. (English speakers agreed to use the word day to name that time when the sky is lit by the sun.)
* Subjective Truth is how the individual sees or experiences the world. (Today is a good day for me.)
* Complex Truth recognizes the validity of all those truths and allows you to focus on the one is most useful at any given time. (The sun is up; the day is bright. Today is a good day for MOM, so lets take advantage of that and ask for ice cream for dinner.)
* An objective statement is factual; it has a definite correspondence to reality, independent of anyone's feelings or biases.
* The distinction between subjectivity and objectivity is a basic idea of philosophy, particularly epistemology and metaphysics. It is often related to discussions of consciousness, agency, personhood, philosophy of mind, philosophy of language, reality, truth, and communication.

**Chapter 1 Summary**:

In this chapter, the authors explore the complex relationship between data, technology, and society, emphasizing the need for greater awareness of the potential biases and harms associated with data-driven decision-making. They also stress the importance of accountability and fairness in algorithmic systems, especially given the increasing prevalence of automated decision-making in various sectors.

The authors argue that while data and technology offer promising advancements, they are not neutral and can perpetuate existing inequalities if not developed and deployed responsibly.  They highlight concerns about privacy, surveillance, and the potential for algorithmic bias to amplify social injustices.

The chapter also touches on the historical context of data collection and analysis, emphasizing that the issues we face today are not entirely new. The authors point to past instances where the accumulation and analysis of data have posed threats to privacy and exacerbated social inequalities.

Overall, this chapter serves as a call for greater critical engagement with data and technology, urging readers to consider the potential consequences of algorithmic systems and advocate for fairness, transparency, and accountability in their development and implementation.

Resources:

* How Data Happened: A History from the Age of Reason to the Age of Algorithms By Chris Wiggins, Matthew L. Jones 2023
* https://www.hsdinstitute.org/resources/four-truths.html
* https://www.gotquestions.org/objective-truth.html
* https://en.wikipedia.org/wiki/Subjectivity_and_objectivity_(philosophy)


### Randomized Controlled Trials (RCTs or AB Testing), Multicollinearity, and Confounders

Child's shoe size and reading level is correlated but consider the child's age

* Randomized Controlled Trial: A study design that randomly assigns participants into an experimental group or a control group. As the study is conducted, the only expected difference between the control and experimental groups in a randomized controlled trial (RCT) is the outcome variable being studied.
* Multicollinearity: two or more independent variables are correlated with each other
    * Removing variables?
* Confounders
  * An extraneous variable whose presence affects the variables being studied so that the results do not reflect the actual relationship between the variables under study.
  * A variable that influences both the dependent variable and independent variable, causing a spurious association.
* Lurking variable
  * A lurking variable is a variable that is not included in a statistical analysis but can still affect the outcome of that analysis
* Statistical Control
    * Variables may be controlled directly by holding them constant throughout a study
    * This hurts confounders as they share a relationship for what's being controlled
    * Confounders need to be considered in the big picture
    * Sometimes you end up controlling the thing you are trying to measure (Ezra Klein)
    
Sources:
* https://himmelfarb.gwu.edu/tutorials/studydesign101/rcts.cfm
* https://harini.blog/2021/10/10/understanding-multicollinearity-and-condounding-variables-in-regression/
* https://statisticsbyjim.com/basics/lurking-variable/

In [None]:
# # create population
# # https://stats.stackexchange.com/questions/560281/what-is-the-meaning-of-loc-and-scale-for-the-distributions-in-scipy-stats
# import numpy as np
# import matplotlib.pyplot as plt

# mu, std, n = (0, 1, 10000) # unpacking a tuple
# pop = np.random.normal(loc=mu, scale=std, size=n)
# plt.hist(ndist);

## Sampling Examples

### Random Sampling

In [None]:
# # random sample
# np.random.choice(pop, replace=False, size=50)

### Systematic Sampling

In [None]:
# # systematic sampling [start, end, interval]
# print(pop[0:7])
# pop[0:25:3]

### Stratified Sampling

In [None]:
# # clustering
# import seaborn as sns

# mpg = sns.load_dataset('mpg')
# print(mpg.shape)
# mpg.head()

In [None]:
# # get value counts for origin to determine cluster size
# mpg['origin'].value_counts()

In [None]:
# # stratified sampling
# import numpy as np
# import pandas as pd

# # shift + alt + arrow down or up
# clst1 = mpg.where(mpg['origin'] == 'usa').sample(50, replace=False) # False is default
# clst2 = mpg.where(mpg['origin'] == 'japan').sample(50)
# clst3 = mpg.where(mpg['origin'] == 'europe').sample(50)


# # merge into one dataframe (concat, merge, join)
# # https://pandas.pydata.org/docs/user_guide/merging.html
# clst_df = pd.concat([clst1, clst2, clst3])
# print(clst_df.shape)
# clst_df.head()

**1. Clustering by Country Origin**

*   **Purpose:** You're creating strata (subgroups) within your dataset based on the country of origin (e.g., USA, Japan, Europe). This ensures that each origin group is represented equally in your final sample.
*   **Why this matters:** Stratified sampling helps prevent imbalances in your sample that could lead to biased results. If one origin group is overrepresented, your analysis might not accurately reflect the overall patterns in the data.

**2. Creating a New DataFrame**

*   **Purpose:** You're forming a new dataframe to hold your stratified sample. This keeps your original dataset intact while allowing you to work with the specific sample you've created.
*   **Why this matters:**  Organization and clarity. It's good practice to keep your raw data separate from your processed or sampled data.

In [None]:
# # weighted - using normal represents precentages of original population
# mpg['origin'].value_counts(normalize=True)

**3. Value Count with Normalization**

*   **Purpose:** You're calculating the proportion of each origin group in your stratified sample. Normalization ensures these proportions sum up to 1 (or 100%).
*   **Why this matters:**  This gives you a clear picture of the distribution of origin groups in your sample and confirms that you've achieved equal representation.

In [None]:
# # weighted possible solution based on percentage of each label for origin
# weights = {'usa': mpg['origin'].value_counts(normalize=True)[0],
#           'japan': mpg['origin'].value_counts(normalize=True)[1],
#            'europe': mpg['origin'].value_counts(normalize=True)[2]}
# mpg['weights'] = mpg['origin'].apply(lambda x: weights[x])
# mpg.sample(n=300, weights='weights')['origin'].value_counts(normalize=True)

**4. Assigning Weights**

*   **Purpose:** You're likely assigning weights to each observation in your sample to adjust for any differences in the size of the origin groups in the original population.
*   **Why this matters:**  Weighting ensures that your sample accurately reflects the proportions of each origin group in the population you're trying to represent. If one origin group is smaller in the population, its observations in the sample will be weighted more heavily to compensate.

In [None]:
# # stratified sampling by cylinder
# clst_df.groupby(['cylinders', 'origin'])['origin'].count()

**5. Group By with Origin Count**

*   **Purpose:** You're likely grouping your stratified sample by origin and counting the number of observations in each group.
*   **Why this matters:**  This is a final check to confirm that you have equal representation of each origin group in your stratified sample.

**In essence, you're demonstrating the key principles of stratified sampling:**

*   Divide the population into meaningful subgroups (strata).
*   Sample proportionally from each stratum to ensure representation.
*   Adjust for any differences in population proportions through weighting.

By following these steps, you're creating a sample that is more likely to accurately reflect the characteristics of the overall population, leading to more reliable and generalizable results in your analysis.

## Bayesian Inference

The relationship between sampling and Bayesian inference is absolutely **fundamental**, especially when you are updating your **posterior distribution**.

Sampling methods are the computational engine that makes modern Bayesian inference possible.

---

## The Core Problem: Intractability of the Posterior

In Bayesian inference, the goal is to calculate the **posterior distribution**, which represents your updated belief about the model parameters ($\theta$) after seeing the data ($D$). This is governed by Bayes' Theorem:

$$P(\theta | D) = \frac{P(D | \theta) P(\theta)}{P(D)}$$

Where:
* $P(\theta | D)$ is the **Posterior** (what we want).
* $P(D | \theta)$ is the **Likelihood** (how well the data fits the parameters).
* $P(\theta)$ is the **Prior** (our initial belief).
* $P(D)$ is the **Marginal Likelihood** or **Evidence** (a normalizing constant).

The term $P(D)$ is the problem. It requires calculating a complex integral over the entire parameter space: $P(D) = \int P(D | \theta) P(\theta) d\theta$. For most real-world models, this integral is **intractable**—it cannot be solved analytically or in closed form.

This means you can calculate a value proportional to the posterior ($P(D | \theta) P(\theta)$), but you can't get the exact probability density function needed to fully define the distribution.

---

## The Sampling Solution: Monte Carlo Integration

Since we can't get the exact formula for the posterior, we use **sampling** techniques to generate a large number of simulated parameter values ($\theta_1, \theta_2, \dots, \theta_N$) that are distributed exactly according to the posterior $P(\theta | D)$.

This process is called **Monte Carlo Integration**. Once you have a set of samples that represent the posterior, you can:

* **Approximate the Distribution:** The histogram of the samples is an approximation of the posterior probability density function.
* **Estimate Summary Statistics:** Instead of calculating the expected value $E[h(\theta)] = \int h(\theta) P(\theta | D) d\theta$, you simply average the function $h(\theta)$ over your samples:
    $$E[h(\theta)] \approx \frac{1}{N} \sum_{i=1}^{N} h(\theta_i)$$
    This allows you to easily estimate the posterior mean, variance, medians, and credible intervals.

---

## Key Sampling Methods for Posterior Updating

The most common methods used for drawing samples from the complex posterior distribution are:

* **Markov Chain Monte Carlo (MCMC):** This is the most widely used family of methods.
    * MCMC algorithms (like **Metropolis-Hastings** and **Hamiltonian Monte Carlo**) construct a sequence of random parameter states, forming a "chain."
    * This chain is designed to eventually settle into a pattern where the frequency of visiting any particular parameter value is **proportional to its posterior probability**.
    * After a 'burn-in' period, the samples generated by the chain are treated as draws from the true posterior distribution.

* **Sequential Monte Carlo (SMC):** Also known as **Particle Filters**.
    * SMC is particularly useful for **online** or **sequential** Bayesian inference, where you are constantly updating your posterior as new data arrives (e.g., in tracking problems).
    * It maintains a population of samples (particles) and updates their importance weights sequentially based on the new data, making it efficient for dynamic models.

* **Importance Sampling (IS):**
    * This technique uses a proposal distribution that is easy to sample from (like the prior) and then assigns a **weight** to each sample, correcting for the difference between the proposal and the true posterior.
    * It's less efficient than MCMC for high-dimensional problems but useful in certain contexts.

In essence, **sampling is the primary mechanism by which we realize the posterior update prescribed by Bayes' Theorem when an analytical solution is unavailable.**