# Week 3 - Sampling and Point estimation

## Lesson 1 - Population and Sample

### Population and Sample


1. **Population and Sample**:
   - **Population**: The entire group of individuals or items that you want to study. Represented by \( N \).
     - Example: All people in Statistopia (10,000 individuals).
   - **Sample**: A smaller subset of the population that is actually observed or measured. Represented by \( n \).
     - Example: A random selection of 100 individuals from Statistopia.

2. **Sampling Strategies**:
   - **Random Sampling**: The best method to ensure an unbiased sample. Each individual in the population has an equal chance of being selected.
   - **Non-Random Sampling**: Should be avoided as it may lead to biased results (e.g., selecting the shortest individuals when measuring height).

3. **Independent and Identically Distributed (i.i.d.) Samples**:
   - **Independent Samples**: Each sample should be independent of others. The selection of one individual should not affect the selection of another.
   - **Identically Distributed**: The rule used to select the sample should be consistent across all selections to maintain uniformity.

4. **Importance in Machine Learning**:
   - **Data Sets as Samples**: In ML, every dataset is a sample of the possible data. No matter how large, it does not represent the entire population.
   - **Representative Data Sets**: Ensuring that your dataset is representative of the population's distribution is crucial for model accuracy.
     - Example: For cat image classification, the dataset should include diverse images (cats on grass, couches, etc.) to avoid biased learning.

5. **Formal Definitions**:
   - **Population**: The entire set of individuals or elements under study, sharing common characteristics.
   - **Sample**: A subset of the population used to draw conclusions about the entire population.
   - **N**: Denotes the population size.
   - **n**: Denotes the sample size.


### Sample Mean


1. **Population Mean ($\mu$)**:
   - The average value of a characteristic (e.g., height) in the entire population.
   - Example: In Statistopia, if the population size is 10, and the average height is 160 cm, $\mu = 160$ cm.

2. **Sample Mean ($\bar{X}$)**:
   - The average value calculated from a subset of the population.
   - Represented by $\bar{X}_1$, $\bar{X}_2$, etc., for different samples.
   - Example: For a sample of 6 people, $\bar{X}_1 = 160.97$ cm.

3. **Sample Size ($n$)**:
   - The number of individuals in the sample. Larger sample sizes generally provide better estimates of the population mean.
   - Example: A sample of size 6 ($n=6$) generally gives a better estimate of $\mu$ than a sample of size 2 ($n=2$).

4. **Sampling and Estimation**:
   - **Random Sampling**: Essential for getting a representative estimate of the population mean.
     - Example: If a sample is formed by the shortest individuals, it might not be a good estimate.
   - **Estimate Accuracy**: The accuracy of the sample mean as an estimate of the population mean improves with larger sample sizes.
     - Example: $\bar{X}_1$ (with $n=6$) is a better estimate than $\bar{X}_3$ (with $n=2$).

5. **Estimating Variance**:
   - When estimating the population variance using a sample, the calculated sample variance will typically be close to the population variance but not exactly the same.
   - This discrepancy occurs because the sample variance tends to underestimate the population variance. Later lessons will explore this in more detail.

6. **Key Takeaway**:
   - The larger the sample size ($n$), the better the estimate of the population mean ($\mu$) and variance you will obtain from the sample.

#### Notes for Exam Preparation:
- Understand the difference between the population mean ($\mu$) and the sample mean ($\bar{X}$).
- Recognize how sample size ($n$) impacts the accuracy of estimating $\mu$.
- Remember that while sample variance is close to the population variance, it often slightly underestimates it.
- Be aware that random sampling is crucial for obtaining a good estimate of the population parameters.

This summary outlines the essential points about estimating population means and variance using samples, including the importance of sample size and the concept of random sampling.

### Sample Proportion

1. **Population Proportion (P)**:
   - The proportion of the entire population that has a specific characteristic.
   - Calculated as $ P = \frac{x}{N} $, where $ x $ is the number of individuals with the characteristic and $ N $ is the total population size.
   - **Example**: In Statistopia, if 4 out of 10 people own a bicycle, the population proportion $ P = \frac{4}{10} = 0.4 $ or 40%.

2. **Sample Proportion ($\hat{P}$)**:
   - The proportion of a characteristic observed in a randomly selected sample from the population.
   - Calculated as $ \hat{P} = \frac{x_{\text{sample}}}{n} $, where $ x_{\text{sample}} $ is the number of individuals with the characteristic in the sample, and $ n $ is the sample size.
   - **Example**: If 2 out of 6 randomly sampled people own a bicycle, the sample proportion $ \hat{P} = \frac{2}{6} = 0.333 $ or 33.3%.

3. **Relationship Between P and $\hat{P}$**:
   - The sample proportion ($\hat{P}$) serves as an estimate of the population proportion (P).
   - While $\hat{P}$ may not exactly equal $ P $, it provides a useful approximation, especially when a large and representative sample is used.

4. **Key Takeaway**:
   - Population proportion (P) gives the true proportion of a characteristic within the entire population.
   - Sample proportion ($\hat{P}$) is an estimate of $ P $ based on a subset of the population and may vary depending on the sample size and representativeness.


### Sample Variance

This explanation covers the concepts of population variance and sample variance, and how to estimate them when working with a sample instead of the entire population. Here's a summary:

1. **Population Variance (σ²):**
   - Measures how spread out data points are around the population mean (μ).
   - Formula: $\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2$, where $N$ is the population size.

2. **Sample Variance (s²):**
   - Used when you only have a sample of the population.
   - The sample mean ($\bar{x}$) replaces the population mean, and $n$ (sample size) replaces $N$.
   - Initial estimation: $ \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2 $. However, this formula introduces a bias because the sample mean is used instead of the population mean.

3. **Bias Correction with $n-1$:**
   - To correct the bias, divide by $n-1$ instead of $n$, resulting in an unbiased estimator of the population variance.
   - Formula: $ s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2 $.
   - This adjustment is necessary because using the sample mean tends to underestimate the variance.

4. **Impact of Sample Size:**
   - The difference between dividing by $n$ and $n-1$ becomes less significant as the sample size increases. For small samples, the difference is more noticeable.

5. **Alternative Variance Estimators:**
   - Some statistical techniques, like maximum likelihood estimation, use the $ \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2 $ formula, which has a small bias.
   - Despite this, the $s^2$ estimator (using $n-1$) is the most common and preferred method in practice for estimating variance from a sample.

In summary, when estimating variance from a sample, using the corrected formula with $n-1$ provides a more accurate (unbiased) estimate of the population variance. This approach is standard in most statistical analyses.

### Law of Large Numbers

The Law of Large Numbers is a fundamental concept in probability and statistics that states that as the size of a sample drawn from a population increases, the sample mean (or average) will get closer and closer to the population mean.

### Example with Human Height:
Let's say you want to estimate the average height of all humans. If you measure the height of just one person, you get an estimate, but it's likely to be off due to randomness. If you measure two or three people and take the average of their heights, your estimate improves. As you increase the number of people measured—say 10, 100, or 1,000—the average of these heights will get closer to the true average height of the entire human population.

### Dice Example:
Imagine you have a fair 4-sided die with possible outcomes of 1, 2, 3, and 4. The true average (mean) outcome of a single roll is 2.5. To illustrate the Law of Large Numbers, let’s consider an experiment where you roll the die twice and calculate the average of the two rolls. There are 16 possible pairs of outcomes (e.g., (1,1), (1,2), etc.), and the average for each pair is listed in a table.

If you take just one sample (e.g., rolling a 4 and a 3), the average might be 3.5—higher than the true mean of 2.5. But if you continue taking more samples and calculating the average each time, you'll notice that these averages begin to converge towards 2.5.

### Law of Large Numbers:
Mathematically, the Law of Large Numbers states that if you take an increasing number of independent and identically distributed (i.i.d) samples $ X_1, X_2, ..., X_n $ from a population, the average of these samples:

$$\bar{X}_n = \frac{X_1 + X_2 + ... + X_n}{n}$$

will tend to get closer and closer to the population mean $ \mu_X $ as the sample size $ n $ increases.

### Conditions:
1. **Random Sampling**: The samples must be drawn randomly from the population.
2. **Large Sample Size**: The sample size must be sufficiently large. Larger samples provide more accurate estimates.
3. **Independence**: The individual observations in the sample must be independent of each other.

In summary, the Law of Large Numbers ensures that as you collect more data, your estimate of the population mean becomes increasingly accurate, assuming the conditions are met.

### Central Limit Theorem - Discrete Random Variable

The concept you're describing is the **Central Limit Theorem (CLT)**, one of the most important results in statistics. It states that the distribution of the sum (or average) of a large number of independent, identically distributed (i.i.d.) random variables tends to become a normal distribution, even if the original variables themselves are not normally distributed. This is true regardless of the original distribution, as long as the number of variables (n) is sufficiently large.

### Example with Coin Flips:
Let's take the example of flipping a fair coin, where the probability of getting heads (X = 1) is 0.5 and tails (X = 0) is also 0.5. For a single flip, the probability distribution is simple: there's a 50% chance of getting either heads or tails.

Now, consider flipping two coins. The possible outcomes are:
- 0 heads (TT) with a probability of 1/4,
- 1 head (HT or TH) with a probability of 2/4,
- 2 heads (HH) with a probability of 1/4.

This gives us a distribution that's starting to look more complex, but it's still discrete.

As you increase the number of flips:
- With three coins, the distribution has more outcomes (0, 1, 2, or 3 heads) and looks more spread out.
- With 10 coins, the distribution of the number of heads becomes more bell-shaped and starts resembling a normal distribution.

### Why Does This Happen?
Each coin flip is a Bernoulli trial, which is a simple random variable that can take one of two values: 1 (for heads) or 0 (for tails). When you sum the results of multiple coin flips, you're essentially adding up a number of Bernoulli random variables. The CLT states that as you increase the number of these trials (coin flips), the distribution of their sum (or average) will approach a normal distribution.

### Calculating the Mean and Variance:
For a single coin flip (n = 1):
- The mean $\mu$ is $np = 1 \times 0.5 = 0.5$.
- The variance $\sigma^2$ is $np(1-p) = 1 \times 0.5 \times 0.5 = 0.25$.

As you increase the number of flips (n):
- The mean becomes $\mu = np$.
- The variance becomes $\sigma^2 = np(1-p)$.

For example:
- For 10 flips, $\mu = 10 \times 0.5 = 5$ and $\sigma^2 = 10 \times 0.5 \times 0.5 = 2.5$.

### Implication of the Central Limit Theorem:
The Central Limit Theorem is powerful because it allows us to make inferences about the sum or average of a large number of independent random variables, regardless of their original distribution. This is why the normal distribution appears so frequently in statistical analyses, even when the underlying data is not normally distributed.

In summary, the Central Limit Theorem shows that as you increase the number of observations or trials, the distribution of their average will tend to become normal (Gaussian), with a mean of $np$ and a variance of $np(1-p)$, making it a cornerstone of probability theory and statistics.

### Central Limit Theorem - Continuous Random Variable

The example you provided effectively illustrates the **Central Limit Theorem (CLT)** through a practical experiment. The CLT is a cornerstone of statistics, and this example highlights how the theorem applies to a continuous random variable. Here's a summary of the key points:

1. **Central Limit Theorem (CLT):** It states that when you take a large enough sample size (typically n ≥ 30) from any distribution, the distribution of the sample mean will tend to be normally distributed, regardless of the original distribution's shape. The larger the sample size, the more the distribution of the sample mean will resemble a Gaussian (normal) distribution.

2. **Example Using Call Wait Time:**
   - **Initial Setup:** The wait time for a call to be answered follows a uniform distribution between 0 and 15 minutes.
   - **Sampling and Averaging:** The experiment involves taking different sample sizes (n = 1, 2, 3, etc.) and calculating the average wait time for each sample. Repeating this process many times and plotting the resulting averages provides a histogram of the sample means.
   - **Observations:**
     - For **n = 1**: The distribution of the sample mean resembles the original uniform distribution.
     - For **n = 2**: The distribution begins to take on a triangular shape, centered around the population mean (7.5 minutes).
     - For **n = 3** and beyond: The distribution starts to become more bell-shaped and symmetric, approaching the normal distribution.
   - **Mean and Variance:** 
     - The mean of the sample means remains constant at the population mean (7.5 minutes).
     - The variance of the sample means decreases as n increases, showing that the sample mean becomes more concentrated around the population mean.

3. **Standardization:** 
   - To compare distributions of sample means for different n, standardization is used. By standardizing, you transform the distribution into a standard normal distribution (mean = 0, variance = 1).
   - This makes it easier to observe the effects of the CLT, as the distribution of the sample means becomes closer to a standard normal distribution as n increases.

4. **Formal Definition of CLT:** 
   - As n approaches infinity, the standardized average of n independent and identically distributed (i.i.d.) random variables will follow a standard normal distribution.
   - Alternatively, the sum of these variables, after proper scaling, will also follow a standard normal distribution.

5. **Practical Application:** 
   - In practice, the CLT allows statisticians to make inferences about population parameters using sample data. The fact that the sample mean follows a normal distribution (for sufficiently large n) makes it easier to calculate confidence intervals and conduct hypothesis tests, even when the original data is not normally distributed.

This experiment and explanation show how the CLT is applicable in real-world situations, making it an essential tool in statistics for analyzing and interpreting data from various distributions.

## Lesson 2 - Point Estimation

### Point Estimation

In this lesson, you’re diving into the important concept of **estimation**, which is central to statistics and machine learning. Here’s a breakdown of the key ideas you'll be exploring:

#### 1. **Estimation in Statistics:**
   - **Point Estimation:** This involves estimating an unknown parameter of a population (like the mean or variance) using sample data. A point estimate provides a single value as an estimate of the parameter.
   - **Maximum Likelihood Estimation (MLE):** MLE is one of the most widely used methods for point estimation. It aims to find the parameter values that maximize the likelihood function, meaning they make the observed data most probable. MLE is crucial in various machine learning models, including logistic regression, neural networks, and more.

#### 2. **Maximum a Posteriori (MAP) Estimation:**
   - MAP is an extension of MLE that incorporates prior knowledge or beliefs about the parameter, using Bayes' theorem. While MLE only considers the likelihood of the data, MAP combines this likelihood with a prior distribution, resulting in an estimate that reflects both the data and prior beliefs.
   - **Bayes' Theorem in MAP:** Bayes' theorem updates the probability estimate for a parameter as more evidence (data) becomes available. The MAP estimate is the mode of the posterior distribution, which is the product of the likelihood function and the prior distribution.
   - **MAP vs. MLE:** MAP can be seen as a regularized version of MLE. In machine learning, regularization is a technique used to prevent overfitting by adding a penalty for complexity (e.g., large coefficients in a regression model). MAP estimation introduces this regularization naturally through the prior distribution.

#### 3. **Regularization in Machine Learning:**
   - **Overfitting Prevention:** Regularization techniques, such as L1 (lasso) and L2 (ridge) regularization, are used in machine learning to prevent models from fitting the noise in the data too closely, which leads to poor generalization to new data.
   - **Connection to MAP:** The lesson will show you how MAP estimation can be interpreted as MLE with a regularization term, where the prior distribution in MAP serves as the source of regularization. This perspective is not only elegant but also practically useful in building robust machine learning models.

#### 4. **Detailed Walkthrough:**
   - The lesson will guide you through the mathematical details of MLE and MAP estimation, demonstrating how these methods work in practice.
   - You’ll learn how to derive MLE and MAP estimates, understand their properties, and see examples of their application in real-world problems.

This lesson sets the stage for understanding the core techniques used in statistical inference and machine learning, particularly in how models are trained and optimized to make predictions based on data.

### Maximum Likelihood Estimation Motivation

In this video, you're learning about **Maximum Likelihood Estimation (MLE)**, a fundamental concept in statistics and machine learning that's used to train models by finding the most probable explanation for observed data. Let's break down the key points:

#### **Concept of MLE:**
- **Scenario Setup:** MLE is about inferring the most likely scenario that explains a given set of evidence. For instance, you see popcorn on the floor and want to determine what led to this. You consider different scenarios, like people watching a movie, playing board games, or someone taking a nap.
- **Probability Assessment:** Each scenario has a different probability of leading to popcorn on the floor:
  - **Movies:** High probability
  - **Board games:** Medium probability
  - **Taking a nap:** Low probability

- **Choosing the Most Likely Scenario:** Since watching a movie has the highest probability of resulting in popcorn on the floor, you infer that this scenario is the most likely to have occurred. This process of choosing the scenario that maximizes the probability of the observed evidence is known as **maximum likelihood**.

#### **Application to Machine Learning:**
- **Model Selection:** In machine learning, you have a dataset (your evidence) and several possible models (scenarios) that could have generated this data. MLE helps you choose the model that most likely produced the data.
- **Maximizing Conditional Probability:** You estimate the probability of observing your data given each model (i.e., P(Data | Model 1), P(Data | Model 2), etc.). The model that maximizes this probability is considered the best fit.
  
#### **Link to Linear Regression:**
- **Data Points and Models:** Imagine you have data points and three possible linear models (lines) that could explain these points.
- **Probability of Data Given a Model:** MLE would involve evaluating how likely it is that the data points were generated by each of these lines. The line that makes the observed data most probable is the one you choose.
- **Linear Regression Context:** In linear regression, you're finding the line that best fits your data by maximizing the likelihood of the data given the line (or minimizing the sum of squared errors, which is closely related).

#### **Summary:**
- **MLE in Simple Terms:** MLE involves picking the scenario or model that makes the observed data most likely. In machine learning, it helps you choose the model that best explains the data you have.
- **Why It's Important:** MLE is a cornerstone of many machine learning algorithms, including linear regression, where it helps in selecting the best model to predict future outcomes based on past data.

You'll delve into more detailed examples and mathematical explanations of MLE later, but this gives you a solid conceptual foundation to understand how MLE works and why it's so widely used in machine learning.

### MLE: Bernoulli Example

Let's break down the MLE example you provided with the coin tosses to understand the concept more clearly.

#### **Scenario:**
- **Coin Tosses:** You tossed a coin 10 times, resulting in 8 heads and 2 tails.
- **Coins:** You have three possible coins:
  - **Coin 1:** Probability of heads = 0.7
  - **Coin 2:** Probability of heads = 0.5 (fair coin)
  - **Coin 3:** Probability of heads = 0.3

#### **Objective:**
Determine which coin most likely produced the observed results (8 heads, 2 tails).

#### **Maximum Likelihood Estimation (MLE):**
1. **Calculate Likelihood for Each Coin:**
   - **Coin 1:**
     - Probability of 8 heads and 2 tails = $0.7^8 \times 0.3^2$
     - Compute: $0.7^8 \approx 0.0576$, $0.3^2 = 0.09$
     - Likelihood = $0.0576 \times 0.09 = 0.0051$

   - **Coin 2:**
     - Probability of 8 heads and 2 tails = $0.5^{10}$
     - Compute: $0.5^{10} = 0.0010$

   - **Coin 3:**
     - Probability of 8 heads and 2 tails = $0.3^8 \times 0.7^2$
     - Compute: $0.3^8 \approx 0.000656$, $0.7^2 = 0.49$
     - Likelihood = $0.000656 \times 0.49 = 0.00032$

   The highest likelihood is for **Coin 1** (0.0051), so it’s the most probable coin that generated the data.

2. **Finding the Optimal Probability $P$:**
   - Suppose we don't know the exact probability of heads $P$, but we can use MLE to estimate it.
   - The likelihood function for $P$ (where $P$ is the probability of heads) is:
     $$   \text{Likelihood}(P) = P^8 \times (1 - P)^2$$
   - To simplify, use the **log likelihood**:
     $$   \text{Log Likelihood} = 8 \log(P) + 2 \log(1 - P)$$
   - Take the derivative of the log likelihood with respect to $P$ and set it to 0:
     $$   \frac{d}{dP} [8 \log(P) + 2 \log(1 - P)] = \frac{8}{P} - \frac{2}{1 - P} = 0$$
   - Solve for $P$:
     $$   \frac{8}{P} = \frac{2}{1 - P}$$
     $$   8(1 - P) = 2P$$
     $$   8 - 8P = 2P$$
     $$   8 = 10P$$
     $$   P = \frac{8}{10} = 0.8$$

   The estimated probability of heads $P$ is 0.8, which aligns with the observed data (8 heads in 10 tosses).

#### **General Case:**
- **Bernoulli Variable:** For a sequence of $n$ coin flips with $k$ heads, the likelihood function is:
  $$\text{Likelihood}(P) = P^k \times (1 - P)^{n - k}$$
- **Log Likelihood:**
  $$\text{Log Likelihood} = k \log(P) + (n - k) \log(1 - P)$$
- **Maximization:** Taking the derivative and solving gives:
  $$\hat{P} = \frac{k}{n}$$
  Thus, the MLE for the probability of heads is simply the proportion of heads observed in the flips.

This example illustrates how MLE can be used to estimate parameters (like the probability of heads) by maximizing the likelihood of observing the given data.

### MLE: Gaussian Example

#### **Understanding Maximum Likelihood Estimation (MLE) with Gaussian Distributions**

In this scenario, you're given some observations (1 and -1) and need to determine which of the given Gaussian distributions most likely generated these observations.

#### **Step-by-Step Solution:**

1. **Initial Choices:**
   - **Distribution 1:** Normal distribution with mean 10 and standard deviation 1
   - **Distribution 2:** Normal distribution with mean 2 and standard deviation 1

   To determine which distribution is more likely to have generated the observations, calculate the likelihood of observing 1 and -1 under each distribution.

2. **Calculate Likelihoods:**

   - **Distribution 1:** Mean = 10, SD = 1
     - Likelihood of observing 1: $\phi(1; 10, 1)$
     - Likelihood of observing -1: $\phi(-1; 10, 1)$

     Using the normal density function:
     $$\phi(x; \mu, \sigma) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left( -\frac{(x - \mu)^2}{2\sigma^2} \right)$$
     For mean 10, SD 1:
     - $\phi(1; 10, 1) \approx \frac{1}{\sqrt{2\pi}} \exp \left( -\frac{(1 - 10)^2}{2} \right)$
     - $\phi(-1; 10, 1) \approx \frac{1}{\sqrt{2\pi}} \exp \left( -\frac{(-1 - 10)^2}{2} \right)$

   - **Distribution 2:** Mean = 2, SD = 1
     - Likelihood of observing 1: $\phi(1; 2, 1)$
     - Likelihood of observing -1: $\phi(-1; 2, 1)$

     For mean 2, SD 1:
     - $\phi(1; 2, 1) \approx \frac{1}{\sqrt{2\pi}} \exp \left( -\frac{(1 - 2)^2}{2} \right)$
     - $\phi(-1; 2, 1) \approx \frac{1}{\sqrt{2\pi}} \exp \left( -\frac{(-1 - 2)^2}{2} \right)$

   You’ll find that the second distribution (mean = 2, SD = 1) provides higher likelihoods for both observations 1 and -1 compared to the first distribution.

3. **Compare Likelihoods:**

   For the three new Gaussians:
   - **Gaussian 1:** Mean = -1, SD = 1
   - **Gaussian 2:** Mean = 0, SD = 1
   - **Gaussian 3:** Mean = 1, SD = 1

   Calculate the likelihood of observing 1 and -1 for each Gaussian:

   - **Gaussian 1:** Mean = -1, SD = 1
     - Likelihoods: $\phi(1; -1, 1) \approx 0.242$ and $\phi(-1; -1, 1) \approx 0.242$
     - Product of likelihoods: $0.242 \times 0.242 = 0.059$

   - **Gaussian 2:** Mean = 0, SD = 1
     - Likelihoods: $\phi(1; 0, 1) \approx 0.398$ and $\phi(-1; 0, 1) \approx 0.398$
     - Product of likelihoods: $0.398 \times 0.398 = 0.159$

   - **Gaussian 3:** Mean = 1, SD = 1
     - Likelihoods: $\phi(1; 1, 1) \approx 0.398$ and $\phi(-1; 1, 1) \approx 0.054$
     - Product of likelihoods: $0.398 \times 0.054 = 0.022$

   **Conclusion:** The Gaussian distribution with mean 0 and standard deviation 1 has the highest product of likelihoods, making it the most likely distribution to have generated the observations 1 and -1.

4. **Variance and Standard Deviation:**

   - **Variance of Observations:** $\text{Var} = \frac{(1 - \text{mean})^2 + (-1 - \text{mean})^2}{2}$. For the observations (1, -1), the mean is 0, so variance = 1.
   - **Distributions Variance:**
     - SD = 0.5 → Variance = 0.25
     - SD = 1 → Variance = 1
     - SD = 2 → Variance = 4

   The distribution with SD 1 (variance 1) matches the variance of the observations, which is why it is preferred.

#### **Summary:**

- **Gaussian Distribution with Mean 0 and SD 1** is the most likely candidate for generating the observations 1 and -1 due to its highest likelihood value, which aligns well with both the data’s mean and variance.

### MLE: Linear Regression

#### **Link Between Maximum Likelihood Estimation (MLE) and Linear Regression**

In this explanation, you’ve demonstrated how Maximum Likelihood Estimation (MLE) can be applied to linear regression, showing that finding the line that best fits a set of data points is equivalent to maximizing the likelihood of observing those points given the line. Here’s a breakdown of the key concepts and steps:

##### **1. MLE and Linear Regression:**
- **Objective:** In linear regression, the goal is to find the line that best fits the given data points. This can be approached using MLE, where we model the data generation process probabilistically.

##### **2. Data Generation Model:**
- **Assumption:** We assume that the data points are generated from a Gaussian distribution centered around the line. For each point $x_i$, the Gaussian is centered at the vertical distance from $x_i$ to the line.

##### **3. Likelihood Calculation:**
- **Gaussian Likelihood:** For a given line $y = mx + b$, the distance $d_i$ from each data point $(x_i, y_i)$ to the line is calculated. The likelihood of observing each data point under the Gaussian distribution centered at the line can be written as:
  $$P(y_i | x_i, \text{line}) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{d_i^2}{2\sigma^2} \right)$$
  where $\sigma$ is the standard deviation of the Gaussian.

- **Total Likelihood:** Since the data points are assumed to be independently generated, the total likelihood is the product of individual likelihoods:
  $$L(\text{line}) = \prod_{i} P(y_i | x_i, \text{line})$$
  Taking the logarithm of the likelihood (log-likelihood) simplifies the multiplication into summation:
  $$\log L(\text{line}) = \sum_{i} \log P(y_i | x_i, \text{line})$$

##### **4. Maximizing the Likelihood:**
- **Simplified Expression:** Substituting the Gaussian likelihood into the log-likelihood equation:
  $$\log L(\text{line}) = \sum_{i} \left( -\frac{d_i^2}{2\sigma^2} - \log (\sqrt{2 \pi \sigma^2}) \right)$$
  Since $\log (\sqrt{2 \pi \sigma^2})$ is a constant, it can be ignored in the optimization process:
  $$\log L(\text{line}) \propto -\frac{1}{2\sigma^2} \sum_{i} d_i^2$$
  Maximizing the likelihood is thus equivalent to minimizing the sum of squared distances $\sum_{i} d_i^2$.

- **Least Squares Error:** The objective function in linear regression is to minimize the sum of squared residuals (or errors), which is:
  $$\text{Sum of Squared Errors} = \sum_{i} d_i^2$$

  This is exactly what is done in linear regression.

##### **5. Conclusion:**
- **MLE and Linear Regression:** The line that maximizes the likelihood of the data being generated from that line, under the Gaussian assumption, is the same as the line that minimizes the least squares error. Hence, linear regression can be understood as a specific case of MLE where the data is assumed to be generated from a Gaussian distribution with the residuals centered around the line.

#### **Example:**

In the given example:
- **Models:** Different lines are tested (Model 1, Model 2, Model 3).
- **Evaluation:** Each line is evaluated by calculating the likelihood of the given data points assuming they are generated from a Gaussian centered at each line.
- **Best Fit:** The line with the highest likelihood (or equivalently, the line with the smallest sum of squared distances) is selected as the best fit for the data.

This illustrates how MLE provides a probabilistic framework for fitting models and connects directly to familiar regression techniques.

### Regularization

#### **Regularization in the Context of Model Complexity and Maximum Likelihood**

Regularization is a technique used to prevent overfitting by adding a penalty for model complexity. Here’s a breakdown of how it works and its connection to probability and maximum likelihood estimation (MLE).

##### **1. Regularization Overview**

- **Objective:** Regularization aims to improve the generalization of a model by discouraging overly complex models that might fit the training data too closely (overfitting). This is achieved by adding a penalty term to the loss function based on model complexity.

##### **2. Types of Regularization**

- **L2 Regularization (Ridge Regression):** Adds a penalty proportional to the sum of the squared coefficients (excluding the intercept term). The regularization term is given by:
  $$\text{Penalty} = \lambda \sum_{i=1}^n w_i^2$$
  where $\lambda$ is the regularization parameter, and $w_i$ are the model coefficients.

- **Regularized Loss Function:** The total loss with regularization is:
  $$\text{Regularized Loss} = \text{Log Loss} + \lambda \sum_{i=1}^n w_i^2$$

##### **3. Example of Regularization**

- **Models:**
  - **Linear Model:** $y = 4x + 3$ with a loss of 10
  - **Quadratic Model:** $y = 2x^2 - 4x + 5$ with a loss of 2
  - **Polynomial Model of Degree 10:** with a loss of 0.1

- **Penalties:**
  - **Linear Model:** $4^2 = 16$
  - **Quadratic Model:** $2^2 + (-4)^2 = 20$
  - **Polynomial Model:** Sum of the squares of all coefficients, which equals 262

- **Regularized Losses:**
  - **Linear Model:** $10 + \lambda \times 16$
  - **Quadratic Model:** $2 + \lambda \times 20$
  - **Polynomial Model:** $0.1 + \lambda \times 262$

  For a regularization parameter $\lambda$, the regularized loss values adjust the fit. If $\lambda$ is large enough, it increases the penalty for complex models, thus making simpler models more favorable.

##### **4. Connection to Probability and MLE**

- **MLE without Regularization:** Finds the model parameters that maximize the likelihood of the observed data. For complex models, this often leads to overfitting since the model can perfectly fit the training data without considering generalization.

- **MLE with Regularization:** Incorporates regularization into the likelihood framework. The goal is to maximize the likelihood while also considering the penalty for model complexity. The regularization term helps to balance between fitting the data well and keeping the model simple.

##### **5. How Regularization Affects Model Selection**

- **Unregularized Selection:** Chooses the model with the lowest training error. In the absence of regularization, the complex polynomial (degree 10) might win due to its low error on the training data.

- **Regularized Selection:** The addition of the regularization term adjusts the selection criteria, favoring models that are less complex even if they have slightly higher training error. The model selection is now influenced by both the fit to the data and the complexity penalty.

##### **6. Summary**

- **Regularization** is used to penalize model complexity, helping to select models that generalize better to unseen data.
- **Incorporation into MLE:** Regularization modifies the standard MLE approach by adding a complexity penalty, leading to a trade-off between fit and simplicity.
- **Practical Impact:** Regularization adjusts the loss function to prevent overfitting, thereby improving the model's ability to generalize to new data.

Regularization helps ensure that the model you select is not only the best fit for the training data but also maintains a balance between complexity and generalization.

### Back to "Bayesics"

In your example, you're highlighting a fundamental concept in probability and Bayesian inference, where we need to consider both the likelihood of observing evidence given a scenario and the prior probability of that scenario itself. Let’s break down the concepts and how they relate to the example of popcorn on the floor.

#### **Understanding the Example**

##### **Initial Setup:**

1. **Scenarios:**
   - **Movies:** High probability of popcorn on the floor.
   - **Board Games:** Medium probability of popcorn on the floor.
   - **Nap:** Low probability of popcorn on the floor.

2. **New Candidates:**
   - **Movies:** High probability of popcorn on the floor.
   - **Popcorn Throwing Contest:** Very high probability of popcorn on the floor.

##### **Evaluating Scenarios:**

1. **Probability of Evidence Given the Scenario:**
   - **P(Popcorn | Movies)**: High
   - **P(Popcorn | Contest)**: Very High

2. **Prior Probability of the Scenarios:**
   - **P(Movies)**: High
   - **P(Contest)**: Low

##### **Finding the Most Likely Scenario:**

To find the most likely scenario, you should consider both:

- **Likelihood of Evidence Given the Scenario** (how probable is the evidence if the scenario is true?)
- **Prior Probability of the Scenario** (how likely is the scenario itself?)

The correct approach is to maximize the **joint probability** of the evidence and the scenario. This involves multiplying the likelihood of the evidence given the scenario by the prior probability of the scenario:

- **For Movies:**
  $$  \text{Joint Probability} = P(\text{Popcorn} | \text{Movies}) \times P(\text{Movies})$$

- **For Contest:**
  $$  \text{Joint Probability} = P(\text{Popcorn} | \text{Contest}) \times P(\text{Contest})$$

##### **Applying Bayes' Theorem:**

Bayes’ Theorem helps in updating our beliefs about the scenarios based on the evidence:

$$P(\text{Movies} | \text{Popcorn}) = \frac{P(\text{Popcorn} | \text{Movies}) \times P(\text{Movies})}{P(\text{Popcorn})}$$

$$P(\text{Contest} | \text{Popcorn}) = \frac{P(\text{Popcorn} | \text{Contest}) \times P(\text{Contest})}{P(\text{Popcorn})}$$

Here:

- $P(\text{Popcorn})$ is the overall probability of finding popcorn on the floor, regardless of the scenario.
- The denominator $P(\text{Popcorn})$ normalizes the probabilities so they sum to 1.

##### **Why Movies Might Be Preferred:**

Even though the contest has a very high likelihood of popcorn on the floor, its prior probability is very low. Thus, when considering both the likelihood and prior probability, movies might turn out to be the more probable scenario due to its higher prior probability, despite a slightly lower likelihood of generating popcorn.

#### **Summary**

- **Maximum Likelihood:** Looks at which scenario makes the observed evidence most probable but doesn’t consider how likely the scenario itself is.
- **Bayesian Inference:** Considers both the likelihood of the evidence given the scenario and the prior probability of the scenario. It calculates the joint probability of evidence and scenario, updating beliefs in light of the evidence.

In practice, considering both the likelihood and prior helps ensure that we choose the most realistic and plausible scenario, balancing evidence and prior knowledge effectively.

### Bayesian Statistics - Frequent vs. Bayesian

The debate between frequentist and Bayesian approaches in statistics revolves around how probabilities are interpreted and used in inference. Let’s break down the key differences between these philosophies using your example of coin tossing:

#### **Frequentist Approach:**

1. **Probability Interpretation:**
   - **Frequentists** view probability as the long-term frequency of an event occurring if an experiment is repeated an infinite number of times. For them, probability is objective and derived purely from data.
   
2. **Inference:**
   - In your example, after tossing the coin 10 times and observing 8 heads and 2 tails, a frequentist would calculate the probability of heads as:
     $$\text{Probability of Heads} = \frac{\text{Number of Heads}}{\text{Total Tosses}} = \frac{8}{10} = 0.8$$
   - This result is based entirely on the observed data, without incorporating any prior beliefs about the fairness of the coin.

3. **Objective Evidence:**
   - Frequentists rely on data collected from experiments or trials to make inferences. They use methods like Maximum Likelihood Estimation (MLE) to find the model parameters that make the observed data most likely, but they do not incorporate prior beliefs or external information.

#### **Bayesian Approach:**

1. **Probability Interpretation:**
   - **Bayesians** view probability as a measure of belief or certainty about an event. This belief can be updated as new evidence is collected. Probabilities represent a degree of confidence in an event occurring.

2. **Prior Beliefs:**
   - Bayesians use **prior distributions** to incorporate prior knowledge or beliefs about a parameter before seeing the data. For instance, if the Bayesian initially believes the coin is likely fair, they might start with a prior belief that the probability of heads is around 0.5.

3. **Updating Beliefs:**
   - After observing the data (8 heads out of 10 tosses), Bayesians update their beliefs using **Bayes’ Theorem**:
     $$P(\text{Probability of Heads} | \text{Data}) = \frac{P(\text{Data} | \text{Probability of Heads}) \times P(\text{Probability of Heads})}{P(\text{Data})}$$
   - Here, \(P(\text{Probability of Heads})\) is the prior belief, \(P(\text{Data} | \text{Probability of Heads})\) is the likelihood of the data given the probability of heads, and \(P(\text{Data})\) is the overall probability of observing the data.

4. **Posterior Distribution:**
   - The result is a **posterior distribution**, which reflects updated beliefs about the probability of heads after considering the evidence. Even if the data shows 8 heads, the Bayesian might still assign a high probability to the coin being fair but will adjust the belief slightly based on the evidence.

#### **Conceptual Differences:**

- **Frequentist Methods:**
  - **Model Selection:** Find the model that maximizes the likelihood of the observed data.
  - **Parameter Estimation:** Focus on point estimates like MLE.

- **Bayesian Methods:**
  - **Model Selection:** Update prior beliefs based on observed data to get a posterior distribution.
  - **Parameter Estimation:** Use the entire posterior distribution to make inferences, allowing for uncertainty in parameter estimates.

#### **Illustrating with the Coin Toss Example:**

- **Frequentist Result:** The probability of heads is estimated as 0.8 based on observed frequencies.
  
- **Bayesian Result:** The probability might be adjusted from the prior belief (0.5) to a value slightly different from 0.5, reflecting both the prior belief and the observed data. For example, if the prior was a fair coin and the observed evidence suggests 0.8, the Bayesian estimate might be a value like 0.65, reflecting the updated belief.

In summary, while frequentists rely solely on observed data to make inferences, Bayesians incorporate prior beliefs and update these beliefs with new evidence. This difference leads to different approaches in handling uncertainty and making predictions.

### Bayasian Statistics - MAP


#### **Different Priors and Their Impact**

1. **Conservative Bayesian:**
   - **Prior:** Very narrow, centered around 0.5.
   - **Initial Belief:** Strongly believes the coin is fair.
   - **Update After One Toss (Heads):** Minimal change in belief. The posterior is very close to the prior, reflecting that the Bayesian is highly skeptical of deviating from their initial belief.
   - **Posterior Mode (MAP Estimation):** Close to 0.501, almost unchanged from the prior belief.

2. **Moderate Bayesian:**
   - **Prior:** Wider distribution, still centered around 0.5 but with more spread.
   - **Initial Belief:** Believes the coin is likely fair but is open to some bias.
   - **Update After One Toss (Heads):** Shows a noticeable shift in belief. The prior was more flexible, so the evidence has a more substantial impact on the posterior.
   - **Posterior Mode (MAP Estimation):** Around 0.607, reflecting some adjustment from the initial belief but not as extreme.

3. **Non-Informative Bayesian:**
   - **Prior:** Uniform distribution, assigning equal probability to all possible values.
   - **Initial Belief:** No strong prior belief; all outcomes are equally likely.
   - **Update After One Toss (Heads):** Significant change in belief. Since the prior was non-informative, the posterior reflects the data more strongly.
   - **Posterior Mode (MAP Estimation):** Around 0.8, which aligns with the frequentist estimate given the observed data.

#### **Summary of Concepts:**

- **Prior Distribution:**
  - Represents initial beliefs before seeing the data.
  - **Conservative Prior:** Strongly anchored belief; changes minimally.
  - **Moderate Prior:** Allows for more flexibility; adjusts moderately.
  - **Non-Informative Prior:** Equal belief across all possible outcomes; changes significantly based on data.

- **Posterior Distribution:**
  - Represents updated beliefs after observing data.
  - **MAP Estimation:** The value of the parameter that maximizes the posterior distribution. It serves as the point estimate for the parameter considering both the prior and the data.

- **Frequentist vs. Bayesian:**
  - **Frequentist:** The result (MAP estimation) with an uninformative prior will match the frequentist estimate because it relies only on the observed data.
  - **Bayesian:** The result depends on the prior belief. With informative priors, the posterior reflects both the prior and the evidence. 

#### **Key Takeaway:**

Bayesian statistics allows for the incorporation of prior beliefs, which can significantly affect the results of statistical inference. The choice of prior impacts the posterior distribution and, consequently, the MAP estimation. In contrast, frequentist approaches are solely data-driven and do not incorporate prior beliefs.



### Bayesian Statistics - Uploading Priors

Bayes' theorem is a fundamental concept in Bayesian statistics, and understanding it is crucial for performing belief updates based on new evidence. Here's a summary of the key points from the video:

### Bayes' Theorem Overview

1. **Bayes' Theorem Formula**:
   $$P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}$$
   - **Posterior ($P(A \mid B)$)**: Updated probability of event A given evidence B.
   - **Prior ($P(A)$)**: Initial probability of event A before considering evidence B.
   - **Likelihood ($P(B \mid A)$)**: Probability of evidence B given that event A is true.
   - **Marginal Likelihood ($P(B)$)**: Total probability of evidence B, considering all possible scenarios.

2. **Example**:
   - **Event A**: Being offered a job.
   - **Evidence B**: Receiving a request for a follow-up phone call.
   - **Posterior**: Probability of being offered the job given the follow-up phone call.
   - **Prior**: Initial belief about the likelihood of getting the job.
   - **Likelihood**: Probability of receiving a follow-up call if you are to get the job.
   - **Marginal Likelihood**: Overall probability of receiving a follow-up call.

### Updating Beliefs with Bayesian Inference

1. **Coin Example**:
   - **Types of Coins**: Fair (0.5 heads) and Biased (0.8 heads).
   - **Priors**: Initial belief about the coin's type (e.g., 75% fair, 25% biased).
   - **Evidence**: Outcome of a coin flip (e.g., heads).

2. **Steps to Update Beliefs**:
   - **Calculate Posterior for Fair Coin**:
     $$P(\text{Fair} \mid \text{Heads}) = \frac{P(\text{Heads} \mid \text{Fair}) \cdot P(\text{Fair})}{P(\text{Heads})}$$
   - **Calculate Marginal Likelihood (Overall Probability of Heads)**:
     $$P(\text{Heads}) = P(\text{Heads} \mid \text{Fair}) \cdot P(\text{Fair}) + P(\text{Heads} \mid \text{Biased}) \cdot P(\text{Biased})$$
   - **Update Priors**:
     - **Posterior Probability**: Reflects updated belief after observing the evidence.
     - **Example Calculation**: For 8 heads and 2 tails, updating beliefs results in:
       - **Fair Coin**: Posterior probability might be updated to 65%.
       - **Biased Coin**: Probability might increase to 35%.

### Generalized Formula for Continuous and Discrete Variables

1. **Discrete Discrete**:
   $$P(Y \mid X) = \frac{P(X \mid Y) \cdot P(Y)}{P(X)}$$

2. **Continuous Continuous**:
   $$f(Y \mid X) = \frac{f(X \mid Y) \cdot f(Y)}{f(X)}$$

3. **Discrete Continuous**:
   $$P(Y \mid X) = \frac{f(X \mid Y) \cdot P(Y)}{f(X)}$$

4. **Continuous Discrete**:
   $$f(Y \mid X) = \frac{P(X \mid Y) \cdot f(Y)}{f(X)}$$

### Maximum A Posteriori (MAP) Estimation

- **MAP Estimation**: To determine the most likely value of the parameter after observing the evidence, you use the mode of the posterior distribution.

### Summary

Bayes' theorem provides a systematic way to update beliefs about an event or parameter based on new evidence. By calculating the posterior probability, you refine your initial beliefs (priors) and can make more informed decisions or predictions. The process involves considering the likelihood of the evidence, updating priors accordingly, and using various formulas depending on whether the variables are discrete or continuous.


### Bayesian Statistics - Full Worked Example 


### Bayesian Statistics Summary and Notes

#### **Overview:**
Bayesian statistics is a method of statistical inference where prior beliefs are updated with new data. The core concept involves revising the probability of a hypothesis based on observed evidence. This process uses Bayes' theorem to update the prior distribution to a posterior distribution.

#### **Key Concepts:**

1. **Bayes' Theorem:**
   Bayes' theorem allows us to update our beliefs about a probability based on new evidence.
   $$P(\theta | x) = \frac{P(x | \theta) \cdot P(\theta)}{P(x)}$$
   where:
   - $P(\theta | x)$ is the posterior probability of $\theta$ given the data $x$.
   - $P(x | \theta)$ is the likelihood of the data given $\theta$.
   - $P(\theta)$ is the prior probability of $\theta$.
   - $P(x)$ is the marginal likelihood of the data.

2. **Setting Up the Problem:**
   - **Define $\theta$**: Random variable representing the probability of heads in a coin flip.
   - **Data Representation**: Collect data $x = (x_1, x_2, ..., x_{10})$, where each $x_i$ is a Bernoulli random variable (1 for heads, 0 for tails).

3. **Likelihood Calculation:**
   - For each flip, the likelihood is given by $\theta^k \cdot (1 - \theta)^{n - k}$, where $k$ is the number of heads and $n$ is the total number of flips.
   - Example: For 8 heads and 2 tails, the likelihood is $\theta^8 \cdot (1 - \theta)^2$.

4. **Choosing Priors:**
   - If no prior information is available, use a uniform prior: $P(\theta) = 1$ for $\theta$ in [0,1].
   - This prior is often represented as a Beta distribution with parameters $\alpha = 1$ and $\beta = 1$, i.e., Beta(1,1).

5. **Posterior Distribution:**
   - Combine the likelihood with the prior to get the posterior distribution:
   $$P(\theta | x) \propto \theta^k \cdot (1 - \theta)^{n - k} \cdot P(\theta)$$
   - For a uniform prior, the posterior distribution will be Beta($k+1$, $n-k+1$).

6. **Updating Beliefs with New Data:**
   - After the initial data (e.g., 10 flips), the posterior becomes the new prior for subsequent updates.
   - Example: After 10 flips (8 heads, 2 tails) with a Beta(1,1) prior, the posterior is Beta(9, 3). If 10 more flips yield 6 heads and 4 tails, update the prior to Beta(9,3) and calculate the new posterior as Beta(15,7).

7. **Maximum A Posteriori (MAP) Estimation:**
   - The MAP estimate is the mode of the posterior distribution.
   - For a Beta distribution Beta($\alpha$, $\beta$), the MAP estimate is:
   $$\frac{\alpha - 1}{\alpha + \beta - 2}$$
   - Example: For Beta(9, 3), MAP estimate is $\frac{8}{11} \approx 0.73$.

8. **Frequentist vs. Bayesian Approach:**
   - Frequentist methods focus on the data alone and ignore prior distributions.
   - Bayesian methods incorporate prior beliefs and update them with new data.
   - With large amounts of data, Bayesian and frequentist results converge, but Bayesian methods are preferred when prior information is crucial or data is limited.

#### **Important Points:**

- **Beta Distribution**: Conjugate prior for Bernoulli likelihood; posterior is also a Beta distribution.
- **Constant Terms**: Often ignored in practical calculations, as they do not affect the shape or the MAP estimate.
- **Updating Priors**: Iteratively update priors with new data to refine beliefs.


### Relationship between MAP, MLE, and Regularization

1. **Maximum Likelihood Estimation (MLE)**: 
   - MLE aims to find the model parameters that maximize the probability of the observed data given the model. For example, if you have multiple models, you choose the one that best explains the data.

2. **Regularization**:
   - Regularization helps prevent overfitting by penalizing complex models. This is done by adding a term to the loss function that discourages large coefficients.

3. **Combining MLE and Regularization**:
   - When you incorporate regularization into MLE, you adjust the likelihood with a prior probability distribution over the model parameters. This can be seen as combining the likelihood of the data given the model with the probability of the model itself.
   - This combination leads to a loss function that includes both the fit of the model to the data (e.g., squared loss) and a regularization term (e.g., L2 norm of coefficients).

4. **Mathematical Derivation**:
   - To combine MLE with regularization, you transform the product of probabilities into a sum by taking logarithms. This converts the problem into one of minimizing a loss function that includes both the data fitting term (like squared loss) and a regularization term.
   - For example, if the model parameters are drawn from a normal distribution, the likelihood is based on the product of these probabilities. The log transformation simplifies this to a sum of squared terms, representing the regularization term.

5. **Final Model Selection**:
   - The final model is selected by maximizing the posterior probability of the model given the data, which translates into minimizing the regularized loss function. This combines minimizing the sum of squared errors (fit) with the regularization term.
