<a href="https://colab.research.google.com/github/anytaaly/machine-learning/blob/main/Machine_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction and Probability Review
This module will introduce the main idea behind statistical learning for data science. You will learn to differentiate between model-driven and data-driven approaches to address complex problems, as well as supervised and unsupervised learning techniques. We will start by providing a crash course on probability tools that will be needed throughout the course, such as probability spaces, random variables, probability distributions, expectations, Bayes’ rule, and multivariate probability. Finally, we will introduce the programming language Python, which will be used to illustrate our theoretical advances throughout the course.

# Learning Objectives

Understand the difference between model-driven and data-driven approaches to

*   Understand the difference between model-driven and data-driven approaches to address complex problems.
*   Understand the difference between supervised and unsupervised learning problems.
Remember the probability tools to study statistical learning problems in future modules.
*  Install and explore the Python programming language using notebooks.

**Stochastic modeling** is a mathematical approach to represent and predict systems or phenomena that involve an element of randomness or chance



---

This course will not focus on programming or implementation, for which you can find excellent courses online, but on developing an understanding about why, how and when some of these learning techniques work from a statistical perspective.

Choosing the right learning technique to implement in your code must be based on an understanding of the tool you're implementing, instead of blindly trying different alternatives.

On the other hand, those making decisions from data must understand the scope and limitations of the techniques used to digest this data in order to make informed and valuable decisions.


Data science is about drawing useful conclusions from large and diverse data sets through three phases, which I described here as exploration, prediction and inference.

# 1- Exploration Phase:
In the exploration phase, we try to identify patterns that are useful in our analysis using visualization and descriptive statistics.
Also known as Exploratory Data Analysis (EDA), it helps uncover insights, identify anomalies and outliers, formulate hypotheses, and select appropriate analytical methods before building models or making conclusions.


# 2- Planning Phase:
In a second phase, which I call prediction, we use information to make informed guesses about values we wish we knew using machine learning and other tools.

# 3- Inference
In our third phase called inference. We quantify the degree of certainty or uncertainty in our models. In other words, we try to answer the question, how accurate are our predictions? The main tools in this third phase are statistical tests and models.

In data science, inference refers to the process of using data analysis and statistical methods to draw conclusions about a larger population or system based on a sample of data.

Statistics is a central component in data science because it studies how to make robust conclusions based on incomplete information, and this will be the focus of this course.



---



# Data-Driven Learning
**1. Model-Based Approach (First-Principles Driven)**

Definition: This approach relies on theoretical models derived from established scientific principles (e.g., Newton’s laws of motion, Maxwell’s equations, Einstein’s relativity).

**Strengths:**

Provides explainability (why something happens).

Models often have predictive power in well-defined domains (physics, chemistry, engineering).

Require relatively less data since the governing equations are already known.

**Limitations:**

Breaks down in complex, chaotic, or poorly understood systems where exact governing laws are unknown or too difficult to model (e.g., climate change in detail, human behavior, stock markets).

**2. Data-Driven Approach (Empirical / Machine Learning)**

Definition: Instead of relying on known physical laws, this approach uses data itself to infer patterns, correlations, and predictive models.

Examples:

Stock market prediction (too many hidden variables).

Human psychology and behavior modeling.

Brain activity and neural processing.

Strengths:

Can uncover patterns in systems that are too complex or nonlinear for first-principle models.

Improves as more data becomes available (big data + machine learning).

Limitations:

Often lacks interpretability—models may be “black boxes.”

Requires large amounts of high-quality data.

May capture correlation, not causation.
These systems are so complex that we don't have the Newton slow equivalent.


![](https://drive.google.com/uc?export=view&id=1WXt-RLUhV3kjLWHZTtSgYC2u3qTyS1vX)




---

![](https://drive.google.com/uc?export=view&id=1gYRBIsnNuwrnGQ3khC2Ptkd39v4Df0PZ)




---



*Consider a collection of 1000 gray-scale images of dogs and cats. Each image has a resolution of 32 x 32 pixels, and each pixel has 256 possible gray levels. Your learning problem is to say what is the animal in an image. In this scenario, what is the number of features p and the number of data points N ?*

Looking at this machine learning scenario, let me break down the components:
Number of data points (N):
N = 1000
This is simply the total number of images in your collection.
Number of features (p):
Each image has 32 × 32 = 1,024 pixels, and if we treat each pixel as a feature, then:
p = 1,024
Each pixel can take one of 256 possible gray levels (typically 0-255), but the number of features is determined by the dimensionality of the input space, not the number of possible values each feature can take.
So in summary:

N = 1,000 (data points)
p = 1,024 (features)

This gives you a scenario where p ≈ N, which is interesting from a machine learning perspective since you're in a regime where the number of features is comparable to the number of training examples. This can present challenges like overfitting and may require techniques like regularization, dimensionality reduction, or data augmentation to achieve good generalization performance.

# Topics Covered



1.   Introduction
2.   Statistical Learning
3.   Linear Regression
4.   Classification
5.   Resampling Methods
6.   Model Selection and Regularization
7.   Tree-Based Methods
8.   Support Vector Machines
9.   Unsupervised Learning




---



All the topics mentioned in the previous slide can be classified using two families of learning problems.

# Supervised Learning:
The first family is what's called **supervised learning**. In a supervised learning problem, we are given a training data set D. Inside this data set, we have a collection of N pairs. Each pair, $ x_i $  and $ y_i $ , represents two elements.


*   Given:
A training dataset D = $ {(x_1,y_1), ...., (x_N,y_N)}   $ Where
$ x_i  $ are p-dimensional inputs (aka features, regressors, covariates)
$ y_i $ are outcomes variables (ala response, target, dependent variables)

The first element, x_i, is a p-dimensional input.
A p-dimensional input is nothing but a collection of features that we can use to classify the input. So think of, for example, if this is an image, x_i can be a collection of the pixels. If it's a voice recording, would be the voice signal or some features about this voice signals such as the power spectral density and so on and so forth.

**What is a feature in Machine Learning?**

A feature is an individual measurable property of your data that helps the model make predictions or decisions.

Think of features as the inputs (the “X” values) that describe each data point.

Features are what the algorithm “looks at” to learn patterns and make classifications or predictions.



**Example 1: Image Recognition**

Suppose you have a 32 × 32 grayscale image. Each pixel value (0–255 for brightness) is one feature.

Total features =
32 × 32 = 1024

So, when you train your model, every image is represented as a vector of 1024 features.
![](https://drive.google.com/uc?export=view&id=1tbbSKZ5A27nwtP5dinyBsVKOe7d7OQ-d)


**Example 2: Audio Classification**

You have a 2-second audio recording sampled at 100 Hz. That means 200 samples (like little points of sound intensity).

Each sample is a feature → total = 200 features.


**Example 3: Predicting House Prices**

If you’re predicting the price of a house, some features might be:

Size of the house (in square feet)

*   Number of bedrooms
*   Location (zip code)
*   Age of the house
*   Whether it has a garage


Here, each of these is a feature that describes the house.

**In summary**

Features = inputs that describe your data point.

Labels = outputs you want to predict (e.g., cat/dog, price, phoneme type).

The number of features = the length of the vector describing one data point.

## 🔹 What is a Vector?

At its simplest, a vector is just an ordered list of numbers. Each number represents a feature. The vector represents one data point in your dataset.

The variable $ y_i $ is what's called the outcome or the output, also called response target or dependent variable. This $ y_i $ is typically either a value or a label that we want to assign to the input $ x_i $.
In practice, we want to find a function able to map inputs to outcome variables.
As I mentioned before, the outcome variable can be of two types. If $ y_i $ is a quantitative variable, we say that the problem is called a **regression problem**, think of quantitative variables such as salary. In contrast, the outcome can be a qualitative variable. In that case, the problem is called a **classification problem**.


---

Statistical learning refers to a vast set of tools for understanding data.
these tools can be classified as *supervised* or *unsupervised*.

Broadly speaking, supervised statistical learning involves building a statistical model for predicting, or estimating, an output based on one or more inputs.





---
# 🔹 Input vs. Output in ML


*   Each data point is described by input features -> denoted as $ x_i $
*   The output / label we want to predict is $ y_i $
*   The gial of ML is to learn a function

$$
f(x_i) \approx y_i
$$

so that given a new input, the model can predict its outcome.


🔹 Two Main Types of Outcome Variables

**1. Regression Problems**

When $ y_i $ is quantitative (numeric, continuous).

Example:

* Predicting salary of an employee ($45,000, $67,000, etc.)

* Predicting house price ($120k, $300k, …)

* Predicting temperature tomorrow (23.5°C, 30.2°C, …)

The model outputs a real number.

**2. Classification Problems**

When  $ y_i $ is qualitative (categorical, discrete).

Example:

* Predicting if an email is spam or not spam (labels: 0, 1).

* Predicting whether an animal in an image is dog or cat (labels: dog, cat).

* Predicting the customer’s sentiment: positive, neutral, or negative.

The model outputs a class label.



---
🔹 What does p-dimensional mean?

It means that each data point is described by p numbers (features).

**🏠 Example: House Price Prediction**

Suppose each house has 3 features:

* Size (in sq ft)

* Number of bedrooms

* Age of the house

Then each house (one data point) can be written as:

$ 𝑥_𝑖 = [ 2000 , 3 ,10 ] $

Here, p=3.
So, the input space is 3-dimensional.


![](https://drive.google.com/uc?export=view&id=1e8ST6Et49YqgmsYUd4GyzNsdz4nItsIE)

---
# Unsupervised Learning:

Apart from supervised learning, we also have unsupervised learning problems. In this type of problems, we are given a collection of inputs alone. So instead of pairs, as we had before, we're only given the inputs $ { x_1 - x_N } $.

Our problem is to find groups of variables that behaves similarly.

For example, any clustering problem in which we are not given labeled data would be an unsupervised learning problem.


With unsupervised statistical learning, there are inputs but no supervising output; nevertheless we can learn relationships and structure from such data.

# Probability Theory:
Probability theory is the study of uncertainty. Through this class, we will be relying on concepts from probability theory for deriving machine learning algorithms. These notes attempt to cover the basics of probability theory at a level appropriate for CS 229.

The mathematical theory of probability is very sophisticated, and delves into a branch of analysis known as measure theory. In these notes, we provide a basic treatment of probability that does not address these finer details.

# Elements of Probability:
In order to define a probability on a set we need a few basic elements:
The building blocks of probability theory:

1.   **Sample Space: $ \Omega $**

The set of all the possible outcomes of a random experiment. Here, each outcome
$ \omega \in \Omega$  can be thought of as a complete description of the state of the real world at the end of the experiment.

Example: When rolling two dice: Each die has outcomes  {1,2,3,4,5,6}.

The sample space is all ordered pairs: S = {(i,j):i∈{1,…,6},j∈{1,…,6}}

So |S| = 36.

Each element ($ \omega \in \Omega$) is a complete description of what could happen in one trial of the experiment.


2.   **Set of events: $ $**

An event is a subset of the sample space.

Example: Simple Events: Rolling(2,5) & A = {(2,5)}

3.   **Event Space: $ $**

Since the sample space is finite (36 elements), the event space is the power set of S. That means:
$$ \mathscr{F}= {A: A \subseteq S } $$

Number of possible events: $2^{36}$ (a huge number, ~ 68 billion!).
Includes:
* $\emptyset$ impossible event.
* $ S $ (Certain event).
* Any subset of outcomes (like events A,B,C,D above).



![](https://drive.google.com/uc?export=view&id=1J-lRVheikuSGyFAAsD4TdLnLHdiPDs3G)


4.   **Probabilities measure:**

This is a function that assigns probabilities to events. It must satisfy the axioms of probability:

So yes — every event in F gets a probability assigned to it. ✅

✅ So the refined answer is:

In finite/ countable cases: yes, all subsets of outcomes are events, and each gets a probability.

In continuous cases: only measurable sets (those in the σ-algebra) get assigned probabilities, not every conceivable subset.

## Properties of Probability.

**4.1- Non-negativity:**

$ 𝑃 (𝐴)  \geq 0 $ , $ \forall A \in 𝐅 $  

→ Probabilities can’t be negative.

![](https://drive.google.com/uc?export=view&id=1f4RZJYYLoAVyh8GIGSdLOXQ37z_EB-DT)

**4.2- Normalization:**

$ P(\Omega)=1 $

→ The probability of the entire sample space (something in Ω must happen) = 1

**4.3- Additivity (for disjoint events):**

If $A_1$ , $A_2$, .. are disjoint (they don't overlap then:

$ Pr (A_1 ​\bigcup A_2) = Pr(A_1) + Pr(A_2) $

Or

$$ P(​\bigcup_i A_i​)= \sum_i  P(A_i​) $$

**4.4- probability of the intersection of two sets A and B:**
$$ Pr (A \cap B) \leq min(Pr(A) , Pr(B)) $$

The probability of the intersection of two sets A and B, is less or equal than the minimum between the probability of A and the probability of B.

**5-:**
$$ Pr (A \cap A) \leq Pr(A) + Pr(B) $$
The second property states that the probability of the union is less or equal than the sum of the probabilities, probability of A and probability of B.

**6-:**
If $A_1,....,A_k $ are a partition of $\Omega $, then $ \sum_{i=1}^{k} Pr(A_i) = 1$

What does “partition of Ω” mean?

A partition of the sample space Ω means:

* The events $A_1, A_2, .... A_k $  are disjoint (no overlap,$ A_i
\cap A_j =  \varnothing $ for $ i \neq j $)



* Together they cover the whole sample space ($A_1 \cup A_2 \cup A_3 \cup A_k = \Omega $)


![](https://drive.google.com/uc?export=view&id=1D5lvGMupl9lqrNKZRZFgASfGXuL5IsQF)


✅ In short:
If you break the entire sample space into mutually exclusive events (no overlaps) that cover all possibilities, then the probabilities of those events must add up to 1, because something in Ω must happen.

# Conditional Probability and independence:


Conditional probability measures the probability of an event occurring given that another event has already occurred or is known to be true.
The conditional probability of any event A given an event B is defined as,

$$ Pr (A|B) = \frac{Pr(A \cap B)}{Pr(B)} $$


* $A\cap B$ = “both A and B happen.”

* $P(A \cup  B)$ = probability that both events occur.

* We divide by P(B) because we are restricting the world to situations where B has already occurred.

In plain words, $ Pr(A|B) $ represents the probability of event A after observing the occurence of event B.

Two evens are called independent if and only if:
$$ Pr (A \cap B) - Pr(A) Pr(B)  $$
or equivalenty, Pr(A|B) = Pr(A)


✅ In simple words: Conditional probability zooms into a smaller “world” (where
B has happened) and asks: within that smaller world, how likely is A?

If A and B are independent, then

$ P(A∣B)=P(A) $

(knowing  B doesn’t change the probability of A).

![](https://drive.google.com/uc?export=view&id=1mZcn-EeVF8adpSudKAJmtceYgfY3Ju9t)



# Random Variable

Despite the name, a random variable is not “random” and not exactly a “variable” in the usual sense.

👉 A random variable (RV) is a function that assigns a number to each outcome of a random experiment.

The experiment produces an outcome ω (an element of the sample space Ω).

The random variable maps ω to a real number.

Formally:

$$ 𝑋 : Ω → R $$

🔹 Why do we need Random Variables?
The sample space Ω can be complicated (words, colors, faces, sounds, …).
It’s easier to work with numbers.
So we define a random variable that converts outcomes into numerical values we can analyze.

![](https://drive.google.com/uc?export=view&id=14V6KEmVdMTfs_qN56XUeDxTS2AtfvcUP)



# 🔹 Types of Random Variables

**Discrete Random Variable**

* Takes values from a countable set.

Examples:

* Die roll = {1,2,3,4,5,6}
* Number of cars passing in 1 hour = {0,1,2,…}
* Coin toss indicator = {0,1}
* Number of Emergency call in an hour
* number of bulbs that produces flower

**Continuous Random Variable**

Takes values from an interval of real numbers.

Examples:

* Height of a person (e.g., 165.2 cm)
* Temperature tomorrow (e.g., 23.6 °C)
* Time until next bus arrives
* Weight of a Suitcase
* time taken to get to a fire
* Length of a tulip stem

Good Youtube: https://www.youtube.com/watch?v=lHCpYeFvTs0&t=81s


#Probability Distribution Functions (PDFs)

![](https://drive.google.com/uc?export=view&id=1nqfXvo-XKGTqljcTfb5WhWkQEy5kOXt5)

## Probablity Mass Function

The Probability Mass Function applies to discrete random variables.(Example of Discrete Variables:


👉 It gives the probability that the random variable equals some value:
$$ P (x) = Pr( X=x)  $$

![](https://drive.google.com/uc?export=view&id=1bPEOaNCGUPLgBLeNaZ1YHD9yUwohUI1m)
Cumulative probability will become of probablity of rolling 4 or less.

One of the properties of cumulative function is that the final bar needs to be 1, for example. the probablity of getting 6 or less is 100%. you can't get a roll a 7 on any dice.


Reference video: https://www.youtube.com/watch?v=YXLVjCKVP7U&t=16s






🔹 How to Find a PMF (General Strategy)

Whenever you’re asked “What is the PMF of X?”:

1- Understand the random variable (RV).

* What does X represent?

* Is it discrete or continuous?

* What values can it take?

2- List the sample space (Ω).

* Write all possible outcomes of the experiment.

3- Map outcomes to values of X.

* For each outcome, compute the value of X.

4- Count frequencies or compute probabilities.

* For a discrete RV, the PMF is:

$ p(x) = P(X = x) = \frac{\text{# of favorable outcomes for } X=x}{\text{total outcomes}} $

Check two things:

Non-negativity:
$ p(x)\geq 0 $

Normalization:
$ \sum p(x)=1 $

# Question 1.1: #


---
Consider the experiment of flipping three coins. What is the size of the sample space

Each coin has 2 outcomes. There are 3 coins. $|S|$ = $2^3$

# Question 1.2: #


---
Consider the experiment of flipping three coins. Assuming the coins are all fair, what is the probability of observing exactly two heads and one tail (in any order)? Input your answer as an irreducible fraction p/q, where p and q are integer numbers. For example, if your answer is 0.6, enter 3/5.

{(H,H,H), (H,H,T), (H,T,H), (T,H,H) , (T,H,H) ,(T,H,T), (T,T,H), (T,T,T)}

3/8



# Question 2.1:
---

Q: Consider the experiment of flipping three fair coins. Define the random variable H as the number of heads observed in the experiment. What is the PMF of H ?

**Step 1: What values can X take?**

Minimum = 0 (no heads at all).

Maximum = 3 (all heads).
So possible values:  $ X \in {0,1,2,3} $

**Step 2: Write sample space (Ω).**

For 3 coin flips:
Ω = {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT}
Total outcomes = 8

**Step 3: Map outcomes to X**

X=3: {HHH} → 1 outcome.

X=2: {HHT, HTH, THH} → 3 outcomes.

X=1: {HTT, THT, TTH} → 3 outcomes.

X=0: {TTT} → 1 outcome.

**Step 4: Compute probabilities.**

Each outcome has probability 1/8 (fair coin).

So:

P(X=0) = 1/8, P(X=1) = 3/8 , P(X=2) = 3/8, P(X=3) = 1/8

**Step 5: Check sum = 1.**
1/8 + 3/8 + 3/8 + 1/8 = 1


## Probability Density function (PDF)

Reference video: https://www.youtube.com/watch?v=oI3hZJqXJuc

Probability Distribution Function: https://www.youtube.com/watch?v=YXLVjCKVP7U&t=144s





![](https://drive.google.com/uc?export=view&id=1VZVhOYRiR9LxW4iBRk6ACxAvMOXsaUP5)


# Conditional Probablity Density Function

$f_{Y|X} (y|x) $
is the conditional probability density function (PDF) of Y given that X = x.

It describes the distribution of Y once you fix the value of X.
So instead of asking "*what's is te probablility of Y = y in general?*"- you are asking "*whats the probability of Y = y given X = x*"?



### 2. Formal Definition

Conditional density is defined using the joint density $ f_{x,y} $ and the marginal density $ f_X(x)$


$$ f_{Y|X} (y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)} for F_X(x) > 0 $$


# Expectation



---

The expected value (or expectation) of a probability distribution is the long-term average of the outcomes if the random experiment were repeated many times.

Expected value is the weighted average in probablity, it multiplies each outcome by its chance. Think of it as balancing by all possible outcomes. This method predicts long term results from random events.

It tells you the usual results over many trials.

In sports, it can predict win chances and score.

In finance, it balances risk with potential risk.


For a **discrete distribution**, it's calculated by summing the product of each possible value and its probability (Σ xP(x)).

Given a discrete random variable X with PMF px and a function  $g:  \mathbb{R} \to \mathbb{R} $, the expectations ( or expected value) of g(X) is defined as:

$$ \mathbb{E}[g(X)] = \sum_{x \in \mathscr{x}}  g(x) \cdot  px(x)   $$

Example:
So think of the case of X being the throw of a dice and the function g could be something like the squared function.
$$ X \to Dice   $$
$$ g(x) x^2   $$
So this is a function that map's real numbers into real numbers.
$$ \mathbb{E}[g(X)] = \mathbb{E}[X^2]  $$

So remember the argument in here is the X, which is the throw of a dice i.e outcome of rolling a fair 6-sided die .
and we know that the p of x is equal to 1/6 for all possible outcomes of the dice, right?
$$ = \sum_{x \in \mathscr{x}}  x^2  \cdot  1/6 $$

where $\mathscr{x}$ is the values that the dice can take and this is nothing but the set 1, 2, up to 6.

$$  \mathscr{x} = { 1, 2, 3, . . . , 6}  $$


* Random variable X = outcome of rolling a fair 6-sided die.

* Sample space: 𝜒={1,2,3,4,5,6}.

* PMF: p(x)=1/6 for each value (since fair die).

We want:
$$ \mathbb{E}[X^2] = \sum_{x=1}^{6}  x^2 \cdot  1/6  $$

$$ \mathbb{E}[X^2] =   1/6 (1^2 + 2^2 + 3^2 + 4^2 + 5^2 + 6^2)  $$

$$ \mathbb{E}[X^2] =  91/6  \approx 15.17 $$

✅ Interpretation: If you roll a die many, many times, and each time square the result, the average squared outcome will approach 15.17


![](https://drive.google.com/uc?export=view&id=1gDFZR2S1-9HyTM29qO03SYRZxMIkdwtl)



## Expectations for continuous distribution:

while for a continuous distribution, it's found by integrating the product of each value and its probability density function (∫ xf(x)dx). Essentially, it's a weighted average of all possible values, where the weights are the probabilities.

2. Continuous Random Variable

When X is continuous, probabilities are described using the probability density function (PDF), $ f_X(x)$

👉 Instead of a sum, we use an integral:

$$ $$

* g(x) = some function of X.

* $ f_X(x)$ = PDF of X

* The integral accumulates the weighted contributions over all real values of x.

![](https://drive.google.com/uc?export=view&id=19L3ODIptv_RB29b4V_kIWp9beE8OUhZq)


In the continuous case, there are infinitely many possible values (like height, time, or temperature).

You can’t just “add” probabilities like in a discrete list, because each individual value has probability 0.

Instead, you use an integral (continuous sum) to “accumulate” probability mass across values.

👉 So the integral is the continuous analog of the sum.



---


# Question
What is the expected value of rolling a die if you earn \\$1 for an odd number and \\$2 for an even number?

The expected value is calculated as
$$ E(x) = ( 1 \cdot \frac{3}{6}) + (2 \cdot \frac{3}{6}) $$
$$ 1.5 $$



---


# Question
Imagine playing a game where you win \$5 with a probability of $ \frac{1}{2} $ and lose \$3 with a probability of $\frac{1}{2}$. What is the expected value of the game?

The expected value is calculated as
$$ E(x) = ( 5 \cdot \frac{1}{2}) - (3 \cdot \frac{1}{2}) $$
$$ 1 $$

So, on average, you win $1 per game.



---


# Question
In a lottery, you have a probability of $\frac{1}{1000} $ to win \$1000, otherwise you lose $1. What is the expected value of playing the lottery?

The expected value is calculated as
$$ E(x) = ( 1000 \cdot \frac{1}{1000}) - (1 \cdot 1- \frac{1}{1000}) $$
$$ 0.001 $$

So, on average, you win \$0.001 per game.

# Properties of Expectation



---


**Property 1: Constants and Scaling**

$ E[a] = a$ , $ E[ag(X)] = aE[g(X)] $

What it means: If you take a contast (say a= 5), its expectation is just the constant itself:

$$ E[5] = 5 $$

Beause its not random.

If you scale a random variable by a constant a:

$$  E[ag(X)] = a E[g(X)] $$

This is called the linearity of expectation with respect to scaling.

**Example: **
Let X = outcome of a fair die, so E[X] = 3.5 (1 x 1/6 + 2 x 1/6 + 3 x 1/6 + 4 x 1/6 + 5 x 1/6 + 6 x 1/6)

Then E[2X] = 2.E[X] = 7


**Property 2: Linearity of Expectation (Addition Rule)**

$$ E[f(X) + g(X)] = E[f(X)] + E[g(X)] $$

What it means: The expectation of a sun is the sum of expectations. This is true always (no indepedence required.)

Example: Let X = die roll, Y = coin flip ( 1=Head, 0=Tail).



*   E[X] = 3.5, E[Y] = 0.5
*   Then E[X + Y] = E[X] + E[Y] = 3.5 + 0.5 = 4

Even though a coin and a die are totally different experiments, expectation "adds up" beautifully.



# Variance

Variance is a measure of how spread out the values of a random variable are around its mean.

If the variance is small, values of the random variable cluster closely around the mean.

If the variance is large, values are more spread out.

Formally, for a random variable X:

$$ Var(X) = E[(X - E[X])^2 $$

## Intuition
*  The mean E[X] tell us the average outcome.
*  Variance tells us how far, on average, outcomes deviate from that average.
*  Spaquring (X - E[X]):
** Makes Deviations positive
** Penalizes larger deviations more heavily

So Vairance captures the average distance from the mean.

## Alternative formula

There's a useful shortcut:

$$ Var(X) = E[X^2] - (E[X])^2 $$

If you Var(X) given - you can caucluate the others from other equation.

General Identity:
For any random variable X, there's a very useful identity (which we computed in above section)

$$ E[(a_i - b_i)^2] = Var(a_i - b_i) + (E[a_i - b_i])^2$$

Or



![](https://drive.google.com/uc?export=view&id=1C9z-KeSFHTW9yrZSOC4EUgvDeSre_gJt)


---
# Question

Question 1: Define a random variable (r.v.) that assigns a value 1 if the flip of a coin comes out Heads, and 0 if it comes out Tails. Input the mean and variance for the r.v. defined as the sum of two fair coins flipped simultaneously (Heads = 1 and Tails = 0).


Let $X_1$ = 1st Flip

and $X_2$ = 2nd Flip

where X₁ and X₂ are independent Bernoulli(1/2) random variables. $ Y \in {0,1,2} $

**Step # 4: Find Mean of the Sum**
Using the linearity of expectation: $ E[Y] = E[X_1 + X_2] $
$ = E[X_1] + E[X_2] $
$ = 1/2 + 1/2 = 1 $

**Step # 5: Find the Variance of the Sum**

Since X₁ and X₂ are independent, we can use:
Var(Y) = Var(X₁ + X₂) = Var(X₁) + Var(X₂) = 1/4 + 1/4 = 1/2


**Key Concepts Used:**

Linearity of Expectation: E[X + Y] = E[X] + E[Y] (always true)
Independence for Variance: If X and Y are independent, then Var(X + Y) = Var(X) + Var(Y)
*Bernoulli Distribution:* A single trial with two outcomes (success/failure)

**Alternative Verification:**
You can verify this by listing all possible outcomes:

(T,T): Y = 0, P = 1/4
(T,H) or (H,T): Y = 1, P = 1/2
(H,H): Y = 2, P = 1/4

E[Y] = 0×(1/4) + 1×(1/2) + 2×(1/4) = 1 ✓

Var(Y) = E[Y²] - (E[Y])² = (0²×1/4 + 1²×1/2 + 2²×1/4) - 1² = 1.5 - 1 = 0.5 ✓

---
# Question

Input the mean and variance (separated by comma and space) of the random variable associated with a fair die with 6 faces (answer non-integer numbers as irreducible fractions).

Step # 1:

Let X = outcome of a fair die roll
* Possible values : {1,2,3,4,5,6}
* Each equally likely P(X = x) = 1/6

Step # 2:

$$ E[X] = \sum^{6}_{x=1} \cdot P(X=x) $$
$$ (1+2+3+4+5+6) \cdot 1/6 $$
$$ E[X] = 21/6 = 7/2 $$

Step # 3: Formula for Variance:

$$ Var(X) = E[X^2] - (E[X])^2 $$

First Compute $E[X^2]$ :

$$ E[X^2] = \sum^{6}_{x=1} x^2 \cdot P(X=x) = \frac{1}{6} \sum^{6}_{x=1} x^2 $$


$$ x^2 = 1^2 + 2^2 + 3^2 + 4^2 + 5^2 + 6^2 = 91 $$

$$ E[X^2] = 91 / 6 $$

Now Variance:

$$ Var(X) = 91/6  - (7/2)^2 $$

$
$ = 91/6 - 49/ 4 = 35/12 $$


# Probability Distribution:



*   Normal Distribution
*   Uniform Ditribution
*   Binomial Distribution
*   Poisson Distribution
*   Triangular Distribution
*   Weibull Distribution
*   Bernoulli Distribution


Reference: https://www.youtube.com/watch?v=3VylC_mIAjE


## 1- T-Distribution

![](https://drive.google.com/uc?export=view&id=1mnaZ51NrxCJjnDlt83TsD4taFMEZvj3d)


## 3- Bernoulli Distribution - Common Discrete R.V

Special Case of Binomial Distribution. Instead of considering all the outcomes across X axis, we are just considering the 2 outcomes. True or False or (Yes or No)

For example: How likely we were to roll Six, Yes or No kind of question falls in Bernoulli Distribution.
If we did this enough time, we should end up 1 out of eveyr 6 times.


![](https://drive.google.com/uc?export=view&id=1A3Cw_3usXyDKJQ54DZJm-sdpIHBIA2LH)


![](https://drive.google.com/uc?export=view&id=1TaPHzwFIWbHBg-CPwb3Ed2eahSEwGeOn)


**Khan Academy Resource**
Reference Video: https://www.khanacademy.org/math/statistics-probability/random-variables-stats-library/binomial-mean-standard-dev-formulas/v/mean-and-variance-of-bernoulli-distribution-example


**Expectations and Variance of a Bernoulli Distribution**

mean:
$$ \mathbb{E}[X] = 1 \cdot p + 0 \cdot (1 - p) = p $$
variance:
$$  p(1-p) $$

![](https://drive.google.com/uc?export=view&id=1zFf0ULftcpaqwqVfg6lc-UGeVHuE4dG1)


## 2- Binomial Distribution - Common Discrete R.V

![](https://drive.google.com/uc?export=view&id=1CAU1PyFfHfFnt9xIvwBbo4yPxMuVKgug)


🔹 What the Graph Shows

This is a Binomial Distribution with:

Number of trials  𝑛 = 10 (flipping 10 coins).

Probability of success (Heads) : p=0.5.

Random variable  X = number of Heads observed.

The x-axis = number of heads (0 through 10).
The y-axis = probability of that outcome.


🔹**How to Interpret the Bars**

*Example*: The bar at X=5 shows ~24.6%.

→ If you flip 10 coins, the most likely outcome is 5 heads (about 25% of the time).

*Example*: The bar at X=0 shows 0.1%.
→ This does not mean “0 heads OR 0 tails” in a single flip.

→ It means: out of 10 flips, the chance of getting all Tails (0 Heads) is 0.1%.

Similarly, the bar at  X=10 = 0.1%.
→ The chance of getting all Heads in 10 flips is also 0.1%.


**Binomial Distribution Basics:**

* X ~ Bin(n,p) means X follows a binomial distribution
* n = number of independent trials
* p = probability of success on each trial
* X counts the total number of successes in n trials

**What is Variance:**
Variance measures how spread out the values are from the mean. For any random variable:

* Variance = E[(X - μ)²] where μ is the mean
* It tells us about the "scatter" or variability of the distribution

**Key Properties of Binomial Distribution:**

Each trial has only two outcomes: success (probability p) or failure (probability 1-p)
Trials are independent
The probability p remains constant across all trials.

**Conceptual Thinking for Binomial Variance:**

* If p is close to 0 or 1: Most outcomes will be very predictable (low variance)
* If p is around 0.5: Maximum uncertainty in each trial (higher variance)
* More trials (larger n): Generally increases the total variability
* The variance depends on both the number of trials AND the uncertainty in each trial

** Mathematical Foundation:**
Since binomial is the sum of n independent Bernoulli trials, and variances of independent random variables add up, the binomial variance relates to:

* The variance of a single Bernoulli trial
* Multiplied by the number of trials


$$ px(h) = (^n_h) P^h (1-p) ^{(n-h)} $$

## 4-  Uniform Distribution - Common Continuous Random Variable (R.V)

What is a Uniform Distribution?

* A uniform distribution is a probability distribution where every outcome in a given interval is equally likely.

* If a random variable X follows a uniform distribution between a and b, we write:
$$ X \backsim Unif(a,b) $$


![](https://drive.google.com/uc?export=view&id=1Ipt9jMwGOKwsjSILAf7RHDtWGDsWQCyI)

Mean for Uniform Distribution:

Mean:
$$ E[X] = \frac{a+b}{2} $$

Variance:
$$ Var(X) = \frac{(b-a)^2}{12} $$


# Empirical Mean
![](https://drive.google.com/uc?export=view&id=1HWGSm3jRZKUgo1-W5fJC7BGqU6rxicf4)




---



# Question
 Input the mean and variance of a random real number chosen uniformly at random from the range [-2, 2].

 ## Answer
for X $ \approx Uniform(a,b): $

Mean can be calculated as
The mean $\mu $ of a uniform distribution is calculated using the formula

$ \frac{a+b}{2}\$

$$ E[X] = \frac{a+b}{2} $$

Variance:
The variance $ (\sigma ^{2}) $ of a uniform distribution is calculated using the formula $ (\sigma ^{2}=\frac{(b-a)^{2}}{12}) $.

$$ Var(X) = \frac{(b - a)^2}{12}$$

Using formula: Mean is $ E[X] = \frac{-1 + 2}{2} = 0/2 = 0 $

and Variance is

$$ Var(X) = \frac{(2-(-2))^2}{12} = \frac{4^2}{12} = \frac{16}{12}= \frac{4}{3}  $$


![](https://drive.google.com/uc?export=view&id=1hNj8wS1hrcEKlFkLiXgpuk5Mhollwok8)

# Question

Consider N indepdent random variables $X_1, X_2, ...., X_N $ with identical means and variances equal to $\mu $ and $ \sigma^2 $, respectively.

What is the mean of the empirical mean, defines as
 ## Answer
 We have N independent and identically distributed random variables:

 $$ X_1, X_2, ...., X_N $$

 with

 $$ E[X_i] = \mu $$

 $$ Var(X_i) = \sigma^2 $$



 Question is: what is the mean(expectation) of the empirical mean (empirical mean is defined as
  $$ \bar{X} = \frac{1}{N} \sum^{N}_{i=1} X_i $$

  so we basically want $ E[\bar{X}] $

  We know from Use of Linearity of Expectations:

  $$ E[aX + bY] = aE[X] + bE[Y]  $$

  replacing the $ \bar{X} in the above equation:

  $$ E[-] = aE[-] $$

  $$ E[\bar{X}] = E[ \frac{1}{N} \sum^{N}_{i=1} X_i] $$
  $$ E[\bar{X}] = \frac{1}{N} E[ \sum^{N}_{i=1} X_i] $$
  $$ E[\bar{X}] = \frac{1}{N} \sum^{N}_{i=1} E[X_i] $$

We can "pull out" the summation from inside the expectation because:The linearity of expectation states that: E[X + Y] = E[X] + E[Y]

since each $E[X_i] = \mu $
$$ E[\bar{X}] = \frac{1}{N} N\mu $$

N cancels out and we are left with
$$ E[\bar{X}] = \mu $$

Whenever you see a question like this:

Recognize the structure: empirical mean = sum of r.v.s divided by N.

1.   Recognize the structure: empirical mean = sum of r.v.s divided by N.
2.   Apply linearity of expectation: you can pull constants out and split sums.
3.   Substitute known expectations: if each 𝐸[𝑋𝑖]=𝜇, then the average also has expectation 𝜇

👉 This is why the sample mean is an unbiased estimator of the population mean.

Population mean (𝜇) = the true mean of the entire distribution (theoretical).

Empirical mean (𝑋ˉ) = the mean of your finite observed sample.


# Statistical Learning


Read Sections 2.1 and 2.2 in Chapter 2 from the book Introduction to Statistical Learning

This module introduces the standard theoretical framework used to analyze statistical learning problems. We start by covering the concept of regression function and the need for parametric models to estimate it due to the curse of dimensionality. We continue by presenting tools to assess the quality of a parametric model and discuss the Bias-Variance tradeoff as a theoretical framework to understand overfitting and optimal model flexibility. Finally, we will continue our introduction to the Python programming language.

**Learning Objectives**

* Understand the theoretical framework to analyze statistical learning problems and the regression function.
* Understand the curse of dimensionality.
* Understand the parametric models and the bias-variance tradeoff.
* Evaluate model quality and optimal model flexibility.
* Create program in Python using Jupyter notebooks.

In order to motivate our study of statistical learning, we begin with a simple example. Suppose that we are statistical consultants hired by a client to provide advice on how to improve sales of a particular product. The Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper. The data are displayed in Figure 2.1. It is not possible for our client to directly increase sales of the product. On the other hand, they can control the advertising expenditure in each of the three media. Therefore, if we determine that there is an association between advertising and sales, then we can instruct our client to adjust advertising budgets, thereby indirectly increasing sales. In other words, our goal is to develop an accurate model that can be used to predict sales on the basis of the three media budgets.
In this setting, the advertising budgets are input variables while sales is an output variable. The input variables are typically denoted using the symbol X, with a subscript to distinguish them. So X1 might be the TV budget, X2 the radio budget, and X3 the newspaper budget. The inputs go by different names, such as predictors, independent variables, features, or sometimes just variables. The output variable—in this case, sales—is often called the response or dependent variable, and is typically denoted using the symbol Y . Throughout this book, we will use all of these terms interchangeably.

More generally, suppose that we observe a quantitative response Y and p different predictors, X1, X2, . . . , Xp. We assume that there is some relationship between Y and X = (X1, X2, ..., Xp), which can be written in the very general form

$ Y =f(X)+ε $

Here f is some fixed but unknown function of  𝑋1,...,𝑋𝑝  , and ε is a random error term, which is independent of X and has mean zero. In this formulation, f represents the systematic information that X provides about Y .

We assume that X is a vector of random variables and Y is scalar variable.


In essence, statistical learning refers to a set of approaches for estimating f . In this chapter we outline some of the key theoretical concepts that arise in estimating f, as well as tools for evaluating the estimates obtained.


**Prediction**
In many situations, a set of inputs X are readily available, but the output Y cannot be easily obtained. In this setting, since the error term averages to zero, we can predict Y using

$ \hat{Y} = f(X) $

where $ \hat{f} $  represents our estimate for f , and Yˆ represents the resulting prediction for Y . In this setting, fˆ is often treated as a black box, in the sense that one is not typically concerned with the exact form of f, provided that it yields accurate predictions for Y .

**🔹 1. The idea of 𝑓**

In supervised learning, we assume there is some true relationship between inputs 𝑋 and output 𝑌
$ 𝑌 = 𝑓(𝑋) + 𝜀 $

𝑓 = the true (but unknown) function that links  𝑋 and 𝑌.

𝜀 = random error (noise), usually assumed to average to 0.

👉 Example: If 𝑋 = house size and 𝑌 = price, then

𝑓( 𝑋 ) is the true “law” of how size affects price.


**🔹 2. The role of $ \hat{f}(𝑋) $**

Since we don’t know the true 𝑓, we estimate it from data. That estimate is written as:

$ \hat{f}(𝑋) $  
* The hat symbol means “estimate of.”

* $ \hat{f}$ could come from linear regression, decision trees, neural networks, etc.

👉 Example: If you fit a line through data points (linear regression), that fitted line is your 𝑓^

**🔹 3. Using $ \hat{f}$ to make predictions**

Once we have  $ \hat{f}$ , we can predict $ \hat{Y}$.

$ \hat{Y}$ = the predicted outcome (not the true Y, just our model’s guess).

In practice, $ \hat{f}$  is often treated as a black box — we may not know or care about its exact formula, as long as it predicts well.

**🔹 4. Prediction setup**

We want to predict Y from inputs X.

True relationship:
$$ Y = f(X) + \varepsilon $$
where $ \varepsilon $ = random noise (can't be predicted).
* Our model uses an estimate $ \hat{f} $:
$$ \hat{Y} = \hat{f}(X) $$
So the prediction error is:

$$ Y - \hat{Y} = f(X) + \varepsilon - \hat{f}(X) $$


**🔹 5. Two sources of error**

The text introduces two categories of error:

**Reducible error**

* This comes from the fact that our estimate $ \hat{f}$ is not exactly the true f.
* If we choose better algorithms, more data, or tune parameters, we can reduce this gap.

Formally:
$$ [f(X) - \hat(X)]^2$$

**Irreducible error**

This comes from the noise term $ \varepsilon $

Even if we knew the true f exactly, we cannot predict ε because it represents random variation (e.g., unmeasured factors, randomness in the world).

Formally: $ Var(\varepsilon)$  

**🔹 6. Why irreducible error > 0?**

There are always factors affecting Y that are not in X.

**Example**: A patient’s reaction to a drug might vary due to stress, sleep, manufacturing variation in the pill, etc. These things are unpredictable with the data we have.

So no matter how good your model is, there will always be some error left.

**🔹 7. What are we trying to measure?**

When we build a prediction model, we want to measure:

How far off are our predictions $\hat{Y}$ from the true values Y?

A natural measure is the squared prediction error:

$$ (Y - \hat{Y})^2 $$

Squaring is useful because:
* it makes errors positive (no cancellation of under/over predictions).
* it penalizes large errors more heavily.


**🔹 8. Why expectation?**

One prediction error $ (Y - \hat{Y})^2 $  depends on one data point. But we want a general measure of average error across possible data. So we take the expected value (the average over the data distribution):

$$ E [(Y - \hat{Y} )^2 ] $$

This is called the expected prediction error or mean squared error (MSE)

**🔹 8. Substituting the data-generating process**

We know from the model:

$$ Y = f(X) + \varepsilon $$

and our prediction is:

$$ \hat{Y} = \hat{f}(X)  $$

So the error becomes:

$$ Y - \hat{Y} = f(X) + \varepsilon -  \hat{f}(X)  $$

Squaring
$$ ( Y - \hat{Y}) ^2  = ( f(X) + \varepsilon -  \hat{f}(X) ) ^2  $$

**🔹 9. Taking expectation**
$$ E [( Y - \hat{Y}) ^2]  = E[( f(X) + \varepsilon -  \hat{f}(X) ) ^2 ] $$
$$ = (f(X) -  \hat{f}(X) )^2 + 2(\hat{f}(X) -  \hat{f}(X)) \varepsilon + \varepsilon^2  $$


$$ = (f(X) -  \hat{f}(X) )^2 + 2(\hat{f}(X) -  \hat{f}(X))E[\varepsilon] + E[\varepsilon^2]  $$

* The middle term vanishes because $ E[ɛ]$ = 0 . This is an assumption in most regression/statistical models: the noise has zero mean (no bias, just random fluctuation) and some finite variance. and the last term: $ E[ɛ^2] $ = Var(ɛ)

 $$ E [( Y - \hat{Y}) ^2]  = (f(X) -  \hat{f}(X) )^2  + Var(\varepsilon)  $$

![](https://drive.google.com/uc?export=view&id=1ckPYFbGfuZD3t_l40E_QxE-bYwLLKZpV)

![](https://drive.google.com/uc?export=view&id=1pyCYalpeNV6Tl2AkuDnHaCvblGmc8vbx)

#Regression function

Using tools from statistics, one can prove that the regression function is equal to this conditional expectation,

$$ f(x) = \mathbb{E} [Y | X = x] $$

essentially the expectation of Y given X.

If we have access to the conditional pdf f of Y given X, we could compute this expectation $ f(x) = \mathbb{E}[Y | X = x] $ by using the explicit formula that we introduced before. Essentially, we need to compute the integral of Y times
$f_{Y|X} $ and then integration goes over Y.

Notice that the result of the integral is a function of $ \textbf{x}$ alone.

$$ f(x) = \mathbb{E} [Y|X = x] = \int_{y=+\infty. }^{y=-\infty.} y \; f_{Y|X} \; (y|\textbf{x}) dy $$



## 🔹 1. Conditional expectation definition
The regression function is defined as:

$$ f(x) = E[Y | X = x] $$

This means:

For a given input value 𝑥, the “best guess” of 𝑌 is the conditional expectation of 𝑌 given 𝑋=𝑥

In words: if you know the input
𝑋=𝑥, what’s the average value of 𝑌 you should expect?

👉 Example: If 𝑋 = study hours and 𝑌 = exam score, then  𝑓(5) means: “the expected exam score if someone studies 5 hours.”

## 🔹 2. Expressed with conditional density
If the conditional density of Y given X = x if $f_{Y|X} (y|x) $, then:
$$ f(x) = \mathbb{E} [Y|X = x] = \int_{y=+\infty. }^{y=-\infty.} y \; f_{Y|X} \; (y|\textbf{x}) dy $$

This integral just says:
* Where $ f_{Y|X} \; (y|\textbf{x}) $ is the conditional probablity density function of Y given X = $\textbf{x} $
* Take all possible values of Y.
* Weight them by how likely they are (the conditional density).
* That gives the average = the regression function
*

## 🔹 3. Why it depends only on x

Notice that once you itnegrate over all possible y, the result is just a function of the input x.

That's why f(x) is called the regression function of Y on X.

## 🔹 4. Optimization characterization

Another way to define the regression function or The regression function can also be viewed as the solution to an optimization problem:

$$ f(.) = \underset{g(.)}{\mathrm{argmin}} \mathbb{E} [(Y - g(X))^2] $$

This says:

* Imagine trying to approximate Y with some function of X, call it g(X).
* measure how good your guess is using mean squared error (MSE):

Here, one seeks the function g that minimizes the expected squared error (mean squared error, MSE) between the actual value Y and prediction g(X).

Key Insights:
* The regression function depends only on x and not directly on y.
* Although the probablity density function $ F_{Y|X} $ is usually unknown in practice, this framework justifies why learning algorithms try to estimate conditional expectation.


**In Summary: **

The regression function f(x) is the conditional mean of $ f_{Y|X}$, which can be computed if the ocnditional density is known, and proves that it minimizes the mean squared error among all possible funcrions to predict Y from X.
* The regression function is not necessarily a line or polynomial; its just the conditional expectation E[Y|X = x].
* All regression methods (Linear, logistic, neural nets, etc.) are ways of trying to estimate $\hat{f}(x)$, since the true f(x) is usually unknown.


# Question
Let's work through a concerte example step by step, so you can see how the regression function. compute $ \mathbb{E}[Y|X=x]$ for a simple case like

$$ Y = 2X + \varepsilon $$

Where
* X is the input,
* $ \varepsilon $ N(0,1) (noise, normally distributed with mean 0 and variance 1

By definition: $ f(x) = \mathbb{E}[Y | X = x] $

Substitute the model for Y:
$$ f(x) = \mathbb{E} [2X + ɛ | X = x] $$
Using linearity of expectation:
$$ f(x) = \mathbb{E}[2X | X = x] + \mathbb{E}[ɛ | X =x]  $$

first term: $ \mathbb{E}[2X | X = x] $ = 2x. (That’s why it is not
E[2x] (because 2x isn’t random anymore, it’s just a number). The expectation of a constant is the constant itself.)


Second term: Since $ ɛ $ is indepedent of X and has mean 0,
$$ f(x) = 2x + 0  $$

# Question

In Practice, we do not know explicitly the conditional PDF $ f_{Y|X} $; however, we can sample data points from the additive model. Hence,
Given a dataset D = $ {(x_1,y_1), . . . . (x_N,y_N)}$  where $(x_i, y_i)$ are random samples drawn independetly from the additive model (notice that $x_i $ ($x_{i1}, ..... x_{ip})^T $ $ \in \mathbb{R}^P $ is a p-dimensional vector).

Find as estimate of the regression function f. We will denote our estimate by $ \hat{f}$.

![](https://drive.google.com/uc?export=view&id=1dXlhmJ3e7q1YR8ZYpBFSHVxcHdWC9EmS)


$ y_i = f(x_i) + ɛ_i $
where $ \mathbb{E}[ɛ_i] = 0 $

Using approximate f(x) using the empirical conditional mean, i.e..

![](https://drive.google.com/uc?export=view&id=1dqj0Hx6KCb2OIbaSd0WimsV1KYoMOSY9)



![](https://drive.google.com/uc?export=view&id=1QL8yXDEUZ9uM0lXqK1cVCTL_idQJKUwY)


![](https://drive.google.com/uc?export=view&id=1JfM2g62pa12JEk8vWWpY1tAEnEDfh1-1)


![](https://drive.google.com/uc?export=view&id=1llya1EeAhspdP2f6u3gH-BaFWfNEDveI)

https://canvas.upenn.edu/courses/1878459/pages/%3E-regression-function-11-36?module_item_id=34205529
window in technique


# Question:

![](https://drive.google.com/uc?export=view&id=1JlIHDVcghrmAYEqfIUpLqUrT-l4ZtR_W)


![](https://drive.google.com/uc?export=view&id=1NsTZtQgUXcjO2vi1MnplCie-Ijs6YdDx)





---
# Local Averaging
The slide is explaining local averaging as a practical way to approximate the regression function f(x) when you only have sample data.

In theory, we said:

$$ f(x) = E[Y|X =x]  $$

In practice, we don’t know the true distribution, so we approximate using empirical averages of nearby data points.

![](https://drive.google.com/uc?export=view&id=123HbKN5_3Nk_cirf0qxUYGrJlgUdQi5m)


## 🔹 Step-by-step process (illustrated in the figure)


1.   Pick a point of interest — say, 𝑥=2
2.   choose a window around it - e.g. interaval [2 - r, 2+r]
3.   Collect all data points inside that windown
4.   Average their y-values (outputs)

![](https://drive.google.com/uc?export=view&id=1gV8xIEf1j--XpybenBsb7CBqMEpCSH0h)

Where $D_(2,r)$ is the set of all dat apoints with inputs in [2-r, 2+r].

5.   That average gives an estimate of f(2). Graphically, this is a point somewhere in the middle of the cloud of y-values around x = 2.
6.   Repeat across all x by sliding the window, and you trace out a curve that runs throuhg the middle of the scatterplt.

That curve is your estimate of the regression function.



# Example: Predicting House Prices with 1 feature

Suppose you want to estimate the price of a house(Y) based on some features (X).

**Case # 1: 1D (Only one feature)**
Let's say X = house size (sqft)



*   You have a dataset of 500 houses: sizes vs. prices.
*   if you want to estimate the price of a 2,000 sqft house, you can do local averaging:
- Take all houses with sizes between 1,900 and 2,100 sqft, Average their prices. That Average is your estimate of f(2000).

Works great! You have lots of nearby examples.


Scatterplot: each blue dot = one house.

$x_i$ = house size (sqft)
$y_i$ = sale price

Local Averaging at one point

Suppose we want to estimate the price for a 2,000 sqft house:
1. Pick up a window around 2000 (say, between 1,900 and 2,100 sqft).
2. Collect all houses in that range (the neighbors").
3. Compute the average price of those neighbors.
4. Plot that average as a single point (green dot) at x = 2000.

thats $\hat{f}$(2000), the estimated regression function at 2000 sqft.


Slide the Winde:
Now repeat:
1. Move the window to 1,000 sqft, average prices there.
2. Move to 1,500 sqft, average prices there.
3. Move to 2,500 sqft, and so on.
at each location, plot the average.

Connect the averages:
If you connect all those averages smoothly, you get the green line - as estimate of the true regression function f(x) = E[Y|X =x].

It passes through the middle of the cloud pf blue points.

It's not a perfect line, because it adapts to the local shape of the data.




# Further Explanation:
The set is $S_1$ = [-1,1] $ \subset \mathbb{R}^1$

there are N points uniformly at random into this interval.

Pick some location $ x \in S_1 $ neighborhood around it:

$$ D_r (x) = [x-t, x+r] $$
in 1D, the length og an interval [a,b] is simply:
Length = b-a

Length(D_r(x)) = (x+r) - (x-r)
= 2r

# Density of points in 1D
* Total interval length = 1 - (-1) = 2
if the interval of length has N points uniformly distributed, then the expected number of points per unit length is:

$$ Density = \frac{N}{ total \; length } = \frac{N}{2} $$

So, the expected number of points inside that window is:

Expected number in window = Density × Length of window


![](https://drive.google.com/uc?export=view&id=1S7AQHCFOuktIMDOAyC6tHpUjXKh3vR9n)



![](https://drive.google.com/uc?export=view&id=1ByAzzCWzkQJcOF6kHxbBr0czKCokaNI8)



---
# Curse of Dimensionality

This approach works fine when = 1 (only one input variable).
But when you have many input variables (p>1), something called the curse of dimensionality kicks in.

*   Data points become sparse in high-dimensional space.
*   Your local window may contain very few or no neighbors, even with a large dataset.
*   Local averaging becomes unrealible.

So while local averaging is a nice intuitive method in 1D, in higher dimensions we need more sophisticated statistical learning techniques (like linear regression, kernel methods, decision trees, neural nets, etc.)







---



# Case 2: High-D (Many features)


*   $X_1$ = house size,
*   $X_2$ = number of bedrooms
*   $X_3$ = lot size,
*   $X_4$ = year built
*   $X_5$ = distance to city center.

Thats 5D inout space (p = 5)

if you want to estimate the price of a house with: 2,000 sqft, 5,000 sqft lot, built in 1990, 10 miles city center, and you try local averaging (looking for "neighbors" in 5D space.

It's very unlikely you'll find many houses exactly matching or even close in all 5 features at once.

To get enough neighbors, you'd need to expand your window a lot... but then you're averaging across very different houses (e.g 1,000 sqft vs. 3000 sqft, or 1 bedroom vs, 5 bedrooms).

This is the curse of dimensionality:
In highesr dimensions, data points are spread far apart.
Local neighborhoods have too few points, even if you have a large dataset.






---

What does "sampled uniformly at random" mean?

![](https://drive.google.com/uc?export=view&id=1iEu4-JQkQDRbMYT1eyA44ZtvWces2K8W)

![](https://drive.google.com/uc?export=view&id=1yXETD-EU38ooL6hT2o-p75GAhxZ6k9Vb)


---



# Conclusion of Conditional expecarion and Expirical estimates:

![](https://drive.google.com/uc?export=view&id=11ln8NkslEtOULKONuAfyu2ROrKjaE4d3)


# Parametric Models

## Curse of Dimensionality
The curse of dimensionality (CoD) means that when the number of inputs (features, or p) is large , local averaging around a print x doesnt work well to estimate the regression function f(x).

For small p (e.g. 1 or 2 dimensions), you have a decent number of points close to x.

As p increases, the number of samples around x quickly drops toward zero.
This happens because in higher dimensions, points get farther apart - the data becomes sparse.

In 1D, your window around a point is a short line segment ( easy to capture nearby points).

In 2D, it's a circle area.

In 10D, it's a hypersphere - and most of the data lies at the edges, far away from the center. So local averaging fails because you dont have enough data near your target points.

To beat the CoD, we can't rely only on local averages - We need a parametric models.

## Parametric Models:

The idea:  
Instead of trying to average over local neighborhoods, we assume the regression function has a certain **parametric form** (a formula with parameters to estimate).

**Linear model example**
$$ f_L(x;\beta) = \beta_0 + \beta_1 \cdot x_1 +  \beta_2 \cdot x_2 + ... +  \beta_p \cdot x_p $$

* Here $ \beta_0, \beta_1, \beta_2, \beta_3 $ are parameters (weights)
* x = ($x_1, x_2, x_3, .... , x_p) is your input vector.


**Estimation:**
Using your dataset D, you estimate the parameters $\hat{\beta}}$

Then the estimated function becomes:
$$ \hat{f}_L(x;\beta) = \hat{\beta}_0 + \hat{\beta}_1 \cdot x_1 +  \hat{\beta}_2 \cdot x_2 + ... +  \hat{\beta}_p \cdot x_p $$

Why this beats the CoD:
* In local averaging, predictions rely only on the few data points near x.
* In a parametric model, the prediction uses the whole dataset to estimate the global parameters $ \hat{\beta}_i $
* that means even if data is sparse in high dimensions, we can still make predictions because the model "shares strength" across the entire dataset.



#Example of Parametric Models

Let's say we have y = 2x + ɛ, with ɛ ~ N(o,1) (some noise).

Lets first try 1D (easy for local averaging)

Then extend to 5D (curse of dimensionality starts hitting).



#Week 3

## Linear Regression
In this module, we will cover the problem of linear regression.
We will start with a formal statement of the problem, we will derive a solution as an **optimization problem**, and provide a closed-form expression using the matrix pseudoinverse. We will then move on to analyze the statistical properties of the linear regression coefficients, such as their covariance and variances.

We will use this statistical analysis to determine coefficient accuracy and analyze confidence intervals.
We will then move on to the topic of hypothesis testing, which we use to determine dependencies between input variables and outputs.

We will then finalize with a collection of metrics to measure model accuracy and a continuation of the introduction to the Python programming language.

## Learning Objectives

* Understand linear regression problems and their solutions.
* Analyze statistical properties of linear coefficients.
* Evaluate coefficient accuracy and confidence intervals.
* Understand hypothesis testing and implications.
* Create the use of Pandas in Python.

# Linear Alegebra:
Reference Video:
https://www.youtube.com/watch?v=fNk_zzaMoSs&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab


## Basic Operations:

---


![](https://drive.google.com/uc?export=view&id=1vRApRCf7FlYA83-E_xH5Z3l4jAXOz9j0)


![](https://drive.google.com/uc?export=view&id=1t8XgzdwlEj2ldwWfmMGbDfZYWCL-A5BU)



## Multiplication:

![](https://drive.google.com/uc?export=view&id=1Pv7xULS1GDjQaHvTmdRw9FvTpqNs1yxM)


* Matrix multiplication is associative $(AB) C = A (BC)$
* Matrix multiplication is distributive $A(B + C )  = AB + AC $
* but (in general) not commutative 𝐴𝐵 ≠ 𝐵𝐴

In linear algebra, when we say matrix multiplication is not commutative, it means:

𝐴𝐵 ≠ 𝐵𝐴 in general.

🔹 What does this mean?

With real numbers, multiplication is commutative: 2×3=3×2.

With matrices, the order matters:
Multiplying A by B does not usually give the same result as multiplying
𝐵 by 𝐴.
In fact, sometimes one of the products doesn’t even exist (if dimensions don’t match).
![](https://drive.google.com/uc?export=view&id=1Hwm_ERz-rz1LkA-iUaRcsZTo74auz_KK)



## Transpose & Conjugate:

The transpose of a matrix A is denoted by $ A^T $ i.e. $[A^T]_{ij} =[A]_{ji} $.

In other words, the transpose results from converting rows of A into columns of $A^T$.
The transpose operations satisfies:

* $(A^T)^T = A $
* $(AB)^T = B^T A^T $
* $(A + B) ^ T = A^T + B^T $

![](https://drive.google.com/uc?export=view&id=1X03_FZg9HYpHXg0GWdpTRlQrlfKdwuk5)




Question # 1:
---

![](https://drive.google.com/uc?export=view&id=1si4hJcMWbk5t_GMv3q8LeVcCollkSvBl)

Both x and y are column vectors:

Step # 1: Dimensions
* x is 3 x 1
* y is 3 x 1
* $ x^T $ is 1 x 3

Multipluing $ x^T $y =
(1 x 3) (3 x 1) = 1 X 1
The result is a scalar (a single whole number)

$ x^T y $ =

$$
\begin{bmatrix} 1 & 3 & 5 \end{bmatrix}
\begin{bmatrix} 2 \\ 4 \\ 5 \end{bmatrix}
$$
$ x^T y $ =  = (1)(2) + (3)(4) + (5)(6) = 2 + 12 + 30 = 44



#Week 3 - Question # 2

#Week 3 - Question # 3

# Transformation:

To sum up, linear transformations are a way to move around space such that gridline remain parallel and evenly spaced, and such that the origin remains fixed.


![](https://drive.google.com/uc?export=view&id=1L28E5lY9sicXTPUVSZE9i2igckPBVt-f)



![](https://drive.google.com/uc?export=view&id=1NzujqcU2nvkEuf81qnSDBaD84T_J1g3C)


Linear transformation is completely determined by where it takes rhe basis vec tors of the space, which for two dimensions means $ \hat{i} $ and $ \hat{j} $.
This is because any other vector could de described as a linear combination of those basis vectors.

 ![](https://drive.google.com/uc?export=view&id=19UycmAhRI29BQH8UFrVIrWbDxuwjra6o)





---


# Linear Regression - Additive Model


Regression analysis is like any other inferential methodology. Our goal is to draw a random sample from a population and use it to estimate the properties of that population.

 ![](https://drive.google.com/uc?export=view&id=1ZR8-c1YZz6N07_yzxkz73HMudPSnrY1I)


According to this additive model, the output variable y is going to be generated as a function that depends on a vector of inputs x, and a collection of parameters Beta plus some measurement noise.

$$ Y = f_L (\textbf{X};\beta) + \varepsilon = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + . . . + \beta_p X_p + \varepsilon $$

* So the vector bold X contains the variables $ x_1$ , $x_2$ ,  up to $x_p$.
$X = (X_1, X_2, . . .X_p)$ are random inputs drawn from some distribution $f_x(x) $.

* the coefficients $\beta_0, \beta_1,  \beta_2,  \beta_3 + . . . + \beta_p $ are deterministic but unknown coefficients.
* the measurement noise follows a distribution $ \varepsilon ~ f_{\varepsilon}$

✅ In plain words:
“The additive model induces a joint PDF” means: once you assume
𝑌 is generated by a linear function of 𝑋 plus random noise, you automatically define how the pair (𝑋,𝑌) is distributed together — i.e., their joint probability density function.



---


Notice that given the marginal of x and the equation of the additive model, we can compute a joint PDF, f(x, y).

HERE IS HOW TO DO THAT:

**Step # 1: Restate the additive model**

$$ Y = f_L = (\textbf{X};\beta) + \varepsilon = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + . . . + \beta_p X_p $$

Where
* X ~ $f_X(x)$ (we knpw the marginal distributional of the predictors)
* $ \varepsilon ~ f_{\varepsilon}(\varepsilon) $ (independent of X).

**Step # 2: Express conditional distribution:**
Given X = x,
$$ Y |X = x = \beta_0 + \beta^T x + \varepsilon $$




---


# Quick Note on ~
Meaning of ~
$$ (x_i, y_i) ~ f_{XY} $$

it means:
* The random vector $(x_i, y_i)$ is distributed according to the probablility distribution with density $f_{XY}$"
In other words: $ (x_i, y_i) $ follows the joint distribution $f_{XY}$

 ![](https://drive.google.com/uc?export=view&id=1cVNNZW2GHViMoZ6XrKYfEcuA7pjlw3JG)


# Linear Regression Problem

* Given a training dataset $D_{Tr}$ - ${(x_i, y_i)}^N_{i=1}$ consisting of N independent samples $ (x_i, y_i) $ ~ $f_{XY} $
* Estimate values for unknown coefficients $ \beta_0, \beta_1, \beta_2, . ..., \beta_p $ denoted y $\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2, ...., \hat{\beta}_p$. Because $\beta$ values are known by nature, we will use $ \hat{\beta}} $
* Once we have these estimates, we can make prediction about the output variable corresponding to a new input $ x = [x_1, ...., x_p] ^T$ , as follows:
$$ \hat{y} = f_L (\textbf{x};\hat{\beta}) = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \hat{\beta}_3 x_3 + . . . + \hat{\beta}_p x_p $$



---
## Sum of the Square Error:


  ![](https://drive.google.com/uc?export=view&id=1XW2n5nfjQJ0UDYkmqjnYqvjK-0jm1RLN)

When you square the difference between blue dots and yellow line. and then same it all - it produces Sum of the Squared Error and we need to find the $\beta_0 $ and $ \beta_1 $ that minimizes number.

#  Calculate Coefficients $\beta_0$ & $\beta_1$ in One predictor Variable

  We want to find a way where we can use the known data points to estimate the values of $ \beta_0 $ & $ \beta_1 $ and the estimates that we obtain will be represented by this  $ \hat{\beta}_0 $ and $ \hat{\beta}_1 $.
  we would use some function of the data that we have collected to estimate $ \hat{\beta}_0 $ and $ \hat{\beta}_1 $. of-course your estimate could be so good that your estimates turn out to be exactly $ {\beta} $ and $ \hat{\beta} $ and in that case  your white line with co-incide with the red line entiry. but in reality we get a slight deviation.


  ![](https://drive.google.com/uc?export=view&id=1oO1GK2GDPDN-jvJ6GZR8MRDjcut2GEhc)

  So the question now is how do we obtain the best estimate of $\hat{\beta}_0$ & $\hat{\beta}_1$


$$ \hat{\beta}_1 = \frac{\sum^n_{i=1} (x_i - \bar{x}) (y_i - \bar{y})}{\sum^n_{i=1} (x_i - \bar{x})^1} $$

To calculate $\hat{\beta_0}}$

$$ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x} $$



---

The residual is:
$$ e_i = Y_i - (\hat{\beta}_0 + \hat{\beta}_1 X_i ) $$

Where $Y_i$ is the actual Y value and $\hat{Y}_i $ is the estimated one.

$$ e_i = Y_i - \hat{Y}_i $$

Using the above statement we can find the sum of all the residual - but doing so could actually give a false positive. In order to get around this problem - we do the square of sum of residual. so that residual can nto offste each other.


some of the residual could be possitive and might have negative values to cancel them off or offset each other.

  ![](https://drive.google.com/uc?export=view&id=1UMHRyFAn2H64Y86jLHbpG-CAkIAUFUv6)

Even though the fit in above image is not good, but summing all residual and picking the one with least square sum will give you a very good fit.




# Calculate Coefficients $\beta_0$ & $\beta_1$ in multiple predictor Variable
## Linear Regression using Matrix

The Equation:

$$ \hat{\beta} = (M_X^T M_X)^{-1} M_X^T y $$

Where
* $ \hat{\beta} $ : This is the vector of estimated coefficients. It has values $\beta_0$, $\beta_1$ , $\beta_2$.

* $ \hat{\beta}_0 $ = intercept ("The starting value" on Y-axis.
* $ \hat{\beta}_1 $ , $ \hat{\beta}_2 $ = slopes for each predictor variables.
* $M_X $ This is the design matrix. It’s basically a big table that holds all your input x-values:

** The first column is all 1s (so the model can include the intercept).

** The other columns are your predictor variables ( $x_1$, $x_2$, ...).

* y: This is your outcome (dependent variable) vector.
* $(M_X^T M_X)^{-1} M_X^T y $ This is called the Moore–Penrose pseudoinverse of the matrix M_X. It’s the mathematical trick that gives you the “best fit” line through your data points in the least squares sense.



# Quick Note on Optimization

 ![](https://drive.google.com/uc?export=view&id=1QE5VDvl69IXPTvqt0llgKZbfFheN3zZc)


  ![](https://drive.google.com/uc?export=view&id=1QE5VDvl69IXPTvqt0llgKZbfFheN3zZc)

# Univariate Linear Regression
Univariate is one indepedent variables (one predictor)
The model lookslike:

$$ y = \beta_0 + \beta_1 x + \varepsilon $$
where
* y = dependent (output) variable.
* x = independent (input) variable (the predictor).
* $\beta_0 , \beta_1$ = coefficients we want to estimate
* $\varepsilon $ = random error



  ![](https://drive.google.com/uc?export=view&id=1jIP9wDuFZKxTFlErxFWTcIHkLk3BJV9h)



---

reference video: https://www.youtube.com/watch?v=54ewnkdWU6w&t=496s

# Confidence Interval

We can compute $\hat{\beta}_0 $ , . . . $\hat{\beta_0}$ from D alone. These are estimate aiming to approximate the values of $\beta_0$ and $\beta_p$ using data.

These estimates are random variables that are centered around each one of the parameters, Beta


Analyze the uncertainty of our estimates using confidence Intervals, in other words, we can claim that the true value of a parameter $\beta_i$ is within a particular interval with a 95% probability.

The parameter $ \hat{\beta}_i$ is distributed according to a Gaussian distribution that is centered around the true value of $\beta_i$.




---

Slope $\beta_1$  & intercept $ \beta_0 $: for every unit increase in slope, we would see increase the intercept.


reference video: https://www.youtube.com/watch?v=54ewnkdWU6w&t=496s

Example in the given video teaches us that, when we work with samples, the point estimare of the coefficient is not necessarily (and almost never is) the true value.

Instead of using the point estimare, it's always better to work with ranges of values for out estimates, which we call confidence intervals.

It is standard to work with 05% confidence intervals, which means we are 95% certain the true value lies within our interval.


$$ \beta_i = [\hat{\beta}_i - 2 \cdot SD (\hat{\beta}_i), \beta_i + 2 \cdot SD (\hat{\beta}_i) ] $$

What we can claim is that the value that nature uses, essentially this $ \beta_i $  that is unknown to us is contained within this interval with a 95 percent probability.





---
# Hypothesis Testing:

Deterines how likely it is for a particular input $ X_i $ to influence the output Y. In this task, we used Hypothesis Testing, which allow us to build statistical evidence to reject Null Hypothesis Testing, which allow us to build statistical evidence to reject Null Hypothesis of the form: $X_i$ does not influence Y (Legal parable Bob did not kill Alice)





---


# Categorical Inputs

Recap:

1. The Linear Regression Model

The model is usually written as (for 1 dimensional model) :

$$ Y = \beta_0 + \beta_1 X + \varepsilon $$  

(for multi dimensional model) :


$$ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon $$  
	​

Where:
* $\beta_0$ (intercept): where the regression line crosses the y-axis (the value of y when x = 0).

* $\beta_1$ (Slope): how much y changes, on average, for a one-unit increase in x.

$$ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x + \varepsilon $$

with “hats” showing estimated values.

Where $x_i$ is a p-dimensional vector and $y_i$ is a scalar output.

These random input vectors are drawn from a distribution f_x, which is the marginal distribution of the input vector x_i.


**The Key Idea**

In simple regression, X is just $X_1$. Easy to plot.
In multiple regression, there isn't one single "X- axis" - the predictors together define a higher-dimensional space.


That’s why in textbooks you usually see either:

Plots of the fitted line/plane with 1 or 2 predictors, or

Plots that show the effect of one predictor while holding others fixed.

---
1. Input Vectors in Machine Learning / Statistics

Suppose you’re working with data:

$$ (x_i, y_i), = i = 1, . . . , n $$

$ x_i $ = input vector (features), e.g. age, income, blood pressure.

$ y_i $ = output (label/response), e.g. disease outcome.

so, $ x_i $ itself is a random vector because it comes from some population, not fixed by you.


2. Marginal Distribution of x
In probability, when we talk about the joint distribution of (x,y), we mean a probability law that governs both input and output together.

If you only care about the inputs x, you look at the marginal distribution of
x:

$$ f_x(x) = \int f_{x,y}(x,y)dy $$
This "marginalizes out the y-part, leaving just the distribution for the inputs.


#



---
# Qualitative Inputs:

Linear models can handle qualitative inputs, also called categorical variables, taking a discrete set of values.


Analyze the differences in credit card balance between males and females, ignoring other variables. The output variables $ y_i $ represents the credit card balance of individual i. We will consider the gender of individual i as the only input $x_i$. How do we build a linear model?

Step # 1: Create a dummy variables:
x_i = 1 if individual i is a female and 0 is individual i is a male

Step # 2:

---


  ![](https://drive.google.com/uc?export=view&id=1naTsJnlThcTuG0e7kOLhLPpXrnCcTxNY)

  ![](https://drive.google.com/uc?export=view&id=1tj13cskfjBLGAoXeR06WDdHzQs4tayxu)



Reference video: https://www.youtube.com/watch?v=9yTui_LoSOc




---

# Problems

# Calculate $M_X^T$ from a Dataset to calculate $\beta_0$ and $\beta_1$

Given a dataset $D_{Tr}$ = ${(x_i, y_i)}_{i=1}^4 = {(1,2),(2,5),(3,13),(4,20)}$

Co-efficients of a quadractic model =
$$ \hat{Y} = \hat{\beta_0} + \hat{\beta}_1 X + \hat{\beta}_1 X^2 $$

A Simple linear regression ( intercept + x )

x = [1,2,3,4]

Fopr a Quadractic model (intercept + x + $x^2$)



$ M_x$ =
\begin{bmatrix}
1 & 1 & 1\\
1 & 2 & 4\\
1 & 3 & 9\\
1 & 4 & 16\\
\end{bmatrix}

Transposing the Matrix

$ M_X^T  $ =

\begin{bmatrix}
1 & 1 & 1 & 1\\
1 & 2 & 3 & 4\\
1 & 3 & 9 & 16\\
\end{bmatrix}


# Calculate $M_X^T$ $M_X$ from a Dataset to calculate $\beta_0$ and $\beta_1$

Multiplication:

  ![](https://drive.google.com/uc?export=view&id=1an4Tpl2tdo3XWv2zGwMlopoax5FRfvBB)




# Mean Squared Error:
It measures how far off your model's predictions are from the true values.

Formula:

$$ MSE = \frac{1}{n} \sum^n_{i=1} (y_i - \hat{y}_i)^2 $$

Where
  * $y_i$ = actual (true) value
  * $\hat{y}_i$ = predicted value
  * n = number of data points

🔹 Why square the error?

* If you just subtract ( $ (y_i - \hat{y}_i) $ ), positive and negative errors could cancel out.

* Squaring makes all errors positive and penalizes large mistakes more.

* Then we average, so it’s on a comparable scale across models.


🔹 Intuition

Imagine you're predicting house prices:
  * True prices: $200k, Prediction: $210K -> Error = 200 - 210 = -10K , squared = 100M
  * True price: $200k, Prediciton $250K -> -50K, squared = 2.5B

  The big mistake is punished way more: That's why MSE is sensitive to outliers.

  🔹 In practice

Low MSE = predictions close to reality (good model).

High MSE = predictions are often far off (bad model).


# CLASSIFICATION

Consider a discret set C, which contains K class labels. In Classification problems, the output variable Y is qualitative and takes values from C.

The Goal is usually to build a classifier C(x) that assigns a class label into C to an input x

for example:
C = ["Cats", "Dogs", "Parrots"... "k" ]
$$ C: \mathbb{R}^P \longmapsto C $$

  ![](https://drive.google.com/uc?export=view&id=18UG60E6ZOE4ni4O4Q4BJo3UdlUPq4D2a)

## 🔹 Generative Model Idea
  The generative model assumes that the data $(x_i, y_i)$ are generated in two steps:
  1- Sample the input $ x_i $ First, draw  $ x_i $  from the marginal probability distribution of inputs:
  $$ x_i ~ f_X(x) $$

  Where $ f_X $ is the marginal probability density function (PDF).
  This step says: "inputs come from some underlying distribution of data (like natural images, speech signals, etc.).

  2- Sample the output $ y_i $
  Next, given the sampled input $ x_i $, we draw the label $ y_i $, according to a conditional probability mass function (PMF):

  $$  p_k (x_i) = Pr(Y = k | X = x_i ),  $$
  for all k ∈ C

  This is called the conditional class probability. It tells us how likley each class label is, given the input.

  ✅ So in simple words:

Drawing from $ f_X $ = “pick a random image from the universe of possible images.”

Then, given that image, assign a label (“cat” or “dog”) with probability
$ Pr(Y = k | X = x_i )  $.





  ![](https://drive.google.com/uc?export=view&id=1czGO-9JUt6BqvceuPHK1XRTa87o4LjmR)

  The y-axis represents the conditional class probability	​(x)=Pr(Y=1∣X=x),

  i.e. the probability that the label is 1 given the input x.

  ![](https://drive.google.com/uc?export=view&id=1GWkQ4i2cT51-EWsrbo_z5LS3xVQLyixv)
  