## Bayes' Formula

Bayes' theorem describes how to update probabilities based on new evidence. It is given by:

$$
P(A | B) = \frac{P(B | A) P(A)}{P(B)}
$$

where:
- \( P(A | B) \) is the **posterior probability** (probability of \( A \) given \( B \)).
- \( P(B | A) \) is the **likelihood** (probability of \( B \) given \( A \)).
- \( P(A) \) is the **prior probability** (initial probability of \( A \)).
- \( P(B) \) is the **marginal probability** (total probability of \( B \)).

### **Example:**
Suppose a medical test for a disease has:
- **Sensitivity**: \( P(\text{Positive} | \text{Disease}) = 0.99 \)
- **False positive rate**: \( P(\text{Positive} | \neg \text{Disease}) = 0.05 \)
- **Prevalence**: \( P(\text{Disease}) = 0.01 \)

Using Bayes' theorem, the probability that a person actually has the disease given a positive test result is:

$$
P(\text{Disease} | \text{Positive}) = \frac{0.99 \times 0.01}{(0.99 \times 0.01) + (0.05 \times 0.99)}
$$

This formula helps in **medical diagnostics, spam filtering, machine learning, and decision-making**.


## Capital Pi

#### Product Notation (Capital Pi) (Π)

The product of a sequence is defined as:

$$
P = \prod_{i=1}^{n} a_i = a_1 \cdot a_2 \cdot \dots \cdot a_n
$$

##### Example Calculation:
$$
\prod_{i=1}^{4} i = 1 \times 2 \times 3 \times 4 = 24
$$

## Derivative


The **derivative** of a function measures the rate at which the function value changes as its input changes. Mathematically, it is defined as:

$$
f'(x) = \lim_{{h \to 0}} \frac{f(x + h) - f(x)}{h}
$$

### Example Calculation:
For the function \( f(x) = x^2 \), the derivative is calculated as:

$$
\frac{d}{dx} x^2 = 2x
$$

Thus, at \( x = 3 \):

$$
f'(3) = 2(3) = 6
$$

---


## Gradient

The **gradient** of a multivariable function is a vector containing its partial derivatives with respect to each variable. It points in the direction of the steepest ascent.

For a function \( f(x, y) \), the gradient is defined as:

$$
\nabla f = \left( \frac{\partial f}{\partial x}, \frac{\partial f}{\partial y} \right)
$$

### Example Calculation:
Given \( f(x, y) = x^2 + y^2 \), the gradient is:

$$
\nabla f = \left( \frac{\partial}{\partial x} (x^2 + y^2), \frac{\partial}{\partial y} (x^2 + y^2) \right)
$$

Computing the partial derivatives:

$$
\nabla f = \left( 2x, 2y \right)
$$

At \( (x, y) = (1, 2) \):

$$
\nabla f(1,2) = (2(1), 2(2)) = (2, 4)
$$

This means the function increases most rapidly in the direction of **(2,4)**.


## Linear Regression

### **Definition:**
Linear Regression models the relationship between an independent variable \( X \) and a dependent variable \( Y \) using a linear function.

### **Equation:**
The equation for simple linear regression is:

$$
Y = w_0 + w_1 X + \epsilon
$$

Where:
- \( Y \) is the **predicted output** (dependent variable).
- \( X \) is the **input feature** (independent variable).
- \( w_0 \) (intercept) and \( w_1 \) (coefficient) are the **parameters** learned from data.
- \( \epsilon \) represents the **error term** (residual).

For multiple features \( X_1, X_2, ..., X_n \), the **multiple linear regression** model is:

$$
Y = w_0 + w_1 X_1 + w_2 X_2 + \dots + w_n X_n + \epsilon
$$

The **objective** in linear regression is to minimize the **Mean Squared Error (MSE)**:

$$
MSE = \frac{1}{N} \sum_{i=1}^{N} (Y_i - \hat{Y}_i)^2
$$

---

## Logistic Regression

### **Definition:**
Logistic Regression is used for **binary classification**. It predicts the probability that an instance belongs to a particular class.

### **Equation:**
Logistic regression applies the **sigmoid function** to a linear model:

$$
P(Y = 1 | X) = \sigma(w_0 + w_1 X_1 + w_2 X_2 + \dots + w_n X_n)
$$

where the **sigmoid function** \( \sigma(z) \) is:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

This ensures that the output is in the range \( (0,1) \), representing a probability.

### **Decision Rule:**
To classify an instance:
- If \( P(Y=1|X) \geq 0.5 \), predict class **1**.
- Otherwise, predict class **0**.

### **Objective Function:**
Logistic regression minimizes the **log-loss (cross-entropy loss)**:

$$
J(\theta) = -\frac{1}{N} \sum_{i=1}^{N} \left[ Y_i \log(\hat{Y}_i) + (1 - Y_i) \log(1 - \hat{Y}_i) \right]
$$

where \( \hat{Y}_i \) is the predicted probability.


## Matrix


#### Addition
$$
A + B =
\begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} +
\begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix} =
\begin{bmatrix} a_{11} + b_{11} & a_{12} + b_{12} \\ a_{21} + b_{21} & a_{22} + b_{22} \end{bmatrix}
$$

$$
\begin{bmatrix} 10 & 8 \\ 6 & 4 \end{bmatrix} +
\begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} =
\begin{bmatrix} 11 & 10 \\ 9 & 8 \end{bmatrix}
$$

#### Subtraction

$$
A - B =
\begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} -
\begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix} =
\begin{bmatrix} a_{11} - b_{11} & a_{12} - b_{12} \\ a_{21} - b_{21} & a_{22} - b_{22} \end{bmatrix}
$$

$$
\begin{bmatrix} 10 & 8 \\ 6 & 4 \end{bmatrix} -
\begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} =
\begin{bmatrix} 9 & 6 \\ 3 & 0 \end{bmatrix}
$$


#### Multiplication 

$$
\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \times
\begin{bmatrix} 7 & 8 \\ 9 & 10 \\ 11 & 12 \end{bmatrix}
$$

<br>
$$
\begin{bmatrix} (1\cdot7 + 2\cdot9 + 3\cdot11) & (1\cdot8 + 2\cdot10 + 3\cdot12) \\ (4\cdot7 + 5\cdot9 + 6\cdot11) & (4\cdot8 + 5\cdot10 + 6\cdot12) \end{bmatrix}$$<br>
$$
\begin{bmatrix} 58 & 64 \\ 139 & 154 \end{bmatrix} $$ 


## Random Variables

A **discrete random variable** is a random variable that takes on a **countable** number of distinct values. These values are typically integers or countable values.

### **Mathematical Representation:**
A discrete random variable \( X \) has a probability mass function (PMF) given by:

$$
P(X = x) = p(x), \quad \sum_{x} p(x) = 1
$$

where \( p(x) \) is the probability of \( X \) taking a specific value \( x \).

### **Example:**
Consider rolling a six-sided die. The possible outcomes are:

$$
X = \{1, 2, 3, 4, 5, 6\}
$$

Each outcome has an equal probability:

$$
P(X = x) = \frac{1}{6}, \quad x \in \{1,2,3,4,5,6\}
$$

A **continuous random variable** is a random variable that can take on **infinitely many values** within a given range. It is described by a **probability density function (PDF)** instead of a PMF.

### **Mathematical Representation:**
A continuous random variable \( X \) has a probability density function (PDF) \( f(x) \), where:

$$
P(a \leq X \leq b) = \int_{a}^{b} f(x) dx
$$

and the total probability satisfies:

$$
\int_{-\infty}^{\infty} f(x) dx = 1
$$

### **Example:**
Consider a standard normal distribution \( X \sim N(0,1) \), which has the probability density function:

$$
f(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}}
$$

The probability that \( X \) lies between -1 and 1 is:

$$
P(-1 \leq X \leq 1) = \int_{-1}^{1} \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}} dx
$$

This represents the area under the normal curve between -1 and 1.

---

## Skewness

Skewness measures the **asymmetry** of the distribution of a dataset.

$$
\text{Skewness} = \frac{n}{(n-1)(n-2)} \sum_{i=1}^{n} \left( \frac{x_i - \bar{x}}{s} \right)^3
$$

### Definitions:
- $( n )$ = Number of observations
- $( x_i )$ = Individual data point
- $\bar{x} = \text{Mean of the data}$
- $( s )$ = Standard deviation

### Interpretation:
- **Skewness = 0** → Perfectly symmetric (normal distribution)
- **Skewness > 0** → Right-skewed (positive skew, tail on the right)
- **Skewness < 0** → Left-skewed (negative skew, tail on the left)
- **|Skewness| > 1** → Highly skewed (might need transformation)

---

####  Fixing Skewness
If the data is **highly skewed**, transformations can help make it more normally distributed.

#### Fixing Right (Positive) Skewness
If the data has a **long right tail**, try these transformations:

- **Log Transformation**:  
  $[
  X' = \log(X + c)]$
  (Useful for right-skewed data, where \( c \) is a small constant to avoid log(0))
  
- **Square Root Transformation**:  
  $[
  X' = \sqrt{X}
  ]$
  (Moderate effect on skewness)
  
- **Reciprocal Transformation**:  
  $[
  X' = \frac{1}{X}
  ]$
  (Strong effect but inverts the order)
  
- **Box-Cox Transformation**:  
  Best for normalizing skewed data (**requires all positive values**).

### Fixing Left (Negative) Skewness

If the data has a **long left tail**, it means the majority of values are concentrated on the right side. To correct this, we can apply transformations that **stretch** the left side and **compress** the right side.

#### ** Power Transformation**
Power transformations raise the values to an exponent greater than 1, making small values grow faster than large ones.

$$
X' = X^p, \quad p > 1
$$

where $ p $ is a power greater than 1. Common choices are:
- **Square transformation**: $ X' = X^2 $
- **Cube transformation**: $ X' = X^3 $

#### Exponential Transformation**
Exponential transformation amplifies differences in small values, making the left tail stretch out.

$$
X' = e^X
$$

This transformation is particularly useful when dealing with data containing **negative values**.

#### **3️⃣ Reflect and Apply Right-Skewness Correction**
For datasets containing **negative or zero values**, a simple transformation might not work. Instead, **reflect the data**, apply a right-skew correction (like log or Box-Cox), and then reflect it back.

1. Reflect the data:  
   $$
   X_{\text{reflected}} = \max(X) + \min(X) - X
   $$
2. Apply **log transformation**:
   $$
   X' = \log(X_{\text{reflected}} + c)
   $$
3. Reflect back to original scale.


## Kurtosis

Skewness measures the asymmetry of the distribution of a dataset. It describes whether the data has **heavy tails (high kurtosis)** or **light tails (low kurtosis)**. A normal distribution has a **kurtosis of 3** (excess kurtosis = 0), lower means there are less extreme observation values, and higher means there are more. More can be a problem as it may impact the distribution (if the values are not consistent with the actual distribution), and not enough can result in the model not sufficiently understanding the distribution.

$$
\text{Kurtosis} = \frac{n(n+1)}{(n-1)(n-2)(n-3)} \sum_{i=1}^{n} \left( \frac{x_i - \bar{x}}{s} \right)^4 - \frac{3(n-1)^2}{(n-2)(n-3)}
$$


### Interpretation:
- **Kurtosis ≈ 3** → Normal distribution (**Mesokurtic**)
- **Kurtosis > 3** → Heavy tails, more extreme outliers (**Leptokurtic**)
- **Kurtosis < 3** → Light tails, fewer outliers (**Platykurtic**)

---

### **1️ Fixing High Kurtosis (Heavy Tails)**

#### **1. Winsorization (Clipping Extreme Values)**
Winsorization limits extreme values to reduce their impact.

$$
X' = 
\begin{cases} 
\text{upper bound}, & X > \text{upper threshold} \\
\text{lower bound}, & X < \text{lower threshold} \\
X, & \text{otherwise}
\end{cases}
$$

In [None]:
from scipy.stats.mstats import winsorize
import numpy as np
import pandas as pd
from scipy.stats import boxcox


df['winsorized'] = winsorize(df['feature'], limits=[0.05, 0.05])  # Trim 5% from both ends

df['power_transformed'] = np.power(df['feature'], 2)  # Square transformation
df['exp_transformed'] = np.exp(df['feature'])  # Exponential transformation

# Reflect and apply log transformation
df['reflected'] = df['feature'].max() + df['feature'].min() - df['feature']
df['log_reflected'] = np.log(df['reflected'] + 1)
df['final_transformed'] = df['feature'].max() + df['feature'].min() - df['log_reflected']

df['log_transformed'] = np.log(df['feature'] + 1)  # Avoid log(0)
df['sqrt_transformed'] = np.sqrt(df['feature'])
df['boxcox_transformed'], _ = boxcox(df['feature'] + 1)  # Apply Box-Cox



## Standardization

### ** You Should Standardize If:**
**You’re Using Distance-Based Algorithms**  
   - Algorithms that rely on **distance metrics** (e.g., Euclidean, Manhattan) perform better when features are standardized.
   - Examples:
     - $K$-Nearest Neighbors (KNN)
     - Support Vector Machines (SVM)
     - Principal Component Analysis (PCA)
     - $K$-Means Clustering
     - Linear Regression (if gradient descent is used)

**Your Features Have Different Scales**  
   - If some features are on **very different scales**, standardization helps **prevent one feature from dominating the model**.
   - Example:
     - **Feature 1:** Age $(20$ to $80)$
     - **Feature 2:** Salary $(20,000$ to $200,000)$
   - **Without standardization**, the model might focus too much on salary because it has larger values.

**You’re Using Regularization (Lasso, Ridge, ElasticNet)**  
   - Standardization ensures that **penalty terms** ($L_1, L_2$) are applied **fairly across all features**.

 **Your Model is Sensitive to Outliers**  
   - Standardizing data **reduces the impact of extreme values** by centering around zero.

---

### ** You Can Skip Standardization If:**
**You’re Using Tree-Based Models**  
   - **Decision Trees, Random Forests, Gradient Boosting (XGBoost, LightGBM, CatBoost)** **do not require standardization**.
   - These models are not affected by feature magnitudes because they split data **based on thresholds, not distance**.

**Your Features Are Already on Similar Scales**  
   - If all features are on similar scales (e.g., all between $0$ and $1$), standardization may not add much value.

**You’re Working with Categorical Variables**  
   - Standardization applies only to **numerical features**.
   - **One-hot encoded variables do not need standardization**.

## Summation Notation (Σ)


The summation of a sequence is defined as:

$$
S = \sum_{i=1}^{n} a_i = a_1 + a_2 + \dots + a_n
$$

##### Example Calculation:
$$
\sum_{i=1}^{5} i = 1 + 2 + 3 + 4 + 5 = 15
$$

## Unbiased Estimator

An **unbiased estimator** is a statistical estimator whose expected value is equal to the parameter it is estimating. Formally, an estimator \( \hat{\theta} \) of a parameter \( \theta \) is **unbiased** if:

$$
E[\hat{\theta}] = \theta
$$

This means that, on average, the estimator correctly estimates the true parameter value.

### **Example:**
Consider a random sample \( X_1, X_2, \dots, X_n \) drawn from a population with mean \( \mu \). The sample mean:

$$
\hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} X_i
$$

is an **unbiased estimator** of the population mean \( \mu \), since:

$$
E[\hat{\mu}] = \mu
$$

This implies that the expected value of the sample mean is equal to the true population mean.

## Variance Inflation Factor (VIF)

$$
VIF_i = \frac{1}{1 - R^2_i}
$$

where:

- $VIF_i$ is the Variance Inflation Factor for the $i$th predictor variable.
- $R_i^2$ is the coefficient of determination obtained by regressing $X_i$ on all the other independent variables in the model.


## Interpretation of VIF

\[
\begin{array}{|c|c|}
\hline
\textbf{VIF Value} & \textbf{Interpretation} \\
\hline
1 & No multicollinearity (ideal) \\
\hline
1 - 5 & Low to moderate multicollinearity (acceptable) \\
\hline
5 - 10 & High multicollinearity (requires investigation) \\
\hline
> 10 & Severe multicollinearity (problematic, needs correction) \\
\hline
\end{array}
\]

## Handling High VIF
- **Remove one of the correlated variables** (e.g., drop redundant predictors).
- **Feature engineering** (e.g., combine correlated variables using PCA).
- **Collect more data** (can sometimes reduce collinearity).
- **Use Ridge Regression** (adds regularization to limit high coefficient values).



## Requirements for Statistical Inference

#### Random
- The sample must be **randomly selected** to avoid bias.

#### Independent
- Each observation must be **independent** of others.

#### Sufficient Records
- A larger sample **reduces variability** and provides **better estimates**. Small samples often lead to **higher standard errors** and **less reliable inferences**.

#### Normal Distribution
- Many tests (**t-test, ANOVA, regression**) assume **normally distributed data. If n is large**, normality is **less of a concern** (CLT applies).
- **Shapiro-Wilk Test:** Checks for normality.
- **Q-Q Plot:** Data should align with the diagonal.

#### Homoscedasticity (Equal Variance) 
- The variance should be **constant** across different levels of the independent variable. Unequal variance (heteroscedasticity) can **bias standard errors**.
- **Levene’s Test:** Tests for equal variances.
- **Residual Plot:** If residuals show a **funnel shape**, variance is **not equal**.

In [None]:
## 

DATASCI203

DATASCI 205 - t
Difference in expectation
Basic Set Up.
Suppose (x1…) are I.i.d with mean Ux
Suppose (y1…) are I.i.d with mean Uy

Null Hyothesis Ux= Uy (the two populations means are equal)

Alternative Hypothesis 
H1: ux ne uy ()
H2 ux>uy
H3 ux<uy

Two-Sample t-Test: Technical whether two populations are the same in expectations while accounting for variability

T = X-Y/ Estimate of Standard Deviation

Specify Model, Null Hypothesis, Rejection Criteria
Calculate Statistic
Plot Statistic on the null distribution to get p value

Students t-Test - Pooled Estimate
Test is mean of X equals mean of Y.
Uses pool estimate of Standard Deviation.
Only relevant if standard deviations are the same
Lavines Test to understand if std dev same. Limited value

Welch’s t-Test
Dof - complex.
More general, don’t need standard deviations same

Degrees of freedom - Number of independent pieces of information that vary given estimated parameters
1 -sided - df = n-1
2-sided - df = n1 +n2 - 2

Which to use, Welch. Not worth benefit for limited value of Students.

Correlation is a measure of linear dependence. Then, what are the possible values for correlation when one random variable is a linear function of another? To fix terms, suppose that X and Y
Y are random variables, and a is a constant where a does not equal 0 and b is any constant in the real numbers. Furthermore, suppose that Y is a function of X that takes the following form: y = ax+b. What are the possible values for correlation between X and Y.

Quantitative data is numbers-based, countable, or measurable. 
Qualitative data is interpretation-based, descriptive, and relating to language
Empirical derived from or guided by direct experience or by experiment, rather than abstract principles or theory:

Probability: Reasoning under uncertainty

]All Models are wrong, some are useful.
Subjective Probability: Someones assessment of what the probability is, based on the information that they have available

Canonical: Mathematics. (of an equation, coordinate, etc.) in simplest or standard form.

Axomatic Approach: (Thomas Kuhn). Start with Axiomatic Statements, move to intermediate statements, end with testable hypothesis 
Axomatic: pertaining to or of the nature of an axiom; self-evident; obvious.

A/B - In A but Not in B
AC - Compliment - Everything that isn’t this.

Set: Contains Objects, Not Ordered, Can’t Repeat, Colon means such that, {},
Is an element of, not an element of, empty set

Probability theory is a mathematical construct used to represent processes involving randomness, unpredictability, or intrinsic uncertainty.

In a setting in which there are several possible outcomes, each with some probability of occurring, we refer to the process by which the outcome is determined as a random generative process.


Efficiency - Mean Square Error 
Approaches the true value, more quick, or more efficienctly (often) than other estimators

Consistency - As you have more observations it progressively gets closer to the estimate

IID - Independent and Identically distributed
Formal Definition of “Random Sample”
If each data you pull our comes from the sample probability distribution, identically distributed
If none of the instances in the sample provide information about other instances of same, they are independent
