<a href="https://colab.research.google.com/github/davidofitaly/notes_02_50_key_stats_ds/blob/main/05_chapter/01_raw_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###Naïve Bayes Classifier

#####Naïve Bayes is a probabilistic classification model based on **Bayes' theorem** with the assumption of feature independence.

##### **Bayes' Theorem**  

Bayes' theorem describes the relationship between conditional probabilities:  
$$ P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)} $$  
Where:  
- $ P(A \mid B) $ – **posterior probability** (after observing $ B $),  
- $ P(B \mid A) $ – **conditional probability** (likelihood of $ B $ given $ A $),  
- $ P(A) $ – **prior probability** (before observing $ B $),  
- $ P(B) $ – total probability of $ B $.

##### **Independence Assumption**  

For a feature set $ x_1, x_2, \dots, x_n $, the model assumes conditional independence:  
$$ P(x_1, x_2, \dots, x_n \mid C) = P(x_1 \mid C) P(x_2 \mid C) \dots P(x_n \mid C) $$  

##### **Classification Rule**  

For a given class $ C_k $, the posterior probability is computed as:  
$$ P(C_k \mid x_1, x_2, \dots, x_n) = \frac{P(x_1 \mid C_k) P(x_2 \mid C_k) \dots P(x_n \mid C_k) P(C_k)}{P(x_1, x_2, \dots, x_n)} $$  
The predicted class is the one with the highest posterior probability.



### Discriminant Analysis

#####Discriminant Analysis is a technique used to classify observations into predefined classes based on predictor variables. It focuses on finding boundaries between classes by analyzing differences in means and variances.

##### 1. Covariance

#####In discriminant analysis, covariance represents the relationship between variables within each class. For **Linear Discriminant Analysis (LDA)**, we assume that all classes share the same covariance matrix, denoted $ \Sigma $.

#####Covariance matrix for class $ k $:

$$
\Sigma_k = \frac{1}{n_k - 1} \sum_{i=1}^{n_k} (x_i - \mu_k)(x_i - \mu_k)^T
$$

Where:
- $ n_k $ is the number of observations in class $ k $,
- $ x_i $ is the $ i $-th observation,
- $ \mu_k $ is the mean vector for class $ k $.

##### 2. Discriminant Function

#####The **discriminant function** is a linear function used to classify new observations. It is defined as:

$$
g_k(x) = \mathbf{w_k}^T x + b_k
$$

Where:
- $ g_k(x) $ is the discriminant function for class $ k $,
- $ \mathbf{w_k} $ is the weight vector,
- $ x $ is the predictor vector,
- $ b_k $ is the bias term.

##### 3. Weights of the Discriminant Function

#####The **weights** reflect the importance of each predictor in classifying data. The weight vector $ \mathbf{w_k} $ is calculated as:

$$
\mathbf{w_k} = \Sigma^{-1} (\mu_k - \mu)
$$

Where:
- $ \Sigma^{-1} $ is the inverse of the pooled covariance matrix,
- $ \mu_k $ is the mean vector for class $ k $,
- $ \mu $ is the overall mean vector of all classes.

#####The bias term is calculated as:

$$
b_k = -\frac{1}{2} \mu_k^T \Sigma^{-1} \mu_k + \ln(\pi_k)
$$

Where:
- $ \pi_k $ is the prior probability of class $ k $.


### Logistic Regression


#####**Logistic regression** is a statistical method used for modeling the relationship between a dependent binary variable and one or more independent variables. It is used when the outcome variable is categorical, often with two classes (e.g., success/failure, yes/no).

#####The model predicts the probability that a given observation belongs to a particular class (usually coded as 0 or 1).

##### Key Concepts:

1. **Logistic Function** (Sigmoid Function):
   The logistic regression model uses the logistic function to predict probabilities. The logistic function is an S-shaped curve that outputs values between 0 and 1. The formula is:
   $$
   P(y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n)}}
   $$  
   Where:
   - $P(y=1|X)$ is the probability of the class being 1 (success).
   - $\beta_0$ is the intercept.
   - $\beta_1, \dots, \beta_n$ are the coefficients for the independent variables $X_1, X_2, \dots, X_n$.
   - $e$ is the base of the natural logarithm.

2. **Log-Odds**:
   The output of the linear combination $\beta_0 + \beta_1 X_1 + \dots + \beta_n X_n$ is called the log-odds. The logistic regression model estimates the log-odds of the dependent variable being 1.

3. **Odds**:
   The odds represent the ratio of the probability of an event happening to the probability of it not happening. The odds of event $y=1$ can be expressed as:
   $$
   \text{Odds} = \frac{P(y=1|X)}{1 - P(y=1|X)}
   $$

4. **Maximum Likelihood Estimation (MLE)**:
   The parameters $\beta_0, \beta_1, \dots, \beta_n$ are estimated using **Maximum Likelihood Estimation (MLE)**, which finds the values of the parameters that maximize the likelihood of the observed data.

5. **Interpretation of Coefficients**:
   - The coefficients $\beta_1, \dots, \beta_n$ represent the change in the log-odds of the dependent variable per unit change in the respective independent variable.
   - Exponentiating the coefficients gives the **odds ratio** (OR), which represents how the odds of the outcome change with a one-unit increase in the predictor variable.

   The odds ratio for a predictor $X_i$ is given by:
   $$
   OR_i = e^{\beta_i}
   $$



### Evaluation of Classification Models

#####Evaluating a classification model is essential to understand its performance. Various metrics help assess how well a model distinguishes between different classes.

##### 1. Accuracy
Accuracy measures the proportion of correctly classified instances in the dataset:
$$
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
$$
- $TP$ (True Positives): Correctly predicted positive cases.
- $TN$ (True Negatives): Correctly predicted negative cases.
- $FP$ (False Positives): Incorrectly predicted positive cases.
- $FN$ (False Negatives): Incorrectly predicted negative cases.

**Limitation:** Accuracy may be misleading when classes are imbalanced.

##### 2. Confusion Matrix
The confusion matrix summarizes model performance by comparing predicted vs. actual values.

| Actual / Predicted | Positive (1) | Negative (0) |
|--------------------|-------------|-------------|
| **Positive (1)**   | TP          | FN          |
| **Negative (0)**   | FP          | TN          |

It helps calculate other classification metrics.

##### 3. Sensitivity (Recall)
Sensitivity (also called recall or **true positive rate**) measures how well the model identifies actual positive cases:
$$
\text{Sensitivity} = \frac{TP}{TP + FN}
$$
- High sensitivity means fewer false negatives.
- Important for medical diagnosis and fraud detection.

##### 4. Specificity
Specificity (**true negative rate**) measures how well the model identifies actual negative cases:
$$
\text{Specificity} = \frac{TN}{TN + FP}
$$
- High specificity means fewer false positives.
- Useful when false positives are costly (e.g., legal cases).

##### 5. Precision
Precision (**positive predictive value**) measures the proportion of correctly predicted positive cases:
$$
\text{Precision} = \frac{TP}{TP + FP}
$$
- High precision means fewer false positives.
- Important when false positives have high costs (e.g., spam filtering).

##### 6. ROC Curve
The **Receiver Operating Characteristic (ROC) curve** plots the **true positive rate** (sensitivity) against the **false positive rate** (1 - specificity) for different classification thresholds.  
- A **perfect classifier** reaches the top-left corner (100% sensitivity, 100% specificity).
- The **diagonal line** represents a random classifier.

##### 7. AUC (Area Under Curve)
The **AUC** measures the area under the ROC curve. It indicates how well a model distinguishes between classes:
- **AUC = 1** → Perfect classification.
- **AUC = 0.5** → Random guessing.
- **Higher AUC** means a better model.

##### 8. Lift
**Lift** measures how much better a model performs compared to random selection. It is defined as:
$$
\text{Lift} = \frac{\text{Precision of the model}}{\text{Baseline Precision}}
$$
- A **lift of 2** means the model is twice as effective as random guessing.
- Used in marketing and fraud detection.
