## Bayes Theorem  
- Mathematical way to invert conditional probabilities
- $P(A|B) = \frac{P(A\cap{B})}{P(B)} = \frac{P(B|A)P(A)}{P(B)}$
- P(A|B) and P(B|A) are called conditional probabilities, P(A) and P(B) are called
- Let $f_k(X) = P(X | Y = k)$ denote the density function of X for an observation that comes from kth class then
  $P(Y = k | X = x) = \frac{P(Y = k, X = x)}{P(X = x)} = \frac{P(X = x | Y = k) P(Y = k)}{P(X = x)}$
  = $\frac{P(X = x | Y = k) P(Y = k)}{\Sigma^{K}_{l=1}P(X = x| Y = k)P(Y = k)}$ = $\frac{\pi_k f_k(x)}{\Sigma^{K}_{l = 1}\pi_l f_k(l)}$
- Pr(Y = k|X = x) is the posterior probability that an observation X = x belongs to the kth class and $\pi_k$ represent the overall or `prior probability` that a randomly chosen observation comes from the kth class
- To compute Pr(Y = k|X = x), we need to have estimates of $\pi_k$'s and $f_k$'s for k = 1, 2, ...., K
- Estimating the prior probabilities $\pi_1$, $\pi_2$, ..., $\pi_K$ is typically straightforward: for instance, we can estimate $\hat{\pi_k}$ as the proportion of training observations belonging to the kth class, for k = 1, . . . ,K.
- We assume that features are independent. Stated mathematically, this assumption means that for k = 1, . . . ,K,
  </br> $f_k(x) = f_{k1}(x_1) × f_{k2}(x_2)×· · ·×f_{kp}(x_p)$
  

In [1]:
import pandas as pd

In [10]:
df = pd.read_csv(r'Default.csv')
df['student'] = df['student'].factorize()[0]
df.head()

Unnamed: 0,default,student,balance,income
0,No,0,729.526495,44361.62507
1,No,1,817.180407,12106.1347
2,No,0,1073.549164,31767.13895
3,No,0,529.250605,35704.49394
4,No,0,785.655883,38463.49588


In [21]:
from sklearn.naive_bayes import GaussianNB
## https://scikit-learn.org/1.5/modules/naive_bayes.html

In [13]:
model = GaussianNB()
model.fit(df[['student', 'balance', 'income']], df['default'])

In [15]:
model.score(df[['student', 'balance', 'income']], df['default'])

0.9707

In [16]:
from sklearn.metrics import confusion_matrix

In [17]:
y_pred = model.predict(df[['student', 'balance', 'income']])
confusion_matrix(df['default'], y_pred)

array([[9620,   47],
       [ 246,   87]], dtype=int64)

In [18]:
from sklearn.metrics import classification_report

In [20]:
print(classification_report(df['default'], y_pred))

              precision    recall  f1-score   support

          No       0.98      1.00      0.98      9667
         Yes       0.65      0.26      0.37       333

    accuracy                           0.97     10000
   macro avg       0.81      0.63      0.68     10000
weighted avg       0.96      0.97      0.96     10000



In [25]:
model.

array([[2.91403745e-01, 8.03943750e+02, 3.35661666e+04],
       [3.81381381e-01, 1.74782169e+03, 3.20891471e+04]])

## LDA (Linear Discriminant Analysis)   
- If we assume that X's follow a multivariate Gaussian distribution (with a class-specific mean vector and a common variance matrix). then the model that we get is called LDA.
- With above assumption and some manipulation we get
  $δ_k(x) = x^TΣ^{−1}μ_k − \frac{1}{2}μ^{T}_kΣ^{−1}μ_k + log π_k$ </br>
  We assign class k for which $δ_k$ is the largest. $δ_k$ is also called the `discriminant` function and $δ_k$ is linear in x and hence the name LDA.
- If class has its own covariance matrix then we get `QDA (Quadratic Discriminant Analysis)`.