In [31]:
import pandas as pd
df = pd.read_csv('titanic_train.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [32]:
df[(df['Survived']==0)]['Age']

0      22.0
4      35.0
5       NaN
6      54.0
7       2.0
       ... 
884    25.0
885    39.0
886    27.0
888     NaN
890    32.0
Name: Age, Length: 549, dtype: float64

In [33]:
df[(df['Survived']==1)]['Age']

1      38.0
2      26.0
3      35.0
8      27.0
9      14.0
       ... 
875    15.0
879    56.0
880    25.0
887    19.0
889    26.0
Name: Age, Length: 342, dtype: float64

In [14]:
pd.crosstab(df.Pclass, df.Survived,margins=True, normalize=True)

Survived,0,1,All
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0.089787,0.152637,0.242424
2,0.108866,0.097643,0.20651
3,0.417508,0.133558,0.551066
All,0.616162,0.383838,1.0


In [13]:
# p(survived=0/Pclass=1) = 80/216 = 
80/216
# p(Pclass=1/survived=0)*p(survived=0)/p(Pclass=1)
# (80/549)*(549/891)/(216/891)

0.37037037037037035

In [None]:
0.3703703703703704

### **Naïve Bayes Algorithm: A Probabilistic Classifier**  

Naïve Bayes is a probabilistic machine learning algorithm based on **Bayes' Theorem**. It is widely used for **classification tasks** such as spam filtering, sentiment analysis, and document classification.  

#### **Key Concept: Bayes' Theorem**  
Bayes' Theorem provides a mathematical formula to calculate the probability of a hypothesis given some observed evidence:  

$$
P(A | B) = \frac{P(B | A) \cdot P(A)}{P(B)} 
$$

Where:  
- P(A | B) is the **posterior probability** (probability of class (A) given feature (B)
- P(B | A) is the **likelihood** (probability of feature (B) given class (A)  
- P(A) is the **prior probability** (probability of class (A) before seeing the data  
- P(B) is the **evidence** (probability of observing feature (B)  

#### **Assumption of Naïve Bayes**  
The algorithm makes a **naïve assumption** that all features are **independent** of each other given the class label. This simplifies the probability calculation significantly.  

$$
P(A | B_1, B_2, ..., B_n) = \frac{P(B_1 | A) \cdot P(B_2 | A) \cdots P(B_n | A) \cdot P(A)}{P(B_1, B_2, ..., B_n)}
$$ 

#### **Types of Naïve Bayes Classifiers**  
1. **Gaussian Naïve Bayes** – Used for continuous data assuming a normal distribution.  
2. **Multinomial Naïve Bayes** – Used for discrete data, like text classification (word counts).  
3. **Bernoulli Naïve Bayes** – Used for binary data (e.g., whether a word appears in a document or not).  

#### **Steps in Naïve Bayes Classification**  
1. **Calculate Prior Probability** for each class.  
2. **Compute Likelihood** for each feature given the class.  
3. **Multiply the Probabilities** and use Bayes' Theorem to compute the posterior probability.  
4. **Choose the class with the highest posterior probability** as the prediction.  

#### **Advantages**  
- **Fast and efficient** for large datasets.  
- **Works well with high-dimensional data** (like text data).  
- **Performs well with small datasets** if feature independence holds.  
- **Handles missing data well**, as probabilities can be estimated without all features.  

#### **Limitations**  
- **Assumes independence between features**, which is often unrealistic.  
- **Struggles with correlated features**, leading to inaccurate probability estimates.  
- **Performs poorly when data is scarce** or when feature distributions are not well-represented.  

#### **Applications of Naïve Bayes**  
- **Spam Filtering** (classifying emails as spam or not)  
- **Sentiment Analysis** (positive or negative reviews)  
- **Text Classification** (categorizing documents)  
- **Medical Diagnosis** (predicting diseases based on symptoms)  



### **Naïve Bayes Algorithm: Step-by-Step Explanation**

The **Naïve Bayes classifier** is based on **Bayes' Theorem** and the assumption that all features are conditionally independent given the class label. It follows the below steps during **training** and **testing**.

---

## **1. Training Phase**

### **Step 1: Compute Prior Probability for Each Class**
The prior probability of a class \( Y = y \) is calculated as:

$$
P(Y = y) = \frac{\text{Number of instances in class } y}{\text{Total number of instances}}
$$

where:
- \( P(Y = y) \) is the prior probability of class \( y \).
- The numerator represents the count of samples belonging to class \( y \).
- The denominator is the total number of training samples.

### **Step 2: Compute Likelihood for Each Feature Given a Class**
For each feature \( X_i \) and each class \( Y = y \), we calculate the conditional probability:

$$
P(X_i = x | Y = y)
$$

There are different ways to compute this depending on the type of data:

- **For Categorical Features**:
  Using the **frequency-based probability estimate**:

  $$
  P(X_i = x | Y = y) = \frac{\text{Count of } X_i = x \text{ in class } y}{\text{Total count of instances in class } y}
  $$

- **For Continuous Features**:
  Assuming a **Gaussian (Normal) distribution**, we estimate the likelihood using:

  $$
  P(X_i = x | Y = y) = \frac{1}{\sqrt{2\pi\sigma_y^2}} \exp \left( -\frac{(x - \mu_y)^2}{2\sigma_y^2} \right)
  $$

  where:
  - \( \mu_y \) is the **mean** of feature \( X_i \) for class \( y \).
  - \( \sigma_y^2 \) is the **variance** of feature \( X_i \) for class \( y \).
  - \( \exp \) represents the exponential function.

### **Step 3: Store the Probabilities**
After computing:
1. Prior probabilities \( P(Y = y) \)
2. Likelihoods \( P(X_i = x | Y = y) \)

We store these values for the prediction phase.

---

## **2. Testing Phase (Prediction)**
### **Step 4: Compute Posterior Probability for Each Class**
For a given **new sample** with features \( X_1, X_2, ..., X_n \), we compute the **posterior probability** using Bayes' Theorem:

$$
P(Y = y | X_1 = x_1, X_2 = x_2, ..., X_n = x_n) = \frac{P(Y = y) \cdot P(X_1 = x_1 | Y = y) \cdot P(X_2 = x_2 | Y = y) \cdots P(X_n = x_n | Y = y)}{P(X_1 = x_1, X_2 = x_2, ..., X_n = x_n)}
$$

Since the denominator \( P(X_1, X_2, ..., X_n) \) is constant for all classes, we only need to compute the **numerator**:

$$
P(Y = y) \cdot \prod_{i=1}^{n} P(X_i = x_i | Y = y)
$$

### **Step 5: Select the Class with the Highest Probability**
We choose the class \( Y = y^* \) that has the **maximum posterior probability**:

$$
y^* = \arg\max_{y \in K} P(Y = y) \cdot \prod_{i=1}^{n} P(X_i = x_i | Y = y)
$$

where \( K \) is the set of all possible classes.

---

## **3. Example Calculation**
Consider a dataset with two classes \( K = \{A, B\} \) and two features \( X_1 \) and \( X_2 \), each having categorical values.

1. **Calculate Prior Probabilities**:
   $$
   P(Y = A) = \frac{\text{Instances in class } A}{\text{Total instances}}
   $$
   $$
   P(Y = B) = \frac{\text{Instances in class } B}{\text{Total instances}}
   $$

2. **Calculate Likelihood Probabilities**:
   $$
   P(X_1 = x_1 | Y = A) = \frac{\text{Count of } X_1 = x_1 \text{ in class } A}{\text{Total count of instances in class } A}
   $$
   $$
   P(X_2 = x_2 | Y = A) = \frac{\text{Count of } X_2 = x_2 \text{ in class } A}{\text{Total count of instances in class } A}
   $$

3. **Compute Posterior Probabilities**:
   $$
   P(Y = A | X_1 = x_1, X_2 = x_2) = P(Y = A) \cdot P(X_1 = x_1 | Y = A) \cdot P(X_2 = x_2 | Y = A)
   $$

   $$
   P(Y = B | X_1 = x_1, X_2 = x_2) = P(Y = B) \cdot P(X_1 = x_1 | Y = B) \cdot P(X_2 = x_2 | Y = B)
   $$

4. **Predict the Class**:
   Choose the class with the highest probability:

   $$
   Y^* = \arg\max_{y \in \{A, B\}} P(Y = y) \cdot P(X_1 = x_1 | Y = y) \cdot P(X_2 = x_2 | Y = y)
   $$

---

## **4. Summary**
1. **Train the model** by computing:
   - Prior probabilities \( P(Y) \).
   - Conditional probabilities \( P(X_i | Y) \).
2. **Predict the class** for a new sample by:
   - Computing the posterior probability for each class.
   - Choosing the class with the highest probability.

This is how the **Naïve Bayes** classifier works mathematically! 🚀


### **Naïve Bayes Probability Calculation**

In a **Naïve Bayes** classification model, the number of probabilities that need to be calculated depends on:  

- \( n \): The number of feature columns.  
- \( k(i) \): The number of unique classes in the \( i \)-th feature column.  
- \( K \): The number of unique classes in the output (target) column.  

#### **Types of Probabilities to Compute**  

1. **Prior Probabilities**  
   - For each output class \( y \in K \), we compute:  
     $$ P(Y = y) $$
   - This requires computing **\( K \)** probabilities.

2. **Likelihood Probabilities**  
   - For each feature column \( i \) and for each class \( y \), we compute:  
     $$ P(X_i = x | Y = y) $$
   - Since \( X_i \) can take \( k(i) \) different values, we need to compute \( k(i) \) probabilities per class.  
   - For all \( n \) features and \( K \) output classes, the total likelihood probabilities are:  
     $$ \sum_{i=1}^{n} k(i) \cdot K $$  

#### **Total Number of Probabilities to Calculate**  

$$
K + \sum_{i=1}^{n} k(i) \cdot K
$$

where:  
- \( K \) accounts for the prior probabilities.  
- \( \sum_{i=1}^{n} k(i) \cdot K \) accounts for the likelihood probabilities.  

#### **Example Calculation**  
Suppose we have:  
- \( n = 3 \) features  
- \( k(1) = 4 \), \( k(2) = 3 \), \( k(3) = 5 \) unique values in each feature  
- \( K = 2 \) (binary classification)  

$$
\text{Total probabilities} = 2 + (4 \times 2 + 3 \times 2 + 5 \times 2)  
$$

$$
= 2 + (8 + 6 + 10) = 26
$$

This formula generalizes for any number of features and classes in a Naïve Bayes model. 🚀


In [16]:
import numpy as np
import pandas as pd

In [17]:
df = pd.read_csv('https://gist.githubusercontent.com/DiogoRibeiro7/c6590d0cf119e87c39e31c21a9c0f3a8/raw/4a8e3da267a0c1f0d650901d8295a5153bde8b21/PlayTennis.csv')

In [18]:
df

Unnamed: 0,Outlook,Temperature,Humidity,Wind,Play Tennis
0,Sunny,Hot,High,Weak,No
1,Sunny,Hot,High,Strong,No
2,Overcast,Hot,High,Weak,Yes
3,Rain,Mild,High,Weak,Yes
4,Rain,Cool,Normal,Weak,Yes
5,Rain,Cool,Normal,Strong,No
6,Overcast,Cool,Normal,Strong,Yes
7,Sunny,Mild,High,Weak,No
8,Sunny,Cool,Normal,Weak,Yes
9,Rain,Mild,Normal,Weak,Yes


In [19]:
pd.crosstab(df['Outlook'], df['Play Tennis'],normalize='columns').stack().to_dict()

{('Overcast', 'No'): 0.0,
 ('Overcast', 'Yes'): 0.4444444444444444,
 ('Rain', 'No'): 0.4,
 ('Rain', 'Yes'): 0.3333333333333333,
 ('Sunny', 'No'): 0.6,
 ('Sunny', 'Yes'): 0.2222222222222222}

In [23]:
D = {}
for i in ['Outlook','Temperature',	'Humidity',	'Wind']:
  D.update(pd.crosstab(df[i], df['Play Tennis'],normalize='columns').stack().to_dict())


In [24]:
D

{('Overcast', 'No'): 0.0,
 ('Overcast', 'Yes'): 0.4444444444444444,
 ('Rain', 'No'): 0.4,
 ('Rain', 'Yes'): 0.3333333333333333,
 ('Sunny', 'No'): 0.6,
 ('Sunny', 'Yes'): 0.2222222222222222,
 ('Cool', 'No'): 0.2,
 ('Cool', 'Yes'): 0.3333333333333333,
 ('Hot', 'No'): 0.4,
 ('Hot', 'Yes'): 0.2222222222222222,
 ('Mild', 'No'): 0.4,
 ('Mild', 'Yes'): 0.4444444444444444,
 ('High', 'No'): 0.8,
 ('High', 'Yes'): 0.3333333333333333,
 ('Normal', 'No'): 0.2,
 ('Normal', 'Yes'): 0.6666666666666666,
 ('Strong', 'No'): 0.6,
 ('Strong', 'Yes'): 0.3333333333333333,
 ('Weak', 'No'): 0.4,
 ('Weak', 'Yes'): 0.6666666666666666}

In [34]:
# ['Sunny'	'Cool'	'Normal'	'Weak']
y = 0.2222222222222222*0.3333333333333333*0.6666666666666666*0.6666666666666666*9/14
n = 0.6*0.2*0.2*0.4*5/14
y>n

True