# Naive Bayes

disadur dari: https://appliedmachinelearning.blog/2017/05/23/understanding-naive-bayes-classifier-from-scratch-python-code/

---

- Menggunakan konsep teorema Bayes

$$
P(O|E) = \cfrac{P(E|O) * P(O)}{P(E)}
$$

- Dengan:
  - $P(O)$ adalah **prior probability**, peluang terjadinya $O$
  - $P(E)$ adalah **evidence**, peluang terjadinya event $E$
  - $P(O|E)$ adalah **posterior probability**, peluang terjadinya $O$ setelah mengetahui event $E$
  - $P(E|O)$ adalah **likelihood**, menjawab pertanyaan "*Berapa peluang terjadi event-$E$ apabila terjadi $O$ ?*"
  
  
- Bisa ditulis juga dengan

$$
\text{posterior probability} = \cfrac{\text{likelihood} * \text{prior probability}}{\text{evidence}}
$$


---
**Mengapa disebut *Naive*?**

- Pada kenyataannya, peluang terjadinya $O$ mungkin dipengaruhi oleh beberapa event.
- Secara matematika, 
  - jika $O$ bergantung pada event $E_{1}, E_{2}, \cdots, E_{n}$
  - dan dengan asumsi $E_{1}, E_{2}, \cdots, E_{n}$ tidak bergantung satu sama lain, maka
  
$$
P(O|E_{1}, E_{2}, \cdots, E_{n}) = \cfrac{P(E_{1}|O) * P(E_{2}|O) * \cdots * P(E_{n}|O)}{P(E_{1}) * P(E_{2}) * \cdots * P(E_{n})} * P(O)
$$

- Dengan ini, model **sangat naive** mempercayai bahwa terjadinya event-event tidak saling mempengaruhi satu sama lain.
- Harusnya, di dunia real tidak terjadi seperti ini karena antar fitur mungkin saling mempengaruhi satu sama lain.
- Namun, Naive Bayes masih terbukti dapat memberikan hasil yang bagus dibandingkan model klasifikasi yang computationally expensive

---

## Contoh 1 - Categorical Data

- Diberikan data berikut

|chills|runny nose|headache|fever|flu (target)|
|:--:|:--:|:--:|:--:|:--:|
|Y|N|Mild|Y|N|
|Y|Y|No|N|Y|
|Y|N|Strong|Y|Y|
|N|Y|Mild|Y|Y|
|N|N|No|N|N|
|N|Y|Strong|Y|Y|
|N|Y|Strong|N|N|
|Y|Y|Mild|Y|Y|

- Coba lakukan klasifikasi apakah data berikut menghasilkan flu / tidak?

|chills|runny nose|headache|fever|flu (target)|
|:--:|:--:|:--:|:--:|:--:|
|Y|N|Mild|N|?|


- Bagaimana cara melakukan klasifikasinya?


- Kita misalkan
  - $E_{1} = (\text{chills} = Y)$
  - $E_{2} = (\text{runny nose} = N)$
  - $E_{3} = (\text{headache} = \text{Mild})$
  - $E_{4} = (\text{fever} N)$


- Selanjutnya, kita mencari
  - $P(\text{flu} = Y | E_{1}, E_{2}, E_{3}, E_{4})$ 
  - $P(\text{flu} = N | E_{1}, E_{2}, E_{3}, E_{4})$ 
  
  
- Terakhir kita tinggal bandingkan, probability mana yang lebih besar. 


- Apabila $P(\text{flu}=Y | E_{1}, E_{2}, E_{3}, E_{4}) > P(\text{flu}=N | E_{1}, E_{2}, E_{3}, E_{4})$, maka data tersebut menghasilkan hasil klasifikasi flu. Begitu juga sebaliknya

---
**Cara hitung**

- $P(\text{flu} = Y) = P(O) = \cfrac{5}{8} = 0.625$

- $P(\text{chills} = Y | \text{flu} = Y) = P(E_{1} | O) = \cfrac{3}{5} = 0.6$

- $P(\text{runny nose} = N | \text{flu} = Y) = P(E_{2} | O) = \cfrac{1}{5} = 0.2$

- $P(\text{headache} = \text{Mild} | \text{flu} = Y) = P(E_{3} | O) = \cfrac{2}{5} = 0.4$

- $P(\text{fever} = N | \text{flu} = Y) = P(E_{4} | O) = \cfrac{1}{5} = 0.2$

- $P(\text{chills} = Y) = P(E_{1}) = \cfrac{4}{8} = 0.5$

- $P(\text{runny nose} = N) = P(E_{2}) = \cfrac{3}{8} = 0.375$

- $P(\text{headache} = \text{Mild}) = P(E_{3}) = \cfrac{3}{8} = 0.375$

- $P(\text{fever} = N) = P(E_{4}) = \cfrac{3}{8} = 0.375$

---

- Maka

$$
P(\text{flu}=Y|E_{1}, E_{2}, E_{3}, E_{4}) = \cfrac{P(E_{1}|O) * P(E_{2}|O) * P(E_{3}|O) * P(E_{4}|O)}{P(E_{1}) * P(E_{2}) * P(E_{3}) * P(E_{4})} * P(O)
$$

$$
P(\text{flu}=Y|E_{1}, E_{2}, E_{3}, E_{4}) = \cfrac{0.6 * 0.2 * 0.4 * 0.2}{0.5 * 0.375 * 0.375 * 0.375} * 0.625
$$

$$
P(\text{flu}=Y|E_{1}, E_{2}, E_{3}, E_{4}) = 0.227
$$


- Dengan cara yang sama, maka diperoleh

$$
P(\text{flu}=N|E_{1}, E_{2}, E_{3}, E_{4}) = 0.674
$$


---
- Karena $P(\text{flu}=Y | E_{1}, E_{2}, E_{3}, E_{4}) < P(\text{flu}=N | E_{1}, E_{2}, E_{3}, E_{4})$
- artinya, posterior probability kelas **tidak flu** lebih besar dibandingkan kelas **flu**
- sehingga hasil prediksinya adalah **tidak flu**

---
## Coba dengan Python

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# Buat Dataset
data = {'chills': ['Y', 'Y', 'Y', 'N', 'N', 'N', 'N', 'Y'],
        'runny nose': ['N', 'Y', 'N', 'Y', 'N', 'Y', 'Y', 'Y'],
        'headache': ['Mild', 'No', 'Strong', 'Mild', 'No', 'Strong', 'Strong', 'Mild'],
        'fever': ['Y', 'N', 'Y', 'Y', 'N', 'Y', 'N', 'Y']}

target = ['N', 'Y', 'Y', 'Y', 'N', 'Y', 'N', 'Y']


# Buat dataframe
df = pd.DataFrame(data)
df['flu?'] = target

# Print dataframe
df

Unnamed: 0,chills,runny nose,headache,fever,flu?
0,Y,N,Mild,Y,N
1,Y,Y,No,N,Y
2,Y,N,Strong,Y,Y
3,N,Y,Mild,Y,Y
4,N,N,No,N,N
5,N,Y,Strong,Y,Y
6,N,Y,Strong,N,N
7,Y,Y,Mild,Y,Y


In [3]:
# Preprocess data
def preprocess_dataframe(data):
    # Convert categorical to numerical data
    # Chills
    data['chills'] = data['chills'].apply(lambda x: 0 if x=='N' else 1)

    # Runny nose
    data['runny nose'] = data['runny nose'].apply(lambda x: 0 if x=='N' else 1)

    # Headache
    data['headache'] = data['headache'].apply(lambda x: 0 if x=='No' else 1 if x=='Mild' else 2)

    # Fever
    data['fever'] = data['fever'].apply(lambda x: 0 if x=='N' else 1)

    # Target
    if 'flu?' in data.columns:
        data['flu?'] = data['flu?'].apply(lambda x: 0 if x=='N' else 1)

        # Set X & y
        y = data['flu?'].to_numpy()
        X = data.drop(['flu?'], axis=1).to_numpy()
    
        return X, y
    else:
        X = data.to_numpy()
        
        return X, None

# Execute preprocessing data
X, y = preprocess_dataframe(df)
print(X)
print(y)

[[1 0 1 1]
 [1 1 0 0]
 [1 0 2 1]
 [0 1 1 1]
 [0 0 0 0]
 [0 1 2 1]
 [0 1 2 0]
 [1 1 1 1]]
[0 1 1 1 0 1 0 1]


---
Mencari **prior probability**

In [4]:
def occurence(data):
    labels, counts = np.unique(data, return_counts=True)
    
    prior_data = {}
    for i in range(len(labels)):
        prior_data[labels[i]] = counts[i] / len(data)
        
    return prior_data

# Test fungsi pada data target
prior_data = occurence(y)
print(prior_data)

{0: 0.375, 1: 0.625}


---
Mencari **likelihood**

In [5]:
def likelihood(X, y):    
    n_class = len(np.unique(y))
    _, n_event = X.shape
    
    likelihood_data = {}
    
    for class_i in range(n_class):
        likelihood_data[class_i] = {}

        for event_j in range(n_event):
            event_data = X[:,event_j][y==class_i]
            proba_event = occurence(event_data)
        
            likelihood_data[class_i][event_j] = proba_event
            
    return likelihood_data
    
# Test fungsi
likelihood_data = likelihood(X, y)
print(likelihood_data)

{0: {0: {0: 0.6666666666666666, 1: 0.3333333333333333}, 1: {0: 0.6666666666666666, 1: 0.3333333333333333}, 2: {0: 0.3333333333333333, 1: 0.3333333333333333, 2: 0.3333333333333333}, 3: {0: 0.6666666666666666, 1: 0.3333333333333333}}, 1: {0: {0: 0.4, 1: 0.6}, 1: {0: 0.2, 1: 0.8}, 2: {0: 0.2, 1: 0.4, 2: 0.4}, 3: {0: 0.2, 1: 0.8}}}


---
Mencari **Evidence**

In [6]:
def evidence(X):
    _, n_event = X.shape
        
    evidence_data = {}
    for event_i in range(n_event):
        event_data = X[:, event_i]
        proba_event = occurence(event_data)
        
        evidence_data[event_i] = proba_event
        
    return evidence_data

# Test fungsi
evidence_data = evidence(X)
print(evidence_data)

{0: {0: 0.5, 1: 0.5}, 1: {0: 0.375, 1: 0.625}, 2: {0: 0.25, 1: 0.375, 2: 0.375}, 3: {0: 0.375, 1: 0.625}}


---
**Masukkan data yang ingin dicari**

In [7]:
# Buat Dataset
data_test = {'chills': ['Y', 'Y'],
             'runny nose': ['N', 'Y'],
             'headache': ['Mild', 'Strong'],
             'fever': ['N', 'Y']}

# Buat dataframe
df_test = pd.DataFrame(data_test)

# Print dataframe
df_test

Unnamed: 0,chills,runny nose,headache,fever
0,Y,N,Mild,N
1,Y,Y,Strong,Y


In [8]:
# Preprocess data test
X_test, _ = preprocess_dataframe(df_test)

print(X_test)

[[1 0 1 0]
 [1 1 2 1]]


---
**Lakukan prediksi**

In [9]:
def predict(X_test, prior, likelihood, evidence):
    n_class = len(prior)
    n_data, n_event = X_test.shape
    
    # Buat posterior
    posterior_data = {}
    for data_i in range(n_data):
        posterior_data[data_i] = {}
        
        for class_j in range(n_class):
            posterior_data[data_i][class_j] = {}

            tot_likelihood = 1
            tot_evidence = 1
            for event_k in range(n_event):
                tot_likelihood *= likelihood[class_j][event_k][X_test[data_i][event_k]]
                tot_evidence *= evidence[event_k][X_test[data_i][event_k]]

            posterior = (tot_likelihood/tot_evidence) * prior[class_j]
            posterior_data[data_i][class_j] = posterior
        
    # Normalize
    for data_j in range(n_data):
        sum_prob = 0
        for class_i in range(n_class):
            sum_prob += posterior_data[class_i][data_j]
            
        for class_i in range(n_class):
            posterior_data[class_i][data_j] /= sum_prob
    
    # Hasil prediksi
    placeholder_class = np.zeros((n_data, n_class))
    for class_i in range(n_class):
        for data_j in range(n_data):
            placeholder_class[data_j][class_i] = posterior_data[class_i][data_j]
    
    choosen_class = np.argmax(placeholder_class, axis=1)

    return choosen_class, posterior_data


# Test fungsi
choosen_class, posterior_data = predict(X_test, prior_data, likelihood_data, evidence_data)
print(choosen_class)
print(posterior_data)

[0 1]
{0: {0: 0.9174311926605505, 1: 0.14792899408284022}, 1: {0: 0.08256880733944953, 1: 0.8520710059171598}}


In [10]:
def preprocess_dataframe_invert(data):
    # Convert categorical to numerical data
    # Chills
    data['chills'] = data['chills'].apply(lambda x: 'N' if x==0 else 'Y')

    # Runny nose
    data['runny nose'] = data['runny nose'].apply(lambda x: 'N' if x==0 else 'Y')

    # Headache
    data['headache'] = data['headache'].apply(lambda x: 'No' if x==0 else 'Mild' if x==1 else 'Strong')

    # Fever
    data['fever'] = data['fever'].apply(lambda x: 'N' if x==0 else 'Y')

    # Target
    data['flu?'] = data['flu?'].apply(lambda x: 'N' if x==0 else 'Y')
    
    return data

# Execute preprocessing data
df_test['flu?'] = choosen_class

df_test = preprocess_dataframe_invert(df_test)
df_test

Unnamed: 0,chills,runny nose,headache,fever,flu?
0,Y,N,Mild,N,N
1,Y,Y,Strong,Y,Y


---

## Contoh 2 - Continuous Data

- Diberikan data berikut

|gender (target)|height|weight|foot_size|
|:--:|:--:|:--:|:--:|
|male|6.00|180|12|
|male|5.92|190|11|
|male|5.58|170|12|
|male|5.92|165|10|
|female|5.00|100|6|
|female|5.50|150|8|
|female|5.42|130|7|
|female|5.75|150|9|

- Coba lakukan klasifikasi apakah data berikut adalah `male` atau `female`?

|height|weight|foot_size|
|:--:|:--:|:--:|
|6.00|130|8|
|5.60|190|10|


---
- Untuk menjawab, kita bisa memakai persamaan Bayes Theorem seperti biasa

$$
p(\text{class} | \textbf{data}) = \cfrac{p(\textbf{data} | \text{class}) * p(\text{class})}{p(\textbf{data})}
$$

- Dalam Naive Bayes, kita tidak memperdulikan nilai asli dari posterior probability ($p(\text{class} | \textbf{data})$), 
- kita hanya ingin mencari posterior probability yang terbesar.
- Mengingat marginal probability atau evidence ($p(\textbf{data})$) adalah sama, maka kita dapat abaikan ini.
- Sehingga kita punya

$$
p(\text{class} | \textbf{data}) \sim  p(\textbf{data} | \text{class}) * p(\text{class})
$$

- dengan 
  - $p(\textbf{data} | \text{class})$ adalah likelihood
  - $p(\text{class})$ adalah prior

- Mengingat data-nya adalah continue, kita bisa menggunakan 2 asumsi berikut:
  1. **Naive assumption**. Kita asumsikan antar fitur tidak berkolerasi satu sama lain.
  2. Nilai-nilai fitur untuk setiap kelasnya kita asumsikan memiliki suatu distribusi tertentu. Kita anggap distribusinya gaussian (distribusi yang paling general).
  
$$
p(\text{height} | \text{female}) = \cfrac{1}{\sqrt{2 \pi \sigma_{\text{height} | \text{female}}^{2}}} \cdot e^{- \cfrac{\left ( \text{height} - \mu_{\text{height} | \text{female}} \right ) ^{2}}{\sigma_{\text{height} | \text{female}}^{2}}}
$$

---
## Coba dengan Python

In [11]:
# Buat Dataset
data = {'height': [6.00, 5.92, 5.58, 5.92, 5.00, 5.50, 5.42, 5.75],
        'weight': [180., 190., 170., 165., 100., 150., 130., 150.],
        'foot size': [12., 11., 12., 10., 6., 8., 7., 9.]}

target = ['male', 'male', 'male', 'male',
          'female', 'female', 'female', 'female']

# Buat dataframe
df_baru = pd.DataFrame(data)
df_baru['gender'] = target

# Print dataframe
y = df_baru['gender'].apply(lambda x: 0 if x=='female' else 1).copy().to_numpy()
X = df_baru.drop(['gender'], axis=1).copy().to_numpy()

print(X)
print(y)

[[  6.   180.    12.  ]
 [  5.92 190.    11.  ]
 [  5.58 170.    12.  ]
 [  5.92 165.    10.  ]
 [  5.   100.     6.  ]
 [  5.5  150.     8.  ]
 [  5.42 130.     7.  ]
 [  5.75 150.     9.  ]]
[1 1 1 1 0 0 0 0]


---
Mencari **prior probability**

In [12]:
def prior_gaussian_NB(y):
    classes, counts = np.unique(y, return_counts=True)
    n_class = len(classes)
    n_data = len(y)
    
    prior_data = {}
    for class_i in range(n_class):
        prior_proba = counts[class_i] / n_data
        prior_data[class_i] = prior_proba
    
    return prior_data
        

# Test fungsi pada data target
prior_data = prior_gaussian_NB(y)
print(prior_data)

{0: 0.5, 1: 0.5}


---
Mencari **mean & variance dari data**

In [13]:
def extract_data(X, y):
    classes = np.unique(y)
    n_class = len(classes)
    n_data, n_feature = X.shape
    
    data_properties = {}
    for class_i in range(n_class):
        # cari data
        data_in_class_i = X[y==class_i, :]
        
        # cari mean
        mean_data = np.mean(data_in_class_i, axis=0)
        
        # cari variance
        var_data = np.var(data_in_class_i, axis=0)
        
        # masukkan data
        data_properties[class_i] = {}
        data_properties[class_i]['mean'] = mean_data
        data_properties[class_i]['var'] = var_data
        
    return data_properties


# Test fungsi pada data
data_properties = extract_data(X, y)
print(data_properties)

{0: {'mean': array([  5.4175, 132.5   ,   7.5   ]), 'var': array([7.291875e-02, 4.187500e+02, 1.250000e+00])}, 1: {'mean': array([  5.855, 176.25 ,  11.25 ]), 'var': array([2.62750e-02, 9.21875e+01, 6.87500e-01])}}


---
Mencari **likelihood data**

In [14]:
def likelihood_gaussian(data, mean, var):
    prob = (1/(np.sqrt(2 * np.pi * var))) * np.exp((-1) * ((data-mean)**2) / (2*var))
    
    return prob

---
Mencari **prediksi**

In [15]:
# Buat Dataset
data_test = {'height': [6.00, 5.60],
             'weight': [130., 190.],
             'foot size': [8., 10.]}

# Buat dataframe
df_test = pd.DataFrame(data_test)

# Print dataframe
X_test = df_test.to_numpy()
print(X_test)

[[  6.  130.    8. ]
 [  5.6 190.   10. ]]


In [16]:
def predict(X_test, prior, data_properties):
    n_class = len(prior_data.keys())
    n_data, n_feature = X_test.shape
    
    posterior_data = {}
    for class_i in range(n_class):
        posterior_data[class_i] = {}
        
        for data_j in range(n_data):
            # Cari data
            current_data = X_test[data_j]

            # Hitung likelihood probability
            tot_likelihood = 1
            for feature_k in range(n_feature):
                data_feature_k = current_data[feature_k]
                mean_feature_k = data_properties[class_i]['mean'][feature_k]
                var_feature_k = data_properties[class_i]['var'][feature_k]
                
                likelihood_prob = likelihood_gaussian(data=data_feature_k,
                                                      mean=mean_feature_k,
                                                      var=var_feature_k)
                
                tot_likelihood *= likelihood_prob
                
            posterior_prop = tot_likelihood * prior[class_i]
            posterior_data[class_i][data_j] = posterior_prop
    
    # Normalize
    for data_j in range(n_data):
        sum_prob = 0
        for class_i in range(n_class):
            sum_prob += posterior_data[class_i][data_j]
            
        for class_i in range(n_class):
            posterior_data[class_i][data_j] /= sum_prob
    
    # Hasil prediksi
    placeholder_class = np.zeros((n_data, n_class))
    for class_i in range(n_class):
        for data_j in range(n_data):
            placeholder_class[data_j][class_i] = posterior_data[class_i][data_j]
    
    choosen_class = np.argmax(placeholder_class, axis=1)

    return choosen_class, posterior_data
        

# Test fungsi
choosen_class, posterior_data = predict(X_test, prior_data, data_properties)
print(choosen_class)
print(posterior_data)

[0 1]
{0: {0: 0.9999998455713321, 1: 0.007821961019943816}, 1: {0: 1.5442866782106927e-07, 1: 0.9921780389800562}}


In [17]:
def preprocess_dataframe_invert(data):
    # Convert categorical to numerical data
    # Target
    data['gender?'] = data['gender?'].apply(lambda x: 'female' if x==0 else 'male')
    
    return data

# Execute preprocessing data
df_test['gender?'] = choosen_class

df_test = preprocess_dataframe_invert(df_test)
df_test

Unnamed: 0,height,weight,foot size,gender?
0,6.0,130.0,8.0,female
1,5.6,190.0,10.0,male
