# Teorema Bayes
Teorema bayes menjelaskan mengenai peluang suatu kejadian berdasarkan pengetahuan awal (prior) dari kondisi yang mungkin berhubungan dengan kejadian tersebut. 

Contohnya, jika penyakit diabetes berhubungan dengan umur, maka peluang seseorang terjangkit diabetes akan lebih akurat jika dimasukkan faktor umur di dalamnya. Secara matematis, teorema bayes dituliskan sebagai berikut.

$$
P(A|B) = \frac{P(B|A)}{P(B)}
$$


- P(A|B) = Peluang terjadinya A jika B terjadi
- P(B|A) = Peluang B terjadi saat A muncul
- P(A) = Peluang terjadinya A tanpa memandang kejadian apapun.
- P(B) = Peluang terjadinya B tanpa memandang kejadian apapun.

# Load Package

In [1]:
import pandas as pd 
from datetime import datetime, timedelta, date
import numpy as np
import matplotlib.pyplot as plt

In [2]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import StratifiedKFold, cross_val_score,GridSearchCV
from sklearn.preprocessing import LabelEncoder

In [3]:
from sklearn.naive_bayes import GaussianNB

# 1. Data Preprocessing

Dataset yang digunakan berasal dari kaggle.com yaitu https://www.kaggle.com/uciml/pima-indians-diabetes-database. Variabel yang digunakan adalah

- Pregnancies : Number of times pregnant
- Glucose : Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- BloodPressure : Diastolic blood pressure (mm Hg)
- SkinThickness :Triceps skin fold thickness (mm)
- Insulin : 2-Hour serum insulin (mu U/ml)
- BMI : Body mass index (weight in kg/(height in m)^2)
- DiabetesPedigreeFunction : Diabetes pedigree function
- Age : Age (years)
- Outcome : 1 diabetes, 0 not-diabetes

## 1.1 Import Datasets

In [4]:
url = 'https://raw.githubusercontent.com/arofiqimaulana/Course-Python-For-Data-Analysis/762d8c9c29ec0d12251c0268bd85f3c6c0b3ae24/Section%206%20(Data%20Visualization)/data/diabetes.csv'
df = pd.read_csv(url)

In [5]:
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [14]:
feature_cols = ['Age','Glucose','BloodPressure']
X = df[feature_cols]
y = df.Outcome

In [15]:
# feature_cols = ['Pregnancies','Insulin','BMI','Age','Glucose','BloodPressure','DiabetesPedigreeFunction']

## 1.2 Split datasets
Membagi dataset menjadi data training dan data testing.
- data training digunakan untuk membentuk model
- data testing digunakan untuk menguji seberapa akurat model

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

# 2. Modeling

In [9]:
fit = GaussianNB().fit(X_train, y_train.values.ravel())
y_pred = fit.predict(X_test)

In [46]:
X_test

Unnamed: 0,Age,Glucose,BloodPressure
668,43,98,58
324,21,112,75
624,21,108,64
690,34,107,80
473,50,136,90
...,...,...,...
619,24,119,0
198,26,109,64
538,23,127,80
329,37,105,70


In [45]:
y_pred

array([0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0,
       0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1,
       0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

## 2.2 Evaluation Model
Tahapan ini berfungsi untuk mengetahui apakah model yang digunakan layak digunakan alat prediksi. Beberapa cara yang dapat dijadikan evaluasi model

- Akurasi (> 80%)
- Presisi (> 80%)
- F1-Score (> 80%)
- Recall (> 80%)

In [10]:
confusion_matriks = confusion_matrix(y_test, y_pred)
classification_reports = classification_report(y_test, y_pred)

In [11]:
print(confusion_matriks)

[[129  22]
 [ 39  41]]


In [12]:
print(classification_reports)

              precision    recall  f1-score   support

           0       0.77      0.85      0.81       151
           1       0.65      0.51      0.57        80

    accuracy                           0.74       231
   macro avg       0.71      0.68      0.69       231
weighted avg       0.73      0.74      0.73       231



## 2.3 Prediction
Terdapat dua pasien dengan ciri ciri
- pasien 1 : Umur 25 tahun, Glukosa 150, Tekanan darah 58
- pasien 2 : Umur 40 tahun, Glukosa 200, Tekanan darah 90

Berapa peluang terkena diabetes ?

In [78]:
pasien = pd.DataFrame([[25,200,58],
                      [40,100,90]],columns= ['Age','Glucose','BloodPressure'])

In [79]:
prediksi = fit.predict(pasien)

In [80]:
pasien['hasil_prediksi'] = prediksi

In [82]:
"""
hasil_prediksi = 1 : terkena diabetes
hasil_prediksi = 0 : tidak terkena diabetes
"""
pasien

Unnamed: 0,Age,Glucose,BloodPressure,hasil_prediksi
0,25,200,58,1
1,40,100,90,0
