## Tugas 3
Dengan menggunakan dataset diabetes, buatlah ensemble voting dengan algoritma

1. Logistic Regression
2. SVM kernel polynomial
3. Decission Tree

## Jawab

### Import Library 

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder # Kebutuhan encoding label

### Persiapan Data

In [2]:
df = pd.read_csv('assets/diabetes.csv')

df.head(100)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
95,6,144,72,27,228,33.9,0.255,40,0
96,2,92,62,28,0,31.6,0.130,24,0
97,1,71,48,18,76,20.4,0.323,22,0
98,6,93,50,30,64,28.7,0.356,23,0


In [3]:
# Menghitung banyaknya nilai nol dalam setiap kolom menggunakan pandas
count_zeros_per_column = (df == 0).sum()

print("Banyaknya nilai nol dalam setiap kolom:")
print(count_zeros_per_column)

Banyaknya nilai nol dalam setiap kolom:
Pregnancies                 111
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                     500
dtype: int64


### Analisis data: Lah ko ada 0
Disini terlihat bahwa ada beberapa kolom yang bernilai kolom yang memmiliki nilai 0 didalamnya. Tentunya hal tersebut cukup sus, dikarenakan tidak ada manusia yang tidak memiliki glukosa, tekanan darah, dan hormon insulin. Untuk mengatasi ke-sus-an tersebut maka akan dilakukan data impute untuk mengganti data 0 menjadi data sintetis yang merupakan nilai mean dari tiap kolom. 

### Data impute

In [4]:
# Impute nilai 0 dengan mean
from sklearn.impute import SimpleImputer

feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']

fill_values = SimpleImputer(missing_values=0, strategy="mean", copy=False)

df[feature_columns] = fill_values.fit_transform(df[feature_columns])

In [5]:
df.head(100)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.000000,148.0,72.0,35.00000,155.548223,33.6,0.627,50.0,1
1,1.000000,85.0,66.0,29.00000,155.548223,26.6,0.351,31.0,0
2,8.000000,183.0,64.0,29.15342,155.548223,23.3,0.672,32.0,1
3,1.000000,89.0,66.0,23.00000,94.000000,28.1,0.167,21.0,0
4,4.494673,137.0,40.0,35.00000,168.000000,43.1,2.288,33.0,1
...,...,...,...,...,...,...,...,...,...
95,6.000000,144.0,72.0,27.00000,228.000000,33.9,0.255,40.0,0
96,2.000000,92.0,62.0,28.00000,155.548223,31.6,0.130,24.0,0
97,1.000000,71.0,48.0,18.00000,76.000000,20.4,0.323,22.0,0
98,6.000000,93.0,50.0,30.00000,64.000000,28.7,0.356,23.0,0


### Split Data

In [6]:
X = df[feature_columns]
y = df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Analisis Data: Butuh standarisasi
Karena persebaran nilai data yang tidak merata, maksudnya ada yang desimal, ada yang bulat, nilainya kecil, besar, dan lain sebagainya. Maka diperlukan standarisasi agar nilai tiap kolom sepadan sehingga training data dapat dilakukan dengan lebih baik. 

### Standarisasi menggunakan Standard Scaler

In [7]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

# Standarisasi pada fitur di X_train dan X_test
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)

In [8]:
X_train_std

array([[-0.82506128, -1.25828206,  0.01321033, ...,  0.01501323,
        -0.49073479, -1.03594038],
       [ 1.57255664, -0.32735374,  0.8068672 , ..., -0.59935041,
         2.41502991,  1.48710085],
       [-1.16757813,  0.57032714, -2.17095414, ..., -0.52719904,
         0.54916055, -0.94893896],
       ...,
       [ 1.91507348, -0.69307558,  1.13773624, ...,  1.91151712,
         1.981245  ,  0.44308379],
       [ 0.02940616,  0.63682202,  0.01321033, ...,  1.44974838,
        -0.78487662, -0.33992901],
       [ 0.02940616,  0.10486298,  1.96490883, ..., -1.42187598,
        -0.61552223, -1.03594038]])

### Training SVM Kernel Polynomial

In [9]:
from sklearn.svm import SVC

model = SVC(kernel='poly') #model svc tanpa tuning

model.fit(X_train_std, y_train) #fit ke model 

y_pred_svm = model.predict(X_test_std) #prediksi menggunakan model

acc_svm_pol = accuracy_score(y_test, y_pred_svm) #mengecek akurasi model svm polynomial

print("Test set accuracy: {:.2f}".format(acc_svm_pol))
print(f"Test set accuracy: {acc_svm_pol}")

Test set accuracy: 0.74
Test set accuracy: 0.7402597402597403


### Tuning Hyperparameter di SVM (Tambah degree dan C)

In [12]:
from sklearn.svm import SVC

model = SVC(kernel='poly', degree=1, C=1.0) #model svc tanpa tuning

model.fit(X_train_std, y_train) #fit ke model 

y_pred_svm = model.predict(X_test_std) #prediksi menggunakan model

acc_svm_pol = accuracy_score(y_test, y_pred_svm) #mengecek akurasi model svm polynomial

print("Test set accuracy: {:.2f}".format(acc_svm_pol))
print(f"Test set accuracy: {acc_svm_pol}")

Test set accuracy: 0.77
Test set accuracy: 0.7662337662337663


### Training Decision 

In [13]:
from sklearn.tree import DecisionTreeClassifier # import DT

dt = DecisionTreeClassifier()

# Sesuaikan dt ke set training
dt.fit(X_train_std, y_train)

# Memprediksi label set test
y_pred_dt = dt.predict(X_test_std)

#  menghitung set accuracy
acc_dt = accuracy_score(y_test, y_pred_dt)
print("Test set accuracy: {:.2f}".format(acc_dt))
print(f"Test set accuracy: {acc_dt}")

Test set accuracy: 0.70
Test set accuracy: 0.7012987012987013


### Training Logistic Regression

In [14]:
from sklearn.linear_model import LogisticRegression #impor LR

model = LogisticRegression()

model.fit(X_train_std, y_train)

# Memprediksi label set test
y_pred_lr = model.predict(X_test_std)

#  menghitung set accuracy
acc_lr = accuracy_score(y_test, y_pred_dt)
print("Test set accuracy: {:.2f}".format(acc_dt))
print(f"Test set accuracy: {acc_dt}")

Test set accuracy: 0.70
Test set accuracy: 0.7012987012987013


### Ensemble Voting (HARD)


In [15]:
from sklearn.ensemble import VotingClassifier # impor voting classifier

# Definisikan model dalam variabel
model1 = SVC(kernel='poly')
model2 = DecisionTreeClassifier()
model3 = LogisticRegression()

# Masukkan model ke parameter
ensemble_model = VotingClassifier(estimators=[('SVC', model1), ('DecisionTreeClassifier', model2), ('LogisticRegression', model3)], voting='hard')

# Fit ke model
ensemble_model.fit(X_train_std, y_train) 

# Predict
y_pred_ens = ensemble_model.predict(X_test_std)

#  menghitung set accuracy
acc_ensemble = accuracy_score(y_test, y_pred_ens)
print("Test set accuracy: {:.2f}".format(acc_ensemble))
print(f"Test set accuracy: {acc_ensemble}") 

Test set accuracy: 0.78
Test set accuracy: 0.7792207792207793


### Tenatang Ensemble Voting
Jadi ensemble voting merupakan algoritma pembelajaran mesin yang menggabungkan beberapa model dengan tuning masing-masing, hasil akhir berupa kelas yang paling banyak muncul diantara prediksi model-model tersebut, semisal dua dari tiga model tersebut memprediksi 0, dan satu model memprediksi 1, maka yang diambil adalah 0 (yang paling banyak muncul)