## Tugas Praktikum 3

Dengan menggunakan dataset diabetes, buatlah ensemble voting dengan algoritma

1. Logistic Regression

2. SVM kernel polynomial

3. Decission Tree

Anda boleh melakukan eksplorasi dengan melakukan tunning hyperparameter

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

In [3]:
# Load data
df = pd.read_csv('asset/diabetes.csv')
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
# Cek nama kolom
df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [5]:
# Cek kolom Null
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [6]:
# Cek kolom neng nilai 0
feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
for column in feature_columns:
    print("============================================")
    print(f"{column} ==> Missing zeros : {len(df.loc[df[column] == 0])}")

Pregnancies ==> Missing zeros : 111
Glucose ==> Missing zeros : 5
BloodPressure ==> Missing zeros : 35
SkinThickness ==> Missing zeros : 227
Insulin ==> Missing zeros : 374
BMI ==> Missing zeros : 11
DiabetesPedigreeFunction ==> Missing zeros : 0
Age ==> Missing zeros : 0


In [8]:
fill_values = SimpleImputer(missing_values=0, strategy="mean", copy=False)

df[feature_columns] = fill_values.fit_transform(df[feature_columns])

In [9]:
# Split data training dan test
X = df[feature_columns]
y = df.Outcome

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [10]:
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

In [11]:
# Membuat model Logistic Regression
logreg = LogisticRegression(random_state=42)

# Tuning hyperparameter untuk Logistic Regression
param_grid_logreg = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l2'],
    'solver': ['lbfgs', 'liblinear']
}
grid_logreg = GridSearchCV(LogisticRegression(random_state=42), param_grid_logreg, cv=5, n_jobs=-1, verbose=1)
grid_logreg.fit(X_train, y_train)

best_logreg = grid_logreg.best_estimator_

Fitting 5 folds for each of 8 candidates, totalling 40 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [12]:
# Membuat model SVM dengan kernel polynomial
svm = SVC(kernel='poly', probability=True, random_state=42)

# Tuning hyperparameter untuk SVM kernel polynomial
param_grid_svm = {
    'C': [0.01, 0.1, 1, 10],
    'degree': [2, 3, 4],
    'coef0': [0.0, 0.1, 0.5]
}
grid_svm = GridSearchCV(SVC(kernel='poly', probability=True, random_state=42), param_grid_svm, cv=5, n_jobs=-1, verbose=1)
grid_svm.fit(X_train, y_train)

best_svm = grid_svm.best_estimator_

Fitting 5 folds for each of 36 candidates, totalling 180 fits


In [13]:
# Membuat model Decision Tree
dt = DecisionTreeClassifier(random_state=42)

# Tuning hyperparameter untuk Decision Tree
param_grid_dt = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5]
}
grid_dt = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid_dt, cv=5, n_jobs=-1, verbose=1)
grid_dt.fit(X_train, y_train)

best_dt = grid_dt.best_estimator_

Fitting 5 folds for each of 36 candidates, totalling 180 fits


In [14]:
# Best parameters for each model
print("\nBest parameters for Logistic Regression:", grid_logreg.best_params_)
print("Best parameters for SVM:", grid_svm.best_params_)
print("Best parameters for Decision Tree:", grid_dt.best_params_)


Best parameters for Logistic Regression: {'C': 10, 'penalty': 'l2', 'solver': 'lbfgs'}
Best parameters for SVM: {'C': 10, 'coef0': 0.1, 'degree': 3}
Best parameters for Decision Tree: {'max_depth': 5, 'min_samples_leaf': 2, 'min_samples_split': 2}


In [17]:
# Best score for each model
print("\nBest Score for Logistic Regression:", grid_logreg.best_score_)
print("Best Score for SVM:", grid_svm.best_score_)
print("Best Score for Decision Tree:", grid_dt.best_score_)


Best Score for Logistic Regression: 0.783939079266182
Best Score for SVM: 0.7802872966424367
Best Score for Decision Tree: 0.7523018345448251


In [20]:
# Membuat Voting Classifier dengan model yang sudah dituning
voting_clf_tuned = VotingClassifier(estimators=[('logreg', best_logreg), ('svm', best_svm), ('dt', best_dt)], voting='soft')

# Melatih Voting Classifier
voting_clf_tuned.fit(X_train, y_train)

# Prediksi pada data uji
y_pred = voting_clf_tuned.predict(X_test)

# Evaluasi performa
print("Accuracy Voting Classifier:", accuracy_score(y_test, y_pred))
print("\nClassification Report Voting Classifier :\n", classification_report(y_test, y_pred))

Accuracy Voting Classifier: 0.7532467532467533

Classification Report Voting Classifier :
               precision    recall  f1-score   support

           0       0.79      0.84      0.82       151
           1       0.66      0.59      0.62        80

    accuracy                           0.75       231
   macro avg       0.73      0.71      0.72       231
weighted avg       0.75      0.75      0.75       231



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
