# Tugas Prapraktikum

Tugas Prapraktikum dikerjakan dengan _dataset_ [Rain in Australia](https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package/download?datasetVersionNumber=2). Tanpa meninjau waktu (`date`), prediksi status hujan pada keesokan harinya (`RainTomorrow`). Berikan nilai `1` jika diprediksi hujan pada keesokan harinya, `0` jika tidak.

<br>
Oleh :
<br>
1. 13520135 - Muhammad Alif Putra Yasa
<br>
2. 13520165 - Ghazian Tsabit Alkamil

# 0. Persiapan Data and Pustaka

In [20]:
# Letakkan pustaka di sini.
import numpy as np
import pandas as pd
import scipy.stats as zscore
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

In [21]:
# Baca data di sini.

data = pd.read_csv("weatherAUS.csv")
df = pd.DataFrame(data)
df.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No


# I. Pemahaman Data
Tujuan dari bagian ini adalah peserta dapat memahami kualitas dari data yang diberikan. Hal yang diliputi adalah sebagai berikut:
1. Ukuran data
2. Statistik dari tiap fitur
3. Pencilan (_outlier_)
4. Korelasi
5. Distribusi 

## I.1 
Carilah:
1. Ukuran dari data (instansi dan fitur)
2. Tipe dari setiap fitur 
3. Banyak nilai unik dari fitur yang bertipe kategorikal
4. Nilai minimum, maksimum, rata-rata, median, dan standar deviasi dari fitur nonkategorikal

In [22]:
# I.1 Kode di sini.


## I.2
Carilah:
1. Nilai hilang (_missing_) dari setiap fitur
2. Nilai pencilan (_outlier_) dari setiap fitur

In [23]:
# I.2.1 Nilai hilang (missing) dari setiap fitur 

missing_value = df.isnull().sum()
missing_value = missing_value.to_frame()
missing_value.columns = ['jumlah missing value']
missing_value

Unnamed: 0,jumlah missing value
Date,0
Location,0
MinTemp,1485
MaxTemp,1261
Rainfall,3261
Evaporation,62790
Sunshine,69835
WindGustDir,10326
WindGustSpeed,10263
WindDir9am,10566


In [24]:
'''
I.2.2 Nilai pencilan (outlier) dari setiap fitur 

Proses pencarian nilai pencilan dari setiap fitur menggunakan metode IQR untuk fitur
yang memiliki tipe data numerical dan menggunakan metode frequency distribution untuk 
fitur yang memiliki tipe data kategorikal.
Data yang dianggap pencilan untuk fitur yang memiliki tipe numerical adalah data yang 
berada di luar rentang Q1 - 1.5*IQR dan Q3 + 1.5*IQR dengan IQR = Q3 - Q1.
Data yang dianggap pencilan untuk fitur yang memiliki tipe kategorikal adalah data yang
frekuensi z-scorenya lebih dari 2.5.

'''

def IQR_outlier(data, col):
    
    # Nilai Q1 dan Q3
    Q1 = data[col].quantile(0.25)
    Q3 = data[col].quantile(0.75)
    # Nilai IQR
    IQR = Q3 - Q1
    # Lower bound dan upper bound
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    # Nilai outlier
    outlier = data[(data[col] < lower_bound) | (data[col] > upper_bound)]
    return outlier, lower_bound, upper_bound

# Nilai pencilan dari setiap fitur
outliers_data = pd.DataFrame(columns = ['Fitur', 'Jumlah nilai pencilan', 'Upper bound', 'Lower bound', 'Nilai pencilan'])

for col in df.columns:
    if df[col].dtype == 'float64' or df[col].dtype == 'int64':
        outliers, lower_bound, upper_bound = IQR_outlier(df, col)
        if len(outliers) > 0:
            res = pd.DataFrame({'Fitur':[col], 'Jumlah nilai pencilan':[len(outliers)], 'Upper bound':[upper_bound], 'Lower bound':[lower_bound], 'Nilai pencilan':None})
            outliers_data = pd.concat([outliers_data, res])
        else:
            res = pd.DataFrame({'Fitur':[col], 'Jumlah nilai pencilan':[len(outliers)], 'Upper bound':[upper_bound], 'Lower bound':[lower_bound], 'Nilai pencilan':None})
            outliers_data = pd.concat([outliers_data, res])
    else:
        # Nilai frekuensi dari setiap nilai kategorikal
        freq = df[col].value_counts()
        # Nilai z-score dari setiap nilai kategorikal
        z = zscore.zscore(freq)
        # Nilai pencilan
        outliers = freq[z > 2.5]
        if len(outliers) > 0:
            res = pd.DataFrame({'Fitur':[col], 'Jumlah nilai pencilan':[len(outliers)], 'Upper bound':None, 'Lower bound':[2.5], 'Nilai pencilan':outliers.index.tolist()}, index = [0])
            outliers_data = pd.concat([outliers_data, res])
        else:
            res = pd.DataFrame({'Fitur':[col], 'Jumlah nilai pencilan':[len(outliers)], 'Upper bound':None, 'Lower bound':[2.5], 'Nilai pencilan':None}, index = [0])
            outliers_data = pd.concat([outliers_data, res])
outliers_data = outliers_data.reset_index(drop = True)
outliers_data

# Untuk fitur dengan tipe data numerical, value dari nilai pencilan tidak ditambahkan di data frame, karena nilai tersebut
# terlalu banyak sehingga tidak dapat ditampilkan. Untuk fitur dengan tipe data kategorikal, value dari nilai pencilan terdapat pada kolom
# nilai pencilan.

Unnamed: 0,Fitur,Jumlah nilai pencilan,Upper bound,Lower bound,Nilai pencilan
0,Date,0,,2.5,
1,Location,0,,2.5,
2,MinTemp,54,30.85,-6.35,
3,MaxTemp,489,43.65,2.45,
4,Rainfall,25578,2.0,-1.2,
5,Evaporation,1995,14.6,-4.6,
6,Sunshine,0,19.3,-3.9,
7,WindGustDir,0,,2.5,
8,WindGustSpeed,3092,73.5,5.5,
9,WindDir9am,1,,2.5,N


## I.3
Lakukan:
1. Pencarian korelasi antarfitur
2. Visualisasi distribusi setiap fitur (kategorikal dan kontinu)
3. Visualisasi distribusi setiap fitur per target (`RainTomorrow`)

In [25]:
# I.3 Kode di sini.

## I.4
Lakukanlah analisis lebih lanjut jika diperlukan, kemudian lakukan hal berikut:
1. Penambahan fitur jika memungkinkan
2. Pembuangan fitur yang menurut kalian tidak dibutuhkan
3. Penanganan nilai hilang
4. Transformasi data kategorikal menjadi numerikal (_encoding_)
5. _Scaling_ dengan `MinMaxScaler`

### I.4.1 Penambahan fitur jika memungkinkan

Dari hasil analisis data, tidak diperlukan penambahan fitur baru. Data yang ada 
sudah cukup untuk melakukan prediksi status hujan pada keesokan harinya.

In [26]:
'''
I.4.2 Pembuangan fitur yang tidak diperlukan

Dari hasil analisis data, fitur yang tidak diperlukan adalah fitur 'Date' dan 'Location'.
'''

# Pembuangan fitur yang tidak diperlukan
df.drop(['Date', 'Location'], axis = 1, inplace = True)
df.head()


Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,13.4,22.9,0.6,,,W,44.0,W,WNW,20.0,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,7.4,25.1,0.0,,,WNW,44.0,NNW,WSW,4.0,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,12.9,25.7,0.0,,,WSW,46.0,W,WSW,19.0,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,9.2,28.0,0.0,,,NE,24.0,SE,E,11.0,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,17.5,32.3,1.0,,,W,41.0,ENE,NW,7.0,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No


In [27]:
'''
I.4.3 Penanganan nilai hilang (missing)

Dari hasil analisis data penangan terhadap nilai hilang dibagi menjadi 2 bagian, yaitu
penanganan nilai hilang pada fitur yang memiliki tipe data numerical dan penanganan
nilai hilang pada fitur yang memiliki tipe data kategorikal. Untuk penanganan nilai
hilang pada fitur yang memiliki tipe data numerical, nilai hilang diisi dengan nilai
rata rata dari fitur tersebut. Untuk penanganan nilai hilang pada fitur yang memiliki tipe
data kategorikal, nilai hilang diisi dengan nilai modus dari fitur tersebut.
'''

# Penanganan nilai hilang pada fitur yang memiliki tipe data numerical
for col in df.columns:
    if df[col].dtype == 'float64' or df[col].dtype == 'int64':
        df[col].fillna(df[col].mean(), inplace = True)
    else:
        df[col].fillna(df[col].mode()[0], inplace = True)

In [28]:
# I.4.4 Transformasi data kategorikal menjadi numerikal (encoding)

encoder = LabelEncoder()
for col in df.columns:
    if df[col].dtype == 'object':
        df[col] = encoder.fit_transform(df[col])
df.head()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,13.4,22.9,0.6,5.468232,7.611178,13,44.0,13,14,20.0,...,71.0,22.0,1007.7,1007.1,8.0,4.50993,16.9,21.8,0,0
1,7.4,25.1,0.0,5.468232,7.611178,14,44.0,6,15,4.0,...,44.0,25.0,1010.6,1007.8,4.447461,4.50993,17.2,24.3,0,0
2,12.9,25.7,0.0,5.468232,7.611178,15,46.0,13,15,19.0,...,38.0,30.0,1007.6,1008.7,4.447461,2.0,21.0,23.2,0,0
3,9.2,28.0,0.0,5.468232,7.611178,4,24.0,9,0,11.0,...,45.0,16.0,1017.6,1012.8,4.447461,4.50993,18.1,26.5,0,0
4,17.5,32.3,1.0,5.468232,7.611178,13,41.0,1,7,7.0,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,0,0


In [29]:
# I.4.5 Scaling dengan metode MinMaxScaler

scaler = MinMaxScaler()
scaled_df = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_df, columns = df.columns)
scaled_df.head()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,0.516509,0.523629,0.001617,0.037712,0.524909,0.866667,0.294574,0.866667,0.933333,0.153846,...,0.71,0.22,0.449587,0.48,0.888889,0.501103,0.508439,0.522073,0.0,0.0
1,0.375,0.565217,0.0,0.037712,0.524909,0.933333,0.294574,0.4,1.0,0.030769,...,0.44,0.25,0.497521,0.4912,0.494162,0.501103,0.514768,0.570058,0.0,0.0
2,0.504717,0.57656,0.0,0.037712,0.524909,1.0,0.310078,0.866667,1.0,0.146154,...,0.38,0.3,0.447934,0.5056,0.494162,0.222222,0.594937,0.548944,0.0,0.0
3,0.417453,0.620038,0.0,0.037712,0.524909,0.266667,0.139535,0.6,0.0,0.084615,...,0.45,0.16,0.613223,0.5712,0.494162,0.501103,0.533755,0.612284,0.0,0.0
4,0.613208,0.701323,0.002695,0.037712,0.524909,0.866667,0.271318,0.066667,0.466667,0.053846,...,0.82,0.33,0.500826,0.4624,0.777778,0.888889,0.527426,0.673704,0.0,0.0


# II. Desain Eksperimen
Tujuan dari bagian ini adalah peserta dapat memahami cara melakukan eksperimen mencari metode terbaik dengan benar. Hal yang diliputi adalah sebagai berikut:
1. Pembuatan model
2. Proses validasi
3. _Hyperparameter tuning_

## II.1
Tentukanlah metrik yang akan digunakan pada eksperimen kali ini. Metrik yang dapat lebih dari satu jenis.

1. Accuracy
2. Recall
3. F1 Score
4. Precisision

## II.2 
Bagi data dengan perbandingan 0,8 untuk data latih dan 0,2 untuk data validasi.

In [39]:
# II.2 Kode di sini

x = df[['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am', 'Temp3pm', 'RainToday']]
y = df['RainTomorrow']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2,  train_size=0.8, random_state = 42)

0         0
1         0
2         0
3         0
4         0
         ..
145455    0
145456    0
145457    0
145458    0
145459    0
Name: RainToday, Length: 145460, dtype: int32

## II.3
Lakukan hal berikut:
1. Prediksi dengan menggunakan model _logistic regression_ sebagai _baseline_.
2. Tampilkan evaluasi dari model yang dibangun dari metrik yang ditentukan pada II.1
3. Tampilkan _confusion matrix_.

In [32]:
# II.3.1  Prediksi dengan menggunakan model logistic regression sebagai baseline

lr = LogisticRegression()
lr.fit(x_train, y_train)
y_pred = lr.predict(x_test)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [33]:
# II.3.2  Evaluasi dari model logistic regression

print("Evaluasi dari model logistic regression dengan metrics yang sudah ditentukan sebelumnya: ")
print(classification_report(y_test, y_pred))
print('Accuracy: ', accuracy_score(y_test, y_pred))
print('Precision: ', precision_score(y_test, y_pred))
print('Recall: ', recall_score(y_test, y_pred))
print('F1 Score: ', f1_score(y_test, y_pred))

Evaluasi dari model logistic regression dengan metrics yang sudah ditentukan sebelumnya: 
              precision    recall  f1-score   support

           0       0.86      0.95      0.90     22672
           1       0.71      0.45      0.55      6420

    accuracy                           0.84     29092
   macro avg       0.79      0.70      0.73     29092
weighted avg       0.83      0.84      0.82     29092

Accuracy:  0.8387872954764196
Precision:  0.7128444881889764
Recall:  0.4512461059190031
F1 Score:  0.552651659671881


In [34]:
# II.3.3  Confusion matrix dari model logistic regression

print("Confusion matrix dari model logistic regression: ")
print(pd.DataFrame(confusion_matrix(y_test, y_pred), columns = ['Predicted No', 'Predicted Yes'], index = ['Actual No', 'Actual Yes']))

Confusion matrix dari model logistic regression: 
            Predicted No  Predicted Yes
Actual No          21505           1167
Actual Yes          3523           2897


## II.4 
Lakukanlah:
1. Pembelajaran dengan model lain
2. _Hyperparameter tuning_ untuk model yang dipakai dengan menggunakan _grid search_ (perhatikan _random factor_ pada beberapa algoritma model)
3. Validasi dengan _cross validation_


In [35]:
# II.4 Kode di sini.

# III. Improvement
Pada bagian ini, kalian diharapkan dapat:
1. melakukan pelatihan dengan data hasil _oversampling_ / _undersampling_, disertai dengan validasi yang benar; serta
2. menerapkan beberapa metode untuk menggabungkan beberapa model.

Kedua hal ini adalah contoh metode untuk meningkatkan kinerja dari model.

## III.1
Lakukanlah:
1. _Oversampling_ pada kelas minoritas pada data latih
2. _Undersampling_ pada kelas mayoritas pada data latih

Pada setiap tahap, latih dengan model *baseline* (II.3), dan validasi dengan data validasi. Data latih dan validasi adalah data yang disusun pada bagian II.2.

In [36]:
# III.1.1 Oversampling pada kelas minoritas pada data latih

over_sampler = RandomOverSampler(sampling_strategy="not majority", random_state = 42)
over_x_train, over_y_train = over_sampler.fit_resample(x_train, y_train)

lr.fit(over_x_train, over_y_train)
over_y_pred = lr.predict(x_test)

print("Evaluasi dari baseline model dengan oversampling: ")
print(classification_report(y_test, over_y_pred))
print('Accuracy: ', accuracy_score(y_test, over_y_pred))
print('Precision: ', precision_score(y_test, over_y_pred))
print('Recall: ', recall_score(y_test, over_y_pred))
print('F1 Score: ', f1_score(y_test, over_y_pred))

crossval_score = cross_val_score(lr, x_train, y_train.values.ravel(), cv=5)
print("Cross validation score: ", crossval_score)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Evaluasi dari baseline model dengan oversampling: 
              precision    recall  f1-score   support

           0       0.92      0.79      0.85     22672
           1       0.51      0.75      0.60      6420

    accuracy                           0.78     29092
   macro avg       0.71      0.77      0.73     29092
weighted avg       0.83      0.78      0.80     29092

Accuracy:  0.7826206517255603
Precision:  0.5050135784416127
Recall:  0.7531152647975078
F1 Score:  0.6046017256471177


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Cross validation score:  [0.84368824 0.83651285 0.83857523 0.83929876 0.83916985]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [37]:
# III.1.2 Undersampling pada kelas mayoritas pada data latih

under_sampler = RandomUnderSampler(sampling_strategy=1, random_state = 42)
under_x_train, under_y_train = under_sampler.fit_resample(x_train, y_train)

lr.fit(under_x_train, under_y_train)
under_y_pred = lr.predict(x_test)

print("Evaluasi dari baseline model dengan undersampling: ")
print(classification_report(y_test, under_y_pred))
print('Accuracy: ', accuracy_score(y_test, under_y_pred))
print('Precision: ', precision_score(y_test, under_y_pred))
print('Recall: ', recall_score(y_test, under_y_pred))
print('F1 Score: ', f1_score(y_test, under_y_pred))

crossval_score = cross_val_score(lr, x_train, y_train.values.ravel(), cv=5)
print("Cross validation score: ", crossval_score)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Evaluasi dari baseline model dengan undersampling: 
              precision    recall  f1-score   support

           0       0.92      0.79      0.85     22672
           1       0.50      0.76      0.60      6420

    accuracy                           0.78     29092
   macro avg       0.71      0.77      0.72     29092
weighted avg       0.83      0.78      0.79     29092

Accuracy:  0.7792520280489482
Precision:  0.4998969497114592
Recall:  0.755607476635514
F1 Score:  0.601711734061027


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Cross validation score:  [0.84368824 0.83651285 0.83857523 0.83929876 0.83916985]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## III.2
Lakukanlah:
1. Eksplorasi _soft voting_, _hard voting_, dan _stacking_.
2. Buatlah model _logistic regression_ dan SVM.
3. Lakukanlah _soft voting_ dari model-model yang dibangun pada poin 2.
4. Lakukan _hard voting_ dari model-model yang dibangun pada poin 2.
5. Lakukanlah _stacking_ dengan _final classifier_ adalah _logistic regression_ dari model-model yang dibangun pada poin 2.
6. Lakukan validasi dengan metrics yang telah ditentukan untuk poin 3, 4, dan 5.

(Tuliskan hasil eksplorasi III.2 poin 1 di sini.)

In [38]:
# III.2 Kode di sini.

# IV. Analisis
Bandingkan hasil dari hal-hal berikut:
1. Model _baseline_ (II.3)
2. Model lain (II.4)
3. Hasil _undersampling_
4. Hasil _oversampling_
5. Hasil _soft voting_
6. Hasil _hard voting_
7. Hasil _stacking_

(Tuliskan jawaban bagian IV di sini.)