# Machine Learning Challenge - Kaggle Competition

Semua mengenai Kaggle Challenge dapat diakses [di sini](https://www.kaggle.com/competitions/machine-learning-challenge-kaggle-competition/overview). 

## Latar Belakang
Pak Mulyono adalah seorang ayah yang ingin memastikan anak-anaknya mendapatkan hasil akademik terbaik selama menempuh pendidikan di Singapura. Namun, karena kesibukannya mengurus berbagai bisnis dan proyek nasional seperti MBG, Food Estate, dan IKN bersama El Gemoy, beliau tidak memiliki cukup waktu untuk menganalisis data akademik anak-anaknya secara manual.    

Salah satu anaknya, Fufufafa, mengusulkan untuk menggunakan teknologi kecerdasan buatan (AI) guna memprediksi nilai kuliah mereka secara otomatis dan akurat. Dengan bantuan siswa dari Purwadhika JCDS-2804, sebuah model machine learning diharapkan dapat dibangun untuk membantu Pak Mulyono dalam mengawasi performa akademik anak-anaknya tanpa mengganggu jadwal sibuknya.

---

## Permasalahan

Pak Mulyono ingin mengetahui **nilai akhir (Grade)** anak-anaknya di kuliah tanpa harus mengecek satu per satu secara manual. Data nilai tidak langsung tersedia, dan perlu diprediksi berdasarkan berbagai data pendukung. Oleh karena itu, dibutuhkan **sebuah model prediktif** berbasis Machine Learning untuk membantu memperkirakan nilai mereka secara cepat, akurat, dan efisien.

---

## Tujuan

Proyek ini bertujuan untuk:

1. **Membangun model klasifikasi** menggunakan algoritma Machine Learning untuk memprediksi nilai akhir (Grade) siswa berdasarkan data karakteristik akademik dan non-akademik.
2. **Mengevaluasi performa model** menggunakan metrik yang sesuai (misalnya akurasi, f1-score) untuk memastikan kualitas prediksi.
3. Memberikan **insight fitur-fitur penting** yang paling memengaruhi nilai akhir siswa.
4. Menghasilkan output prediksi untuk data siswa yang belum diketahui nilainya, sehingga Pak Mulyono dapat memantau perkembangan akademik anak-anaknya dengan praktis.

> Dengan adanya model ini, diharapkan proses pemantauan nilai menjadi lebih modern, terautomasi, dan relevan dengan kebutuhan zaman.
---

## Deskripsi Kolom Dataset

| **Kolom**                     | **Deskripsi**                                                                 |
|------------------------------|-------------------------------------------------------------------------------|
| **Student_ID**               | Identifikasi unik untuk setiap mahasiswa.                                     |
| **First_Name**               | Nama depan mahasiswa.                                                        |
| **Last_Name**                | Nama belakang mahasiswa.                                                     |
| **Email**                    | Alamat email mahasiswa (dapat dianonimkan).                                  |
| **Gender**                   | Jenis kelamin mahasiswa (Male, Female, Other).                     |
| **Age**                      | Usia mahasiswa.                                                              |
| **Department**               | Jurusan atau departemen tempat mahasiswa belajar (misalnya: Engineering, Business). |
| **Attendance (%)**           | Persentase kehadiran mahasiswa (0–100%).                                     |
| **Midterm_Score**            | Nilai ujian tengah semester (skala 100).                                      |
| **Final_Score**              | Nilai ujian akhir semester (skala 100).                                       |
| **Assignments_Avg**          | Rata-rata nilai tugas (skala 100).                                            |
| **Quizzes_Avg**              | Rata-rata nilai kuis (skala 100).                                             |
| **Participation_Score**      | Nilai partisipasi kelas (skala 0–10).                                        |
| **Projects_Score**           | Nilai proyek akhir (skala 100).                                               |
| **Total_Score**              | Jumlah total nilai akhir.                                   |
| **Grade**                    | Nilai akhir mahasiswa (1 = Lulus (A–C), 0 = Tidak Lulus (D–F)).              |
| **Study_Hours_per_Week**     | Rata-rata jam belajar mahasiswa per minggu.                                  |
| **Extracurricular_Activities** | Partisipasi dalam kegiatan ekstrakurikuler (Yes/No).                     |
| **Internet_Access_at_Home**  | Akses internet di rumah (Yes/No).                                          |
| **Parent_Education_Level**   | Tingkat pendidikan tertinggi orang tua (None, High School, Bachelor's, Master's, PhD).     |
| **Family_Income_Level**      | Tingkat pendapatan keluarga (Low, Medium, High).                      |
| **Stress_Level (1-10)**      | Tingkat stres berdasarkan persepsi diri (1 = Rendah, 10 = Tinggi).           |
| **Sleep_Hours_per_Night**    | Rata-rata jam tidur per malam.                                               |


In [2]:
# buat model dari train, prediksi grade pada data test, gabungkan hasil prediksi dgn kolom student id data test, simpan csv lalu upload ke kaggle untuk dihitung akurasinya oleh mereka

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings('ignore')

### percobaan 1 (pipeline)

In [105]:
# Load Dataset
df_train = pd.read_csv('train_data_mapped.csv')
df_test = pd.read_csv('test_data.csv')

# Simpan Student_ID untuk submission
test_ids = df_test['Student_ID']

# Drop kolom non-predictor
drop_columns = ['Student_ID', 'First_Name', 'Last_Name', 'Email']
df_train = df_train.drop(columns=drop_columns)
df_test = df_test.drop(columns=drop_columns)

# Fitur dan Target
X = df_train.drop(columns='Grade') 
y = df_train['Grade']            

# Deteksi kolom numerik & kategorikal
num_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
cat_cols = X.select_dtypes(include=['object']).columns.tolist()
# penggunaan .tolist() tdk wajib tapi disarankan 
# .columns menghasilkan objek index yg mirip list, tapi menambahkan .tolist() mengubahnya jd list asli - shg lebih fleksibel saat digunakan dlm pipeline dkk yg hanya menerima list

# Split train & validation
X_train, X_val, y_train, y_val = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
# dilakukan setelah deteksi krn bisa jadi ada kolom kategorikal yg hanya ada di val, shg jika dideteksi setelah split maka pipeline tdk tau cara menghandlenya

# 2. Buat pipeline numerik & kategorik
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),  # Imputing missing values
    ('scaler', StandardScaler())  # Standardizing features
])

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Imputing missing categorical data
    ('encoder', OneHotEncoder(handle_unknown='ignore'))  # OneHotEncoding
])

# Gabungkan preprocessing
preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_cols),
    ('cat', cat_pipeline, cat_cols)
])

# Pipeline Model
model = Pipeline([
    ('preprocessing', preprocessor),
    ('classifier', LogisticRegression(random_state=42))
])

# Fit Model pada Data Train
model.fit(X_train, y_train)

# Evaluasi di data validasi
y_pred = model.predict(X_val)
print("Classification Report (Validation):")
print(classification_report(y_val, y_pred))

# 11. Prediksi data test (tidak perlu fit ulang karena preprocessing sudah otomatis)
test_pred = model.predict(df_test)

# 12. Buat file submission
submission = pd.DataFrame({
    'Student_ID': test_ids,
    'Grade': test_pred
})
# submission.to_csv('submission.csv', index=False)

Classification Report (Validation):
              precision    recall  f1-score   support

           0       0.63      0.57      0.59        83
           1       0.73      0.78      0.76       127

    accuracy                           0.70       210
   macro avg       0.68      0.67      0.68       210
weighted avg       0.69      0.70      0.69       210



### percobaan 2 (combined data test)

In [104]:
# Load Dataset
train_df = pd.read_csv('train_data_mapped.csv')
test_df = pd.read_csv('test_data.csv')

# Simpan Student_ID dari test untuk submission 
test_ids = test_df['Student_ID'] # simpan di awal menghindari kemungkinan datanya berubah tanpa sengaja setelah digabung dgn data train

# Target
target = 'Grade'

# Drop kolom non-predictor
drop_cols = ['Student_ID', 'First_Name', 'Last_Name', 'Email']
X = train_df.drop(columns=drop_cols + [target])
y = train_df[target]

# Gabungkan dengan test untuk preprocessing bareng
X_test = test_df.drop(columns=drop_cols)

# Gabung untuk preprocessing
combined = pd.concat([X, X_test], axis=0) # karena ada kemungkinan ada nilai unik yg hanya muncul di data test, dilakukan agar preprocessing belajar dari seluruh kombinasi data

# Pisahkan kolom numerik dan kategorikal
num_cols = combined.select_dtypes(include=['int64', 'float64']).columns
cat_cols = combined.select_dtypes(include='object').columns

# baseline cukup (tdk merusak, tp bisa suboptimal) untuk awal, dan bisa ditambahkan sambil pantau skor

# Imputasi data numerik
imputer_num = SimpleImputer(strategy='mean') # pakai mean dan most frequent karena ini yg plg umum untuk baseline # bisa pakai median, imputasi grouping atau model based imputation agar imputasi lebih kuat
combined[num_cols] = imputer_num.fit_transform(combined[num_cols])

# Imputasi data kategorikal
imputer_cat = SimpleImputer(strategy='most_frequent')
combined[cat_cols] = imputer_cat.fit_transform(combined[cat_cols]) # data test ikut difit (walaupun tdk ideal di dunia nyata, tapi di kaggle, test tdk punya label -aman dr kebocoran target- dan evaluasi dilakukan di kaggle server shg bisa ditoleransi)
# fit berfungsi untuk menyimpan data dari train, dan transform berfungsi untuk mengubah data berdasarkan hasil fit

# One-hot encoding untuk kolom kategorikal
combined = pd.get_dummies(combined, columns=cat_cols)

# Split kembali ke X_train dan X_test
X_encoded = combined.iloc[:len(X)] # len(X) adalah jumlah baris awal dr data train sblm digabung (tdk perlu (+1) karena perhitungan baris dr 0)
X_test_encoded = combined.iloc[len(X):] 

# Feature scaling
scaler = StandardScaler()
X_encoded = scaler.fit_transform(X_encoded)
X_test_encoded = scaler.transform(X_test_encoded)

# Train-test split
X_train, X_val, y_train, y_val = train_test_split(X_encoded, y, test_size=0.2, random_state=42) # dilakukan setelah preprocessing agar preprocessing bisa mempelajari keseluruhan data train dan validasi
# data validasi diambil dr 20% data train, tidak ikut ditrain, untuk memastikan model tdk overfit dan bisa generalisasi
# train test split tidak dilakukan di data test karena data test tdk memiliki label target

# Modeling
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train) # fit disini adalah proses pelatihan model (akan mempelajari hubungan fitur dan target dan simpan info penting)
# model ML tdk perlu transform, tapi pakai predict() untuk prediksi kelas dan predict_proba() untuk prediksi probabilitas

# Evaluation
y_pred = model.predict(X_val) # evaluasi performa model, untuk melihat seberapa bagus model
print(classification_report(y_val, y_pred)) # hanya untuk evaluasi internal, bukan skor final kaggle

# Predict test data
test_pred = model.predict(X_test_encoded) # prediksi akhir ke data uji, untuk submission csv
 
# Submission file
submission = pd.DataFrame({
    'Student_ID': test_ids,
    'Grade': test_pred
})
# submission.to_csv('submission2.csv', index=False)

Accuracy: 0.6619047619047619
              precision    recall  f1-score   support

           0       0.62      0.48      0.54        87
           1       0.68      0.79      0.73       123

    accuracy                           0.66       210
   macro avg       0.65      0.64      0.64       210
weighted avg       0.66      0.66      0.65       210



### percobaan 3 (combined data test, pipeline)

In [109]:
# Load Dataset
df_train = pd.read_csv('train_data_mapped.csv')
df_test = pd.read_csv('test_data.csv')

# Simpan Student_ID untuk submission
test_ids = df_test['Student_ID']

# Drop kolom non-predictor
drop_columns = ['Student_ID', 'First_Name', 'Last_Name', 'Email']
df_train = df_train.drop(columns=drop_columns)
df_test = df_test.drop(columns=drop_columns)

# Tambahkan placeholder Grade untuk test agar bisa digabung
df_test['Grade'] = None

# Gabungkan train dan test
combined = pd.concat([df_train, df_test], axis=0)

# Deteksi kolom numerik & kategorikal
num_cols = combined.select_dtypes(include=['int64', 'float64']).columns.tolist()
cat_cols = combined.select_dtypes(include=['object']).drop('Grade', axis=1).columns.tolist()

# Buat pipeline numerik & kategorik
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Gabungkan preprocessing dalam ColumnTransformer
preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_cols),
    ('cat', cat_pipeline, cat_cols)
])

# Buat pipeline model
model_pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('classifier', LogisticRegression(random_state=42))
])

# Pisahkan kembali data train dan data test
combined_train = combined[combined['Grade'].notnull()]
combined_train['Grade'] = combined_train['Grade'].astype(int) # None dikenali sebagai object, shg ketika digabungkan, Grade otomatis diubah menjadi object
combined_test = combined[combined['Grade'].isnull()]

# Fitur dan Target
X = combined_train.drop(columns='Grade')
y = combined_train['Grade']
X_test = combined_test.drop(columns='Grade')

# Split train-val
X_train, X_val, y_train, y_val = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Fit model dan evaluasi
model_pipeline.fit(X_train, y_train)
y_pred = model_pipeline.predict(X_val)
print("Classification Report (Validation):")
print(classification_report(y_val, y_pred))

# Prediksi data test
test_pred = model_pipeline.predict(X_test)

# Submission
submission = pd.DataFrame({
    'Student_ID': test_ids,
    'Grade': test_pred
})
# submission.to_csv('submission3.csv', index=False)

Classification Report (Validation):
              precision    recall  f1-score   support

           0       0.63      0.57      0.59        83
           1       0.73      0.78      0.76       127

    accuracy                           0.70       210
   macro avg       0.68      0.67      0.68       210
weighted avg       0.69      0.70      0.69       210



### percobaan 4 (hyperparameter tuning)

In [12]:
# Load Dataset
df_train = pd.read_csv('train_data_mapped.csv')
df_test = pd.read_csv('test_data.csv')

# Simpan Student_ID untuk submission
test_ids = df_test['Student_ID']

# Drop kolom non-predictor
drop_columns = ['Student_ID', 'First_Name', 'Last_Name', 'Email']
df_train = df_train.drop(columns=drop_columns)
df_test = df_test.drop(columns=drop_columns)

# Tambahkan placeholder Grade untuk test agar bisa digabung
df_test['Grade'] = None

# Gabungkan train dan test
combined = pd.concat([df_train, df_test], axis=0)

# Deteksi kolom numerik & kategorikal
num_cols = combined.select_dtypes(include=['int64', 'float64']).columns.tolist()
cat_cols = combined.select_dtypes(include=['object']).drop('Grade', axis=1).columns.tolist()

# Buat pipeline numerik & kategorik
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Gabungkan preprocessing dalam ColumnTransformer
preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_cols),
    ('cat', cat_pipeline, cat_cols)
])

# Buat pipeline model
model_pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('classifier', LogisticRegression(random_state=42))
])
# Tentukan grid parameter yang ingin diuji
param_grid = {
    'classifier__C': [0.1, 1, 10, 100],  # Regularization strength untuk Logistic Regression
    'classifier__penalty': ['11', 'l2'],  # Regularization type (L1 atau L2)
    'classifier__solver': ['liblinear', 'saga'],  # Solvers yang berbeda untuk Logistic Regression
    'preprocessing__num__imputer__strategy': ['mean', 'median']  # Strategi imputasi untuk kolom numerik
}

# Buat objek GridSearchCV
grid_search = GridSearchCV(model_pipeline, param_grid, cv=5, n_jobs=-1, scoring='accuracy')

# Pisahkan kembali data train dan data test
combined_train = combined[combined['Grade'].notnull()]
combined_train['Grade'] = combined_train['Grade'].astype(int)  # Pastikan target Grade bertipe int
combined_test = combined[combined['Grade'].isnull()]

# Fitur dan Target
X = combined_train.drop(columns='Grade')
y = combined_train['Grade']
X_test = combined_test.drop(columns='Grade')

# Split train-val
X_train, X_val, y_train, y_val = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Lihat hasil terbaik
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_}")

# Evaluasi pada validation set
y_pred = grid_search.predict(X_val)
print("Classification Report (Validation):")
print(classification_report(y_val, y_pred))

# Prediksi data test
test_pred = grid_search.predict(X_test)

# Submission
submission = pd.DataFrame({
    'Student_ID': test_ids,
    'Grade': test_pred
})

# submission.to_csv('submission4.csv', index=False)

Best parameters: {'classifier__C': 0.1, 'classifier__penalty': 'l2', 'classifier__solver': 'liblinear', 'preprocessing__num__imputer__strategy': 'median'}
Best cross-validation score: 0.6666666666666666
Classification Report (Validation):
              precision    recall  f1-score   support

           0       0.63      0.57      0.59        83
           1       0.73      0.78      0.76       127

    accuracy                           0.70       210
   macro avg       0.68      0.67      0.68       210
weighted avg       0.69      0.70      0.69       210



In [128]:
# Load Dataset
df_train = pd.read_csv('train_data_mapped.csv')
df_test = pd.read_csv('test_data.csv')

# Simpan Student_ID untuk submission
test_ids = df_test['Student_ID']

# Drop kolom non-predictor
drop_columns = ['Student_ID', 'First_Name', 'Last_Name', 'Email']
df_train = df_train.drop(columns=drop_columns)
df_test = df_test.drop(columns=drop_columns)

# Tambahkan placeholder Grade untuk test agar bisa digabung
df_test['Grade'] = None

# Gabungkan train dan test
combined = pd.concat([df_train, df_test], axis=0)

# Deteksi kolom numerik & kategorikal
num_cols = combined.select_dtypes(include=['int64', 'float64']).columns.tolist()
cat_cols = combined.select_dtypes(include=['object']).drop('Grade', axis=1).columns.tolist()

# Buat pipeline numerik & kategorik
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Gabungkan preprocessing dalam ColumnTransformer
preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_cols),
    ('cat', cat_pipeline, cat_cols)
])

# Buat pipeline model
model_pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('classifier', LogisticRegression(random_state=42))
])
# Tentukan distribusi parameter yang ingin diuji
param_dist = {
    'classifier__C': [0.1, 1, 10, 100],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__solver': ['liblinear', 'saga'],
    'preprocessing__num__imputer__strategy': ['mean', 'median'],
}

# Buat objek RandomizedSearchCV
random_search = RandomizedSearchCV(model_pipeline, param_dist, n_iter=100, cv=5, n_jobs=-1, scoring='accuracy', random_state=42)

# Pisahkan kembali data train dan data test
combined_train = combined[combined['Grade'].notnull()]
combined_train['Grade'] = combined_train['Grade'].astype(int)  # Pastikan target Grade bertipe int
combined_test = combined[combined['Grade'].isnull()]

# Fitur dan Target
X = combined_train.drop(columns='Grade')
y = combined_train['Grade']
X_test = combined_test.drop(columns='Grade')

# Split train-val
X_train, X_val, y_train, y_val = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Fit RandomizedSearchCV
random_search.fit(X_train, y_train)

# Lihat hasil terbaik
print(f"Best parameters: {random_search.best_params_}")
print(f"Best cross-validation score: {random_search.best_score_}")

# Evaluasi pada validation set
y_pred = random_search.predict(X_val)
print("Classification Report (Validation):")
print(classification_report(y_val, y_pred))

# Prediksi data test
test_pred = random_search.predict(X_test)

# Submission
submission = pd.DataFrame({
    'Student_ID': test_ids,
    'Grade': test_pred
})

# submission.to_csv('submission4.csv', index=False)

Best parameters: {'preprocessing__num__imputer__strategy': 'median', 'classifier__solver': 'liblinear', 'classifier__penalty': 'l2', 'classifier__C': 0.1}
Best cross-validation score: 0.6666666666666666
Classification Report (Validation):
              precision    recall  f1-score   support

           0       0.63      0.57      0.59        83
           1       0.73      0.78      0.76       127

    accuracy                           0.70       210
   macro avg       0.68      0.67      0.68       210
weighted avg       0.69      0.70      0.69       210



### percobaan 5 (ubah encoder dan scaler)

In [None]:
# Load Dataset
df_train = pd.read_csv('train_data_mapped.csv')
df_test = pd.read_csv('test_data.csv')

# Simpan Student_ID untuk submission
test_ids = df_test['Student_ID']

# Drop kolom non-predictor
drop_columns = ['Student_ID', 'First_Name', 'Last_Name', 'Email']
df_train = df_train.drop(columns=drop_columns)
df_test = df_test.drop(columns=drop_columns)

# Tambahkan placeholder Grade untuk test agar bisa digabung
df_test['Grade'] = None

# Gabungkan train dan test
combined = pd.concat([df_train, df_test], axis=0)

# Fitur
num_cols = ['Age', 'Attendance (%)', 'Midterm_Score', 'Final_Score', 'Assignments_Avg',
            'Quizzes_Avg', 'Participation_Score', 'Projects_Score', 'Total_Score',
            'Study_Hours_per_Week', 'Stress_Level (1-10)', 'Sleep_Hours_per_Night']

nominal_cols = ['Gender', 'Department', 'Extracurricular_Activities', 'Internet_Access_at_Home']
ordinal_cols = ['Parent_Education_Level', 'Family_Income_Level']

# Ordinal encoder mapping
education_order = ['None', 'High School', "Bachelor's", "Master's", 'PhD']
income_order = ['Low', 'Medium', 'High']

# Buat pipeline numerik & kategorik
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

nominal_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

ordinal_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OrdinalEncoder(categories=[education_order, income_order]))
])

# Gabungkan preprocessing dalam ColumnTransformer
preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_cols),
    ('nom', nominal_pipeline, nominal_cols),
    ('ord', ordinal_pipeline, ordinal_cols)
])

# Buat pipeline model
model_pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('classifier', LogisticRegression(random_state=42))
])
# Tentukan distribusi parameter yang ingin diuji
param_dist = {
    'classifier__C': [0.1, 1, 10, 100],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__solver': ['liblinear', 'saga'],
    'preprocessing__num__imputer__strategy': ['mean', 'median'],
}

# Buat objek RandomizedSearchCV
random_search = RandomizedSearchCV(model_pipeline, param_dist, n_iter=100, cv=5, n_jobs=-1, scoring='accuracy', random_state=42)

# Pisahkan kembali data train dan data test
combined_train = combined[combined['Grade'].notnull()]
combined_train['Grade'] = combined_train['Grade'].astype(int)  # Pastikan target Grade bertipe int
combined_test = combined[combined['Grade'].isnull()]

# Fitur dan Target
X = combined_train.drop(columns='Grade')
y = combined_train['Grade']
X_test = combined_test.drop(columns='Grade')

# Split train-val
X_train, X_val, y_train, y_val = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Fit RandomizedSearchCV
random_search.fit(X_train, y_train)

# Lihat hasil terbaik
print(f"Best parameters: {random_search.best_params_}")
print(f"Best cross-validation score: {random_search.best_score_}")

# Evaluasi pada validation set
y_pred = random_search.predict(X_val)
print("Classification Report (Validation):")
print(classification_report(y_val, y_pred))

# Prediksi data test
test_pred = random_search.predict(X_test)

# Submission
submission = pd.DataFrame({
    'Student_ID': test_ids,
    'Grade': test_pred
})

# submission.to_csv('submission5.csv', index=False)

Best parameters: {'preprocessing__num__imputer__strategy': 'mean', 'classifier__solver': 'saga', 'classifier__penalty': 'l2', 'classifier__C': 1}
Best cross-validation score: 0.6761904761904762
Classification Report (Validation):
              precision    recall  f1-score   support

           0       0.64      0.57      0.60        83
           1       0.74      0.79      0.76       127

    accuracy                           0.70       210
   macro avg       0.69      0.68      0.68       210
weighted avg       0.70      0.70      0.70       210



### percobaan 6 (feature selection ubah encoder scaler)

In [None]:
# Load Dataset
df_train = pd.read_csv('train_data_mapped.csv')
df_test = pd.read_csv('test_data.csv')

# Simpan Student_ID untuk submission
test_ids = df_test['Student_ID']

# Drop kolom non-predictor
drop_columns = ['Student_ID', 'First_Name', 'Last_Name', 'Email']
df_train = df_train.drop(columns=drop_columns)
df_test = df_test.drop(columns=drop_columns)

# Tambahkan placeholder Grade untuk test agar bisa digabung
df_test['Grade'] = None

# Gabungkan train dan test
combined = pd.concat([df_train, df_test], axis=0)

# Fitur
num_cols = ['Age', 'Attendance (%)', 'Midterm_Score', 'Final_Score', 'Assignments_Avg',
            'Quizzes_Avg', 'Participation_Score', 'Projects_Score', 'Total_Score',
            'Study_Hours_per_Week', 'Stress_Level (1-10)', 'Sleep_Hours_per_Night']

nominal_cols = ['Gender', 'Department', 'Extracurricular_Activities', 'Internet_Access_at_Home']
ordinal_cols = ['Parent_Education_Level', 'Family_Income_Level']

# Ordinal encoder mapping
education_order = ['None', 'High School', "Bachelor's", "Master's", 'PhD']
income_order = ['Low', 'Medium', 'High']

# Buat pipeline numerik & kategorik
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

nominal_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

ordinal_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OrdinalEncoder(categories=[education_order, income_order]))
])

# Gabungkan preprocessing dalam ColumnTransformer
preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_cols),
    ('nom', nominal_pipeline, nominal_cols),
    ('ord', ordinal_pipeline, ordinal_cols)
])

# Pipeline with feature selection
model_pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('feature_selection', SelectKBest(score_func=f_classif, k=20)),  # jumlah fitur dipilih
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])

# Tentukan distribusi parameter yang ingin diuji
param_dist = {
    'feature_selection__k': [10, 15, 20, 'all'],
    'classifier__C': [0.1, 1, 10, 100],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__solver': ['liblinear', 'saga'],
    'preprocessing__num__imputer__strategy': ['mean', 'median'],
}

# Buat objek RandomizedSearchCV
random_search = RandomizedSearchCV(model_pipeline, param_dist, n_iter=100, cv=5, n_jobs=-1, scoring='accuracy', random_state=42)

# Pisahkan kembali data train dan data test
combined_train = combined[combined['Grade'].notnull()]
combined_train['Grade'] = combined_train['Grade'].astype(int)  # Pastikan target Grade bertipe int
combined_test = combined[combined['Grade'].isnull()]

# Fitur dan Target
X = combined_train.drop(columns='Grade')
y = combined_train['Grade']
X_test = combined_test.drop(columns='Grade')

# Split train-val
X_train, X_val, y_train, y_val = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Fit RandomizedSearchCV
random_search.fit(X_train, y_train)

# Lihat hasil terbaik
print(f"Best parameters: {random_search.best_params_}")
print(f"Best cross-validation score: {random_search.best_score_}")

# Evaluasi pada validation set
y_pred = random_search.predict(X_val)
print("Classification Report (Validation):")
print(classification_report(y_val, y_pred))

# Prediksi data test
test_pred = random_search.predict(X_test)

# Submission
submission = pd.DataFrame({
    'Student_ID': test_ids,
    'Grade': test_pred
})

submission.to_csv('submission6.csv', index=False)

Best parameters: {'preprocessing__num__imputer__strategy': 'mean', 'feature_selection__k': 'all', 'classifier__solver': 'saga', 'classifier__penalty': 'l2', 'classifier__C': 1}
Best cross-validation score: 0.6761904761904762
Classification Report (Validation):
              precision    recall  f1-score   support

           0       0.64      0.57      0.60        83
           1       0.74      0.79      0.76       127

    accuracy                           0.70       210
   macro avg       0.69      0.68      0.68       210
weighted avg       0.70      0.70      0.70       210



### percobaan 7 (feature selection tanpa ubah preprocessing)

In [13]:
# Load Dataset
df_train = pd.read_csv('train_data_mapped.csv')
df_test = pd.read_csv('test_data.csv')

# Simpan Student_ID untuk submission
test_ids = df_test['Student_ID']

# Drop kolom non-predictor
drop_columns = ['Student_ID', 'First_Name', 'Last_Name', 'Email']
df_train = df_train.drop(columns=drop_columns)
df_test = df_test.drop(columns=drop_columns)

# Tambahkan placeholder Grade untuk test agar bisa digabung
df_test['Grade'] = None

# Gabungkan train dan test
combined = pd.concat([df_train, df_test], axis=0)

# Deteksi kolom numerik & kategorikal
num_cols = combined.select_dtypes(include=['int64', 'float64']).columns.tolist()
cat_cols = combined.select_dtypes(include=['object']).drop('Grade', axis=1).columns.tolist()

# Buat pipeline numerik & kategorik
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Gabungkan preprocessing dalam ColumnTransformer
preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_cols),
    ('cat', cat_pipeline, cat_cols)
])

# Pipeline with feature selection
model_pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('feature_selection', SelectKBest(score_func=f_classif, k=20)),  # jumlah fitur dipilih
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])

# Tentukan grid parameter yang ingin diuji
param_grid = {
    'feature_selection__k': [10, 15, 20, 'all'],
    'classifier__C': [0.1, 1, 10, 100],  # Regularization strength untuk Logistic Regression
    'classifier__penalty': ['11', 'l2'],  # Regularization type (L1 atau L2)
    'classifier__solver': ['liblinear', 'saga'],  # Solvers yang berbeda untuk Logistic Regression
    'preprocessing__num__imputer__strategy': ['mean', 'median']  # Strategi imputasi untuk kolom numerik
}

# Buat objek GridSearchCV
grid_search = GridSearchCV(model_pipeline, param_grid, cv=5, n_jobs=-1, scoring='accuracy')

# Pisahkan kembali data train dan data test
combined_train = combined[combined['Grade'].notnull()]
combined_train['Grade'] = combined_train['Grade'].astype(int)  # Pastikan target Grade bertipe int
combined_test = combined[combined['Grade'].isnull()]

# Fitur dan Target
X = combined_train.drop(columns='Grade')
y = combined_train['Grade']
X_test = combined_test.drop(columns='Grade')

# Split train-val
X_train, X_val, y_train, y_val = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Lihat hasil terbaik
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_}")

# Evaluasi pada validation set
y_pred = grid_search.predict(X_val)
print("Classification Report (Validation):")
print(classification_report(y_val, y_pred))

# Prediksi data test
test_pred = grid_search.predict(X_test)

# Submission
submission = pd.DataFrame({
    'Student_ID': test_ids,
    'Grade': test_pred
})

submission.to_csv('submission7_.csv', index=False)

Best parameters: {'classifier__C': 0.1, 'classifier__penalty': 'l2', 'classifier__solver': 'saga', 'feature_selection__k': 20, 'preprocessing__num__imputer__strategy': 'mean'}
Best cross-validation score: 0.6738095238095237
Classification Report (Validation):
              precision    recall  f1-score   support

           0       0.64      0.57      0.60        83
           1       0.74      0.80      0.77       127

    accuracy                           0.70       210
   macro avg       0.69      0.68      0.68       210
weighted avg       0.70      0.70      0.70       210



In [5]:
# Load Dataset
df_train = pd.read_csv('train_data_mapped.csv')
df_test = pd.read_csv('test_data.csv')

# Simpan Student_ID untuk submission
test_ids = df_test['Student_ID']

# Drop kolom non-predictor
drop_columns = ['Student_ID', 'First_Name', 'Last_Name', 'Email']
df_train = df_train.drop(columns=drop_columns)
df_test = df_test.drop(columns=drop_columns)

# Tambahkan placeholder Grade untuk test agar bisa digabung
df_test['Grade'] = None

# Gabungkan train dan test
combined = pd.concat([df_train, df_test], axis=0)

# Deteksi kolom numerik & kategorikal
num_cols = combined.select_dtypes(include=['int64', 'float64']).columns.tolist()
cat_cols = combined.select_dtypes(include=['object']).drop('Grade', axis=1).columns.tolist()

# Buat pipeline numerik & kategorik
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Gabungkan preprocessing dalam ColumnTransformer
preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_cols),
    ('cat', cat_pipeline, cat_cols)
])

# Pipeline with feature selection
model_pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('feature_selection', SelectKBest(score_func=f_classif, k=20)),  # jumlah fitur dipilih
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])

# Tentukan distribusi parameter yang ingin diuji
param_dist = {
    'feature_selection__k': [10, 15, 20, 'all'],
    'classifier__C': [0.1, 1, 10, 100],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__solver': ['liblinear', 'saga'],
    'preprocessing__num__imputer__strategy': ['mean', 'median'],
}

# Buat objek RandomizedSearchCV
random_search = RandomizedSearchCV(model_pipeline, param_dist, n_iter=100, cv=5, n_jobs=-1, scoring='accuracy', random_state=42)

# Pisahkan kembali data train dan data test
combined_train = combined[combined['Grade'].notnull()]
combined_train['Grade'] = combined_train['Grade'].astype(int)  # Pastikan target Grade bertipe int
combined_test = combined[combined['Grade'].isnull()]

# Fitur dan Target
X = combined_train.drop(columns='Grade')
y = combined_train['Grade']
X_test = combined_test.drop(columns='Grade')

# Split train-val
X_train, X_val, y_train, y_val = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Fit RandomizedSearchCV
random_search.fit(X_train, y_train)

# Lihat hasil terbaik
print(f"Best parameters: {random_search.best_params_}")
print(f"Best cross-validation score: {random_search.best_score_}")

# Evaluasi pada validation set
y_pred = random_search.predict(X_val)
print("Classification Report (Validation):")
print(classification_report(y_val, y_pred))

# Prediksi data test
test_pred = random_search.predict(X_test)

# Submission
submission = pd.DataFrame({
    'Student_ID': test_ids,
    'Grade': test_pred
})

# submission.to_csv('submission7.csv', index=False)

Best parameters: {'preprocessing__num__imputer__strategy': 'mean', 'feature_selection__k': 20, 'classifier__solver': 'saga', 'classifier__penalty': 'l2', 'classifier__C': 0.1}
Best cross-validation score: 0.6738095238095237
Classification Report (Validation):
              precision    recall  f1-score   support

           0       0.64      0.57      0.60        83
           1       0.74      0.80      0.77       127

    accuracy                           0.70       210
   macro avg       0.69      0.68      0.68       210
weighted avg       0.70      0.70      0.70       210



# percobaan 8 (p6+random forest)

In [None]:
# Load Dataset
df_train = pd.read_csv('train_data_mapped.csv')
df_test = pd.read_csv('test_data.csv')

# Simpan Student_ID untuk submission
test_ids = df_test['Student_ID']

# Drop kolom non-predictor
drop_columns = ['Student_ID', 'First_Name', 'Last_Name', 'Email']
df_train = df_train.drop(columns=drop_columns)
df_test = df_test.drop(columns=drop_columns)

# Tambahkan placeholder Grade untuk test agar bisa digabung
df_test['Grade'] = None

# Gabungkan train dan test
combined = pd.concat([df_train, df_test], axis=0)

# Deteksi kolom numerik & kategorikal
num_cols = combined.select_dtypes(include=['int64', 'float64']).columns.tolist()
cat_cols = combined.select_dtypes(include=['object']).drop('Grade', axis=1).columns.tolist()

# Buat pipeline numerik & kategorik
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Gabungkan preprocessing dalam ColumnTransformer
preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_cols),
    ('cat', cat_pipeline, cat_cols)
])

from sklearn.feature_selection import SelectFromModel

model_pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('feature_selection', SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42))),
    ('classifier', RandomForestClassifier(random_state=42))
])

# GridSearch param untuk Random Forest
param_grid = {
    'classifier__n_estimators': [100, 200],
    'classifier__max_depth': [None, 10, 20],
    'classifier__min_samples_split': [2, 5],
    'classifier__min_samples_leaf': [1, 2],
    'preprocessing__num__imputer__strategy': ['mean', 'median']
}

# Buat objek GridSearchCV
grid_search = GridSearchCV(model_pipeline, param_grid, cv=5, n_jobs=-1, scoring='accuracy')

# Pisahkan kembali data train dan data test
combined_train = combined[combined['Grade'].notnull()]
combined_train['Grade'] = combined_train['Grade'].astype(int)  # Pastikan target Grade bertipe int
combined_test = combined[combined['Grade'].isnull()]

# Fitur dan Target
X = combined_train.drop(columns='Grade')
y = combined_train['Grade']
X_test = combined_test.drop(columns='Grade')

# Split train-val
X_train, X_val, y_train, y_val = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Lihat hasil terbaik
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_}")

# Evaluasi pada validation set
y_pred = grid_search.predict(X_val)
print("Classification Report (Validation):")
print(classification_report(y_val, y_pred))

# Prediksi data test
test_pred = grid_search.predict(X_test)

# Submission
submission = pd.DataFrame({
    'Student_ID': test_ids,
    'Grade': test_pred
})

## submission.to_csv('submission4.csv', index=False)

Best parameters: {'classifier__max_depth': 10, 'classifier__min_samples_leaf': 2, 'classifier__min_samples_split': 2, 'classifier__n_estimators': 100, 'preprocessing__num__imputer__strategy': 'mean'}
Best cross-validation score: 0.6547619047619048
Classification Report (Validation):
              precision    recall  f1-score   support

           0       0.62      0.60      0.61        83
           1       0.74      0.76      0.75       127

    accuracy                           0.70       210
   macro avg       0.68      0.68      0.68       210
weighted avg       0.69      0.70      0.69       210



In [3]:
# Load Dataset
df_train = pd.read_csv('train_data_mapped.csv')
df_test = pd.read_csv('test_data.csv')

# Simpan Student_ID untuk submission
test_ids = df_test['Student_ID']

# Drop kolom non-predictor
drop_columns = ['Student_ID', 'First_Name', 'Last_Name', 'Email']
df_train = df_train.drop(columns=drop_columns)
df_test = df_test.drop(columns=drop_columns)

# Tambahkan placeholder Grade untuk test agar bisa digabung
df_test['Grade'] = None

# Gabungkan train dan test
combined = pd.concat([df_train, df_test], axis=0)

# Deteksi kolom numerik & kategorikal
num_cols = combined.select_dtypes(include=['int64', 'float64']).columns.tolist()
cat_cols = combined.select_dtypes(include=['object']).drop('Grade', axis=1).columns.tolist()

# Buat pipeline numerik & kategorik
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Gabungkan preprocessing dalam ColumnTransformer
preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_cols),
    ('cat', cat_pipeline, cat_cols)
])

from sklearn.feature_selection import SelectFromModel

model_pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('feature_selection', SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42))),
    ('classifier', RandomForestClassifier(random_state=42))
])

# GridSearch param untuk Random Forest
param_grid = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__max_depth': [None, 10, 20, 30],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4],
    'preprocessing__num__imputer__strategy': ['mean', 'median']
}


# Buat objek GridSearchCV
grid_search = GridSearchCV(model_pipeline, param_grid, cv=5, n_jobs=-1, scoring='accuracy')

# Pisahkan kembali data train dan data test
combined_train = combined[combined['Grade'].notnull()]
combined_train['Grade'] = combined_train['Grade'].astype(int)  # Pastikan target Grade bertipe int
combined_test = combined[combined['Grade'].isnull()]

# Fitur dan Target
X = combined_train.drop(columns='Grade')
y = combined_train['Grade']
X_test = combined_test.drop(columns='Grade')

# Split train-val
X_train, X_val, y_train, y_val = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Lihat hasil terbaik
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_}")

# Evaluasi pada validation set
y_pred = grid_search.predict(X_val)
print("Classification Report (Validation):")
print(classification_report(y_val, y_pred))

# Prediksi data test
test_pred = grid_search.predict(X_test)

# Submission
submission = pd.DataFrame({
    'Student_ID': test_ids,
    'Grade': test_pred
})

submission.to_csv('submission8.csv', index=False)

Best parameters: {'classifier__max_depth': 10, 'classifier__min_samples_leaf': 2, 'classifier__min_samples_split': 2, 'classifier__n_estimators': 100, 'preprocessing__num__imputer__strategy': 'mean'}
Best cross-validation score: 0.6547619047619048
Classification Report (Validation):
              precision    recall  f1-score   support

           0       0.62      0.60      0.61        83
           1       0.74      0.76      0.75       127

    accuracy                           0.70       210
   macro avg       0.68      0.68      0.68       210
weighted avg       0.69      0.70      0.69       210



# paling tinggi :) (random forest + tanpa scaler)

In [None]:
# Load Dataset
df_train = pd.read_csv('train_data_mapped.csv')
df_test = pd.read_csv('test_data.csv')

# Simpan Student_ID untuk submission
test_ids = df_test['Student_ID']

# Drop kolom non-predictor
drop_columns = ['Student_ID', 'First_Name', 'Last_Name', 'Email']
df_train = df_train.drop(columns=drop_columns)
df_test = df_test.drop(columns=drop_columns)

# Tambahkan placeholder Grade untuk test agar bisa digabung
df_test['Grade'] = None

# Gabungkan train dan test
combined = pd.concat([df_train, df_test], axis=0)

# Deteksi kolom numerik & kategorikal
num_cols = combined.select_dtypes(include=['int64', 'float64']).columns.tolist()
cat_cols = combined.select_dtypes(include=['object']).drop('Grade', axis=1).columns.tolist()

# Pipeline numerik & kategorik
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean'))
])

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Gabungkan preprocessing dalam ColumnTransformer
preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_cols),
    ('cat', cat_pipeline, cat_cols)
])

from sklearn.feature_selection import SelectFromModel

model_pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('feature_selection', SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42))),
    ('classifier', RandomForestClassifier(random_state=42))
])

# GridSearch param untuk Random Forest
param_grid = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__max_depth': [None, 10, 20, 30],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4],
    'preprocessing__num__imputer__strategy': ['mean', 'median']
}


# Buat objek GridSearchCV
grid_search = GridSearchCV(model_pipeline, param_grid, cv=5, n_jobs=-1, scoring='accuracy')

# Pisahkan kembali data train dan data test
combined_train = combined[combined['Grade'].notnull()]
combined_train['Grade'] = combined_train['Grade'].astype(int)  # None dikenali sebagai object, shg ketika digabungkan, Grade otomatis diubah menjadi object
combined_test = combined[combined['Grade'].isnull()]

# Fitur dan Target
X = combined_train.drop(columns='Grade')
y = combined_train['Grade']
X_test = combined_test.drop(columns='Grade')

# Split train-val
X_train, X_val, y_train, y_val = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Lihat hasil terbaik
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_}")

# Evaluasi validation set
y_pred = grid_search.predict(X_val)
print("Classification Report (Validation):")
print(classification_report(y_val, y_pred))

# Prediksi data test
test_pred = grid_search.predict(X_test)

# Submission
submission = pd.DataFrame({
    'Student_ID': test_ids,
    'Grade': test_pred
})

# submission.to_csv('submission8_.csv', index=False)

Best parameters: {'classifier__max_depth': 10, 'classifier__min_samples_leaf': 2, 'classifier__min_samples_split': 5, 'classifier__n_estimators': 100, 'preprocessing__num__imputer__strategy': 'mean'}
Best cross-validation score: 0.6547619047619048
Classification Report (Validation):
              precision    recall  f1-score   support

           0       0.63      0.57      0.59        83
           1       0.73      0.78      0.76       127

    accuracy                           0.70       210
   macro avg       0.68      0.67      0.68       210
weighted avg       0.69      0.70      0.69       210

