# Student Performance Prediction Workflow (Dataset Asli)

Notebook ini menggunakan dataset asli **Student_performance_data _.csv**, dari website kaggle (source) : https://www.kaggle.com/datasets/rabieelkharoua/students-performance-dataset

Langkah-langkah:
1. Import dataset
2. Explorasi awal
3. Preprocessing (impute missing, encoding, feature selection)
4. Pembuatan label performance (aman/beresiko dari GPA)
5. Simpan hasil preprocessing → `data_integration.csv`
6. Split train/test
7. Training Decision Tree
8. Cross Validation
9. Evaluasi (Confusion Matrix & Classification Report)
10. Visualisasi
11. Simpan hasil prediksi → `data_validation.csv`


In [None]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv(r"C:\Users\Asadul\Downloads\Data Mining Kelompok\Student_performance_data _.csv")
df.head()


Lihat info dataset

In [None]:

df.info()


Tangani missing values

In [None]:
for col in df.columns:
    if df[col].dtype == "object":
        df[col] = df[col].fillna(df[col].mode()[0])
    else:
        df[col] = df[col].fillna(df[col].mean())

df.isnull().sum()


Buat kolom performance: Aman (GPA >= 2.5), Beresiko (GPA < 2.5)

In [None]:

df["performance"] = np.where(df["GPA"] >= 2.5, "aman", "beresiko")
df["performance"].value_counts()


Simpan hasil preprocessing

In [None]:

df.to_csv(r"C:\Users\Asadul\Downloads\Data Mining Kelompok\data_integration.csv", index=False)
print("data_integration.csv berhasil disimpan!")


Drop kolom ID dan Target

In [None]:

X = df.drop(columns=["performance","StudentID"])  
y = df["performance"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [None]:

clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)


In [None]:

scores = cross_val_score(clf, X, y, cv=5)
print("Cross Validation Scores:", scores)
print("Mean Accuracy:", scores.mean())


In [None]:

cm = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)
print("Confusion Matrix:\n", cm)
print("\nClassification Report:\n", report)


In [None]:

results = pd.DataFrame({
    "y_true": y_test,
    "y_pred": y_pred
})
results.to_csv(r"C:\Users\Asadul\Downloads\Data Mining Kelompok\data_validation.csv", index=False)
print("data_validation.csv berhasil disimpan!")


In [None]:

plt.figure(figsize=(8,5))
sns.histplot(df["GPA"], bins=20, kde=True)
plt.title("Distribusi GPA Mahasiswa")
plt.xlabel("GPA")
plt.ylabel("Jumlah Mahasiswa")
plt.show()


In [None]:

plt.figure(figsize=(6,4))
sns.countplot(x="performance", data=df)
plt.title("Jumlah Mahasiswa per Kategori Performance")
plt.xlabel("Kategori")
plt.ylabel("Jumlah")
plt.show()


In [None]:

ConfusionMatrixDisplay.from_estimator(clf, X_test, y_test, cmap="Blues")
plt.title("Confusion Matrix")
plt.show()


In [None]:

feature_importances = clf.feature_importances_
features = X.columns
sorted_idx = np.argsort(feature_importances)

plt.figure(figsize=(10,6))
plt.barh(features[sorted_idx], feature_importances[sorted_idx])
plt.title("Feature Importance - Decision Tree")
plt.xlabel("Importance")
plt.show()
