# **1. Perkenalan Dataset**


Tahap pertama, Anda harus mencari dan menggunakan dataset dengan ketentuan sebagai berikut:

1. **Sumber Dataset**:  
   Dataset dapat diperoleh dari berbagai sumber, seperti public repositories (*Kaggle*, *UCI ML Repository*, *Open Data*) atau data primer yang Anda kumpulkan sendiri.


# **2. Import Library**

Pada tahap ini, Anda perlu mengimpor beberapa pustaka (library) Python yang dibutuhkan untuk analisis data dan pembangunan model machine learning atau deep learning.

In [26]:
import os

import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

import scipy.sparse as sp

# **3. Memuat Dataset**

Pada tahap ini, Anda perlu memuat dataset ke dalam notebook. Jika dataset dalam format CSV, Anda bisa menggunakan pustaka pandas untuk membacanya. Pastikan untuk mengecek beberapa baris awal dataset untuk memahami strukturnya dan memastikan data telah dimuat dengan benar.

Jika dataset berada di Google Drive, pastikan Anda menghubungkan Google Drive ke Colab terlebih dahulu. Setelah dataset berhasil dimuat, langkah berikutnya adalah memeriksa kesesuaian data dan siap untuk dianalisis lebih lanjut.

Jika dataset berupa unstructured data, silakan sesuaikan dengan format seperti kelas Machine Learning Pengembangan atau Machine Learning Terapan

In [8]:
df = pd.read_csv('../Credit Score Classification Dataset_raw.csv')
print(df.shape)
print(f"\n{df.dtypes}\n")
df.head()

(164, 8)

Age                    int64
Gender                object
Income                 int64
Education             object
Marital Status        object
Number of Children     int64
Home Ownership        object
Credit Score          object
dtype: object



Unnamed: 0,Age,Gender,Income,Education,Marital Status,Number of Children,Home Ownership,Credit Score
0,25,Female,50000,Bachelor's Degree,Single,0,Rented,High
1,30,Male,100000,Master's Degree,Married,2,Owned,High
2,35,Female,75000,Doctorate,Married,1,Owned,High
3,40,Male,125000,High School Diploma,Single,0,Owned,High
4,45,Female,100000,Bachelor's Degree,Married,3,Owned,High


# **4. Exploratory Data Analysis (EDA)**

Pada tahap ini, Anda akan melakukan **Exploratory Data Analysis (EDA)** untuk memahami karakteristik dataset.

Tujuan dari EDA adalah untuk memperoleh wawasan awal yang mendalam mengenai data dan menentukan langkah selanjutnya dalam analisis atau pemodelan.

In [9]:
df.describe()

Unnamed: 0,Age,Income,Number of Children
count,164.0,164.0,164.0
mean,37.97561,83765.243902,0.652439
std,8.477289,32457.306728,0.883346
min,25.0,25000.0,0.0
25%,30.75,57500.0,0.0
50%,37.0,83750.0,0.0
75%,45.0,105000.0,1.0
max,53.0,162500.0,3.0


In [10]:
na_count = df.isna().sum()
print(na_count.sort_values(ascending=False))

Age                   0
Gender                0
Income                0
Education             0
Marital Status        0
Number of Children    0
Home Ownership        0
Credit Score          0
dtype: int64


In [11]:
cat_cols = df.select_dtypes(include=["object", "category"]).columns.tolist()
for c in cat_cols:
    print(c, df[c].nunique(), df[c].unique()[:10])

Gender 2 ['Female' 'Male']
Education 5 ["Bachelor's Degree" "Master's Degree" 'Doctorate' 'High School Diploma'
 "Associate's Degree"]
Marital Status 2 ['Single' 'Married']
Home Ownership 2 ['Rented' 'Owned']
Credit Score 3 ['High' 'Average' 'Low']


In [12]:
target_col = "Credit Score"
print(df[target_col].value_counts())

Credit Score
High       113
Average     36
Low         15
Name: count, dtype: int64


In [13]:
num_cols = df.select_dtypes(include=["int64", "float64"]).columns.tolist()
num_cols = [c for c in num_cols if c != target_col]
cat_cols = [c for c in df.columns if c not in num_cols + [target_col]]

print("Numeric:", num_cols)
print("Categorical:", cat_cols)

Numeric: ['Age', 'Income', 'Number of Children']
Categorical: ['Gender', 'Education', 'Marital Status', 'Home Ownership']


# **5. Data Preprocessing**

Pada tahap ini, data preprocessing adalah langkah penting untuk memastikan kualitas data sebelum digunakan dalam model machine learning.

Jika Anda menggunakan data teks, data mentah sering kali mengandung nilai kosong, duplikasi, atau rentang nilai yang tidak konsisten, yang dapat memengaruhi kinerja model. Oleh karena itu, proses ini bertujuan untuk membersihkan dan mempersiapkan data agar analisis berjalan optimal.

Berikut adalah tahapan-tahapan yang bisa dilakukan, tetapi **tidak terbatas** pada:
1. Menghapus atau Menangani Data Kosong (Missing Values)
2. Menghapus Data Duplikat
3. Normalisasi atau Standarisasi Fitur
4. Deteksi dan Penanganan Outlier
5. Encoding Data Kategorikal
6. Binning (Pengelompokan Data)

Cukup sesuaikan dengan karakteristik data yang kamu gunakan yah. Khususnya ketika kami menggunakan data tidak terstruktur.

In [15]:
numeric_tf = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_tf = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_tf, num_cols),
        ("cat", categorical_tf, cat_cols),
    ]
)

In [20]:
X = df.drop(columns=[target_col])
y = df[target_col].copy()

In [21]:
Xt = preprocess.fit_transform(X)

if sp.issparse(Xt):
    Xt = Xt.toarray()

print(Xt)

[[-1.53531997 -1.04348337 -0.74086152 ...  1.          0.
   1.        ]
 [-0.94370231  0.50172     1.53019061 ...  0.          1.
   0.        ]
 [-0.35208465 -0.27088169  0.39466455 ...  0.          1.
   0.        ]
 ...
 [ 0.12120947 -0.65718253  1.53019061 ...  0.          1.
   0.        ]
 [ 0.71282713  0.11541915 -0.74086152 ...  1.          1.
   0.        ]
 [ 1.30444478 -0.19362152  0.39466455 ...  0.          1.
   0.        ]]


In [23]:
ohe = preprocess.named_transformers_["cat"].named_steps["onehot"]
ohe_feature_names = ohe.get_feature_names_out(cat_cols).tolist()
final_cols = [f"num__{c}" for c in num_cols] + ohe_feature_names

Xt_df = pd.DataFrame(Xt, columns=final_cols)
Xt_df[target_col] = y.values

Xt_df.head()

Unnamed: 0,num__Age,num__Income,num__Number of Children,Gender_Female,Gender_Male,Education_Associate's Degree,Education_Bachelor's Degree,Education_Doctorate,Education_High School Diploma,Education_Master's Degree,Marital Status_Married,Marital Status_Single,Home Ownership_Owned,Home Ownership_Rented,Credit Score
0,-1.53532,-1.043483,-0.740862,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,High
1,-0.943702,0.50172,1.530191,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,High
2,-0.352085,-0.270882,0.394665,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,High
3,0.239533,1.274322,-0.740862,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,High
4,0.831151,0.50172,2.665717,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,High


In [29]:
out_path = "Credit Score Classification Dataset_preprocessing.csv"
Xt_df.to_csv(out_path, index=False)
print("Saved:", out_path)

Saved: Credit Score Classification Dataset_preprocessing.csv
